Uncategorized | Object Oriented

This weekend I got together with a bunch of other brilliant, dedicated people to participate in GLAMHackPhilly (that’s GLAM as in Gallery, Library, and Art Museum, not genre of music popularized by David Bowie). My team worked with the British Museum’s linked open data, trying to see if we could trace trends in the museum’s acquisitions that would be easy to hypothesize, but, previously at least, time consuming if not very difficult to collect meaningful data on. The main question was to see how many objects were coming into the museum from colonies of the British Empire vs. non-colonies over the history of the museum.

Acquiring the Data

Part of our team tried using SPARQL to find objects in the British Museum open linked data that have 1) a date associated with their acquisition by the museum and 2) a location of origin [was present at]. The tricky part is getting uniform specificity in the location, because object records contain locations including archaeological sites, cities, regions, countries, or even continents. Fortunately, locations have a “broader” predicate that allows you to navigate from more specific to less specific geographic locations. Taking the country as our unit, we tried to use SPARQL to find the country associated with each location, but because no one was very familiar with SPARQL, it’s entirely possible that our solution isn’t actually accomplishing what we wanted it to do.

So what is SPARQL? SPARQL is a query language designed to allow people to search structured semantic data written in RDF (Resource Description Framework). In RDF, different types of “objects” are connected to each other using relationships. People talk about these objects and relationships in grammatical terms, as subject, predicate, and object. For example, a museum item can be connected to a geographical location using a relationship (or “predicate”) such as “was present at.” Together, the subject, predicate, and object are known as a “triple.” SPARQL works by allowing you to substitute variables for one or more entities in a triple. There are four different types of SPARQL queries. We used the SELECT query to retrieve entities and return them in a table, but you can also return results in RDF using a CONSTRUCT query, ask true or false questions about entities and predicates with an ASK query, or extract an RDF graph using a DESCRIBE query.

While the British Museum’s SPARQL access point has a couple of examples of SPARQL queries, there are many more examples of more complex queries available at the Europeana SPARQL endpoint. I’d also recommend this How to SPARQL walk-through.

Part of our query is pretty straightforward – find all of the objects that have a date of acquisition associated with them. I think we accomplished this in SPARQL with this query:

SELECT DISTINCT ?yearAcquired WHERE { ?item ecrm:P30i_custody_transferred_through ?hasCustody . ?hasCustody ecrm:P4_has_time-span ?hasTime . ?hasTime rdfs:label ?yearAcquired . }

Every word with a “?” prepended to it is a variable, while all of the predicates are tied to a namespace specified using ecrm:, rdfs:, and skos:. So what does this particular query do? First, it finds all entities that have acquisition data associated with them: ?item ecrm:P30i_custody_transferred_through ?hasCustody. (If we were being more rigorous in our query, which we probably should have been, we might check to ensure that this custody transfer had the British Museum as its object, but I’m guessing that most of the objects in the museum would only have a “custody transfer” related to their acquisition by the museum.) Then it narrows these down to entities for which the acquisition event is linked to a date: ?hasCustody ecrm:P4_has_time-span ?hasTime. Finally, it gets the human readable “label” of this date: ?hasTime rdfs:label ?yearAcquired.

At least, that’s what I think this query does. We can test this query by substituting a concrete date for the final variable:

SELECT DISTINCT * WHERE { ?item ecrm:P30i_custody_transferred_through ?hasCustody . ?hasCustody ecrm:P4_has_time-span ?hasTime . ?hasTime rdfs:label "1900" . }

This query should return only those entities which have a “custody transfer”/acquisition date of 1900. While I’m not sure that it returns all of the objects that were acquired in 1900, it certainly returns a lot of them, and it doesn’t return any objects that were acquired in other years.

Now that we have all of the items with acquisition dates, we can narrow down these items to only those that also have a location (any location, besides the British Museum) associated with them: ?item ecrm:P12i_was_present_at ?placeThing. Because of how the British Museum specifies places, must go a step further and find the actual location where the “being present” of the item “took place at”: ?placeThing ecrm:P7_took_place_at ?locality. Finally, we can get the human-readable label of the location: ?locality skos:prefLabel ?localityLabel. The whole query then looks as follows:

SELECT ?countryLable ?yearAcquired WHERE { ?item ecrm:P30i_custody_transferred_through ?hasCustody . ?hasCustody ecrm:P4_has_time-span ?hasTime . ?hasTime rdfs:label ?yearAcquired . ?item ecrm:P12i_was_present_at ?placeThing . ?placeThing ecrm:P7_took_place_at ?locality . ?locality skos:prefLabel ?countryLabel . }

But this is where we run into a problem. Although I have optimistically used the variable ?countryLabel to name the location of the object, in fact the types of locations specified in the item records are quite variable. They can range from specific obscure archaeological sites to entire continents, depending on the provenance of the object. Trying to identify these locations using a place-name authority would be something of a nightmare, and probably an abysmal failure. Fortunately, the British Museum has linked each of these locations to more general locations with the predicate skos:broader. These more general locations are in turn linked to even more general locations. So every single location theoretically can be traced back to a continent using the skos:broader predicate.

At the very least, we need to filter the results of this query to return only those which actually have a country as the type of location linked from their record. We can do this by adding a filer which makes sure that the locality type is country (or city state, if we’re in Ancient Greece). Here’s the whole query, because the syntax is important:

SELECT * WHERE { ?item ecrm:P30i_custody_transferred_through ?hasCustody . ?hasCustody ecrm:P4_has_time-span ?hasTime . ?hasTime rdfs:label ?yearAcquired . ?item ecrm:P12i_was_present_at ?placeThing . ?placeThing ecrm:P7_took_place_at ?locality . ?locality skos:prefLabel ?countryLabel . ?locality ecrm:P2_has_type ?localityType . ?localityType skos:prefLabel ?localityTypeLabel . Filter(?localityTypeLabel = "country or city-state") }

For the question we initially wanted to pose of the data, we need to find the country associated with the more or less precise location that is actually linked to each item. We figured out a verbose (read: very ugly) way to do this using the skos:broader predicate and UNION to join together multiple filtered subqueries, but it was very slow to run. I suspect there is a more elegant way to do this, although I have yet to test this. In any case, because this more robust search took too long, we ended up presenting the data from the preliminary search, which only returned items that had countries (rather than sites or cities or regions) as their location type – a small fraction of the total items with location data, much less of all of the items in the British Museum’s Semantic Web Collection.

Binning the Data

Once we had a query that was able to find all of the items acquired in a given year and group them by country, another member of the group wrote a small python program that iterated over years and added the number of items acquired in each country to a table. She also modified this program to sum the years into decades, making it easier to read and analyze the data. I’m pretty sure she was using something like this SPARQL wrapper.

Analysis and Visualization

Finally, I summed the acquisitions from the former colonies of the British Empire and subtracted that from the total acquisitions in each year in order to arrive at two numbers of acquisitions for each year: one from countries that were at some point British colonies, and one from countries were never officially part of the British Empire. Feel free to check out the spreadsheet with acquisitions by year, and that with the acquisitions by decade. (Again, these are only the items that had a country as the object of their was_present_at predicate because the more comprehensive query took too long to run the second day.)

I graphed the final colony vs. non-colony totals, first by year:

and then by decade, which is a lot easier to read even if it is less granular:

Another member of the group made choropleth maps showing the number of items received from each country, marking the top ten sources of items in red, the next ten countries in orange, and the next ten countries in yellow. I’m working on trying to make my own maps using google fusion tables.

Conclusions

I’m pretty reluctant to try to draw any conclusions based on these graphs for many reasons. I don’t have a sense of how exhaustive our query was, either in terms of looking in the right places for acquisition data, or in terms of finding all of the items that had locations and acquisition dates. I’m pretty sure we also lost a lot, at least in the query that produced this data, in only looking at items that had a country as their specified location. Objects with more precise locations, although retrievable with our more sophisticated query, aren’t represented in these graphs. Beyond that, I’m not sure to what extent the digital records reflect the actual acquisitions of the museums. They may simply not have records available for some of the earlier years.

That said, I’m pretty proud of how much we learned just trying to answer this question, in only two days. Considering that no one on the team had ever used SPARQL before, we managed to make a surprising amount of progress towards the goal that we set for ourselves yesterday morning. While SPARQL queries take a little getting used to, I can definitely see myself playing around with them. It seems like it should be possible to use for everything from automatic bibliography generation (just find all of the sources that items related to a category are mentioned in) to more complex network analyses linking humans and museum objects in surprising ways. I’m really excited to see how linked open data continues to change scholarship and curatorial practice.

Appendix

I made a gist of our hot mess of a SPARQL query. If you have any ideas on how to make this better, please let me know.

Object Oriented

Art, Archaeology, and the Pursuit of Happiness

Category Archives: Uncategorized

Fusion Tables for Choropleth Mapping

GLAMHackPhilly: Empire and the Collection of the British Museum

Featured

Acquiring the Data

Binning the Data

Analysis and Visualization

Conclusions

Appendix

Of Conferences and Coins

Reimagining Museums: Practice in the Arabian Peninsula

Aside