The Painter Goblin: Part 3, Data Sources
One thing that held the Painter Goblin project back was finding a data source to get images from.
- There are potentially hundreds of sources out there, but! The path of least resistance means that:
- Any source needs either hackable URIs* or a random function
- Ideally, a data source doesn’t link to yet-another-page, e.g. portal like websites to other’s collections
- Ideally the data source links directly to an image to download
- Data can be easily selected by category, e.g. just paintings, or posters, not just ‘art’.
* A hackable URI is a URI pattern that can be cycled through using computational techniques, even if the underlying data isn’t entirely well-known. E,g, http://example.com/image/0001, http://example.com/image/0002, for subsequent pages, for lack of a more concrete example.
I wanted to explore heritage sources such as Europeana, TROVE, DPLA. I struggled to search these effectively though, and struggled to see how I might automate using them. I recognise they have APIs. I’ll revisit them in the future as I look to expand the Painter Goblin’s corpus.
Enter Wikidata
I started looking at Wikipedia knowing that it was categorised well and could also return random pages. I didn’t know how to create a random search per category though.
I then started thinking about Wikidata (an open data project that is part of the other wiki projects). Already aware that it had a SPARQL endpoint a technology I was familiar with, I just had to figure out if I could split searches by category, randomly, and via URI, e.g. via CURL or some such.
This is where I got lucky as my search started by looking at Wikidata’s examples pages:
https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/queries/examples
But even better than that – one of the examples was about finding ‘random paintings’:
This was everything I needed.
But it’s not all rosy…
If Wikidata hadn’t provided this example then the learning curve would have been much steeper. Specifically, one of the hardest things to get used to is the naming conventions for URIs, e.g. subjects, and predicates.
- Van Gogh, for example, is: Q5582, e.g. https://www.wikidata.org/wiki/Q5582
- His Starry Night painting is: Q45585, e.g. https://www.wikidata.org/wiki/Q45585
And a location predicate looks like:
- Location, P276, e.g. https://www.wikidata.org/wiki/Property:P276
So they’re not user-friendly – they need to be known via other mechanism (I used search engine) and then input into queries.
But – like I say – I had a head start, and with the data I could already retrieve I only had to figure out a few more pieces to be able to start the bot.
What to tweet?
What do you want to know when you look at a piece of art? – Who made it, the title, and perhaps its location, or collection, are important. With the information coming from Wikidata we can also return the URI.
Providing we can return an image link from Wikidata – key to this project, we’ll download this to a temporary location and the Painter Goblin will set to work.
An example tweet may then be:
Q27513169, Eugène Grasset, Museu Nacional d’Art de Catalunya http://www.wikidata.org/entity/Q27513169
N.B. Images aren’t compared to each other in the real life Tweet. This is just for the blog. The design decision was not to do this so as not to take away from the remixed piece produced by The Goblin.
The SPARQL used in-code to generate a random tweet looks like:
SELECT ?item ?itemLabel ?image ?loc ?locLabel ?coll ?collLabel ?artist ?artistLabel (MD5(CONCAT(str(?item),str(RAND()))) as ?random) WHERE { ?item wdt:P31 wd:Q429785. ?item wdt:P18 ?image. OPTIONAL { ?item wdt:P276 ?loc . } ?item wdt:P195 ?coll . ?item wdt:P170 ?artist . SERVICE wikibase:label { bd:serviceParam wikibase:language "en,fr,de,it"} } ORDER BY ?random LIMIT 1
In the second part to this series I mentioned being able to manually tweet. One mechanism we might use to do this is by knowing the link first, so a randomly found image isn’t needed. To do that we can use the following SPARQL;
SELECT ?itemLabel ?image ?loc ?locLabel ?coll ?collLabel ?artist ?artistLabel WHERE { OPTIONAL { <http://www.wikidata.org/entity/Q27513169> rdfs:label ?itemLabel . FILTER (LANG(?itemLabel) = "en") } <http://www.wikidata.org/entity/Q27513169> wdt:P18 ?image . OPTIONAL { <http://www.wikidata.org/entity/Q27513169> wdt:P276 ?loc . } <http://www.wikidata.org/entity/Q27513169> wdt:P195 ?coll . <http://www.wikidata.org/entity/Q27513169> wdt:P170 ?artist . SERVICE wikibase:label { bd:serviceParam wikibase:language "en,fr,de,it"} } LIMIT 1
So all that’s left is for the Painter Goblin to be put to work!
You can try the SPARQL queries on the Wikidata Query Service. It’s pretty cool that there’s a tag that informs Wikidata to preview the results for us,
#defaultView:ImageGrid
Give it a whirl!
1 thought on “The Painter Goblin: Part 3, Data Sources”