The Painter Goblin: Part 3, Data Sources

One thing that held the Painter Goblin project back was finding a data source to get images from.

There are potentially hundreds of sources out there, but! The path of least resistance means that:

Any source needs either hackable URIs** (uniform resource identifier) or a randomizing function.
Ideally, a data source doesn’t link to yet-another-page, e.g. portal like websites to other’s collections.
Ideally the data source links directly to an image to download.
Data can be easily selected by category, e.g. just paintings, or posters, not just ‘art’.

** A hackable URI is a URI pattern that can be cycled through using computational techniques, even if the underlying data isn’t entirely well-known. E,g, http://example.com/image/0001, http://example.com/image/0002, for subsequent pages, for lack of a more concrete example.

I wanted to explore heritage sources such as Europeana, TROVE, DPLA. I struggled to search these effectively though, and struggled to see how I might automate using them. I recognise they have APIs. I’ll revisit them in the future as I look to expand the Painter Goblin’s corpus.

Enter Wikidata.

Enter Wikidata

I started looking at Wikipedia knowing that it was categorized well and could also return random pages. I didn’t know how to create a random search per category though.

I then started thinking about Wikidata (an open data project that is part of the other wiki projects). Already aware that it had a SPARQL endpoint a technology I was familiar with, I just had to figure out if I could split searches by category, randomly, and via URI, e.g. via CURL or some such.

This is where I got lucky as my search started by looking at the Wikidata examples page.

But even better than that – one of the examples was about finding random paintings.

This was everything I needed.

Although it’s not all solved…

If Wikidata hadn’t provided this example then the learning curve would have been much steeper. Specifically, one of the hardest things to get used to is the naming conventions for URIs, e.g. subjects, and predicates.

Van Gogh, for example, is: Q5582, e.g. https://www.wikidata.org/wiki/Q5582
His Starry Night painting is: Q45585, e.g. https://www.wikidata.org/wiki/Q45585

And a location predicate looks like:

Location, P276, e.g. https://www.wikidata.org/wiki/Property:P276

So they’re not user-friendly – they need to be known via other mechanism (I used search engine) and then input that information into queries.

But—like I said—I had a head start, and with the data I could already retrieve, I only had to figure out a few more pieces to be able to start the bot.

What to tweet?

What do you want to know when you look at a piece of art? – Who made it, the title, and perhaps its location, or collection, are important. With the information coming from Wikidata we can also return the URI.

Providing we can return an image from Wikidata – key to this project, we need to download this to a temporary location and give the Painter Goblin what it needs to remix the content.

An example tweet may then be

{item name or number} {artist} {location} {Wikidata URI}

e.g. Q27513169, Eugène Grasset, Museu Nacional d’Art de Catalunya https://www.wikidata.org/entity/Q27513169

N.B. Images aren’t compared to each other in the real life Tweet. This is just for the blog. The design decision was not to do this so as not to take away from the remixed piece produced by The Goblin.

The SPARQL used in-code to generate a random tweet looks like:

SELECT ?item ?itemLabel ?image ?loc ?locLabel ?coll ?collLabel ?artist ?artistLabel (MD5(CONCAT(str(?item),str(RAND()))) as ?random) WHERE {
 ?item wdt:P31 wd:Q429785.
 ?item wdt:P18 ?image.
 OPTIONAL { ?item wdt:P276 ?loc . }
 ?item wdt:P195 ?coll .
 ?item wdt:P170 ?artist .
 SERVICE wikibase:label { bd:serviceParam wikibase:language "en,fr,de,it"}
} ORDER BY ?random
LIMIT 10

In the second part to this series I mentioned being able to manually tweet. One mechanism we might use to do this is by knowing the link first, so a randomly found image isn’t needed. To do that we can use the following SPARQL;

SELECT ?itemLabel ?image ?loc ?locLabel ?coll ?collLabel ?artist ?artistLabel WHERE { 
 OPTIONAL { <http://www.wikidata.org/entity/Q27513169> rdfs:label ?itemLabel . 
 FILTER (LANG(?itemLabel) = "en") } 
 <http://www.wikidata.org/entity/Q27513169> wdt:P18 ?image . 
 OPTIONAL { <http://www.wikidata.org/entity/Q27513169> wdt:P276 ?loc . } 
 <http://www.wikidata.org/entity/Q27513169> wdt:P195 ?coll . 
 <http://www.wikidata.org/entity/Q27513169> wdt:P170 ?artist . 
 SERVICE wikibase:label { bd:serviceParam wikibase:language "en,fr,de,it"} 
} LIMIT 1

All that is left is for the Painter Goblin to be start painting!

Wikidata query service (WDQS)

You can try the SPARQL queries on the Wikidata Query Service. It’s pretty cool that there’s a tag that informs Wikidata to preview the results for us,

#defaultView:ImageGrid

Give it a whirl!

The Painter Goblin: Part 3, Data Sources

Enter Wikidata

Although it’s not all solved…

What to tweet?

Wikidata query service (WDQS)

1 thought on “The Painter Goblin: Part 3, Data Sources”

Leave a Reply Cancel reply

Follow ross spencer :: exponentialdecay.digipres :: blog