We Love All Our Data

Posted on February 27th, 2011 by Nick Jackson

It’s been a while since our last update, so I’m going to go over the key data sets which we’ll be using to drive the Jerome project. These are the collections of data which we lump together into the unified search index, as well as those which provide the supporting metrics to drive the intelligent generation of results.

The obvious one to include is the contents of our library catalogue, currently being scraped from our Horizon LMS using the HiP1 system. This contains information on titles available within our own collections, including both physical books and ebooks as well as ‘auxiliary’ collections such as reference and dissertations. This data is supplemented (where available) from sources such as Open Library and LibraryThing to provide as rich an experience as possible.

On top of the catalogue we’re also including the contents of our institutional repository. This is a collection of papers, datasets and other useful bits and pieces of academic importance from the depths of the University. It’s also harvestable through the initially horrific-seeming but actually delightfully sensible OAI-PMH2) standard. It’s a little slower than I expected, but it allows us to cleanly extract all the data we want to regarding titles, authors, summaries and access URIs. The OAI-PMH harvester also has the handy side effect of being compatible with archiving software that the Library is proposing acquiring, so we reduce the workload required to add other sources.

Journals are up next, and this is a tricky one since many publishers (being of the old-skool “we don’t understand this ‘Internet’ thing” and “why would you want our data?”) don’t tell us anything remotely useful about their journals or their contents. Fortunately for us, help is at hand from the people at Heriot-Watt in the form of JournalTOCs, a service which provides information on journals and what’s in them based purely on ISSNs3. Since we have a gigantic list of the ISSNs of all the journals we can access it’s a fairly simple matter to loop through them and extract all the data we can.

JournalTOCs also – as the name suggests – provides us with tables of contents for some of these journals meaning that we can even provide searching down to the individual journal articles.

So they’re the four big things we’re initially launching Jerome’s integrated search with: catalogue, repository, journals, and journal contents.

  1. Horizon Information Portal []
  2. Open Archives Initiative Protocol for Metadata Harvesting – A standard way of getting information about the contents of an archive collection (although not the contents themselves []
  3. International Standard Serial Number – like ISBNs, but for serial publications []