It’s the end of Jerome as we know it (but I feel fine)

Posted on November 28th, 2011 by Paul Stainthorp

The University of Lincoln’s Jerome project finished in August with the successful release of more than 240,000 openly-licensed bibliographic records, available over developer APIs, and a joint hack day with Cambridge University Library‘s COMET project.

Now, encouraged by positive JISC feedback, both institutions—Cambridge and Lincoln jointly—have applied for follow-up project funding under the project title CLOCK. If our bid is successful, the new project will run between December 2011–July 2012, employing a web developer based at the University of Lincoln, and distilling the work of both institutions into the development of new innovative library metadata discovery services for the scholarly community.

You can read the project proposal for CLOCK at http://lncn.eu/ijt4 – the introductory section is below.

The University of Lincoln and Cambridge University Library both delivered successful projects (Jerome and COMET) for the JISC Infrastructure for Resource Discovery Programme in 2011. This is a proposal for the continuation of and elaboration upon the work of both projects, via a programme of development work shared between the two institutions.

Throughout both projects (COMET-Jerome), parallel approaches in technology and data structure were noted and commented upon. A ‘mash day’ workshop event held in Cambridge in August aimed to explore these differences as well as areas of potential synergy. Here project members identified several points of interest to take forward.

Both projects produced outputs of interest to researchers, students, librarians, developers, and designers of bibliographic discovery environments. The CLOCK project will harness the success of these two complementary initiatives and investigate new approaches to data creation and discovery in the library domain. In particular, it will investigate, propose, and develop new, web-based bibliographic tools/APIs which will make it easier for developers, academic libraries and library end-users (esp. researchers) to find Open Bibliographic Data and incorporate that data into systems and workflows.

This project is an opportunity to [1] exploit through real-world applications the significant amount of data released openly by Cambridge University Library; [2] apply the Jerome database architecture, iterative development methodology, and API framework to a bibliographic dataset an order of magnitude greater than the University of Lincoln’s; and [3] to build and enable a new set of tools and demonstrator services which will enable the future development of public Open Bib Data web applications of practical utility to libraries and end-users.

The project will be supported by library consultant Owen Stephens, who will help to put the work into a national context, relating CLOCK to the wider movement toward Open Bib Data and the work of the JISC Discovery initiative. It will take place in an environment (Lincoln/Cambridge) where a culture of developer inquiry and experimentation is encouraged and nurtured. It is also endorsed by senior library management at both universities.

Both universities are involved in complementary development work which will  both inform and be informed by CLOCK: at Cambridge, Ed Chamberlain is guiding the development of the JISC Open Bibliography 2 project; in Lincoln, Paul Stainthorp is lead researcher on the #jiscmrd Orbital project, which is investigating the management of research data, with some areas of overlap.

CLOCK will operate as part of the wider JISC Digital Infrastructure: Information and library infrastructure: Resource discovery, and support the recent concerted effort to move toward openly licensed library discovery in UK Higher Education and beyond.

Jerome/COMET hack day: Fun in the Fens

Posted on August 10th, 2011 by Paul Stainthorp

Here’s a photo of the CARET (Centre for Applied Research in Educational Technologies) offices at the University of Cambridge, where we held our log-awaited joint Jerome/COMET hack day, on Monday 8 August. Actually, in the end, it turned out to be a kind of Jerome/COMET/SALDA/synthesis/OUseful mashup-AH!

Jerome/COMET

In attendance (for the record):

Train mayhem aside (in the end the Lincoln contingent didn’t arrive until nearly midday), it was a really useful day and well worth doing. Particular thanks to Ed Chamberlain and his colleagues for hosting the event and for arranging the food and refreshments. Thanks also to everyone who travelled from afar for no other reason than they love a good mashup.

Typically, the ever-prolific Tony Hirst has already managed to write up not one, but two blog posts about ideas that came out of the day:

  • Getting Library Catalogue Searches Out There…
  • Open Data Processes: the Open Metadata Laundry (N.B. this one relates specifically to Jerome – in particular, our notion of ‘scrubbing’ dodgy MARC records by taking only the identifiers plus the bare citation-only fields, and using that minimal set to grab additional free and Open data from the web, automatically creating new full versions of records that are inherently Open. ‘Metadata laundry’, me like.)

Here are three more ideas/conversations we had in Cambridge that I thought were going somewhere interesting. Yeah, we might get around to actually doing these, sometime…

1. Using COMET data to enhance Jerome

The ideaSimilar to the ‘metadata laundry’, above, and to the way Jerome already uses data from the Open Library, JournalTOCs, LibraryThing, etc., to enhance its book records with additional metadata. Jerome constructs a URL in the form http://data.lib.cam.ac.uk/isbn/_______, with the ISBN from the Jerome record dropped in at the end. COMET responds with a link to an open record in RDF and/or JSON, which Jerome gladly sucks in, adding any additional fields to its original source record. Enrichment ensues.

2. Using Jerome search to ‘skin’ COMET

I called this one “Jerome Scholar” ;-) …we make use of the search aspects of Jerome (in particular, the speed of Sphinx, the ‘mixing desk‘ idea, the neat record presentation, to provide a really smooth way of interacting with the much more well-structured (hence “Scholar”) data that resides in COMET.

3. Using the differences between the two datasets to tell us something interesting

I have a notion that there’s something inherently useful about being able to compare two versions of a record for the ‘same’ object. If we could use Jerome+COMET to generate a web application/data feed – one that other discovery services could themselves consume, we’d have ways of ‘sparking off’ whole new avenues of discovery: from misspelled names, variant titles, different subject terms assigned by different cataloguing practices, etc. Like xISBN, but for non-standardised data(?). All right, that’s the fuzziest of the three ideas. And as the eminiently sensible Owen Stephens kept asking me, “…what’s the use case?”.

And then we went to the pub.

And then we went to the pub.

It’s the ‘Final’ Blog-Post

Posted on August 1st, 2011 by Paul Stainthorp

sunset cowboy

I’ve put ‘final’ in inverted commas in the title of this blog post (which should be sung—of course—to the tune of this song) – because while the JISC-funded Jerome project has indeed come to an end, Jerome itself is going nowhere. We’ll continue to tweak and develop it as an “un-project” (from whence it came), and—we sincerely hope—Jerome will lead in time, in whole or in part, to a real, live university library service of awesome.

Before we get started, though, thanks are due to the whole of the Jerome project team: Chris Leach, Dave Raines, Tim Simmonds, Elif Varol, Joss Winndevelopers Nick Jackson and Alex Bilbie times a million, and also to people outside the University of Lincoln who have offered support and advice, including Ed ChamberlainOwen Stephens, and our JISC programme manager, Andy McGregor.

Ssssooooo…

Just what exactly have we produced?

  1. A public-facing search portal service available at: http://jerome.library.lincoln.ac.uk/
    • Featuring searchbrowse, and bibliographic record views.
    • Search is provided by Sphinx.
    • A ‘mixing desk‘ allows user control over advanced search parameters.
    • Each record is augmented by data from OpenLibrary (licenced under CC0) to help boost the depth and accuracy of our own catalogue. Where possible, OpenLibrary also provides our book cover images.
    • Bibliographic work pages sport COinS metadata and links to previews from Google Books.
    • Item data is harvested from the Library Management System.
    • Social tools allow sharing of works on Facebook, Twitter, etc.
  2. Openly licensed bibliographic data, available at http://data.lincoln.ac.uk/documentation.html#bib, and including:
  3. Attractive, documented, supported APIs for all data, with a timeline of data refresh cycles. The APIs will provide data in the following formats:
    1. RDF/XML
    2. JSON
    3. RIS
    4. The potential for MARC
  4. Source code for Jerome will be made Open and publicly available (after a shakedown) on GitHub.
  5. While the user interface, technical infrastructure, analytics and machine learning/personalisation aspects of Jerome have been discussed fairly heavily on the project blog, you’ll have to wait a little while for formal case studies.
  6. Contributions to community events. We presented/discussed Jerome at:

What ought to be done next?

  1. There’s a lot more interesting work to be done around the use of activity/recommendation data and Jerome. We’re using the historical library loan data both to provide user recommendations (“People who borrowed X…“), and to inform the search and ranking algorithms of Jerome itself. However, there are lots of other measures of implicit and explicit activity (e.g. use of the social sharing tools) that could be used to provide even more accurate recommendations.
  2. Jerome has concentrated on data at the bibliographic/work level. But there is potentially even more value to be had out of aggregating and querying library item data (i.e. information about a library’s physical and electronic holdings of individual copies of a bibliographic work) – e.g. using geo-lookup services to highlight the nearest available copies of a work. This is perhaps the next great untapped sphere for the open data/Discovery movement.
  3. Demonstrate use of the APIs to do cool stuff! Mashing up library data with other sets of institutional data (user profiles, mapping, calendaring data) to provide a really useful ‘portal’ experience for users. Also: tapping into Jerome for reporting/administrative purposes; for example identifying and sanitising bad data!

Has Jerome’s data actually been used?

Probably not yet. We were delighted to be able to offer something up (in the form of an early, bare-bones Jerome bibliographic API) to the #discodev Developer Competition, where we still hope to see it used. Also, we are holding a post-project hack day (on 8 August 2011) with the COMET project in Cambridge to share data, code, and best practices around handling Open Data. We certainly intend to make use of the APIs internally to enhance the University of Lincoln’s own library services. If you’re interested in making use of the Jerome open data, please email me or leave a comment here.

What skills did we need?

At the University of Lincoln we have been experimenting with a new (for us) way of managing development projects: the Agile method, using shared tools (Pivotal Tracker, GitHub) to allow a distributed team of developers and interested parties to work together. On a practical level, we’ve had to come to terms with matching a schemaless database architecture with traditional formats for describing resources… Nick and Alex have learned more about library standards and cataloguing practice (*cough*MARC*cough*) than they may have wished! . There are also now plans to extend MongoDB training to more staff within the ICT Services department.

What did we learn along the way?

Three things to take away from Jerome:

  1. MARC is evil. But still, perhaps, a necessary evil. Until there’s a critical mass of libraries and library applications using newer, more sane languages to describe their collections, developers will just have to bite down hard and learn to parse MARC records. Librarians, in turn, need to accept the limitations of MARC and actively engage in developing the alternative</lecture over>.
  2. Don’t battle: use technology to find a way around licensing issues. Rather than spending time negotiating with third parties to release their data openly, Jerome took a different approach, which was to release openly those (sometimes minimal) bits of data which we know are free from third-party interest, then to use existing open data sources to enhance and extend those records.
  3. Don’t waste time trying to handle every nuance of a record. Whilst it’s important from a catalogue standpoint, people really don’t care if it’s a main title, subtitle, spine title or any other form of title when they’re searching. Perfection is a goal, but not a restriction. Releasing 40% of data and working on the other 60% later is better than aiming for 100% and never releasing anything.

Thanks! It’s been fun…

Paul Stainthorp
July, 2011

The Jerome Resource Model

Posted on July 31st, 2011 by Nick Jackson

One of the things which I’ve come to realise whilst working on Jerome is that library storage formats generally suck. They’re either so complex you need a manual to work out what they mean (such as my good friend MARC) or lack sufficient depth to elegantly handle what you’re trying to do without resorting to epic kludges. This is mostly down to the fact that people have assumed a storage format must also function as an exchange format which can encompass the entirety of knowledge about a resource, a foolish notion which has led to all kinds of mayhem in the past. We have catalogues to know things like the date at which the special edition bound with human skin was printed, who edited it and the colour of his eyes. A discovery service doesn’t care, and neither (I strongly suspect) do people who just want access to our data.

As a result Jerome uses its own internal format for storing data which is built around what Jerome needs to know to provide the necessary search and discovery services. Since we’re storing objects in MongoDB I’ve opted for a nice clean JSON-based object model with very clear field names and structure. It throws away all kinds of ‘useful’ data from a detailed metadata standpoint purely because it’s irrelevant to discovery. The vast majority of fields we store are completely optional (since not all resources have everything we care to know about). In a nutshell, we are capturing and storing the following data:

  • A resource’s unique Jerome ID.
  • Information on the collection the resource belongs to, including its unique ID within that collection.
  • Title, secondary title and edition.
  • An array of author names and secondary author names.
  • A date of publishing split into year, month and day.
  • An abstract or synopsis.
  • Keywords and subject categories.
  • A periodical name, volume number and issue number.
  • Availability dates, including start and stop of availability.
  • Publisher name and location.
  • ISBN or ISSN.
  • A URL to access the resource, and optionally a label for that URL.
  • A URL for direct access to the full text of the resource.
  • Name and URL for the licence information of the resource metadata.

As part of the magic of this approach we completely disassociate the individual type of resource from its representation within Jerome. We can represent pretty much any collection of resources within the same framework with no having to muck about writing custom code to turn one format into another for representation.

Modular Discovery

Posted on July 30th, 2011 by Nick Jackson

Jerome is a system which is modular by design. It comprises of a variety of distinct modules which handle data collection, formatting, output, search, indexing, recommendation and more. It’s also fairly unique (as far as I can tell) in that different types of resource also occupy a modular ‘slot’ rather than being interwoven with Jerome itself – it has no differentiation at the code level between books, ebooks, dissertations, papers, journals, journal entries, websites or any other ‘resource’ which people may want to feed it.

As a result of this approach we can use Jerome as a true multi-channel resource discovery tool. All that’s required for anybody to add resources to Jerome and immediately make them searchable and recommendable is for a ‘collection’ to be created and for them to write a bit of code which can make the following API calls as necessary:

  • Create a new resource as part of the collection, telling us as much about it as they can.
  • Update an existing resource when it changes.
  • Delete a resource which is no longer available.
  • Optionally record a use of a resource against a user’s account to help build our recommendations dataset.

That’s it. Got a collection of awesome lecture slides you want to feed into Jerome and instantly make known as a resource? You can do that.

We’ll have your API documentation up soon.