Posts Tagged ‘recommendation’

It’s the ‘Final’ Blog-Post

Posted on August 1st, 2011 by Paul Stainthorp

sunset cowboy

I’ve put ‘final’ in inverted commas in the title of this blog post (which should be sung—of course—to the tune of this song) – because while the JISC-funded Jerome project has indeed come to an end, Jerome itself is going nowhere. We’ll continue to tweak and develop it as an “un-project” (from whence it came), and—we sincerely hope—Jerome will lead in time, in whole or in part, to a real, live university library service of awesome.

Before we get started, though, thanks are due to the whole of the Jerome project team: Chris Leach, Dave Raines, Tim Simmonds, Elif Varol, Joss Winndevelopers Nick Jackson and Alex Bilbie times a million, and also to people outside the University of Lincoln who have offered support and advice, including Ed ChamberlainOwen Stephens, and our JISC programme manager, Andy McGregor.


Just what exactly have we produced?

  1. A public-facing search portal service available at:
    • Featuring searchbrowse, and bibliographic record views.
    • Search is provided by Sphinx.
    • A ‘mixing desk‘ allows user control over advanced search parameters.
    • Each record is augmented by data from OpenLibrary (licenced under CC0) to help boost the depth and accuracy of our own catalogue. Where possible, OpenLibrary also provides our book cover images.
    • Bibliographic work pages sport COinS metadata and links to previews from Google Books.
    • Item data is harvested from the Library Management System.
    • Social tools allow sharing of works on Facebook, Twitter, etc.
  2. Openly licensed bibliographic data, available at, and including:
  3. Attractive, documented, supported APIs for all data, with a timeline of data refresh cycles. The APIs will provide data in the following formats:
    1. RDF/XML
    2. JSON
    3. RIS
    4. The potential for MARC
  4. Source code for Jerome will be made Open and publicly available (after a shakedown) on GitHub.
  5. While the user interface, technical infrastructure, analytics and machine learning/personalisation aspects of Jerome have been discussed fairly heavily on the project blog, you’ll have to wait a little while for formal case studies.
  6. Contributions to community events. We presented/discussed Jerome at:

What ought to be done next?

  1. There’s a lot more interesting work to be done around the use of activity/recommendation data and Jerome. We’re using the historical library loan data both to provide user recommendations (“People who borrowed X…“), and to inform the search and ranking algorithms of Jerome itself. However, there are lots of other measures of implicit and explicit activity (e.g. use of the social sharing tools) that could be used to provide even more accurate recommendations.
  2. Jerome has concentrated on data at the bibliographic/work level. But there is potentially even more value to be had out of aggregating and querying library item data (i.e. information about a library’s physical and electronic holdings of individual copies of a bibliographic work) – e.g. using geo-lookup services to highlight the nearest available copies of a work. This is perhaps the next great untapped sphere for the open data/Discovery movement.
  3. Demonstrate use of the APIs to do cool stuff! Mashing up library data with other sets of institutional data (user profiles, mapping, calendaring data) to provide a really useful ‘portal’ experience for users. Also: tapping into Jerome for reporting/administrative purposes; for example identifying and sanitising bad data!

Has Jerome’s data actually been used?

Probably not yet. We were delighted to be able to offer something up (in the form of an early, bare-bones Jerome bibliographic API) to the #discodev Developer Competition, where we still hope to see it used. Also, we are holding a post-project hack day (on 8 August 2011) with the COMET project in Cambridge to share data, code, and best practices around handling Open Data. We certainly intend to make use of the APIs internally to enhance the University of Lincoln’s own library services. If you’re interested in making use of the Jerome open data, please email me or leave a comment here.

What skills did we need?

At the University of Lincoln we have been experimenting with a new (for us) way of managing development projects: the Agile method, using shared tools (Pivotal Tracker, GitHub) to allow a distributed team of developers and interested parties to work together. On a practical level, we’ve had to come to terms with matching a schemaless database architecture with traditional formats for describing resources… Nick and Alex have learned more about library standards and cataloguing practice (*cough*MARC*cough*) than they may have wished! . There are also now plans to extend MongoDB training to more staff within the ICT Services department.

What did we learn along the way?

Three things to take away from Jerome:

  1. MARC is evil. But still, perhaps, a necessary evil. Until there’s a critical mass of libraries and library applications using newer, more sane languages to describe their collections, developers will just have to bite down hard and learn to parse MARC records. Librarians, in turn, need to accept the limitations of MARC and actively engage in developing the alternative</lecture over>.
  2. Don’t battle: use technology to find a way around licensing issues. Rather than spending time negotiating with third parties to release their data openly, Jerome took a different approach, which was to release openly those (sometimes minimal) bits of data which we know are free from third-party interest, then to use existing open data sources to enhance and extend those records.
  3. Don’t waste time trying to handle every nuance of a record. Whilst it’s important from a catalogue standpoint, people really don’t care if it’s a main title, subtitle, spine title or any other form of title when they’re searching. Perfection is a goal, but not a restriction. Releasing 40% of data and working on the other 60% later is better than aiming for 100% and never releasing anything.

Thanks! It’s been fun…

Paul Stainthorp
July, 2011

The Re-Architecting of Jerome

Posted on July 12th, 2011 by Nick Jackson

Over the past few days I’ve been doing some serious brain work about Jerome and how we best build our API layer to make it simultaneously awesomely cool and insanely fast whilst maintaining flexibility and clarity. Here’s the outcome.

To start with, we’re merging a wide variety of individual tables1 – one for each type of resource offered – into a single table which handles multiple resource types. We’ve opted to use all the fields in the RIS format as our ‘basic information’ fields, although obviously each individual resource type can extend this with their own data if necessary. This has a few benefits; first of all we can interface with our data easier than before without needing to write type-specific code which translates things back to our standardised search set. As a byproduct of this we can optimise our search algorithms even further, making it far more accurate and following generally accepted algorithms for this sort of thing. Of course, you’ll still be able to fine-tune how we search in the Mixing Deck.

To make this even easier to interface with from an admin side, we’ll be strapping some APIs (hooray!) on to this which support the addition, modification and removal of resources programmatically. What this means is that potentially anybody who has a resource collection they want to expose through Jerome can do, they just need to make sure their collection is registered to prevent people flooding it with nonsense that isn’t ‘approved’ as a resource. Things like the DIVERSE research project can now not only pull Jerome resource data into their interface, but also push into our discovery tool and harness Jerome’s recommendation tools. Which brings me neatly on to the next point.

Recommendation is something we want to get absolutely right in Jerome. The amount of information out there is simply staggering. Jerome already handles nearly 300,000 individual items and we want to expand that to way more by using data from more sources such as journal table of contents. Finding what you’re actually after in this can be like the proverbial needle in a haystack, and straight search can only find so much. To explore a subject further we need some form of recommendation and ‘similar item engine. What we’re using is an approach with a variety of angles.

At a basic level Jerome runs term extraction on any available textual content to gather a set of terms which describe the content, very similar to what you’ll know as tags. These are generated automatically from titles, synopses, abstracts and any available full text. We can then use the intersection of terms across multiple works to find and rank similar items based on how many of these terms are shared. This gives us a very simple “items like this” set of results for any item, with the advantage that it’ll work across all our collections. In other words, we can find useful journal articles based on a book, or suggest a paper in the repository which is on a similar subject to an article you’re looking for.

We then also have a second layer very similar to Amazon’s “people who bought this also bought…”, where we look over the history of users who used a specific resource to find common resources. These are then added to the mix and the rankings are tweaked accordingly, providing a human twist to the similar items by suppressing results which initially seem similar but which in actuality don’t have much in common at a content level, and pushing results which are related but which don’t have enough terms extracted for Jerome to infer this (for example books which only have a title and for which we can’t get a summary) up to where a user will find them easier.

Third of all in recommendation there’s the “people on your course also used” element, which is an attempt to make a third pass at fine-tuning the recommendation using data we have available on which course you’re studying or which department you’re in. This is very similar to the “used this also used” recommendation, but operating at a higher level. We analyse the borrowing patterns of an entire department or course to extract both titles and semantic terms which prove popular, and then boost these titles and terms in any recommendation results set. By only using this as a ‘booster’ in most cases it prevents recommendation sets from being populated with every book ever borrowed whilst at the same time providing a more relevant response.

So, that’s how we recommend items. APIs for this will abound, allowing external resource providers to register ‘uses’ of a resource with us for purposes of recommendation. We’re not done yet though, recommendation has another use!

As we have historical usage data for both individuals and courses, we can throw this into the mix for searching by using semantic terms to actively move results up or down (but never remove them) based on the tags which both the current user and similar users have actually found useful in the past. This means that (as an example) a computing student searching for the author name “J Bloggs” would have “Software Design by Joe Bloggs” boosted above “18th Century Needlework by Jessie Bloggs”, despite there being nothing else in the search term to make this distinction. As a final bit of epic coolness, Jerome will sport a “Recommended for You” section where we use all the recommendation systems at our disposal to find items which other similar users have found useful, as well as which share themes with items borrowed by the individual user.

  1. Strictly speaking Mongo calls them Collections, but I’ll stick with tables for clarity

If I may make a suggestion?

Posted on April 26th, 2011 by Nick Jackson

One of the cool things that Jerome wants to do as part of its looking at the library access portal (give our current build a whirl) is provide relevant, accurate suggestions of books which you may find useful. This is a whopping huge piece of statistics which is causing me to dust off my A Level statistics, whip out some algorithms and start thinking about ways of actually doing it in a reliable way. This blog post is a summary of some of my thoughts on the issue, and I’d really appreciate it if people could weigh in with any suggestions or experience before I delve in to the depths of coding recommendations.

Here goes. First of all, we want to use as many different sources as possible from which to derive similarity suggestions. At the moment we’ve got a few suggestions:

  • People who borrowed this book also borrowed… (See how Huddersfield do it). This provides direct, human-based connections between items which are weighted by both how common the combination of books are and also how unique the combination is1.
  • Catalogued Subject Headings. Using the mad cataloguing skills of our ninja cataloguers to determine which books share a subject, as well as boosting these subject headings with data from OpenLibrary. The weight of a subject heading in the overall similarity ranking is inversely proportional to the number of times it’s used. This gives stronger recommendations to books which are within a smaller field of interest. We can also use subject headings to suggest similar journals.
  • Extracted Semantic Tags. As we pull OpenLibrary summaries for books and abstracts from our repository we’ll be slamming them through OpenCalais to extract semantic information on what the item is actually about. These then go into the item’s tags and (using a similar algorithm to subject headings) are weighed to find works about the same type of things.
  • Manual Groupings. Jerome has in-built support for ‘lists’ of items, intended to provide for things such as reading lists, collections of citations for a specific paper and so-on. We can assume that items in a list together are related, giving us a potentially huge set of manually curated similarities. To prevent too much positive reinforcement, again we’ll be weighting a link in inverse proportion to its popularity.

Read the rest of this entry »

  1. Not mutually exclusive. One his how many times book A is borrowed by people who also borrowed book B, one is how often book B is only borrowed by people who borrowed book A []