Posts Tagged ‘APIs’

Modular Discovery

Posted on July 30th, 2011 by Nick Jackson

Jerome is a system which is modular by design. It comprises of a variety of distinct modules which handle data collection, formatting, output, search, indexing, recommendation and more. It’s also fairly unique (as far as I can tell) in that different types of resource also occupy a modular ‘slot’ rather than being interwoven with Jerome itself – it has no differentiation at the code level between books, ebooks, dissertations, papers, journals, journal entries, websites or any other ‘resource’ which people may want to feed it.

As a result of this approach we can use Jerome as a true multi-channel resource discovery tool. All that’s required for anybody to add resources to Jerome and immediately make them searchable and recommendable is for a ‘collection’ to be created and for them to write a bit of code which can make the following API calls as necessary:

  • Create a new resource as part of the collection, telling us as much about it as they can.
  • Update an existing resource when it changes.
  • Delete a resource which is no longer available.
  • Optionally record a use of a resource against a user’s account to help build our recommendations dataset.

That’s it. Got a collection of awesome lecture slides you want to feed into Jerome and instantly make known as a resource? You can do that.

We’ll have your API documentation up soon.

What did it cost and who benefits?

Posted on July 27th, 2011 by Paul Stainthorp

This is going to be one of the hardest project blog posts to write…

The costs of getting Jerome to this stage are relatively easy to work out. Under the Infrastructure for Resource Discovery programme, JISC awarded us the sum of £36,585 which (institutional overheads aside) we used to pay for the following:

  • Developer staff time: 825 hours over six months.
  • Library and project staff time: 250 hours over six months.
  • The cost of travel to a number of programme events and relevant conferences at which we presented Jerome, including this one, this one, this one, this one and this one.

As all the other aspects of Jerome—hardware, software etc.—either already existed or were free to use, that figure represents the total cost of getting Jerome to its current state.

The benefits (see also section 2.4 of the original bid) of Jerome are less easily quantified financially, but we ought to consider these operational benefits:

1. The potential for using Jerome as a ‘production’ resource discovery system by the University of Lincoln. As such it could replace our current OPAC web catalogue as the Library’s primary public tool of discovery. The Library ought also to consider Jerome as a viable alternative to the purchase of a commercial, hosted next-generation resource discovery service (which it is currently reviewing), with the potential for replacing the investment it would make in such a system with investment in developer time to maintain and extend Jerome. In addition, the Common Web Design (on which the Jerome search portal is based) is inherently mobile-friendly.

2. Related: even if the Jerome search portal is not adopted in toto, there’s real potential for using Jerome’s APIs and code (open sourced) to enhance our existing user interfaces (catalogues, student portals, etc.) by ‘hacking in’ additional useful data and services via Jerome (similar to the Talis Juice service). This could lead to cost savings: a modern OPAC would not have to be developed in isolation or tools bought in. And these enhancements are as available to other institutions and libraries as much as to Lincoln.

3. The use of Jerome as an operational tool for checking and sanitising bibliographic data. Jerome can already be used to generate lists of ‘bad’ data (e.g. invalid ISBNs in MARC records); this intelligence could be fed back into the Library to make the work of cataloguers, e-resources admin staff, etc., easier and faster (efficiency savings) and again to improve the user experience.

4. Benefits of Open Data: in releasing our bibliographic collections openly Jerome is adding to the UK’s academic resource discovery ‘ecosystem‘, with benefits to scholarly activity both in Lincoln and elsewhere. We are already working with the COMET team at Cambridge University Library on a cross-Fens spin-off miniproject(!) to share data, code, and best practices around handling Open Data. Related to this are the ‘fuzzier’ benefits of associating the University of Lincoln’s name with innovation in technology for education (which is a stated aim in the University’s draft institutional strategy).

5. Finally, there is the potential for the university to use Jerome as a platform for future development: Jerome already sits in a ‘suite’ of interconnecting innovative institutional web services (excuse the unintentional alliteration!) which include the Common Web Design presentation framework, Total ReCal space/time data, lncn.eu URL shortener and link proxy, a university-wide open data platform, and the Nucleus data storage layer. Just as each of these (notionally separate) services has facilitated the development of all the others, so it’s likely that Jerome will itself act as a catalyst for further innovation.

The Re-Architecting of Jerome

Posted on July 12th, 2011 by Nick Jackson

Over the past few days I’ve been doing some serious brain work about Jerome and how we best build our API layer to make it simultaneously awesomely cool and insanely fast whilst maintaining flexibility and clarity. Here’s the outcome.

To start with, we’re merging a wide variety of individual tables1 – one for each type of resource offered – into a single table which handles multiple resource types. We’ve opted to use all the fields in the RIS format as our ‘basic information’ fields, although obviously each individual resource type can extend this with their own data if necessary. This has a few benefits; first of all we can interface with our data easier than before without needing to write type-specific code which translates things back to our standardised search set. As a byproduct of this we can optimise our search algorithms even further, making it far more accurate and following generally accepted algorithms for this sort of thing. Of course, you’ll still be able to fine-tune how we search in the Mixing Deck.

To make this even easier to interface with from an admin side, we’ll be strapping some APIs (hooray!) on to this which support the addition, modification and removal of resources programmatically. What this means is that potentially anybody who has a resource collection they want to expose through Jerome can do, they just need to make sure their collection is registered to prevent people flooding it with nonsense that isn’t ‘approved’ as a resource. Things like the DIVERSE research project can now not only pull Jerome resource data into their interface, but also push into our discovery tool and harness Jerome’s recommendation tools. Which brings me neatly on to the next point.

Recommendation is something we want to get absolutely right in Jerome. The amount of information out there is simply staggering. Jerome already handles nearly 300,000 individual items and we want to expand that to way more by using data from more sources such as journal table of contents. Finding what you’re actually after in this can be like the proverbial needle in a haystack, and straight search can only find so much. To explore a subject further we need some form of recommendation and ‘similar item engine. What we’re using is an approach with a variety of angles.

At a basic level Jerome runs term extraction on any available textual content to gather a set of terms which describe the content, very similar to what you’ll know as tags. These are generated automatically from titles, synopses, abstracts and any available full text. We can then use the intersection of terms across multiple works to find and rank similar items based on how many of these terms are shared. This gives us a very simple “items like this” set of results for any item, with the advantage that it’ll work across all our collections. In other words, we can find useful journal articles based on a book, or suggest a paper in the repository which is on a similar subject to an article you’re looking for.

We then also have a second layer very similar to Amazon’s “people who bought this also bought…”, where we look over the history of users who used a specific resource to find common resources. These are then added to the mix and the rankings are tweaked accordingly, providing a human twist to the similar items by suppressing results which initially seem similar but which in actuality don’t have much in common at a content level, and pushing results which are related but which don’t have enough terms extracted for Jerome to infer this (for example books which only have a title and for which we can’t get a summary) up to where a user will find them easier.

Third of all in recommendation there’s the “people on your course also used” element, which is an attempt to make a third pass at fine-tuning the recommendation using data we have available on which course you’re studying or which department you’re in. This is very similar to the “used this also used” recommendation, but operating at a higher level. We analyse the borrowing patterns of an entire department or course to extract both titles and semantic terms which prove popular, and then boost these titles and terms in any recommendation results set. By only using this as a ‘booster’ in most cases it prevents recommendation sets from being populated with every book ever borrowed whilst at the same time providing a more relevant response.

So, that’s how we recommend items. APIs for this will abound, allowing external resource providers to register ‘uses’ of a resource with us for purposes of recommendation. We’re not done yet though, recommendation has another use!

As we have historical usage data for both individuals and courses, we can throw this into the mix for searching by using semantic terms to actively move results up or down (but never remove them) based on the tags which both the current user and similar users have actually found useful in the past. This means that (as an example) a computing student searching for the author name “J Bloggs” would have “Software Design by Joe Bloggs” boosted above “18th Century Needlework by Jessie Bloggs”, despite there being nothing else in the search term to make this distinction. As a final bit of epic coolness, Jerome will sport a “Recommended for You” section where we use all the recommendation systems at our disposal to find items which other similar users have found useful, as well as which share themes with items borrowed by the individual user.

  1. Strictly speaking Mongo calls them Collections, but I’ll stick with tables for clarity

Quick update

Posted on June 20th, 2011 by Alex Bilbie

Just a quick update on what we’re working on Jerome over the next week or so:

We’ll be merging the types of items Jerome indexes into one database table (or MongoDB collection in our case), at the moment books, journals and repository items are in their own tables (collections). This won’t be a big job, it’s mostly a case of altering our import cron jobs.

Once the merge has happened then we can start fleshing out our final APIs. There are already a few that we’ve made but they don’t integrate with our security model yet, and also don’t output in a consistent format. And for those open data fans, yes we will have *some* RDF outputs…

Over the weekend I spent some time looking at how we’re going to add a personalisation layer into Jerome. Graph databases seems to be the way to go as these allow us to do some funky queries like “return all books that students that are in the Software Engineering and Hardware modules borrowed over the last two years ordered by average rating then ordered by the number of times they were borrowed”. We’re going to play with Neo4j as I’ve seen some cool tutorials which cover our use case.

Three quarks for Muster MARC!

Posted on April 21st, 2011 by Paul Stainthorp

My esteemed, gracious and talented colleague Mr. Jackson is not happy.

He’s not happy because I’ve asked him to do something which he thinks is an awful, depressing, retrograde step. I’ve asked him to add a MARC export function to Jerome.

Nick’s argument in a nutshell (he won’t mind me paraphrasing):

  • MARC is awful: truly awful. It’s holding back humanity’s (and libraries’) progress. We shouldn’t be doing anything to prolong its life. #marcmustdie

My argument in a nutshell:

  • For better or worse, libraries still use MARC, and this will be a useful facility for libraries who want to consume our open data straight into their existing Library Management Systems.

What does the studio audience think? Should Jerome serve up MARC (actually, MARCXML. I’m not a monster.) because someone, somewhere might want to consume it, or should we take a stand and insist on providing only decent, sane data formats from now on?

For anyone who’s blissfully unaware of MARC (MAchine-Readable Cataloging) formats, read this. Then read this, this, and this. Then go and have a lie down in a darkened room.

I don’t love MARC. More than anything, I don’t really understand it (I have a cataloguer to do that for me). But it still has currency in libraries. #shouldmarcdie?