Archive for the ‘Behind The Scenes’ Category

It’s starting to look useful…

Posted on November 26th, 2010 by Nick Jackson

It is with great pleasure and a little bit of excitement that I would like to bring you up to date on the latest developments from the land of Jerome.

First of all, our looping catalogue import system is now up and running properly. This system literally starts at the first record in our catalogue and slowly but steadily checks and imports every single one, at the rate of 45 a minute. The entire import of our current catalogue completes in a little under 5 days, and as soon as it’s done the system starts the process again. This means that although Jerome isn’t showing you the ‘live’ catalogue, it’s never more than a few days out of date and most of what changes in the catalogue is just housekeeping and fixing ‘wrong’ records. New stock is automatically detected and added, so the act of getting data into Jerome is now fully automated. We’ve already completed one whole round, and we’re currently around 10,000 into the second one.

As part of the import process we now automatically grab free book covers from LibraryThing and cache a local copy. Although this is by no means complete and doesn’t offer a cover for everything it is a start, and shows just how useful the rest of the world can be in filling out our information. Covers (where available) exist for bib numbers over 272,000(ish) and under 10,000(ish), but as part of the looping import these will appear over the next week for items in the middle. We’re also looking at other cover providers such as OpenLibrary or Amazon to help boost the quality and quantity of covers, but due to restrictive licensing we’re having to tread carefully. Book covers will be used more liberally in some future features, including things such as a ‘looks like’ cover finder – using the power of perceptual hashes – to help find that book you can’t remember the name of but you can see the cover in your mind.

Read the rest of this entry »

Putting the “I” back in Library

Posted on November 5th, 2010 by Nick Jackson

It’s been a while since anyone posted about what’s been going on with Jerome, mostly because those pesky students keep taking up valuable messing about time with fiddling little problems like being unable to log in. Okay, I jest. We love students really, since their complaining drives so many of the things we want to do.

First of all, epic backend work has been going on to make our Horizon to Jerome import path a bit slicker. Through a bit of inspiration from Dave Pattern, some XML voodoo, some juggling of arrays, a clever scheduled task and a plain text file I’ve been able to get Horizon imports happening on a rolling basis. It takes us a little under 7 and a half days (7.41 if you really care) to complete a full cycle of imports, iterating through every potential record number in our catalogue to find out if there’s anything useful. It’s not the most efficient method (we’ll build in some smart blank record skipping in a future version), but it does stop us from melting the server with a massive bulk export. At the moment this is throttled back to around half of its theoretical maximum rate whilst we test it, but by the end of the month we’re hoping to have import cycles running at under 5 days, and under 4 by Christmas.

We’ve also started work on our Journals indexing. This is a bit more tricky due to the lack of open information for a lot of journals, but by tapping in to Journal TOCs we can get hold of a fair bit of journal information and table of contents, allowing article-level searching of all our available resources. Similarly to catalogue import this is a rolling import process, so things may not appear for a day or two.

For all resources (catalogue and journals) we’re taking a look at what open data we can grab from elsewhere on the internet to bolster search results. We’d really like to be able to grab summaries, abstracts and synopses wherever possible (it’s something else to search through) but there are a few licensing issues we need to look at in more detail. Regardless, however, we will soon be running all our available search content through term extractors to automatically generate keywords

Finally on the backend, I’ve made some sweeping changes to how search reindexes content (it’s now fully automated and bitching fast), and some tweaks to our search API to support weighting data (so your results really are more relevant, and we don’t give things like stemmed words and metaphones the same search priority as your original text) and our upcoming relevancy engine (more below).

Read the rest of this entry »

All together now…

Posted on September 23rd, 2010 by Nick Jackson

Tonight Alex and I have been working on two different, yet equally awesome parts of Jerome. Alex has already mentioned about his bit of geocode awesome on his blog, so I’m going to ramble on about our integrated searching.

At the moment searching our library systems can be a bit of a pain, since our ‘catalogue’ is actually a collection of independent systems. Want to search journals from the main catalogue? Sorry, can’t be done. So, in true Jerome style, we’ve decided to fix it in a slightly more awesome way. By taking the individual list of items and combining them in Sphinx (discussed previously as being a bit fast) we can slam through multiple indexes as though they were one, carefully blending the weighting of individual elements and indexes to try provide the results that people want, be they books, ebooks, journals, articles, items in our repository or potentially anything else the Library looks after. At the moment we’re working from our testing subset of the main catalogue (although with the Horizon voodoo of Dave Pattern over at Huddersfield this is soon going to become the whole thing) and a set of journal articles which Journal TOCs says we have access to. Extra sources (like our repository, and a ‘journal’ vs ‘journal TOC’ list) will be coming online soon.

The result is a smoothly integrated search solution which you can try out on our demo site. One box, two sources, real-time, no worries.

The Slides

Posted on July 30th, 2010 by Nick Jackson

Here are the slides from my presentation at Chips and Mash in Huddersfield.

If you’re curious for a bit more reading on MongoDB and Sphinx, the two systems at the core of Jerome, read on.

Read the rest of this entry »

Engage Ludicrous Speed!

Posted on July 23rd, 2010 by Nick Jackson

One of our key aims for Jerome is for the whole thing to be fast. Not “the average search should complete in under a second” fast, but “your application should be fine to hit us with 50 queries a second” fast.

This requirement was one of the key factors in our decision to use MongoDB as our backend database, and provide search using Sphinx. We’ll have another blog post fairly soon with more detail on how we’re using Mongo and Sphinx to store, search and retrieve data but for now I’d like to share some preliminary numbers on how close we are to our goal of speed.

First of all, getting data in. This is a pain in the backside due to the MARC-21 specification being so complex and needing to perform several repetitive checks on data to make sure we’re importing it right. However, on the import side of things we’re in the region of importing 150 MARC records a second, including parsing, filtering, mapping fields and finally getting the data into the database. This is done using the File_MARC PEAR library to manage the actual parsing of the MARC data into a set of arrays, then some custom PHP to extract information like title, author, publisher etc. into a more readily understood format. This information extraction isn’t or complete yet so it’s likely that there’ll be a bit of a slowdown as we add more translation rules, but equally it’s not optimised to improve speed.

Read the rest of this entry »