Engage Ludicrous Speed!

Posted on July 23rd, 2010 by Nick Jackson

One of our key aims for Jerome is for the whole thing to be fast. Not “the average search should complete in under a second” fast, but “your application should be fine to hit us with 50 queries a second” fast.

This requirement was one of the key factors in our decision to use MongoDB as our backend database, and provide search using Sphinx. We’ll have another blog post fairly soon with more detail on how we’re using Mongo and Sphinx to store, search and retrieve data but for now I’d like to share some preliminary numbers on how close we are to our goal of speed.

First of all, getting data in. This is a pain in the backside due to the MARC-21 specification being so complex and needing to perform several repetitive checks on data to make sure we’re importing it right. However, on the import side of things we’re in the region of importing 150 MARC records a second, including parsing, filtering, mapping fields and finally getting the data into the database. This is done using the File_MARC PEAR library to manage the actual parsing of the MARC data into a set of arrays, then some custom PHP to extract information like title, author, publisher etc. into a more readily understood format. This information extraction isn’t or complete yet so it’s likely that there’ll be a bit of a slowdown as we add more translation rules, but equally it’s not optimised to improve speed.

At the moment we’ve extracted around 32,000 entries from our catalogue (about a 4 minute import process) for use in testing. We can get the database to dump the relevant information for indexing as an XML file in under a second. We can then feed this file into Sphinx for indexing, a process taking around 0.3 of a second. Once indexed, searches are really fast. So fast, in fact, that when we search Sphinx from the command line it says they’re completed in 0.000 seconds (really). Quick maths means that we’re completing a full search in under 1/2000th of a second.

“Aha!” you may cry. “How complex are your queries?”. To be honest, not massively. Until we get loads more data loaded and a better MARC to Real World translation working we’re somewhat limited in what we can search. However, we’ve tried prodding a few of the more complex queries we can do (including wildcards, boolean search terms, required terms, stemmed words and sound-alikes) through the system and they all still come out at under the 1/2000th of a second. In short, it’s still fast.

“Aha!” I hear you cry again. “People aren’t going to be sat at the terminal waiting for a response direct from the server! How long does it take for a ‘real’ request?”. It’s a good question, so we decided to find out. Running a complete API-level query (sending the request to the server, completing the search, retrieving the records from the database, formatting them, replying and the results arriving at the destination) takes about 32ms. In other words, this is 31 serial (ie one after the other) queries a second. We also knocked up a parallel test where it would place several requests at the same time, and all the numbers were about the same which implies the majority of the time is spent waiting for requests to travel across the network.

Now, all these numbers are very rough and ready and once we get the system more stable and robust with more information we’ll build some proper testing harnesses to see how it behaves under various conditions, but for the moment I think that we’re doing quite well for an unoptimised system.

Tags: , , , , ,

5 Responses to “Engage Ludicrous Speed!”

  1. [...] a slightly more awesome way. By taking the individual list of items and combining them in Sphinx (discussed previously as being a bit fast) we can slam through multiple indexes as though they were one, carefully [...]

  2. [...] from Portal, we’ve got a good set of data to be getting on with. During our playing around with Sphinx for Jerome I discovered that it can support multiple distributed indexes, making it [...]

  3. [...] This post was mentioned on Twitter by Joss Winn, Uni of Lincoln Blogs. Uni of Lincoln Blogs said: Engage Ludicrous Speed! http://ff.im/-oatx0 [...]

  4. Avatar of Nick Jackson Nick Jackson says:

    Yeah, there are a lot of things which we’re learning on how to optimise the indexing and retrieval methods. There’s also an unanticipated bonus effect in that Sphinx supports distributed indexing across multiple servers, so we can combine multiple indexes to get ‘universal search’ working more easily.

  5. Avatar of Joss Winn Joss Winn says:

    Speed freak :-) What are you picking up now that will cross over with TotalReCal? Sounds like there’s some shared benefits here. Good work.

Leave a Reply