One of the cool things that Jerome wants to do as part of its looking at the library access portal (give our current build a whirl) is provide relevant, accurate suggestions of books which you may find useful. This is a whopping huge piece of statistics which is causing me to dust off my A Level statistics, whip out some algorithms and start thinking about ways of actually doing it in a reliable way. This blog post is a summary of some of my thoughts on the issue, and I’d really appreciate it if people could weigh in with any suggestions or experience before I delve in to the depths of coding recommendations.
Here goes. First of all, we want to use as many different sources as possible from which to derive similarity suggestions. At the moment we’ve got a few suggestions:
- People who borrowed this book also borrowed… (See how Huddersfield do it). This provides direct, human-based connections between items which are weighted by both how common the combination of books are and also how unique the combination is1.
- Catalogued Subject Headings. Using the mad cataloguing skills of our ninja cataloguers to determine which books share a subject, as well as boosting these subject headings with data from OpenLibrary. The weight of a subject heading in the overall similarity ranking is inversely proportional to the number of times it’s used. This gives stronger recommendations to books which are within a smaller field of interest. We can also use subject headings to suggest similar journals.
- Extracted Semantic Tags. As we pull OpenLibrary summaries for books and abstracts from our repository we’ll be slamming them through OpenCalais to extract semantic information on what the item is actually about. These then go into the item’s tags and (using a similar algorithm to subject headings) are weighed to find works about the same type of things.
- Manual Groupings. Jerome has in-built support for ‘lists’ of items, intended to provide for things such as reading lists, collections of citations for a specific paper and so-on. We can assume that items in a list together are related, giving us a potentially huge set of manually curated similarities. To prevent too much positive reinforcement, again we’ll be weighting a link in inverse proportion to its popularity.
- Not mutually exclusive. One his how many times book A is borrowed by people who also borrowed book B, one is how often book B is only borrowed by people who borrowed book A [↩]
