News Air: lucene

Prompted by the appearance on the BHL blog of an article about BioStor I've thinking about how to improve what is basically a fairly clunky tool.

One major weakness is searching the collection of nearly 40,000 articles extracted from BHL. Note the word "extracted." BioStor isn't a tool like PubMed or Google Scholar where the goal is to find articles on a topic. Instead it addresses a more specific question, namely whether a given article is contained in an item scanned by BHL. Confusion about this was one reason publication of my paper on BioStor (doi:10.1186/1471-2105-12-187) took so long to pass through the review stage.

However, users (myself included) expect to be able to search for articles. So, it's time to explore ways to make it easier to find articles within the BioStor database. I've junked the previous pretty crappy code I wrote and have started to play with the Solr search engine. I'd experimented with Solr a while ago, but other stuff got in the way. Today I've managed to add it to BioStor and do a preliminary indexing of the articles in BioStor. So far I'm only indexing basic bibliographic metadata, and displaying the first 30 hits, but already it's making it much easier to find interesting stuff in BioStor.

Solr also supports faceted searching (i.e., clustering results by categories such as year, author, journal). I don't so much with this yet, but there's clearly a lot of scope. I could also add taxonomic names, and even the OCR text to Solr, greatly expanding the ability to find articles. But that's for the future. For now, here are some interesting searches:

Quick notes to self on fulltext search and CouchDB. Note that links to CouchDB are local to my machine(s),and won't work unless you are me, or have a copy of the same database running on your machine). CouchDB and Lucene adds fulltext indexing to CouchDB. After a few false starts I now have this working. The documentation is a little misleading, you don't need to clone the github repository, nor use Maven to build couchdb-lucene (at least, I didn't). Instead I grabbed couchdb-lucene-0.5.6, unpacked it, used that as is.

To configure CouchDB I ended up editing the configuration using Futon (there's a link "Add a new section" down the bottom of the Configuration page), then I restarted CouchDB. The things to add are:


[couchdb]
os_process_timeout=60000 ; increase the timeout from 5 seconds.
[external]
fti=/path/to/python /path/to/couchdb-lucene/tools/couchdb-external-hook.py
[httpd_db_handlers]
_fti = {couch_httpd_external, handle_external_req, <<"fti">>}

To start couchdb-lucene, just cd couchdb-lucene-0.5.6 and bin/run.

Then it's a case of adding a fulltext index. In Futon I start adding a regular design document, then edit the Javascript. For example, here is a simple index on document titles:


{
   "_id": "_design/lucene",
   "_rev": "2-96b333dfc77866a13c0de7f856d27b6c",
   "language": "javascript",
   "fulltext": {
       "by_title": {
           "index": "function(doc) { 
                 var ret=new Document(); 
                 ret.add(doc.title); 
                 return ret 
            }"
       }
   }
}

Once the indexing has been completed, you can search the CouchDB database using a URL like this: http://localhost:5984/col2010ref/_fti/_design/lucene/by_title?q=frog+new+species.

Lots more to do here, but with spatial queries and now fulltext search, it's time to start building something...

News Air

Adding Solr to BioStor: searching for real

CouchDB and Lucene

Feedjit

My Blog List