BHL and OCR

Some quick notes on OCR. Revisiting my DjVu viewer experiments it really struck me how "dirty" the OCR text is. It's readable, but if we were to display the OCR text rather than the images, it would be a little offputting. For example, in the paper A new fat little frog (Leptodactylidae: Eleutherodactylus) from lofty Andean grasslands of southern Ecuador (http://biostor.org/reference/229) there are 15 different variations of the frog genus Eleutherodactylus:

  • Eleutherodactylus
  • Eleutheroclactylus
  • Eleuthewdactyliis
  • Eleiitherodactylus
  • Eleuthewdactylus
  • Eleuthewdactylus
  • Eleutherodactyliis
  • Eleutherockictylus
  • Eleutlierodactylus
  • Eleuthewdactyhts
  • Eleiithewdactylus
  • Eleutherodactyhis
  • Eleiithemdactylus
  • Eleuthemdactylus
  • Eleuthewdactyhis

Of course, this is a recognised problem. Wei et al. Name Matters: Taxonomic Name Recognition (TNR) in Biodiversity Heritage Library (BHL) (hdl:2142/14919) found that 35% of names in BHL OCR contained at least one wrong character. They compared the performance of two taxonomic name finding tools on BHL OCR (uBio's taxonFinder and FAT), neither of which did terribly well. Wei et al. found that different page types can influence the success of these algorithms, and suggested that automatically classifying pages into different categories would improve performance.

Personally, it seems to me that this is not the way forward. It's pretty obvious looking at the versions of "Eleutherodactylus" above that there are recognisable patterns in the OCR errors (e.g., "u" becoming "ii", "ro" becoming "w", etc.). After reading Peter Norvig's elegant little essay How to Write a Spelling Corrector, I suspect the way to improve the finding of taxonomic names is to build a "spelling corrector" for names. Central to this would be building a probabilistic model of the different OCR errors (such as "u" → "ii"), and use that to create a set of candidate taxonomic names the OCR string might actually be (the equivalent of Google's "did you mean", which is the subject of Norvig's essay). I had hoped to avoid doing this by using an existing tool, such as Tony Rees' TAXAMATCH, but it's a website not a service, and it is just too slow.

I've started doing some background reading on the topic of spelling correction and OCR, and I've created a group on Mendeley called OCR - Optical Character Recognition to bring these papers together. I'm also fussing with some simple code to find misspellings of a given taxonomic names in BHL text, use the Needleman–Wunsch sequence alignment algorithm to align those misspellings to the correct name, and then extract the various OCR errors, building a matrix of the probabilities of the various transformations of the original text into OCR text.

One use for this spelling correction would be in an interactive BHL viewer. In addition to showing the taxonomic names that uBio's taxonFinder has located in the text, we could flag strings that could be misspelt taxonomic names (such as "Eleutherockictylus") and provide an easy way for the user to either accept or reject that name. If we are going to invite people to help clean up BHL text, it would be nice to provide hints as to what the correct answer might be.

BioStor one year on: has it been a success?

One year ago I released BioStor, which scratched my itch regarding finding articles in the Biodiversity Heritage Library. This anniversary seems to be a good time to think about where next with this project, but also to ask whether it's been successful. Of course, this rather hinges on what I mean by "success." I've certainly found BioStor to be useful, both the experience of developing it, and actually using it. But it's time to be a little more hard-headed and look at some stats. So I'm going to share the Google Analytics stats for BioStor. Below is the report for Dec 20, 2009 to Dec 19, 2010, as a PDF.

Visitsvisits.png

BioStor had 63,824 visits over the year, and 197,076 pageviews. After an initial flurry of visits on its launch the number of visitors dropped off, then slowly grew. Numbers dipped during the middle of the year, then started to climb again.

In order to discover whether these numbers are a little or a lot, it would be helpful to compare them with data from other biodiversity sites. Unfortunately, nobody seems to be making this information readily available. There is a slide in a BHL presentation that shows BHL having had more than 1 million visits since January 2008, and in March 2010 it was receiving around 3000 visits per day, which is an order of magnitude greater than the traffic BioStor is currently getting. For another comparison, I looked at Scratchpads, which currently comprise 193 sites. In November 2007 Scratchpads had 43,379 pageviews altogether, in November 2010 BioStor had 17,484 page views. For the period May-October 2009 Scratchpads had 74,109 visitors, for the equivalent period in 2010 BioStor had 28,110. So, BioStor is getting about a third of the traffic as the entire Scratchpad project.

Bounce rate

One of the more interesting charts is "Bounce rate", defined by Google as

Bounce rate is the percentage of single-page visits or visits in which the person left your site from the entrance (landing) page.
bouce.png
The bounce rate for BioStor is pretty constant at around 65%, except for two periods in March and June, when it plummeted to around 20%. This corresponds to when I set up a Wikisource installation for BioStor so that the OCR text from BHL could be corrected. Mark Holder ran a student project that used the BioStor wiki, so I'm assuming that the drop in bounce rate reflects Mark's students spending time on the wiki. BHL OCR text would benefit from cleaning, but I'm not sure Wikisources is the way to do it as it feels a little clunky. Ideally I'd like to build upon the interactive DjVu experiments to develop a user-friendly way to edit the underlying OCR text.

Is it just my itch?
Every good work of software starts by scratching a developer's personal itch - Eric S. Raymond, The Cathedral and the Bazaar

Looking at traffic by city, Glasgow (where I'm based) is the single largest source of traffic. This is hardly surprising, given that I wrote BioStor to solve a problem I was interested in, and the bulk of its content has been added by me using various scripts. This raises the possibility that BioStor has an active user community of *cough* one. However, looking at traffic by country, the UK is prominent (due to traffic primarily from Glasgow and London), but more visits come from the US. It seems I didn't end up making this site just for me.

map.pngGoogle search
Another measure of success is Google search rankings, which I've used elsewhere to compare the impact of Wikipedia and EOL pages. As a quick experiment I Googled the top ten journals in BioStor and recorded where in the search results BioStor appeared. For all but the Biological Bulletin, BioStor appeared in the top ten (i.e., on the first page of results):

JournalGoogle rank of BioStor page
Biological Bulletin12
Bulletin of Zoological Nomenclature6
Proceedings of the Entomological Society, Washington6
Proc. Linn. Soc. New South Wales3
Annals of the Missouri Botanical Garden3
Tijdschr. Ent.2
Transactions of The Royal Entomological Society of London6
Ann. Mag. nat. Hist3
Notes from the Leyden Museum5
Proceedings of the United States National Museum4


This suggests that BioStor's content is a least findable.

Where next?
The sense I'm getting from these stats is that BioStor is being used, and it seems to be a reaosnably successful, small-scale project. It would be nice to play with the Google Analytics output a bit more, and also explore usage patterns more closely. For example, I invested some effort in adding the ability to create PDFs for BioStor articles, but I've no stats on how many PDFs have been downloaded. Metadata in BioStor is editable, and edits are logged, but I've not explored the extent to which the content is being edited. If a serious effort is going to be made to clean up BHL content using crowd sourcing, I'll need to think of ways to engage users. The wiki experiments were a step in this direction, but I suspect that building a network around this task might prove difficult. Perhaps a better way is to build the network elsewhere, then try to engage it with this task (OCR correction). This was one reason behind my adopting Mendeley's OAuth API to provide a sign in facility for BioStor (see Mendeley connect). Again, I've no stats on the extent to which this feature of BioStor has been used. Time to give some serious thought to what else I can learn about how BioStor is being used.