News Air: OCR

Showing posts with label OCR. Show all posts

Suggested apps for BHL's Life and Literature Code Challenge

Since I won't be able to be at the Biodiversity Heritage Library's Life and Literature meeting I thought I'd share some ideas for their Life and Literature Code Challenge. The deadline is pretty close (October 17) so having ideas now isn't terribly helpful I admit. That aside, here are some thoughts inspired by the challenge. In part this post has been inspired by the Results of the PLoS and Mendeley "Call for Apps", where PLoS and Mendeley asked for people (not necessarily developers) to suggest the kind of apps they'd like to see. As an aside, one thing conspicuous by it's absence is a prize for winning the challenge. PLoS and Mendeley have a "API Binary Battle" with a prize of $US 10,001, which seems more likely to inspire people to take part.

Visual search engine
I suspect that many BHL users are looking for illustrations (exemplified by the images being gathered in BHL's Flickr group). One way to search for images would be to search within the OCR text for figure and plate captions, such as "Fig. 1". Indexing these captions by taxonomic name would provide a simple image search tool. For modern publications most figures are on the same page as the caption, but for older publications with illustrations as plates, the caption and corresponding image may be separated (e.g., on facing pages), so the search results might need to show pages around the page containing the caption. As an aside, it's a pity the Flickr images only link to the BHL item and not the BHL page. If they did the later, and the images were tagged with what they depict, you could great a visual search engine using the Flickr API (of course, this might be just the way to implement the visual search engine — harvest images, tags with PageID and taxon names, upload to Flickr).

Mobile interface
The BHL web site doesn't look great on an iPhone. It makes no concessions to the mobile device, and there are some weird things such as the way the list of pages is rendered. A number of mainstream science publishers are exploring mobile versions of their web sites, for example Taylor and Francis have a jQuery Mobile powered interface for mobile users. I've explored iPad interfaces to scientific articles in previous posts. BHL content posses some challenges, but is fundamentally the same as viewing PDFs — you have fixed pages that you may want to zoom.

OCR correction
There is a lot of scope for cleaning up the OCR text in BHL. Part of the trick would be to have a simple use interface for people to contribute to this task. In an earlier post I discussed a Firefox hOCR add-on that provides a nice way to do this. Take this as a starting point, add a way to save the cleaned up text, and you'd be well on the way to making a useful tool.

Taxon name timeline
Despite the shiny new interface, the Encyclopedia of Life still displays BHL literature in the same clunky way I described in an earlier blog post. It would great to have a timeline of the usage of a name, especially if you could compare the usage of different names (such as synonyms). In many ways this is the BHL equivalent Google Books Ngram viewer.

These are just a few hastily put together thoughts. If you have any other ideas or suggestions, feel free to add them as comments below.

- Posted using BlogPress from my iPad

Correcting OCR using hOCR in Firefox

Quick post on a little tool I came across, moz-hocr-edit. This Firefox add-on lets you proofread Optical Character Recognition (OCR) output. Given my interest in OCR and the Biodiversity Heritage Library I decided to take it for a spin.

moz-hocr-edit uses the hOCR, which is a format for representing the output of OCR software, and is used by tools such as OCRopus (you can see the public specification for hOCR here). Basically it's a microformat, that is, it's HTML with some additional tags. Given some hOCR, moz-hocr-edit enables you to edit the OCR output line-by-line.

Demo
I've created a simple demo based upon Case 3368 Eatoniella Dall, 1876 and EATONIELLIDAE Ponder, 1965 (Mollusca, Gastropoda): proposed conservation. For the demo to work you will need to use the Firefox web browser with the moz-hocr-edit installed.

Go to http://dl.dropbox.com/u/639486/hocr/80780.html
You will see a simple HTML representation of the OCR text from "Case 3368 Eatoniella Dall, 1876 and EATONIELLIDAE Ponder, 1965 (Mollusca, Gastropoda): proposed conservation". I created this HTML from the original ABBYY FineReader XML from the Internet Archive.
On the bottom right-hand of the Firefox browser window you should see hOCR. Click on it and select "Edit this hOCR document":
Firefox will open a new tab that will look something like this:
You can now edit individual lines of text, and see your edits applied to the HTML below.

moz-hocr-edit is a neat little tool. With appropriate web server settings (and, as the tool's author Jim Garrison suggests, autoversioning) it could the basis of a great tool for correcting OCR errors in BHL.

Microcitations: linking nomenclators to BHL

One of the challenges of linking databases of taxonomic names to the primary literature is the minimal citation style used by nomenclators (see my earlier post Nomenclators + digitised literature = fail).

For example, consider Nomenclator Zoologicus. Volumes 1-10 of this list of generic names in zoology were digitised in 2004 and put online by uBio (for more details of this project see Taxonomic informatics tools for the electronic Nomenclator Zoologicus, pmid:16501061). In Nomenclator Zoologicus the citation for the genus Abana is:

Ann. Mag. nat. Hist., (8) 2, 72.

The challenge is to link this short citation to the digital version of the corresponding article. I've been sitting on a copy of the digitised Nomenclator Zoologicus kindly provided by Dave Remsen, and I've finally started to look at the problem of mining it for links to databases such as BHL.

You can see the first attempt at http://biostor.org/microcitation.php. This form takes a genus name and the short citation and attempts to locate the corresponding page in BHL. It then checks whether the name is present on that page. Locating a page in a journal can be a challenge given the often rather ropey metadata in BHL, but BioStor uses a combination of fuzzy string matching and crude kludges to find the best match. But a further complication is that OCR errors may mean the taxonomic name we are looking for might not be detected on the page.

For example, if we search for the citation for the genus Aethriscus, Ann. Mag. nat. Hist., (7) 10, 329. we find two candidate pages in the journal Ann. Mag. nat. Hist, but neither contains the string "Aethriscus". However, if we use approximate string matching we find the OCR text for one page has the string "thriscus". This differs by only two characters from "Aethriscus", and so is a possible match (shown in orange).

Looking at the scanned page we can see the likely source of the problem:

In the original publication the name Aethriscus was written as Æthriscus. The ligature Æ has been corrupted by the OCR engine, and in Nomenclator Zoologicus the name is written without the ligature, hence the failure to exactly match the name with the text. These are some of the challenges faced when trying to close the circle and link names to literature.

The microcitation parser is still pretty crude, but usable. You can get results in either HTML or JSON, so the task of mapping microcitations to BHL pages can be automated. At present the name matching assumes you are looking at a single word (e.g., a genus), I need to extend it to handle binomials.

BHL and OCR

Some quick notes on OCR. Revisiting my DjVu viewer experiments it really struck me how "dirty" the OCR text is. It's readable, but if we were to display the OCR text rather than the images, it would be a little offputting. For example, in the paper A new fat little frog (Leptodactylidae: Eleutherodactylus) from lofty Andean grasslands of southern Ecuador (http://biostor.org/reference/229) there are 15 different variations of the frog genus Eleutherodactylus:

Eleutherodactylus
Eleutheroclactylus
Eleuthewdactyliis
Eleiitherodactylus
Eleuthewdactylus
Eleuthewdactylus
Eleutherodactyliis
Eleutherockictylus
Eleutlierodactylus
Eleuthewdactyhts
Eleiithewdactylus
Eleutherodactyhis
Eleiithemdactylus
Eleuthemdactylus
Eleuthewdactyhis

Of course, this is a recognised problem. Wei et al. Name Matters: Taxonomic Name Recognition (TNR) in Biodiversity Heritage Library (BHL) (hdl:2142/14919) found that 35% of names in BHL OCR contained at least one wrong character. They compared the performance of two taxonomic name finding tools on BHL OCR (uBio's taxonFinder and FAT), neither of which did terribly well. Wei et al. found that different page types can influence the success of these algorithms, and suggested that automatically classifying pages into different categories would improve performance.

Personally, it seems to me that this is not the way forward. It's pretty obvious looking at the versions of "Eleutherodactylus" above that there are recognisable patterns in the OCR errors (e.g., "u" becoming "ii", "ro" becoming "w", etc.). After reading Peter Norvig's elegant little essay How to Write a Spelling Corrector, I suspect the way to improve the finding of taxonomic names is to build a "spelling corrector" for names. Central to this would be building a probabilistic model of the different OCR errors (such as "u" → "ii"), and use that to create a set of candidate taxonomic names the OCR string might actually be (the equivalent of Google's "did you mean", which is the subject of Norvig's essay). I had hoped to avoid doing this by using an existing tool, such as Tony Rees' TAXAMATCH, but it's a website not a service, and it is just too slow.

I've started doing some background reading on the topic of spelling correction and OCR, and I've created a group on Mendeley called OCR - Optical Character Recognition to bring these papers together. I'm also fussing with some simple code to find misspellings of a given taxonomic names in BHL text, use the Needleman–Wunsch sequence alignment algorithm to align those misspellings to the correct name, and then extract the various OCR errors, building a matrix of the probabilities of the various transformations of the original text into OCR text.

One use for this spelling correction would be in an interactive BHL viewer. In addition to showing the taxonomic names that uBio's taxonFinder has located in the text, we could flag strings that could be misspelt taxonomic names (such as "Eleutherockictylus") and provide an easy way for the user to either accept or reject that name. If we are going to invite people to help clean up BHL text, it would be nice to provide hints as to what the correct answer might be.

News Air

Suggested apps for BHL's Life and Literature Code Challenge

Correcting OCR using hOCR in Firefox

Microcitations: linking nomenclators to BHL

BHL and OCR

Feedjit

My Blog List