Microcitations: linking nomenclators to BHL

One of the challenges of linking databases of taxonomic names to the primary literature is the minimal citation style used by nomenclators (see my earlier post Nomenclators + digitised literature = fail).

For example, consider Nomenclator Zoologicus. Volumes 1-10 of this list of generic names in zoology were digitised in 2004 and put online by uBio (for more details of this project see Taxonomic informatics tools for the electronic Nomenclator Zoologicus, pmid:16501061). In Nomenclator Zoologicus the citation for the genus Abana is:

Ann. Mag. nat. Hist., (8) 2, 72.

The challenge is to link this short citation to the digital version of the corresponding article. I've been sitting on a copy of the digitised Nomenclator Zoologicus kindly provided by Dave Remsen, and I've finally started to look at the problem of mining it for links to databases such as BHL.

You can see the first attempt at http://biostor.org/microcitation.php. This form takes a genus name and the short citation and attempts to locate the corresponding page in BHL. It then checks whether the name is present on that page. Locating a page in a journal can be a challenge given the often rather ropey metadata in BHL, but BioStor uses a combination of fuzzy string matching and crude kludges to find the best match. But a further complication is that OCR errors may mean the taxonomic name we are looking for might not be detected on the page.

For example, if we search for the citation for the genus Aethriscus, Ann. Mag. nat. Hist., (7) 10, 329. we find two candidate pages in the journal Ann. Mag. nat. Hist, but neither contains the string "Aethriscus". However, if we use approximate string matching we find the OCR text for one page has the string "thriscus". This differs by only two characters from "Aethriscus", and so is a possible match (shown in orange).

2.png

Looking at the scanned page we can see the likely source of the problem:
3.png

In the original publication the name Aethriscus was written as Æthriscus. The ligature Æ has been corrupted by the OCR engine, and in Nomenclator Zoologicus the name is written without the ligature, hence the failure to exactly match the name with the text. These are some of the challenges faced when trying to close the circle and link names to literature.

The microcitation parser is still pretty crude, but usable. You can get results in either HTML or JSON, so the task of mapping microcitations to BHL pages can be automated. At present the name matching assumes you are looking at a single word (e.g., a genus), I need to extend it to handle binomials.

BioStor updates on Twitter

BioStor has had a Twitter account @biostor_org for a while, but it's not been active. I finally got around to hooking it up to BioStor, so that now every time an article is added to BioStor, the title of that article and it's URL appears in the @biostor_org Twitter feed.



Activity on this feed will be variable, depending on whether articles are being added manually, or in bulk. But it's a handy way to keep tabs on the growing number of articles being harvested from the Biodiversity Heritage Library.

Zooming a large tree, now with thumbnails

Continuing experiments with a zoom viewer for large trees (see previous post), I've now made a demo where the labels are clickable. If the NCBI taxon has an equivalent page in Wikipedia the demo displays and link to that page (and, if present, a thumbnail image). Give it a try at

http://iphylo.org/~rpage/deeptree/3.html

or watch the short video clip below:

Mendeley, OpenURL, BioStor, and BHL

Mendeley has added a feature which makes it easier to use Mendeley with repositories such as BioStor and BHL. As announced in Get Full Text: Mendeley now works with your local library via OpenURL, you can now add OpenURL resolvers to your Mendeley account:
We’ve added a button to the catalog pages that will allow you to get the article from your library right in Mendeley. This feature will link you directly to the full text copy according to your institutional access rights.
Ironically, in the UK access to electronic articles from a University is pretty seamless via the UK Access Management Federation, so I don't need to add an OpenURL resolver to get full text for an article. But this new feature does enable another way to access to articles in my BioStor repository. By adding the BioStor OpenURL to your Mendeley account, you can search for articles from your Mendeley library in BioStor.

The Mendeley blog post explains how to set up an OpenURL resolver. Go to your Mendeley account and click on the My Account button in the upper right corner of then page, then select Account Details, then the Sharing/Importing tab, or just click here.

openurl_settings.jpg

Click on Add library manually, then enter the name of the resolver (e.g., "BioStor") and the URL http://biostor.org/openurl:

Snapshot 2011-03-01 07-37-20.png

If you view a reference in Mendeley, you will now see something like this:

Snapshot 2011-03-01 07-40-04.png

In addition to the DOI and the URL, this reference now displays a Find this paper at menu. Clicking on it shows the default services, together with any OpenURL resolvers you've added (in this case, BioStor):
Snapshot 2011-03-01 07-42-50.png

You can add multiple resolvers, so we could add the BHL OpenURL resolver http://www.biodiversitylibrary.org/openurl, although finding articles isn't BHL OpenURL resolver's strong point.

Now, what would be very handy is if Mendeley were to complete the circle by providing their own OpenURL resolver, so that people could find articles in Mendeley from metadata such as article title, journal, volume, and starting page. The Mendeley API might be a way to implement this, although its search features lack the granularity needed.

Live demo of zooming a large tree

After the teaser on Friday (see Deep zooming a large 2D tree) I've put a live demo of my experiments with viewing a large tree online at:

http://iphylo.org/~rpage/deeptree/

The first example (Experiment 1) is the NCBI classification for frogs:

This version displays internal node labels, leaf labels (as many as can be displayed at a given zoom level), and works in Safari, Firefox, and Internet Explorer 8. Obviously this is all pretty rough, but take it for a spin, I'd welcome any feedback.