Showing posts with label mapping. Show all posts
Showing posts with label mapping. Show all posts

Mapping names to literature: closing in on 250,000 names

Following on from my earlier post Linking taxonomic names to literature: beyond digitised 5×3 index cards I've been slowly updating my latest toy:

http://iphylo.org/~rpage/itaxonAlpheus

This site displays a database mapping over 200,000 animal names to the primary literature, using a mix of identifiers (DOIs, Handles, PubMed, URLs) as well as links to freely available PDFs where they are available. Lots still to do as about a third of the 1.5 million names in the database have citations that my code hasn't been able to parse. There are also lots of gaps that need to be filled in, for example missing DOIs or PubMed identifiers, and a lot of the earlier names are linked by "microcitations" to names, and I'll need to handle those (using code from my earlier project Nomenclator Zoologicus meets Biodiversity Heritage Library: linking names directly to literature).

The mapping itself is stored in a database that I'm constantly editing, so this is far from production quality, but I've found it eye-opening just how much literature is available. There is a lot of scope for generating customised lists of papers, for example, primary taxonomic sources for taxa currently on the IUCN Red List, or those taxa which have sequences in GenBank (building on the mapping of NCBI taxa onto Wikipedia). Given that a lot of the relevant literature is in BHL, or available as PDFs, we could do some data mining, such as extracting geographical coordinates, taxonomic names, and citations. And if linked data is your thing, the 110,000 DOIs and nearly 9,000 CiNiii URLs all serve RDF (albeit not without a few problems).

I've set a "goal" of having 250,000 names mapped to the primary literature, at which point the database interface will get some much-needed attention, but for now have a look for your favourite animal and see if it's original description has been digitised.

TreeBASE meets NCBI, again

Déjà vu is a scary thing. Four years ago I released a mapping between names in TreeBASE and other databases called TBMap (described here: doi:10.1186/1471-2105-8-158). Today I find myself releasing yet another mapping, as part of my NCBI to Wikipedia project. By embedding the mapping in a wiki, it can be edited, so the kinds of problems I encountered with TbMap, recounted here, here, and here. The mapping in and of itself isn't terribly exciting, but it's the starting point for some things I want to do regarding how to visualise the data in TreeBASE.

Because TreeBASE 2 has issued new identifiers for its taxa (see TreeBASE II makes me pull my hair out), and now contains its own mapping to the NCBI taxonomy, as a first pass I've taken their mapping and added it to http://iphylo.org/linkout. I've also added some obvious mappings that TreeBASE has missed. There are a lot more taxa which could be added, but this is a start.

The TreeBASE taxa that have a mapping each get their own page with a URL of the form http://iphylo.org/linkout/<TreeBase taxon identifier>, e.g. http://iphylo.org/linkout/TB2:Tl257333. This page simply gives the name of the taxon in TreeBASE and the corresponding NCBI taxon id. It uses a Semantic Mediawiki template to generate a statement that the TreeBASE and and NCBI taxa are a "close match". If you go to the corresponding page in the wiki for the NCBI taxon (e.g., http://iphylo.org/linkout/Ncbi:448631) you will see any corresponding TreeBASE taxa listed there. If a mapping is erroneous, we simply need to edit the TreeBASE taxon page in the wiki to fix it. Nice and simple.

At the time of writing the initial mapping is still being loaded (this can take a while). I'll update this post when the uploading has finished.