Showing posts with label PDF. Show all posts
Showing posts with label PDF. Show all posts

Mapping names to literature: closing in on 250,000 names

Following on from my earlier post Linking taxonomic names to literature: beyond digitised 5×3 index cards I've been slowly updating my latest toy:

http://iphylo.org/~rpage/itaxonAlpheus

This site displays a database mapping over 200,000 animal names to the primary literature, using a mix of identifiers (DOIs, Handles, PubMed, URLs) as well as links to freely available PDFs where they are available. Lots still to do as about a third of the 1.5 million names in the database have citations that my code hasn't been able to parse. There are also lots of gaps that need to be filled in, for example missing DOIs or PubMed identifiers, and a lot of the earlier names are linked by "microcitations" to names, and I'll need to handle those (using code from my earlier project Nomenclator Zoologicus meets Biodiversity Heritage Library: linking names directly to literature).

The mapping itself is stored in a database that I'm constantly editing, so this is far from production quality, but I've found it eye-opening just how much literature is available. There is a lot of scope for generating customised lists of papers, for example, primary taxonomic sources for taxa currently on the IUCN Red List, or those taxa which have sequences in GenBank (building on the mapping of NCBI taxa onto Wikipedia). Given that a lot of the relevant literature is in BHL, or available as PDFs, we could do some data mining, such as extracting geographical coordinates, taxonomic names, and citations. And if linked data is your thing, the 110,000 DOIs and nearly 9,000 CiNiii URLs all serve RDF (albeit not without a few problems).

I've set a "goal" of having 250,000 names mapped to the primary literature, at which point the database interface will get some much-needed attention, but for now have a look for your favourite animal and see if it's original description has been digitised.

Viewing scientific articles on the iPad: the PLoS Reader

Continuing on from my previous post Viewing scientific articles on the iPad: towards a universal article reader, here are some brief notes on the PLoS iPad app that I've previously been critical of.

There are two key things to note about this app. The first is that it uses the page turning metaphor. The article is displayed as a PDF, a page at a time, and the user swipes the page to turn it over. Hence, the app is simulating paper on the iPad screen.

turn.jpg


But perhaps more interesting is that, unlike the Nature app discussed earlier, the PLoS app doesn't use a custom API to retrieve articles. Instead the app uses RSS feeds from the PLoS site. PLoS provides journal-specific RSS feeds, as well as subject-specific feeds within journals (see, for example, the PLoS ONE home page). The PLoS Reader app takes these feeds and uses them to create a list of articles the reader can choose from.

A nice feature of the PLoS ATOM feeds is the provision of links to alternative formats for the article (unlike many journal RSS feeds, which provide just a DOI or a URL). For example, the feed item for the article "Transmission of Single HIV-1 Genomes and Dynamics of Early Immune Escape Revealed by Ultra-Deep Sequencing" doi:10.1371/journal.pone.0012303 contains links to the PDF and XML versions of the article:


<link rel="related"
type="application/pdf"
href="http://www.plosone.org/article/fetchObjectAttachment.action?uri=info:doi/10.1371/journal.pone.0012303&representation=PDF"
title="(PDF) Transmission of Single HIV-1 Genomes and Dynamics of Early Immune Escape Revealed by Ultra-Deep Sequencing" />
<link rel="related"
type="text/xml"
href="http://www.plosone.org/article/fetchObjectAttachment.action?uri=info:doi/10.1371/journal.pone.0012303&representation=XML"
title="(XML) Transmission of Single HIV-1 Genomes and Dynamics of Early Immune Escape Revealed by Ultra-Deep Sequencing" />


This makes the task of an article reader much easier. Rather than attempt to screen scrape the article web page, or rely on a rule for constructing the link to the desired file, the feed provides an explicit URL to the different available formats.

I've not seen this feature in other journal RSS feeds, although article web pages sometimes provide this information. BMC journals, for example, provide <link rel="alternate"> tags in the web page for each article, from which we can extract links to the XML and PDF versions, and some journals (BMC included) provide the Google Scholar metadata data tag <meta name="citation_pdf_url"> to link to the PDF. Hence, a generic article reader will need to be able to extract metadata tags from article web pages as it seeks formats suitable to display.