News Air: OpenURL

Showing posts with label OpenURL. Show all posts

DNA Barcoding, the Darwin Core Triplet, and failing to learn from past mistakes

Given various discussions about identifiers, dark taxa, and DNA barcoding that have been swirling around the last few weeks, there's one notion that is starting to bug me more and more. It's the "Darwin Core triplet", which creates identifiers for voucher specimens in the form <institution-code>:<OPTIONAL collection-code>:<specimen-id>. For example,

MVZ:Herp:246033

is the identifier for specimen 246033 in the Herpetology collection of the Museum of Vertebrate Zoology (see http://arctos.database.museum/guid/MVZ:Herp:246033).

On the face of it this seems a perfectly reasonable idea, and goes some way towards addressing the problem of linking GenBank sequences to vouchers (see, for example, http://dx.doi.org/10.1016/j.ympev.2009.04.016, preprint at PubMed Central). But I'd argue that this is a hack, and one which potentially will create the same sort of mess that citation linking was in before the widespread use of DOIs. In other words, it's a fudge to postpone adopting what we really need, namely persistent resolvable identifiers for specimens.

In many ways the Darwin Core triplet is analogous to an article citation of the form <journal>, <volume>:<starting page>. In order to go from this "triplet" to the digital version of the article we've ended up with OpenURL resolvers, which are basically web services that take this triple and (hopefully) return a link. In practice building OpenURL resolvers gets tricky, not least because you have to deal with ambiguities in the <journal> field. Journal names are often abbreviated, and there are various ways those abbreviations can be constructed. This leads to lists of standard abbreviations of journals and/or tools to map these to standard identifiers for journals, such as ISSNs.

This should sound familiar to anybody dealing with specimens. Databases such as the Registry of Biological Repositories and the Biodiversity Collectuons Index have been created to provide standardised lists of collection abbreviations (such as MVZ = Museum of Vertebrate Zoology). Indeed, one could easily argue that the what we need is an OpenURL for specimens (and I've done exactly that).

As much as there are advantages to OpenURL (nicely articulated in Eric Hellman's post When shall we link?), ultimately this will end in tears. Linking mechanisms that depend on metadata (such as museum acronyms and specimen codes, or journal names) are prone to break as the metadata changes. In the case of journals, publishers can rename entire back catalogues and change the corresponding metadata (see Orwellian metadata: making journals disappear), journals can be renamed, merged, or moved to new publishers. In the same way, museums can be rebranded, specimens moved to new institutions, etc. By using a metadata-based identifier we are storing up a world of hurt for someone in the future. Why don't we look at the publishing industry and learn from them? By having unique, resolvable, widely adopted identifiers (in this case DOIs) scientific publishers have created an infrastructure we now take for granted. I can read a paper online, and follow the citations by clicking on the DOIs. It's seamless and by and large it works.

On could argue that a big advantage of the Darwin Core triplet is that it can identify a specimen even if it doesn't have a web presence (which is another way of saying that maybe it doesn't have a web presence now, but it might in the future). But for me this is the crux of the matter. Why don't these specimens have a web presence? Why is it the case that biodiversity informatics has failed to tackle this? It seems crazy that in the context of digital data (DNA sequences) and digital databases (GenBank) we are constructing unresolvable text strings as identifiers.

But, of course, much of the specimen data we care about is online, in the form of aggregated records hosted by GBIF. It would be technically trivial for GBIF to assign a decent identifier to these (for example, a DOI) and we could complete the link between sequence and specimen. There are ways this could be done such that these identifiers could be passed on to the home institutions if and when they have the infrastructure to do it (see GBIF and Handles: admitting that "distributed" begets "centralized").

But for now, we seem determined to postpone having resolvable identifiers for specimens. The Darwin Core triplet may seem a pragmatic solution to the lack of specimen identifiers, but it seems to me it's simply postponing the day we actually get serious about this problem.

More BHL app ideas

Following on from my previous post on BHL apps and a Twitter discussion in which I appealed for a "sexier" interface for BHL (to which @elyw replied that is what BHL Australia were trying to do), here are some further thoughts on improving BHL's web interface.
Build a new interface
A fun project would be to create a BHL website clone using just the BHL API. This would give you the freedom to explore interface ideas without having to persuade BHL to change its site. In a sense, the app would be provide the persuasion.

Third party annotations
It would be nice if the BHL web site made use of third party annotations. For example, BHL itself is extracting some of the best images and putting them on Flickr. How about if you go to the page for an item in BHL and you see a summary of the images from that item in Flickr? At a glance you can see whether the item has some interesting content. For example, if you go to http://biodiverritylibrary.org/item/109846 you see this:

N2 w1150

which gives you no idea that it contains images like this:

Tables of contents
Another source of annotations is my own BioStor project, which finds articles in scanned volumes in BHL. If you are looking at an item in BHL it would be nice to see a list of articles that have been found in that item, perhaps displayed in a drop down menu as a table of contents. This would help provide a way to navigate through the volume.

Who links to BHL?
When I suggested third party annotations on Twitter @stho002 chimed in asking about Wikispecies, Species-ID, ZooBank, etc. These resources are different, in that they aren't repurposing BHL content but are linking to it. It woud be great if a BHL page for an item could display reverse links (i.e., the pages in those external databases that link to that BHL item).

Implementing reverse links (essential citation linking) can be tricky, but two ways to do it might be:

Use BHL web server logs to find and extract referrals from those projects
Perhaps more elegantly, encourage external databases to link to BHL content using an OpenURL which includes the URL of the originating page. OpenURL can be messy, but especially in Mediawiki-based projects such as Wikispecies and Species-ID it would be straightforward to make a template that generated the correct syntax. In this way BHL could harvest the inbound links and display them on the item page.

Mendeley, OpenURL, BioStor, and BHL

Mendeley has added a feature which makes it easier to use Mendeley with repositories such as BioStor and BHL. As announced in Get Full Text: Mendeley now works with your local library via OpenURL, you can now add OpenURL resolvers to your Mendeley account:

We’ve added a button to the catalog pages that will allow you to get the article from your library right in Mendeley. This feature will link you directly to the full text copy according to your institutional access rights.

Ironically, in the UK access to electronic articles from a University is pretty seamless via the UK Access Management Federation, so I don't need to add an OpenURL resolver to get full text for an article. But this new feature does enable another way to access to articles in my BioStor repository. By adding the BioStor OpenURL to your Mendeley account, you can search for articles from your Mendeley library in BioStor.

The Mendeley blog post explains how to set up an OpenURL resolver. Go to your Mendeley account and click on the My Account button in the upper right corner of then page, then select Account Details, then the Sharing/Importing tab, or just click here.

Click on Add library manually, then enter the name of the resolver (e.g., "BioStor") and the URL http://biostor.org/openurl:

If you view a reference in Mendeley, you will now see something like this:

In addition to the DOI and the URL, this reference now displays a Find this paper at menu. Clicking on it shows the default services, together with any OpenURL resolvers you've added (in this case, BioStor):

You can add multiple resolvers, so we could add the BHL OpenURL resolver http://www.biodiversitylibrary.org/openurl, although finding articles isn't BHL OpenURL resolver's strong point.

Now, what would be very handy is if Mendeley were to complete the circle by providing their own OpenURL resolver, so that people could find articles in Mendeley from metadata such as article title, journal, volume, and starting page. The Mendeley API might be a way to implement this, although its search features lack the granularity needed.

Web Hooks and OpenURL: the screencast

Yesterday I posted notes on Web Hooks and OpenURL. That post was written when I was already late (you know, when you say to yourself "yeah, I've got time, it'll just take 5 minutes to finish this..."). The Web Hooks + OpenURL project is still very much a work in progress, but I thought a screen cast would help explain why I think this is going to make my life a lot easier. It shows an example where I look at a bibliographic record in one database (AFD, the Australian Faunal Directory on CouchDB), click on a link that takes me to BioStor — where I can find the reference in BHL — then simply click on a button on the BioStor page to "automagically" update the AFD database. The "magic" is the Web Hook. The link I click on in the AFD database contains the identifier for that entry in the AFD, as well a a URL BioStor can call when it's found the reference (that URL is the "web hook").

Using Web Hooks and OpenURL from Roderic Page on Vimeo.

Web Hooks and OpenURL: making databases editable

For me one of the most frustrating things about online databases is that they often can't be edited. For example, I've recently created a version of the Australian Faunal Directory on CouchDB, which contains a list of all animals in Australia, and a fairly comprehensive bibliography of taxonomic publication on those animals. What I'd like to do is locate those publications online. Using various scripts I've found DOIs for some 2,500 articles, and located nearly 4,900 article in BHL, and added these to the database, but browsing the database (using, say, the quantum treemap interface) makes it clear there are lots of publications that I've missed.

It would be great if I could go to the Australian Faunal Directory on CouchDB and edit these on that site, but that would require making the data editable, and that means adding a user interface. And that's potentially a lot of work. Then, if I go to another database (say, my CouchDB version of the Catalogue of Life) and want to make that editable then I have to add an interface to that database as well. I could switch to using a wiki, which I've done for some projects (such as the NCBI to Wikipedia mapping), but wikis have their own issues (in particular, they don't easily support the kinds of queries I want to do).

There is, as they say, a third way: web hooks. I first came across web hooks when I discovered that Post-Commit Web Hooks in Google Code. The idea is you can create a web service that gets called every time you commit code to the Google Code repository. For example, each time you commit code you can call a web hook that uses the Twitter API to tweet details of what you just committed (I tried this for a while, until some of my Twitter followers got seriously pissed off by the volume of tweets this was generating).

What has this to do with making databases editable? Well, imagine the following scenario. A web page displays a publication, but no DOI. However, the web page embeds an OpenURL in the form of a COinS (in other words, a URL with key-value pairs describing the publication). If you use a tool such as the OpenURL Referrer in Firefox you can use an OpenURL resolver to find that publication. Examples of OpenURL resolvers include bioGUID and BioStor. Let's say you find the publication, and it has a DOI. How do you tell the database about this? Well, you can try and find an email address of someone running the database so you can send them the information, but this is a hassle. What if the OpenURL resolver that you used to find the DOI could automatically tell the source database that it's found the DOI? That's the idea behind web hooks.

I've started to experiment with this, and have most of the pieces working. Publication pages in Australian Faunal Directory on CouchDB have COinS that include two additional pieces of information: (1) the database identifier for the publication (in this case a UUID, in the hideously complex jargon of OpenURL this the "Referring Entity Identifier"), and (2) the URL of the web hook. The idea is that an OpenURL resolver can take the OpenURL and try and locate the article. If it succeeds it will call the web hook URL supplied by the database, tell it "hey, I've found this DOI for the publication with this database identifier". The database can then update its data, so the next time a user visits the page for that publication in the database, the user will see the DOI. This has the huge advantage over tools that just modify the web page on the fly, such as David Shorthouse's reference parser of persistence: the database itself is updated, not just the web page.

In order to make this work, all the database needs to do is have a web hook, namely a URL that accepts POST requests. The heavy lifting of searching for the publication, or enabling users to correct and edit the data can be devolved to a single place, namely the OpenURL resolver. As a first step I'm building an OpenURL resolver that displays a fnrm the in which the user can edit bibliographic details, and launch searches in CrossRef (and soon BioStor). When the user is done they can close the form, which is when it calls the web hook with the edited data. The database can then choose to accept or reject the update.

Given that it's easy to create the web hook, and trivial to get a database to output an OpenURL with its internal identifier and the URL of the web hook, this seems like a light-weight way of making databases editable.

Linking taxonomic databases to the primary literature: BHL and the Australian Faunal Directory

Continuing my hobby horse of linking taxonomic databases to digitised literature, I've been working for the last couple of weeks on linking names in the Australian Faunal Directory (AFD) to articles in the Biodiversity Heritage Library (BHL). AFD is a list of all animals known to occur in Australia, and it provides much of the data for the recently released Atlas of Living Australia. The data is available as series of CSV files, and these contain quite detailed bibliographic references. My initial interest was in using these to populate BioStor with articles, but it seemed worthwhile to try and link the names and articles together. The Atlas of Living Australia links to BHL, but only via a name search showing BHL items that have a name string. This wastes valuable information. AFD has citations to individual books and articles that relate to the taxonomy of Australian animals — we should treat that as first class data.

So, I cobbled together the CSV files, some scripts to extract references, ran them through the BioStor and bioGUID OpenURL resolvers, and dumped the whole thing in a CouchDB database. You can see the results at Australian Faunal Directory on CouchDB.

The site is modelled on my earlier experiment with putting the Catalogue of Life on CouchDB. It's still rather crude, and there's a lot of stuff I need to work on, but it should illustrate the basic idea. You can browse the taxonomic hierarchy, view alternative names for each taxon, and see a list of publications related to those names. If a publication has been found in BioStor then the site displays a thumbnail of the first page, and if you click on the reference you see a simple article viewer I wrote in Javascript.

For PDFs I'm experimenting with using Google's PDF viewer (the inspiration for the viewer above):

How it was made
Although in principle linking AFD to BHL via BioStor was fairly straight forward, these are lots of little wrinkles, such as errors in bibliographic metadata, and failure to parse some reference strings. To help address this I created a public group on Mendeley where all the references I've extracted are stored. This makes it easy to correct errors, add identifiers such as DOIs and ISSNs, and upload PDFs. For each article a reference to the original record in AFD is maintained by storing the AFD identifier (a UUID) as a keyword.

The taxonomy and the mapping to literature is stored in a CouchDB database, which makes a lot of things (such as uploading new versions of documents) a breeze.

It's about the links
The underlying motivation is that we are awash in biodiversity data and digitisation projects, but these are rarely linked together. And it's more than just linking, it's bring the data together so that we can compute over it. That's when things will start to get interesting.

PLoS doesn't "get" the iPad (or the web)

PLoS recently announced a dedicated iPad app, that covers all the PLoS Journals, and which is available from the App Store. Given the statement that "PLoS is committed to continue pushing the boundaries of scientific communication" I was expecting something special. Instead, what we get (as shown) in the video below is a PDF viewer with a nice page turning effect (code here). Maybe it's Steve Job's fault for showing iBooks when he first demoed the iPad, but there desire to imitate 3D page turning effects leaves me cold (for a nice discussion of how this can lead to horribly mixed metaphors see iA's Designing for iPad: Reality Check).

PLoS iPad Demo from PLoS on Vimeo.

But I think this app shows that PLoS really don't grok the iPad. Maybe it's early days, but I find it really disappointing that page-turning PDFs is the first thing they come up with. It's not about recreating the paper experience on a device! There's huge scope for interactivity, which the PLoS app simply ignores — you can't select text, and none of the references. It also ignores the web (without which, ironically, PLoS couldn't exist).

Instead of just moaning about this, I've spent a couple of days fussing with a simple demo of what could be done. I've taken a PLoS paper ("Discovery of the Largest Orbweaving Spider Species: The Evolution of Gigantism in Nephila", doi:10.1371/journal.pone.0007516), grabbed the XML, applied a XSLT style sheet to generate some HTML, and added a little Javascript functionality. References are displayed as clickable links inline. If you click on one a window pops up displaying the citation, and it then tries to find it for you online (for the technically mined, it's using OpenURL and bioGUID). If it succeeds it displays a blue arrow — click that and you're off to the publisher's web site to view the article.

Figures are also links, click on and you get a Lightbox view of the image.
You can view this article live, in a regular browser or in iPad. Here's a video of the demonstration page:

PLoS paper on the iPad from Roderic Page on Vimeo.

This is all very crude and rushed. There's a lot more that could be done. For references we could flag which articles are self citations, we could talk to bookmarking services via their APIs to see which citations the reader already has, etc. We could also make data, sequences, and taxonomic names clickable, providing the reader with more information and avenues for exploration. Then there's the whole issue of figures. For graphs we should have the underlying data so that we can easily make new visualisations, phylogenies should be interactive (at least make the taxon names clickable), and there's no need to amalgamate figures into aggregates like Fig .2 below. Each element (A-E) should be separately addressable so when the text refers to Fig. 2D we can show the user just that element.

journal.pone.0007516.g002.png

The PLoS app and reactions to Elsevier's "Article 2.0" (e.g., Elsevier's 'Article of the Future' resembles websites of the past and The “Article of the Future” — Just Lipstick Again?) suggests publishers are floundering in their efforts to get to grips with the web, and new platforms for interacting with the web.

So, PLoS, I challenge you to show us that you actually "get" the iPad and what it could mean for science publishing. Because at the moment, I've seen nothing that suggests you grasp the opportunity it represents. Better yet, why not revisit Elsevier's Article 2.0 project and have a challenge specifically about re-imagining the scientific article? And please, no more page turning effects

News Air