Linked data that isn't: the failings of RDF

OK, a bit of hyperbole in the morning. One of the goals of RDF is to create the Semantic Web, an interwoven network of data seamlessly linked by shared identifiers and shared vocabularies. Everyone uses the same identifiers for the same things, and when they describe these things they use the same terms. Simples.

Of course, the reality is somewhat different. Typically people don't reuse identifiers, and there are usually several competing vocabularies we can chose from. To give a concrete example, consider two RDF documents describing the same article, one provided by CiNii, the other by CrossRef. The article is:

Astuti, D., Azuma, N., Suzuki, H., & Higashi, S. (2006). Phylogenetic Relationships Within Parrots (Psittacidae) Inferred from Mitochondrial Cytochrome-b Gene Sequences(Phylogeny). Zoological science, 23(2), 191-198. doi:10.2108/zsj.23.191

You can get RDF for a CiNii record by appending ".rdf" to the URL for the article, in this case http://ci.nii.ac.jp/naid/130000017049. For CrossRef you need a Linked Data compliant client, or you can do something like this:


curl -D - -L -H "Accept: application/rdf+xml" "http://dx.doi.org/10.2108/zsj.23.191"

You can view the RDF from these two sources here and here.

No shared identifiers
The two RDF documents have no shared identifiers, or at least, any identifiers they do share aren't described in a way that is easily discovered. The CrossRef record knows nothing about the CiNii record, but the CiNii document includes this statement:


<rdfs:seeAlso rdf:resource="http://ci.nii.ac.jp/lognavi?name=crossref
&amp;id=info:doi/10.2108/zsj.23.191" dc:title="CrossRef" />

So, CiNii knows about the DOI, but this doesn't help much as the CrossRef document has the URI "http://dx.doi.org/10.2108/zsj.23.191", so we don't have an explicit statement that the two documents refer to the same article.

The other shared identifier the documents could share is the ISSN for the journal (0289-0003), but CiNii writes this without the "-", and uses the PRISM term "prism:issn", so we have:


<prism:issn>02890003</prism:issn>


whereas CrossRef writes the ISSN like this:


<ns0:issn xmlns:ns0="http://prismstandard.org/namespaces/basic/2.1/">
0289-0003</ns0:issn>


Unless we have a linked data client that normalises ISSNs before it does a SPARQL query we will miss the fact that these two articles are in the same journal.

Inconsistent vocabularies
Both CiNii use the PRISM vocabulary to describe the article, but they use different versions. CrossRef uses "http://prismstandard.org/namespaces/basic/2.1/" whereas CiNii uses "http://prismstandard.org/namespaces/basic/2.0/". Version 2.1 versus version 2.0 is a minor difference, but the URIs are different and hence they are different vocabularies (having version numbers in vocabulary URIs is asking for trouble). Hence, even if CiNii and CrossRef wrote ISSNs in the same way, we'd still not be able to assert that the articles come from the same journal.
Inconsistent use of vocabularies
Both CiNii use FOAF for author names, but they write the names differently:


<foaf:name xml:lang="en">Suzuki Hitoshi</foaf:name>


<ns0:name xmlns:ns0="http://xmlns.com/foaf/0.1/">Hitoshi Suzuki</ns0:name>


So, another missed opportunity to link the documents. One could argue this would be solved if we had consistent identifiers for authors, but we don't. In this case CiNii have their own local identifiers (e.g. http://ci.nii.ac.jp/nrid/1000040179239), and CrossRef has a rather hideous looking Skolemisation: http://id.crossref.org/contributor/hitoshi-suzuki-2gypi8bnqk7yy.

In summary, it's a mess. Both CiNii and CrossRef organisations are whose core business is bibliographic metadata. It's great that both are serving RDF, but if we think this is anything more than providing metadata in a useful format I think we may be deceiving ourselves.

Orwellian metadata: making journals disappear

UnknownI've been spending a lot of time recently mapping bibliographic citations for taxonomic names to digital identifiers (such as DOIs). This is tedious work at the best of times (despite lots of automation), but it is not helped but the somewhat Orwellian practices of some publishers. Occasionally when an established journal gets renamed the publisher retrospectively applies that name to the previous journal. For example, in 2000 the journal Entomologica Scandinavica (ISSN 0013-8711) became Insect Systematics & Evolution (ISSN 1399-560X):


(diagram based on WorldCat xISSN history tool, rendered using Google Charts.)

Content for both Entomologica Scandinavica and Insect Systematics & Evolution is available from Ingenta's web site, but every article is listed as being in Insect Systematics & Evolution, and this is reflected in the metadata CrossRef has for each DOI.

For example, the paper
Andersen, N.M. & P.-p. Chen, 1993. A taxonomic revision of pondskater genus Gerris Fabricius in China, with two new species (Hemiptera: Gerridae). – Entomologica Scandinavica 24: 147-166

has the DOI doi:10.1163/187631293X00262 which resolves to a page saying this article was published in Insect Systematics & Evolution. The XML for the DOI says the same thing:



<issn type="print">1399560X</issn>
<issn type="electronic">1876312X</issn>
<journal_title>Insect Systematics & Evolution</journal_title>


In one sense this is no big deal. If you know the DOI then that's all you need to use to refer to the article (and the sooner we abandon fussing with citation styles and just use DOIs the better).

But if you haven't yet found the DOI then this is problem, because if I search CrossRef using the original journal name (Entomologica Scandinavica) I get nothing. As far as CrossRef is concerned the DOI doesn't exist. If, however, I happen to know that Entomologica Scandinavica is now Insect Systematics & Evolution, I rewrite the query and I retrieve the DOI.

It's bad enough dealing with taxonomic names changes without having to deal with journal names changes as well! It would be great if publishers didn't indulge in wholesale renaming old journals, or if CrossRef had a mechanism (perhaps based on WorldCat's xISSN History Visualization Tool) to handle retrospectively renamed journals.