In this post I discuss the reverse problem, combining two or more distinct references into one. I've been uploading large collections of references based on harvesting metadata for journal articles. Although the metadata isn't perfect, it's usually pretty good, and in many cases linked to Open Access content in BioStor. References that I upload appear in public groups listed on my profile, such as the group Proceedings of the Entomological Society of Washington.
Reverse engineering Mendeley
In the absence of a good description by Mendeley of how their tools work, we have to try and figure it out ourselves. If you click on a refernece that has been recently added to Mendeley you get a URL that looks like this: http://www.mendeley.com/c/3708087012/g/584201/magalhaes-2008-a-new-species-of-kingsleya-from-the-yanomami-indians-area-in-the-upper-rio-orinoco-venezuela-crustacea-decapoda-brachyura-pseudothelphusidae/ where 584201 is the group id, 3708087012 is the "remoteId" of the document (this is what it's called in the SQLite database that underlies the desktop client), and the rest of the URL is the article title, minus stop words.
After a while (perhaps a day or so) Mendeley gets around to trying to merge the references I've added with those it already knows about, and the URLs lose the group and remoteId and look like this: http://www.mendeley.com/research/review-genus-saemundssonia-timmerman-phthiraptera-philopteridae-alcidae-aves-charadriiformes-including-new-species-new-host/ . Let's call this document the "canonical document" (this document also has a UUID, which is what the Mendeley API uses to retrieve the document). Once the document gets one of these URLs Mendeley will also display how many people are "reading" that document, and whether anyone has tagged it.
But that's not my paper!
The problem is that sometimes (and more often than I'd like) the canonical document bears little relation to the document I uploaded. For example, here is a paper that I uploaded to the group Proceedings of the Entomological Society of Washington:
Review of the genus Saemundssonia Timmermann (Phthiraptera: Philopteridae) from the Alcidae (Aves: Charadriiformes), including a new species and new host records by Roger D Price, Ricardo L Palma, Dale H Clayton, Proceedings of the Entomological Society of Washington, 105(4):915-924 (2003). |
You can see the actual paper in BioStor: http://biostor.org/reference/57185. To see the paper in the Mendeley group, browse it using the tag Phthiraptera:
Note the 2, indicating that two people (including myself) have this paper in their library. The URL for this paper is http://www.mendeley.com/research/review-genus-saemundssonia-timmerman-phthiraptera-philopteridae-alcidae-aves-charadriiformes-including-new-species-new-host/, but this is not the paper I added!.
What Mendeley displays for this URL is this:
Not only is this not the paper I added, there is no such paper! There is a paper entitled "A new genus and a new species of Daladerini (Hemiptera: Heteroptera: Coreidae) from Madagascar", but that is by Harry Brailovsky, not Clayton and Price (you can see this paper in BioStor as http://biostor.org/reference/55669). The BioStor link for the phantom paper displayed by Mendeley, http://biostor.org/reference/55761, is for a third paper "A review of ground beetle species (Coleoptera: Carabidae) of Minnesota, United States : New records and range extensions". The table below shows the original details for the paper, the details for the "canonical paper" created by Mendeley, and the details for two papers that have some of the bibliographic details in common with this non-existent paper (highlighted in bold).
Field | Original paper | Mendeley | ||
---|---|---|---|---|
Title | Review of the genus Saemundssonia Timmermann (Phthiraptera: Philopteridae) from the Alcidae (Aves: Charadriiformes), including a new species and new host records | A new genus and a new species of Daladerini (Hemiptera: Heteroptera: Coreidae) from Madagascar | A new genus and a new species of Daladerini (Hemiptera: Heteroptera: Coreidae) from Madagascar | A review of ground beetle species (Coleoptera: Carabidae) of Minnesota, United States : New records and range extensions |
Author(s) | Roger D Price, Ricardo L Palma, Dale H Clayton | DH Clayton, RD Price | Harry Brailovsky | |
Volume | 105 | 105 | 104 | 107 |
Pages | 915-924 | 915-924 | 111-118 | 917-940 |
BioStor | 57185 | 55761 | 55669 | 55761 |
As you can see it's a bit of a mess. Now, finding and merging duplicates is a hard problem (see doi:10.1145/1141753.1141817 for some background), but I'm struggling to see why these documents were considered to be duplicates.
What I'd like to see
I'm a big fan of Mendeley, so I'd like to see this problem fixed. What I'd really like to see is the following:
- Mendeley publish a description of how their de-duplication algorithms work
- Mendeley describe the series of steps a document goes through as they process it (if nothing else, so that users can make sense of the multiple URLs a document may get over it's lifetime in Mendeley).
- For each canonical reference Mendeley shows the the set of documents that have been merged to create that canonical reference, and display some measure of their confidence that the match is genuine.
- Mendeley enables users to provide feedback on a canonical document (e.g., a button by each document in the set that enables the user to say "yes this is a match" or "no, this isn't a match").
Perhaps what would be useful is if Mendeley (or the community) assemble a test collection of documents which contains duplicates, together with a set of the canonical documents this collection actually contains, and use this to evaluate alternative algorithms for finding duplicates. Let's make this a "challenge" with prizes! In many ways I'd be much more impressed by a duplication challenge than the DataTEL challenge, especially as it seems clear that Mendeley readership data is too sparse to generate useful recommendations (see Mendeley Data vs. Netflix Data).