Happy New Year

.
I would like to wish a Happy New Year/ Szczęśliwego Nowego Roku/ to any readers who have taken a break from the shopping and getting ready for the party to look in. I'd like to thank you all for taking an interest in some of this stuff, whether you agree with me or not. I've had some fun and found it useful putting it down on paper and it is satisfying that readership figures are for some reason taking off quite noticeably.

I was hoping to do a review of the year on the "PACHI" blog, but when I wrote it, it was neither very interesting, nor very edifying. In fact it was downright discouraging. But that in itself encourages thinking more about a way forward, doesn't it?

So here's a topical picture of a Slavic-looking Janus looking back at a difficult and somewhat dissatisfying year when seen from the point of view of collecting issues but looking forward with brighter expectation to some long overdue changes in the situation surrounding no-questions-asked and irresponsible erosive collecting of archaeological artefacts.
.

Why no Light? Who is Keeping the PAS Under a Bushel?

.
"Gill obviously drove the agenda: some faint praise for the system combined with much rehashing of various (mostly old) complaints about it"
opines a Washington lawyer about the Papers from the Institute of Archaeology forum I discussed earlier. If the PAS was doing its job (and doing what the Hawkshead Review specifically told it to do - which was to engage with the archaeological milieu over issues like these) then there would long ago have been discussion of these "old" queries about its precise role in the protection and preservation of the archaeological heritage of England and Wales.

In fact some of these "old" points were specifically noted in the 2004 "Hawkshead Review" of the Scheme (pp 33, 45, 48, 54-5, 60-1) and the PAS urged to address some of the concerns even then being expressed by those in the archaeological and heritage management communities (pp 34, 45-6, 56, 60-1). The report urges the PAS to make "protecting the public interest in safeguarding the historic environment" a key aim (p. 61). Gill's discussion six years on raises the question of to what degree the PAS heeded the specific recommendations of the report in these regards.

As it is, what happened was that some texts were published in an archaeological peer-reviewed academic publication in which the voice of the PAS is missing. This is not because anyone "organized" it this way, my understanding is that the PAS refused to engage in this discussion. One may speculate as to the reasons why that is.

The lawyer seeks shock-horror scandal even in an academic publication, but seems to be unaware of even the basic features of how formal round table debates of this type are organized by the editors of academic publications. Brian Hole should be given the credit for all his work to make this happen.

CPO: Gill Inspired Papers Provide More Heat than Light on Benefits of PAS

Bush on Antiquity Looting in Iraq

.
Larry Rothfield (author of 'The Rape of Mesopotamia', 2009 - ISBN: 9780226729459) has an interesting post on his 'Punching Bag' blog. In it he takes a good hard look at G.W. Bush's own account in his autobiography "Decision Points" of his "surprise" at the looting in Iraq associated with the US-led invasion ('Bush's ghostwriters on the looting of the Iraq National Museum'). Bush skims over the topic.
.

The Plant List: nice data, shame it's not open

nd.large.pngThe Plant List (http://www.theplantlist.org/) has been released today, complete with glowing press releases. The list includes some 1,040,426 names. I eagerly looked for the Download button, but none is to be found. You can grab download individual search results (say, at family level), but not the whole data set.

OK, so that makes getting the complete data set a little tedious (there are 620 plant families in the data set), but we can still do it without too much hassle (in fact, I've grabbed the complete data set while writing this blog post). Then I see that the data is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs (CC BY-NC-ND) license. Creative Commons is good, right? In this case, not so much. The CC BY-NC-ND license includes the clause:
You may not alter, transform, or build upon this work.
So, you can look but not touch. You can't take this data (properly attributed, or course) and build your own list, for example with references linked to DOIs, or to the Biodiversity Heritage Library (which is, of course, exactly what I plan to do). That's a derivative work, and the creators of the Plant List don't want you to do that. Despite this, the Plant List want us to use the data:
Use of the content (such as the classification, synonymised species checklist, and scientific names) for publications and databases by individuals and organizations for not-for-profit usage is encouraged, on condition that full and precise credit is given to The Plant List and the conditions of the Creative Commons Licence are observed.
Great, but you've pretty much killed that by using BY-NC-ND. Then there's this:
If you wish to use the content on a public portal or webpage you are required to contact The Plant List editors at editors@theplantlist.org to request written permission and to ensure that credits are properly made.
Really? The whole point of Creative Commons is that the permissions are explicit in the license. So, actually I don't need your permission to use the data on a public portal, CC BY-NC-ND gives me permission (but with the crippling limitation that I can't make a derivative work).

So, instead of writing a post congratulating the Royal Botanic Gardens, Kew and Missouri Botanical Garden (MOBOT) for releasing this data, I'm left spluttering in disbelief that they would hamstring its use through such a poor choice of license. Kew and MOBOT could have made the Plant List available as open data using one of the licenses listed on the Open Definition web site, such as putting the data in the public domain (for example, or using a Creative Commons CC0 license). Instead, they've chosen a restrictive license which makes the data closed, effectively killing the possibility for people to build upon the effort they've put into creating the list. Why do biodiversity data providers seem determined to cling to data for dear life, rather than open it up and let people realise its potential?

PAS has Set Targets?

.
In his reply to David Gill's paper, Trevor Austin of the National Council of Metal Detectorists asserts that there is a reason why more finds are not being reported by the PAS, despite the high values of the "guestimate[s] of yearly detector finds promoted by some protagonists". Apparently the total number of such finds would in any case "be impossible for the PAS to ever record" as mitigation for the information loss. According to Austin, "to blame detectorists for under recording is totally without foundation as the resources of the PAS are simply too few: accordingly at this funding level it can only ever achieve a token figure". Austin seems to lose sight of the fact that to carry on causing erosion of the archaeological record knowing that there is no possibility of mitigating that damage by recording is hardly what one would term "best practice" or responsible behaviour. Austin says however that information loss is not the fault of the metal detectorist, but the fault of the PAS, or rather the MLA, for according to Austin,
Under its current funding masters, the Museums Libraries and Archives Council, the PAS has a set target of 55,000 items recorded per year.
I am not clear whether he means "artefacts" or "records". Let us assume he means the latter (recording the findspot details of a bag of twenty Roman sherds, or a pot of Roman coins on the database takes little longer than the handling time required to do the same for a single find). What that figure represents is 25 objects recorded a week (five a day then) by each FLO [not counting attendance at rallies] and each one costing the taxpayer 23 quid.

The latest of a series of figures for annual recording numbers to emerge from the PAS reveals an interesting contrast to what Austin asserts:

Records Finds recorded Year of recording
3476 4588 1998
6128 8201 1999
11323 18106 2000
11481 16368 2001
8164 11996 2002
14657 21684 2003
26383 39000 2004
33919 52202 2005
37502 58311 2006
49308 79052 2007
37455 56449 2008
39981 66481 2009
112893 190091 2010

Basically there is not a lot of evidence from this that the PAS has been reaching that alleged upper limit of the annual number of records, only in 2010 is there more than Austin's figure of 55000 records coming onto the PAS database (this was achieved by incorporating a pre-existing database compiled by somebody else). Even if we take Austin's words literally and look at the number of individual objects represented by those records, we see that only from 2006 onwards the alleged upper limit set by the paymasters of the PAS has been exceeded, in 2007 and 2010 quite considerably. In the years before 2006 however the alleged upper limit set by the MLA paymasters cannot explain the smaller numbers of both records and finds entering the PAS database from the collecting activities of 10 000 metal detectorists over a period of thirteen years.

If the PAS was receiving enough artefacts to keep it working to a 55000 objects/year capacity, it would have reached its 400 000th record within seven and a half years and not thirteen.

This is just another of those self-justificatory deceits put out by collectors isn't it? It's yet another attempt to shift the blame from the artefact collector (portrayed as a victim here) to the archaeological establishment. We've seen this so many times before with the ACCG, British metal detectorists are no better. Austin's texts like the detecting forums are full of this "it's not are fault" nonsense. I suggest if collectors (and that goes for metal detectorists as well) want to be seen as responsible, they should take responsibility for the way they conduct their hobby and take responsibility for the effects of what they do, and not constantly attempt to show that it is the 'other side' that is responsible for what can only be seen as their failures.

Nobody MAKES them go metal detecting. What is asked of them though is that they do so in a manner which is sustainable and as non-damaging as possible, and where even minimal erosion of the archaeological resource is mitigated by proper and detailed recording. The reason why the PAS is not getting more records in its database is that (though it busts a gut to get them), the truth is not all detectorists are showing all their finds. But then the "number of objects in the database" is not the most important characteristic of PAS outreach. Austin's text shows clearly just how much of a failure those other aspects have been and are likely to ever be too.

I bet neither Trevor Austin nor any of the metal detectorists he represents (so that's ten thousand of you) can post up here in the comments a single reference to a document in the public domain which confirms the existence of an official '55 000 objects a year' fixed limit to PAS annual activity set by the MLA, beyond which the PAS is not permitted to extend.
.

Cyprus Policy on Looted Artefacts

.
There has been some discussion surrounding Sam Hardy's recent text 'archaeologists accepted Greek Cypriot looting of Alaas, Cyprus?' based on the evidence of the archaeological documentation of material in private hands. I mentioned it here ('Cyprus like PAS'), drawing attention to the parallels between the object-centred approach exhibited by the Cypriot authorities here and that of the PAS in the UK. Peter Tompa disagrees with this ('Cypriot Corruption Not Like PAS ') and rather oddly says the differences lie in the "corruption" which he suggests these foreign collections embody. I questioned what Tompa had said, in one post ('UK Treasure Act "Predicated on the Rule of Law" - Eh?') pointing out the discrepancy between what the Washington lawyer had said about the legal position in my own country and the real legal context; in a second ('Cyprus Collections Against the Law?') citing the Cypriot legislation on the basis of which - despite what Tompa thinks - the private possession of antiquities is not forbidden, Sam Hardy has now clarified the reasons behind the Cypriot policy (which I for one was not questioning, though I think they are wrong-headed). Barford on Cypriot antiquities looting policy logic: clarification.

Once again, we see that the xenophobic obfuscations of the collectors' lobbyists act to deflect discussion. We do seem to be getting far from the original topic which is that of archaeological ethics and the handling of looted and potentially looted material. I am glad to see that Hardy has brought the discussion back to that point.

I was taken by the concluding comments in ' online essay "International Trade in Looted Antiquities, www.plunderedpast". Though concerning a totally different part of the looted past, they seem to fit perfectly the situation here:
We need to present forcefully [...] the idea that the past is not disparate things, things which are owned by individuals, that it is those things in their cultural context which permits an understanding of the past. We need to present graphically the destruction that looting causes, the racist attitudes involved in dealing and collecting, and the corruption of virtually everyone this activity leads to. [...] In the long run it is only an informed public that will make the antiquities market unprofitable and hence nonviable.

The opinions expressed by those US collectors like Mr Tompa and his sidekicks that self-declare themselves to be "cultural property internationalists" are in fact deeply embedded in a corrupt colonialist and imperialist ideology, and in fact if one examines its philosophy in any detail is the purest expression of cultural nationalism.

UPDATE 2/1/11

Sam Hardy, the author of the original post about archaeologists recording privately owned artefacts in Cyprus has asked Tompa to clarify his position (Tompa's incorrect claim on looted Cypriot antiquities collecting). He says Tompa's words indicate that he and his fellow American collectors want to have the same access to looted antiquities as Cypriot collectors.
If you object to the fact that 'the connected few are allowed to collect as much looted material as they want', do you object to anyone collecting looted material, in which case you would surely support American import restrictions, as well as [additional] Cypriot acquisition restrictions? Otherwise, does your objection have nothing whatsoever to do with Cypriot collecters' purchases underwriting looting? Instead, do you object to the fact that you were not able to buy looted Cypriot antiquities?
This cuts to the core of the matter. Despite all the talk of "fairness" and "discrimination", so-called enlightened "cosmopolitanism"/ "internationalism", when you strip away the facade what the US antiquity dealers are campaigning for is the "right" to legally import illegally exported artefacts, no matter where they come from. This is nothing more than colonialism. And these dealers and their supporters accuse others of being corrupt!

Gill on the Portable Antiquities Scheme as Preservation

.
The Papers of the Institute of Archaeology volume 20 (available online) have a forum with the keynote paper by David Gill "The Portable Antiquities Scheme and the Treasure Act: Protecting the Archaeology of England and Wales?". Apart from the editor's introduction, this comprises the following seven components:

Keynote text by David W. J. Gill: The Portable Antiquities Scheme and the Treasure Act: Protecting the Archaeology of England and Wales?"

Trevor Austin: The Portable Antiquities Scheme and the Treasure Act: Protecting the Archaeology of England and Wales? A Response.

Paul Barford: Archaeology, Collectors and Preservation: a Reply to David Gill

Gabriel Moshenska: Portable Antiquities, Pragmatism and the ‘Precious Things’

Colin Renfrew: Comment on the Paper by David Gill

Sally Worrell: The Crosby Garrett Helmet

David W. J. Gill: Reply to Austin, Barford, Moshenska, Renfrew and Worrell

I understand that Roger Bland, head of the PAS, was invited to comment, but did not avail himself of the opportunity (" Unfortunately due to the sensitivity of the subject, PAS itself was less willing to contribute"). The five comments are notable for their varied approach. Renfrew's was quite short, Worrell's concentrated on a single aspect, my own was typically long-winded. Austin's and Moshenska's were real eye-openers.

I'd like to comment on several of the responses in more detail below (Austin, Barford (!), Moshenska, Worrel/PAS).
.

Gill on PAS as Preservation (4): Gabriel Moshenska responds

.

Gabriel Moshenska of London University's Institute of Archaeology sent a text, „Portable Antiquities, Pragmatism and the ‘Precious Things’ ” which I have to say, and typically for the supporters of artefact hunting, totally misrepresents the nature of the debate. It uses a series of straw man arguments to engage with some imaginary hysterical critics of artefact hunting, failing to engage with their real arguments and concerns (which he dismisses as "staggeringly unimportant") and ultimately failing even to engage with what Gill wrote. One is left to wonder why this is.

Moshenska fetishises the “finds” at the expense of allowing the question Gill asks about the site they come from to surface. This is well demonstrated by the analogy the author chooses in his response to Gill to describe what he sees as a form of “hysteria” surrounding the debate on policies connected with artefact hunting: “Within archaeology the small faction of anti-metal detector zealots often resemble the grotesque Tubbs in The League of Gentlemen clutching her snow-globes and shrieking ‘Don’t touch the Precious Things!’ (BBC 1999-2002)”. Gill however was talking about sites, not who owns the artefacts taken from them which it is understandable is the focus of the collectors and dealers’ debate, less understandable is to see it here from the pen of an archaeologist – albeit, as can be seen, a supporter of the PAS.

Moshenska declares himself to be a “pragmatist”. He states that “the campaigns against […] metal detectors” (sic) are characterised by an “unwillingness to consider the wider context”. Actually, I would disagree, it is supporters of the artefact hunters like Moshenska who are quite demonstrably failing to see it and Britain’s limp-wristed response to it in the wider context of its relationship to the wider debate on commercial looting of the global archaeological heritage. I do not know if Moshenska has heard of the Monuments At Risk Surveys. He makes no reference to it when writing:

It would be instructive to create a […] chart ranking the various threats to archaeological heritage in Britain; from coastal erosion and ploughing to worms and moles. Despite serving as a lightning-rod for knee-jerk heritage protectionism I seriously doubt that metal detecting would make a prominent appearance on any such ranking. Thus not only is the metal detecting debate needlessly divisive and intemperate, it is also staggeringly unimportant.
Astoundingly we are told as if it needed no explanation or justification:
There are parts of the world where looting poses a serious threat to archaeological heritage and our ability to interpret the past. Britain is not one of these places. Nonetheless there are serious threats to archaeological heritage in Britain. Metal detecting is not one of these.
For Moshenska the consideration of artefact hunting as in any way related to looting is therefore “unhelpful”. Like US coiney Dave Welsh, he points out that metal detecting is done in fields and in ploughed fields “buried artefacts are annually shuffled through the upper half metre of topsoil, bringing them within the limited range of most modern metal detectors”. Like Austin he denies that “some undisturbed archaeological material is being removed from its archaeological context” below the ploughsoil. He thus ridicules the concern expressed about the implications of “depth advantage” metal detecting (discussed elsewhere in this blog) as “incongruous” and makes the astonishing statement that: “if we are truly concerned with the protection of archaeological heritage then this is of roughly equivalent unimportance to the question of whether rabbits are digging deeper burrows in response to global warming”. Except rabbits are not doing what they do in response to UK government policy on archaeological heritage mismanagement.

In response to Gill’s concerns, Moshenska seems to be expressing an opinion that it is not important that the Scheme is not providing much mitigation of information loss due to artefact hunting, because it is a “voluntary recording scheme”. This rather misses the point of whether better mitigation would not be provided if it were not. The respondent accuses Gill of “explicit injustice towards PAS” and “its hard-earned relationship with the metal detecting community [which] offers a practical, pragmatic and proven solution to this problem [“metal detecting without reporting finds”]” (except it does not) as if that was the only concern that Gill had raised. Moshenska also points out that “making money from selling finds is not inherently illegal in Britain”. But Gill was discussing an ethical issue.

Moshenska considers that rather than “to bridge the gap between the archaeological community and those involved in metal detecting”, the task in hand is “to mend the divide within the archaeological community” caused by debating collecting issues. He falls into the well-worn trope of referring to the archaeological community’s “widespread elitism and class snobbery” concerning artefact hunters. (There is an egregious example of a twisted sentence when he says: “The amateur’s disdain for the professional has no place in twenty-first century archaeology”, I’m pretty sure he meant that to go the other way round.) He dismisses those who question current policies on collecting as “doom-mongers wringing their hands at what they no doubt regard as metal detectorists’ proletarian insurgency into the archaeological domain” – perhaps he could do well to read what the concerns are, they are rather of the inability of PAS outreach to bring ten thousand (or how many it is) artefact hunters and collectors, proletarian or not, into the archaeological fold. Moshenska therefore also sees non-compliance as a legacy of the “the history of the ‘STOP’ campaign and the long-standing animosity between metal detectorists and the archaeological establishment” (“some opponents of metal detecting would like to see it made illegal, or at least severely restricted”). He reckons that “doom-mongers wringing their hands” at the damage artefact hunting is doing to the archaeological record and archaeology as a discipline should turn their attention instead to what he regards as “the real, tangible threats to archaeological heritage”. As if looting for entertainment and profit was not in fact a real and tangible threat to the global archaeological heritage.

Frankly I see Moshenska’s response as an archetypical expression of the failure of its supporters to see UK metal detecting in its wider context and I consider this a disappointing, rather flat and flippant contribution to the discussion.



----------------------------------------------------------------

Any non-British reader confused by the "precious things" reference might (or might not) appreciate this You Tube clip from the BBC series to which Moshenska refers:



I understand there's good metal detecting land around Royston Vasey (aka Hadfield), but (unlike Crosby Garrett 130 km up the M6) apparently some of the locals do not take too kindly to 'outsiders' in their fields.

Gill on PAS as Preservation (3): Barford responds

.

One of the respondents to David Gill was archaeoblogger Paul Barford who overran the word limit with his long-winded response to the topics raised ('Archaeology, Collectors and Preservation: a Reply to David Gill'). Anyone who has read what this bloke has said elsewhere will recognise that there is not much there which he has not already said many times.

It is' however, worth drawing attention to the fact that although Barford is often labelled "anti-PAS" by his opponents (and the PAS refuse to talk to him !), his text actually contains a call to strengthen PAS by incorporating it into legislation with a permanent place in the heritage preservation system and a permanent budget, which would strengthen its position immensely.

Gill on PAS as Preservation (2): Trevor Austin responds

Austin has written a somewhat confrontational and at the same time defensive response to Gill’s article. The text shows very well the mindset of collectors which one is up against in any attempt to collaborate with them. It will be noted that while Gill wrote about current policies and how they reflect the protection of the archaeological information contained in the archaeological record, Austin has concentrated his reply on protecting the hobby he represents from any kind of questioning. I would have thought however that one of the characteristics of the "responsible detectorist" (which is what Austin's NCMD claims to represent) is to be concerned about the sort of issues that Gill raises. Austin has set out instead primarily to show why PAS recording figures are not as high as might be hoped. In this sense the choice of Austin as the respondent was a particularly good (or bad) one depending on one’s position in the debate.

Instead of discussing what Gill wrote about site preservation, Austin expends a substantial amount of the space alloted to him to showing „when and why the PAS came into being” as an antidote to what he says is a text based on „rely[ing] on selective published PAS statistics or anecdotal and bigoted statements made by uninformed self opiniated groups who have no practical knowledge of the hobby to support his hypothesis”.

Austin's historical presentation is somewhat one-sided, the reader would do well to see it in the context of Addyman 2009 and Thomas 2009 and earlier literature. Significant is Austin’s conclusion that information loss is the result of „decades of archaeological non-co-operation” (in what?) and a refusal of archaeology to see the „opportunity metal detecting presented them with” (what, to have all exposed archaeological sites stripped of metal objects?). This is not really true. What however is true and can be documented is that when the archaeologists (CBA and EH) set about creating a report showing the potential benefits of that co-operation, it was Mr Austin’s organization, the NCMD, which refused to take part in its writing. Austin has tried this "we wanted to build bridges but it was the archaeologists who turned us away" ploy before (Austin 2009), but in fact the antagonism had its origin in both sides, it was the artefact hunters who saw archaeology as the threat to their new hobby.

Austin is derisive of the HA artefact erosion counter mentioned by Gill, but offers no figures of his own. He however offers several excuses why the PAS database does not contain all of the finds taken by artefact hunters:

1) the above-mentioned perceived antagonism of the archaeological world to artefact hunting and collecting,

2) the landowner may withhold permission (but then the responsible detectorist would avoid working such land),

3) The PAS does not record finds less than 300 years old, so such finds do not appear on its database (but then, neither do they appear in the HA erosion counter, do they?),

4) The problem is the low funding level applied to support metal detecting. The PAS has limited resources and Austin alleges that when it has reached its annual target (55000 objects), it turns artefact hunters away. That is just five objects per detectorist per year. (Frankly that is the first time I've heard anything like this, certainly this calls for PAS clarification, are they turning detectorists away?)

Nowhere does Austin acknowledge that a reason why the PAS database contains only what he admits is a „token figure” of records is that UK metal detectorists are digging stuff up they have no intention of reporting. Eighty percent of the material on the UKDFD last year was not reported to the PAS.

Austin simply does not accept that artefacts are sometimes removed by metal detectorists and other diggers from undisturbed archaeological deposits from below topsoil/ploughsoil levels. This is according to him „mere speculation and just another example of the uncorroberated statements levelled at the hobby and PAS alike”. Well, there is plenty of evidence otherwise (some of it mentioned in this blog). Personally, having seen the deep holes dug into sites by artefact hunters (both in the media they themselves produce as well as in the field) I would see the opposite statement as an uncorroborated one. But then, if this were not so, why would artefact hunters be interested in getting „depth advantage” machines like the GPX 5000?

Austin is derisive of the mention of the Icklingham bronzes (notable for the puerile interjection: „(excuse me while I pick up my violin)” which the Papers' editor thoughtfully left) and dismisses mentions of illegal artefact hunting as mere „scaremongering”. He suggests that the mention of this egregious case of looting by Gill is a result of „archaeology” having „little new information to add to this issue”. He also states that the farmer at Icklingham whose land is raided „remains a single example”. Well, first of all it is not archaeology but the police which investigate illegal artefact hunting, secondly he fails to note that the background of Gill’s research was the reason why it is mentioned. Thirdly that this is not an „isolated” instance, whether or not he wishes to acknowledge it, is known to many metal detectorists, some of them no doubt in Mr Austin’s own organization (for example here). Far from being a defence of the hobby, the pretence that in the whole of the UK only one farm is ever raided by illegal artefact hunters only lays the hobby open to ridicule.

Even more ridiculous is the suggestion of the spokesman for the metal detectorists (wholly illogically and in fact libellously) that „the archaeologists” are deliberately encouraging the looting of this one site to discredit artefact hunting ! According to Austin, „a protected site of national importance has been sacrificed whilst EH turned a blind eye to the long term loss of material and damage to maintain this opportunity”. It is unclear what he expects English Heritage to do on this private property to stop metal detectorists searching this land illegally.

Austin seems to count Gill as one of „those who oppose [...] the Treasure Act” who is „pursing some dogmatic fantasy”. He presents a whole load of figures to show that under the Treasure Act more Treasure than ever is being dug up by artefact hunters. But to what degree is this a symptom of „success” and – given the fact that large number of these „finds” are retrieved under less than ideal conditions - to what degree is it a sign that a certain portion of the finite and limited archaeological resource has in the past decade or so been irrevocably damaged by the increasing scope and quantity of officially sanctioned artefact hunting?

References

Addyman, P. V. 2009, ‘Before the Portable Antiquities Scheme’, pp 51-62 in: Thomas and Stone (eds) 2009.

Austin, T. 2009, Building Bridges between Metal Detectorists and Archaeologists, pp. 119-123 in Thomas and Stone eds 2009.

Thomas, S. 2009, ‘Wanborough Revisited: the Rights and Wrongs of Treasure trove law in England and Wales’, pp 153-165 in: Thomas and Stone eds 2009.

Thomas, S. and P. Stone eds 2009, ‘Metal Detecting and Archaeology’, Boydell Press, Stowmarket.

Gill on PAS as Preservation (6): "Response" from the PAS?

.

Well, this is an ODD one ! Very very odd. Ms Worrell is accredited as: "Prehistoric and Roman Finds Adviser, Portable Antiquities Scheme, UCL Institute of Archaeology". Sadly she is the only member of the PAS staff who responded to Gill, according to the editor Brian Hole, "Unfortunately due to the sensitivity of the subject, PAS itself was less willing to contribute". That can only sound extremely comical. Fifty staff members of the PAS are out there in the midst of the general public engaging with treasure hunting clubs, for example in the roughest areas of Essex, and with cultural philistine "what's this archaeology for then, anyway?" diehards like my Mum, explaining to them the importance of what they do. And yet when David Gill ("the polite one") writes a text involving a few gentlemanly arguments, the PAS cannot muster up an answer? The PAS cannot spend a few hours at most discussing these issues with archaeologists in a proper archaeological journal, giving its response to what David wrote? Why not? Why run away from the opportunity to discuss with the very milieu the PAS is suppodedly representing as archaeology's "outreach" to the wider public? Was the venue, a proper journal produced by an academic institution, not in some way suitable? If they were invited to talk to metal detectorists I bet they'd find the time and the words and petrol money...

So after all that effort, all we get from the PAS is this brief text by Ms Worrell. David Gill initiated a discussion on current policies towards artefact hunting in England and Wales, in which the recently discovered Crosby Garret Roman helment is mentioned several times. Instead of addressing the issues Gill raises, she summarises events surrounding the discovery and sale of this single object (this is in fact similar to her text in British Archaeology published about the same time). What is notable however is that Worrell does not add any depth to the discussion on how the problems which obviously occurred here could be resolved. The text seems more of an apologia, explaining why the finders and landowners cannot really be accused of doing anything wrong. Surely that was not the point that Gill was making.
.
One might say, taking the lead from metal detectorist Candice Jarman: "How about answering some questions, Dr Bland?" Well, I have answered hers, now let Roger Bland give David Gill the answer his thoughtful contribution deserves. Shame on you PAS.
.

Ton Cremers Retiring Fully from MSN

.
Ton Cremers has just announced on MSN that:
After almost 15 years I will finish my WWW activities as owner and moderator of http://www.museum-security.org February 1, 2011.
The end of an era. I am sure many of his devoted readers will join me in heartily thanking Ton for all the work he has put into the running of MSN and making it what it is today, a prime resource full of thought-provoking material. May I wish him all the best for his well-deserved rest and every success in future endeavours.

BHL and OCR

Some quick notes on OCR. Revisiting my DjVu viewer experiments it really struck me how "dirty" the OCR text is. It's readable, but if we were to display the OCR text rather than the images, it would be a little offputting. For example, in the paper A new fat little frog (Leptodactylidae: Eleutherodactylus) from lofty Andean grasslands of southern Ecuador (http://biostor.org/reference/229) there are 15 different variations of the frog genus Eleutherodactylus:

  • Eleutherodactylus
  • Eleutheroclactylus
  • Eleuthewdactyliis
  • Eleiitherodactylus
  • Eleuthewdactylus
  • Eleuthewdactylus
  • Eleutherodactyliis
  • Eleutherockictylus
  • Eleutlierodactylus
  • Eleuthewdactyhts
  • Eleiithewdactylus
  • Eleutherodactyhis
  • Eleiithemdactylus
  • Eleuthemdactylus
  • Eleuthewdactyhis

Of course, this is a recognised problem. Wei et al. Name Matters: Taxonomic Name Recognition (TNR) in Biodiversity Heritage Library (BHL) (hdl:2142/14919) found that 35% of names in BHL OCR contained at least one wrong character. They compared the performance of two taxonomic name finding tools on BHL OCR (uBio's taxonFinder and FAT), neither of which did terribly well. Wei et al. found that different page types can influence the success of these algorithms, and suggested that automatically classifying pages into different categories would improve performance.

Personally, it seems to me that this is not the way forward. It's pretty obvious looking at the versions of "Eleutherodactylus" above that there are recognisable patterns in the OCR errors (e.g., "u" becoming "ii", "ro" becoming "w", etc.). After reading Peter Norvig's elegant little essay How to Write a Spelling Corrector, I suspect the way to improve the finding of taxonomic names is to build a "spelling corrector" for names. Central to this would be building a probabilistic model of the different OCR errors (such as "u" → "ii"), and use that to create a set of candidate taxonomic names the OCR string might actually be (the equivalent of Google's "did you mean", which is the subject of Norvig's essay). I had hoped to avoid doing this by using an existing tool, such as Tony Rees' TAXAMATCH, but it's a website not a service, and it is just too slow.

I've started doing some background reading on the topic of spelling correction and OCR, and I've created a group on Mendeley called OCR - Optical Character Recognition to bring these papers together. I'm also fussing with some simple code to find misspellings of a given taxonomic names in BHL text, use the Needleman–Wunsch sequence alignment algorithm to align those misspellings to the correct name, and then extract the various OCR errors, building a matrix of the probabilities of the various transformations of the original text into OCR text.

One use for this spelling correction would be in an interactive BHL viewer. In addition to showing the taxonomic names that uBio's taxonFinder has located in the text, we could flag strings that could be misspelt taxonomic names (such as "Eleutherockictylus") and provide an easy way for the user to either accept or reject that name. If we are going to invite people to help clean up BHL text, it would be nice to provide hints as to what the correct answer might be.

BioStor one year on: has it been a success?

One year ago I released BioStor, which scratched my itch regarding finding articles in the Biodiversity Heritage Library. This anniversary seems to be a good time to think about where next with this project, but also to ask whether it's been successful. Of course, this rather hinges on what I mean by "success." I've certainly found BioStor to be useful, both the experience of developing it, and actually using it. But it's time to be a little more hard-headed and look at some stats. So I'm going to share the Google Analytics stats for BioStor. Below is the report for Dec 20, 2009 to Dec 19, 2010, as a PDF.

Visitsvisits.png

BioStor had 63,824 visits over the year, and 197,076 pageviews. After an initial flurry of visits on its launch the number of visitors dropped off, then slowly grew. Numbers dipped during the middle of the year, then started to climb again.

In order to discover whether these numbers are a little or a lot, it would be helpful to compare them with data from other biodiversity sites. Unfortunately, nobody seems to be making this information readily available. There is a slide in a BHL presentation that shows BHL having had more than 1 million visits since January 2008, and in March 2010 it was receiving around 3000 visits per day, which is an order of magnitude greater than the traffic BioStor is currently getting. For another comparison, I looked at Scratchpads, which currently comprise 193 sites. In November 2007 Scratchpads had 43,379 pageviews altogether, in November 2010 BioStor had 17,484 page views. For the period May-October 2009 Scratchpads had 74,109 visitors, for the equivalent period in 2010 BioStor had 28,110. So, BioStor is getting about a third of the traffic as the entire Scratchpad project.

Bounce rate

One of the more interesting charts is "Bounce rate", defined by Google as

Bounce rate is the percentage of single-page visits or visits in which the person left your site from the entrance (landing) page.
bouce.png
The bounce rate for BioStor is pretty constant at around 65%, except for two periods in March and June, when it plummeted to around 20%. This corresponds to when I set up a Wikisource installation for BioStor so that the OCR text from BHL could be corrected. Mark Holder ran a student project that used the BioStor wiki, so I'm assuming that the drop in bounce rate reflects Mark's students spending time on the wiki. BHL OCR text would benefit from cleaning, but I'm not sure Wikisources is the way to do it as it feels a little clunky. Ideally I'd like to build upon the interactive DjVu experiments to develop a user-friendly way to edit the underlying OCR text.

Is it just my itch?
Every good work of software starts by scratching a developer's personal itch - Eric S. Raymond, The Cathedral and the Bazaar

Looking at traffic by city, Glasgow (where I'm based) is the single largest source of traffic. This is hardly surprising, given that I wrote BioStor to solve a problem I was interested in, and the bulk of its content has been added by me using various scripts. This raises the possibility that BioStor has an active user community of *cough* one. However, looking at traffic by country, the UK is prominent (due to traffic primarily from Glasgow and London), but more visits come from the US. It seems I didn't end up making this site just for me.

map.pngGoogle search
Another measure of success is Google search rankings, which I've used elsewhere to compare the impact of Wikipedia and EOL pages. As a quick experiment I Googled the top ten journals in BioStor and recorded where in the search results BioStor appeared. For all but the Biological Bulletin, BioStor appeared in the top ten (i.e., on the first page of results):

JournalGoogle rank of BioStor page
Biological Bulletin12
Bulletin of Zoological Nomenclature6
Proceedings of the Entomological Society, Washington6
Proc. Linn. Soc. New South Wales3
Annals of the Missouri Botanical Garden3
Tijdschr. Ent.2
Transactions of The Royal Entomological Society of London6
Ann. Mag. nat. Hist3
Notes from the Leyden Museum5
Proceedings of the United States National Museum4


This suggests that BioStor's content is a least findable.

Where next?
The sense I'm getting from these stats is that BioStor is being used, and it seems to be a reaosnably successful, small-scale project. It would be nice to play with the Google Analytics output a bit more, and also explore usage patterns more closely. For example, I invested some effort in adding the ability to create PDFs for BioStor articles, but I've no stats on how many PDFs have been downloaded. Metadata in BioStor is editable, and edits are logged, but I've not explored the extent to which the content is being edited. If a serious effort is going to be made to clean up BHL content using crowd sourcing, I'll need to think of ways to engage users. The wiki experiments were a step in this direction, but I suspect that building a network around this task might prove difficult. Perhaps a better way is to build the network elsewhere, then try to engage it with this task (OCR correction). This was one reason behind my adopting Mendeley's OAuth API to provide a sign in facility for BioStor (see Mendeley connect). Again, I've no stats on the extent to which this feature of BioStor has been used. Time to give some serious thought to what else I can learn about how BioStor is being used.

TreeBASE, again

My views on TreeBASE are pretty well known. Lately I've been thinking a lot about how to "fix" TreeBASE, or indeed, move beyond it. I've made a couple of baby steps in this direction.

The first step is that I've created a group for TreeBASE papers on Mendeley. I've uploaded all the studies in TreeBASE as of December 13 (2010). Having these in Mendeley makes it easier to tidy up the bibliographic metadata, add missing identifiers (such as DOIs and PubMed ids), and correct citations to non-existent papers (which can occur if at the time the authors uploaded their data the planned to submit their paper to one journal, but it ending up being accepted in another). If you've a Mendeley account, feel free to join the group. If you've contributed to TreeBASE, you should find your papers already there.

The second step is playing with CouchDB (this years new hotness), exploring ways to build a database of phylogenies that has nothing much to do with either a relational database or a triple store. CouchDB is a document store, and I'm playing with taking NeXML files from TreeBASE, converting them to something vaguely usable (i.e., JSON), and adding them to CouchDB. For fun, I'm using my NCBI to Wikipedia mapping to get images for taxa, so if TreeBASE has mapped a taxon to the NCBI taxonomy, and that taxon has a page in Wikipedia with an image, we get an image for that taxon. The reason for this is I'd really like a phylogeny database that was visually interesting. To give you some examples, here are trees from TreeBASE (displayed using SVG), together with thumbnails of images from Wikipedia:

myzo.png


troidini.png


protea.png


Snapshot 2010-12-15 10-38-02.png


Everything (tree and images) is stored within a single document in CouchDB, making the display pretty trivial to construct. Obviously this isn't a proper interface, and there's things I'd need to do, such as order the images in such a way that they matched the placement of the taxa on the tree, but at a glance you can see what the tree is about. We could then envisage making the images clickable so you could find out more about that taxon (e.g., text from Wikipedia, lists of other trees in the database, etc.).

We could expand this further by extracting geographical information (say, from the sequences included in the study) and make a map, or eventually a phylogeny on Google Earth) (see David Kidd's recent "Geophylogenies and the Map of Life" for a manifesto doi:10.1093/sysbio/syq043).

One of the big things missing from databases like TreeBASE is a sense of "fun", or serendipity. It's hard to find stuff, hard to discover new things, make new connections, or put things in context. And that's tragic. Try a Google image search for treebase+phylogeny:

treebasephylogeny.png

Call me crazy, but I looked at that and thought "Wow! This phylogeny stuff is cool!" Wouldn't it be great if that's the reaction people had when they looked at a database of evolutionary trees?

How do I know if an article is Open Access?

open-access-logo.jpg.png
One of my pet projects is to build a "Universal Article Reader" for the iPad (or similar mobile device), so that a reader can seemlessly move between articles from different publishers, follow up citations, and get more information on entities mentioned in those articles (e.g., species, molecules, localities, etc.). I've made various toys towards this, the latest being a HTML5 clone of Nature's iPhone app.

One impediment to this is knowing whether an article is Open Access, and if so, what representations are available (i.e., PDF, HTML, XML). Ideally, the "Universal Article Reader" would be able to look at the web page for an article, determine whether it can extract and redisplay the text (i.e., is the article Open Access) and if so, can it, for example, grab the article in XML and reformat it.

Some journals are entirely Open Access, so for these journals the first problem (is it Open Access?) is trivial, but a large number of journals have a mixed publishing model, some articles are Open Access, some aren't. One thing publishers could do that would be helpful would be to specify the access status of an article in a consistent manner. Here's a quick survey at how things stand at the moment.

JournalRights
PLoSOneEmbedded RDF, e.g. <license rdf:resource="http://creativecommons.org/licenses/by/2.5/" />
Nature Communications<meta name="access" content="Yes" /> for open, <meta name="access" content="No" /> for close
Systematic Biology<meta name="citation_access" content="all" /> for open, this tag missing if closed
BioOneNothing for article, Open Access icon next to open access articles in table of contents
BMC Evolutionary Biology<meta name ="dc.rights" content="http://creativecommons.org/licenses/by/2.0/" />
Philosophical Transactions of the Royal Society<meta name="citation_access" content="all" /> for open access
Microbial EcologyNo metadata (links and images in HTML)
Human Genomics and Proteomics<meta name ="dc.rights" content="http://creativecommons.org/licenses/by/2.0/" />


A bit of a mess. Some publishers embed this information in <meta> tags (which is good), some (such as PLoS) embed RDF (good, if a little more hassle), some leaves us in the dark, or give vidual clues such as logos (which mean nothing to a computer). In some ways this parallels the variety of ways journals have implemented RSS feeds, which has lead to some explicit Recommendations on RSS Feeds for Scholarly Publishers. Perhaps the time is right to develop equivalent recommendations for article metadata, so that apps to read the scientific literature can correctly determine whether they can display an article or not.





Viewing scientific articles on the iPad: cloning the Nature.com iPhone app using jQuery Mobile

Over the last few months I've been exploring different ways to view scientific articles on the iPad, summarised here. I've also made a few prototypes, either from scratch (such as my response to the PLoS iPad app) or using Sencha Touch (see Touching citations on the iPad).

Today, it's time for something a little different. The Sencha Touch framework I used earlier is huge and wasn't easy to get my head around. I was resigning myself to trying to get to grips with it when jQuery Mobile came along. Still in alpha, jQuery Mobile is very simple and elegant, and writing an app is basically a case of writing HTML (with a little Javascript here and there if needed). It has a few rough edges, but it's possible to create something usable very quickly. And, it's actually fun.

So, to learn a it more about how to use it, I decided to see if I could write a "clone" of Nature.com's iPhone app (which I reviewed earlier). Nature's app is in many ways the most interesting iOS app for articles because it doesn't treat the article as a monolithic PDF, but rather it uses the ePub format. As a result, you can view figures, tables, and references separately.

The cloneYou can see the clone here.

photo.PNGphoto.PNG


I've tried to mimic the basic functionality of the Nature.com app in terms of transitions between pages, display of figures, references, etc. In making this clone I've focussed on just the article display.

A web app is going to lack the speed and functionality of a native app, but is probably a lot faster to develop. It also works on a wider range of platforms. jQuery Mobile is committed to supporting a wide range of platforms, so this clone should work on platforms other than the iPad.

The Nature.com app has a lot of additional functionality apart from just displaying articles, such as list the latest articles from Nature.com journals, manage a user's bookmarks, and enable the user to buy subscriptions. Some of this functionality would be pretty easy to add to this clone, for example by consuming RSS feeds to get article lists. With a little effort one could have a simple, Web-based app to browse Nature content across a range of mobile devices.

Technical stuff

Nature's app uses the ePub format, but Nature's web site doesn't provide an option to download articles in ePub format. However, if you use a HTTP debugging proxy (such as Charles Proxy) when using Nature's app you can see the URLs needed to fetch the ePub file.

I grabbed a couple of ePub files for articles in Nature communications and unzipped them (.epub files are zip files). The iPad app is a single HTML file that uses some Ajax calls to populate the different views. One Ajax call takes the index.html that has the article text and replaces the internal and external links with calls to Javascript functions. An article's references, figure captions, and tables are stored in separate XML files, so I have some simple PHP scripts that read the XML and extract the relevant bits. Internal links (such as to figures and references) are handled by jQuery Mobile. External links are displayed within an iFrame.

There are some intellectual property issues to address. Nature isn't an Open Access journal, but some articles in Nature Communications are (under the Commons Attribution-NonCommercial-Share Alike 3.0 Unported License), so I've used two of these as examples. When it displays an article, Nature's app uses Droid fonts for the article heading. These fonts are supplied as an SVG file contained within the ePub file. Droid fonts are available under an Apache License as TrueType fonts as part of the Android SDK. I couldn't find SVG versions of the fonts in the Android SDK, so I use the TrueType fonts (see Jeffrey Zeldman's Web type news: iPhone and iPad now support TrueType font embedding. This is huge.). Oh, and I "borrowed" some of the CSS from the style.css file that comes with each ePub file.

First thoughts on CiteBank and BHL-Europe

This week saw the release of two tools from the Biodiversity Heritage Library, CiteBank and the BHL-Europe portal. Both have actually been quietly around for a while, but were only publicly announced last week.

In developing a new tool there are several questions to ask. Does something already exist that meets my needs? If it doesn't exist, can I build it using an existing framework, or do I need to start from scratch? As a developer it's awfully tempting sometimes to build something from scratch (I'm certainly guilty of this). Sometimes a more sensible approach is to build on something that already exists, particularly if what you are building upon is well supported. This is one of the attractions of Drupal, which underlies CiteBank and Scratchpads. In my own work I've used Semantic Mediawiki to support editable, versioned databases, rather than roll my own. Perhaps the more difficult question for a developer is whether they need to build anything at all. What if there are tools already out there that, if not exacty what you want, are close enough (or most likely will be by the time you finish your own tool).

CiteBank
bhlsquare_reasonably_small.png
CiteBank is an open access platform to aggregate citations for biodiversity publications and deliver access to biodiversity related articles. CiteBank aggregates links to content from digital libraries, publishers, and other bibliographic systems in order to provide a single point of access to the world’s biodiversity literature, including content created by its community of users. CiteBank is a project of the Biodiversity Heritage Library (BHL).

I have two reactions to CiteBank. Firstly, Drupal's bibliographic tools really suck, and secondly, why do we need this? As I've argued earlier (see Mendeley, BHL, and the "Bibliography of Life"), I can't see the rationale for having CiteBank separate from an existing bibliographic database such as Mendeley or Zotero. These tools are more mature, better supported, and address user needs beyond simply building lists of papers (e.g., citing papers when writing manuscripts).

For me, one of BHL's goals should be integrating the literature they have scanned into mainstream scientific literature, which means finding articles, assigning DOIs, and becoming in effect a digital publishing platform (like BioOne or JSTOR). Getting to this point will require managing and cleaning metadata for many thousands of articles and books. It seems to me that you want to gather this metadata from as many sources as possible, and expose it to as many eyes (and algorithms) as possible to help tidy it up. I think this is a clear case of it being better to use an existing tool (such as Mendeley), rather than build a new one. If a good fraction of the world's taxonomists shared their person bibliographies on Mendeley we'd pretty much have the world's taxonomic literature in one place, without really trying.

BHL-Europe
logo.jpg
It's early days for BHL-Europe, and they've taken the "lets use an existing framework" approach, basing the BHL-Europe portal on DISMARC, the later being a EU-funded project to "encourage and support the interoperability of music related data".

BHL-Europe is the kind of web site only its developers could love. It's spectacularly ugly, and a classic example of what digital libraries came up with while Google was quietly eating their lunch. Here's the web site showing search results for "Zonosaurus":

bhleu.png


Yuck! Why do these things have to be so ugly?. DISMARC was designed to store metadata about digital objects, specifically music. Look at commercial music interfaces such as iTunes, Spotify, and Last.fm. Or even academic projects such as mSpace.

To be useful BHL-Europe really needs to provide an interface that reflects what its users care about, for example taxonomic names, classification, and geography. It can't treat scientific literature as a bunch of lifeless metadata objects (but then again, DISMARC managed to do this for music).

Where next?
CiteBank and BHL-Europe seem further additions to the worthy but ultimately deeply unsatisfying attempts to improve access biodiversity literature. To date our field has failed to get to grips with aggregating metadata (outside of the library setting), creating social networks around that aggregation, and providing intuitive interfaces that enable users to search and browse productively. These are big challenges. I'd like to see the resources that we have put to better use, rather than being used to build tools where suitable alternatives already exist (CiteBank), or used to shoe horn data into generic tools that are unspeakably ugly (BHL-Europe portal) and not fit for purpose. Let's not reinvent the wheel, and let's not try and convince ourselves that squares make perfectly good wheels.

Linking taxonomic databases to the primary literature: BHL and the Australian Faunal Directory

Continuing my hobby horse of linking taxonomic databases to digitised literature, I've been working for the last couple of weeks on linking names in the Australian Faunal Directory (AFD) to articles in the Biodiversity Heritage Library (BHL). AFD is a list of all animals known to occur in Australia, and it provides much of the data for the recently released Atlas of Living Australia. The data is available as series of CSV files, and these contain quite detailed bibliographic references. My initial interest was in using these to populate BioStor with articles, but it seemed worthwhile to try and link the names and articles together. The Atlas of Living Australia links to BHL, but only via a name search showing BHL items that have a name string. This wastes valuable information. AFD has citations to individual books and articles that relate to the taxonomy of Australian animals — we should treat that as first class data.

So, I cobbled together the CSV files, some scripts to extract references, ran them through the BioStor and bioGUID OpenURL resolvers, and dumped the whole thing in a CouchDB database. You can see the results at Australian Faunal Directory on CouchDB.

afd.png


The site is modelled on my earlier experiment with putting the Catalogue of Life on CouchDB. It's still rather crude, and there's a lot of stuff I need to work on, but it should illustrate the basic idea. You can browse the taxonomic hierarchy, view alternative names for each taxon, and see a list of publications related to those names. If a publication has been found in BioStor then the site displays a thumbnail of the first page, and if you click on the reference you see a simple article viewer I wrote in Javascript.

v1.png


For PDFs I'm experimenting with using Google's PDF viewer (the inspiration for the viewer above):

v2.png



How it was made
Although in principle linking AFD to BHL via BioStor was fairly straight forward, these are lots of little wrinkles, such as errors in bibliographic metadata, and failure to parse some reference strings. To help address this I created a public group on Mendeley where all the references I've extracted are stored. This makes it easy to correct errors, add identifiers such as DOIs and ISSNs, and upload PDFs. For each article a reference to the original record in AFD is maintained by storing the AFD identifier (a UUID) as a keyword.

The taxonomy and the mapping to literature is stored in a CouchDB database, which makes a lot of things (such as uploading new versions of documents) a breeze.

It's about the links
The underlying motivation is that we are awash in biodiversity data and digitisation projects, but these are rarely linked together. And it's more than just linking, it's bring the data together so that we can compute over it. That's when things will start to get interesting.

Mendeley mangles my references: phantom documents and the problem of duplicate references

One issue I'm running into with Mendeley is that it can create spurious documents, mangling my references in the process. This appears to be due to some over-zealous attempts to de-duplicate documents. Duplicate documents is the number one problem faced by Mendeley, and has been discussed in some detail by Duncan Hull in his post How many unique papers are there in Mendeley?. Duncan focussed on the case where the same article may appear multiple times in Mendeley's database, which will inflate estimates of how many distinct references the database contains. It also has implications for metrics derived from the Mendeley, such as those displayed by ReaderMeter.

In this post I discuss the reverse problem, combining two or more distinct references into one. I've been uploading large collections of references based on harvesting metadata for journal articles. Although the metadata isn't perfect, it's usually pretty good, and in many cases linked to Open Access content in BioStor. References that I upload appear in public groups listed on my profile, such as the group Proceedings of the Entomological Society of Washington.

Reverse engineering Mendeley
In the absence of a good description by Mendeley of how their tools work, we have to try and figure it out ourselves. If you click on a refernece that has been recently added to Mendeley you get a URL that looks like this: http://www.mendeley.com/c/3708087012/g/584201/magalhaes-2008-a-new-species-of-kingsleya-from-the-yanomami-indians-area-in-the-upper-rio-orinoco-venezuela-crustacea-decapoda-brachyura-pseudothelphusidae/ where 584201 is the group id, 3708087012 is the "remoteId" of the document (this is what it's called in the SQLite database that underlies the desktop client), and the rest of the URL is the article title, minus stop words.

After a while (perhaps a day or so) Mendeley gets around to trying to merge the references I've added with those it already knows about, and the URLs lose the group and remoteId and look like this: http://www.mendeley.com/research/review-genus-saemundssonia-timmerman-phthiraptera-philopteridae-alcidae-aves-charadriiformes-including-new-species-new-host/ . Let's call this document the "canonical document" (this document also has a UUID, which is what the Mendeley API uses to retrieve the document). Once the document gets one of these URLs Mendeley will also display how many people are "reading" that document, and whether anyone has tagged it.

But that's not my paper!
The problem is that sometimes (and more often than I'd like) the canonical document bears little relation to the document I uploaded. For example, here is a paper that I uploaded to the group Proceedings of the Entomological Society of Washington:

16212462.gifReview of the genus Saemundssonia Timmermann (Phthiraptera: Philopteridae) from the Alcidae (Aves: Charadriiformes), including a new species and new host records by Roger D Price, Ricardo L Palma, Dale H Clayton, Proceedings of the Entomological Society of Washington, 105(4):915-924 (2003).


You can see the actual paper in BioStor: http://biostor.org/reference/57185. To see the paper in the Mendeley group, browse it using the tag Phthiraptera:

group.png


Note the 2, indicating that two people (including myself) have this paper in their library. The URL for this paper is http://www.mendeley.com/research/review-genus-saemundssonia-timmerman-phthiraptera-philopteridae-alcidae-aves-charadriiformes-including-new-species-new-host/, but this is not the paper I added!.

What Mendeley displays for this URL is this:
dala.png


Not only is this not the paper I added, there is no such paper! There is a paper entitled "A new genus and a new species of Daladerini (Hemiptera: Heteroptera: Coreidae) from Madagascar", but that is by Harry Brailovsky, not Clayton and Price (you can see this paper in BioStor as http://biostor.org/reference/55669). The BioStor link for the phantom paper displayed by Mendeley, http://biostor.org/reference/55761, is for a third paper "A review of ground beetle species (Coleoptera: Carabidae) of Minnesota, United States : New records and range extensions". The table below shows the original details for the paper, the details for the "canonical paper" created by Mendeley, and the details for two papers that have some of the bibliographic details in common with this non-existent paper (highlighted in bold).

FieldOriginal paperMendeley
TitleReview of the genus Saemundssonia Timmermann (Phthiraptera: Philopteridae) from the Alcidae (Aves: Charadriiformes), including a new species and new host recordsA new genus and a new species of Daladerini (Hemiptera: Heteroptera: Coreidae) from MadagascarA new genus and a new species of Daladerini (Hemiptera: Heteroptera: Coreidae) from MadagascarA review of ground beetle species (Coleoptera: Carabidae) of Minnesota, United States : New records and range extensions
Author(s)Roger D Price, Ricardo L Palma, Dale H ClaytonDH Clayton, RD PriceHarry Brailovsky
Volume105105104107
Pages915-924915-924111-118917-940
BioStor57185557615566955761

As you can see it's a bit of a mess. Now, finding and merging duplicates is a hard problem (see doi:10.1145/1141753.1141817 for some background), but I'm struggling to see why these documents were considered to be duplicates.

What I'd like to see
I'm a big fan of Mendeley, so I'd like to see this problem fixed. What I'd really like to see is the following:
  1. Mendeley publish a description of how their de-duplication algorithms work

  2. Mendeley describe the series of steps a document goes through as they process it (if nothing else, so that users can make sense of the multiple URLs a document may get over it's lifetime in Mendeley).

  3. For each canonical reference Mendeley shows the the set of documents that have been merged to create that canonical reference, and display some measure of their confidence that the match is genuine.

  4. Mendeley enables users to provide feedback on a canonical document (e.g., a button by each document in the set that enables the user to say "yes this is a match" or "no, this isn't a match").


Perhaps what would be useful is if Mendeley (or the community) assemble a test collection of documents which contains duplicates, together with a set of the canonical documents this collection actually contains, and use this to evaluate alternative algorithms for finding duplicates. Let's make this a "challenge" with prizes! In many ways I'd be much more impressed by a duplication challenge than the DataTEL challenge, especially as it seems clear that Mendeley readership data is too sparse to generate useful recommendations (see Mendeley Data vs. Netflix Data).