News Air: taxonomy

Showing posts with label taxonomy. Show all posts

Sherborn presentation on Open Taxonomy

Here is my presentation from today's Anchoring Biodiversity Information: From Sherborn to the 21st century and beyond meeting.

Open taxonomy

View more presentations from Roderic Page

All the presentations will be posted online, along with podcasts of the audio. Meantime, presentations by Dave Remsen and Chris Freeland are already online.

Taxonomy - crisis, what crisis?

Following on from the last post How many species are there, and why do we get two very different answers from same data? another interesting paper has appeared in TREE:

Lucas N. Joppa, David L. Roberts, Stuart L. Pimm The population ecology and social behaviour of taxonomists Trends in Ecology & Evolution doi:10.1016/j.tree.2011.07.010

The paper analyses the "ecology and social habits of taxonomists" and concludes:

Conventional wisdom is highly prejudiced. It suggests that taxonomists were a formerly more numerous people, are in 'crisis', are becoming endangered and are generally asocial. We consider these hypotheses and reject them to varying degrees.

Queue flame war on TAXACOM, no doubt, but it's a refreshing conclusion, and it's based on actual data. Here I declare an interest. I was a reviewer, and in a fit of pique recommended rejection simply because the authors don't make the data available (they do, however, provide the R scripts used to do the analyses). As the authors patiently pointed out in their response to reviews, the various explicit or implicit licensing statements attached to taxonomic data mean they can't provide the data (and I'm assuming that in at least some cases the dark art of screen scrapping was used to get the data).

There's an irony here. Taxonomic databases are becoming hot topics, generating estimates of the scale of the task facing taxonomy, and diagnosing state of the discipline itself (according to Joppa et al. it's in rude health). This is the sort of thing that can have a major impact on how people perceive the discipline (and may influence how many resources are allocated to the subject). If taxonomists take issue with the analyses then they will find them difficult to repeat because the taxonomic data they've spent their careers gathering are under lock and key.

How many species are there, and why do we get two very different answers from same data?

Two papers estimating the total number of species have recently been published, one in the open access journal PLoS Biology:

Camilo Mora, Derek P. Tittensor, Sina Adl, Alastair G. B. Simpson, Boris Worm. How Many Species Are There on Earth and in the Ocean?. PLoS Biol 9(8): e1001127. doi:10.1371/journal.pbio.1001127

the second in Systematic Biology (which has an open access option but the authors didn't use it for this article):

Mark J. Costello, Simon Wilson and Brett Houlding. Predicting total global species richness using rates of species description and estimates of taxonomic effort. Syst Biol (2011) doi:10.1093/sysbio/syr080

The first paper has gained a lot of attention, in part because Jonathan Eisen Bacteria & archaea don't get no respect from interesting but flawed #PLoSBio paper on # of species on the planet was mightily pissed off about the estimates of the number:

Their estimates of ~ 10,000 or so bacteria and archaea on the planet are so completely out of touch in my opinion that this calls into question the validity of their method for bacteria and archaea at all.

The fuss over the number of bacteria and archaea seems to me to be largely a misunderstanding of how taxonomic databases count taxa. Databases like Catalogue of Life record described species, and most bacteria aren't formally described because they can't be cultured. Hence there will always be a disparity between the extent of diversity revealed by phylogenetics and by classical taxonomy.

The PLoS Biology paper has garnered a lot more reaction than the Systematic Biology paper (e.g., the commentary by Carl Zimmer in the New York TimesHow Many Species? A Study Says 8.7 Million, but It’s Tricky), which arguably has the more dramatic conclusion.

How many species, 8.7 million, or 1.8 to 2.0 million?

Whereas the Mora et al. in PLoS Biology concluded that there are some 8.7 million (±1.3 million SE) species on the planet, Costello et al. in Systematic Biology arrive at a much more conservative figure (1.8 to 2.0 million). The implications of these two studies are very different, one implies there's a lot of work to do, the other leads to headlines such as 'Every species on Earth could be discovered within 50 years'.

What is intriguing is that both studies use the same databases, Catalogue of Life and the World's Register of Marine Species, and yet arrive at very different results.

So, the question is, how did we arrive at two very different answers from the same data?

Anchoring Biodiversity Information: from Sherborn to the 21st century and beyond

Charles Davies Sherborn, the Natural History Museum's 'magpie with a card-index mind’

Next month I'll be speaking in London at The Natural History Museum at a one day event Anchoring Biodiversity Information: From Sherborn to the 21st century and beyond. This meeting is being organised by the International Commission on Zoological Nomenclature and the Society for the History of Natural History, and is partly a celebration of his major work Index Animalium and partly a chance to look at the future of zoological nomenclature.

Details are available from the ICZN web site. I'll be giving a a talk entitled "Towards an open taxonomy" (no, I don't know what I mean by that either). But it should be a chance to rant about the failure of taxonomy to embrace the Interwebs.

SherbornPoster Sept 11

The top-ten new species described in 2010 and the failure of taxonomy to embrace Open Access publication

Each year the grandly titled International Institute for Species Exploration (IISE) publishes list of the top 10 species described in the previous year. This year's list is reproduced below, to which I've added the links to the original publications (why do people think still it's OK to omit links to the primary literature when all of these articles are online?).

The striking thing is that only 2 of the 10 species were described in Open Access publications (and I use that term loosely as as Arthropod Systematics & Phylogeny PDFs are freely available, but the licensing isn't clear). Sadly much of our knowledge of the planet's diversity is still locked up behind a paywall.

	Reference	DOI/PDF	Open Access
Darwin's Bark Spider	Kuntner, M. and I. Agnarsson. 2010. Web gigantism in Darwin's bark spider, a new species from Madagascar (Araneidae: Caerostris). The Journal of Arachnology 38(2):346-356	10.1636/B09-113.1	No
Bioluminescent Mushroom	Desjardin, D.E., B.A. Perry, D.J. Lodge, C.V. Stevani, and E. Nagasawa. 2010. Luminescent Mycena: new and noteworthy species. Mycologia 102(2):459-477	10.3852/09-197	No
Bacterium	Sanchez-Porro, C., B. Kaur, H. Mann and A. Ventosa. 2010. Halomonas titanicae sp. nov., a halophilic bacterium isolated from the RMS Titanic. International Journal of Systematic and Evolutionary Microbiology 60(12):2768-2774	10.1099/ijs.0.020628-0	No
Monitor Lizard	Welton, L.J., C.D. Siler, D. Bennett, A. Diesmos, M.R. Duya, R. Dugay, E.L.B. Rico, M. van Weerd and R.M. Brown. 2010. A spectacular new Philippine monitor lizard reveals a hidden biogeographic boundary and a novel flagship species for conservation. Biology Letters 6(5):654-658	10.1098/rsbl.2010.0119	No
Pollinating cricket	Hugel, S., C. Micheneau, J. Fournel, B.H. Warren, A. Gauvin-Bialecki, T. Pailler, M.W. Chase and D. Strasberg. 2010. Glomeremus species from the Mascarene islands (Orthoptera, Gryllacrididae) with the description of the pollinator of an endemic orchid from the island of Réunion. Zootaxa 2545:58-68	PDF	No
Duiker	Colyn, M., J. Hulselmans, G. Sonet, P. Oudé, J. de Winter, A. Natta, Z.T. Nagy and E. Verheyen. 2010. Discovery of a new duiker species (Bovidae: Cephalophinae) from the Dahomey Gap, West Africa. Zootaxa 2637:1-30	PDF	No
Leech	Phillips, A.J., R. Arauco-Brown, A. Oceguera-Figueroa, G.P. Gomez, M. Beltran, Y.-T. Lai and M.E. Siddall. 2010. Tyrannobdella rex n. gen. n. sp. and the evolutionary origins of mucosal leech infestations. PLoS ONE 5(4):e10057	10.1371/journal.pone.0010057	Yes
Underwater mushroom	Frank, J.L., R.A. Coffan and D. Southworth. 2010. Aquatic gilled mushrooms: Psathyrella fruiting in the Rogue River in southern Oregon. Mycologia 102(1):93-107	10.3852/07-190	No
Jumping cockroach	Bohn, H., M. Picker, K.-D. Klass and J. Colville. 2010. A jumping cockroach from South Africa, Saltoblattella montistabularis, gen. nov., spec. nov. (Blattodea: Blattellidae). Arthropod Systematics and Phylogeny 68(1):53-39/td>	PDF	Yes
Pancake Batfish	Ho, H.-C., P. Chakrabarty and J.S. Sparks. 2010. Review of the Halieutichthys aculeatus species complex (Lophiiformes: Ogcocephalidae), with descriptions of two new species. Journal of Fish Biology 77(4):841-869	10.1111/j.1095-8649.2010.02716.x	No

Dark taxa: GenBank in a post-taxonomic world

In an earlier post (Are names really the key to the big new biology?, I questioned Patterson et al.'s assertion in a recent TREE article (doi:10.1016/j.tree.2010.09.004) that names are key to the new biology.

In this post I'm going to revisit this idea by doing a quick analysis of how many species in GenBank have "proper" scientific names, and whether the number of named species has changed over time. My definition of "proper" name is a little loose: anything that had two words, second one starting with a lower case letter, was treated as a proper name. hence, a name like Eptesicus sp. A JLE-2010" is not a proper name, but Eptesicus andersoni is.

Mammals

Since GenBank started, every year has seen some 100-200 mammal species added to the database.

Until around 2003 almost all of these species had proper binomial names, but since then an increasing percentage of species-level taxa haven't been identified to species. In 2010 three-quarters of new tax_ids for mammals weren't identified.

Invertebrates

For "invertebrates" 2010 saw an explosive growth in the number of new taxa sequenced, with nearly 71,000 new taxa added to GenBank.

This coincides with a spectacular drop in the number of properly-named taxa, but even before 2010 the proportion of named invertebrate species in GenBank was in decline: in 2009 just over a half of the species added had binomials.

Bacteria

To put this in perspective, here are the equivalent graphs for bacteria.
Although at the outset most of the bacteria in GenBank had binomial names, pretty quickly the bulk of sequenced bacteria had informal names. In 2010 less than 1% of newly sequenced bacteria had been formerly described.

Dark taxa

For bacteria the graphs are hardly surprising. To get a proper name a bacterium must be cultured, and the vast majority of bacteria haven't been (or can't be) cultured. Hence, microbiologists can gloat at the nomenclatural mess plant and animal taxonomists have to deal with only because microbiologists have a tiny number of names to deal with.

For mammals and invertebrates there's clear a decline in the use of proper names.It would be tempting to suggest that this reflects a decline in the number of taxonomists - there might simply not be enough of them in enough groups to be able to identify and/or describe the taxa being sequenced.

However, if we look at the recent peaks of unnamed animal species, we discover that many have names like Lepidoptera sp. BOLD:AAD7075, indicating that they are DNA Barcodes from the Barcode of Life Data Systems. Of the 62,365 unnamed invertebrates added last year, 54,546 are BOLD sequences that haven't been assigned to a known species. Of the 277 unnamed mammals, 218 are BOLD taxa. Hence, DNA bnacording is flooding Genbank with taxa that lack proper names (and typically are represented by a single DNA bnacode sequence).

There are various ways to interpret these graphs, but for me the message is clear. The bulk of newly added taxa in GenBank are what we might term "dark taxa", that is, taxa that aren't identified to a known species. This doesn't necessarily mean that they are species new to science, we may already have encountered these species before, they may be sitting in museum collections, and have descriptions already published. We simply don't know. As the output from DNA barcoding grows, the number of dark taxa will only increase, and macroscopic biology starts to look a lot like microbiology.

A post-taxonomic world
If we look at the graphs for bacteria, we see that taxonomic names are virtually irrelevant, and yet microbiology seems to be doing fine as a discipline. So, perhaps it's time to think about a post-taxonomic world where taxonomic names, contra Patterson et al., are not that important. We can discover a good deal about organismal biology from GenBank alone (see my post Visualising the symbiome: hosts, parasites, and the Tree of Life for some examples, as well as Rougerie et al. 2010 doi:10.1111/j.1365-294X.2010.04918.x).

This leaves us with two questions:

How much biology can we do without taxonomic names?
If the lack of taxonomic names limits what we can do (and, playing devil's advocate, this is an open question) how can we speed up linking GenBank sequences to names?

I suspect that the answer to (1) is "quite a lot" (especially if we think like microbiologists). Question (2) is ultimately a question about how fast we can link literature, museum collections, sequences, and phylogenies. If progress to date is any indication, we need to rethink how we do this, and in a hurry, because dark taxa are accumulating at an accelerating rate.

How the analyses were done

Although the NCBI makes a dump of its taxonomic database available via FTP (at ftp://ftp.ncbi.nih.gov/pub/taxonomy/), this dump doesn't have dates for when the taxa were added to the database. However, using the Entrez EUtilities we can get the tax_ids that were published within a given date range. For example, to retrieve all the tax_ids added to the database in December 2010, we set the URL parameters &mindate=2010/12/01 and &maxdate=2010-12-31 to form this URL:

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=taxonomy&mindate=2010/12/01&maxdate=2010/12/31&retmax=1000000.

I've set &retmax to a big number to ensure I get all the tax_ids for that month (in this case 23511). I then made a local copy of the NCBI database in MySQL ( instructions here) and queried for all species-level taxa in GenBank. I used a rather crude regular expression REGEXP '^[A-Z][a-z]+ [a-z][a-z]+$' to find just those species names that were likely to be proper scientific names (i.e., no "sp.", "aff.", museum or voucher codes, etc.). To group the species into major taxonomic groups I used the division_id.

Results are available in a Google Spreadsheet.

TreeBASE meets NCBI, again

Déjà vu is a scary thing. Four years ago I released a mapping between names in TreeBASE and other databases called TBMap (described here: doi:10.1186/1471-2105-8-158). Today I find myself releasing yet another mapping, as part of my NCBI to Wikipedia project. By embedding the mapping in a wiki, it can be edited, so the kinds of problems I encountered with TbMap, recounted here, here, and here. The mapping in and of itself isn't terribly exciting, but it's the starting point for some things I want to do regarding how to visualise the data in TreeBASE.

Because TreeBASE 2 has issued new identifiers for its taxa (see TreeBASE II makes me pull my hair out), and now contains its own mapping to the NCBI taxonomy, as a first pass I've taken their mapping and added it to http://iphylo.org/linkout. I've also added some obvious mappings that TreeBASE has missed. There are a lot more taxa which could be added, but this is a start.

The TreeBASE taxa that have a mapping each get their own page with a URL of the form http://iphylo.org/linkout/<TreeBase taxon identifier>, e.g. http://iphylo.org/linkout/TB2:Tl257333. This page simply gives the name of the taxon in TreeBASE and the corresponding NCBI taxon id. It uses a Semantic Mediawiki template to generate a statement that the TreeBASE and and NCBI taxa are a "close match". If you go to the corresponding page in the wiki for the NCBI taxon (e.g., http://iphylo.org/linkout/Ncbi:448631) you will see any corresponding TreeBASE taxa listed there. If a mapping is erroneous, we simply need to edit the TreeBASE taxon page in the wiki to fix it. Nice and simple.

At the time of writing the initial mapping is still being loaded (this can take a while). I'll update this post when the uploading has finished.

Are names really the key to the big new biology?

David ("Paddy") Patterson, Jerry Cooper, Paul Kirk, Rich Pyle, and David Remsen have published an article in TREE entitled "Names are key to the big new biology" (doi:10.1016/j.tree.2010.09.004). The abstract states:

Those who seek answers to big, broad questions about biology, especially questions emphasizing the organism (taxonomy, evolution and ecology), will soon benefit from an emerging names-based infrastructure. It will draw on the almost universal association of organism names with biological information to index and interconnect information distributed across the Internet. The result will be a virtual data commons, expanding as further data are shared, allowing biology to become more of a ‘big science’. Informatics devices will exploit this ‘big new biology’, revitalizing comparative biology with a broad perspective to reveal previously inaccessible trends and discontinuities, so helping us to reveal unfamiliar biological truths. Here, we review the first components of this freely available, participatory and semantic Global Names Architecture.

Do we need names?

Reading this (full disclosure, I was a reviewer) I can't wondering whether the assumption that names are key really needs to be challenged. Roger Hyam has argued that we should be calling time on biological nomenclature, and I wonder whether for a generation of biologists brought up on DNA barcodes and GPS, taxonomy and names will seem horribly quaint. For a start, sequences and GPS coordinates are computable, we can stick them in computers and do useful things with them. DNA barcodes can be used to infer identity, evolutionary relationships, and dates of divergence. Taken in aggregate we can infer ecological relationships (such as diet, e.g., doi:10.1371/journal.pone.0000831), biogeographic history, gene flow, etc. While barcodes can tells us something about an organism, names don't. Even if we have the taxonomic description we can't do much with it — extracting information from taxonomic descriptions is hard.

Furthermore, formal taxonomic names don't seem terribly necessary in order to do a lot of science. Patterson et al. note that taxa may have "surrogate" names":

Surrogates include provisional names and specimen, culture or strain numbers which refer to a taxon. 'SAR-11' ('SAR' refers to the Sargasso Sea) was a surrogate name given in 1990 to an important member of the marine plankton. Only a decade later did it become known as Pelagibacter ubique.

The name Pelagibacter ubique was published in 2002 (doi:10.1038/nature00917), although as a Candidatus name (doi:10.1099/00207713-45-1-186), not a name conforming to the International Code of Nomenclature of Bacteria. I doubt the lack of a name that follows this code is hindering the study of this organism, and researchers seem happy to continue to use 'SAR11'.

So, I think that as we go forward we are going to find nomenclature struggling to establish its relevance in the age of digital biology.

If we do need them, how do we manage them?
If we grant Patterson et al. their premise that names matter (and for a lot of the legacy literature they will), then how do we manage them? In many ways the "Names are key to the big new biology" paper is really a pitch for the Global Names Architecture or GNA (and it's components GNI, GNITE, and GNUB). So, we're off into alphabet soup again (sigh). The more I think about this the more I want something very simple.

Names
All I want here is a database of name strings and tools to find them in documents. In other words, uBio.

Documents
Broadly defined to include articles, books, DNA sequences, specimens, etc. I want an database of [name,document] pairs (BHL has a huge one), and a database of documents.

Realistically, given the number and type of documents there will be several "document" databases, such as GenBank and GBIF. For citations Mendeley looks very promising. If we had every taxonomic publication in Mendeley, tagged with scientific names, then we'd have the bibliography of life. Taxonomic nomenclators would be essentially out of business, given that their function is to store the first publication of a name. Given a complete bibliography we just create a timeline of usage for a name and note the earliest [name,document] pair:

Taxonomy
There are a few wrinkles to deal with. Firstly, names may have synonyms, lexical variants, etc. (the Patterson et al. paper has a nice example of this). Leaving aside lexical variants, what we want is a "view" of the [name,document] pairs that says this subset refer to the same thing (the "taxon concept").

We can obsess with details in individual cases, but at web-scale there are only two ones that spring to mind. The first is the Catalogue of Life, the second is NCBI. The Catalogue of Life lists sets of names and reference that it regards as being the same thing, although it does unspeakable things to many of the references. In the case of NCBI the "concepts" would be the sets of DNA sequences and associated publications linked to the same taxonomy id. Whatever you think of the NCBI taxonomy, it is at least computable, in the sense that you could take a taxon and generate a list of publications 'about" that taxon.

So, we have names, [name,document] pairs, and sets of [name,document] pairs. Simples.

News Air