Towards an interactive DjVu file viewer for the BHL

The bulk of the Biodiversity Heritage Library's content is available as DjVu files, which package together scanned page images and OCR text. Websites such as BHL or my own BioStor display page images, but there's no way to interact with the page content itself. Because it's just a bitmap image there's no obvious way to do simple things such as select and copy some text, click on some text and correct the OCR, or highlight some text as a taxonomic name or bibliographic citation. This is frustrating, and greatly limits what we can do with BHL's content.

In March I wrote a short post DjVu XML to HTML showing how to pull out and display the text boxes for a DjVu file. I've put this example, together with links to the XSLT file I use to do the transformation online at Display text boxes in a DjVu page. Here's an example, where each box (a DIV element) corresponds to a fragment of text extracted by OCR software.

boxes.png

The next step is to make this interactive. Inspired by Google's Javascript-based PDF viewer (see How does the Google Docs PDF viewer work?), I've revisited this problem. One thing the Google PDF viewer does nicely is enable you to select a block of text from a PDF page, in the same way that you can in a native PDF viewer such as Adobe Acrobat or Mac OS X Preview. It's quite a trick, because Google is displaying a bitmap image of the PDF page. So, can we do something similar for DjVu?

The thing I'd like to do is something what is shown below — drag a "rubber band" on the page and select all the text that falls within that rectangle:

textselection.png

This boils down to knowing for each text box whether it is inside or outside the selection rectangle:

selection.png


Implementation

We could try and solve this by brute force, that is, query each text box on the page to see whether it overlaps with the selection or not, but we can make use of a data structure called an R-tree to speed things up. I stumbled across Jon-Carlos Rivera's R-Tree Library for Javascript, and so was inspired to try and implement DjVu text selection in a web browser using this technique.

The basic approach is as follows:

  1. Extract text boxes from DjVu XML file and lay these over the scanned page image.

  2. Add each text box to a R-tree index, together with the "id" attribute of the corresponding DIV on the web page, and the OCR text string from that text box.

  3. Track mouse events on the page, when the user clicks with the mouse we create a selection rectangle ("rubber band"), and as the mouse moves we query the R-tree to discover which text boxes have any portion of their extent within the selection rectangle.

  4. Text boxes in the selection have their background colour set to an semi-transparent shade of blue, so that the user can see the extent of the selected text. Boxes outside the selection are hidden.

  5. When the user releases the mouse we get the list of text boxes from the R-tree, and concatenate the text corresponding to each box, and finally display the resulting selection to the user.



Copying text

So far so good, but what can we do with the selected text? One obvious thing would be to copy and paste it (for example, we could select a species distribution and paste it into a text editor). Since all we've done is highlight some DIVs on a web page, how can we get the browser to realise that it has some text it can copy to the clipboard? After browsing Stack Overflow I came across this question, which gives us some clues. It's a bit of a hack, but behind the page image I've hidden a TEXTAREA element, and when the user has selected some text I populate the TEXTAREA with the corresponding text, then set the browser's selection range to that text. As a consequence, the browser's Copy command (⌘C on a Mac) will copy the text to the clipboard.

Demo

You can view the demo here. It only works in Safari and Chrome, I've not had a chance to address cross-browser compatibility. It also works in the iPad, which seems a natural device to support interactive editing and annotation of BHL text, but you need to click on the button On iPad click here to select text before selecting text. This is an ugly hack, so I need to give a bit more thought to how to support the iPad touch screen, while still enabling users to pan and zoom the page image.

Next steps

This is all very crude, but I think it shows what can be done. There are some obvious next steps:

  • Enable selected text to be edited so that we can correct the underlying OCR text.

  • Add tools that operate on the selected text, such as check whether it is a taxonomic name, or if it is a bibliographic citation we could attempt to parse it and locate it online (such as David Shorthouse's reference parser).

  • Select parts of the page image itself, so that we could extract a figure or map.

  • Add "post it note" style annotations.

  • Add services that store the edits and annotations, and display annotations made by others.


Lots to do. I foresee a lot of Javascript hacking over the coming weeks.

Scripting Life

Not really a blog post, more a note to self. If I ever did get around to writing a book again, I think Scripting Life would be a great title.

PLoS Biodiversity Hub launches

hubs.png

The PLoS Biodiversity Hub has launched today. There's a PLoS blog post explaining the background to the project, as well as a summary on the Hub itself:

The vision behind the creation of PLoS Hubs is to show how open-access literature can be reused and reorganized, filtered, and assessed to enable the exchange of research, opinion, and data between community members.

PLoS Hubs: Biodiversity provides two main functions to connect researchers with relevant content. First, open-access articles on the broad theme of biodiversity are selected and imported into the Hub. In time, the content will also be enhanced so that the articles are connected with data, and we will provide features to make the articles easier for people to use. These two functions - aggregation and adding value - build on the concept of open access, which removes all the barriers to access and reuse of journal article content.


Readers of iPhylo may recall my account of one of the meetings involved in setting up this hub, in which I began to despair about the lack of readiness of biodiversity informatics to provide much of the information needed for projects such as hubs. Despite this (or perhaps, because of it), I've become a member of the steering committee for the Biodiversity Hub. There's clearly a lot of interest in repurposing the content found in scientific articles, and I think we're going to see an increasing number of similar projects from the major players in science publishing, Open Access or otherwise. One of the challenges is going to be moving beyond the obvious things (such as making taxon names clickable) to enable new kinds of ways of reading, navigating, and querying the literature, and exploring ways to track the use that is made of the information in these articles. Biodiversity studies are ideally placed to explore this as the subject is data rich and much of that data, such as specimens and DNA sequences, persist over time and hence get reused (data citation gets very boring if the data is used just once). We also have obvious ways to enrich navigation, such as spatially and taxonomically.

For now the PLoS Biodiversity Hub is very pretty, but it's more a statement of intent than a real demonstration of what can be done. Let's hope our field gets its act together and seizes the opportunity that initiatives like the Hub represents. Publishers are desperate to differentiate themselves from their competitors by providing added value as part of the publication process, and they provide a real use case for all the data that the biodiversity projects have been accumulating over the last couple of decades.