Social Bookmarking For Scientists - The Best Of Both Worlds
- , ,
Introduction
Connotea (http://www.connotea.org/) might be better described as a social reference management service rather than a social bookmarking service. This is because it is geared towards the needs of clinicians and scientists, a community who need to track original research (with citable references) as part of their day-to-day work. Nevertheless, its origins lie in social bookmarking – Connotea began life as an experiment in applying that paradigm to the scientific information space by Nature Publishing Group (NPG)’s New Technology team. It is now offered as a free service alongside other NPG products and services.
So what does the 'best of both worlds' mean here? As web publishing has developed, academic publishers have developed a number of robust, highly structured, and interdependent technologies for linking to and retrieving online information. But social bookmarking and other web 2.0 developments have shown the benefits of less structured, less hierarchical and less tightly coupled approaches. NPG’s aim in building Connotea is to meet the needs of its audience by combining traditional reference management and academic publishing with the more web-focused approaches of social bookmarking.
Connotea is billed as a service ‘for clinicians and scientists’. Why do we focus on, and market the tool to, a specific community? Our hope is that this will enhance the content discovery benefits and social aspects of the tool by concentrating the commonality of users' interests. This is, of course, a hypothesis that will be tested during the future development of the service.
Readers will be familiar with the basic concepts of social bookmarking: web-based, using tags for organisation, open collections, and the ability to find related users. This introduction is a very brief tour in order to demonstrate that Connotea is a social bookmarking service – the rest of this paper will focus on the main additional feature of Connotea that distinguishes it and lets it better meet the needs of scientists and clinicians – the automatic import of bibliographic information by means of URL scanning. I will be discussing the whats, the hows and the whys of that feature in the following sections, and will conclude with a brief whine about the problems with URL scanning, followed by an 'if wishing made it so' section on the future of this technique.
Note: There has been some parallel evolution in this space. CiteULike (http://www.citeulike.org/) is another, independent, service that has addressed similar problems to those described here. This paper only reports on the history and features of Connotea.

Figure 1. Connotea is a social bookmarking tool.
Figure 1 demonstrates the social bookmarking origins of Connotea – users’ posts are publicly available (by default), connected to other users who post the same or similar content, and are organised with tags.
Excellent, that dispenses with that, and now we can get into the exciting bits of Connotea.
The What: Automatic Import of Bibliographic Information
Bookmarking is about saving links. In addition to saving a link, you might also want to save, say, the title of the web page and a brief description of it. Some sites even save the entire text of the page alongside the link.
Users of Connotea are very often bookmarking formal, scholarly pieces, where there are more things that they want to know about the content. Things like the name of the journal an article was published in, the date it was published, the correct title of the article (without the search engine friendly pipe | delimited | section | information in the page title), and even the volume, issue and page numbers for the print-published version. That is to say, everything that goes into a standard formal citation:
Genetical Implications of the Structure of Deoxyribonucleic AcidWatson, J. D., and Crick, F. H. C.Nature, 171, 964–967 (1953).
Connotea, wherever it can, automatically recognises URLs that are for citable articles and attempts to automatically import this information.

Figure 2. The Connotea bookmarking form, showing an article whose bibliographic information has been automatically imported.
Figure 2 shows Connotea importing the bibliographic information for this article during the process of bookmarking it.
The How: URL Scanning
Connotea accomplishes this recognition and import by means of a series of "citation source plug-ins". A Connotea plug-in is a Perl module written against a simple API specification. Figure 3 illustrates the process.

Figure 3. URL Scanning in Connotea.
Before insertion into the database, Connotea runs each of the plug-ins against the URL that the user is bookmarking. If one of the plug-ins signals that it can further process the URL, Connotea uses it to do precisely that.
So what exactly do the plug-ins do? There are three distinct tasks that each plug-in can accomplish:
Step 1: understands?
The answer to this question asked of the plug-in is a simple true or false – in other words, "do you think you can get extra information for this URL?" Connotea citation source plug-ins generally use the following technique for determining whether they can further process a URL:
Does the URL host domain match a set list?
Yes
Does the path match a particular regular expression?
Yes -> Return True
No -> Return False
No -> Return False
Step 2: filter!
I wish this step weren't necessary, but it is. There are a lot of un-cool (http://www.w3.org/Provider/Style/URI) URLs on the web, especially in science publishing.
The filter step gives the plug-in a chance to alter the URL before it goes into the Connotea database. The idea here is to clean up crufty links and make them more permanent and comparable. The most common task in this step is to remove unnecessary URL parameters and session information that confuse URL matching and lead to duplicate articles in the database.
The filer step has also allowed us to do a few sneaky tricks. For example, we support the doi: pseudo-URI scheme in this way. Users can enter "doi:" followed by a DOI (Digital Object Identifier – see later for more details on this), which is recognised by our CrossRef plug-in’s ‘understands’ step and converted to a clickable http://dx.doi.org/ link in the ‘filter’ step before going into the database. The new info: URI scheme (http://www.faqs.org/rfc/rfc4452.txt) may make this kind of trick more legitimate, but we would still have to alter the URI to make it clickable.
Step 3: get_citation
This is where the plug-in makes good on its promise, and goes out onto the web to find the data.
The main problem faced in this step can be summed up in the following question: "All I know is the URL for this article, so how can I find out where I can get some basic data about the article?" The ideal answer to that question would be that the URL is all you need to know, but, alas, that's not possible the vast majority of the time.
So, instead, we do what you're not supposed to do and unpick the URL in order to extract some unique identifier from it. We then use that ID to query a web service, or to construct a new URL, in order to retrieve the extra data. Here, in the abstract, are the two basic processes that all the plug-ins use:
Method A:
Extract an identifier from the article URL
Construct a new URL using that ID and a common base template
This will be the URL for a data file (most often in RIS format)
GET that URL and parse the content
Example: The journals hosted by Highwire Press (http://highwire.stanford.edu/) are supported in this way.
Method B:
Extract an identifier from the article URL
Use that ID in a query to a relevant web service
Parse the results of the query
Example: Querying the CrossRef (http://www.crossref.org/) web service with DOIs extracted from Blackwell Synergy (http://www.blackwell-synergy.com/) URLs.
To further illustrate how this works, let’s walk through another example of method A scanning – in this case, an article bookmarked from PubMed (http://www.pubmed.org/).
The article URL:
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=pubmed&dopt=Abstract&list_uids=13063483&query_hl=4&itool=pubmed_docsum
We scan this URL and extract the list_uids parameter, which holds the PubMed ID for this article.
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=pubmed&dopt=Abstract&list_uids=13063483&query_hl=4&itool=pubmed_docsum
PubMed offers an E-Utilities service (http://eutils.ncbi.nlm.nih.gov) for access to the raw data for its articles. Query URLs for this service are constructed like this:
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?retmode=xml&db=pubmed&id=[PubMed ID goes here]
Using the PubMed ID we already prepared, we get a URL that looks like this:
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?retmode=xml&db=pubmed&id=13063483
Which we then issue an HTTP GET for, returning an XML document which, among other things, contains the following snippet of data:
<ArticleTitle>Genetical implications of the structure
of deoxyribonucleic acid.</ArticleTitle>
<Pagination>
<MedlinePgn>964-7</MedlinePgn>
</Pagination>
<AuthorList CompleteYN="Y">
<Author ValidYN="Y">
<LastName>WATSON</LastName>
<ForeName>J D</ForeName>
<Initials>JD</Initials>
</Author>
<Author ValidYN="Y">
<LastName>CRICK</LastName>
<ForeName>F H</ForeName>
<Initials>FH</Initials>
</Author>
</AuthorList>
By the way, all this happens in real time, of course, while the user is bookmarking the URL.
Interlude: A Brief Note on Data Formats and Protocols
None of the services that Connotea queries to retrieve bibliographic data uses SOAP. Some of the services use XML over HTTP, and some of these are RESTful.
None of the services gives their data in a standard XML format. Some give data in a proprietary XML format, others use a plain text format like RIS (the format supported by desktop software like 'Reference Manger') or something that approximates to RIS.
All of the protocols used are different. All the common data formats used have parsing quirks unique to a particular publisher.
The Why: Retrieval and Discovery
Each bookmark in Connotea gets, wherever possible, a set of structured data associated with it. Along with URL and HTML title, there is the official article title, the name of the journal, the publication date, volume, issue, and page numbers, and a full list of authors. Connotea also, wherever possible, finds standard identifiers for the article.
What help does this give to the user? One answer is that while Connotea already has tagging for retrieval and discovery, now there's some text to search too.

Figure 4. Searching using an author name in Connotea.
Connotea users generally use tags to organise their collections according to subject area. But Connotea also offers a bibliographic information text search facility, as shown in Figure 4. This allows users to retrieve bookmarks by, for example, searching for the name of the journal, searching author names, or with keyword matches from the article title or their own description of it. It's an extra dimension to the service that comes with no extra effort from the perspective of the user.
Another use to which the extra data that Connotea automatically collects is put is linking. As mentioned above, URLs in scientific publishing can be somewhat less than perma- in their nature. To address this concern, and to facilitate reference linking within primary research, the scholarly publishing community makes use of Digital Object Identifiers (DOIs), assigning a DOI to each article published. CrossRef (http://www.crossref.org/) maintains a look-up table of DOIs and publisher URLs, along with the basic bibliographic information for the article. DOIs can be de-referenced by prefixing them with http://dx.doi.org/, which simply redirects to the publisher URL (for example, here is a DOI link for a paper about Connotea published in D-Lib Magazine: http://dx.doi.org/10.1045/april2005-lund). If the publisher URL changes, the CrossRef database is updated and link integrity is maintained, and hence the dx.doi.org link is considered permanent.
Wherever possible, Connotea links out to bookmarks using the article DOI as well as the original URL. PubMed IDs are also widely used and recognised as article identifiers, so a link to the PubMed entry for an article is also provided if the ID has been imported.
Even without such identifiers, the imported bibliographic information allows Connotea to provide additional linking options via OpenURL (http://www.niso.org/committees/committee_ax.html). OpenURL has been described extensively elsewhere (http://en.wikipedia.org/wiki/OpenURL), but suffice to say here that if a user is at an institution with a local OpenURL resolver, Connotea can provide links to library look-ups for holdings of a particular article at that institution.
Future plans for Connotea include de-duplicating articles in the database based on these identifiers and metadata, although right now Connotea is URL-centric.
The Whine
We encounter three main problems with the URL scanning approach outlined above.
Poorly documented and poorly implemented data formats.
Item: A variety of different XML schemata.
Item: Liberal interpretations of RIS (http://www.refman.com/support/risformat_intro.asp), both in terms of the syntax and in terms of what data go in which fields.
Unnecessary hoop-jumping to get to the data.
By way of illustrating this point, let me describe the procedure Connotea takes for one publisher, who shall remain nameless.
Parse the article URL and extract an identifier.
Issue an HTTP POST request to one URL and save the cookie that is set by the response.
Do a POST to another URL with the extracted ID in the body, along with the cookie.
The response for this request can then be parsed and the extra data extracted.
We have to reverse-engineer each publisher's site, and we have to write ad hoc rules and custom procedures in each case.
Now, the first problem we can deal with – that is the reality of life on the web, although there is certainly room for improvement. But the second two are just crazy. It's a lose-lose situation: publishers don't get their content well-represented and discoverable in Connotea, and we have to reverse engineer URL formats and scrape web pages. There must be a better way.
The Wish
One should always be suspicious of suggestions that start with “wouldn't it be great if”, because they often go on to ask for the impossible situation that “everyone agree to do things in this one, particular way”. However...
Wouldn't it be great if publishers could just choose to be supported in Connotea by implementing something at their end? The thing that they implemented could be very simple, the thing that the web thrives on – a link, a URL pointing to the metadata about the article. It could even be hidden in the (non-access controlled) HTML, perhaps in the <head> section.
Does this sound familiar? Are you thinking “RSS/Atom auto-discovery”? So am I.
NPG recently announced OTMI (the Open Text Mining Interface), a suggestion for how paid-for content publishers could support the increasing number of researchers who do text mining on the scientific literature in an open and web-friendly way (see http://blogs.nature.com/wp/nascent/2006/04/open_text_mining_interface_1.html). OTMI could be the subject of a whole separate paper, but in a nutshell, the suggestion in that publishers place Atom entry documents (http://www.atomenabled.org/developers/syndication/atom-format-spec.php#atom.documents) for their articles on their web site, which would contain pre-digested text analysis results (word vectors and sentence snippets) for those articles. However, the text-mining aspect of this suggestion is not what concerns us here, because, in addition, the documents would reference the article in question using identifiers and bibliographic information, and this is where Connotea comes in.
We will create a Universal plug-in for Connotea that does the following:

Figure 5. Proposed architecture for the Connotea ‘Universal’ citation source plug-in.
The process by which Connotea could automatically pick up extra data for an article would go something like this:
Connotea GETs the HTML for the article.
It looks in the HTML for an element of the form:
<link rel="OTMI" type="application/atom+xml"href="../otmi/otmi-nature04614.xml" />If present, it GETs that URL, parses the Atom entry document and extracts the relevant data.
This way, any publisher who wanted to be supported by bibliographic information collection in Connotea need only implement OTMI, and the data would be automatically collected. The publisher would then be in control of the data they supplied to Connotea.
This is, of course, still just a suggestion, but there are number of possible enhancements:
Look for more than just
application/atom+xmlin the type attribute, and parse the data appropriately.Those publishers that already offer RIS or other citation downloads could simply point to the appropriate URL.What if the URL being bookmarked is for a non-HTML version of the article?HTTP content negotiation could help here. Connotea would GET with an
Accept: application/xhtml+xmlheader, to which the publisher could respond with a document pointing in the right place.Even better, content negotiation could mean that a GET for the article URL with an
Accept: application/atom+xmlheader would yield the data document directly.
Of course, this approach to associating a resource with data about the resource is not limited to scientific publishing. Blog posts and job postings could have Atom entry documents associated with them, the data in which would be very useful to users of Connotea or other link-saving services.
There's a potential problem here: an Atom-based approach would have to make heavy use of extensions, especially when transporting data about scholarly articles. Atom extension support is mostly 'insert a blob of data here', and some publishers will inevitably ask whether this causes interoperability problems. I don't think it will, but this approach is held in less than high esteem in some circles in STM publishing, and that may hinder adoption of OTMI (or an OTMI-like approach) for data publishing. NPG is, of course, eager to hear other ideas about how this URL-to-metadata problem might be addressed.




