Thursday, May 13, 2010

A Long Handle on Shortened Digital Object Identifiers

Google has launched a social networking site called Orkut. Already the site has over 100 million members worldwide! What's that? You haven't heard of it? No it's not new, it's actually more than 6 years old. The 100 million users- they're mostly in Brazil and India.

You might be asking yourself, "What is Google doing running a social networking site for Brazil and India?", but a better question might be "What do you do in life when you take a home run swing and manage a nub single?"

The technology behind the Digital Object Identifier has proven to be one of these infield hits. It's called the "Handle" system, and it was developed fifteen years ago by computer scientists who feared that the DNS system that maps host names to IP addresses on the internet would fail to scale for the very large numbers of digital objects needed in future digital library applications. Bob Kahn, who with Vint Cerf designed the Transmission Control Protocol (TCP) underlying today's internet, was the driving force behind the Handle system. It implements a highly scaleable distributed naming infrastructure for digital objects. Kahn's organization, CNRI, then developed the Digital Object Identifier to serve as a core application of the Handle System.

The Digital Object Identifier, or DOI, was designed and promoted as a solution to two problems, broken URLs and rights management. The fact that fifteen years later broken URLs are still a problem and digital rights are a quagmire suggests that the DOI has had limited impact in these areas. Did something go wrong?

Fifteen years ago, Google did not exist. Netscape was just a year old. The developers of the Handle system hoped that Handle resolution would get built into web browser software alongside DNS. That never really happened, perhaps because Netscape and Microsoft had development roadmaps for web servers and browsers that diverged from CNRI's vision. To allow Handles to work in unmodified browser software, CNRI was forced to implement a proxy service that connected the system of DNS resolution to the system of handle resolution via a web server. This proxy, at http://dx.doi.org/, serves almost all of the world's DOIs. A link pointed at the DOI proxy gets redirected to a web address designated by the owner of the digital object or perhaps to a library with DOI-enabled software. This redirection capability gives publishers the flexibility to move digital objects from one address to the other without breaking links, even if the object is traded to another publisher using a completely different website.

Things have changed in 15 years. The World Wide Web turned out to be not so interested in digital objects with complex management systems and rights regimes. Instead, the World Wide Web turned out to be an unthinkably large number of web pages with neglible rights management indexed by search engines. Persistence of links turned out to be less important than the findability of content in search engines.

Since search engines never bothered to learn about the Handle system, the DOI proxy turned out to be much more important than the Handle resolution system that hides behind it. Details that were inconsequential 15 years ago have become important. One of these details is the DOI proxy's HTTP status code. This code tells a requestor the meaning of the redirect URL. As I've previously written, the DOI proxy returns a 302 status code. The Google index inteprets this to mean that it should assign indexing weight to the URL beginning with "http://dx.doi.org/", and not the publisher-specified URL. 302 is the is the correct code for the DOI proxy, because if the publisher-specified URL changes, the publisher doesn't want to lose all the "link juice" it has earned by being linked to from other sites.

The DOI has not been generally adopted by the web at large, with an important exception, CrossRef. CrossRef added some useful machinery onto the DOI and turned it into an absolutely essential service for publishers of ejournals other types of content that need to embed persistent links. CrossRef accounts for 96% of all registered DOIs and DOI resolutions (about 60 million per month).

60 million resolutions per month might seem like a lot of traffic, but it's not so big on the scale of today's web. Google delivers that many searches in 3 hours. A more direct comparison would be bit.ly, the URL shortening service, which reported 3.4 BILLION resolutions in March, or 4.6 million resolutions per hour.

The staggering popularity of URL shortening services such as bit.ly prompted CrossRef's Geoffrey Bilder to propose last year a similar service for DOI links. Evidently, the folks at the International DOI Foundation agreed that this was a good idea, because last week, they launched the "shortDOI" service.

ShortDOI is meant to address a shortcoming of DOIs- their length and ugliness. When DOI started, no one could have imagined that URLs would appear prominently in boxes of children's cereal, as they do today. It was assumed that they would be hidden in links and be used exclusively by machines. The original spec for the DOI string even allowed DOIs to include spaces and non-printing Unicode characters! Digital object owners were free to choose ID strings that were long and full of punctuation, even punctuation that was incompatible with web pages. ShortDOI uses a small number of alphanumeric digits to do away with all the DOI ugliness. It also does away with the publisher prefix, which hasn't been much use anyway. So instead of 10.1103/PhysRevLett.48.1559 or 10.1002/(SICI)1097-0258(19980815/30)17:15/16<1661::AID-SIM968>3.0.CO;2-2 shortDOI lets us use URLs like http://doi.org/aa9 and http://doi.org/aabbe.

ShortDOI can't quite decide whether it's a DOI or a shortener. Like DOI, it's implemented using the Handle system. Instead of redirecting through the DOI proxy, shortDOI acts as an alternate proxy, and uses the same 302 redirects that the long DOI proxy uses. From the perspective of search engines, a shortDOI is a new object to be ranked separately from the DOI. The link juice earned by a short DOI won't accrue to the DOI it has shortened.

Like a shortener, shortDOI assigns codes sequentially, making it easy for robots to harvest content identified by shortDOIs. ShortDOI allows anyone to create the shortened URL, but provides none of the tracking, statistics, spam protection and malware rejection offered by other shortener services. Library OpenURL servers don't yet work with shortDOI, even though the shortDOI proxy will try to send shortDOI handles to them.

The implementation choices made for shortDOI aren't mistakes- they make perfect sense in the context of the Handle naming architecture. Nonetheless, the difficulties they present for libraries and search engine optimization highlight the Handle system's misfit with the dominant link resolution architecture of the web.

The Handle system has been very well designed and managed. I've written software that uses the Handle software libraries and I found them to be elegant and easy to work with. The principles and algorithms built into Handle system are very similar to those used years later inside Google's internal file system or by any number of other large digital object management systems.

The Handle System is relatively inexpensive, but the costs are now higher than the large scale URL shorteners. According to public tax returns, the DOI Foundation pays CNRI about $500,000 per year to run the DOI resolution system. That works out to about 0.7 cents per thousand resolutions. Compare this to Bit.ly, which has attracted $3.5 million of investment and has resolved about 20 billion shortened links- for a cost of about 0.2 cents per thousand. It remains to be seen whether bit.ly will find a sustainable business model; competing directly with DOI is not an impossibility.

What do you do with infrastructure that has been successful in Brazil or scholarly publishing but not elsewhere? Do you keep it alive in hopes that after twenty years, some unforeseen circumstance will result in its overnight universal adoption? Do you scale back, phase out or sell out in favor of more cost effective alternatives? Or do you just do your best to continue serving loyal users? I don't know the answer, but I do know that in baseball and cricket you've got to run the bases to score.

Update: Corrected CrossRef share of DOI resolutions.

Reblog this post [with Zemanta]

14 comments:

  1. I'd love DOIs to have taken off to widely, but they are just too hard to get. As a result, they are seen as a sort of Badge Of Honour, or Stamp Of Approval, hence comment like this one that some blog "has always been one of the most serious science blogs out there. Their posts, almost every one of them, deserves a DOI".

    I don't whether that's what the Handle guys, or the DO guys, or Crossref, intended; but that's how things have landed up. DOIs belong to a rarefied, elitist space.

    Because I sometimes write rather substantial articles ion my blogs, and because I have on occasion had reason to cite them in scholarly papers, I've made some little effort to see what it would take to get OIDs assigned. It was a while back so I don't remember the details now, but the upshot was that, basically it's not possible without paying a whole lot of money and jumping through a whole lot of hoops.

    How can that compete with tinyurl.com?

    ReplyDelete
  2. The article is fascinating and I've enjoyed it a lot, thank you so much.

    That said, and even though I know it's a minor point used just to frame the article, the implication that 100 million users for Orkut is some kind of qualified failure just because those users are not "universal" (meaning USA/UK based, it would seem) is slightly disturbing.

    ReplyDelete
  3. I hope that Manuel would not consider 60 million resolutions per month some kind of qualified failure just because those users are "academic". His reminder of our US/UK bias is appreciated, though.

    The anti-blog bias in the identisphere extends to many of the ISSN agencies. PURL should work well for blogs, however. Maybe there's a business opportunity there.

    ReplyDelete
  4. I'm one of those cool URIs don't change people :^)

    That means that I don't like these technologies which in my view are (i) directly abusing the HTTP protocol, (ii) not succeeding. While crossref may help against link rot for PDF files, it does not help me to create a sustainable set of annotation anchors inside a hypertext document.

    Cool technologies require cool URIs

    Sigfrid

    ReplyDelete
  5. Thanks for the pointer to Google's internal file system in comparision to Handle which can also bee seen as kind of a file system. You can also see HTTP as (an API to) a distributed file system. The basic task which is yet not solved since the very first computer systems were build is identification and access of digital objects. It is no coincidence that "A file structure for the complex, the changing and the indeterminate", the very first paper that coined the term "Hypertext" in 1964/65 was about an alternative file system, latter to be known as Ted Nelson's Xanadu.

    ReplyDelete
  6. Thanks for the post Eric. Just one thing I wanted to pick up on quickly. You say.

    "ShortDOI allows anyone to create the shortened URL, but provides none of the tracking, statistics, spam protection and malware rejection offered by other shortener services. Library OpenURL servers don't yet work with shortDOI, even though the shortDOI proxy will try to send shortDOI handles to them."

    This simply isn't true. While, yes, at some level the shortDOI is only "realised" for a regular DOI when somebody asks for it to be, it is still the case that shortDOIs can *only* be generated for existing DOIs and, unless a spammer is going to pay to register spam DOIs with some new spam-registering RA, then they can not exploit shortDOIs in the same way way that normal url shorteners are often exploited.

    As for tracking statistics, these would be provided by the RA's normal DOI statistics services.

    Now, I understand why you might have come to these conclusions. The truth is, I would have rather that we had not said that shortDOIs are "generated" at all. I would have rather come out and said simply that we had just created a "short form" of any existing DOI- that it already exists and you can use it if you want to (e.g. in newspapers, tweeting or any other place where space is a premium) This small change in the language describing shortDOIs, coupled with pointing out that DOIs and shortDOIs can *not* actually be exploited for spam, malware redirection, etc. would have made it clear that shortened DOIs are, in fact, a far better deal than you get with most URL shorteners. After all, the terms & conditions under which CrossRef DOIs can be assigned, require that CrossRef members maintain these pointers. This is not something that any URL shortener service that I know of does.

    Just also wanted to point out that in a past CrossTech post we did discuss the need for truly persistent identifiers for "less formal" scholarly communication channels such as blog posts, etc. (http://tinyurl.com/2gyhej3). We've even had some discussions with Science Blogs about this idea. I would like to hear from people who are interested in this idea.

    ReplyDelete
  7. The "you can trust the members of CrossRef and other RAs" argument is not what most people think of as spam and malware rejection; I agree that charging for the service is quite effective as a practical matter. However, I've not been able to find any public policy from CrossRef or other RAs that forbids putting DOI links in spam or which addresses malware insertion by their members; a pointer would be appreciated.

    I actually registered a bunch of DOIs for testing purposes a while back to see how nasty I could make them; there were no limits to what I could put in, and I made some very nasty ones.

    Just to make sure you understand the spam issue, the spammers use shortened urls for tracking and for obscurity. The shortening services try to disconnect these URLs so that ISPs don't bounce them. I can easily imagine that CrossRef DOIs could get included in a spam message (read the latest research on erectile disfunction!); I doubt they would get disconnected, and it would be bad if they were.

    ReplyDelete
  8. You say that "you can trust the members of CrossRef and other RAs" argument is not what most people think of as spam and malware rejection;".

    But by this you seem to imply that "you can trust the users bit.ly and other URL shortners" *is* what most people think of as spam and malware rejection".

    Or, in short, that somehow trust in URL shorteners is more reasonable than trust in the IDF, CrossRef and other RAs. Is this right?

    Surely, if it comes down to trust, then it seems that we (IDF, CrossRef, DataCite, etc.) have something better than explicit language telling you what you can *not* assign DOIs to- we have language governing what you *can/should* assign DOIs to and we have terms & conditions which our members have signed and we can actually enforce.

    As for the technical experiment of seeing what evil you can embed in DOIs- I'm not sure what this proves. If your DOIs had been followed by third parties and we had received complaints- then we would have politely pointed you to our terms and conditions and escalated as appropriate. Ultimately, if you didn't comply, we could remove the DOIs and stop you from being a CrossRef member. The IDF could do similar if an RA went rogue.

    And as for your last scenario- I'm not sure if I understand the issue:

    a) Are you worried about people creating spam with DOIs which point to questionable material. If so- I think this would just be covered by what I've already described. The IDF governs who can become an RA and the RA governs who can become a member. Both do this through enforceable terms and conditions. In short, Sanford Wallace would not be able to use DOIs and *that* is part of their strength.

    b) Are you worried about people creating spam and including legitimate DOIs to the published literature in the links? If this is what you mean, what would you expect the IDF or an RA to do? I can't imagine anybody, including URL shortener services, is going to disconnect a redirection to a legitimate site..

    ReplyDelete
  9. I think the assumption is that the people creating bit.ly links CANNOT be trusted, which results in a very different security stance with regard to bit.ly links compared to DOI links. Many high-security environments put extra scrutiny on bit.ly links, even to the point of blocking them. DOI links, on the other hand, are widely trusted and thus thinly scrutinized for security. While I'm familiar with what CrossRef does, I don't know what sort of content the other registration agencies cover or who they allow to register DOIs, and I can't find any place where the IDF asserts any no-malware requirements for DOI.

    On the spam issue, let's consider this scenario: Publisher A sells scientific articles. Like Amazon, they offer sales commissions to 3rd parties that link to the article. But unlike Amazon, they fail to put in place effective controls to prevent the 3rd parties from using spam to promote the articles. Would CrossRef be able to force the publisher to shape up their systems? If yes, then I concede the point for CrossRef Links, which can't be distinguished from OPECE Links, for what it's worth.

    ReplyDelete
  10. I still challenge the notion that somehow, because the IDF does not have an explicit "no malware" clause in their RA requirements, this somehow constitutes an "anything goes" stance as to what RAs will be approved. The RA requirements for the IDF are pretty stringent (http://preview.tinyurl.com/26uacpj). It seems pretty clear that the IDF could deal with abuse of the system quickly, if a roque RA managed to get through the vetting process.

    Having said that, I do think that an interesting challenge the IDF will face is when/if they get big and perfectly legitimate RAs that have missions that are distinctly *not* about scholarly content. What happens then to the assumption (that @spov pointed out above) that DOIs are a "Stamp Of Approval?"

    And in your last scenario, I expect that, just like Amazon, the publisher selling commissions for links would be the entity responsible for making sure that third parties were not exploiting DOI links to trick them into paying commissions. I don't see how this would be an IDF or CrossRef responsibility. But perhaps I am being thick and not fully understanding the scenario...

    ReplyDelete
  11. Here's a link describing bit.ly's automated spam and malware detection. This automation is necessary for any shortening service that wants to approach bit.ly's scale; tr.im is an example of a service that failed because of failure to scale spam and malware prevention.

    ReplyDelete
  12. The reason a link shortening service has to get involved in shutting of malware and spam using their links is that their ISP and other entities will cut them off if they don't. If DOI.org links get used for spma and malware, and IDF doesn't cut them off, then many entities will cut all doi.org links.

    ReplyDelete
  13. This comment has been removed by the author.

    ReplyDelete
  14. Hi

    As Geoffrey mentioned the IDF vets potential RAs pretty extensively - http://tinyurl.com/26uacpj. In addition CrossRef has well developed application procedures for potential members - http://bit.ly/dv21jI and the obligations on members are in the membership agreement (which also addresses persistence issues) http://bit.ly/avO217(PDF) - this does define what publishers and can't do.

    ReplyDelete