Monday, July 6, 2009

Crossref, OpenURL and more Linked Data Heresy

After CrossRef was started nine years ago, I quipped that it was nothing short of miraculous, since it was the first time in recorded history that so many publishers had gotten together and agreed on something that they would have to pay for. I'm sure that was an exageration, but my point was that CrossRef was not really about linking technology, rather, it was about the establishment of a business process around linking technology. The choice of technology itself was to some extent irrelevant.

Last week, in a comment on my post about AdaptiveBlue and OpenURL, Owen Stephens raised some interesting questions surrounding OpenURL, DOI (Digital Object Identifier), and Linked Data. It's useful to think of each of these as a social practice surrounding a linking technology; I'll describe each of them in turn.

DOI is often thought of as synonymous with CrossRef, which is incorrect. DOI is a link indirection technology used by the CrossRef organization. There are some DOIs that are not CrossRef DOIs, but most of the DOIs you are likely to come across will be CrossRef DOIs. CrossRef provides registration, matching and lookup services in addition to the DOI redirection service, and from here on, I'll be talking about CrossRef DOIs only. The core mission of Crossref is the transformation of journal article citations into clickable URLs. CrossRef has registered about 35 million DOIs, most of them for journal articles. In the registration process, CrossRef collects identifying metadata for the journal articles, which it then uses to power its matching and lookup services. The matching service is currently making about 15 million matches per month.

CrossRef is far from being perfect, but its achievements have been considerable. Most scholarly journal publishers have integrated the CrossRef registration and matching process into their production workflows. The result is that many thousands of electronic journals today are being linked to from many thousands of other electronic journals, databases, search engines, even blogs.

In contrast to CrossRef, which is focuses on publishers and publisher workflow integration, OpenURL is a linking technology and practice that has focused on helping libraries manage links to and from the electronic resources available to their patrons. OpenURL is complementary to Crossref- OpenURL linking agents usually make use of CrossRef services to accomplish their mission of helping users select the appropriate resources for a given link. Libraries frequently need to deal with problems associated with multiple resolution- a given article might be available at ten or even a hundred different URLs, only one of which might work for a given library patron.

Finally, Linked Data is an emerging practice which enables diverse data sets to be published, consumed and then linked with other data sets and relinked into a global web of connections. It would be interesting to find out how many matches are being made in the Linked Data web to compare with CrossRef, but because of the decentralized matching, its not really possible to know. While CrossRef and OpenURL focuses on connecting citing articles and abstracts with the cited articles, Linked Data attempts to support any type of logical link.

Obviously there is overlap between Linked Data and the more established linking practices. Can (and should) Linked Data applications reuse the CrossRef and/or OpenURL URI's? Let's first consider OpenURL. OpenURL is really a mechanism for packaging metadata for a citation (jargon: ContextObject) into a URI. So the "thing" that an OpenURL URI identifies is the set of services about the citation available from a particular resolver agent. That's not usually the thing that you want to talk about in a Linked Data Application.

What about CrossRef DOIs? There are two different URI's that you can make with a DOI. There's the http URL that gets redirected to full text (you hope) by the DOI gateway: http://dx.doi.org/10.1144/0016-76492006-123 There's also the "info-uri" form of the doi- info:doi/10.1144/0016-76492006-123 , which you can't click on. It's clear what the latter URI identifies- it's a 2007 article in the Journal of the Geological Society. Many libraries run resolver agents that can turn that URI into clicakable service links. I'm not sure what the former URI identifies. What the URI gets you to is a web page with links to two different instantiations of the article identified by the info-uri. Apparently it doesn't identify the same article in its other instantiations on the internet. So the most correct URI to use, if you want to make Linked Data assertions about the article, is (in my humble but correct opinion) to use the info-uri.

There's one little problem.

The second of Tim Berners-Lee's "Four Rules" for Linked Data is "Use HTTP URIs so that people can look up those names." But CrossRef, a stable, self-sustaining organization which has made huge strides moving the world of journal publishing to a more open, more usable, more linked environment, provides look-up APIs that return high quality XML metadata so that you can look up the names that it defines. It has a solid record of accomplishing exactly the things that Linked Data is trying to do, albeit with broader scope, but undeniably with significant impact. The identifier that CrossRef is using is the DOI, and the URI form of DOI is NOT an HTTP URI.

Maybe Tim BL's second rule is wrong, too!


8 comments:

  1. A simple fix would be to create an http: form of the info-uri. But ...

    On a deeper level, there are two missing mechanisms in the semantic web compared with the DOI/OpenURL world:
    (1) a directory that connects an identifier for a schema expressed as semantic web data with the repositories that purport to conform to the schema and
    (2) a means of characterizing the repositories so that it becomes possible to select one or more to use.

    ReplyDelete
  2. The info uri scheme is rather ugly, and doesn't seem to be widely used. Why not just use the prefix "doi:", which has wide adoption (certainly publishers routinely use it, see the example doi:10.1144/0016-76492006-123 you gave)?

    OpenURL is great, but it's use of ghastly specs such as info uri doesn't help it's wider adoption.

    DOIs don't by themselves, play nice with Linked Data, but they can be made to. For example, http://bioguid.info/doi:10.1144/0016-76492006-123 supports 303 redirects a la Linked Data.

    ReplyDelete
  3. Eric,

    Perversely, I have recently been making an argument that, on the face of it, seems to contradict yours in regards to the preferred URI form for DOIs (http-url vs info-uri). Please note that this is *my* current take, and not CrossRef's or the IDF's.

    But first- a little bit on points where I think we agree...

    First, your distinction between DOIs generally and "CrossRef DOIs" specifically is important and it will become more important as we get more DOI registration agencies offering different services. I think we have made a marketing mistake in conflating Crossref and DOIs and one of our immediate priorities will be to try and reverse that.

    Second, I think that the fundamental concept of link-indirection behind Handle, DOI, PURLs, etc. is critical to the success of linked data. Broken links are already a massive annoyance when browsing the web and the problem will become many orders of magnitude worse if we keep using inherently fragile plain-old-URLs (POUs?) as the basis for linked data. Providing a level of link-indirection is one step toward helping to ensure the persistence of links, but the technology behind the indirection is an almost trivial issue when compared to the finance and governance issues involved in setting up an organization responsible for sustaining and maintaining said link indirection infrastructure. Persistence is a social issue, not a technology issue, etc.

    But this gets us to where I think we disagree.

    You say that "it's clear" what "info:doi/10.1144/0016-76492006-123" identifies, and that libraries can use their resolvers to turn it into something which you can click on. Conversely, you say that it *isn't* clear what "http://dx.doi.org/10.1144/0016-76492006-123" identifies. While I understand your point from a theoretical perspective, this seems to verge on architecture-astronautism** and ignores two critical points:

    a) Most web users will intuitively be able to figure out what they can do with http://dx.doi.org/10.1144/0016-76492006-123, whereas the same cannot be said of info:doi/10.1144/0016-76492006-123. I note, for instance, that my editor has automatically turned the former into a link and has ignored the latter.
    b) A library resolver could as easily turn http://dx.doi.org/10.1144/0016-76492006-12 into several alternative clickable services as they could with the info-uri encoding. In fact, most already do because they have DOI links go straight back to their OpenURL resolvers.

    I think the argument comes down to:

    a) Do we want to have a generations of web application developers have to constantly write extra code to convert identifiers into http-URIs (assuming they even know that they can/should do this.)
    b) Possibly have a future generation of application developers on some new, non-web platform have to strip out some junk DNA (junk DNS?) from identifiers that had once been encoded as http-URIs for legacy web applications.

    I think that option "a" will prove an unnecessary barrier to the recognition and uptake of persistent linking systems. Most users do not know what to do with the string "doi:10.1007/BF01829975" when they see it on a web page or printed in New Scientist. This is a shame and unlikely to change. Similarly, most linked data applications will not know what to do with "doi:10.1007/BF01829975" or, worse yet "info:doi/10.1007/BF01829975". I fear that in insisting on this rarified distinction between "an identifier" and "a reference", we are putting people off using persistent linking systems. This would be unforgivable.

    I also think we have ample evidence that new technologies can easily subsume earlier technologies that conflated identifiers and locators. This is essentially what web browsers do when they are given an ftp: or gopher: uri. Why do we think that future, non-web-based platforms won't be able to do essentially the same thing?

    So just who is the heresiarch here?

    --G
    -----------
    **http://www.joelonsoftware.com/articles/fog0000000018.html

    ReplyDelete
  4. "I'm not sure what the former URI identifies. What the URI gets you to is a web page with links to two different instantiations of the article identified by the info-uri."

    I think you may be confusing identification and resolution. The http URI identifies the same item as the info URI.

    Andy Powell has a clarifying (for me, at least) post on this topic here.

    ReplyDelete
  5. Oh dear! As a retired serials librarian of three years, talk of OpenURL and DOI is positively wallowing in nostalgia. Sorry, guys, to lower the tone here!!

    ReplyDelete
  6. Rod, (and Geoffrey)

    "doi:" would be 5 characters prettier than "info:doi/", wouldn't it? Perhaps someone could recount the history of trying to get doi: recognized as a URI. Not a story about beauty, I'd say.

    In my admiration of CrossRef, the social construct, I've definitely glossed over the deficiencies of CrossRef/DOI, the technical infrastructure. Geoffrey's comment is indicative of the fact that the CrossRef infrastructure continues to develop. Over time, that development has moved in a direction I've been consistently pleased with, though more slowly than I'd like. Right now I would be pleased to see movement towards Linked Data.

    Eric

    ReplyDelete
  7. lbjay,

    How, exactly is a software agent supposed to know what http://dx.doi.org/10.1144/0016-76492006-123 identifies, unless it's been pre-coded to know about doi? Linked Data rule #2 says that I'm supposed to look at what a uri resolves to determine what the uri identifies, so that's what I did. Or maybe the idea is that a human with an understanding of identity vs. resolution (I think that would include me) is supposed to look at a resolution and decide what is being identified. Andy's promised to write more about this; looking forward to hearing what he has to say. Note that there are no 303 redirects being issued here.

    Eric

    ReplyDelete
  8. Briezhlady (and any other librarians visiting!),

    I try to keep the tone here firmly planted on the ground, any help is greatly appreciated.

    Eric

    ReplyDelete