Sunday, April 18, 2010

When Shall We Link?

When I was in grad school, my housemates and I would sit around the dinner table and have endless debates about obscure facts like "there's no such thing as brown light". That doesn't happen so much in my current life. Instead, my family starts making fun of me for "whipping out my iPhone" to retreive some obscure fact from Wikipedia to end a discussion about a questionable fact. This phenomenon of having access to huge amounts of information has also changed the imperatives of education: students no longer need to learn "just in case", but they need to learn how to get information "just in time".

In thinking about how to bring semantic technologies to bear on OpenURL and reference linking, it occured to me that "just in time" and "just in case" are useful concepts for thinking about linking technologies. Semantic technogies in general, and Linked Data in particular, seem to have focused on just-in-case, identifier-oriented linking. Library linking systems based on OpenURL, in contrast, have focused on just-in-time description-oriented linking. Of course, this distinction is an oversimplification, but let me explain a bit what I mean.

Let's first step back and take a look at how links are made. Links are directional; they have a start and an end (a target). The start of a link always has an intention or purpose, the target is the completion of that purpose. For example, look at the link I have put on the word "grad school" above. My intention there was to let you, the reader, know something about my graduate school career, without needing to insert that digressional information in the narrative. (Actually my purpose was to illustrate the previous sentence, but let's call that a meta-purpose.) My choice of URL was "http://ee.stanford.edu/", but I might have chosen some very different URL. When I choose a specific URL, I "bind" that URL to my intention.

In the second paragraph, I have added a link for "OpenURL". In that case, I used the "Zemanta" plug-in to help me. Zemanta scans the text of my article for words and concepts that it has links for, and offers them to me as choices to apply to my article. Zemanta has done the work of finding links for a huge number of words and concepts, just in case a user come along with a linking intention to match. In this case, the link suggested by Zemanta matches my intention (to provide background for readers unfamiliar with OpenURL). The URL becomes bound to the word during the article posting process.

At the end of this article, there's a list of related articles, along with a link that says "more fresh articles". I don't know what URLs Zemanta will supply when you click on it, but it's an example of a just in time link. A computer scientist would call this "late binding". My intention is abstract- I want you to  be able to find articles like this one.

Similar facilities are in operation in scholarly publishing, but the processes have a lot more moving parts.

Consider the citation list of a scientific publication. The links expressed by these lists are expressions of the author's intent- perhaps to support an assertion in the article, to acknowledge previous work, or to provide clarification or background. The cited item is described by metadata formatted so that humans can read and understand the description and go to a library to find the item. Here's an example:
D. C. Tsui, H. L. Störmer and A. C. Gossard, Phys. Rev. Lett. 48, 1559 (1982).
With the movement of articles on-line, the citations are typically turned into links in the publication process by parsing the citation into a computer-readable description. If the publisher is a member of CrossRef, the description could then be matched against CrossRef's huge database of article descriptions. If a match is found, the cited item description is bound to an article identifier, the DOI. For my example article, the DOI is 10.1103/PhysRevLett.48.1559 The DOI provides a layer of indirection that's not found in Zemanta linking. While CrossRef binds the citation to an identifier, the identifier link, http://dx.doi.org/10.1103/PhysRevLett.48.1559, is not bound to the target URL, http://prl.aps.org/abstract/PRL/v48/i22/p1559_1 until the user clicks the link. This scheme holds out hope that should the article move to a different URL, the connection to the citation can be maintained and the link will still work.

If the user is associated with a library using an OpenURL link server, another type of match can be made. OpenURL linkservers use knowledgebases which describe the set of electronic resources made available by the library. When the user clicks on on OpenURL link, the description contained in the link is matched against the knowledgebase, and the user is sent to the best-matching library resource. It's only at the very last moment that the intent of the link is bound to a target.

While the combination of OpenURL and CrossRef has made it possible to link citations to their intended target articles in libraries with good success, there has been little leveraging of this success outside the domain of scholarly articles and books. The NISO standardization process for OpenURL spent a great deal of time in making the framework extensible, but the extension mechanisms have not seen the use that was hoped for.

The level of abstraction of NISO OpenURL is often cited as a reason it has not been adopted outside its original application domain. It should also be clear that many applications that might have used OpenURL have instead turned to Semantic Web and Linked Data technologies (Zemanta is an example of a linking application built with semantic technologies.) If OpenURL and CrossRef could be made friendly to these technologies, the investments made in these systems might also find application in more general circumstances.

I began looking at the possibilities for OpenURL Linked Data last summer, when, at the Semantic Technologies 2009 conference, Google engineers expressed great interest in consuming OpenURL data exposed via RDFa in HTML, which had just been finalized as a W3C Technical Recommendation. I excitedly began to work out what was needed (Tony Hammond, another member of the NISO standardization committee had taken a crack at the same thing.)

My interest flagged, however, as I began to understand the nagging difficulties of mapping OpenURL into an RDF model. OpenURL mapped into RDF was...ugly. I imagined trying to advocate use of OpenURL-RDF over BIBO, an ontology for bibliographic data developed by Bruce D'Arcus and Frédérick Giasson, and decided it would not be fun. There's nothing terribly wrong with BIBO.

One of the nagging difficulties was that OpenURL-RDF required the use of "blank nodes", because of its philosophy of transporting descriptions of items which might not have URIs to identify them. When I recently described this difficulty to the OpenURL Listserv, Herbert van de Sompel, the "irresistible force" behind OpenURL a decade ago, responded with very interesting notes about "thing-described-by.org", how it resembled "by-reference" OpenURL, and how this could be used in a Linked Data  friendly link resolver. Thing-Described-by is a little service that makes it easy to mint a URI, attach an RDF description to it, and make it available for harvest as Linked Data.

In the broadest picture, linking is a process of matching the intent of a link with a target. To accomplish that, we can't get around the fact that we're matching one description with another. A link resolver needs to accomplish this match in less than a second using a description squeezed into a URL, so it must rely on heuristics, pre-matched identifiers, and restricted content domains. If link descriptions were pre-published as Linked Data as in thing-described-by.org, linking providers would have time to increase accuracy by consulting more types of information and provide broader coverage. By avoiding the necessity of converting and squeezing the description into a URL, link publishers could conceivably reduce costs while providing for richer links. Let's call it "Linked Description Data".

Descriptions of targets could also be published as Linked Description Data. Target knowledgebase development and maintenance is a significant expense for link server vendors. However, target publishers have come to understand the importance (see KBART) of providing more timely, accurate and granular target descriptions. If they ever start to view the knowledgebase vendors as bottlenecks, the Linked Description Data approach may prove appealing.

Computers don't learn "just-in-time" or "just-in-case" the way humans do. But the matching at the core of making links can be an expensive process, taking time proportional to the square of the number of items (N2). Identifiers make the process vastly more efficient, (N*logN). This expense can be front-loaded (just-in-case) or saved till the last momemt (just-in-time), but opening the descriptions being matched for "when-there's-time" processing could result in dramatic advances in linking systems as a whole.
Reblog this post [with Zemanta]

2 comments:

  1. I've heard that claim about education a lot, but as a former teacher, I just don't buy it.

    It was, for example, not enough for my students to know where to look up the endings to the first declension. They needed to have internalized those endings, and their meanings, in order to read texts. (And I don't mean just "read texts" fluently; I mean "read texts at all": if some of the baseline steps were too slow, they'd lose the forest for the trees.) I think there are similar problems of automaticity in all disciplines.

    And you actually have to have information in your head to be able to be creative.

    I think there are aspects of education that are changed by the widespread availability of information. I think information literacy becomes hugely more important; there are types of questions it is now possible to ask one's students; and there are types of questions it is no longer useful to ask. There are skills it is much harder to inculcate or assess (that are not, therefore, less important). But the ability to look things up fluently is not the same as being an educated person, or even a broadly able one, no matter how many things are out there to be looked up. It is merely a subset.

    ReplyDelete
  2. Duck- I should have written that "just in case learning is less important" rather than "students no longer need to learn 'just in case'".

    ReplyDelete