Wednesday, April 7, 2010

The Library IS the Machine

When librarians catalog a book, they do their best to describe a thing they have in their hands. The profession has been cataloging for a long time, and it tends to think that it's reduced the process to a science. When library catalogs became digital in the 1970's, the descriptions moved off of paper cards and into structured database records using a data format called MARC. That stands for MAchine Readable Cataloging, and as one Google engineer recently complained, "the MAchine Readable part of the name is a lie". The problem that Google's machines are having with these records is that the descriptions have always been meant for humans to read, not for computers to parse and understand.

Cataloging librarians are not stupid, and they've been working since the very beginning of digital cataloging to make their descriptions more useful to computers. They've introduced "name authority files" to bring uniformity to things like subject headings and author and publisher names. Unicode has brought uniformity to the encoding of non-roman characters and diacritics. XML has replaced some of the ancient delimiters and message length encoding. And perhaps most importantly, for a long time they've been embedding identifiers in the catalog records. Despite all this, library catalog records are still not as computer-friendly as they should be.

The move towards identifiers is worth special note. The use of identifiers in libraries dates to the first industrialization of libraries that took place in the 19th century. The classification systems of Melvil Dewey, Charles Ammi Cutter and the Library of Congress were all efforts to make library catalogs more friendly to machines.  Except the machines weren't digital computers, the machines were the libraries themselves. From the shelves to the circulation slips, libraries were giant, human-powered information storage and retrieval machines. The classification codes are sophisticated identifier systems upon which the entire access system was based. So maybe MARC isn't a lie after all!

The rest of the world took a while to catch up on the use of identifiers. The US began issuing social security numbers in 1936, but it wasn't until the 60's with the adoption of ISBN in the 1966 and ISSN in 1971 that the entire publishing industry began to use identifiers to more efficiently manage their sales, delivery and tracking of products.

The same properties that made identifiers useful in physical libraries make them essential for digital databases. Identifiers serve as keys that allow records in on table to be precisely sorted and matched against records in other tables. Well designed identifier systems provide assurances of uniqueness: there may be many people with the same name as me, but I'm the only one with my social security number.

Nowadays, it sometimes seems that almost any problem in the information industries is being solved by the introduction of a new identifier. Building on the success of ISBN and ISSN, there are efforts to identify works (ISTC),  authors (ORCID, ISNI), musical notations (ISMN), organizations (SAN), recordings (ISRC), audio-visual works (ISAN), trade items (UPC) and many other entities of interest. We live in an age of identifiers.

The apotheosis of indentifiers has been achieved in the Linked Data movement. The first rule of Linked Data is to give everything- subject, objects, and properties, their own URI (Uniform Resource Identifier). By putting EVERYTHING in one global space of identifiers, it is expected that myriad types of knowledge and information can be made available in uniform and efficient ways over the internet, to be reused, recombined, and reimagined.

What's often glossed over during the adoption of identifiers is their fundamental pragmatism. The association between any identifier and the real-world object it purports to identify is a thinly veneered but extremely useful social fiction which doesn't approach mathematical perfection. Even very good identifier systems can fail as much as 1% of the time, and automated systems that fail to recognize and accommodate the possibility of identifier failure exhibit brittleness and become subject to failure themselves. Still 99% of perfect works perfectly fine for a lot of things.

A decade ago, the world of libraries and the publishers that supply them embarked on an effort to link together the citations in journal articles and the bibliographic databases essential to libraries with the cited articles in e-journals and full text databases. Two complementary paths were pursued. One effort, OpenURL, sent bibliographic descriptions inside hyperlinks, and relied on intelligent agents in libraries to provide users with institutional specific and relevant links. The other, CrossRef, built identifiers for journal articles into a link redirection system. Together, OpenURL and CrossRef built on the strengths of the description and identification approaches and do a reasonably good job serving a wide range of users, including those in libraries.

Now, however, the slow but sure development of semantic web technologies and deployment of Linked Data has spurred both CrossRef's Geoff Bilder and the OCLC's Jeff Young (OCLC runs the OpenURL Maintenance Agency) to examine whether CrossRef and OpenURL need to make changes to take advantage of wider efforts. In another post, I'll look at this question more closely, but for now, I'd like to comment on what we've learned in the process of building article linking systems for libraries.

1. Successful linking requires both identification and description. The use of CrossRef by itself did not have the flexibility that libraries needed; CrossRef addressed this by making its bibliographic descriptions available to OpenURL systems. Similarly, the OpenURL's ability to embed CrossRef identifiers (DOIs) inside hyperlinks has made OpenURL linking much more accurate and effective.

2. Successful linking is as much about knowing which links to hide as about link discovery. Link discovery and link computation turn out not to be so hard. Keeping track of what is and isn't available to a user is much harder.

3. Bad data is everywhere. If a publisher asks authors for citations, 10% of the submitted citations will be wrong. If a librarian is given a book to catalog, 10% of the records produced will start out with some sort of transcription error. If a publisher or library is asked to submit metadata to a repository, 10% of the submitted data will have errors. It's only by imposing the discipline of checking, validating and correcting data at every stage that the system manages to perform acceptably.

Linking real world objects together doesn't happen by magic. It's a lot of work, and no amount of RDF, SPARQL, or URI fairy dust can change that. The magic of people and institutions working together, especially when facilitated by appropriate semantic technologies, can make things easier.

Reblog this post [with Zemanta]


  1. And not just *easier* but also more transparent. The OpenURL ecosystem seems to hide the links between things in knowledgebases, while OpenURL resolvers are essentially blackboxes that sprinkle some magic fairy dust over your Context Object to figure out where you ought to go.

    Linked Data makes some of that machinery more explicitly available on the web for clients to use however they wish...not just to support someone's idea of a business model.

  2. inkdroid- From the perspective of a decade ago, the OpenURL ecosystem has used knowledgebases to expose the links between things. At the beginning, some information providers threatened the knowledgebase developers with lawsuits, but were eventually persuaded that the exposure in linking knowledgebases was to their benefit.

    You are right that a decade hence the perspective will be very different!

  3. Eric: Interesting, I never knew about that chapter in OpenURL's history. Who threatened who with lawsuits back then? How did you persuade the publishers that OpenURL knowledgebases were to their benefit?

    I see your point about knoweldgebases actually opening up the links between things. I guess all I'm saying is it would've been nice if was more part of the picture back then. But maybe it was?

  4. inkdroid- the names are best left private.

    The Cufts knowledgebase is the most direct of the heirs to the legacy of "JAKE", which was very much a part of the picture back then.