Thursday, September 3, 2009

White Dielectric Substance in Library Metadata

A bad scientist assumes his data is pristine. A good scientist understands all the shit in her data. When I worked at Bell Labs, we listened repeatedly to Arno Penzias talk about his Nobel prize. In 1965, Arno and Robert Wilson were trying to understand the noise sources in radio signals received from communication satellites. Because they worked for the research arm of a big, rich monopoly company, they had big, expensive ultrasensitive microwave horn antennas to play with and the fanciest, most expensive electronics. But however much work they did, there was more noise in their signal than they wanted. Noise would limit the amount of data that could be transmitted, and therefore the amount of money that AT&T would be able to make from their communications satellites. At one point, Penzias and Wilson thought that the noise was coming from an unwanted "white dielectric substance" being deposited by birds that were perching on their antenna. They cleaned out the antenna, but the noise was still there. Diligent and focused work on problems related to the business of AT&T, we learned, could lead to great discoveries.

When I was a student at Princeton, we listened repeatedly to the story of how Bob Dicke and Jim Peebles tried to find the 3 degree background radiation predicted by George Gamow if the big bang theory were really true. They were unsuccessful, but when they happened to hear about the noise problem down the road at Bell Labs, they knew immediately what it really was. It was the afterglow left over from the very beginning moments of our universe, an unmistakable sign that the Big Bang had really occurred. Penzias and Wilson had worked so hard at understanding and eliminating the other noise in their radio receiver that they could confidently publish their findings, after being clued in by their Princeton colleagues. Penzias and Wilson eventually received the Nobel Prize in Physics that eluded Dicke and Peebles. Understanding the noise turned out to be more important than understanding the signal.

I was reminded of this story on reading the accounts of Friday afternoon's session at The Google Books Settlement and the Future of Information Access Conference. I attended in the morning, but due to family commitments I missed the entertainment in the afternoon, which was highlighted by Geoff Nunberg's report on mistakes in the metadata in Google Book Search. Reading Nunbergs blog post corresponding to the talk is very entertaining in a juvenile sort of way. The poor guy has been trying to use Google Books as a linguistic research corpus, and has discovered to his professed horror that there are all sorts of errors, many of them humorous, in its metadata.

I must now let you in on a closely held secret among library metadata technologists which due to the Google Books metadata fiasco must now be revealed to the general public. There is some crappy data in library catalogs. How much is an interesting question, and my ability to comment on how much is limited by confidentiality obligations. However, I am free to observe that studies have been published on the error rate in OpenURL linking. OpenURL linking usually depends on matching of metadata between a source metadata file and a target metadata file; errors in either file can cause a linking error. Reported error rates are in excess of 1%. In his response to Nunberg blog post, Jon Orwant points out that a one in a million error occurs a million times if you have a trillion metadata items; my guess is that an error rate of one part per million may be overly optimistic by four orders of magnitude when applied to library metadata.

In my post on "collecting too much data", I wrote that a huge challenge of maintaining a large metadata database is battling entropy as the collection grows. I've observed that most people trying to collect metadata go through an early period of thinking it's easy, and then gradually gain understanding of the real challenges. Google has certainly been no exception to this pattern. When they first started dealing with book metadata, they were oblivious to the difficulties of maintaining a large metadata database. As Orwant's response to Nunberg shows, they are currently in the phase of understanding the true difficulties of what they need to do. They have most certainly become attuned to the importance of keeping track of the source (provenance) of their metadata, if for no other reason than to have someone to blame for the inevitable metadata stupidities. Much of the "Linked Data" crowd has yet to digest this lesson fully.

Nunberg's thesis is that Google Books will be the "Last Library" and that it would be a disaster for society if Google does a bad job of it. He does not consider the converse possibility. What if Google manages to do a better job of it than libraries have done? If that happens, all of the library world could be turned upside down. Existing metadata maintenance cooperatives would vanish overnight and libraries around the world would become dependent on Google's metadata prowess. Google would acquire a legal metadata monopoly through technical merit rather than through class action maneuvering. What if Google, with pseudo-monopoly funding and the smartest engineers anywhere, manages to figure out new ways to separate the bird shit from the valuable metadata in thousands of metadata feeds, thereby revolutionizing the library world without even intending to do so? Is it this even conceivable?

Arno Penzias would say YES.
Reblog this post [with Zemanta]


  1. I'm all for Google Books. I don't understand why people would be against it. It's not like it's going to replace every single library in existence. And if they don't do a very good job of it, you know dang well someone else will. At least eventually...

  2. I like the story about white dielectric substance (is there an AAT term for it?), but I am not seeing the analogy to Google's handling of metadata. After all, Google doesn't make its money from clean documents, it makes money from ads, and its culture seems more along the lines of 'good enough engineering' rather than that of the obsessive scientific effort exhibited by Penzias et al.

    So while it is certainly possible that Google could get it right, I hardly think it likely.

  3. Excellent post, Eric. And thanks for the link to the Bloomberg and Hutchins work -- I'll cite it. You might be interested in the work I am doing comparing openurl quality across vendors: -Adam

  4. Nice post Eric. I see the same opportunity for GOOG and libraries to innovate new tools for managing this unique view on the bibliographic universe. I'm also grateful to Nunberg for sparking the conversation.

  5. GeneG,

    I'm never afraid to profess ignorance- what is AAT?

    From what I know of Google, they're obsessed about measuring and improving user experience, just as AT&T was obsessed about making sure the calls were completed. Google will pay attention to metadata to the extent that it affects user experience, and no more. How much that really is is an open question.

  6. During 15 years at Bell Labs, I certainly heard Arno explain many times how he mistook the origin of the universe for some stray bird shit. But I'm completely baffled about the analogy to this case -- when Google tells us that Sigmund Freud wrote a book on the Mosaic web browser, this is actually a signal of fundamental cosmic importance?

    One cynical take on the situation, suggested by at the GBS conference by Bob Glushko, is that bad metadata is actually a key part of Google's business model, because G's strength is a combination of weak search methods applied to large amounts of data, obviating the need for any classification other than what's implicit in the content itself. Is that what you mean?

    I hope that this isn't true, since on Bob's analysis, Google would have a positive interest in failing to fix (or even in screwing up) the Google Print metadata.

    The metadata (who wrote what when, at a minimum) is not very important to someone who is just looking for a a quote or some information about a monument or whatever. But for those who want to track language and culture over time (among others), it's a necessity.

  7. myl,

    Remember, Google is just relaying an assertion by someone else that Sigmund Freud was a web browser expert. If Google can figure out how to tell that such a report is not to be trusted, (i.e. eliminate the noise) then it will have learned something much more useful than the fact that Sigmund Freud predated the internet. Figuring out which assertions are to be trusted is essentially the same as being able to detect spam, and I don't think there is any doubt that spam detection is core to all of Google's businesses.

  8. Eric,

    AAT is the Art and Architecture Thesaurus, maintained by the Getty Research Institute. It turns out that it does have an entry for the afore-mentioned substance, although it is certainly inadequate as metadata because it does not characterize the animal that produced the material. :-)