Thursday, September 3, 2009
A bad scientist assumes his data is pristine. A good scientist understands all the shit in her data. When I worked at Bell Labs, we listened repeatedly to Arno Penzias talk about his Nobel prize. In 1965, Arno and Robert Wilson were trying to understand the noise sources in radio signals received from communication satellites. Because they worked for the research arm of a big, rich monopoly company, they had big, expensive ultrasensitive microwave horn antennas to play with and the fanciest, most expensive electronics. But however much work they did, there was more noise in their signal than they wanted. Noise would limit the amount of data that could be transmitted, and therefore the amount of money that AT&T would be able to make from their communications satellites. At one point, Penzias and Wilson thought that the noise was coming from an unwanted "white dielectric substance" being deposited by birds that were perching on their antenna. They cleaned out the antenna, but the noise was still there. Diligent and focused work on problems related to the business of AT&T, we learned, could lead to great discoveries.
When I was a student at Princeton, we listened repeatedly to the story of how Bob Dicke and Jim Peebles tried to find the 3 degree background radiation predicted by George Gamow if the big bang theory were really true. They were unsuccessful, but when they happened to hear about the noise problem down the road at Bell Labs, they knew immediately what it really was. It was the afterglow left over from the very beginning moments of our universe, an unmistakable sign that the Big Bang had really occurred. Penzias and Wilson had worked so hard at understanding and eliminating the other noise in their radio receiver that they could confidently publish their findings, after being clued in by their Princeton colleagues. Penzias and Wilson eventually received the Nobel Prize in Physics that eluded Dicke and Peebles. Understanding the noise turned out to be more important than understanding the signal.
I was reminded of this story on reading the accounts of Friday afternoon's session at The Google Books Settlement and the Future of Information Access Conference. I attended in the morning, but due to family commitments I missed the entertainment in the afternoon, which was highlighted by Geoff Nunberg's report on mistakes in the metadata in Google Book Search. Reading Nunbergs blog post corresponding to the talk is very entertaining in a juvenile sort of way. The poor guy has been trying to use Google Books as a linguistic research corpus, and has discovered to his professed horror that there are all sorts of errors, many of them humorous, in its metadata.
I must now let you in on a closely held secret among library metadata technologists which due to the Google Books metadata fiasco must now be revealed to the general public. There is some crappy data in library catalogs. How much is an interesting question, and my ability to comment on how much is limited by confidentiality obligations. However, I am free to observe that studies have been published on the error rate in OpenURL linking. OpenURL linking usually depends on matching of metadata between a source metadata file and a target metadata file; errors in either file can cause a linking error. Reported error rates are in excess of 1%. In his response to Nunberg blog post, Jon Orwant points out that a one in a million error occurs a million times if you have a trillion metadata items; my guess is that an error rate of one part per million may be overly optimistic by four orders of magnitude when applied to library metadata.
In my post on "collecting too much data", I wrote that a huge challenge of maintaining a large metadata database is battling entropy as the collection grows. I've observed that most people trying to collect metadata go through an early period of thinking it's easy, and then gradually gain understanding of the real challenges. Google has certainly been no exception to this pattern. When they first started dealing with book metadata, they were oblivious to the difficulties of maintaining a large metadata database. As Orwant's response to Nunberg shows, they are currently in the phase of understanding the true difficulties of what they need to do. They have most certainly become attuned to the importance of keeping track of the source (provenance) of their metadata, if for no other reason than to have someone to blame for the inevitable metadata stupidities. Much of the "Linked Data" crowd has yet to digest this lesson fully.
Nunberg's thesis is that Google Books will be the "Last Library" and that it would be a disaster for society if Google does a bad job of it. He does not consider the converse possibility. What if Google manages to do a better job of it than libraries have done? If that happens, all of the library world could be turned upside down. Existing metadata maintenance cooperatives would vanish overnight and libraries around the world would become dependent on Google's metadata prowess. Google would acquire a legal metadata monopoly through technical merit rather than through class action maneuvering. What if Google, with pseudo-monopoly funding and the smartest engineers anywhere, manages to figure out new ways to separate the bird shit from the valuable metadata in thousands of metadata feeds, thereby revolutionizing the library world without even intending to do so? Is it this even conceivable?
Arno Penzias would say YES.