At the hospital, nudity is no big deal. Doctors and nurses see bodies all the time, including ones that look like yours, and ones that look a lot worse. You get a gown, but its coverage is more psychological than physical!
Today, Google made an unprecedented display of its book metadata private parts, but the audience was a group of metadata doctors and nurses, and believe me, they've seen MUCH worse. Kurt Groetsch, a Collections Specialist in the Google Books Project presented details of how Google processes book metadata from libraries, publishers, and others to the Association for Library Collections and Technical Services Forum during the American Library Association's Midwinter Meeting.
The Forum, entitled "Mix and Match: Mashups of Bibliographic Data", began with a presentation from OCLC's Renée Register, who described how book metadata gets created and flows though the supply chain. Her blob diagram conveyed the complexity of data flow, and she bemoaned the fact that library data was largely walled off from publisher data by incompatible formats and cataloging practice. OCLC is working to connect these data silos.
Next came friend-of-the-blog Karen Coyle, who's been a consultant (or "bibliographic informant") to the Open Library project. She described the violent collision of library metadata with internet database programmers. Coyle's role in the project is not to provide direction, but to help the programmers decode arcane library-only syntax such as "ill. (some col)". The one instance where she tried to provide direction turned out to be something of a mistake. She insisted that, to allow proper sorting, the incoming data stream should try to keep track of the end of leading articles in title strings. So for example, "The Hobbit" should be stored as "(The )Hobbit". This proved to be very cumbersome. Eventually the team tried to figure out when alphabetical sorting was really required, and the answer turned out to be "never".
Open Library does not use data records at all, instead, every piece of data is typed with a URI. This architecture aligns with W3C web standards for the semantic web, and allows much more flexible searching and data mining than would be possible with a MARC record.
Finally, Groetsch reported on Google's metadata processing. They have over 100 bibliographic data sources, including libraries, publishers, retailers and aggregators of review and jacket covers. The library data includes MARC records, anonymized circulation data and authority files. The publisher and retailer data is mostly ONIX formatted XML data. They have amassed over 800 million bibliographic records containing over a trillion fields of data.
Incoming records are parsed into simple data structures which looked similar to Open Library's, but without the URI-ness. These structures are than transformed in various ways for Googles use. The raw metadata structures are stored in an SQL-like database for easy querying.
Groetsch then talked about the nitty-gritty details of data. For example, the listing of an author on a MARC record can only be used as an "indication" of the authors name, because MARC gives weak indications of the contributor role. ONIX is much better in this respect. Similarly, "identifiers" such as ISBN, OCLC number, LCCN, and library barcode number are used as key strings but are only identity indicators with varying strengths. One ISBN with a chinese publisher prefix was found on records for over 24,000 different books; ISBN reuse is not at all uncommon. One librarian had mentioned to Groetsch that in her country, ISBNs are pasted onto a book to give it a greater appearance of legitimacy.
Echoing comments from Coyle, Groetsch spoke with pride of the progress the Google Books metadata team has made in capturing series and group data. Such information is typically recorded in mushy text fields with inconsistent syntax, even in records from the same library.
The most difficult problem faced by the Google Books team is garbage data. Last year, Google came under harsh criticism for the quality of its metadata, most notably from Geoffrey Nunberg. (I wrote an article about the controversy.) The most hilarious errors came from garbage records. For example, certain Onix records describing Gulliver's Travels carried an author description of the wrong Jonathan Swift. Most of these errors come from garbage records, and when one of these is found, almost always, the same problems can be found in other metadata sources. Google would like to find a way to get corrected records back into the library data ecosystem so that they don't have to fix them again, but that there have been issues with data licensing agreements that still need to be worked out. Article like Nunberg's have been quite helpful to the Google team. Every indication is that Google is in the metadata slog for the long term.
One questioner asked the panel what the library community should be doing to prevent "metadata trainwrecks" from happening in the future. Groetsch said without hesitation "Move away from MARC". There was nodding and murmuring in the audience (the librarian equivalent of an uproar). He elaborated that the worst parts of MARC records were the free text data, and normalization of data would be beneficial whereever possible.
One of the Google engineers working on record parsing, Leonid Taycher, added that the first thing he had had to learn about MARC records was that the "Machine Readable" part of the MARC acronym was a lie. (MARC stands for MAchine Readable Cataloging) The audience was amused.
The last question from the audience was about the future role of libraries in production of metadata. Given the resources being brought to bear on the book metadata by OCLC, Google and others, should libraries be doing cataloguing at all? Karen Coyle's answer was that libraries should concentrate their attention on the rare and unique material in their collections- without their work, these materials would continue to be almost completely invisible.