Monday, January 18, 2010

Google Exposes Book Metadata Privates at ALA Forum

At the hospital, nudity is no big deal. Doctors and nurses see bodies all the time, including ones that look like yours, and ones that look a lot worse. You get a gown, but its coverage is more psychological than physical!

Today, Google made an unprecedented display of its book metadata private parts, but the audience was a group of metadata doctors and nurses, and believe me, they've seen MUCH worse. Kurt Groetsch, a Collections Specialist in the Google Books Project presented details of how Google processes book metadata from libraries, publishers, and others to the Association for Library Collections and Technical Services Forum during the American Library Association's Midwinter Meeting.

The Forum, entitled "Mix and Match: Mashups of Bibliographic Data", began with a presentation from OCLC's Renée Register, who described how book metadata gets created and flows though the supply chain. Her blob diagram conveyed the complexity of data flow, and she bemoaned the fact that library data was largely walled off from publisher data by incompatible formats and cataloging practice. OCLC is working to connect these data silos.

Next came friend-of-the-blog Karen Coyle, who's been a consultant (or "bibliographic informant") to the Open Library project. She described the violent collision of library metadata with internet database programmers. Coyle's role in the project is not to provide direction, but to help the programmers decode arcane library-only syntax such as "ill. (some col)". The one instance where she tried to provide direction turned out to be something of a mistake. She insisted that, to allow proper sorting, the incoming data stream should try to keep track of the end of leading articles in title strings. So for example, "The Hobbit" should be stored as "(The )Hobbit". This proved to be very cumbersome. Eventually the team tried to figure out when alphabetical sorting was really required, and the answer turned out to be "never".

Open Library does not use data records at all, instead, every piece of data is typed with a URI. This architecture aligns with W3C web standards for the semantic web, and allows much more flexible searching and data mining than would be possible with a MARC record.

Finally, Groetsch reported on Google's metadata processing. They have over 100 bibliographic data sources, including libraries, publishers, retailers and aggregators of review and jacket covers. The library data includes MARC records, anonymized circulation data and authority files. The publisher and retailer data is mostly ONIX formatted XML data. They have amassed over 800 million bibliographic records containing over a trillion fields of data.

Incoming records are parsed into simple data structures which looked similar to Open Library's, but without the URI-ness. These structures are than transformed in various ways for Googles use. The raw metadata structures are stored in an SQL-like database for easy querying.

Groetsch then talked about the nitty-gritty details of data. For example, the listing of an author on a MARC record can only be used as an "indication" of the authors name, because MARC gives weak indications of the contributor role. ONIX is much better in this respect. Similarly, "identifiers" such as ISBN, OCLC number, LCCN, and library barcode number are used as key strings but are only identity indicators with varying strengths. One ISBN with a chinese publisher prefix was found on records for over 24,000 different books; ISBN reuse is not at all uncommon. One librarian had mentioned to Groetsch that in her country, ISBNs are pasted onto a book to give it a greater appearance of legitimacy.

Echoing comments from Coyle, Groetsch spoke with pride of the progress the Google Books metadata team has made in capturing series and group data. Such information is typically recorded in mushy text fields with inconsistent syntax, even in records from the same library.

The most difficult problem faced by the Google Books team is garbage data. Last year, Google came under harsh criticism for the quality of its metadata, most notably from Geoffrey Nunberg. (I wrote an article about the controversy.) The most hilarious errors came from garbage records. For example, certain Onix records describing Gulliver's Travels carried an author description of the wrong Jonathan Swift. Most of these errors come from garbage records, and when one of these is found, almost always, the same problems can be found in other metadata sources. Google would like to find a way to get corrected records back into the library data ecosystem so that they don't have to fix them again, but that there have been issues with data licensing agreements that still need to be worked out. Article like Nunberg's have been quite helpful to the Google team. Every indication is that Google is in the metadata slog for the long term.

One questioner asked the panel what the library community should be doing to prevent "metadata trainwrecks" from happening in the future. Groetsch said without hesitation "Move away from MARC". There was nodding and murmuring in the audience (the librarian equivalent of an uproar). He elaborated that the worst parts of MARC records were the free text data, and normalization of data would be beneficial whereever possible.

One of the Google engineers working on record parsing, Leonid Taycher, added that the first thing he had had to learn about MARC records was that the "Machine Readable" part of the MARC acronym was a lie. (MARC stands for MAchine Readable Cataloging) The audience was amused.

The last question from the audience was about the future role of libraries in production of metadata. Given the resources being brought to bear on the book metadata by OCLC, Google and others, should libraries be doing cataloguing at all? Karen Coyle's answer was that libraries should concentrate their attention on the rare and unique material in their collections- without their work, these materials would continue to be almost completely invisible.
Reblog this post [with Zemanta]

8 comments:

  1. That Google is getting anonymized circulation data is an interesting development. Any indication which libraries are supplying the data? And how are the libraries and Google ensuring that the circulation data is anonymized?

    ReplyDelete
  2. Uh oh... I was feeling pretty good about RDA, but then I read that paragraph about how the free text part of MARC is the worst part. So with RDA, we're going to take the best part of MARC, and kill it with "key it as you see it" instead of those terrible "ill. (some col.)" abbreviations the programmers are complaining about. Wait a minute... which is it? Do you hate the arcane, library-only syntax, or is that the best part of MARC?

    ReplyDelete
  3. Bryan, We don't know anything about anonymized circ, that was the first I'd heard about it.

    Maritta, if every "ill. (some col)" work was coded with exactly the same abbreviation it would be easy to deal with, but no such luck. Computers don't know anything, it's impossible to make them as smart as the dumbest cataloger.

    ReplyDelete
  4. Good point Eric. It seems that if we can't succeed even in our own controlled environment (libraries) with standardized vocabulary and terminology, there is little chance of doing so when that environment explodes open, internationally and across metadata creators and providers.

    ReplyDelete
  5. In the case of ill. and everything else that tends to appear in MARC 300$b, might the fixed field values be of help temporarily? Much of what appears in 300$b has, for better or worse, an equivalent code in the fixed fixed (e.g. ill=a; maps=b, etc.) I think it should also be permissible to use standardized, coded place names in 260$a rather than free text. Catalogers might complain about the sanctity of transcription, but maybe literal transcription in many of these areas is no longer appropriate. Besides, there are lists of standardized place names used all over the web (NOAA and Amtrak sites come to mind). Why not use them in this area of a catalog record? I should also mention that OCLC has been working on preferred forms of publisher names (http://bit.ly/89L3YX). So much of the literally transcribed stuff in catalog records (e.g. content in 260 and 300 and elsewhere) could probably be transformed to coded values and would make selection point and click simple. Then we could spend more time on access points (maybe retrospective correction of damage caused by the rule of three, which limits the number of name access points) and fully elaborating all the relationships within and among works.

    ReplyDelete
  6. A nice summary of the same session is over at Disruptive Library Technology Jester.

    ReplyDelete
  7. Don't most libraries get catalogue records from external sources anyway and do as little in-house cataloguing as possible? It makes no sense to create catalogue records from scratch if they are available elsewhere, even if you have to do a bit of local tweaking to these records once you receive them. But then again, do you really need to tweak them?

    ReplyDelete
  8. Hi friends,
    I am with a specific doubt regarding the realization/download of ebooks from google. Currently in our site , we store the onix(metadata) in our system as xml file.
    my question is that How I can download the books whose ONIX exist in our database after purchase . I need to integrate same functionality to our web site. How It can done? I am waiting for your valuable answer, if you know any other who can answer me better, kindly transfer this question to them. I attach a sample file of onix xml file along with this.pls reply me in:
    anes.pa@gmail.com

    Thankfully
    Anes P.A

    ReplyDelete

Note: Only a member of this blog may post a comment.