Showing posts with label metadata. Show all posts
Showing posts with label metadata. Show all posts

Thursday, December 14, 2023

The Revenge of the Cataloguers

Over the past 15 years or so, libraries around the world have de-emphasized cataloguing. While budgetary concerns and technological efficiencies have been factors in the decline of cataloguing, the emergence of full text search and relevance ranking as practiced by Google and others has proved to be more popular for the vast majority of users. On the open internet, subject classifications have proved to be useless in an environment rife with keyword spam and other search engine optimization techniques. 

In the past year, the emergence of artificial intelligence (AI) with large language models with surprising abilities to summarize and classify texts has people speculating that AI will put most cataloguers out of work in the not-so-distant future.

I think that's not even wrong. But Roy Tennant will turn out to be almost right. MARC, the premier tool of cataloguers around the world, will live forever...  as a million weights in generative pre-trained transformer. Let me explain...

The success or failure of modern AI depends on the construction of large statistical models with billions or even trillions of variables. These models are built from training data. The old adage about computers: "garbage in garbage out" is truer than ever. The models are really good at imitating the training data; so good that they can surprise the models' architects! Thus the growing need for good training data, and the increasing value of rich data sources.

Filings in recent lawsuits confirm the value of this training data. Getty Images is suing Stability AI for the use of Getty Images' material in AI training sets. But it's not just for the use of the images, which are copyrighted, but also for the use of trademarks and the detailed descriptions than accompany the data. Read paragraph 57 of the complaint:

Getty Images’ websites include both the images and corresponding detailed titles and captions and other metadata. Upon information and belief, the pairings of detailed text and images has been critical to successfully training the Stable Diffusion model to deliver relevant output in response to text prompts. If, for example, Stability AI ingested an image of a beach that was labeled “forest” and used that image-text pairing to train the model, the model would learn inaccurate information and be far less effective at generating desirable outputs in response to text prompts by Stability AI’s customers. Furthermore, in training the Stable Diffusion model, Stability AI has benefitted from Getty Images’ image-text pairs that are not only accurate, but detailed. For example, if Stability AI ingested a picture of Lake Oroville in California during a severe drought with a corresponding caption limited to just the word “lake,” it would learn that the image is of a lake, but not which lake or that the photograph was taken during a severe drought. If a Stable Diffusion user then entered a prompt for “California’s Lake Oroville during a severe drought” the output image might still be one of a lake, but it would be much less likely to be an image of Lake Oroville during a severe drought because the synthesis engine would not have the same level of control that allows it to deliver detailed and specific images in response to text prompts.

If you're reading this blog, you're probably thinking to yourself "THAT'S METADATA!"

Let's not forget the trademark part of the complaint:


In many cases, and as discussed further below, the output delivered by Stability AI includes a modified version of a Getty Images watermark, underscoring the clear link between the copyrighted images that Stability AI copied without permission and the output its model delivers. In the following example, the image on the left is another original, watermarked image copied by Stability AI and used to train its model and the watermarked image on the right is output delivered using the model:


If you're reading this blog, you're probably thinking to yourself "THAT'S PROVENANCE!"

So clearly, the kinds of data that libraries and archives have been producing for many years will still have value, but we need to start thinking about how the practice of cataloguing and similar activities will need to change in response to the new technologies. Existing library data will get repurposed as training data to create efficiencies in library workflows. Organizations with large, well-managed will extract windfalls, deserved or not.

If the utility of metadata work is shifting from feeding databases to training AI models, how does this affect the product of that work? Here's how I see it:


  • Tighter coupling of metadata and content. Today's discovery systems are all about decoupling data from content - we talk about creating metadata surrogates for discovery of content. Surrogates are useless for AI training; a description of a cat is useless for training without an accompanying picture of the cat. This means that the existing decoupling of metadata work from content production is doomed. You might think that copyright considerations will drive metadata production into the hands of existing content producers, but more likely organizations that focus on production of integrated training data will emerge to license content and support the necessary metadata production.
  • Tighter collaboration of machines and humans. Optical character recognition (OCR) is a good example of highly focused and evolved machine learning that can still be improved by human editors. The practice of database-focused cataloguing will be made more productive as cataloguers become editors of machine generated structured data. (As if they're not already doing that!)

  • Softer categorization. Discovery databases demand hard classifications. Fiction. Science. Textbooks. LC Subject Headings. AIs are much better at nuance, so the training data needs to include a lot more context. You can have a romantic novel of chemists and their textbooks, and an AI will be just fine with that, so long as you have enough description and context for the machine to assign lots of weights to many topic clusters. 

  • Emphasis on novelty. New concepts and things appear constantly; an AI will extrapolate unpredictably until it gets on-topic training data. AI-OCR might recognize a new emoji, but it might not.
  • Emphasis on provenance. Reality is expensive, which is why I think for-profit organizations will have difficulty in the business of providing training data while Wikipedia will continue to succeed because it requires citations. Already the internet is awash in AI produced content that sounds real, but is just automated BS. Training data will get branded.

What gets me really excited though, is thinking about how a library of the future will interact with content. I expect users will interact with the library using a pre-trained language model, rather than via databases. Content will get added to the model using packages of statistical vectors, compiled by human-expert-assisted content processors. These human experts won't be called "cataloguers" any longer but rather "meaning advisors". Or maybe "biblio-epistemologists". The  revenge of the cataloguers will be that because of the great responsibilities and breadth of expertise required, biblio-epistemologists will command salaries well exceeding the managers and programmers who will just take orders from well-trained AIs. Of course there will still be MARC records, generated by a special historical vector package guaranteed to only occasionally hallucinate.

Note: I started thinking about this after hearing a great talk (starting at about 30:00) by Michelle Wu at the Charleston Conference in November. (Kyle Courtney's talk was good, too).

Sunday, July 31, 2016

Entitled: The art of naming without further elaboration or qualification.

As I begin data herding for our project Mapping the Free Ebook Supply Chain, I've been thing a lot about titles and subtitles, and it got me wondering: what effect do subtitles have on the usage, and should open-access status of a book affect the naming strategy for the book? We are awash in click-bait titles for articles on the web; should ebook titles be clickbaity, too? To what extent should ebook titles be search-engine optimized, and which search engines should they be optimized for?

Here are some examples of titles that I've looked at recently, along with my non-specialist's reactions:
Title: Bigger than You: Big Data and Obesity
Subtitle: An Inquiry toward Decelerationist Aesthetics
The title is really excellent; it gives a flavor of what the book's about and piques my interest because I'm curious what obesity and big data might have to do with each other. The subtitle is a huge turn-off. It screams "you will hate this book unless you already know about decelerationist aesthetics" (and I don't).
(from Punctum Books)



Title: Web Writing
Subtitle: Why and How for Liberal Arts Teaching and Learning
The title is blah and I'm not sure whether the book consists of web writing or is something about how to write for or about the web. The subtitle at least clues me in to the genre, but fails to excite me. It also suggest to me that the people who came up with the name might not be experts in writing coherent, informative and effective titles for the web.
From University of Michigan Press



Title: DOOM
Subtitle: SCARYDARKFAST

If I saw the title alone I would probably mistake it for something it's not. An apocalyptic novel, perhaps. And why is it all caps? The subtitle is very cool though, I'd click to see what it means.
From Digital Culture Books




It's important to understand how title metadata gets used in the real world. Because the title and subtitle get transported in different metadata fields, using a subtitle cedes some control over title presentation to the websites that display it. Four example, Unglue.it's data model has a single title field, so if we get both title and subtitle in a metadata feed, we squash them together in the title field. Unless we don't. Because some of our incoming feeds don't include the subtitle. Different websites do different things. Amazon uses the full title but some sites omit the subtitle until you get to the detail page. So you should have a good reason to use a subtitle as opposed to just putting the words from the subtitle in the title field. DOOM: SCARYDARKFAST is a much better title than DOOM. (The DOOM in the book turns out to be the game DOOM, which I would have guessed from the all-caps if I had ever played DOOM.) And you can't depend on sites preserving your capitalization; Amazon presents several versions of DOOM: SCARYDARKFAST

Another thing to think about is the "marketing funnel". This is the idea that in order to make a sale or to have in impact, your product has to pass through a sequence of hurdles, each step yielding a market that's a fraction of the previous steps. So for ebooks, you have to first get them selected into channels, each of which might be a website. Then a fraction of users searching those websites might see your ebook's title (or cover), for example in a search result. Then a fraction of those users might decide to click on the title, to see a detail page, at which point there had better be an abstract or the potential reader becomes a non-reader.

Having reached a detail page, some fraction of potential readers (or purchase agents) will be enticed to buy or download the ebook. Any "friction" in this process is to be avoided. If you're just trying to sell the ebook, you're done. But if you're interested in impact, you're still not done, because even if a potential reader has downloaded the ebook, there's no impact until the ebook gets used. The title and cover continue to be important because the user is often saving the ebook for later use. If the ebook doesn't open to something interesting and useful, a free ebook will often be discarded or put aside.

Bigger than You's strong title should get it the clicks, but the subtitle doesn't help much at any step of the marketing funnel. "Aesthetics" might help it in searches; it's possible that even the book's author has never ever entered "Decelerationist" as a search term. The book's abstract, not the subtitle, needs to do the heavy lifting of driving purchases or downloads.

The first sentence of "Web Writing" suggest to me that a better title might have been:
"Rebooting how we think about the Internet in higher education
But check back in a couple months. Once we start looking at the data on usage, we might find that what I've written here is completely wrong, and the Web Writing was the best title of them all!

Notes:
1. The title of this blog post  is the creation of Adrian Short, who seems to have left twitter.







Thursday, October 22, 2015

This is NOT a Portrait of Mary Astell

Not Mary Astell, by Sir Joshua Reynolds
Ten years ago, the University of Calgary Press published a very fine book by Christine Mason Sutherland called The Eloquence of Mary Astell, which focused on the proto-feminist's contributions as a rhetorician. The cover for the book featured a compelling image using a painted sketch from 1760-1765 by the master English portraitist Sir Joshua Reynolds, currently in Vienna's Kunsthistorisches Museum and known as Bildnisstudie einer jungen Dame (Study for the portrait of a young woman).

Cover images from books circulate widely on the internet. They are featured in online bookstores, they get picked up by search engines. Inevitably, they get re-used and separated from their context. Today (2015) "teh Internetz" firmly believe that the cover image is a portrait of Mary Astell.

For example:

If you look carefully, you'll see that the image most frequently used is the book cover with the title inexpertly removed.

But the painting doesn't depict Mary Astell. It was done 30 years after her death. In her book, Sutherland notes (page xii):
No portrait of her remains, but such evidence as we have suggests that she was not particularly attractive. Lady Mary Wortley Montagu’s granddaughter records her as having been “in outward form [...] rather ill-favoured and forbidding,” though Astell was long past her youth when this observation was made

Wikipedia has successfully resisted the misattribution.

A contributing factor for the confusion about Mary Astell's image is the book's failure to attribute the cover art. Typically a cover description is included in the front matter of the book. According to the Director of the University of Calgary Press, Brian Scrivener, proper attribution would certainly be done in a book produced today. Publishers now recognize that metadata is increasingly the cement that makes books part of the digital environment. Small presses often struggle to bring their back lists up to date, and publishers both large and small have "metadata debt" from past oversights, mergers, reorganizations and lack of resources.

Managing cover art and permissions for included graphics is often an expensive headache for digital books, particularly for Open Access works. I've previously written about the importance of clear licensing statements and front matter in ebooks. It's unfortunate when public domain art is not recognized as such, as in Eloquence, but nobody's perfect.

The good news is that University of Calgary Press has embraced Open Access ebooks in a big way. The Eloquence of Mary Astell and 64 other books are already available, making Calgary one of the world's leading publishers of Open Access ebooks. Twelve more are in the works.

You can find Eloquence at the Calgary University Press website (including the print edition), Unglue.itDOAB, and Internet Archive. Mary Astell's 1706 pamphlet Reflections Upon Marriage can be found at the Internet Archive and at the University of Pennsylvania's Celebration of Women Writers.

And maybe in 2025, teh internetz will know all about Sir Joshua Reynold's famous painting, Not Mary Astell. Happy Open Access Week!

Wednesday, October 29, 2014

GITenberg: Modern Maintenance Infrastructure for Our Literary Heritage

One day back in March, the Project Gutenberg website thought I was a robot and stopped letting me download ebooks. Frustrated, I resolved to put some Project Gutenberg ebooks into GitHub, where I could let other people fix problems in the files. I decided to call this effort "Project Gitenhub". On my second or third book, I found that Seth Woodworth had had the same idea a year earlier, and had already moved about a thousand ebooks into GitHub. That project was named "GITenberg". So I joined his email list and started submitting pull requests for PG ebooks that I was improving.

Recently, we've joined forces to submit a proposal to the Knight Foundation's News Challenge, whose theme is "How might we leverage libraries as a platform to build more knowledgeable communities? ". Here are some excerpts:
Abstract 
Project Gutenberg (PG) offers 45,000 public domain ebooks, yet few libraries use this collection to serve their communities. Text quality varies greatly, metadata is all over the map, and it's difficult for users to contribute improvements. 
We propose to use workflow and software tools developed and proven for open source software development- GitHub- to open up the PG corpus to maintenance and use by libraries and librarians. 
The result- GITenberg- will include MARC records, covers, OPDS feeds and ebook files to facilitate library use. Version-controlled fork and merge workflow, combined with a change triggered back-end build environment will allow scaleable, distributed maintenance of the greatest works of our literary heritage.  
Description 
Libraries need metadata records in MARC format, but in addition they need to be able to select from the corpus those works which are most relevant to their communities. They need covers to integrate the records with their catalogs, and they need a level of quality assurance so as not to disappoint patrons. Because this sort of metadata is not readily available, most libraries do not include PG records in their catalogs, resulting in unnecessary disappointment when, for example, a patron want to read Moby Dick from the library on their Kindle. 
Progress 
43,000 books and their metadata have been moved to the git version control software, this will enable librarians to collaboratively edit and control the metadata. The GITenberg website, mailing list and software repository has been launched at https://gitenberg.github.io/ . Software for generating MARC records and OPDS feeds have already been written.
Background 
Modern software development teams use version control, continuous integration, and workflow management systems to coordinate their work. When applied to open-source software, these tools allow diverse teams from around the world to collaboratively maintain even the most sprawling projects. Anyone wanting to fix a bug or make a change first forks the software repository, makes the change, and then makes a "pull request". A best practice is to submit the pull request with a test case verifying the bug fix. A developer charged with maintaining the repository can then review the pull request and accept or reject the change. Often, there is discussion asking for clarification. Occasionally versions remain forked and diverge from each other. GitHub has become the most popular sites for this type software repository because of its well developed workflow tools and integration hooks. 
The leaders of this team recognized the possibility to use GitHub for the maintenance of ebooks, and we began the process of migrating the most important corpus of public domain ebooks, Project Gutenberg, onto GitHub, thus the name GITenberg. Project Gutenberg has grown over the years to 50,000 ebooks, audiobooks, and related media, including all the most important public domain works of English language literature. Despite the great value of this collection, few libraries have made good use of this resource to serve their communities. There are a number of reasons why. The quality of the ebooks and the metadata around the ebooks is quite varied. MARC records, which libraries use to feed their catalog systems, are available for only a subset of the PG collection. Cover images and other catalog enrichment assets are not part of PG. 
To make the entire PG corpus available via local libraries, massive collaboration amoung librarians and ebook develeopers is essential. We propose to build integration tools around github that will enable this sort of collaboration to occur. 
  1. Although the PG corpus has been loaded into GITenberg, we need to build a backend that automatically converts the version-controlled source text into well-structured ebooks. We expect to define a flavor of MarkDown or Asciidoc which will enable this automatic, change-triggered building of ebook files (EPUB, MOBI, PDF). (MarkDown is a human-readable plain text format used on GitHub for documentation; MarkDown for ebooks is being developed independently by several team of developers. Asciidoc is a similar format that works nicely for ebooks.) 
  2. Similarly, we will need to build a parallel backend server that will produce MARC and XML formatted records from version-controlled plain-text metadata files.
  3. We will generate covers for the ebooks using a tool recently developed by NYPL and include them in the repository.
  4. We will build a selection tool to help libraries select the records best suited to their libraries.
  5. Using a set of "cleaned up" MARC records from NYPL, and adding custom cataloguing, we will seed the metadata collection with ~1000 high quality metadata records.
  6. We will provide a browsable OPDS feed for use in tablet and smartphone ebook readers.
  7. We expect that the toolchain we develop will be reusable for creation and maintenance of a new generation of freely licensed ebooks.

The rest of the proposal is on the Knight News Challenge website. If you like the idea of GITenberg, you can "applaud" it there. The "applause' is not used in the judging of the proposals, but it makes us feel good. There are lots of other interesting and inspiring proposals to check out and applaud, so go take a look!

Wednesday, June 19, 2013

Book Metadata Under a Bushel

Full story at the Verge
They don't allow witnesses, spectators or journalists to carry cell phones or kindles or iPads into the Federal Courthouse in New York. But books are OK. So every publishing executive at the iBookStore antitrust trial carries a book with them instead. For example, The Verge spotted Penguin's David Shanks sporting Robert B. Parker's Wonderland . The press takes a picture, and the next day the book, which just so happens to be an exciting new release, gets its cover onto the front page of the business section, not to mention Go To Hellman.

This opportunistic book publicity reminded me of the biblical parable:
No man, when he hath lighted a candle, putteth it in a secret place, neither under a bushel, but on a candlestick, that they which come in may see the light. Nor doth a scroll seller speak its name so no man canst hear. Nay, he shouteth from high mountain tops the holy numbers of the scroll.
- Luke 11:33 (more or less).
So you would think that book publishers would also be spreading metadata for their books far and wide, and would make it as easy as possible for developers to propagate the word. But the tyranny of "the way we've always done things" still holds sway in that world. And so, the HarperCollins OpenBook API and the BookSmash developer competition, which I ranted about in my last post, need to be understood as the positive steps they are. They are opportunities for publishers and developers to engage in ways that aren't chiseled in stone.

For my part, I've been engaging with some very helpful people at HarperCollins. Together, we found some documentation issues that had me unsure about the resources being offered to challenge participants.

First of all, the entire text of the 196 books listed in the resources spreadsheet are being made available. This is very cool. Also, 20% samples of all EPUB books in the HarperCollins catalog are available through the standard API.

Hints:
  • If you're participating in the challenge, you need to use a different endpoint than the one offered by the API demo tool to get un-truncated text. Yes, you copy the url it gives you (host name "diner") and replace the endpoint url with one reported in the text on the demo tool (host name "api").
  • If you want to use the catalog API to get ISBNs to use in the content API, note that only books/ISBNs with Sub_Format='EPUB' have preview content associated with them.
  • The API does request throttling in a funny way. If you make too many requests in a short period of time, the API tells you "Developer Inactive". That result seems to get stuck in a server-side cache.
  • The HC people seem eager to improve the API, so don't hesitate to report issues in their forums. If you've ever developed an API, you know that you have to whack at it a bit to get things right.
If you play with this API a bit, it'll be pretty obvious to you that "building an API" is not the way things have always been done in the book industry. Here's how things are done: Publishers cause ONIX XML files that describe their books to come into existence. These files are shipped to "trading partners". The reason, more or less, that the publishers do this is because way back when, Amazon forced them to do it that way instead of the horrible old ways they used to do things.

So the reason that the HarperCollins API, and others like it, are significant, is not because they'll be useful in their current form. It's because big publishers have realized that getting bossed around by Amazon might not be a smartest thing to do, and maybe having more direct relationships with developers would be a good idea.
Enhanced by Zemanta

Sunday, July 31, 2011

Library Data Beyond the Like Button

"Aren't you supposed to be working on your new business? That ungluing ebooks thing? Instead you keep writing about library data, whatever that is. What's going on?"

No, really, it all fits together in the end. But to explain, I need to talk you beyond the "Like Button".

Earlier this month, I attended a lecture at the New York Public Library. The topic was Linked Open Data, and the speaker was Jon Voss, who's been applying this technology to historical maps. It was striking to see how many people from many institutions turned out, and how enthusiastically Jon's talk was received. The interest in Linked Data was similarly high at the American Library Association Meeting in New Orleans, where my session (presented with Ross Singer of Talis) was only one of several Linked Data sessions that packed meeting rooms and forced attendees to listen from hallways.

I think it's important to convert this level of interest into action. The question is, what can be done now to get closer to the vision of ubiquitous interoperable data? My last three posts have explored what libraries might do to better position their presence in search engines and in social networks using schema.org vocabulary and Open Graph Protocol. In these applications, library data enables users to do very specific things on the web- find a library page in a search engine or "Like" a library page in a Facebook. But there's so much more that could be done with the data.

I think that library data should be handled as if it was made of gold, not of diamond.

Perhaps the most amazing property of gold is its malleability. Gold can be pounded into a sheet so thin that it's transparent to light. An ounce of gold can be made into leaf that will cover 25 square meters.

There is a natural tendency to treat library data as a gem that needs skillful cutting and polishing. The resulting jewel will be so valuable that users will beat down library websites to get at the gems. Yeah.

The reality is that  library data in much more valuable as a thin layer that covers huge swaths of material. When data is spread thinly, it has a better chance of connecting with data from other libraries and with other sorts of institutions: Museums, archives, businesses, and communities. By contrast, deep data, the sort that focuses on a specific problem space, is unlikely to cross domains or applications without a lot of custom programming and data tweaking.

Here's the example that's driven my interest in opening up library linked data: At Gluejar, we're building a website that will ask people to go beyond "liking" books. We believe that books are so important to people that they will want to give them to the world; to do that we'll need to raise money. If lots of people join together around a book, it will be easy to raise the money we need, just as public radio stations find enough supporters to make the radio free to everyone.

We don't want our website to be a book discovery website, or a social network of readers, or a library catalog; other sites to that just fine. What we need is for users to click "support this book" buttons on all sorts of websites, including library catalogs. And our software needs to pull just a bit of data off of a webpage to allow us to figure out which book the user wants to support. It doesn't sound so difficult. But we can only support to or three different interfaces to that data. If library websites all put a little more structured data in their HTML, we could do some amazing things. But they don't, and we have to settle for "sort of works most of the time".

Real books get used in all sorts of ways. People annotate them, they suggest them to friends, they give them away, they quote them, and they cite them. People make "TBR" piles next to their beds. Sometimes, they even read and remember them as long as they live. The ability to do these same things on the web would be pure gold.

Wednesday, July 27, 2011

Liking Library Data

If you had told me ten years ago that teenagers would be spending free time "curating their social graphs", I would have looked at you kinda funny. Of course, ten years ago, they were learning about metadata from Pokemon cards, so maybe I should have seen it coming.

Social networking websites have made us all aware of the value of modeling aspects of our daily lives in graph databases, even if we don't realize that's what we're doing. Since the "semantic web" is predicated on the idea that ALL knowledge can be usefully represented as a giant, global graph, it's perhaps not so surprising that the most familiar, and most widely implemented application of semantic web technologies has been Facebook's "Like" button.

When you click a Like button, an arc is added to Facebook's representation of your social graph. The arc links a node that represents you and another node that represents the thing you liked. As you interact with your social graph via Facebook, the added Like arc may introduce new interactions.

Google must think this is really important. They want you to start clicking "+1" buttons, which presumably will help them deliver better search. (You can try following me+, but I'm not sure what I'll do with it.)

The technology that Facebook has favored for building new objects to but in the social graph is derived from RDFa, which adds structured data into ordinary web pages. It's quite similar to "microdata", a competing technology that was recently endorsed by Google, Microsoft, and Yahoo. Facebook's vocabulary for the things it's interested in is called Open Graph Protocol (OGP), which could be considered a competitor for Schema.org.

My previous post described how a library might use microdata to help users of search engines find things in the library. While I think that eventually this will be an necessity for every library offering digital services, the are a bunch of caveats that limit the short-term utility of doing so. Some of these were neatly described in a post by Ed Chamberlain:
  • the library website needs to implement a site-map that search engine's crawlers can use to find all the items in the Library's catalog
  • the library's catalog needs to be efficient enough to not be burdened by the crawlers. Many library catalog systems are disgracefully inefficient.
  • the library's catalog needs to support persistent URLs. (Most systems do this, but it was only ten years ago that I caused Harvard's catalog to crash by trying to get it to persist links. Sorry.)
But the clincher is that web search engines are still suspicious of metadata. Spammers are constantly trying to deceive search engines. So search engines have white-lists, and unless your website is on the white-list, the search engines won't trust your structured metadata. The data might be of great use to a specialized crawler designed to aggregate metadata from libraries, but there's a chicken and egg problem: these crawlers won't be built before libraries start publishing their data.

Facebook's OGP may have more immediate benefits. Libraries are inextricably linked to their communities; what is a community if not a web of relationships? Libraries are uniquely positioned to insert books into real world social networks. A phrase I heard at ALA was "Libraries are about connections, not collections".

Libraries don't need to implement OGP to put a like button on a web page, but without OGP Facebook would understand the "Like" to be about the web page, rather than about the book or other library item.

To show what OGP might look like on a library catalog page, using the same example I used in my post on "spoonfeeding library data to search engines":
<html> 
<head> 
<title>Avatar (Mysteries of Septagram, #2)</title>
</head> 
<body> 
<h1>Avatar (Mysteries of Septagram, #2)</h1>
 <span>Author: Paul Bryers (born 1945)</span> 
 <span>Science fiction</span>
 <img src="http://coverart.oclc.org/ImageWebSvc/oclc/+-+703315758_140.jpg">
</div>

Open Graph Protocol wants the web page to be the digital surrogate for the thing to be inserted into the social graph, and so it wants to see metadata about the thing in the web page's meta tags. Most library catalog systems already put metadata in metatags, so this part shouldn't be horribly impossible.
<html xmlns="http://www.w3.org/1999/xhtml"
      xmlns:og="http://ogp.me/ns#"
      xmlns:fb="http://www.facebook.com/2008/fbml"> 
<head> 
<title>Avatar (Mysteries of Septagram, #2)</title>
<meta property="og:title" content="Avatar - Mysteries of Septagram #2"/>
<meta property="og:type" content="book"/>
<meta property="og:isbn" content="9780340930762"/>
<meta property="og:url"   
      content="http://library.example.edu/isbn/9780340930762"/>
<meta property="og:image" 
      content="http://coverart.oclc.org/ImageWebSvc/oclc/+-+703315758_140.jpg"/>
<meta property="og:site_name" content="Example Library"/>
<meta property="fb:admins" content="USER_ID"/>
</head> 
<body> 
<h1>Avatar (Mysteries of Septagram, #2)</h1>
 <span>Author: Paul Bryers (born 1945)</span> 
 <span>Science fiction</span>
 <img src="http://coverart.oclc.org/ImageWebSvc/oclc/+-+703315758_140.jpg">
</div>

The first thing that OGP does is to call out xml namespaces- one for xhtml, a second for Open Graph Protocol, and a third for some specific-to-Facebook properties. A brief look at OGP reveals that it's even more bare bones than schema.org; you can't even express the fact that "Paul Bryers" is the author of "Avatar".

This is less of an issue than you might imagine, because OGP uses a syntax that's a subset of RDFa, so you can add namespaces and structured data to your heart's desire, though Facebook will probably ignore it.
<html xmlns="http://www.w3.org/1999/xhtml"
      xmlns:og="http://ogp.me/ns#"
      xmlns:fb="http://www.facebook.com/2008/fbml"
      xmlns:dc="http://purl.org/dc/elements/1.1/"
      xmlns:foaf="http://xmlns.com/foaf/0.1/"> 
<head> 
<title>Avatar (Mysteries of Septagram, #2)</title>
<meta property="og:title" 
      content="Avatar - Mysteries of Septagram #2"/>
<meta property="og:type" 
      content="book"/>
<meta property="og:isbn" 
      content="9780340930762"/>
<meta property="og:url"   
      content="http://library.example.edu/isbn/9780340930762"/>
<meta property="og:image" 
      content="http://coverart.oclc.org/ImageWebSvc/oclc/+-+703315758_140.jpg"/>
<meta property="og:site_name" 
      content="Example Library"/>
<meta property="fb:app_id" 
      content="183518461711560"/>
</head> 
<body> 
<h1>Avatar (Mysteries of Septagram, #2)</h1>
 <span rel="dc:creator">Author: 
    <span typeof="foaf:Person" 
        property="foaf:name">Paul Bryers
    </span> (born 1945)
 </span>
 <span rel="dc:subject">Science fiction</span>
 <img src="http://coverart.oclc.org/ImageWebSvc/oclc/+-+703315758_140.jpg">
</div>

The next step is to add the actual like button by embedding a javascript from Facebook:
<div id="fb-root"></div>
<script   src="http://connect.facebook.net/en_US/all.js#appId=183518461711560&xfbml=1"></script>
<fb:like href="http://library.example.edu/isbn/9780340930762/" 
       send="false" width="450" show_faces="false" font=""></fb:like>

The "og:url" property tells facebook the "canonical" url for this page- the url that Facebook should scrape the metadata from.

Now here's a big problem. Once you put the like button javascript on a web page, Facebook can track all the users that visit that page. This goes against the traditional privacy expectations that users have of libraries. In some jurisdictions, it may even be against the law for a public library to allow a third party to track users in this way. I expect it shouldn't be hard to modify the implementation so that the script is executed only if the user clicks the "Like" button, but I've not been able to find a case anyone has done this.

It seems to me that injecting library resources into social networks is important. The libraries and the social networks that figure out how to do that will enrich our communities and the great global graph that is humanity.

Tuesday, July 12, 2011

Spoonfeeding Library Data to Search Engines

CC-NC-BY rocketship
When you talk to a search engine, you need to realize that it's just a humongous baby. You can't expect it to understand complicated things. You would never try to teach language to a human baby by reading it Nietzsche, and you shouldn't expect a baby google to learn bibliographic data by feeding it MARC (or RDA or METS or MODS, or even ONIX).

When a baby says "goo-goo" to you, you don't criticize its misuse of the subjunctive. You say "goo-goo" back. When Google tells you that that it wants to hear "schema.org" microdata, you don't try to tell it about the first indicator of the 856 ‡u subfield. You give it schema.org microdata, no matter how babyish that seems.

It's important to build up a baby's self-confidence. When baby google expresses interest in the number of pages of a book, you don't really want to be specifying that there are ix pages numbered with roman numerals and 153 pages with arabic numerals in shorthand code. When baby google wants to know whether a book is "family friendly" you don't want to tell it about 521 special audience characteristics, you just want to tell it whether or not it's porn.

If you haven't looked at the schema.org model for books, now's a good time. Don't expect to find a brilliant model for book metadata, expect to find out what a bibliographic neophyte machine thinks it can use a billion times a day. Schema.org was designed by engineers from Google, Yahoo, and Bing. Remember, their goal in designing it was not to describe things well, it was to make their search results better and easier to use.

The thing is, it's not such a big deal to include this sort of data in a page that comes from an library OPAC (online catalog). An OPAC that publishes unstructured data produces HTML that looks something like this:
<div> 
<h1>Avatar (Mysteries of Septagram, #2)</h1>
 <span>Author: Paul Bryers (born 1945)</span> 
 <span>Science fiction</span>
 <img src="http://coverart.oclc.org/ImageWebSvc/oclc/+-+703315758_140.jpg">
</div>

The first step is to mark something as the root object. You do that with the itemscope attribute:
<div itemscope> 
<h1>Avatar</h1>
 <span>Author: Paul Bryers (born 1945)</span> 
 <span>Science fiction</span>
 <img src="http://coverart.oclc.org/ImageWebSvc/oclc/+-+703315758_140.jpg">
</div>

A microdata-aware search engine looking at this will start building a model. So far, the model has one object, which I'll denote with a red box.


The second step, using microdata and Schema.org, is to give the object a type. You do that with the itemtype attribute:
<div itemscope itemtype="http://schema.org/Book"> 
<h1>Avatar (Mysteries of Septagram, #2)</h1>
 <span>Author: Paul Bryers (born 1945)</span> 
 <span>Science fiction</span>
 <img src="http://coverart.oclc.org/ImageWebSvc/oclc/+-+703315758_140.jpg">
</div>

Now the object in the model has acquired the type "Book" (or more precisely, the type "http://schema.org/Book".

Next, we give the Book object some properties:
<div itemscope itemtype="http://schema.org/Book"> 
<h1 itemprop="name">Avatar (Mysteries of Septagram, #2)</h1>
 <span>Author: 
<span itemprop="author">Paul Bryers (born 1945)</span></span> 
 <span itemprop="genre">Science fiction</span>
 <img src="http://coverart.oclc.org/ImageWebSvc/oclc/+-+703315758_140.jpg">
</div>

Note that while the library record for this book attempts to convey the title complexity: "245 10 $aAvatar /$cPaul Bryers.$", the search engine doesn't care yet. The book is part of a series: 490 1 $aThe mysteries of the Septagram$, and the search engines don't want to know about that either. Eventually, they'll learn.
The model built by the search engine looks like this:

So far, all the property values have been simple text strings. We can also add properties that are links:
<div itemscope itemtype="http://schema.org/Book"> 
<h1 itemprop="name">Avatar (Mysteries of Septagram, #2)</h1>
 <span>Author: 
<span itemprop="author">Paul Bryers (born 1945)</span></span> 
 <span itemprop="genre">Science fiction</span>
 <img src="http://coverart.oclc.org/ImageWebSvc/oclc/+-+703315758_140.jpg" 
itemprop="image">
</div>
The model grows.

Finally, we want to say that the author, Paul Bryers, is an object in his own right. In fact, we have to, because the value of an author property has to be a Person or an Organization in Schema.org. So we add another itemscope attribute, and give him some properties:
<div itemscope itemtype="http://schema.org/Book"> 
<h1 itemprop="name">Avatar (Mysteries of Septagram, #2)</h1>
 <div itemprop="author" itemscope itemtype="http://schema.org.Person">
Author:  <span itemprop="name">Paul Bryers</span> 
(born <span itemprop="birthDate">1945</span>)
 </div>
 <span itemprop="genre">Science fiction</span>
 <img src="http://coverart.oclc.org/ImageWebSvc/oclc/+-+703315758_140.jpg" 
itemprop="image">
</div>

That wasn't so hard. Baby has this picture in his tyrannical little head:

Which it can easily turn into a "rich snippet" that looks like this:

Though you know all it really cares about is milk.

Here's a quick overview of the properties a Schema.org/Book can have (the values in parentheses indicate a type for the property value):

Properties from http://schema.org/Thing
  • description
  • image(URL)
  • name
  • url(URL)
Properties from http://schema.org/CreativeWork
Properties from http://schema.org/Book
This post is the second derived from my talk at ALA in New Orleans. The first post discussed the changing role of digital surragates in a fully digital world. The next will discuss "Like" buttons.

Friday, July 8, 2011

Library Data: Why Bother?

When face recognition came out in iPhoto, I was amused when it found faces in shrubbery and asked me whether they were friends of mine. iPhoto, you have such a sense of humor!

But then iPhoto looked at this picture of a wall of stone faces in Baoding, China. It highlighted one of the faces and asked me "Is this Jane?" I was taken aback, because the stone depicted Jane's father. iPhoto was not as stupid as I thought it was- it could even see family resemblances.

Facial recognition software is getting better and better, which is one reason people are so worried about the privacy implications of Facebook's autotagging of pictures. Imagine what computers will be able to do with photos in 10 years! They'll be able to recognize pictures of bananas, boats, beetles and books. I'm thinking it's probably not worth it to fill in a lot of iPhoto metadata.

I wish I had thought about facial recognition when I was preparing my talk for the American Library Association Conference in New Orleans. I wanted my talk to motivate applications for Linked Open Data in libraries, and in thinking about why libraries should be charting a path towards Linked Data, I realized that I needed to examine first of all the motivation for libraries to be in the bibliographic data business in the first place.

Originally, libraries invested in bibliographic data to help people find things. Libraries are big and have a lot of books. It's impractical for library users to find books solely by walking the stacks, unless the object of the search has been anticipated by the ordering of books on the shelves. The paper cards in the card catalog could be easily duplicated to enable many types of search in one compact location. The cards served as surrogates for the physical books.

When library catalogs became digital, much more powerful searches could be done. The books acquired digital surrogates that could be searched with incredible speed. These surrogates could be used for a lot of things, including various library management tasks, but finding things was still the biggest motivation for the catalog data.

We're now in the midst of a transition where books are turning into digital things, but cataloging data hasn't changed a whole lot. Libraries still need their digital surrogates because most publishers don't trust them with the full text of books. But without full text, libraries are unable to provide the full featured discovery that a a search engine with access to both the full text and metadata (Google, Overdrive, etc.) can provide.

At the same time, digital content files are being packed with more and more metadata from the source. Photographs now contain metadata about where, when and how they were taken; for a dramatic example of how this data might be used, take a look at this study from the online dating site OKCupid. Book publishers are paying increased attention to title-level metadata, and metadata is being built into new standards such as EPUB3. To some extent, this metadata is competing for the world's attention with library-sourced metadata.

Libraries have two paths to deal with this situation. One alternative is to insist on getting the full text for everything they offer. (Unglued ebooks offer that, that's what we're working on at Gluejar.)

The other alternative for libraries is to feed their bibliographic data to search engines so that library users can discover books in libraries. Outside libraries, this process is known as "Search Engine Optimization". When I said during my talk that this should be the number one purpose of library data looking forward, one tweeter said it was "bumming her out". If the term "Search Engine Optimization" doesn't work for you, just think of it as "helping people find things".

Library produced data is still important, but it's not essential in the way that it used to be. The most incisive question during my talk pointed out that the sort of cataloging that libraries do is still absolutely essential for things like photographs and other digital archival material. That's very true, but only because automated analysis of photographs and other materials is computationally hard. In ten years, that might not be true. iPhoto might even be enough.

In the big picture, very little will change: libraries will need to be in the data business to help people find things. In the close-up view, everything is changing- the materials and players are different, the machines are different, and the technologies can do things that were hard to imagine even 20 years ago.

In a following post, I'll describe ways that libraries can start publishing linked data, feeding search engines, and keep on helping people find stuff. The slides from my talk (minus some copyrighted photos) are available as PDF (4.8MB) and PPTX (3.5MB).

Saturday, July 24, 2010

The Curious Case of eCapitalization

An unresolved problem faced by all technology writers is what to do with creative capitalization. When you want to lead off a sentence with a word like "iPad" or "eBook", how do you capitalize it? Do you go with "Ipad" and "Ebook"? Or perhaps "IPad" and "EBook"? Do you stay with "iPad" and "eBook" and consider them to be capitalized versions of "ipad" and "ebook"? Horror of horrors – you could put in a dash. Maybe you just finesse the issue by changing your sentence around to avoid having a camelCased word leading off the sentence. Even then, you have the problem of what to do if the word is in the title of your article, for which you probably use Title Case unless you're a cataloging librarian, in which case you use Sentence case, not that the problem goes away! If you're using an iPhone, you know it has a mind of its own about the first letter of your email address being capitalized.

The practice of capitalizing titles presents issues particularly when the titles are transported into new contexts, for example via an RSS feed or search engine harvest. ALL CAPS MIGHT LOOK OK AS A <TITLE> ON YOUR WEB PAGE, but a search engine might hesitate to scream at people

This is not by any means a new problem, but it's one that changes from era to era because of the symbiotic relationship between language and printing technology. Here's what Charles Coffin Jewett wrote in 1853 when discussing how libraries should record book titles:
The use of both upper-case and lower-case letters
in a title-page, is for the most part a matter of the printer's taste,
and does not generally indicate the author's purpose.  To copy them in a
catalogue with literal exactness would be exceedingly difficult, and of
no practical benefit.  In those parts of the title-page which are
printed wholly in capitals, initials are undistinguished.  It would be
unsightly and undesirable to distinguish the initials where the
printer had done so, and omit them where he had used a form of letter
which prohibited his distinguishing them.  It would teach nothing to
copy from the book the initial capitals in one part of the title, and
allow the cataloguer to supply them in other parts.
The standard practice of libraries in English speaking countries has been to record book titles in Sentence case, in which the first word of the title is capitalized and the rest of the words are capitalized it only if the language demands it (unless the first word is an article like "A", then the second word is also capitalized). The argument for this is that this capitalization style allows for the most meaning to be transmitted; a reader can tell which words of a title are proper names or other words that are capitalized. Which begs two questions: Why are libraries alone in presenting titles this way? Why do libraries persist in this practice when no one in recorded history has ever asked for sentence case titles?

In German and other languages, nouns are capitalized; this used to be true of English (take a look at the US Constitution).  In German, it's easy to tell nouns from verbs, which might be very useful if we still had it in English. Still, I enjoy being able to write that something is A Good Thing. It gives me a way to intone my text with an extra bit of information.

The rules for how English should be capitalized have become quite complicated. Here and here are two web pages I found devoted to collecting capitalization rules. Some of them are pretty arcane.

It's fun to speculate on the future of capitalization. In the late 19th century, there was a fashion to simplify spelling, grammar and capitalization, led by people like Melvil Dewey. I'm guessing part of the reason was the annoyance of needing to press a shift key on those newfangled typewriters. But spelling and capitalization reform didn't get very far. Perhaps they tried to publish articles and got stopped in their tracks by a unified front of copy editors.

If anything, the current trend is in the direction of making capitalization even more idiosyncratic. In addition to a proliferation of Product names like iPod and eBay that have crossed over into the language mainstream,  the shift from print to electronic distribution of text does a better job of preserving the capitalization chosen by the author, thus allowing it to better transmit additional meaning.

The ability to increase the information density in text is useful in a wide range of situations, for example, when you have only 140 characters to work with, or when you want a meaningful function name, like toUpperCase(). If your family name is McDonald, you probably have strong feelings on the issue.

My guess is that life will become increasingly case sensitive. You may already be aware that it takes 8 seconds, not one, to transmit a 1 GB file over a 1 Gb/s link. And that SI unit Mg is a billion times the mass of a mg. If you are a Java programmer, If you know the difference between an integer and an Integer, you'll quickly learn about NullPointerExceptions.

The shift from ascii to Unicode has made it much easier to cling to language specific capitalization rules. Did you know that there are a small number of characters that are different in "upper case" than in title case? They are: Letter DZ, LETTER DZ WITH CARON, LETTER LJ, LETTER NJ, and LETTER DZ. The lower case versions are dz, dž, lj, and nj; the upper case versions are DZ, DŽ, LJ, NJ, and the title case versions are Dz, Dž, Lj, and Njfi, fl, ffi, ffl, ſt, st. And don't forget your Armenian ligatures, ﬓ, ﬔ, ﬕ, ﬖ, ﬗ. For this reason, being "case insensitive" is poorly defined- two strings that are equal when you've changed both to uppercase are not necessarily equal after you've changed them to lower case!

So what do I do when I write about ebooks I don't use a dash? When the word appears in a title, I capitalize the "B". I can't wait till they translate this rule into Armenian.

Sunday, June 27, 2010

Global Warming of Linked Data in Libraries

Libraries are unusual social institutions in many respects; perhaps the most bizarre is their reverence for metadata and its evangelism. What other institution considers the production, protection and promulgation of metadata to be part of its public purpose?

The W3C's Linked Data activity shares this unusual mission. For the past decade, W3C has been developing a technology stack and methodology designed to support the publication and reuse of metadata; adoption of these technologies has been slow and steady, but the impact of this work has fallen short of its stated ambitions.

I've been at the American Library Association's Annual Meeting this weekend. Given the common purpose of libraries and Linked Data, you would think that Linked Data would be a hot topic of discussion. The weather here has been much hotter than Linked Data, which I would describe as "globally warming". I've attended two sessions covering Linked Data, each attended by between 50 and 100 delegates. These followed a day long, sold-out  preconference. John Phipps, one of the leaders in the effort to make library metadata compatible with the semantic web, remarked to me that these meeting would not have been possible even a year ago. Still, this attendance reflects only a tiny fraction of metadata workers at the conference; Linked Data has quite a ways to come. It's only a few months ago that the W3C formed a Library Linked Data Incubator Group.

On Friday morning, there was an "un-conference" organized by Corey Harper from NYU and Karen Coyle, a well-known consultant. I participated in a subgroup looking at use cases for library Linked Data. It took a while for us to get around to use cases though, as participants described that usage was occurring, but they weren't sure what for. Reports from OCLC (VIAF) and Library of Congress (id.loc.gov) both indicated significant usage but little feedback. The VIVO project was described as one with a solid use case (giving faculty members a public web presence), but no one from VIVO was in attendance.

On Sunday morning, a meeting of the Association for Library Collections and Technical Services (ALCTS), Rebecca Guenther, Library of Congress, discussed id.loc.gov, a service that enables both humans and machines to programatically access authority data at the Library of Congress. Perhaps the most significant thing about id.loc.gov is not what it does but who is doing it. The Library of Congress provides leadership for the world of library cataloguing; what LC does is often slavishly imitated in libraries throughout the US and the rest of the world.  id.loc.gov started out as a research project but is now officually supported.

Sara Russell-Gonzalez of the University of Florida then presented the VIVO which has won a big chunk of funding from the National Center for Research Resources, a branch of NIH. The goal of VIVO is to build an "interdisciplinary national network enabling collaboration and discovery between scientists across all disciplines." VIVO started at Cornell and has garnered strong institutional support there, as evidenced by an impressive web site. If VIVO is able to gain similar support nationally and internationally, it could become an important component of an international research infrastructure. This is a big "if". I asked if VIVO had figured out how to handle cases where researchers change institutional affiliations; the answer was "No". My question was intentionally difficult; Ian Davis has written cogently about the difficulties RDF has in treating time-dependent relationships. It turns out that there are political issues as well. Cornell has had to deal with a case where an academic department wanted to expunge affiliation data for a researcher who left under cloudy circumstances.

At the un-conference, I urged my breakout group to consider linked data as a way to expose library resources outside of the library world as well as a model for use inside libraries. It's striking to me that libraries seem so focused on efforts such as RDA, which aim to move library data models into Semantic Web compatible formats. What they aren't doing is to make library data easily available in models understandable outside the library.

The two most significant applications of Linked Data technologies so far are Google's Rich Snippets and Facebook's Open Graph Protocol (whose user interface, the "Like" button, is perhaps the semantic webs most elegant and intuitive). Why aren't libraries paying more attention to making their OPAC results compatable with these application by embedding RDFa annotations in their web-facing systems? It seems to me that the entire point of metadata in libraries is to make collections accessible. How better to do this than to weave this metadata into peoples lives via Facebook and Google? Doing this will require the dumbing-down of library metadata and some hard swallowing, but it's access, not metadata quality, that's core to the reason that libraries exist.



Enhanced by Zemanta

Monday, January 18, 2010

Google Exposes Book Metadata Privates at ALA Forum

At the hospital, nudity is no big deal. Doctors and nurses see bodies all the time, including ones that look like yours, and ones that look a lot worse. You get a gown, but its coverage is more psychological than physical!

Today, Google made an unprecedented display of its book metadata private parts, but the audience was a group of metadata doctors and nurses, and believe me, they've seen MUCH worse. Kurt Groetsch, a Collections Specialist in the Google Books Project presented details of how Google processes book metadata from libraries, publishers, and others to the Association for Library Collections and Technical Services Forum during the American Library Association's Midwinter Meeting.

The Forum, entitled "Mix and Match: Mashups of Bibliographic Data", began with a presentation from OCLC's Renée Register, who described how book metadata gets created and flows though the supply chain. Her blob diagram conveyed the complexity of data flow, and she bemoaned the fact that library data was largely walled off from publisher data by incompatible formats and cataloging practice. OCLC is working to connect these data silos.

Next came friend-of-the-blog Karen Coyle, who's been a consultant (or "bibliographic informant") to the Open Library project. She described the violent collision of library metadata with internet database programmers. Coyle's role in the project is not to provide direction, but to help the programmers decode arcane library-only syntax such as "ill. (some col)". The one instance where she tried to provide direction turned out to be something of a mistake. She insisted that, to allow proper sorting, the incoming data stream should try to keep track of the end of leading articles in title strings. So for example, "The Hobbit" should be stored as "(The )Hobbit". This proved to be very cumbersome. Eventually the team tried to figure out when alphabetical sorting was really required, and the answer turned out to be "never".

Open Library does not use data records at all, instead, every piece of data is typed with a URI. This architecture aligns with W3C web standards for the semantic web, and allows much more flexible searching and data mining than would be possible with a MARC record.

Finally, Groetsch reported on Google's metadata processing. They have over 100 bibliographic data sources, including libraries, publishers, retailers and aggregators of review and jacket covers. The library data includes MARC records, anonymized circulation data and authority files. The publisher and retailer data is mostly ONIX formatted XML data. They have amassed over 800 million bibliographic records containing over a trillion fields of data.

Incoming records are parsed into simple data structures which looked similar to Open Library's, but without the URI-ness. These structures are than transformed in various ways for Googles use. The raw metadata structures are stored in an SQL-like database for easy querying.

Groetsch then talked about the nitty-gritty details of data. For example, the listing of an author on a MARC record can only be used as an "indication" of the authors name, because MARC gives weak indications of the contributor role. ONIX is much better in this respect. Similarly, "identifiers" such as ISBN, OCLC number, LCCN, and library barcode number are used as key strings but are only identity indicators with varying strengths. One ISBN with a chinese publisher prefix was found on records for over 24,000 different books; ISBN reuse is not at all uncommon. One librarian had mentioned to Groetsch that in her country, ISBNs are pasted onto a book to give it a greater appearance of legitimacy.

Echoing comments from Coyle, Groetsch spoke with pride of the progress the Google Books metadata team has made in capturing series and group data. Such information is typically recorded in mushy text fields with inconsistent syntax, even in records from the same library.

The most difficult problem faced by the Google Books team is garbage data. Last year, Google came under harsh criticism for the quality of its metadata, most notably from Geoffrey Nunberg. (I wrote an article about the controversy.) The most hilarious errors came from garbage records. For example, certain Onix records describing Gulliver's Travels carried an author description of the wrong Jonathan Swift. Most of these errors come from garbage records, and when one of these is found, almost always, the same problems can be found in other metadata sources. Google would like to find a way to get corrected records back into the library data ecosystem so that they don't have to fix them again, but that there have been issues with data licensing agreements that still need to be worked out. Article like Nunberg's have been quite helpful to the Google team. Every indication is that Google is in the metadata slog for the long term.

One questioner asked the panel what the library community should be doing to prevent "metadata trainwrecks" from happening in the future. Groetsch said without hesitation "Move away from MARC". There was nodding and murmuring in the audience (the librarian equivalent of an uproar). He elaborated that the worst parts of MARC records were the free text data, and normalization of data would be beneficial whereever possible.

One of the Google engineers working on record parsing, Leonid Taycher, added that the first thing he had had to learn about MARC records was that the "Machine Readable" part of the MARC acronym was a lie. (MARC stands for MAchine Readable Cataloging) The audience was amused.

The last question from the audience was about the future role of libraries in production of metadata. Given the resources being brought to bear on the book metadata by OCLC, Google and others, should libraries be doing cataloguing at all? Karen Coyle's answer was that libraries should concentrate their attention on the rare and unique material in their collections- without their work, these materials would continue to be almost completely invisible.
Reblog this post [with Zemanta]