Tuesday, July 12, 2011

Spoonfeeding Library Data to Search Engines

CC-NC-BY rocketship
When you talk to a search engine, you need to realize that it's just a humongous baby. You can't expect it to understand complicated things. You would never try to teach language to a human baby by reading it Nietzsche, and you shouldn't expect a baby google to learn bibliographic data by feeding it MARC (or RDA or METS or MODS, or even ONIX).

When a baby says "goo-goo" to you, you don't criticize its misuse of the subjunctive. You say "goo-goo" back. When Google tells you that that it wants to hear "schema.org" microdata, you don't try to tell it about the first indicator of the 856 ‡u subfield. You give it schema.org microdata, no matter how babyish that seems.

It's important to build up a baby's self-confidence. When baby google expresses interest in the number of pages of a book, you don't really want to be specifying that there are ix pages numbered with roman numerals and 153 pages with arabic numerals in shorthand code. When baby google wants to know whether a book is "family friendly" you don't want to tell it about 521 special audience characteristics, you just want to tell it whether or not it's porn.

If you haven't looked at the schema.org model for books, now's a good time. Don't expect to find a brilliant model for book metadata, expect to find out what a bibliographic neophyte machine thinks it can use a billion times a day. Schema.org was designed by engineers from Google, Yahoo, and Bing. Remember, their goal in designing it was not to describe things well, it was to make their search results better and easier to use.

The thing is, it's not such a big deal to include this sort of data in a page that comes from an library OPAC (online catalog). An OPAC that publishes unstructured data produces HTML that looks something like this:
<div> 
<h1>Avatar (Mysteries of Septagram, #2)</h1>
 <span>Author: Paul Bryers (born 1945)</span> 
 <span>Science fiction</span>
 <img src="http://coverart.oclc.org/ImageWebSvc/oclc/+-+703315758_140.jpg">
</div>

The first step is to mark something as the root object. You do that with the itemscope attribute:
<div itemscope> 
<h1>Avatar</h1>
 <span>Author: Paul Bryers (born 1945)</span> 
 <span>Science fiction</span>
 <img src="http://coverart.oclc.org/ImageWebSvc/oclc/+-+703315758_140.jpg">
</div>

A microdata-aware search engine looking at this will start building a model. So far, the model has one object, which I'll denote with a red box.


The second step, using microdata and Schema.org, is to give the object a type. You do that with the itemtype attribute:
<div itemscope itemtype="http://schema.org/Book"> 
<h1>Avatar (Mysteries of Septagram, #2)</h1>
 <span>Author: Paul Bryers (born 1945)</span> 
 <span>Science fiction</span>
 <img src="http://coverart.oclc.org/ImageWebSvc/oclc/+-+703315758_140.jpg">
</div>

Now the object in the model has acquired the type "Book" (or more precisely, the type "http://schema.org/Book".

Next, we give the Book object some properties:
<div itemscope itemtype="http://schema.org/Book"> 
<h1 itemprop="name">Avatar (Mysteries of Septagram, #2)</h1>
 <span>Author: 
<span itemprop="author">Paul Bryers (born 1945)</span></span> 
 <span itemprop="genre">Science fiction</span>
 <img src="http://coverart.oclc.org/ImageWebSvc/oclc/+-+703315758_140.jpg">
</div>

Note that while the library record for this book attempts to convey the title complexity: "245 10 $aAvatar /$cPaul Bryers.$", the search engine doesn't care yet. The book is part of a series: 490 1 $aThe mysteries of the Septagram$, and the search engines don't want to know about that either. Eventually, they'll learn.
The model built by the search engine looks like this:

So far, all the property values have been simple text strings. We can also add properties that are links:
<div itemscope itemtype="http://schema.org/Book"> 
<h1 itemprop="name">Avatar (Mysteries of Septagram, #2)</h1>
 <span>Author: 
<span itemprop="author">Paul Bryers (born 1945)</span></span> 
 <span itemprop="genre">Science fiction</span>
 <img src="http://coverart.oclc.org/ImageWebSvc/oclc/+-+703315758_140.jpg" 
itemprop="image">
</div>
The model grows.

Finally, we want to say that the author, Paul Bryers, is an object in his own right. In fact, we have to, because the value of an author property has to be a Person or an Organization in Schema.org. So we add another itemscope attribute, and give him some properties:
<div itemscope itemtype="http://schema.org/Book"> 
<h1 itemprop="name">Avatar (Mysteries of Septagram, #2)</h1>
 <div itemprop="author" itemscope itemtype="http://schema.org.Person">
Author:  <span itemprop="name">Paul Bryers</span> 
(born <span itemprop="birthDate">1945</span>)
 </div>
 <span itemprop="genre">Science fiction</span>
 <img src="http://coverart.oclc.org/ImageWebSvc/oclc/+-+703315758_140.jpg" 
itemprop="image">
</div>

That wasn't so hard. Baby has this picture in his tyrannical little head:

Which it can easily turn into a "rich snippet" that looks like this:

Though you know all it really cares about is milk.

Here's a quick overview of the properties a Schema.org/Book can have (the values in parentheses indicate a type for the property value):

Properties from http://schema.org/Thing
  • description
  • image(URL)
  • name
  • url(URL)
Properties from http://schema.org/CreativeWork
Properties from http://schema.org/Book
This post is the second derived from my talk at ALA in New Orleans. The first post discussed the changing role of digital surragates in a fully digital world. The next will discuss "Like" buttons.
Enhanced by Zemanta

16 comments:

  1. Thanks for this post !
    The French academic union catalog Sudoc has just announced the release of 10 million bibliographic records in schema.org microdata, alongside RDF/XML metadata.
    HTML and RDF/XML pages are listed in a sitemap, for slow but efficient "slurping" by search engines.
    More info : http://lists.w3.org/Archives/Public/public-lod/2011Jul/0031.html

    ReplyDelete
  2. Sigh

    Looking at the list of "More specific types" in the CreativeWork page, the only one that can possibly be appropriate for describing a journal article is Article, and http://schema.org/Article adds just two -- count them, two! -- new properties: articleBody and articleSection.

    So here we have yet another metadata format incapable of representing the most basic bibliographic information about journal articles -- journalTitle, volume, issue, startPage, endPage. It makes me want to weep.

    (No doubt we are supposed to say that an article isPartOf a journal or something, but even jumping through such computer-sciency hoops doesn't help with startPage and endPage, which I suppose are properly considered properties of the isPartOf relationship.)

    When it comes right down to it, for actually representing actual bibliographic data, RIS remains as good as any subsequently designed format. That makes me sad.
    http://en.wikipedia.org/wiki/RIS_%28file_format%29

    ReplyDelete
  3. This is great. just one addition: but believe me you really really really should ADD YOURSELF those properties you think you really want to give to the world so to teach baby new words!

    Just add those terms also if they're not defined by schema. People will notice you using them, there will be catalogs of terms found "in the wild" with examples associated and ultimately they'll update their schema, so please add that "advanced library name" and that "belongs to series" property.. want an advantage right away? once you submit a page to Sindice.com (or a sitemap) you'll be able to query your new terms just like the other official ones, this will allow you to hack with your own data e.g. to implement new services

    ReplyDelete
  4. Hi, we've been reviewing schema.org and how it can represent jobs and vacancies that a company has to offer.
    I'd like to see the schema extended as per my proposal at http://www.tribepad.com/2011/06/schema-org-needs-a-place-for-jobs/

    We're starting to use it on some of our job sites, but I'm not sure how you actually "spread the word" about schema extensions. What I don't want to happen is 15 different people coming up with their own schema and then none gaining any ground.
    Any tips?

    ReplyDelete
  5. Thanks for the comments. The big take-away from the LOD-LAM summit about schema.org was the necessity for communities of interest to come together on ways to extend the vocabularies so as to meet needs outside the lowest-common-denominator search engines. To Mike's issue, I would see if BibOntology would work, and as Giovanni says, just add it to markup as needed.

    The real difficulty is propagating these schemes and know-how across communities; I'm not sure there are any easy answers, other than (as Dan and Yann are doing) to spread the word.

    ReplyDelete
  6. Do either libraries or search engine users benefit from having bib records crawled in the first place?

    ReplyDelete
  7. Dan Kirkland: we'll be happy to provide soon a public list of "new terms" as crawled from Sindice.com . Use new terms away and people will see which sites are using them and how popular they are :) in the meawhile enjoy 100+ sites using schema.org http://bit.ly/omfoty

    ReplyDelete
  8. Thanks for all the references guys. So much developing and so fast.

    @ Mike Taylor -- The microdata is about search and findability of web-accessible data. Traditional library data is about describing physical items - in the main - and thus serves a different purpose.

    Ask yourself - do you search by page number, or do you search by author name or title. This is what microdata is enabling, better searching.

    And I, as a librarian, welcome it!

    ReplyDelete
  9. Nick, are you telling me no-one ever uses volume and issue in their seaching? Or indeed page-number?

    Or, on the other foot, looking at the fields that ARE described by schema.org's Article type, who searches on URL?

    ReplyDelete
  10. Ed Chamberlain at Cambridge University has been working on "COMET" (Cambridge Open Metadata), a JISC funded project. He's written a worth-reading post on the practicalities of feeding OPAC web pages to search engines.

    ReplyDelete
  11. Thanks for the link Eric, there is clearly a way to go yet, and both the software we use and the search engines are not quite there yet. Interesting times ahead though!

    This is also a really great tutorial, much better than the schema.org one.

    Mike, for me its less about specific indexes to search and more about how the data is structured in a search engine result. Fed through a search engine, a bibliographic record as displayed in an OPAC (often in a formatted table ...) looks confusing.

    As I said in my post, it would be good to see other search engines take initiative on this and show what can be done.

    My (first) attempt is pretty poor. Hopefully I'll get a chance to revisit it.

    ReplyDelete
  12. Ed makes an important point that I forgot to mention in my rant about the inadequacy of the metadata vocabulary, which is just this: THANK YOU, Eric, for a VERY useful tutorial! Definitely the most helpful one I've seen.

    ReplyDelete
  13. I have been thinking about this posting since it was published; and then got a lot of deserved attention on Twitter. I really like the content but something didn't work for me and I couldn't put my finger on what. I think I have it now. It's the metaphor. I can't see Google as a baby that needs to be treated gently and spoon fed. I rather picture it as a manipulative brat that is very picky about what it wants to eat. Because it can be. This personal insight has no impact whatsoever on the value of the posting.

    ReplyDelete
  14. Hvdsomp, you may however put me off having children ...

    ReplyDelete
  15. Hi folks. Great post and discussion. I'd like to encourage you all to pick this up over at W3C. We're just starting a taskforce group of the Semantic Web interest group, 'Web Schemas" (http://www.w3.org/wiki/WebSchemas) that is focussed on collaboration and discussion between Web-scale schema initiatives. If you've ideas for specifics that are missing from Schema.org (or other vocabs), the Web Schemas group is a great place to share use cases, requirements and other analysis.

    ReplyDelete