Tuesday, May 12, 2009

Google, RDFa, and Reusing Vocabularies

Yesterday, I wrote about one difficulty of having machines talk to other machines- propagation and re-use of vocabularies is not something that machines being used today know how to do on their own. I thought it would be instructive to work out a real example of how I might find and reuse vocabulary to express that a work has a certain ISBN (international standard book number). What I found (not to my great surprise) was that it wasn't that easy for me, a moderately intelligent human with some experience at RDF development, to find RDF terminology to use. I tried Knoodl, Google, and SchemaWeb to help me.

Before I complete that thought, I should mention that today Google announced that they've begun supporting RDFa and microformats in what they call "rich snippets". RDFa is a mechanism for embedding RDF in static HTML web pages, while microformats are a simpler and less formalized way to embed metadata in web pages. Using either mechanism, this means that web page authors can hide information in structures mean to be read by machines in the same web pages that humans can read.

Concentrating on just the RDFa mechanism, it's interesting to see how Google expects that vocabulary will be propagated to agents that want to contribute to the semantic web: Google will announce the vocabulary that it understands, and everyone else will use that vocabulary. Resistance is futile. Not only does Google have the market power to set a de facto standard, but it has the intellectual power to do a good job of it- one of the engineers on the Google team working on "rich snippets" is Ramanathan V. Guha, who happens to be one of the inventors of RDF.

You would think that It would be easy to find an RDF property that has been declared to use in assertions like "the ISBN of 'digital Copyright' is 1-57392-889-5". No such luck. Dublin Core, a schema developed in part by the library community, has an "identifier" element which can be modified to indicate the element contains an isbn, but no isbn property. Maybe I just couldn't find it. Similarly, MODS, which is closely related to library standards, has an identifierType element type that can contain an ISBN, but you have to add type=isbn to the element to make it an ISBN. Documentation for RDFa wants you to use the ISBN to make a urn and to make this the subject of your assertion, not an attribute (ignoring the fact the ISBN identifies things that you sell in a bookstore (for example, the paperback version of a book) rather than what most humans think of as books. I also found entries for isbn in schemes like The Agricultural Metadata Element Set v.1.1 and a mention in the IMS Learning Resource Meta-Data XML Binding. Finally I should note that while OpenURL (a standard that I worked on) provides an XML format which includes an ISBN element, it's defined in such a way that it can't be used in other schemas.

The case of ISBN illustrates some of the barriers to vocabulary reuse, and although there are those who are criticizing Google for not reusing vocabulary, you can see why Google thinks it could work better if they just define vocabulary by fiat.

2 comments:

  1. See http://bibliontology.com/

    ReplyDelete
  2. Bruce,

    Thanks for the comment, and nice to see you here.

    When I go to http://purl.org/ontology/bibo/isbn13 , which seems like it might be what I want, I get an xml document that makes no mention of isbn. If I did not know of your work, how would I find out that this is something that I might be able to, or even ought to use? And did you also find no other vocabulary to reuse?

    ReplyDelete