Friday, November 20, 2009

Putting Linked Data Boilerplate in a Box

Humans have always been digital creatures, and not just because we have fingers. We like to put things in boxes, in clearly defined categories. Our brains so dislike ambiguity that when musical tones are too close in pitch, the dissonance almost hurts.

The aesthetics of technical design frequently ask us to separate one thing from another. It's often said that software should separate code from content and that web-page mark-up should separate presentation from content. XML allows us to separate element content from attribute data; well designed XML schemas make clear and consistent decisions about what should go where.

In ontology design, the study of description logics has given us boxes for two types of information, which have been not-so-helpfully named the "A-Box" and the "T-Box". The T-Box is for terminology and the A-Box is for assertions. When you're designing an ontology, an important decision is how much information should be built into your terminology and how much should be left for users of the terminology to assert.

It's not always easy to decide where to draw the terminology vs. assertion line. For example, if you're building a dog ontology, you might want to have a BlackDog class for dogs that are black. Users of your ontology could then make a single assertion that Fido is a BlackDog, saving them the trouble of making the pair of assertions that Fido is a Dog and Fido is colored black. The audience, on the other hand, would have to understand the added terminology to be able to understand what you've said. In one case, the binding of color to dogs is done in the T-Box, in the second, the A-Box. The A/B box choice boils down to a question of whether users would rather have a concise assertion box and a complex terminology box, or a verbose assertion box and a simple terminology.

Although I designed my first RDF Schema over ten years ago, I had not had a chance to try out OWL for ontology design. Since OWL 2 has just just become a W3C Recommendation, I figured it was about time for me to dive in. I was also curious to find out what kind of ontology designs are preferred for linked data deployment, and I'd never even heard of description logic boxes.

Since I gave the New York Times an unfairly hard time for the mistakes it made in its initial Linked Data release, I felt somewhat obligated to do what I could to participate helpfully in their Linked Open Data Community. (Good stuff is going on there- if you're interested, go have a look!) The licensing and attribution metadata in the Times' Linked Data struck me as highly repetitive, and I wondered if this boilerplate metadata could be cleaned up by moving it into an OWL ontology. It could; if you're interested in details, go to the Times Data Community site and see.

It's not obvious which box this boilerplate information should be in. It's really context information, or assertions about other assertions. The Times wants people to know that it has licensed the data under a creative commons license, and that it wants attribution. If it's really the same set of assertions for everything the Times wants to express (i.e. it's boilerplate) then one would think there would be a better way than mindless repetition.

My ontology for New York Times assertion and licensing boilerplate had the effect of compacting the A-Box at the cost of making the T-Box more complex. I asked if that was a desirable thing or not, and the answer from the community was a uniform NOT. The problem is that there are many consumers of linked data who are reluctant to do the OWL reasoning necessary to unveil the boilerplate assertions embedded in the ontology. Since a business objective for the Times is to enable as many users as possible to make use of its data and ultimately to drive traffic to its topic pages, it makes sense to keep technical barriers as low as possible. Mindlessness is a feature.

I could only think of one reason that a real business would want to use my boilerplate-in-ontology scheme. Since handling an ontology may require some human intervention, the use of a custom ontology could be a mechanism to enforce downstream consideration of and assent to license terms, analogous to "click-wrap" licensing. Yuck!

The conclusion, at least for now, is that for most linked data publishing it is desirable to keep the terminology as simple as possible. Linked Data Pidgin is better than Linked Data Creole.


Contribute a Comment