Wednesday, April 28, 2010

Pick this Nit: Null Path URIs and the Pedantic Web

There is no surer way to flush out software bugs and configuration errors than to do a sales demo. The process not only exposes the problem, but also sears into the psyche of the demonstrator an irrational desire to see the problem eradicated from the face of the earth, no matter the cost or consequences.
Here's a configuration problem I once found while demonstrating software to a potential customer:
Many library information services can be configured with the base URL for the institution's OpenURL server. The information service then constructs links by appending "?" and a query string onto the base URL. So for example, if the base URL is
and the query string is
isbn=9780393072235&title=The+Big+Short ,
the constructed URL is
For the demo, we had configured the base URL to be very short:, so the constructed URL would have been Everything worked fine when we tested beforehand. For the customer demo, however, we used the customer's computer, which was running some Windows version of Internet Explorer that we hadn't tested, and none of the links worked. Internet Explorer had this wonderful error page that made it seem as if our software had broken the entire web. Luckily, breaking the entire web was not uncommon at the time, and I was able to navigate to a different demo site and make it appear is if I had fixed the entire web, so we managed to make the sale anyway.
It turns out that http URLs with null paths aren't allowed to have query strings. You wouldn't know it if you looked at the W3C documentation for URIs, which is WRONG, but you will see it if you look at the IETF specs, which have jurisdiction (see RFC 1738 and RFC 2616).
Internet Explorer was just implementing the spec, ignoring the possibility that someone might ignore or misinterpret it. The fact that Netscape worked where IE failed could be considered a bug or a feature, but most users probably considered Netscape's acceptance of illegal URLs to be a feature.
I still feel a remnant of  pain every time I see a pathless URL with a query string. Most recently, I saw a whole bunch of them on the thing-described-by site and sent a nit-picky e-mail to the site's developer, and was extremely pleased when he fixed them. (Expeditious error fixing will be richly rewarded in the hereafter.) I've come to recognize, however, that a vast majority of these errors will never be fixed or even noticed, and maybe that's even a good thing.
Nit picking appears to have been a highlight of the Linked Data on the Web Meeting in Raleigh, NC yesterday, which I've followed via Twitter. If you enjoy tales of nerdy data disasters or wonky metadata mischief, you simply must peruse the slides from Andreas Harth's talk (1.8M, pdf) on "Weaving the Pedantic Web". If you're serious about understanding real-world challenges for the Semantic Web, once you've stopped laughing or crying at the slides you should also read the corresponding paper (pdf, 415K ). Harth's co-authors are Aidan Hogan, Alexandre Passant, Stefan Decker, and Axel Polleres from DERI.
The DERI team has studied the incidence of various errors made by publishers of Linked Data "in the wild". Not so surprisingly, they find a lot of problems. For example, they find that 14.3% of triples in the wild use an undeclared property and 8.1% of the triples use an undeclared class. Imagine if a quarter of all sentences published on the web used words that weren't in the dictionary, and you'd have a sense of what that means. 4.7% of typed literals were "ill-typed". If 5% of the numbers in the phone book had the wrong number of digits, you'd probably look for another phone book.
They've even found ways that seemingly innocuous statements can have serious repercussions. It turns out that it's possible to "hijack" a metadata schema, and induce a trillion bad triples with a single Web Ontology Language (OWL) assertion.
Nit Free Terminator Lice Comb, Professional Stainless Steel Louse and Nit Comb for Head Lice Treatment, Removes NitsTo do battle with the enemy of badly published Linked Data, the DERI team urges community involvement in a support group that has been formed to help publishers fix their data. The "Pedantic Web" has 137 members already. This is a very positive and necessary effort. But they should realize that the correct data cause is a hopeless one. The vast majority of potential data publishers really don't care about correctness, especially when some of the mistakes can be so subtle. What they care about is accomplishing specific goals. The users of my linking software only cared that the links worked. HTML authors mostly care only that the web page looks right. Users of Facebook or Google RDFa will only care that the Like buttons or Rich Snippets work, and the fact that the schemas for these things either don't exist in machine readable form or are wildly inconsistent with the documentation is a Big Whoop.
Until of course, somebody does a sales demo, and the entire web crashes.
(nit and head louse photos from Wikimedia Commons)
Enhanced by Zemanta


  1. <record type="stuck">
    "You cannot simultaneously have mass adoption and rigor" -- Clay Shirky.

  2. Nice post... but then again I would say that.

    You touch on a very important point towards the end: that publishers only care about specific goals. One of the dangers of Linked Data is that you publish your data and oftentimes never see it consumed, or problems with how it's consumed: the best way to improve data quality is to have visible and practical applications consuming the data.

    I agree with your final summary, and agree that in absolute terms, the "correct data cause" by itself is doomed to fail. However, if we can get the data quality to improve by some measurable quantity (or even just nail down what "data quality" even means for Linked Data) we'll have succeeded in our own way, and hopefully make the life of those (now all important) potential adopters a little easier, and their sleak and shiny Linked Data demos... well, a little more sleak and shiny.