Thursday, November 5, 2009

The Blank Node Bother and the RDF Copymess

There were many comments on my post about the problems in the Linked Data released by the New York Times, including some back and forth by Kingsley Idehen, Glenn MacDonald, Cory Casanave and Tim Berners-Lee that many readers of this blog may have found to be somewhat inexplicable. On the surface, the comments appeared to be about how to deal with the potentially toxic scope of "owl:sameAs". At a deeper level, the comments surround the issue of how to deal with a limitation of RDF. A better understanding of this issue will also help you understand difficulties faced by the New York Times and other enterprises trying to benefit from the publication of Linked Data.

Let's suppose that you have a dataset that you want to publish for the world to use. You've put a lot of work into it, and you want the world to know who made the data. This can benefit you by enhancing your reputation, but you might also benefit from others who can enhance the data, either by adding to it or by making corrections. You also may want people to be able to verify the status of facts that you've published. You need a way to attach information about the data's source to the data. Almost any legitimate business model that might support the production and maintenance of datasets depends on having some way to connect data with its source.

One way to publish a dataset is to do as the New York Times did, publish it as Linked Data. Unfortunately, RDF, the data model underlying Linked Data and the Semantic Web, has no built-in mechanism to attach data to its source. To some extent, this is a deliberate choice in the design of the model, and also a deep one. True facts can't really have sources, so a knowledge representation system that includes connections of facts to their sources is, in a way, polluted. Instead, RDF takes the point of view that statements are asserted, and if you want to deal with assertions and how they are asserted in a clean logic system, the assertions should be reified.

I have previously ranted about the problems with reification, but it's important to understand that the technological systems that have grown up around the Semantic Web don't actually do reification. Instead, these systems group triples into graphs and keep track of data sets using graph identifiers. Because these identified graphs are not part of the RDF model they tend to be implemented differently from system to system and thus the portability of statements made about the graph as a whole, such as those that connect data to their source, is limited.

At last week's International Semantic Web Conference Pat Hayes gave an invited talk about how to deal with this problem. I've discussed Pat's work previously, and in my opinion, he is able to communicate a deeper understanding of RDF and its implications than anyone else in the world. In his talk (I wasn't there, but his presentation is available.) he argues that when an RDF graph is moved about on the Web, it loses its self-consistency.

To see the problem, ask yourself this: "If I start with one fact, and copy it, how many facts do I have?" The answer is one fact. "one plus one equals two" is a single fact no matter how many times you copy it! You can think of this as a consequence of the universality of the concepts labeled by the english words "one" and "two".

I haven't gotten to the problem yet. As Pat Hayes points out, the problem is most clearly exposed by blank nodes. Blank nodes are parts of a knowledge representation that don't have global identity; they're put in as a kind of glue that connects parts of a fact. For example, lets suppose that we're representing a fact that's a part of the day's semantic web numerical puzzle: "number x plus number y equals two". "number x" and "number y" are labels we're assigning to a number that semantic web puzzle solvers around the world might attempt to map to a univeral concept. Now suppose I copy this fact into another puzzle. How many facts do I have? This time, the answer is two, because "number x" might turn out to be a different number in the second puzzle. So what happens if I copy a graph with a blank node a hundred times? Do the blank nodes multiply while the universally identified node don't? Nobody knows!

I hope you can see that making copies of knowledge elements and moving them to different contexts is much trickier than you would have imagined. To be able to manage it properly you need more than just the RDF model. In his talk, Pat Hayes proposes something he calls "Blogic" which adds the concept of "surfaces" to provide the context for a knowledge representation graph. If we had RDF surfaces, or something like that, then the connections between data and its source would be much easier to express and maintain across the web. Similarly, it would be possible to limit the scope of potentially toxic but useful assertions such as "owl:sameAs".

There are of course other ways to go about "fixing up" RDF, but I'm guessing the main problem is a lack of enthusiasm from W3C for the project. The view of Kingsley Idehen and Tim Berners-Lee appears to be that existing machinery, perhaps bolstered by graph IDs or document IDs is good enough and that we should just get on with putting data onto the web. I'm not sure, but there may be a bit of "information just wants to be free" ideology behind that viewpoint. There may be a feeling that information should be disconnected from its source to avoid entanglements, particularly of the legal variety. My belief is a bit different- it's that knowledge just wants to be worth something. And that providing solid context for data is ultimately what gives it the most value.

P.S. Ironically, in the very first comment on my last post, Ed Summers hints at a very elegant way that the Times could have avoided a big part of the problem- they could have used entailed attribution. It's probably worth another post just to explain it.

Reblog this post [with Zemanta]

4 comments:

  1. The problem with blank nodes is that they don't have URIs. It's that simple. "Blank" should only ever have meant that they have no *literal*. Letting them exist without URIs was just never a good idea, and hilariously now violates Tim BL's Linked Data commandment #1. Give them URIs and all the weird stuff never arises. No existential marks-on-surfaces dilemmas, no weird how-many-facts-do-you-have ambiguity.

    ReplyDelete
  2. Eric, (and others), I'm not sure if you've seen this post: Manaqing RDF using Named Graphs http://is.gd/4O0VK

    ReplyDelete
  3. Named Graphs seem to me like a really thick layer of duct tape. The whole point of reducing data to triples was that you can express anything in triples. Going meta and making it quads thus seems like a fundamental copout, and I see no reason to believe that the people who think they need a quad today won't come back and demand a quint next month.

    The two-part solution, I think, and this gets right back to our earlier conversation about the NYTimes and owl:sameAs, is a) to quit trying to say things about triples, and just say them about nodes, and b) to understand that you don't get to control how your information maps into mine.

    Take the "source" case, for example. Some source asserts "http://dbpedia.org/resource/Barack_Obama dbpedia-owl:Person/religion
    dbpedia:Islam". We want to collect this information, but attribute it. The quad approach says we turn this into "http://somelunaticsomewhere.net http://dbpedia.org/resource/Barack_Obama dbpedia-owl:Person/religion
    dbpedia:Islam", and then in our SPARQL queries we can (laboriously, if at all) control whether the named-graph http://somelunaticsomewhere.net is included or excluded in any given query.

    But there's so much wrong with that that it's hard to know where to start fixing it. Easier to start over entirely. What somelunaticsomewhere.net is actually saying can be broken up into two parts:

    Part 1: "Barack_Obama religion Islam".
    Part 2:
    - "Barack_Obama represents http://dbpedia.org/resource/Barack_Obama"
    - "religion represents dbpedia-owl:Person/religion"
    - "Islam represents dbpedia:Islam"

    That is, the local assertions (part 1), and the association of the local assertions with global referents (part 2), are logically separate and ought to be structurally separate. This would allow us to take in part 1 as is, but replace somelunaticsomewhere.net's part 2 with this:

    - "http://somelunaticsomewhere.net/Barack_Obama represents http://dbpedia.org/resource/Barack_Obama"
    - "http://somelunaticsomewhere.net/religion represents attributed_religion"
    - "Islam represents dbpedia:Islam"

    Now our queries can get at both the stuff we believe and the stuff that we're skeptical about with existing mechanisms, no extra meta-anything required.

    ReplyDelete