Friday, October 30, 2009

The New York Times Blunders Into Linked Data, Pillages Freebase and DBPedia

Notwithstanding Larry Lessig, when you you try to use the precision of code to express squishiness of the legal system, you are bound to run into problems, as I've explored in my posts on copyright.

This Thursday, the New York Times took advantage of the International Semantic Web Conference to make good on their previous promise to begin releasing the New York Times subject index as Linked Data. No matter how you look at it, this is a big advance for the semantic web and the Linked Data movement. It's also a potential legal disaster for the New York Times.

To understand the what the New York Times did wrong, you have to understand a little but about the workings of RDF, the data model underlying the semantic web. In particular, you have to understand about entailment. Entailments are the sets of facts that can be deduced from the meaning of semantic web data. The crucial difference between plain-old data and Linked Data is that Linked Data includes these entailments.

Consider the English-language statement "apples are red". Because it is expressed in a language, it has meaning in addition to the single fact that apples are red. If we also assert that a specific object is an apple, then there is an entailment that the object is also red.

The New York Times Linked Data is expressed in the RDF language and uses vocabularies called OWL, SKOS, Dublin Core, and Creative Commons (denoted here by the prefixes "owl:", "skos:", "dc:" or "dcterms:", and "cc:"). You can download it yourself at http://data.nytimes.com/people.rdf (11.9 MB download)

Here's a simplified bit of the New York Times Linked Data. It defines a concept about C. C. Sabathia, a baseball pitcher who lost a game on Wednesday for the New York Yankees:
<rdf:Description rdf:about="http://data.nytimes.com/N24334380828843769853">
<skos:prefLabel>Sabathia, C C</skos:prefLabel>
<owl:sameAs rdf:resource="http://dbpedia.org/resource/CC_Sabathia"/>
<owl:sameAs rdf:resource="http://rdf.freebase.com/rdf/en.c_c_sabathia"/>

<dc:creator>The New York Times Company</dc:creator>
<cc:License>http://creativecommons.org/licenses/by/3.0/us/</cc:License>
<dcterms:rightsHolder>The New York Times Company</dcterms:rightsHolder>
<cc:attributionName>The New York Times Company</cc:attributionName>
</rdf:Description>
The first thing this does is it creates an identifier, "http://data.nytimes.com/N24334380828843769853", for the "C. C. Sabathia" subject concept. The New York Times uses this set of subjects to create topic pages, and the main purpose of releasing this data set is to help people link concepts throughout the internet to the appropriate New York Times topic pages.

Next, it gives a label for this concept, "Sabathia, C C". So far so good. The next two statements say that the New York Times Topic labeled by "Sabathia, C C" is the same concept previously identified by DBPedia, a Linked Data version of Wikipedia, and by Freebase, another large collection of Linked Data. This is even better, because this tells us that we can use information from Wikipedia and Freebase to help us infer facts about the New York Times C. C. Sabathia topic. "sameAs" is term is defined as part of the "OWL" standard vocabulary, which defines how machines should process these assertions of sameness.

The last four lines, highlighted in red, assert that the C. C. Sabathia concept was created by "The New York Times Company", which is the rights holder for the C. C. Sabathia concept, and that if you want to use the C. C. Sabathia concept, the The New York Times Company will license the concept to you under the terms of a particular Creative Commons License.

There are two separate blunders made by the stuff in red. The first blunder is that the New York Times is attempting to say that the C. C. Sabathia concept is a work "PROTECTED BY COPYRIGHT AND/OR OTHER APPLICABLE LAW." This is complete rubbish. The information provided by the New York Times about the C. C. Sabathia concept consists of a few facts that cannot be protected by copyright or any other law that I know of. (The entire 5,000 entity collection, however, is probably protectable in countries other than the US).

The second blunder is much worse. Where the first blunder is merely silly, the second blunder is akin to attempted property theft. Because the New York Times has asserted that it holds the rights to the C. C. Sabathia topic, and further, that the C. C. Sabathia topic is the same as the Freebase "c_c_sabathia" topic and the Wikipedia "CC_Sabathia" topic, by entailment, the New York Times is asserting that is the rights holder for those concepts as well.

You might argue that this is a harmless error. But in fact, there is real harm. Computers aren't sophisticated enough to deal with squishy legal concepts. If you load the New York Times file into an OWL-aware data store, the resulting collection will report that the the New York Times Company is the rights holder for 4,770 concepts defined by Wikipedia and 4,785 concepts defined Freebase.

Now before you start bashing the New York Times, it's important to acknowledge that RDF and Linked Data don't make it particularly easy to attached licenses or attributions to semantic web data. The correct ways to do this are all ugly and not standardized. You would think that this would be a requirement for commercial viability of the semantic web.

People trying to use New York Times Linked Data can deal with this in three ways. They can decide not to use data from the New York Times, they can ignore all licensing and attribution assertions that the Times makes, or they can hope that the problem goes away soon.

A fourth way would be to sue the New York Times Company for damages. At long last there's a lucrative business model for Linked Open Data.

Update: I have two follow-up posts: The Blank Node Bother and the RDF CopyMess and The New York Times Gets It Right; Does Linked Data Need a Crossref or an InfoChimp?
Reblog this post [with Zemanta]

23 comments:

  1. Yes, I was surprised that the NYT didn't make each skos:Concept part of a skos:ConceptScheme with skos:inScheme, and then hang the licensing information off of the ConceptScheme. It's an easily fixed problem, in multiple ways.

    Another smaller problem is that the owl:sameAs assertions to Freebase use the document URI instead of the URI for the person. So in your example the http://rdf.freebase.com/rdf/en.c_c_sabathia should actually be http://rdf.freebase.com/ns/en.c_c_sabathia ...

    All this would be a good thread for Pedantic-Web I think actually :-) It's great to see a company like the NYT taking steps in the area of Linked Data. They deserve more than a little bit of slack as we all see how the rubber meets the road.

    ReplyDelete
  2. Am I reading this RDF wrong, or is the NYT actually saying that C C Sabathia (the actual person) is licensed as Creative Commons?? I am sure he would disagree :-)

    So yes, I guess the problems you mentioned are just derived from the fact that these statements are not in the right place (as they are non-sensical as it is).

    ReplyDelete
  3. I agree that this came out wrong, but I have a different idea about why. I think they're trying to say two reasonable things:

    - This tiny bit of NYTimes content has the following copyright stipulations.
    - The concept that this tiny bit of NYTimes content represents is (the NYTimes thinks) the same concept represented by these other IDs in these other systems.

    Personally, I think the problem is in the second part. Using owl:sameAs across data sets, which is ostensibly its grandest purpose, is like lighting a semantic fuse. *Any* disagreement between the various datasets' data about this pitcher, not just copyright, is going to now cause inferrence problems that the computers won't be able to resolve. owl:sameAs is a universal assertion; what we need here is something much more like "See Also".

    In fact, when making assertions across data-sets that's almost always what we need. NYTimes and Freebase have no business, morally or logically, making universal claims about each other's data.

    Some third party, combining these data-sets, may want to say that *for the purposes of that combination*, this NYTimes node and this Freebase node should be treated as equivalent. Or maybe this NYTimes node and these four Freebase nodes that Freebase has their own reasons to keep apart. But that's a very different problem. And owl:sameAs is not a very good tool for that, either...

    ReplyDelete
  4. Although I agree with Eric's assessment about the blunder that NYT made, I'm not sure I agree with glenn.

    The blunder they made was not the use of owl:sameAs, it was the CC identification for this particular concept.

    When someone uses owl:sameAs, they are not saying something about someone else's *dataset*. They are saying that their *concept* is the same one about which another dataset makes statements.

    No one said you should believe everything you read on the (Semantic) Web. Just like with a human reading webpages, a machine parsing datasets should assess a level of trust for each source, instead of swallowing whole anything that anyone says on the Internet.

    ReplyDelete
  5. Hello All, Evan Sandhaus from the New York Times here.

    I can categorically assure you that it is not our intention to assert ownership or licensing terms outside of the records on http://data.nytimes.com. As I mentioned in our announcement yesterday, we are operating our linked data services in beta mode right now, and nothing is set in stone.

    The concerns raised here about our use of owl:sameAs relations certainly merit further discussion and I would welcome your input as to how we should best address this issue. I encourage everybody in this thread to post their approaches to this issue at http://data.nytimes.com/community and look forward to working with all of you as we move this product forward.

    Evan Sandhaus

    Semantic Technologist
    New York Times Research and Development
    evan@nytimes.com

    ReplyDelete
  6. Evan,

    Pat Hayes pretty much nails the owl:sameAs issue here [1]. You can assert owl:sameAs if you mean co-reference. And clearly that's what you mean. Of course the correct URIs need to be used, but that's a minor kink that's easy to fix.

    BTW - Great job and major Linked Data Web contribution, kinks will be sorted out.

    Also, what you are protecting is the data representation (a manifestation New York Time's data curation efforts) exposed by your URIs re. Web. Thus, consumers of this data should keep your URIs intact (in their data spaces) or formally attribute data sources e.g. via LINK tag and rel=dc:source.

    Links:
    1. http://www.slideshare.net/PatHayes/blogic-iswc-2009-invited-talk -- covers Names & Identifiers & Co-reference

    ReplyDelete
  7. Evan and his colleagues deserve support from the community as they iron out the kinks. What they're doing has few precedents.

    Some linkified URL's:
    Pat Hayes invited talk.
    NY Times Data Community

    ReplyDelete
  8. As far as I can tell from reading the slides, Pat Hayes is saying the same thing about owl:sameAs's shortcomings that I am: "Linked data needs to be able to express co-reference without implying acceptance of an entire conceptualization." His proposed solution is RDF+surface, to model the different contexts in which assertions exist. Thus we could say that within the context of the NYTimes data ("on the NYT surface"), this NYT node is equivalent to this Freebase node. Only within the boundary of context is this assertion reliable, and within it the copyright assertions are also fine as is.

    ReplyDelete
  9. Glenn,

    Sorry, he is saying the complete opposite. Basically, he is saying: you don't have to make a full ontological commitment, it isn't what co-reference is about.

    I demonstrate "owl:sameAs" virtues all the time. It about merging and expanding disparate data sources via Identifiers.

    Its about different names for the same thing.

    Naturally, all is of this subjective, so the publisher or consumer of data always has the chance to impose their own context, platform permitting.

    The Context manifest in many ways. In the most basic form: The New York Times is a bona fide Linked Data Space.

    Kingsley

    ReplyDelete
  10. Absent Pat himself jumping in and clarifying, I don't think your interpretation is defensible. I was directly quoting Pat, above: "Linked data needs to be able to express co-reference without implying acceptance of an entire conceptualization." His example of the misuse of owl:sameAs is sodium, which has different "conceptualizations" in dbpedia and cyc, even though it's clearly the same subject to which both are referring. He proposes "surfaces" (in part) to *allow* the use of sameAs without its current consequences.

    SKOS has skos:closeMatch and skos:exactMatch, both explicitly distinct from owl:sameAs, for just this reason, and either would have worked fine here, and there would have been no copyright issue.

    Jaffri/Glaser/Millard's paper "Managing URI Synonymity to Enable Consistent Reference on the Semantic Web" offers another approach to this, which moves the URI equivalences into a separate space, and that would have worked fine here, too.

    ReplyDelete
  11. Richard Cyganiak from DERI proposes that the copyright stuff should be modeled as referring to XXXXX.rdf, instead of XXXXX, conveying that it applies to the publication of triples, not the subjects of those triples. This makes human sense, too, although I admit I don't know how the relationships between rdf files and their contents are supposed to be modeled.

    http://dowhatimean.net/2009/10/linked-data-at-the-new-york-times-exciting-but-buggy

    ReplyDelete
  12. I've been wondering when there was going to be more discussion of the licensing the OCLC (dewey stuff) and NYT are using for their linked data. Notwithstanding the basic copyright issues, I have no idea how one is supposed to comply with such a license, and when? Does this only apply if I supply the actual RDF/XML file for download on my website? What if I include the NYT URIs in my own publicly exposed data? Or somewhere in between, what if I cache some of the NYT triples on my site?

    ReplyDelete
  13. Glenn,

    A surface is what I call a Linked Data Space.

    When I talk about co-reference, I mean co-reference with explicit context boundaries (surfaces or data spaces or documents or files or what ever form of Named Containment). Basically we can all make our claims and inferences in a "Box".

    There is nothing about Linked Data that implies or enforces anything. Pat is saying: "Everyone, Loosen up and just remember we are talking about Names / Identifiers. Not talking old school AI and Logic".

    "Blogic" is a different kind of logic, a loosening of things because like "404's" on the Web, "loosely coupling" is essential to scalability and reality.

    Sorry, but you are missing this *vital* point. Also you missed Pat's: Nearly Pregnant analogy etc.. Again, he is simply trying to remind folks about "Identifiers" and "Names", since the whole Semantic Project has completely mangled the essence of these things.


    Kingsley

    ReplyDelete
  14. You and I (and Pat Hayes, I think) are all saying the exact same thing about co-reference assertions belonging in a containing context, so telling me I have that backwards, or am missing some "*vital*" point, is an attempt at condescension that ends up only being inane. (This is a familiar pattern in your participation in public discussions. My guess is that you start writing before you finish reading. And maybe finish writing before you finish reading, too.)

    owl:sameAs, by its explicit current definition, is contextless.

    I think what you're proposing (and think Pat is validating) is that we should ignore that current reality, assume the context will be provided somehow, at some point, and go ahead asserting equivalences as much as we want.

    I have no practical objection to this approach, except that I can't see any logic whatsoever for labeling these assertions "owl:sameAs", by which act you are going out of your way to assign them a well-defined meaning that you explicitly don't want. "skos:exactMatch" would be better, for example, and this data is already using some SKOS arcs.

    And were you to succeed in co-opting owl:sameAs this way, anyway, we'd only have to invent some *other* arc ("owl:variantID"?) to express the concept that owl:sameAs currently does, which you need when you actually do want to express that two identifiers should be treated as if they were and had always been one.

    Which is OK with me, too, I guess. But if that's your project, your issues are with w3c, not me.


    PS: If it were up to me to solve this issue, I think a much simpler approach than adding a whole containment abstraction is just to use an arc that states uni-directional effective equivalence, not bi-directional sameness. So instead of saying "nyt:x owl:sameAs freebase:y", which implies "freebase:y owl:sameAs nyt:x" and "nyt:Us dcterms:rightsHolder freebase:y", we'd say "nyt:x glenn:extends y", which implies "nyt:x freebase:eats freebase:cupcakes", but in the other direction only "freebase:y glenn:isExtendedBy nyt:x", which does *not* imply "freebase:y nyt:anything nyt:atAll". So glenn:extends would in a way do for individuals what owl:subClassOf does for classes and owl:subPropertyOf does for arcs...

    ReplyDelete
  15. Glenn,

    Any "owl:sameAs" assertion occurs in a triple pattern, within a Named Graph, hosted somewhere. "Somewhere" is something, a Data Space.

    The label "owl:sameAs" doesn't imply != co-reference. Its just a label for a property (a relation) that implies: co-reference.

    If we toss the label "owl:sameAs" aside for a second. Can we not focus more on the fact that: "Glenn" has many Names in different places? And in these places, claims have been made that are at best relative, and inherently subjective, truths?

    There is nothing about Linked Data that impedes the above. Inference Rules are not inextricably bound to instance data. Reasoning is optional, and it should be, since it's a subjective act.

    Kingsley

    ReplyDelete
  16. "Reasoning is optional, and it should be, since it's a subjective act."

    I think we're done here.

    ReplyDelete
  17. Glenn,

    We agree, I assume? Basically, Another way of looking at the not so greatly labeleled "owl:sameAs" property : Identifiers for Entities X & Y have the same Referent, from perspective of "owl:sameAs" claim maker, and as expressed in the claim makers data space.

    Claims are made in data spaces (kind like what we do things in our minds, which are personal data spaces albeit in a different Web-like realm).

    ReplyDelete
  18. Glenn,

    Typo fixed version:

    We agree, I assume? Basically, another way of looking at the not so greatly labeleled "owl:sameAs" property:

    Identifiers for Entities X & Y have the same Referent, from perspective of "owl:sameAs" claim maker, and as expressed in the claim makers data space.

    Claims are made in data spaces (similar to what we do in our minds, which are personal data spaces albeit in a different Web-like realm).

    ReplyDelete
  19. He should have just asserted the licence about the document containing the data.

    The comment system here doesn't seem to like angle brackets but it would have meant closing the description of the person, and opening a new one about="", as the empty URI means "this document".

    Tim BL

    ReplyDelete
  20. I suggest a more positive and helpful approach to a major organization breaking new ground! Great job NYT! Let’s not scare their lawyers into retreat!

    There are two concepts lurking here - we need to pull them apart.

    Concept one - the identity of the thing in the world, "Sabathia, C C" in this case. This real world concept is not owned by anyone.

    Concept two: the identity of an information resource about concept one: http://data.nytimes.com/N24334380828843769853”. In this case the NYT set of triples about "Sabathia, C C". Such an information resource is an opinion about the real world concept, not the real world concept. An opinion is the opinion (and property) of a particular entity – the entity owning the graph that contains triples that encode that opinion.

    What we need is a way to establish that multiple separate information resources (opinions) represent the same underlying thing in the world without asserting agreement or ownership. These information sources should not be expected to "agree" unless that agreement is also asserted. Ownership and such are about information resources. The NYT should be able to make any statements of license they like about the information resource represented by the URL they own. That this information resource represents the same individual as another should not bind them legally.

    While such assertions could be made on the graph we have to make sure the connections between a particular graph and the triple is well supported, which it is not. While the graph can be queried for in a special form of SPARQL – the source of a triple is hard to find. In fact, a query that returned "Sabathia, C C" would probably not return any information about the graph where that information came from. The NYT probably wanted something closer to the data – the information resource http://data.nytimes.com/N24334380828843769853.

    One approach would be to recognize that every resource we have is an information resource and the connection to the “real world” is tenuous at best. These URLs are signs for information (opinions) about the thing in the world . What is missing? We should be able to assert that any set of such resources represent the same real world identity. A real world identity should have no other triples – since everything is an opinion. However, this is not sameAs because sameAs implies agreement– it is more like equivalent class. I would think “represents” would be a reasonable name for such a property linking information about X with X. This capability would have applicability in many aspects of data management on the web.

    The copyright triples are clearly about the data, not the underlying concept. Until there is a standard way of expressing the distinction between a thing and information about the thing the NYT should be able to legally assert that provenance information does not transfer across sameAs.

    ReplyDelete
  21. I hope Tim comes back to answer Cory's question about how to link the data to the license, which was the same question I had earlier when Richard Cyganiak suggested (elsewhere) the same solution. How would a SPARQL query, for example, find some NYTimes data about X *and* the license that goes with it?

    ReplyDelete
  22. So this is a really helpful post, but disguised as a rather grumpy one. Since we're still at the stage where exposing datasets in an open manner is quite experimental and hard, maybe you could have found a gentler title? I'm not sure it helps much internally within orgs like NYT to have their experiments at openness characterised as blundering and pillaging - it just makes it harder to get management support for this kind of stuff in future. Which seems a shame...

    ReplyDelete
  23. danbri- Well, as I was writing this, the Phils lost game 2. Your criticism is fair- the title was aimed more at getting people's attention than at advertising the article's content accurately. Would you or anyone have read and acted on it if it was titled "Implications of Entailment for Linked Data License Assertions: a Case Study"? Zzzzz.

    It's important to remember that real businesses do not adopt technologies to gain praise from bloggers or avoid criticism from the same. They adopt technologies to accomplish business objectives. Technology managers must understand the benefits and dangers to advance their business goals, and that's what I try to focus on in my writing.

    Having criticised the NYT for its first try, I feel a responsibility to laud them for what they've done since.

    ReplyDelete

Note: Only a member of this blog may post a comment.