Errors in information sources can be inadvertent or intentional, but either way, on the internet the errors get copied, propagated and multiplied, resulting in what I call the Information Freedom Corollary:
Information wants to be free, but the truth'll costya.If you accept the idea that technologies such as Linked Data, web APIs and data spidering are making it much easier to distribute and aggregate data and facts on the internet, you come to the unmistakeable conclusion that it will become harder and harder to make money by selling access to databases. Data of all types will become more plentiful and easy to obtain, and by the laws of supply and demand, the price for data access will drop to near zero. In fact, there are many reasons that making data free increases its value, because of the many benefits of combining data from different sources.
If you want a successful business, it's best to be selling a scarce commodity. Chris Anderson and others have been promoting "free" as a business model for media with the idea that attention is a increasingly scarce commodity (an observation attributed to Nobel prize winning economist Herbert Simon). John Hagel has a good review of discussions about "the Economics of Attention" Whether or not this is true, business models that sell attention are very hard to execute when the product is factual information. Data is more of a fuel than a destination.
There is something that becomes scarce as the volume and velocity of information flow increases, and that's the ability to tell fact from fiction. As data becomes plentiful, verifiable truth becomes scarce.
Let's suppose we want to collect a large quantity of information, and think about the ways that we might practically reconcile conflicting assertions. (We're also assuming that it actually matters to someone that the information is correct!)
One way to resolve conflicting assertions is to evaluate the reputation of the sources. The New York Times has has a pretty good reputation for accuracy, so an assertion by to the Times might be accepted over a conflicting assertion by the Drudge Report. An assertion about the date of an ancestor's death might be accepted if it's in the LDS database, and might be trusted even more strongly if it cites a particular gravestone in a particular cemetary (has provenance information). But reputation is imperfect. I am absolutely, positvely sure that there's no street through my kitchen, but if I try to say that to one of the mapping data companies, why should they believe me in preference to a planning map filed in my town's planning office? What evidence are they likely to accept? Try sending a correction to Google Maps, and see what happens.
Another method to resolve conficts is voting. If two or more independent entities make the same assertion, you can assign higher confidence to that assertion. But as it becomes easier to copy and aggregate data, it becomes harder and harder to tell whether assertions from different sources are really independent, or whether they're just copied from the same source. The more that data gets copied and reaggregated, the more that the truth is obscured.
The semantic web offers another method of resolving conficting assertions, consistency checking. Genealogy offers many excellent examples of how data consistency can be checked against models of reality. A death date needs to be after the birth date of a person, and if someone's mother is younger than 12 or older than 60 at their birth, some data is inconsistent with our model of human fertility. Whatever the topic area, a good ontological model will allow consistency checks of data expressed using the model. But even the best knowledge model will be able to reconcile only a small fraction of conflicts- a birth date listed as 03-02 could be either February or March.
Since none of these methods is a very good solution, I'd like to suggest that many information providers should stop trying to sell access to data, and start thinking of themselves as truth providers.
How does an information provider become a truth provider? A truth provider is a verifier of information. A truth provider will try to give not only the details of Barack Obama's birth, but also a link to the image of his certificate of live birth. Unfortunately, the infrastructure for information verification is poorly developed compared to the infrastructure for data distribution, as exemplified by standards developed for the Semantic Web. Although the existing Semantic Web technology stack is incomplete, it comes closer than any other deployed technology to making "truth provision" a reality.
Although there have been an number of efforts to develop vocabularies for provenance of Linked Data (mostly in the context of scientific data), I view "named graphs" as an essential infrastructure for the provision of truth. Named graphs are beginning to emerge as vital infrastructure for the semantic web, but they have not been standardized (except obliquely by the SPARQL query specification). This means that they might not be preserved when information is transferred between one system and another. Nonetheless, we can start to think about how they might be used to build what we might call the "true" or "verified" semantic web.
On the Semantic Web, named graphs can be used to collect closely related triples. The core architecture of the Semantic Web uses URIs to identify the nouns, verbs, and adjectives; named graphs allow URIs to identify the sentences and paragraphs of the semantic web. Once we have named graphs, we can build machinery to verify the sentences and paragraphs.
The simplest way to verify named graphs using their URIs is to use the mechanism of the web to return authoritative graph data in response to an http request at the graph URI. Organizations that are serious about being "truth providers" may want to do much more. Some data consumers may need much more extensive verification (and probably updates) of a graph- they may need to know the original source, the provenance, the change history, the context, licensing information, etc. This information might be provided on a subscription basis, allowing the truth provider to invest in data quality, while at the same time allowing the data consumer to reuse, remix, and redistribute the information without restriction, even adding new verification layers.
Consumers of very large quantities of information may need to verify and update information without polling each and every named graph. This might be done using RSS feeds or other publish/subscribe mechanisms. Another possible solution is to embed digital signatures for the graph in the graph URI itself, allowing consumers posessing the appropriate keys to cryptographically distinguish authentic data from counterfeit or "trap street" data.
Named graphs and data verification. I think this is the beginning of a beautiful friendship.
Excellent post Eric! I'll point out two additional papers that apply named graphs to the provenance problem; the first is the "classic," the second much more recent:
ReplyDelete* Jeremy Carroll, et.al., Named Graphs, Provenance and Trust (2005)
* Tara Gibson, et.al., Application of Named Graphs Towards Custom Provenance Views (2009)
Thanks again!
Caroll et. al. even describe a way to do signed graphs!
ReplyDeleteI imagine that there is a need for a wide range of verified truths. For many people, the assertion that Barack Obama was born Aug 4 1961 feels about right. And, then if I mention that I found it on Wikipedia, then it's a bit more reputable.
ReplyDeleteBut if I was a serious researcher of USA President genealogy, then perhaps I would want that picture of the birth certificate. And it would have to be a verified picture at that.
There are different truths requirement for different people. To presume that linked graphs will be able to provide all the different levels of reliability at different cost (money, time, the list of names that determine the truth) points seems rather naive.
Linked graphs and absolute truth will certainly be useful for a proportion, but I think for the mom & pop surfer, the grade school student, your typical home user, it's overkill.
I'm just trying to inject some perspective into the conversation. We need to realize the target user base when we discuss the verified semantics web.
Keep up the discourse, it's important to understand where the "truth" might be. It's important to know that the "good fight" is going on somewhere.
Cheers.
I've been thinking a lot about provenance lately and looking at esp. at the Provenance Vocabulary by Hartig and Zhao. As I try to understand such models I get a headache wondering whether they'll scale.
ReplyDeleteIt could be I'm simple not thinking in the right way. Perhaps my frame of reference should be an infinite, scalable "quad store in the sky," in which the spawning of "provenance webs" due to individual data entities is simply not a problem.
As I just tweeted, it would be interesting to visualize the application of provenance policies via a tool like RelFinder. I would imagine data relationships being discovered, then green-, red- or gray-highlighted as each entity's provenance data was discovered and evaluated.
Via Twitter Chris Rusbridge asked whether RKBExplorer might do the trick; possibly, but it's not clear to me whether it asks the same questions as RelFinder.
We need also to bear in might that trust is a spectrum. Researchers like Hartig and Zhao are considering tough questions about how scientific data was created and modified; on the other end of the spectrum might be the simple question of "who says," or even simpler "did I get a '404' trying to retrieve it?"