Errors in information sources can be inadvertent or intentional, but either way, on the internet the errors get copied, propagated and multiplied, resulting in what I call the Information Freedom Corollary:
Information wants to be free, but the truth'll costya.If you accept the idea that technologies such as Linked Data, web APIs and data spidering are making it much easier to distribute and aggregate data and facts on the internet, you come to the unmistakeable conclusion that it will become harder and harder to make money by selling access to databases. Data of all types will become more plentiful and easy to obtain, and by the laws of supply and demand, the price for data access will drop to near zero. In fact, there are many reasons that making data free increases its value, because of the many benefits of combining data from different sources.
If you want a successful business, it's best to be selling a scarce commodity. Chris Anderson and others have been promoting "free" as a business model for media with the idea that attention is a increasingly scarce commodity (an observation attributed to Nobel prize winning economist Herbert Simon). John Hagel has a good review of discussions about "the Economics of Attention" Whether or not this is true, business models that sell attention are very hard to execute when the product is factual information. Data is more of a fuel than a destination.
There is something that becomes scarce as the volume and velocity of information flow increases, and that's the ability to tell fact from fiction. As data becomes plentiful, verifiable truth becomes scarce.
Let's suppose we want to collect a large quantity of information, and think about the ways that we might practically reconcile conflicting assertions. (We're also assuming that it actually matters to someone that the information is correct!)
One way to resolve conflicting assertions is to evaluate the reputation of the sources. The New York Times has has a pretty good reputation for accuracy, so an assertion by to the Times might be accepted over a conflicting assertion by the Drudge Report. An assertion about the date of an ancestor's death might be accepted if it's in the LDS database, and might be trusted even more strongly if it cites a particular gravestone in a particular cemetary (has provenance information). But reputation is imperfect. I am absolutely, positvely sure that there's no street through my kitchen, but if I try to say that to one of the mapping data companies, why should they believe me in preference to a planning map filed in my town's planning office? What evidence are they likely to accept? Try sending a correction to Google Maps, and see what happens.
Another method to resolve conficts is voting. If two or more independent entities make the same assertion, you can assign higher confidence to that assertion. But as it becomes easier to copy and aggregate data, it becomes harder and harder to tell whether assertions from different sources are really independent, or whether they're just copied from the same source. The more that data gets copied and reaggregated, the more that the truth is obscured.
The semantic web offers another method of resolving conficting assertions, consistency checking. Genealogy offers many excellent examples of how data consistency can be checked against models of reality. A death date needs to be after the birth date of a person, and if someone's mother is younger than 12 or older than 60 at their birth, some data is inconsistent with our model of human fertility. Whatever the topic area, a good ontological model will allow consistency checks of data expressed using the model. But even the best knowledge model will be able to reconcile only a small fraction of conflicts- a birth date listed as 03-02 could be either February or March.
Since none of these methods is a very good solution, I'd like to suggest that many information providers should stop trying to sell access to data, and start thinking of themselves as truth providers.
How does an information provider become a truth provider? A truth provider is a verifier of information. A truth provider will try to give not only the details of Barack Obama's birth, but also a link to the image of his certificate of live birth. Unfortunately, the infrastructure for information verification is poorly developed compared to the infrastructure for data distribution, as exemplified by standards developed for the Semantic Web. Although the existing Semantic Web technology stack is incomplete, it comes closer than any other deployed technology to making "truth provision" a reality.
Although there have been an number of efforts to develop vocabularies for provenance of Linked Data (mostly in the context of scientific data), I view "named graphs" as an essential infrastructure for the provision of truth. Named graphs are beginning to emerge as vital infrastructure for the semantic web, but they have not been standardized (except obliquely by the SPARQL query specification). This means that they might not be preserved when information is transferred between one system and another. Nonetheless, we can start to think about how they might be used to build what we might call the "true" or "verified" semantic web.
On the Semantic Web, named graphs can be used to collect closely related triples. The core architecture of the Semantic Web uses URIs to identify the nouns, verbs, and adjectives; named graphs allow URIs to identify the sentences and paragraphs of the semantic web. Once we have named graphs, we can build machinery to verify the sentences and paragraphs.
The simplest way to verify named graphs using their URIs is to use the mechanism of the web to return authoritative graph data in response to an http request at the graph URI. Organizations that are serious about being "truth providers" may want to do much more. Some data consumers may need much more extensive verification (and probably updates) of a graph- they may need to know the original source, the provenance, the change history, the context, licensing information, etc. This information might be provided on a subscription basis, allowing the truth provider to invest in data quality, while at the same time allowing the data consumer to reuse, remix, and redistribute the information without restriction, even adding new verification layers.
Consumers of very large quantities of information may need to verify and update information without polling each and every named graph. This might be done using RSS feeds or other publish/subscribe mechanisms. Another possible solution is to embed digital signatures for the graph in the graph URI itself, allowing consumers posessing the appropriate keys to cryptographically distinguish authentic data from counterfeit or "trap street" data.
Named graphs and data verification. I think this is the beginning of a beautiful friendship.