Tuesday, May 26, 2009

There is no truth on the internet

In his retirement, my father took up genealogy as a hobby, and after he died, his database of thousands of ancestors (most of them in northern Sweden) passed to me. If you're interested, you can browse through them on the hellman.net website. Having all this data up on the web has been rather entertaining. Every month or so, I get an e-mail from some sixth cousin or such who has discovered a common ancestor through a google search, and the resulting exchanges of data allow me to make occasional corrections and additions.

Since I've taken the database on, huge amounts of genealogic information has become available on the internet. When I first started finding this information, I made the mistake of trying to suck it into my database, since I had become more a less a professional data sucker and spewer in my work life. Once I had spent hour after hour pulling data in, I started to wonder what the point of it all was. Could I relly determine, and did I really care whether Erik Eriksson, born 1837 in Backfors, was really my fourth cousin thrice removed or not? What is the relationship between the data I sucked in and the truth about all the real people listed in the database? I quickly regretted my data gluttony.

Traditional genealogists focusing on Sweden use a variety of material as primary sources of information. Baptismal records typically give a childs name and birthdate along with the names of their parents; burial and marriage records similarly give names and dates. The genealogist's job is to connect names on different records to construct a family tree. But things are not always simple. Probably 20% of males in the Backfors region were named Erik, and since patronymics were used, 20% of those males were also named Eriksson, though the name might be abbreviated in the records as "Ersson". To judge whether a girl named Hanna listed on a birth record from 1877 which lists "Erik Eriksson" as the father is really the daughter of the Erik Eriksson born in 1837 in Backfors, the genealogist must consider all the information available together with conditional probabilities.

The internet genealogist (e.g., me) has a different task. Rather than looking at the birth records and assessing the likelihood of name coincidences, the internet genealogist looking at the same question searches the internet and finds that the web site "sikhallan.se" lists Hanna as Erik's daughter. The internet genealogist then makes a judgement about the reliability of the Sikhallan website. For example, how do we know that Sikhallan's source for Erik's birthdate isn't just the hellman.net website? If the two databases disagree, who should be believed? In my case, I just look at my father's meticulous notes about where his information comes from and if he noted some uncertainty, then I'm much more likely to believe the other sources available to me. Unless of course my data has come from one of my data sucking binges, in which case the source of my data has been lost and I can no longer judge its reliability.

In my last two posts on reification (Part 1, Part 2), I promised that I would have a third post evaluating whether the reification machinery in RDF was worth the trouble. This is not that third post, this is more of a philosophical interlude. You see, another way to look at genealogic information on the internet is to think of it as a web of RDF triples. For example, imagine if Sikhallan made its data available as a set of triples, e.g. (subject: Erik Eriksson; predicate: had daughter; object Hanna). Then we could load up all the triples into an RDF-enabled genealogy database, and all our problems would be solved, right? Well, yes, unless of course we wanted to retain all the supporting information behind the data, the data provenance, all the extra care in citation of source taken by my Father and and ignored by me in my data-sucking orgies. In reality, the triple itself is worthless, devoid of assessable truth. If the triple were associated with provenance information its truth would become assessable, and thus valuable. The mechanism that RDF provides for doing things like this is... reification.

Wikipedia is the most successful knowledge aggregation on the internet today and is also, not coincidentally, the best example of the value of comprehensive retention of provenance and attribution. Wikipedia keeps track of the data and author of every change in its database, and relentlessly purges anything which is not properly cited. Wikipedia is, in my opinion the best embodiment of my view that there is no truth on the internet- there are only reified assertions.

0 comments:

Contribute a Comment