The widespread enthusiasm for Linked Data has reminded me of those initial forays into database building. Some important things have changed since then. Nowadays, a big database has at least 100 million records. Semantic Web software was in its infancy back then; my attempts to use RDF in my database 11 years ago quickly ran into hard bits in the programming, and I ended up abandoning RDF while stealing some of its most useful ideas. One thing that hasn't changed is something I was ignorant of 11 years ago- maintaining a big database is a big, difficult job. And as a recently exed ex-physicist, I should have known better.
The fundamental problem of maintaining a large knowledgebase is known in physics as the second law of thermodynamics, which states that the entropy of the universe always increases. An equivalent formulation is that perpetual motion machines are impossible. In terms that non-ex-physicist librarians and semantic websperts can understand, the second law of thermodynamics says that errors in databases accumulate unless you put a lot of work into them.
This past week, I decided to brush off my calculus and write down some formulas for knowledgebase error accumulation so that I wouldn't forget the lessons I've learned, and so I could make some neat graphs.
Let's imagine that we're making a knowledgebase to cover something with N possible entities. For example, suppose we're making a knowledgebase of books, and we know there are at most one billion possible books to cover. (Make that two billion, a new prefix is now being deployed!) Let's assume that we're collecting n records at random, so for any entity, each record has a 1/N chance of covering any specific entity. At some point, we'll start to get records that duplicate information we've already but into the database. How many records, n, will we need to collect to get 99% coverage? It turns out this is an easy calculus problem, one that even a brain that has spent 3 years as a middle manager can do. The answer is:
Coverage fraction, f = [1- exp(-n/N)]So to get 99% of a billion records, you'd need to acquire about 4.6 billion records. Of course there are some simplifications in this analysis, but the formula gives you a reasonable feel for the task.
I haven't gotten to the hard part yet. Suppose there are errors in the data records you pull in. Let's call the the fraction of records with errors in them epsilon, or ε. Then we get a new formula for the errorless coverage fraction, F
Errorless coverage fraction, F = [exp(-εn/N) - exp(-n/N)]This formula behaves very differently from the previous one. Instead of rising asymptotically to one for large n, it rises to a peak, and then drops exponentially to zero for large n. That's right, in the presence of even a small error rate, the more data you pull in, the worse your data gets. There's also a sort of magnification effect on errors- a 1% error rate limits you to a maximum of 95% errorless coverage at best; a 0.1% error rate limits you to 99.0% coverage at best.
I can think of three strategies to avoid the complete dissipation of knowledgebase value caused by accumulation of errors.
- stop collecting data once you're close to reaching the maximum. This is the strategy of choice for collections of information that are static, or don't change with time.
- spend enough effort detecting and resolving errors to counteract the addition of errors into the collection.
- find ways to eliminate errors in new records.