Friday, June 5, 2009

When are you collecting too much data?

Sometimes it can be useful to be ignorant. When I first started a company, more than 11 years ago, I decided that one thing the world needed was a database, or knowledgebase, of how to link to every e-journal in the world, and I set out to do just that. For a brief time, I had convinced an information industry veteran to join me in the new company. One day, as we were walking to a meeting in Manhattan, he turned to me and asked "Eric, are you sure you understand how difficult it is to build and maintain a big database like that?" I thought to myself, how hard could it be? I figured there were about 10,000 e-journals total, and we were up to about 5000 already. I figured that 10,000 records was a tiny database- I could easily do 100,000 records even on my Powerbook 5400. I thought that a team of two or three software developers could do a much better job sucking up and cleaning up data than the so-called "database specialists" typically used by the information industry giants. So I told him "It shouldn't be too hard." But really, I knew it would be hard, I just didn't know WHAT would be hard.

The widespread enthusiasm for Linked Data has reminded me of those initial forays into database building. Some important things have changed since then. Nowadays, a big database has at least 100 million records. Semantic Web software was in its infancy back then; my attempts to use RDF in my database 11 years ago quickly ran into hard bits in the programming, and I ended up abandoning RDF while stealing some of its most useful ideas. One thing that hasn't changed is something I was ignorant of 11 years ago- maintaining a big database is a big, difficult job. And as a recently exed ex-physicist, I should have known better.

The fundamental problem of maintaining a large knowledgebase is known in physics as the second law of thermodynamics, which states that the entropy of the universe always increases. An equivalent formulation is that perpetual motion machines are impossible. In terms that non-ex-physicist librarians and semantic websperts can understand, the second law of thermodynamics says that errors in databases accumulate unless you put a lot of work into them.

This past week, I decided to brush off my calculus and write down some formulas for knowledgebase error accumulation so that I wouldn't forget the lessons I've learned, and so I could make some neat graphs.

Let's imagine that we're making a knowledgebase to cover something with N possible entities. For example, suppose we're making a knowledgebase of books, and we know there are at most one billion possible books to cover. (Make that two billion, a new prefix is now being deployed!) Let's assume that we're collecting n records at random, so for any entity, each record has a 1/N chance of covering any specific entity. At some point, we'll start to get records that duplicate information we've already but into the database. How many records, n, will we need to collect to get 99% coverage? It turns out this is an easy calculus problem, one that even a brain that has spent 3 years as a middle manager can do. The answer is:
Coverage fraction, f = [1- exp(-n/N)]
So to get 99% of a billion records, you'd need to acquire about 4.6 billion records. Of course there are some simplifications in this analysis, but the formula gives you a reasonable feel for the task.

I haven't gotten to the hard part yet. Suppose there are errors in the data records you pull in. Let's call the the fraction of records with errors in them epsilon, or ε. Then we get a new formula for the errorless coverage fraction, F
Errorless coverage fraction, F = [exp(-εn/N) - exp(-n/N)]
This formula behaves very differently from the previous one. Instead of rising asymptotically to one for large n, it rises to a peak, and then drops exponentially to zero for large n. That's right, in the presence of even a small error rate, the more data you pull in, the worse your data gets. There's also a sort of magnification effect on errors- a 1% error rate limits you to a maximum of 95% errorless coverage at best; a 0.1% error rate limits you to 99.0% coverage at best.

I can think of three strategies to avoid the complete dissipation of knowledgebase value caused by accumulation of errors.
  1. stop collecting data once you're close to reaching the maximum. This is the strategy of choice for collections of information that are static, or don't change with time.
  2. spend enough effort detecting and resolving errors to counteract the addition of errors into the collection.
  3. find ways to eliminate errors in new records.
A strategy I would avoid would be to pretend that perpetual motion machines are possible. There ain't no such thing as a free lunch.


  1. Eric,

    The key to fighting entropy is to classify and identify (via HTTP URIs) the coherent, loosely-coupled Real World Objects that are hiding in a domain (which I would argue is roughly equivalent to your idea of "knowledgebase"). The value of Linked Data is that it explains how these RWO URIs are easily associated with a negotiable "information resource" that can return context-sensitive information about the RWO. This decoupling is essential for unexpected reuse. Entropy is virtually assured if identity is limited to "data". Granted, today's coherent, loosely-coupled RWO may turn out to be less so in light of tomorrow's use cases, but HTTP defines generalized techniques to deal with this also.

    Extending this principle slightly, let's go ahead and assign HTTP URIs to the incoherent, tightly-coupled RWOs in our domain and re/direct clients to the incoherent, tightly-coupled information resource we can make available. Doing so makes it possible for others to help collaborate on remodeling this part of the domain. If we don't do this, the incoherence and tight-coupling is effectively enshrined in the system ensuring the increase of entropy.


  2. Jeff- I guess I would argue that classification and identification of RWOs, and enforcement thereof is a lot of work, i.e not free. In other words, I agree that the key to prevent the house from getting messy is to keep it organized. On its own, the house prefers to be messy.

  3. There is one thing that is wrong with your conclusion in my opinion. You are stating that errors are bad, so bad that only the errorless fraction is interesting. I do not think that is true. My house can be lived in pretty good, but it is not a sterile environment

  4. Here's a comment from Peter Noerr, who had trouble with the sign-in:


    I believe your very pretty graphs are for the chance of having one
    correct record in your database for the given coverage percentage. So
    for 1% error rate the maximum fraction of correct records is 95%, or
    950K records, for which you have to enter about 4,500K records.

    This leaves a large pool of 3,550K records in your database which are
    known to be in error - something like 73%. Since your model makes no
    assumption about how to recognize a "correct" record, this means that
    any user searching the database would have only about a 1 in 4 chance of
    getting a correct record about the given RWO. Not exactly exciting odds.
    And this would be for an essentially "known item" search on a unique

    So what to do about creating databases from input of dubious quality?
    Your suggestion #1 seems attractive, except the RW will not co-operate
    in general and the collection of RWOs is not static. The collection
    generally keeps growing, so you will be committed to collecting more and
    more records (at a rate of about 4.5 to 1). The good news is that the
    chance of finding a "good" record doesn't go down - thank goodness!

    How about #3 - get only good data. I seem to recall that various
    organizations have spent lots of time and money trying to achieve this
    through record specifications, data entry rules (cataloguing), etc., and
    we still have errors ab initio. It doesn't sound like there is a
    solution, or even much of an amelioration, there. People will make mistakes.

    So #2. This recognizes the reality of the failure of #3 and the
    unwillingness of the RW to co-operate in #1. So it is, at least, a
    viable strategy. But, the cost.

    Can things be made better, even if not perfect. This is where clean-up
    rules (to control the information equivalent of Maxwell's demon) can
    slow the tide of entropy. Since all is statistical it should be possible
    to recognize that various records describe the same RWO (by unique IDs
    for example) and then use the aggregate data from those records to
    produce a "best guess" good record. This assumes that errors are random
    across data items and records (as do your original equations), and that
    most "cataloguers" get most of the data right most of the time (i.e. no
    systematic errors where every record shows the same erroneous piece of

    If you have nothing to do for a weekend or so, then you might like to
    modify your equations to show the effect of statistically assuming that
    once a sufficient number of records about a single RWO have been
    entered, it will be possible to create a record with good data to a
    given percentage accuracy, and to actually identify that record. This is
    definitely an exercise for the student, as my calculus has diminished
    way below tackling this suggestion