Thursday, October 15, 2009

Normal and Inverse Network Effects for Linked Data

The human brain has an amazing capacity to recognize familiar patterns in unfamiliar environments. One manifestation of this is pareidolia, the phenomenon of seeing an image of the Virgin Mary in a grilled cheese sandwich or a mesa on Mars that looks like a face. (picture) Another manifestation is our tendency to apply newly popularized or trendy concepts to to totally inappropriate circumstances. For example, once Clayton Christensen popularized "disruptive innovation", any situation where technology brought about change was all of a sudden being labeled as "disruptive".

My latest peeve is what I perceive to be pareidolic use of the term "network effect" to describe almost any example of positive feedback in markets. For example, here's an example that Tim O'Reilly thinks is a network effect:
Google is better at spidering that network than their competitors. They thus benefit more powerfully from the network that we are all collectively building via our web publishing and cross-linking.
While there is definitely a network that enables Google's spidering, it's not a "network effect" that makes Google a good spiderer. Economies of scale are what make Google a good spiderer, even if that scale has resulted in part from network effects.

The originator of the term "network effect" was Bob Metcalfe, the co-inventer of Ethernet. He used it to refer to a mathematical description of how the value of a network scaled with the number of nodes it connected. His reasoning was that the value of each networked node is proportional to the number of other nodes it can connect with, so that the total value of the network scales with the square of the number of nodes.

It's pretty silly to expect that a scaling rule that works for small networks would continue to apply for large networks, and Andrew Odlyzko and Benjamin Tilley have pointed out that more modest scaling laws are a much better fit to market valuations of networks. Still, their suggestion that inappropriate application of Metcalfe's law was to blame for the internet bubble and its subsequent collapse bears reflection.

Recently there's been some discussion of how to apply Metcalfe's law for the network effect to Linked Data and the Semantic Web. Linked Data is information published using standards so that machines can understand its meaning and make inferences from the totality of data that has been collected. One argument says that the value of any set of Linked Data increases in value with every new bit of linked data is added to the world wide cloud of Linked Data. So how does the value of this Linked Data "Network" really scale with the links it contains?

Since I don't know of any way to value an arbitrary bit of Linked Data, I'll pick a simple system where I can compute utility. I'll focus on the direct effects and benefits of linking data together, and ignore for now indirect benefits such as those which result from the use of standards.

Suppose we have two sets of Linked Data entities, Movies and Actors. Let's also assume that both of these sets are essentially complete. We'll then consider the effect on the system utility of adding random "actedIn" links between Actors and Movies. In our utility computation, we'll assume that answering questions about which actors acted in which movies is the primary utility of our set of links, and the number of these questions the set can answer will be the utility measure.

For the questions "What movies did X act in?" and "who acted in the movie Y?" the value of the link collection scales linearly with the number of "actedIn" links. There's no network effect at all for these questions, because the fact that Marlon Brando acted In On the Waterfront adds no utility to the fact that Humphrey Bogart acted in Casablanca.

For the question "Who else acted in movies that X acted in?" the result is different. For this question, the ability of our collection of links to answer usefully scales as the square of the number of links, just as in the classic network case. For this question, there clearly is a network effect.

For the "Kevin Bacon" question, ("how many acted-with degreees of separation are there between X and Kevin Bacon?") the Network effect is even stronger, with the network value scaling as a higher power of the number of actedIn links. Notice that for Linked Data, the network effect is not inherent in the data, but rather is implicit in the types of queries that are made on the data.

What we really wanted to know was the total value of the set of links. We might guess that the value is proportional to the total number of questions that the set of links can answer. That number grows exponentially with the number of links. We can see that an exponentially increasing value doesn't make sense, however, by considering the total value as a power series. We've already discussed the first two terms of that series, but we've not considered the relative coefficient. It's really hard to argue that the value of a system that can answer only the acted-with questions is hugely more valuable than the one that answers only the acted-in questions, despite the fact that it answers N2 questions compared to only N for the acted-in answering system. It seems to me that the Kevin Bacon answering system is less valuable than the other two systems, despite an even larger number of questions (about N12) that it would be able to answer; they're just really stupid questions.

Even if network effects are not inherent in Linked Data, threshold effects can result in the existence of a "critical mass", above which positive reinforcement kicks in to drive the entire system. In our toy system, we can easily imagine that a collection of links might be worthless until there were a sufficient number of links to exceed critical mass. A system that can tell me 90% of the movies that someone has acted in is a lot more than nine times as valuable as a system that can tell me only 10%. That's because an acted-in answering system is worthless unless it's better than the random guy sitting next to me at the bar! So this is kinda-sorta a network effect, but really it's a threshold effect.

It's rather easy to confuse threshold effects for network effects. I'll put it this way: it's not a network effect that causes me to avoid doing my laundry until I have a full load, it's a threshold effect! Never mind that it's really three loads.

Rod Beckstrom, currently CEO of ICANN has described the "inverse" network effect, which occurs in situations where the addition of nodes reduces the value of a network to each participant. Golf clubs are cited as examples- they have an optimum size of about 500 members because additional members make it more difficult for existing members to get playing time. I see two types of inverse network effects at play in the Linked Data world. The first is the cost of expanding a database; the second is the law of diminishing returns.

In an optimally designed database, the cost of accessing any given record is proportional to the logarithm of the number of records. This is a slowly increasing function- if you have this scaling, you can increase from a million records to 10 million records and only increase your cost by 18%. Alas, optimal design is rarely achieved, and to get to that optimum, you have costs that scale much less gently. The result has been that most practical applications of Linked Data use only the most relevant subsets of available data. If data network effects were pervasive and stronger than inverse network effects, this would generally not happen.

In my movie and actor example, I made the assumption that the value of any particular query was equal in value to any other query. In practice this is not true. Most people would agree that there's more utility in knowing that Humphrey Bogart acted in Casablanca, than in knowing that Michael Ripper acted in The Reptile. In many data sets, the 80/20 rule applies (also known as the Pareto Principle) - 80% of the real-world queries would exercise only 20% of the links. The least useful data is typically the most expensive data to acquire, so if we start by adding the most valuable actedIn links, then every additional link reduces the average value of the links in the collection. This mimics an inverse network effect, as the the total value of the collection grows more slowly with every added link, rather than growing more quickly.

The main take-away from this is that you can't look at Linked Data objectively and conclude that it exhibits strong network effects without taking into account the application it's being used for. Some applications will exhibit strong, even exponential network effects, others may exhibit inverse network effects. And sometimes a grilled cheese sandwich is just a sandwich.

Reblog this post [with Zemanta]


Contribute a Comment