Friday, November 13, 2009

The New York Times Gets It Right; Does Linked Data Need a CrossRef or an InfoChimps?

I've been saying this long enough that I don't remember whether I was quoting someone else: whenever the internet disintermediates a middleman, two new intermediaries pop up somewhere else. It's disintermediation whack-a-mole, if you will. The reasons for this are:
  1. The old middlemen became fat on mark-ups an order of magnitude larger than needed by internet-enabled middlemen.
  2. Internet-enabled middlemen add value in ways that the old ones didn't.
My last business functioned as an intermediary that aggregated linking data. We'd get data from publishers, clean it up and add it to our collection, then provide feeds of that data to our customers (libraries and library systems vendors). Our customers got good data and support if was a problem. The companies who provided the data didn't have to deal with hundreds of libraries or system vendors, and they came to understand that we would help their customers link to their content.

Some companies, especially the large ones, were initially uncomfortable with the knowledge that we were selling feeds of data that they were giving out for free. They felt that somehow there was money left on the table. Other companies were fearful of losing control of the information, even though they didn't really have control of it in the first place. Once we explained to them how their data contained mangled character encodings, fictitious identifiers, stray column separators and Catalan month names, they began to see the value we provided.

While my company focused on the data needs of libraries (and did pretty well), a group of the largest academic publishers put up some money and formed a consortium to pool a different type of linking data in a way that let the publishers have more control of the data distribution. This consortium, known as Crossref, just celebrated its 10th anniversary. Crossref has not only paid back the money that its founders invested in it; it has arguably done more to push academic publishing into the 21st century than any other organization on the planet.

As academic publishing companies began to understand the benefits of distributing linking data through Crossref, my company, and others like it, they became more comfortable opening up their content and reaping the financial benefits. Despite the global recession, and despite predictions of its impending collapse, STM publishing has been financially healthy with companies such as Elsevier reporting increased profits. This is rather unlike the newspaper industry, for example.

Before I get to the newspaper industry, I should note yesterday's news that InfoChimps are publishing a collection of token data harvested from Twitter.
Today we are publishing a few items collected from our large scrape of Twitter’s API. The data was collected, cleaned, and packaged over twelve months and contains almost the entire history of Twitter: 35 million users, one billion relationships, and half a billion Tweets, reaching back to March 2006.
InfoChimps is positioning itself as a marketplace to buy, sell, and share data sets of any size, topic or format. Yet another intermediary has popped up!

Two weeks ago, I wrote a somewhat alarmist article about problems in an exciting set of Linked Data being released by the New York Times. I am pleased to be able to be report that the New York Times is now getting it right! The most important thing that they're doing right is that they're listening to the people who want to consume their data. They've started a Google Group based community for the specific purpose of understanding how best to deliver their data. They've also corrected the problems pointed out by myself and others. It's not perfect, but it's not reasonable to expect perfect. The New York Times has set a very hopeful example for other companies that want to start publishing semantic linking information on the open web.

If, as many of us hope, many publishers decide to follow the lead of the Times and make more data collections available, will more intermediaries such as InfoChimps arise to facilitate data distribution, as happened with linking data in scholarly publishing? Will ad hoc groups such as "the Pedantic Web" become key participants in a less centralized data distribution environment? Or maybe large companies will turn off the spigots as "the suits" grow increasingly worried about their ability to control data once it is let out into the web of data.

Perhaps the time is ripe for a set of forward-looking publishers to emulate the nervous-but-smart journal publishers who started Crossref 10 years ago and start a similar consortium for the distribution of Linked Data.
Reblog this post [with Zemanta]


Contribute a Comment