Monday, June 22, 2009

The New York Times and the Infrastructure of Meaning

The big announcement at last week's Semantic Technology Conference came from the New York Times. Rob Larson and Evan Sandhaus announced that the New York Times would be releasing its entire thesaurus as Linked Data sometime soon (maybe this year). I've been very interested in looking at business models that might support the creation of maintenance of Linked Data, so I've spent some time thinking about what the New York Times is planning to do. Rob's presentation spent a lot of time evoking the history of the New York Times, and tried to make the case that the Adolph Ochs' decision in 1913 to publish an index to the New York Times played a large part in the paper's rise to national prominence as one of the nation's "Newspapers of Record". The decade of that decision was marked by an extremely competitive environment for New York newspapers- the NYT competed with a large number and variety of other newspapers. I don't know enough about the period to know if that's a stretch or not, but I rather suspect that the publication of the index was a consequence of a market strategy that proved to be successful rather than the driver of that strategy. The presentation suggested a correspondence between the decade of the 1910's and our current era of mortal challenges to the newspaper business. The announcement about linked data was thus couched as a potentially pivotal moment in the paper's history- by moving decisively to open its data to the semantic web, the New York Times would be sealing its destiny as a cultural institution integral to our society's infrastructure of meaning.

The actual announcement, on the other hand, was surprisingly vague and quite cautious. It seems that the Times has not decided on the format or the license to be used for the data, and it's not clear exactly what data they are planning to release. Rob Larson talks about the releasing the "thesaurus", and about releasing "tags". These are not the terms that would be used in the semantic web community or in the library community. A look at the "TimesTags API" documentation gives a much clearer picture of what Rob means. Currently, this API gives access to the 27,000 or so tags that power the "Times Topics" pages. Included as "tags" in this set are
  • 3,000 description terms
  • 1,500 geographic name terms
  • 7,500 organization name terms
  • 15,000 person name terms
The Times will release as linked data "hundreds of thousands" of tags dating back to 1980, then in a second stage will release hundreds of thousands more tags that go back to 1851. They want to community to help normalize their tags, and connect them to other taxonomies. According to Larson, "the results of this effort, will in time, take the shape of the Times entering (the linked) data cloud." I presume this to meant that the Times will create identifiers for entities such as persons, places, organizations, and subjects, and make these entities available for others to use. Watch the announcement for yourself:

I've found that it's extremely useful to think of "business models" in terms of the simple question "who is going to write the checks?" The traditional business model for newspapers has been for local advertisers and subscribers to write the checks. Advertisers want to write checks because newspapers deliver localized aggregates of readers attracted by convenient presentations of local and national news together with features such as comics, puzzles, columns and gossip. Subscribers write checks because the the paper is a physical object that provides benefits of access and convenience to the purchaser. Both income streams are driven by a readership that finds reading the newspaper to be an important prerequisite to full participation in society. What Adolph Ochs recognized when he bought control of the Times in 1896 was that there was an educated readership that could be attracted and retained by a newspaper that tried to live up to the motto "All the news that's fit to print". What Ochs didn't try to do was to change the business model.

The trials of the newspaper industry are well known, and the business model of the New York Times is being attacked on all fronts. Newspapers have lost their classified advertising business because Craigslist and the like serve that need better and cheaper. Real estate advertising has been lost to Zillow and the online Multiple Listing Service. The New York Times has done a great job of building up its digital revenue, but the bottom line is that hard news reporting is not as effective an advertising venue as other services such as search engines. Subscribers, on the other side, are justifiably unwilling to pay money for the digital product, because the erection of toll barriers makes the product less convenient rather than more convenient. Nonetheless, the digital version of the New York Times retains the power to inform its readership, a power that advertisers will continue to be willing to pay for. It's also plausible that the New York times will be able to provide digital services that some subscribers will be willing to pay for. So, assuming they don't go bankrupt, the business model for the future New York Times does not look qualitatively different from the current model (at least to me), even if the numbers are shifting perilously in the near future.

So let's examine the stated rationales for the New York Times to join the Linked Data community, and how they might help to get someone to send them some checks. The first and safest stated rationale is that by entering the linked data cloud, traffic to the New York Times website will increase, thus making the New York Times more attractive to advertisers. So here's what puzzles me. What Rob Larson said was that they were going to release the thesaurus. What he didn't say was that they were also going to release the index, e.g. the occurrence of the tags in the articles. Releasing the index together with the thesaurus could have a huge beneficial impact on traffic, but releasing the thesaurus by itself will leave a significant bottleneck on the traffic increase, because developers would still have to use an API to get access to the actual article uri's. More likely, most developers who want to access article links would try to use more generic api such as those you'd get from Google. Why? If you're a developer, not so many people will write you checks for code that only works with one newspaper.

I would think that publication of occurrence coding would be a big win for the NYT. If you have articles that refer to a hundred thousand different people, and you want people interested in any of those people to visit your website, it's a lot more efficient for everyone involved (and a lot less risk of "giving away the store") for you to publish occurrence coding for all of these people than it would be for everyone who might want to make a link to that article to try to do indexing of the articles. The technology behind Linked Data, with its emphasis on dereferencable URI's, is an excellent match to business models that want to drive traffic via publication of occurrence coding.

Let's look at the potential costs of releasing the index. Given that the Times needs to produce all of the occurrence data for its website, the extra cost of releasing the linked data for the index should be insignificant. The main costs of publishing occurrence data as Linked Data are the risks to the Times' business model. By publishing the data for free, the Times would cannibalize revenue or prevent itself from being able to sell services (such as the index) that can be derived from the data, and in this day and age, the Times needs to hold onto every revenue stream that it can. However, I think that trying to shift the Times business model towards data services (i.e. selling access to the index) would a huge risk and unlikely to generate enough revenue to sustain the entire operation. Another serious risk is that a competitor might be able to make use of the occurrence data to provide an alternate presentation of the Times that would prove to be more compelling than what the Times is doing. My feeling is that this is already happening to a great extent- I personally access Times articles most frequently from my My Yahoo page.

The other implied rationale for releasing data is that by having its taxonomy become part of Linked Data infrastructure, the New York Times will become the information "provider of record" in the digital world the way the index helped it become one of the nation's "newspapers of record". The likelihood of this happening seems a bit more mixed to me. Having a Times-blessed set of entities for people, places and organizations seems useful, but in these areas, the Times would be competing with more open, and thus more useful, sets of entities such as those from dbpedia. For the Times to leverage its authority to drive adoption of its entities, it would have to link authoritative facts to its entities. However, deficiencies in the technology underlying linked data make it difficult for asserted facts to retain the authority of the entities that assert them. Consider a news article that reports the death of a figure of note. The Times could include in the coding for that article an assertion of a death date property for the entity corresponding to that person. It's complicated (i.e. it requires reification) to ensure that a link back to the article stays attached to the assertion of death date. More likely, the asserted death date will evaporate into the Linked Data cloud, forgetting where it came from.

It will be interesting to see how skillful the Times will be in exploiting participation in linked data to bolster its business model. I'll certainly be reading the Times' "Open" blog, and I hope, for the Times' sake that the go ahead and release occurrence data along with the thesaurus. The caution of Rob Larson's announcement suggests to me that the Times is a bit fearful of what may happen. Still, it's one small step for a gray lady. One giant leap for grayladykind?
Reblog this post [with Zemanta]


Contribute a Comment