Friday, July 31, 2009

Ignition Timing for Semantic Web Library Automation Engines

Last weekend, I had a chance to learn how to drive a 1915 Model-T Ford. It's not hard, but a Model-T driver needs to know a bit more about his engine and drivetrain than the driver of a modern automobile. There is a clutch pedal that puts the engine into low gear when you press it- high gear is when the pedal is up and neutral is somewhere in between. The brake is sort of a stop gear, and you need to make sure the clutch is in neutral before you step on the brake. The third pedal is reverse.

There are a lot more engine controls than on a modern car. In addition to the throttle and a choke, there is another lever that controls the ignition timing. A modern Model-T driver doesn't have to worry much about the timing once the engine has started, because modern fuel has much higher octane than fuel had in 1915. I would not have understood this except that I recently got a new car whose manual says you should use only premium fuel, and so I did some wikipedia research to find out what octane had to do with automobile engines. But I could have lived blissfully in ignorance. Believe it or not, I have opened the hood of my new car only once since I got it in December.

It occurs to me that in many ways, the library automation industry is still in the Model-T era, particularly in regards to the relationship of the technology to its managers. Libraries still need to keep a few code mechanics on staff, and the librarians who use library automation to deliver services still need to know a lot more about their data engines than I know about my automobile engine. The industry as a whole is trying to evaluate changes roughly analogous to the automobile industry switching to diesel engines.

I've been reading Martha Yee's paper entitled "Can Bibliographic Data Be Put Directly Onto the Semantic Web?" and Karen Coyle's commentary on this paper. I greatly admire Martha Yee's courage to say, essentially, "I don't understand this as well as I need to, here are some questions I would really appreciate help with". When I worked at Bell Labs, I noticed that the people who asked questions like that were the people who had won or would later win Nobel prizes. Karen has done a great job with Martha's queries, but also expresses a fair amount of uncertainty.

I was going to launch into a few posts to help fill in some gaps, but I find that I have difficulty knowing which things are important to explain. Somehow I don't think that Model-T drivers really needed to know about the relationship between octane and ignition timing, for example. But I think that people running trucking companies need to know some of the differences between Diesel engines and internal combustion engines as they built their trucking fleets, just as community leaders like Martha Yee and Karen Coyle probably need to know the important differences between RDF tuple-stores and relational databases. But the more I think about it, the less I'm sure about which of the differences are the important ones for people looking to apply them in libraries.

Another article I've been reading has been Greg Boutin's article "Linked Data, a Brand with Big Problems and no Brand Management", which suggests that the technical community that is pushing RDF Linked Data has not been doing a good job of articulating the benefits of RDF and Linked Data principles in a way that potential customers can understand clearly and consistently.

Engineers tend to have a different sort of knowledge gap. I have a very good friend who designs the advanced fuel injectors. He is able to do this because he has specialized so that he knows everything there is to know about fuel injectors. He doesn't need to know anything about radial tires or airbag inflators or headlamps. But to make his business work, he needs to be able to articulate to potential customers the benefits of his injectors in the context of the entire engine and engine application. Whether the technology Linked Data or fuel injectors, that can be really difficult.

My first guess was that it would be most useful for librarians to understand how indexing and searching are almost the same thing, and that indexing done quite differently in RDF tuple-stores and in relational databases. But on second thought, that's more like telling the trucking company that diesel engines don't need spark plugs. It's good to know, but the higher-level fact that diesels burn less fuel is a lot more relevant? Isn't it more important to know that an RDF tuple-store trades off performance for flexibility? How do you ask the right questions to ask, when you don't know where to start? We find ourselves working across many disciplines each of which are more and more specialized, and we need more communications magic to make everything work together.

I'll try to do some gap-filling next week.

Wednesday, July 29, 2009

Google Books Settlement Agreement Panel at New York Public Library

I've been following the Google Book Search Settlement Agreement very closely, because I think that whatever happens with it will have a large effect on the information industry. In particular, the Book Rights Registry that it would create could implement some extremely useful machinery, and whether we like it or we don't like it, anything we do that touches books would have make room for it.

Yesterday, I attended a panel discussion of the Google Book Search Settlement Agreement at the New York Public Library. It used to be that I never took notes on anything, but I've found that taking notes on Twitter can be a lot of fun. The panel was hosted by NYPL Director David Ferriero, who earlier in the day was announced as President Obama's choice to be "Archivist of the United States", whatever that means. For the most part, the panel represented the particpants in the Settlement agreement. David Drummond, Google's Senior VP for Corporate Development and Chief Legal Officer, represented google- he was the leader of the team that negotiated the agreement for Google. Representing the publishers was Richard Sarnoff, Co-Chairman, Bertelsmann Inc. and the President of the Association of American Publishers when they negotiated the deal with Google. There were two representatives of the authors- Jim Gleick, the well known science writer and author of Chaos: Making a New Science. Somehow the NYPL managed to also get the ghostwriter of The Age of Turbulence: Adventures in a New World on the same panel, Peter Petre. For its neutral voice, the panel had Jonathan Band, a lawyer and author of "A Guide for the Perplexed: Libraries & the Google Library Project Settlement" and "A Guide for the Perplexed Part II: The Amended Google-Michigan Agreement" explanations of the settlement agreement commissioned by a number of library organizations.

For most of the time, the panel answered questions from the audience, which I find to usually be the most useful part of a panel; there was no boring round-robin of opening statements. I should note first what I did NOT hear. If all you know about the settlement agreement is what you have read in the New York Times and the Wall Street Journal, or if you mostly pay attention to blog posts and Twitter feeds, you would get the distinct impression that the settlement agreement was in deep trouble. You would think that the Justice Department was about to throw the antitrust book at Google (the paper version, and it's a big book), that libraries were rushing to end their agreements and that Europe was threatening to speak only French. The panel, for the most part, seemed unconcerned with these scenarios, and the audience was polite, interested and in no way virulent.

The panel opened with a summary of the agreement by David Drummond. He thought it not surprising that, given the size and scope of the agreement, some people are wishing that parts of settlement had been written differently. He added that it was a lengthy effort to get to an agreement- three years. David Ferriero injected that "libraries wish they had been brought into the discussion earlier than 2 years into the lawsuit", but it seemed that no hard feelings remain there.

In response to a question about the adequacy of one free-access terminal per library building, Jim Gleick noted that every computer in the word would become effectively a terminal with increased access to books. In his own work, he has become a heavy user of Google Book Search, and 90% of what his usage would fall within the 20% of books that Google would be showing consumers for free.

Peter Petre then launched into an explanation of the Book Rights Registry, stating that it would have as its primary mission to increase access to Google Books through additional licensing agreements.

Jonathan Band mentioned the concerns from various library groups that the institutional subscription pricing of Google Book Search would be exorbitant. Determining what that price would be is "where the rubber meets the road". Richard Sarnoff's response to this was that the settlement agreement is "organic", i.e. many important aspects are not fixed in stone. For example, if the New York Public Library found that there were lines out the door to use the one free terminal in the building, that provision could be adjusted. "If one terminal is not enough, how many do you want?" While a report of this drew snickers on Twitter, the response from the real audience was quite positive. There seems to be a genuine willingness from the publishers to respond to community concerns, though there was a recurring theme that it was hard to predict the future of a product that does not exist yet.

In response to a question about how the concerns of libraries would be represented in the agreement, Petre said that the one of the first item on the agenda of the Book Rights Registry, once it is constituted, would be the creation of advisory committees for libraries and for the public. Band noted that the library associations want to see strong continuing jurisdiction over the settlement agreement by the court.

A lengthy, discursive question from Columbia University's Kenneth Crews about whether the settlement agreement would reshape the future of the digital information sparked an amusing set of responses. Petre said, very briefly "Yes", while Drummond emphatically said "No." Drummond thought it a great stretch to think the settlement agreement will set the terms for everything in the digital future, because its scope is quite narrow, and only applies to in-copyright, out-of-print books. Evidently Google has other things (video, to name one) to worry about. Gleick added that the settlement only affects books already published. The future is unaffected. Sarnoff characterized the agreement as being only about "horses out of the barn". From the point of view of the publishers, the settlement is a recognition by the publishers that access to copyrighted content will expand; it tries to do this without having all the value leak out of the book publishing industry.

In response to a question about orphan works, Band answered that the only alternative to the settlement agreement is legislation. There is no reason to assume that legislation would be any better for anyone except lobbyists. He says that from the perspective of being a lobbyist.

A discussion of who gets what money followed. Band pointed out, with more than a hint of irony, that lawyers get more money than anybody. "As it should be." My mind must have been wandering, because that comment woke me up! A quick look at section 5.5 of the settlement agreement puts the plaintiffs attorneys fees at $30 million out of a total settlement of about $120 million. Not a bad day's work, even if it took 3 years.

A question from a publisher concerned international rights. Sarnoff pointed out that there was no mechanism in the settlement to extend benefits outside US, nor was such a thing possible under the law. Google would have to make agreements country by country to be able to make available in other countries what the US will have.

What gives teeth to the public's interest in Google Books? Jim Gleick quipped "I'll bet David Drummond is glad he wasn't the one to have to explain Amazon's 1984 fiasco with the Kindle." He went on to say that the public was right to be concerned about issues like privacy, but that the settlement agreement was not the place to address those concerns. Band added that the consideration given the public interest under the settlement agreement is likely to be much better than what it would get in the absence of a settlement agreement. As for libraries, Drummond said that Google will set the price for institutional subscriptions according to settlement "principles", and that partners such as University of Michigan can use their agreements to "hold Google's feet to the fire".

Following in the discussion of the accountability of Google and of the Book Rights Registry, Petre wanted to highlight that for the first time there is a procedure set out to resolve disputes between authors and publishers. This is "truly historic."

Returning to the subject of orphan works, Drummond characterized the settlement as "family reunification for orphan works". Google thinks this will, in the end, be a very small percentage of the total. In response to a later question, he estimated that it would end up being 10% of all the works subject to the agreement. With roughly 20 Million total works, this leaves 2 million orphans. Sarnoff pointed out that unlike photographs, books have embedded metadata which makes them pretty easy to track down and take out of orphan status. Gleick suggested that money tends to make orphans disappear- there's no such problem as orphan works in music industry because of the rights registration organizations BMI and ASCAP.

James Grimmelman asked the best question of the night. "Suppose I started an organization with a purpose to digitize books and let you know about it. Would you settle a copyright lawsuit with me on the same terms as you've given Google?" "For you, sure", quipped Sarnoff. He then answered affirmatively with cautious qualifications. "We'd be inclined to negotiate something similar" with another capable party that was able to provide the sorts of assurances that Google has given. An agreement would certainly be easier the second time around. A subsequent discussion about "inserts" revealed that Google had initially been unaware of the complications of the book licensing environment and that the education process led to considerable delay in reaching an agreement.

With hardly a question about privacy or censorship from the audience, the panel addressed the issue directly. This gave David Drummond the chance to tout the fact that he was the subject of a criminal warrant in Italy and wave the flag for Google's brave stance to deliver YouTube in authoritarian countries. Jim Gleick pointed out that if Google removes something from Google Books, it would be required by the agreement to notify the Book Rights Registry and to provide it with the digital copy. This would act as a safeguard against censorship. Peter Murray asked me if the library would get a copy- the answer is that the fully participating library that provided the book would have the copy, and of course they also would still have the book.

All in all, there was no "news" from the panel, so I expect you'll not read much about it. But there's an editorial in the New York Times today that bravely comes out in favor of copyrights, against monopoly, and in favor of respecting privacy. I feel so much better.
Reblog this post [with Zemanta]

The Illusion of Internet Identity

You've certainly heard of Arthur C. Clarke's Third Law, "Any sufficiently advanced technology is indistinguishable from magic", which says more about magic and our perceptions of the world than it does about technology. When technology does something that is not natural to us, we of course perceive it to be supernatural. But what happens when technology approximates something so natural to us that we don't even perceive that there's anything remarkable? Then we attribute powers to the technology that just don't exist. Just as we can perceive emotions in a stuffed teddy bear, it is only with difficulty that we avoid anthropomorphizing technologies. Do you have a cute name for your car? Do you refer to your GPS as "Lola"? If you have not done so, try the web version of ELIZA, and see if you can avoid thinking of ELIZA as a real person. It's very hard for us to understand how complicated the act of carrying on a real conversation really is- we do it all the time. Even a profoundly retarded technology will be imbued with magic if its function is sufficiently mundane.

I've been reading a paper by Patrick Hayes and Harry Halpin and a presentation by Pat Hayes, both with the unfortunate title "In Defense of Ambiguity". The paper provides a wonderful review of the theory of identity. I've been living very happily, doing productive work in the world of identifiers without ever knowing that identity needed to have a theory behind it. In retrospect, I've managed to do this by staying away from the difficult bits.

After reading Hayes and Halpin, I've come to realize what a miracle human communication is. The fact that I can meet someone with whom I share no languages and that we can exchange our own names and establish names for things may seem simple, but it's something that machines cannot do. For example, I may gesture at myself and say "Eric" to establish my identity. If I then gesture at a banana and say "banana", it's very likely that my counterpart will understand that I have not established an identity for the banana, but rather I have given a name for the kind of fruit. This is possible because people have brains that are similarly wired- our brains are wired to recognize individual people but not individual bananas (though sometimes our knowledge models diverge). Our computers on the other hand, have no fruit wiring, or individual person wiring, so establishment of identity is very hard for them.

The difficulty of teaching computers to identify things has not stopped us from using them to build elaborate identity systems. Hayes and Halpin observe that internet identity can only be established by description, and description is inherently ambiguous. Attempts to make real-world-object identifiers global or to add description actually make the situation worse, by increasing ambiguity. In our daily lives, ambiguity in our communications is mostly not a problem. When I say the word "rose" a listener will almost never be confused between the flower and the verb. I can say "rows of rosebushes" and only rarely will people hear "rose of rosebushes". Our brains are so good at using context to resolve ambiguity that we don't realize how hard it is for computers to do the same thing.

That the situation becomes worse with added description was a bit hard for me to absorb, because at first it seems that the better you define something, the less ambiguous your statements about it become. But it's not true for computers. Suppose my internet identity description added a physical description of me- for example the fact that I have blonde hair. That might help to identify me under certain circumstances, but then when my hair turns gray, it makes my identification more tenuous. You could say that I had blonde hair on a particular date, but then you'd need to add a physical model for hair color to your internet identity system. In actual fact, the added description might help a human to identify me, but it hurts a computer's efferts to establish my identity.

The Hayes and Halpin paper was written in the context of the "http-range" semantic web controversy that I touched upon in my post on the semantics of redirection. They argued that the http protocol is not the right place to put establishments of identity, and that the description model is better suited to do that. As I understand it, the Hayes-Halpin view did not prevail with the W3C TAG; "ambiguity" was not a great concept for people to rally around, I guess.

The internet identity systems I've worked with revolved around identifying things in libraries- books, serials, and articles. For the most part, these sorts of objects do not usually present deep identification quandaries, and so I've not noticed my ignorance of identity theory. For example, most people imagine that computers use ISBNs to identify books (or as I imagined in my last post, that computers use ISBNs to identify items in bookstores). Most often this illusion does not get us into any trouble, just as the illusion that teddy bears have feelings is mostly harmless. Computers are wired to deal with records in data files, and they use ISBNs (with frequent success) to identify and match records in data files, that's all. The rest is just software trickery.

It's interesting to note that the ISBN was not developed by the library community. It was developed by a statistics professor named Gordon Foster for the British Publishers Association. Librarians lived without identifier systems for many years and were content with the library equivalents of an address or locator system. It's as if librarians have intuitively known something that the architects of the semantic web have only recently struggled with- that we can aspire to build description systems and access systems, but building a system that can provide identity is more difficult than it looks like to a human.

Friday, July 24, 2009

If Elvis had an OpenID and the Mome Raths Outgrabe

If the space aliens that kidnapped Elvis decided to return him tomorrow, and he decided to use Twitter and a blog to communicate that fact, how would anyone know it was really him? Would the National Enquirer even bother to report the news? Would Elvis ever be able to reclaim his public or private identities? Would he be able to remember any of his passwords?

Password proliferation has been a problem for so long that innovators have solved it over and over again. The library world came up with an single-sign-on authentication/authorization system called Shibboleth, and then implemented EZProxy so they wouldn't have to deal with it. The UK developed the single-sign-on system called Athens. The dot-com bubble came up with a bunch of single-sign-on companies; some of them, including PassLogix and Imprivata are still at it. I am still waiting for the announcement of the Single-Single-Sign-On system.

OpenID took a different approach, and is now somewhat usable for the purpose of allowing people to establish an identity with one provider that can be used on many websites. For example, I've used the OpenID identity "" to register comments on the Semtech 2009 website and the Paul Miller's Cloud of Data Blog. On my last post, comments were left by "nicomo", and "breizhlady", whose OpenID's are and . Jodi Schneider used Blogger credentials to leave her comment. My OpenID can be used to determine with some degree of certainty that the the Eric Hellman who left a comment on Cloud of Data is the same Eric Hellman who's writing on this blog. A bit of googling will tell you who nicomo and breizhlady are, if you really want to know. If Elvis had been issued an OpenID before he left we would be able to tie his new blog to his old identity.

There are still the single-single problems with OpenID. The user experience for OpenID systems gets a bit clunky- my wife was frustrated when she tried to leave a comment on this blog. But overall there seems to be slow convergence and user acceptance of OpenID.

This brings me to the questions I wanted to raise today: What does an OpenID identify? Does identify me? Can these OpenIDs be used to make assertions about people to enter into the Linked Data Cloud? How should the Linked Data semantics for redirects be implemented for OpenID? Should 303 redirects be used to indicate that the "thing" being identified by OpenID is a real-world object?

To some extent, it's really the way identifiers are used that determines semantics- identification of any real-world object can never have perfect accuracy. The use of ISBN to identify a book is a good example. Although ISBN is frequently used to identify a book, ISBNs are managed in such a way that they most accurately identify items sold in a bookstore- toys and dolls often get ISBNs. Similarly, you might think that the US identifies people with Social Security Numbers (SSN), but if you think about it, the "thing" an SSN most accurately identifies is an account with the Internal Revenue Service. Similarly, I think it's pretty clear that an OpenID identifies a set of login credentials, although people might well use the OpenID to identify the person or persons behind it.

I have been guilty in the past of driving people to distraction by arguing that it can be almost impossible to decide whether something is an "information resource" (something whose essential characteristics can be conveyed in a message) or whether it is a "real-world object". It's pretty easy to blur the issue with an e-book, for example, but what about the SSN? It used to be that "an IRS account" was something on paper somewhere, but I'm pretty sure that my entire IRS account is digitized somewhere.

Section 3 of the W3C's Technical Recommendation "Cool URIs for the Semantic Web" assumes that it's easy to determine whether something is an information resource or whether it's a real-word object and that it's impossible to convey the essence of real-world objects in a stream of bits. I find this a bit unworldy. It even cites the unicorn as an example of a "real-world object". I guess that makes Elvis a real-world object, too. Conversely, even things that live completely on the internet are rarely "conveyed in a message" any more. A typical URI-addressable service today is constructed out of software, web services, content delivery networks, advertising delivery networks and clustered hardware so that the "essential characteristics" include the attributes of real world objects like me.

I've recently become aware that lots of really smart people have thought and written about the theory of identifiers and about how the Semantic Web should handle them. I've particularly enjoyed an article called "In Defense of Ambiguity" by Patrick Hayes and Harry Halpin. But to answer my questions about the semantics of OpenID, there's no sage more useful than the one who said "When I use a word, it means just what I choose it to mean - neither more nor less." Semantics do not get determined by those who mint the identifiers, but rather by those who make use of them. It helps if they are also willing to pay the IDs a bit extra.

Monday, July 20, 2009

The Evolution of Names and Personal Identifiers

One of my Great Great Great Great Great Great Great Great (8G) Grandfathers was born around 1590 in northern Sweden and was named either Pål or Påfvel. A written record of him that survives refers to him as "Pål de Äldre", or Pål the Elder, to distinguish him from his son, called either Pål Pålsson or Påfvel Påfvelsson, depending on which record you believe. Pål de Äldre's father was also called either Pål or Påfvel; in a written record he was called Pål Finne, or Pål the Finn. In 1600, it just wasn't important for a farmer in northern Sweden to have a consistent spelling of his name, or even to have a consistent name. It's likely that his name was written down only a dozen or so times in his lifetime- in church baptismal records, maybe a marriage record. If he had been a city dweller, it would have been different, but for the most part the writing down of names was part of a effort by the church and thus the government to extend its dominion into the countryside. It was common for people to be identified only by their first name and the name of their village. Patronymics were the cultural norm in the cities, and as the countryside developed and communicated with other parts of the kingdom, the patronymic names and registration of those names became more formal and regularized.

In the late 1800's, it became popular in Sweden to adopt family names instead of patronymics. Probably this had to do with an increase of awareness of naming conventions in the rest of the world. My great grandfather, born Abel Olsson, and his brother, Salomon Olsson, decided to take the family name "Hellman", which translates as "bright man". Their brother Olof took the name "Holmberg", while brother Magnus went with "Hellgren". Salomon later emigrated to America, and on entry changed the name to "Hallman" because a religious member of the family who knew English disliked having the word "Hell" in their name.

Personal names are now used globally- my name has to to compete with that of other Eric Hellmans around the world when somebody wants to search for information about us on Google- I am NOT the Eric Hellman that used to manage the band Blink-182. So the global village reduces us to using the same attribute- based disambiguation schemes that were used in 1600- I am now Eric Hellman of Montclair, New Jersey, or Eric Hellman the Linking Technologist.

When I was a scientist, writing articles for technical journals, I always used the form "E.S. Hellman" for my name, because I had searched Science Citation Index and knew that no one else had published under that name for at least 10 years. The process was exactly the same as you might use now to pick a twitter hashtag- temporal disambiguation is the best I can hope for.

My mother-in-law has used many names. She was born in China, and her birth name indicated her birth-order in the family and a "generation name". When she entered school, she received a school name. As an adult, she chose a name for herself. When she emigrated to the US, she chose to adopt an English name and a transliteration of her family name. When she married, she began to use a married name. When the full married name didn't fit on important legal forms, she chose a legal name.

With this family background, you'd think I'd know better. When I designed a database-driven e-journal with front-to back automation in the early days of the internet (1995), I had the bright idea of giving identifiers to all the entities involved in publishing a paper. I imagined a world in which the authors of every paper and the institutions they worked for would have global identifiers, and the every article would be rendered using the up-to the minute information from the database. In short, my design for the e-journal was hoping for a linked data future.

I briefly considered the possibility that an author would change their name, and I naively decided that papers were authored by people and not by author names. If an author changed their name in real life, the name on the paper should change as well. By the time this circumstance actually occurred, I had learned enough about how citations were used by abstracting services and the like, that I realized what a bad idea a retroactive change in an author list would be. I changed the way data binding occurred to prevent the retroactive change from happening. But because I used author identifiers in the database underlying the e-journal, the generated author pages displayed all the papers written by the author who had changed her name.

At this point in the post you might expect to read a clarion call for the establishment of global person identifiers, to enable the global cloud of URI-based linked data to know all the articles authored by a given person. If so, I have to disappoint you. What I have for you is an observation, and a question. The observation is that naming of individuals is a universal practice across all human societies, and name shifting is almost as universal, and certainly as human. The regularization of names on the other hand, the conversion of names into identifiers, if you will, has always been a governmental activity. It was the Kingdom of Sweden and its Church that wrote down Pål Pålsson's name; my middle name seems to exist mostly for entry onto forms, and to lend an initial to my signature. It was the organizing force of my e-journal database that regularized the authornames it displayed.

My question is this: if we give people global identifiers, what will they do with them? Will they view them as progress for civilization, like roads and communication systems, or will they view them as encroachments on privacy and liberty and on their rights to change their name and identity? Will people embrace their identifiers and view them as property, or will they attempt to subvert them and hide them away as Americans must do with their social security numbers? Is the urge to regularize our identifiers a natural extension of our human proclivity to give ourselves names, or is it something that can only be accomplished as an expression of human government?

Friday, July 17, 2009

Why did you ReTweet that Tweet?

Dan Zarella has been using Twitter as a Petri dish for human behavior. I recently discovered his blog while searching for information on the propagation of vocabulary on social networks. He bills himself as a "social media and viral marketing scientist". One thing he's looked at is how retweets spread, which is worth a look.

I was interested in why people retweet. Since it's Friday, I thought it might be fun to try a poll on the subject. If you use Twitter, please give it a try, and retweet the link!

Directions: Look at your Twitter update history. Find the last thing you retweeted (not this!). For that retweet, pick the one answer which best expresses the most important reason for you to retweet that tweet. If there's another reason I haven't thought of, leave a comment!

Reblog this post [with Zemanta]

Thursday, July 16, 2009

The New York Times is NOT Being Disrupted by Innovation

Clayton Christensen coined the phrase "disruptive innovation" to describe a recurring pattern of incumbent technology companies being unable to maintain their market leadership through a particular type of technology transition. If you have not read the book, or watched one of his lectures, you should take two minutes right now to watch his video, or else stop reading this post NOW.

It really bugs me when people who have not read the book or have not taken the time to understand Christensen's insights steal the phrase "disruptive innovation" or "disruptive technology" and plaster it onto something that doesn't fit Christensen's model. For example, one characteristic of disruptive technology is that incumbent companies fail to adopt a new technology because it doesn't meet the needs of the market, i.e. their existing customers. This characteristic gets twisted by some entrepreneurs and technologists so that a technology's failure to address customers' needs (or to have customers in the first place) is cited as evidence of the technology's disruptive nature!

Another common misunderstanding of "disruptive innovation" is to assume that a technology is disruptive just because it poses a threat to an incumbent technology. Here's an easy way to tell if a "threatening" technology is a good fit to the disruptive innovation model: ask yourself "is the new technology a threat because it delivers higher performance, with a hope that its cost will be driven down to challenge current technology? Or is the technology a threat because it's really cheap, and has a hope to increase performance to be able to challenge current technology?" The high-performance technology is what Christensen labels a "sustaining technology"; the low-cost technology is what Christensen label a "disruptive technology".

In a previous post on whether scientific publishing is about to be disrupted, I argued that the problems of newspaper industry were not germane to the future of the scholarly publishing industry. In this post, I want to examine whether the newspaper industry fits the Christensenian model of incumbents facing disruptive innovation. Michael Nielsen's article argues in favor of disruption, suggesting that a blogs like Techcrunch, by adopting low-cost technical infrastructure, are disruptive innovators. I agree that the low cost infrastructure fits the disruptive model- there are no printing companies that have attempted to develop blogging infrastructure, for example. But that doesn't make Techcrunch a disruptive innovator, or newspapers a disrupted industry. The reason is that both Techcrunch and newspapers are really in the business of selling advertising. The advertising that Techcrunch sells is actually at the high-performance, highly targeted, expensive end of the market compared to the advertising that the New York Times sells.

In Christensen's model, incumbent companies abandon low-margin market segments to the disruptors because they want to focus on the most profitable parts of their business. But this is the opposite of what has happened in newspapers. Real estate listings and other classified ads have huge margins. Internet sites such as Zillow and Craiglist exploited these huge margins to make businesses out of delivery of high-performing ads.

I find it much more useful to think of the newspaper industry not as one being disrupted by innovation, but rather as one being fragmented by innovation. The internet allows information services to be profitable at much smaller sizes than previously possible. The result is that many markets previously served by newspapers became vulnerable to competition from smaller, more focused services.

I can think of a number of industries afflicted by fragmentation, and the outlook for incumbent companies is not nearly so dire as for industries afflicted by disruption. The television broadcasting and semiconductor industries are good current examples. Although many companies fail to adapt to a fragmented market and disappear, many survive and remain vital. There are a number of strategies for survival- the "roll-up", the "smaller but focused company", and of course the "climb up the food chain" and "move down the food chain" strategies. There are also strategies for failure, most prominently, the "pretend nothing's wrong" strategy.

The bottom line here is that I think there's hope for companies in the newspaper industry. Unless the New York Times shrinks its typeface and crossword puzzle so loyal readers like me can't read it anymore, it might not go bust.

Tuesday, July 14, 2009

Twittering Librarians Confront Hashtagging

Cataloging used to be the preserve of a special breed of librarians, but in the Web 2.0 era, cataloging, in the form of "tagging" has been opened up to the masses. With the advent of Twitter, this sort of tagging, which used to be tucked away in metadata fields, has injected itself into people's everyday stream of text. So it's interesting to see what happens when librarians, all of whom should have at least a bit of training in the application of subject headings, are presented with the task of hashtagging their Tweets. That's exactly what happened thes past weekend when 28,000 people attended the Annual Meeting of the American Library Association (ALA) in Chicago. Over 1,500 Tweets were hashtagged to indicate they were related to the meeting.

In a previous blog post, I looked at how about 30 conferences were hashtagged. I observed that the choice of hashtag was usually not an issue, and tweets were rarely divergently tagged after the first day of a conference. Avoidance of irrelevent subjects was the most important consideration in choice of hashtag. At ALA, as you might expect, there were more than the usual number of tweets asking which hashtag to use; this occurred even near the end of the meeting. Although the official hashtag was "#ala2009", there was a greater than 5% rate of divergent hashtag use, "#ala09" and "#ala". The latter tag was a poor choice due to conflict with other subjects. In my limited sample set, most conferences saw less divergent hashtag use.

As ALA is a large meeting, several sessions exhibited their own hashtags. The most tweeted "session" that I observed was "#unala2009". I wonder if there is a practical limit to the tweet-rate. At some tweet-rate, perhaps 1000 per day, people will want to start tweeting to a more specific hashtag. The "Top Tech Trends" twitterers battled to a 30-30 tie for the hashtags "#toptech" and "#ttt09".

Also noteworthy was the appearance and disappearance of the "ALAsecrets" account and associated twitterspam. This account published its password to allow for anonymous twittering. Interesting idea, poorly implemented. If only the librarians responsible would have created a librarian specific password like the one that scientists always use for numerical locks on their labs, 3 1 4 etc. Use the comments to propose a "shibboleth" for librarians. (Dewey's first name?)

Addendum: after posting, I realized that I should count the number of ALA tweets that used multiple ALA hashtags. 59% of tweets using the "wrong" hashtag also included the official hashtag. With that correction, the divergence rate went down to 2%, roughly the same as that found overall in my previous study.

Monday, July 13, 2009

Dung Beetle Armament and the Real Threats to Scientific Publishing

To illustrate an article on dung beetle armament, the New York Times Science section published a graphic with a spectacular montage of 35 animals with grotesque armaments, ranging from the Narwhal to the Giraffe Weevil. 13 of them are extinct. The reason that many dung beetles have evolved such elaborate armaments is not so much that they are effective in combat with other dung beetles, but rather that female dung beetles select mates based upon the outward display of combat fitness.

In my last post, I argued that scholarly publishers were not being threatened by imminent disruption by the same factors that have the newspaper publishing industry on the brink. I suggested that a potential vulnerability of the scholarly publishing industry would be the disintegration of the linkage between the industry's activity, publishing scholarly articles, and the industry's main revenue source- library subscriptions. I see two possible ways that this could occur. It could occur through a collapse of library funding; I hope to discuss that in a future post. This post discusses another way this could occur: I think there is a possibility that the adoption of social networking technologies will lead to a collapse of scholarly publishing as it exists today.

If this sounds a bit far-fetched, consider the parallels between scholarly publishing and dung beetle armament. The development of scholarly publishing today is driven by the selections made by authors about where and how to publish articles. The authors ultimate goal is to propagate their work and thus gain tenure, status and funding, just as "the ultimate goal" of the female dung beetle is to gain a safe tunnel to enjoy dung and raise baby dung beetles. The authors do not really know which journals do the best job of propagating their work, but they recognize prestige and the badges of prestige, and they know what sorts of publications will look best to their tenure committees. Authors do not consider the cost of journals any more than female dung beetles consider the energy cost of male armature. The size and form of today's scholarly publishing ecosystem is thus driven to a significant extent by the superficial judgments of tenure committees.

Anything that might change the way tenure committees, and thus authors, perceive journal publication has the potential to reverse the fortunes of journal publishers. To my mind, social networking technologies have that potential as do few other other things on the horizon. The reason is that tenure decisions have used journal publication records as objective measures of a candidate's social status within the scientific community. Publication in a prestigious journal has been an important way for scholars to become known, to gain speaking invitations, and to advance ideas. But publications are only part of this process. Knowing the right people, studying with the right professors, schmoozing at conferences, all of these are probably more important to the advancement of new ideas, but they have been very hard to measure in any objective way.

Social network technologies open new possibilities for the propagation of new ideas and for the assessment the impact of those ideas. Already, we see people using the number of followers they have on Twitter or the number of recommendations they have on LinkedIn as measures of social status, so it's not much of a stretch to imagine that similar measures could be used to evaluate young academics or to award grants. It's beyond dispute that Twitter is already being widely used to propagate links to interesting technical papers and posts on scientific subjects. If targeted development of social network-based evaluation methodologies were pursued by groups such as the library community who wish to re-inject usage and low-cost access into the tenure equation, the competitive environment for scholarly and scientific publishers could change radically.

Every threat is an opportunity, of course, and it's equally possible that social networking technologies could reinforce the scholarly publishing industry- after all, dung beetle armaments evolve to adapt to changing fashion choices among female dung beetles. A potential weakness- for example, the unwillingness of people to post or retweet links to subscriber-only content, could turn into strengths if publishers develop access models that grant special access for re-tweeted links or an author's Facebook friends. Publishers could also try to ward off challengers in the scholar evaluation game by developing improved and more ostentatious badges of honor- best paper prizes, awards for the most forwarded paper, etc.

In fact, I've come up with a mathematical model for how all this will evolve. First, assume a spherical dung beetle...

Friday, July 10, 2009

Spherical Livestock and the Alleged Disruption of Scientific Publishing

Physicists have a joke about "spherical cow approximations" referring to their tendency to simplify a problem to make calculations easier, even though such simplifications bring into question the solution's application to reality. My favorite version of the joke, which I first heard directly from Hans Bethe, has Nikita Khrushchev asking his most elite scientists to help the Soviet Union with its difficulty meeting its five year plan for the dairy industry. The biologists and the chemists are completely stumped by the problems of increasing milk production, but the physicists proudly announce they have solved the milk production problem, but only for the case of spherical cows.

In a post entitled "Is scientific publishing about to be disrupted?", quantum information theorist Michael Nielsen describes what he thinks is a general explanation for why businesses and industries fail, and goes on to draw an analogy between the newspaper industry and the scientific publishing industry. Although the post is well written and highly entertaining, (I find his discussion of "immune systems" particularly delicious) I find part of his analysis to be even worse than a spherical cow approximation- he's trying to study milk production by analyzing the spherical chicken! Let me explain.

Nielsens "spherical chicken" is illustrated in this graph from his blog:

In the graph, he plots some sort of measure of success versus some sort of configuration parameter that presumably could be tuned to turn the New York Times into TechCrunch, or vice versa. He goes on to say that
The problem is that your newspaper has an organizational architecture which is, to use the physicists’ phrase, a local optimum. Relatively small changes to that architecture - like firing your photographers - don’t make your situation better, they make it worse. So you’re stuck gazing over at TechCrunch, who is at an even better local optimum, a local optimum that could not have existed twenty years ago
The problem with this analysis is that TechCrunch is completely immaterial to the difficulties that the newspaper industry is undergoing. The financial health of the New York Times and the newspaper industry is not being undermined by news blogs, it's being undermined by non-news sites such as Craigslist, Zillow, and the internet as a whole. Craigslist has focused on classified ads, and only classified ads, and unburdened by the expense of producing the rest of a newspaper, it is able to provide a much more effective solution for the classified advertiser. Zillow has done the same thing in the real estate advertising category. Another big revenue source for newspapers is display advertising to consumers. But nowadays, when someone wants to buy something or find a service, their first thought is to go directly to the internet. Want to find when a movie is playing? You used to pull out a newspaper, now you go to the internet. A company like BestBuy used to communicate with customers through newspaper ads; while they still do so to some extent, the internet allows them to communicate directly with consumers through their web site. None of the newspapers' real competitors are in the news business at all, and there is no configuration parameter of any sort that could be tuned to transform the New York Times into Craigslist.

The news industry's core problem is not, as Nielsen suggests, their inability to adopt disruptive technologies, but rather the disintegration of the linkage between their main activity and their revenue streams. In the past, good news would attract readership, and readership would attract advertisers. The biggest difficulty for newspapers today is not so much the loss of readership, it's that advertisers now have many more ways to connect to that readership. In applying the lessons of the newspaper industry to the evolution of the scientific publishing industry, it's the stability of activity-revenue linkage that needs to be closely examined.

Even a cursory look at the scholarly publishing industry reveals a very different situation from that of the newspaper industry. First of all, there is much more business-model diversity in scholarly publishing. There are huge companies like Elsevier competing with cottage companies which produce a single journal. There are large non-profit societies such as the American Physical Society that produce extremely cost effective journals and who make much of their content available for free. There are journals that have long survived primarily on advertising and journals that have long survived primarily on society member dues. There is also a lot of experimentation with business models going on, including author-paid open access publishers, and mixed "open choice" business models. This business model diversity gives scientific publishing industry robustness against the prospect of any one business model being severely disrupted. In addition, the transition to digital delivery which is giving the newspaper industry such difficulty is to a significant extent already being accomplished in the journal publishing industry.

The scientific publishing industry does have a similar activity-revenue linkage problem that it needs to pay attention to. The people who write the biggest checks to scientific publishers are institutional libraries. But scientific journals, for the most part, do not cater to libraries, they cater to author communities, because the biggest determinant of a scientific journal's success has been the quality and quantity of articles it is able to attract. As long as libraries continue to be attracted to the authorship attracted by journals, and continue to attract the institutional funding they need to support their subscription, the biggest revenue stream for scientific publishers will be secure. But suppose that institutions start deciding to outsource their libraries or begin to require researchers to directly fund their journal subscriptions? Or suppose that libraries are successful in attracting authors directly into open-access institutional repositories?

A better analogy from physics for the scholarly publishing business might be the polaron. A polaron is the combination of a particle and interactions with the environment that it moves in, and the combination has a mass significantly larger that the "bare" particle moving on its own. In the case of the scientific publishing business, the interactions with its environment include the way tenure committees rely on the prestige of a journal that has published a candidates work, or the way accreditation boards require libraries to subscribe to certain numbers of journals. The polaronic industry thus gains mass and inertia, allowing it continue longer than it might otherwise do. Computer operating systems work in the same way- they induce the creation of third party software that interact with the operating system and thus increase its mass and inertia in the market.

Strongly interacting polarons can distort their environments so much that the become trapped by their cloud of interactions- think of a celebrity trying to walk though a crowd of fans. For a business this can be a fatal situation if objectives change, and there is no possibility to adapt.

How's that for a spherical cow?

Thursday, July 9, 2009

URL Shorteners and the Semantics of Redirection

When I worked at Bell Labs in Murray Hill, NJ, it amused me that at one end of the building, the fiber communications people were worrying that no one could ever possibly make use of all the bandwidth they could provide- we would never be able to charge for telephone calls unless they figured out how to limit the bandwidth. At the other end of the building, computer scientists were figuring out how to compress information so that they could pack more and more into tiny bit-pipes. I'm still not sure who won that battle.

When I was part of a committee working on the OpenURL standard, we had a brief discussion about the maximum length URL that would work over the internet. A few years before that, there were some systems on the internet that barfed if a URL was longer than 512 characters, but most everything worked up to 2,000 characters, and we anticipated that that limit would soon go away. So here we are in 2009, and Internet Explorer is just about the only thing that still has a length limit as low as 2083 characters. Along comes Twitter, with a 140 character limit on an entire message, and all of a sudden, the URL's we've been making have become TOO LONG! Just as fast, URL shortening services sprung up to make the problem go away.

The discussion on my last post (on CrossRef and OpenURL) got me interested in the semantics of redirection, and that got me thinking about the shortening services, which have become monster redirection engines. When we say something about a URI that is resolved by a redirector, what, exactly are we talking about?

First, some basics. A redirection occurs when you click on a link and the web server for that link tells your browser to go to another URL. Usually, the redirection occurs in the http protocol that governs how your web browser gets web pages. Sometimes, a redirect is caused by a directive in an html page, or programmed by a javascript in that page. The result may seem the same but the mechanism is rather different, and I won't get into it any further. There are actually 3 types of redirects provided for in the http protocol, known by their status codes as "301" "302" and "303" redirects. There are 5 other redirect status codes that you can safely ignore if you're not a server developer. The 301 redirect is called "Moved Permanently", the 302 is called "Found" and the 303 is called "See Other". Originally, the main reason for the different codes was to help network servers figure out whether to cache the responses to save bandwidth (the fiber guys had not deployed so much back then and the bit squeezers were top dogs). Nowadays the most important uses of the different codes are in search engines. Google will interpret a 301 as "don't index this url, index the redirect URL". A 302 will be interpreted as "index the content at the redirect URL, but use this URL for access". According to a great article on url shorteners by Danny Sullivan, Google will treat a 303 like a 302, but who knows?

Just as 301 and 302 semantics have been determined by their uses in search engines, the 303 has been coopted by the standards-setters of the semantic web, and they may well be successful in determining the semantics of the 303. As described in a W3C Technical Recommendation, the 303 is to be used
... to give an indication that the requested resource is not a regular Web document. Web architecture tells you that for a thing resource (URI) it is inappropriate to return a 200 because there is, in fact, no suitable representation for those resources.
In other words, the 303 is suppoesed to indicate that the Thing identified by the URI (URL) is something whose existence is NOT on the web. Tim Berners-Lee wrote a lengthy note about this that I found quite enjoyable, though at the end I had no idea what it was advocating. The discussion that led to the W3C Recommendation has apperently been extremely controversial, and has been given the odd designation "http-range-14". The whole thing reminds me of reading the existentialists Sartre and Camus in high school - they sounded so much more understandable in French!

As discussed in Danny Sullivan's article, most of the URL shorteners use 301 redirects, which is usually what most users want to happen. An indexing agent or a semantic web agent should just look through these redirectors and use the target resource URL in its index. The DOI "gateway" redirector at discussed in my previous post uses a 302 redirect. Unless doi's are handled specially by a search engine, it means that the "link credit" (a.k.a. google juice) for a link will accrue to the URL rather than the target URL. This seems appropriate. Although I indicated that if you use Linked Data rules the link identifies whatever is indicated by the returned web page, from the point of view of Search engines, that URI identifies an abstraction of the resource it redirects to. A redirection service similar in conception, PURL, also uses 302 redirects.

I was curious about the length limits of the popular url shorteners. Using a link to this blog, padded by characters ignored by, I shortened a bunch of long URLs. Here are 4 shortened 256 character links to this blog:
They all work just fine. Moving to 1,135 character links, everything still works (at least in my environment):
At 2083 characters, the limit for Internet Explorer, we start separating the redirection studs from the muffins.
When I add another character, to make 2,084 total, and both work, but reports an error!
The compression ratios for these last two links is 109 to 1 for and 95 to 1 for snurl. The bit squeezers would be happy.

Next, I wanted to see if I could make a redirection loop. Most of the shortening services decline to shorten a shortened URL, but they're quite willing to shorten a URL from the PURL service. Also, I couldn't find any way to use the shortening services to fix a link that had rotted after I shortened it. It could be useful to add the PURL service as link-rot insurance behind a shortened url if the 302 redirect is not an issue. So here's a PURL: that redirects to which redirects to etc. Don't click these expecting an endless loop- your browser should detect the loop pretty fast.

A recent article about how is using its data stream to develop new services got me thinking again about how a shortening redirector might be useful in Linked Data. I've written several times that Linked Data lacks the strong attribution and provenance infrastruction needed for many potential applications. Could shortened URIs be used as Linked Data predicates to store and retrieve attribution and provenance information, along with the actual predicate? And will I need another http status code to do it?

Monday, July 6, 2009

Crossref, OpenURL and more Linked Data Heresy

After CrossRef was started nine years ago, I quipped that it was nothing short of miraculous, since it was the first time in recorded history that so many publishers had gotten together and agreed on something that they would have to pay for. I'm sure that was an exageration, but my point was that CrossRef was not really about linking technology, rather, it was about the establishment of a business process around linking technology. The choice of technology itself was to some extent irrelevant.

Last week, in a comment on my post about AdaptiveBlue and OpenURL, Owen Stephens raised some interesting questions surrounding OpenURL, DOI (Digital Object Identifier), and Linked Data. It's useful to think of each of these as a social practice surrounding a linking technology; I'll describe each of them in turn.

DOI is often thought of as synonymous with CrossRef, which is incorrect. DOI is a link indirection technology used by the CrossRef organization. There are some DOIs that are not CrossRef DOIs, but most of the DOIs you are likely to come across will be CrossRef DOIs. CrossRef provides registration, matching and lookup services in addition to the DOI redirection service, and from here on, I'll be talking about CrossRef DOIs only. The core mission of Crossref is the transformation of journal article citations into clickable URLs. CrossRef has registered about 35 million DOIs, most of them for journal articles. In the registration process, CrossRef collects identifying metadata for the journal articles, which it then uses to power its matching and lookup services. The matching service is currently making about 15 million matches per month.

CrossRef is far from being perfect, but its achievements have been considerable. Most scholarly journal publishers have integrated the CrossRef registration and matching process into their production workflows. The result is that many thousands of electronic journals today are being linked to from many thousands of other electronic journals, databases, search engines, even blogs.

In contrast to CrossRef, which is focuses on publishers and publisher workflow integration, OpenURL is a linking technology and practice that has focused on helping libraries manage links to and from the electronic resources available to their patrons. OpenURL is complementary to Crossref- OpenURL linking agents usually make use of CrossRef services to accomplish their mission of helping users select the appropriate resources for a given link. Libraries frequently need to deal with problems associated with multiple resolution- a given article might be available at ten or even a hundred different URLs, only one of which might work for a given library patron.

Finally, Linked Data is an emerging practice which enables diverse data sets to be published, consumed and then linked with other data sets and relinked into a global web of connections. It would be interesting to find out how many matches are being made in the Linked Data web to compare with CrossRef, but because of the decentralized matching, its not really possible to know. While CrossRef and OpenURL focuses on connecting citing articles and abstracts with the cited articles, Linked Data attempts to support any type of logical link.

Obviously there is overlap between Linked Data and the more established linking practices. Can (and should) Linked Data applications reuse the CrossRef and/or OpenURL URI's? Let's first consider OpenURL. OpenURL is really a mechanism for packaging metadata for a citation (jargon: ContextObject) into a URI. So the "thing" that an OpenURL URI identifies is the set of services about the citation available from a particular resolver agent. That's not usually the thing that you want to talk about in a Linked Data Application.

What about CrossRef DOIs? There are two different URI's that you can make with a DOI. There's the http URL that gets redirected to full text (you hope) by the DOI gateway: There's also the "info-uri" form of the doi- info:doi/10.1144/0016-76492006-123 , which you can't click on. It's clear what the latter URI identifies- it's a 2007 article in the Journal of the Geological Society. Many libraries run resolver agents that can turn that URI into clicakable service links. I'm not sure what the former URI identifies. What the URI gets you to is a web page with links to two different instantiations of the article identified by the info-uri. Apparently it doesn't identify the same article in its other instantiations on the internet. So the most correct URI to use, if you want to make Linked Data assertions about the article, is (in my humble but correct opinion) to use the info-uri.

There's one little problem.

The second of Tim Berners-Lee's "Four Rules" for Linked Data is "Use HTTP URIs so that people can look up those names." But CrossRef, a stable, self-sustaining organization which has made huge strides moving the world of journal publishing to a more open, more usable, more linked environment, provides look-up APIs that return high quality XML metadata so that you can look up the names that it defines. It has a solid record of accomplishing exactly the things that Linked Data is trying to do, albeit with broader scope, but undeniably with significant impact. The identifier that CrossRef is using is the DOI, and the URI form of DOI is NOT an HTTP URI.

Maybe Tim BL's second rule is wrong, too!

Saturday, July 4, 2009

How Semantic Technology Unified China in the Qin Dynasty

The first Swedish Rap recording was made by the great troubadour Evert Taube in 1960. It's called "Muren och böckerna", and here's a YouTube video for your listening pleasure:

I became aware of this recording from another song called "Evert berättar" by Peter Carlsson and the Blå Grodorna (Blue Frogs). My Swedish isn't that good, but one day a few months ago the song came up on my iPod Shuffle while I was running, and I suddenly realized that the song had something to do with burning books and the Great Wall of China. As soon as I got back home I started researching Evert Taube and Qin Shi Huangdi, the subject of the original song (whose title translates as "The Wall and the Books").

Shi Huangdi (pinyin: Shǐ Huáng Dì, Chinese: 始皇帝 ) means literally, "first emperor". Just as Julius Caesar's name became synonymous with Emperor continuing to the present in titles such as "Kaiser", "Czar" and "Shah", Huangdi was the term used for Chinese emperors for over two thousand years. Shi Huangdi was the king of the Qin state from 246 BCE to 221 BCE, when he became the first emperor of a unified China. Even the word "China" comes from his "Qin" state (pronounced “chin”), even though most Chinese people are really "Han" rather than "Qin".

Shi Huangdi's unification of China put an end to what historians call "the Warring States Period. Under his leadership, the Qin state defeated one rival state after the other. The Warring States period, though politically chaotic, saw a great deal of economic, cultural and technological growth. Iron replaced bronze, and both Confucianism and Taoism (the Hundred Schools of Thought) developed in this period. The Qin state, however, grew strong because of the adoption of a competing philosophy, called Legalism, which emphasized the rule of law in a totalitarian state. Like Caesar, Shi Huangdi extended his dominion by improving communications and implementing standards. He build roads and canals to link the different parts of China. He standardized the length of axles of carts, the units of weights and measures, and the coinage. His most important acheivement, however, may have been the standardization of the Chinese script. For the first time, the machinery of local governments could communicate with functionaries of government throughout the realm. You might say that this was the first semantic web.

Shi Huangdi's innovations were not achieved by gentle persuasion or community consensus, but rather by imperial edict and brutal force. In order to stifle dissent (not to mention the outlawed non-official scripts), he ordered the destruction of all books other than a few in subjects he deemed to be useful: agriculture, medicine and alchemy, and in particular, he outlawed the works of the competing Hundred Schools of Thought. Those caught possessing any of the illegal works were to be conscripted and sent to work on the public works project now known as the Great Wall of China. In many classical accounts, Shi Huangdi ordered 460 scholars to be buried alive, then beheaded.

Although the Qin dynasty of the first Emperor failed to last more than a decade after his death, the non-political aspects of the unification of China through communication, trade, laws, administration and a standard script have lasted more than 2200 years.

Why would a Swedish troubador be interested in Shi Huangdi? Why would he invent a form similar to modern HipHip to sing it in? Evert Taube seems to be most interested in Shi Huangdi's act of burning "all the books in China", so that "history could begin with him". Shi Huangdi exiled his mother because of some "court intrigue" and Taube thinks that burning the books was an act of destroying history, forgetting the his unhappy past, and thinking only of what can be accomplished for the future. It often strikes me that today we're in another period of forgetting the past- because the internet dates back a relatively short time, modern students often behave as if anything that isn't on the internet doesn't exist, and never has existed. There are ongoing monumental efforts to digitize books and bring them back into view; small wars are being fought over how this will occur and all of the combatants claim the banner of preserving history for eternity. Eternal life was also one of Shi Huangdi's obsessions- the famed army of terra cotta warriors he had made was a product of this obsession.

I think that the musical form chosen by Evert Taube is not an anticipation of HipHop, but rather an evocation of the history that society is so eager to forget. Taube had been a sailor and an adventurer, and no doubt had been exposed to the traditional musical forms of both the Far East and of Africa. I think his intent was to evoke the forgotten primitive past with rhythms that speak across the ages.

We live in a time when the language and mechanisms of human interaction are undergoing great change. We are entering an era in which machines are learning to participate in our conversation. Efforts are under way to standardize and unify notations for the real world concepts and entities that underlie our communication. Success in these endeavors may result in the creation of great wealth and power, and new projections of existing wealth and power. It is possible that we are living in a Warring States/Hundred Schools of Thought period, and standardization of our notations will lend itself to a totalitarian communications regime with global extent such as Shi Huangdi's or Julius Caesar's. Another possibility is that our intercourse will become governed by something like a theocracy, in which texts are governed by a priesthood and preserved by monks. Or perhaps information and its underpinnings will devolve to a dictatorship of the proletariat.

On this 233rd anniversary of the Declaration of Independence, I'd like to suggest that a democratically derived and governed semantic machinery for the internet should also be possible. Humans who interact in large groups, such as they are doing in places like Facebook, Twitter and the like, naturally develop languages and syntax on their own, and machines should bow to our will if they are to participate helpfully in our conversations. We need not only a common language and script to be able to communicate with each other, we need liberty to say what we want to say.

Happy Fourth of July!

Thursday, July 2, 2009

Linked Data Heresy? Under the Hood at AdaptiveBlue

Have you ever watched a web server log? Thirteen years ago, I was starting up a scientific e-journal, and it was very gratifying to watch the monitor and see the traffic coming in from all over the world. Occasionally I would turn on the referrer log to see where people were coming from. One time, I was surprised to see that somebody in Poland was coming to my e-journal site from a russian web site with "xxx" in the URL. Curious about what sort of site might be linking to my e-journal, I checked out the site, and found it to be about blond, naked women. I wasn't sure about what this indicated about my e-journal. Perhaps the Polish scientists found the e-journal and the xxx site equally stimulating? Perhaps their boss had just walked into the room, and they needed a work-oriented internet site to cover their other browsing?

My perspective on the privacy of my internet browsing changed that day. I've become mildly paranoid about things that might spy on me. I am very selective about the Facebook apps that I load, for example, but I don't bother to flush my browsing history or block web bugs or things like that. I enjoyed finding out "what Google knows about me" (post it to Facebook and tag your friends to do the same!). I really worry about Firefox extensions (or "Add-ons"), because I know how extremely powerful and/or intrusive they can be. Even so, the 3 or 4 things I add to Firefox are the main reason I don't use Safari, despite its integration advantages. I'm not surprised that IE and Safari have declined to support practical extension mechanisms; they're sort of scary. On the other hand, Firefox Add-ons have presented very few spyware-related problems; this is due in part to the fact that they must be written in Javascript and delivered as source. It's relatively easy to go and open an Add-on and inspect its code, so if an Add-on does something other than what it says it does, it's likely that sooner or later someone will discover the truth.

A really interesting Firefox Add-on called "Glue" is being offered by a venture-funded company called AdaptiveBlue. (no relation whatsoever to my company, Gluejar, Inc.) Glue watches you browse the internet and when it sees you on one of a set of sites that it knows about, it reports the pages you're on to AdaptiveBlue, enabling them to construct a "Social Network of Things", where the Things might be Books, Music, Products, Wine, Companies, etc.

Image representing AdaptiveBlue as depicted in...

Overall there are over 300 sites that the Glue Add-on does something with. A lot goes on in Glue, and I didn't take the time to sort everything out. For example, when you go to a topic page in Wikipedia or a book page in WorldCat, or a stock page in Yahoo Finance, the url that you visited is reported to AdaptiveBlue. Usually, the Add-on then slides down a Glue header which tells you about what the Glue Social Network thinks about the Thing you are looking at. Personally, I find this very distracting, and I don't plan to continue using Glue, but I can imagine that many people will appreciate the consistent interface to the social network and other services that is presented. Other sites handled by glue include LibraryThing, Epicurious,, ESPN, theStreet, ToysRUs, Expedia, GameSpy, Metacritic, WineLibrary, Flixster, Connotea, Flickr, Technorati, Walmart and eBay, just to name a few. It was very difficult to find the official list of sites that Glue works with on the GetGlue web site; I wish the AdaptiveBlue people were more upfront about exactly what they do on these sites. Nonetheless, the Add-on appears to do what it says it does. I also would like to see the user given more control over the sorts of things that are reported to AdaptiveBlue- I'm much more relaxed about sharing my Wine and Sports browsing than I am about my Wikipedia and Stocks browsing. And I really don't want to share my Russian XXX site browsing!

It's interesting to compare Glue to the OpenURL linking services that have been almost universally adopted in libraries. (I developed one of the first OpenURL link servers, which is now owned by OCLC, Inc.) Like Glue, the OpenURL link servers present users with relevant information and links to services surrounding "things" which are typically journal articles or books. One library that I worked with even used a social network to connect users to other users who had viewed the same item, just like Glue. There was even a Firefox Add-on developed that routed "thing" links to link servers. The link server vendor community worked with publishers closely to enable OpenURL linking; although AdaptiveBlue promotes its "SmartLinks", I doubt that many of the sites Glue is aware of understand what they are doing.

Glue makes heavy use of Amazon web services, including the product information web service, the SimpleDB service and the S3 simple storage service. It's smart these days to outsource scalability and concentrate on your application's functions. Glue also makes nice use of the Dojo and Mochikit Javascript toolkits. In browsing the code, I noticed that many of the problems it addressed were exactly the same ones we encountered developing Linkbaton 9 years ago, and the solutions look quite similar (in otherwords, I think the developers have done a pretty good job!) except that the tools available today are so much more advanced than what we had to work with 9 years ago.

Given that AdaptiveBlue makes a big deal about the Semantic-ness of its technology, I was surprised to find out how it identifies "Things". The canonical way to identify a Thing on the semantic web is to give it a URI, and then attach properties to it. When I spoke with AdaptiveBlue founder and CEO Alex Iskold at the Semantic Technology Conference, he told me that they only use title and author strings to define book Things. In fact, they bundle these strings into keys (such as books/cryptonomicon/neal_stephenson), then use the keys as if they identified a book, when in the real world, it's more complicated. So the "Things" in the AdaptiveGlue "Social Network of Things" are entities that do not correspond to books, but rather correspond to descriptions of books. Interestingly, this is exactly the approach taken in OpenURL URI's, which are really descriptive metadata packages, not entity URI's.

The first of Tim Berners-Lee's "Four Rules" for Linked Data is "Use URIs as names for things". Both Glue and OpenURL, which were designed separately as practical solutions for linking to things, seem to break this rule. Instead they build URIs using descriptions of the things, and don't bother naming the things themselves. Maybe Tim BL's first rule is wrong!