Thursday, December 31, 2009

Do Libraries Have a Role in the Coming e-Book Economy?

You've probably heard it said that in Chinese, the word for "crisis" is composed from the words for "danger" and "opportunity". In the same presentation, you probably heard that there's no "I" in "TEAM". If you were skeptical of these attempts to extract wisdom from way language is written, you had good reason. The story about the Chinese word for crisis is not true. And even if it was true, it would be about as meaningful as the fact that the English word "SLAUGHTER" contains the word "LAUGHTER".

During my brief time working in "middle management", I was required to do "SWOT Analysis". SWOT stands for "Strengths, Weaknesses, Opportunities, Threats". As a planning exercise, it was quite useful, but it became comical when used as a management tool. Everyone understood the fake Chinese crisis wisdom, and we all made sure that our threats were the same as our opportunities, and our weaknesses were also our strengths.

On this last day of the "0"s, I've been reading a lot of prognostication about the next ten years. It's very relevant to this blog, as I've been using it to help me think about what to do next. Some things are not too hard to imagine: the current newspaper industry will shrink to maybe 10% its current size; the book publishing will reshuffle during the transition to e-books; Google will become middle-aged. The SWOT analysis for these will be easy.

The SWOT analysis that I have trouble with is the one for libraries. What threats to libraries will arise? Will Libraries as we know them even exist in 10 years?

I've heard publishers say they believe that there will be no role at all for libraries in the developing e-book ecosystem. If that's not a threat, I don't know what is! On the other hand, there's the example of the Barnes and Noble e-book reader, the Nook, that has the intriguing feature of being able to read books without buying them while you're in the bookstore! If there's a role for brick and mortar bookstores in the e-book ecosystem, then surely there's a role for libraries.

In thinking about what roles libraries will play when all books are e-books, I keep coming back to a conclusion that sounds odd at first: the prospective role of libraries will be entwined with that of piracy in the e-book ecosystem.

While there are fundamental differences between e-book libraries and e-book pirates, there are important similarities. As I noted in my article on copyright enforcement for e-books, libraries have traditionally played an important role in providing free access to print books; e-book pirates have as their mission the provision of free access to e-books. For this reason, libraries and pirates would occupy the same "market space" in an e-book ecosystem. This is not to say that libraries and pirates would be direct competitors; it's hard to imagine pirate sites appealing to many of the people who patronize libraries.

So where is the "threat" to libraries? Think about how book publishers will need to respond to the threat of e-book piracy. I've argued that publishers should do everything they can to reward e-book purchases, but that addresses only the high price segment of the market. Public libraries address the low-price segment of the market, providing books to people with a low willingness or ability to pay for access, while still providing a revenue stream for the publishers. To keep pirates from capturing this market in the e-book economy, publishers will need to facilitate the creation of services targeted at this market.

An analogy from the video business is appropriate here. DVDs can only satisfy part of the digital video market. Though it's taken a while for the studios to realize it, in order to effectively compete with video pirates, the movie studios need to have digital offerings like that offer movies for free.

What will the free e-book services look like? Perhaps they'll be advertising sponsored services like Google Books. Perhaps they'll be publisher- or genre-specific subscription services that provide people a "free book" experience at a fixed monthly price. Unfortunately, it seems a bit unnatural that publishers would turn to libraries to create the sort of services that could replicate the role of the library in the e-book ecosystem- libraries just aren't entrepreneurial in that way.

Somehow I don't think that book publishers will warm to a "Napster for e-Books", even if it was labeled "e-Book Inter-Library Loan".

Still, I'm optimistic. Some horrific mashup of Open Library, Google Books, LibraryThing, WorldCat, BookShare, Facebook, Freebase, RapidShare and the Mechanical Turk is going to just the thing to save both libraries and publishers. You heard it here first. And if you find it scary- don't forget that you can't spell e-Book without BOO!
Reblog this post [with Zemanta]

Monday, December 21, 2009

Copyright Enforcement for eBooks: Cultural Life Preserver or Orwellian Nightmare?

I'm a 7 MPH speeder. When I'm on an empty highway with a 65 MPH speed limit, I drive 72. This puts my car in roughly the 70th percentile of car speed. But when some idiot comes zooming past at 85, I cheer when I see him stopped by the cops 5 minutes later.

Last time I was in England, I was appalled to find that cameras had been installed along some of the motorways that would send you a speeding ticket automatically if you averaged more than the speed limit. I told the limo driver that Americans would elect a black president long before we'd tolerate speed cameras on the freeway. I was right.

I'm no legal theorist, but I know better than to think that human behavior is determined by laws- laws only work as far as they reflect a social consensus. It's true for driving and it's also true for reading, listening to music, and watching videos. As behaviors change due to the introduction of technology, society is forced to modify social norms for behavior.

The book publishing industry is at the beginning of a technology driven change in the way that people read books, and the shape of the consensus that emerges will determine how creative production is sustained. (Same for news, but that's another story entirely!)

Social consensus has a lot of inertia because if people and institutions don't have to change, they won't. Think about how social consensus evolved when music became digital. It used to be that people expected to be able to listen for free, via radio, and expected to pay to "keep" the music. Once paid for, people expected to be able to share their records with friends in a variety of ways.

When music became digital with the advent of the compact disk, very little changed, at least for a while. The addition of internet distribution, however, allowed Napster to stretch the "sharing" behavior so as to cover free listening and threaten the buy-to-own behavior. The music industry responded with legal action, but its failure to provide convenient, authorized activities to cover accustomed behaviors gave Napster an effective monopoly on digitally distributed music. If not for the social habit of paying-to-keep music, the music industry may well have collapsed. With the takedown of Napster and the rise of authorized services like iTunes, Pandora and Spotify, the music industry has begun to successfully reshape user behavior forged by easy unauthorized file sharing, but its mistakes have clearly hurt.

The movie industry has had more luck with the onset of digital distribution. People still expect to watch TV for free, and to pay for premium entertainment at the movies. The internet bandwidth needed to easily move video files has become available at about the same time as distribution sites such as Hulu and Netflix, so pirates have never had much of a monopoly on digital movie distribution. YouTube offers a flood of free video content, and it works with rightsholders to identify and remove unauthorized uses of their work. Large amounts of unauthorized distribution has occurred, but the movie industry has responded with both the carrot and the stick, by providing enhanced in-theater experience, inexpensive secondary distribution channels, making deals with YouTube and providing specialized DVD content while pursuing takedowns and ostentatiously prosecuting copyright infringements. Certainly the movie industry has made some missteps, but a blockbuster movie can still gross a billion dollars.

People have always expected to pay to own books, but once bought, the books could be freely borrowed from friends or libraries, and a vibrant used-books market makes older works available at very low cost. The biggest change brought about by digital distribution is the flood of free material available on a huge variety of websites, from blogs to wikis to traditional news.

It's not clear how book (including ebook) sales will be impacted by unauthorized distribution of digital copies. Although I've noted that it's relatively easy to find and identify unauthorized copies of works like Harry Potter and the Deathly Hallows, it's not likely that people will change their book buying behavior unless they have to. That's why I find it surprising that J. K. Rowling and her publishers are giving the pirates a near monopoly on the digital version of that particular book.

I've heard publishers say that they've learned from the example of the music industry that the threat of piracy makes DRM (digital rights management) a necessity for distribution of ebook content. In fact, almost the opposite is true. Publishers have been distributing books for hundreds of years without DRM. A potential pirate doesn't need to crack any encryption; they need only buy a single copy of the book and scan it. I wrote about the advent of cheap book scanners in October; Wired has a recent article.

Pirating a book is somewhat more difficult than pirating a song, but comparable to pirating a movie. The first step is to acquire a digital copy. Popular books are easy to obtain and a professional pirate would likely remove the binding with a saw and feed the pages into a high-speed copier/scanner. (Until the DVD comes out, a pirate typically sits in a theater and films the movie; the DRM on DVD's is trivial to crack.)

The digital file would then either be seeded onto a peer-to-peer network or uploaded to a file distribution or streaming site similar to rapidshare. Studies by Arbor, Cisco, and Sandvine suggest that P2P networks are declining in popularity compared to the file distribution sites, especially in countries with high broadband penetration.

In a peer to peer network such as those using the BitTorrent protocol, the work is divided between tracker sites and the peers which provide the actual files. The use of many peers allows high-volume distribution without needing a high bandwidth internet connection. Since the RIAA and others began filing lawsuits against people thought to be involved in providing files, the remaining networks have adopted social-networking and encryption to make sure that they can no longer be easily monitored.

File distribution sites are being used more and more as broadband connections become widespread. These sites have many legitimate uses, and will respond to takedown notices when illicit content is identified on their sites (although the in some countries, the takedowns are processed with the underwhelming speed of a bank's electronic funds transfer.) The links and metadata for the illicit files mostly appear on third party sites, which complicates any enforcement action. Ironically, sites such as Rapidshare have become so popular that to use it easily you really have to purchase a premium subscription!

Still, digital book piracy has already begun to appear in significant amounts. According to Brad Beautlich, Sales Director at DtecNet, text books, including law and medical textbooks, are now frequently appearing on the content distribution sites and torrent indexes favored by copyright infringers. These tend to be expensive items sold in cost-sensitive markets, which increases the incentives for unauthorized use. The sites appear to have very few books that have been cracked from digital versions; most of the book content currently available is clearly derived from scanned print.

The lack of pirated e-reader files (such as kindle or epub files) is consistent with the profile of e-reader early adopters, who tend to be to be older and not particularly price sensitive. I assume it's because older users tend to have bad eyes and full shelves. They're unwilling to install P2P client software or be attracted by the sort of advertising found on file index sites. Readers in developing countries may be in different situations.

DtecNet is a company that has been providing detection services to media companies. They offer to seek out, document and help to take down unauthorized content from web sites and file sharing networks. Their task can be difficult, as they need to scan and monitor indexing sites that may cloak the identity of a file ("NITM2" instead of Night in the Museum 2) and figure out from user comments in multiple languages whether a file is genuine or not.

Beautlich suggests that although monitoring from his company would be expensive ($4000-5000/month for a Harry Potterish project), an early investment in copyright enforcement by the book industry might more effective than a strategy of waiting for a larger threat to arrive.

Another strategy to modify user behavior is being pursued by Audible Magic. Audible Magic has a rather different business model from DtecNet. Instead of working for rightsholders, Audible Magic provides content identification services to ISPs, educational institutions, and content distribution services, helping them minimize their liability for copyright infringement. In the US, the Higher Education Opportunity Act (HEOA) of 2008 requires colleges and universities to have "A plan to 'effectively combat' copyright abuse on the campus network using 'a variety of technology-based deterrents'."

Audible Magic provides an appliance that attaches to a router or gateway within the client's network. The appliance "listens" to network traffic, and when is recognizes copyrighted content being transferred in ways that connote unauthorized use, it either logs a report or attempts an intervention. According to Jay Friedman, Audible Magic's Vice President for Marketing, over 100 University campuses are using their systems. Pricing depends on the amount of bandwidth used by the university and can be as little as a few thousand dallars a year.

Interventions are positioned in a "graduated response" model. For example, a user's next webpage download might be replaced by a page suggesting that unauthorized activity may have occurred, along with a reminder of an institution's usage policies. Continued infractions might result in the user being put in a "timeout", followed by a human mediated intervention.

If you find it big-brotherish to have an "appliance" looking over your shoulder so see whether your infringing copyrights, you wouldn't be alone. The Electronic Frontier Foundation has warned that Audible Magic's service offering is "no magic bullet", and is concerned that this type content monitoring would be a threat to individual privacy rights. It's one thing for a universities and corporations to be proactive in avoiding copyright infringement liability, but imagine what it would be like if this sort of monitoring were a legal requirement! Public Knowledge has published an excellent overview of the issues surrounding this sort of network monitoring.

In fact, international treaties and legislation requiring ISPs to adopt "three strikes" graduated response policies culminating in loss of internet connection is being considered in Europe and other parts of the world. While many book publishers would be horrified to buy into these sorts of copyright enforcement regimes, at the same time they are aghast at the prospect of having their content pirated and their livelihoods destroyed.

Think about the speed limit monitor in the accompanying photo. Based on my observation it is very effective at modifying the behavior of drivers. 7-MPH speeders like myself become 1-MPH speeders. I don't think anyone minds being monitored by this sign- there is confidence that it's not doing anything other than measuring and displaying our speed. In contrast, hidden speed traps seem evil- they don't slow people down unless they own radar detectors; the egregious speeders are not the ones who get caught! Copyright enforcement for ebooks should be as much like that as possible. As Princeton's Ed Felten has observed, the ideal copyright enforcement system exhibits maximal compliance and minimum prosecution. Especially for books, monitoring systems should be as open as possible and visible to users to maximize compliance and to create confidence they are not also snooping on reading habits.

It's interesting to read about the experiences of a university that implemented monitoring of P2P networks to comply with HEOA. Illinois State's Digital Citizen Project's summary of "Escalated Response System Testing Utilizing Audible Magic Copysense" (pdf, 1.5 MB) is valuable reading. While it's hard to be sure that Illinois' program was effective (you can't measure events that have evaded detection), I found it interesting that Illinois State students expressed minimum complaints or concern about the program.

A company with content identification technologies similar to those of Audible Magic is Nexicon. Both companies have agreements in place to work with YouTube to help to identify copyrighted material in uploaded videos, but Nexicon's business model aligns them with enforcement-oriented rightsholders. Here's how Nexicon President Sam Glines describes their flagship services:
Through our GetAmnesty and PayArtists solutions, we share with the rights holders settlements collected via the DMCA notices sent to infringers. The copyright holder sets the dollar amount per infringement - in the case of PayArtists and for Frank Zappa, the settlement amount is $10 per infringement. Nexicon’s MARC platform is capable of sending 95 million DMCA notices each day. Nexicon’s MARC platform monitors billions of illegal downloads of copyrighted material on a daily basis.
Nexicon has recently been involved in controversial takedown notices which Prof. Mike Freedman of Princeton's Center for Information Technolgy Policy describes as "inaccurate enforcement". In addition to defending Frank Zappa's copyright interests, Nexicon, a public company, boasts about fighting child pornography. At the same time, it appears to be associated with a New Jersey company that represents pornography publishers in their battle against copyright pirates. It can be hard for a technology company to control how their customers employ technology, but I would like to see more clear and coherent explanations of what happened to Freedman than Nexicon has provided to date.

Identification of ebooks is rather a different endeavor than identification of video or audio files. Copyrighted content in audio and video files can be identified in a number of ways, including watermarking, hashing and fingerprinting. As its name implies, Audible Magic's roots are in the audio fingerprinting area, and its huge library of 7 million song fingerprints is a significant asset, but they increasingly need to use textual clues such as those required for eBook identificationand are interested in further developing book-related identification techniques. As I've written previously, textual fingerprints are surprisingly effective at identifying books, even using a single sentence.

Book publishers preparing to fight piracy need to first and foremost have their content ready to be identified. While metadata, epub files and the like will be useful in locating and identifying pirated content that includes OCRed text, scanned images of books are also likely to be useful for the development of content recognition systems. If book publishers don't at least have scan files of every book they own, now is the time to start for them to start scanning!

Enforcement is only one weapon in the fight against book piracy, and it is the one weapon that most quickly loses effectiveness, as the techniques of copyright evaders evolve. One potential weapon that should be avoided is the dirty trick. If book publishers are unable to learn from the Sony rootkit fiasco, they will get all the ill will and lawsuits they deserve.

The shaping of societal behavior is a hopeless endeavor if the stick is wielded without a corresponding carrot. Any psychologist will tell you that most powerful tool in modifiying human behavior is positive reinforcement. If ebooks are to succeed commercially, publishers must use every means possible to reward people who purchase ebooks. I hope to write more about this soon, but I believe that positive reinforcement is the best lens to look at DRM with. DRM will fail unless its users believe it is rewarding them with convenience and ease of use, and with sufficient reward, it is also unnecessary - that is the lesson of iTunes.

As the era of digital books dawns, book publishers should expect that business models will change. Their mission, if they choose to accept it, is not only to deal with unauthorized use, but also to lead users to a social consensus that benefits everyone.

Update: In this post, I managed to overlook Attributor. Here's a post about them.


Copyright Enforcement/Monitoring Companies

Privacy Organizations

Content Industry Organizations

Reblog this post [with Zemanta]

Tuesday, December 15, 2009

Holiday Product Management and the Hardcover Ebook

The product development team responsible for Christmas should get an award. I can imagine the brainstorming sessions: "Let's have a holiday where travelers get stranded and get to sleep with smelly animals!"

"No, that won't work. How about a holiday where poor people visit maternity wards in hospitals?"

"Crackpot. Wait...I've got it... How about if everyone gets PRESENTS!"

Positive reinforcement can go a long way towards creating success. E-books these days are sort of like the current state of my Christmas tree. Beautiful, smells good, but no decorations or presents. In my previous article discussing copyright enforcement, I stressed that if publishers really want to fight off piracy they need to find ways to positively reinforce the ebook purchase experience. In other words, decorate your Christmas ebook and put lots of presents under it.

Step one. It should be completely painless and hassle-free to buy and use a piece of content. Duh. Amazon has figured this out. Apple has figured it out with iTunes. If it comes to a choice between secure DRM (which will get cracked anyway) and making it easy for customers, always do what the users sending you money want you to do.

Step two. Decorate! When someone gets an ebook, it should be customized with the purchaser's name and a pretty frontispiece that says "this book is being lovingly read and cared for by PURCHASERS NAME HERE", and every ebook file would have a unique serial number. For gifts, of course the message would be different. If the e-reader marketers and developers were really attuned to the publishing ecosystem, they would hardwire a cryptographic signature reader into their devices that would be able to tell a "genuine" ebook frontispiece from a fake one. It's not technically hard.

Step three. Give out presents! The wonderful thing about ebooks is that the reproduction cost is zero. You can afford to give your customers free stuff! Once they register their unique ebook serial number, send them a "personalized" thank you note from the author. Give them a free book from an unknown author that they would never, ever have bought on their own.

Step four. Give out more presents! If you want to reward genuine ebook purchases in places like China, turn ebook registration into a raffle. Put a golden ticket in every millionth ebook, and a silver ticket in every thousandth. Give something to everyone who registers, even if it's just a virtual badge, or "frequent-reader" points. People may start buying ebooks just for a chance at a payout. Other people will try to register fake serial numbers, and for free, you will get both marketing and pirate tracking data.

Step five. Regifting! If someone has paid you good money for an ebook, let them give it to a friend after six months or so (as long as they've registered it!) If they're a platinum-level frequent buyer, let them keep their own registered copy, too.

If ebook publishers get really good at adding value to their products, they could consider rolling out the "Hardcover Ebook". Current practice in the print book industry for "trade" books is to initially offer only a hardcover version. A year later, the same book is released as a softcover at a substantially lower price. The effect of this is to partition the market and capture revenue from more potential purchasers. Consumers accept this partitioning partly because they assign some value to the hard cover- they assume the hard cover is more expensive to produce.

Recently, there's been much discussion about publishers holding back titles from the ebook market to protect hardcover sales. An ebook with enhanced value comparable to a hard cover print book could be offered on the intitial release date at a higher price so as to prevent pirates from having a monopoly on the ebook.

Is there a Grinch preventing the ebook Christmas party? As long as ebook publishing is captive to distribution platforms, innovations in ebook value may be difficult to implement. Amazon's interests do not always align with those of publishers. In particular, ebook serial numbers that register with publishers a not going to be high on the Kindle development queue.

Even the Grinch learned what Christmas is really about. You won't get hardcover ebook in your stocking this year, but have a great Holiday anyway!

When Amazon and Kindle Win

Last week, I went to a talk given by Mike Shatzkin, who has been doing a good job preparing book publishers for their future. One of his points was that the growing market power of has the potential to disrupt the way that book publishers do business.

After the talk, I was inspired to go home and do a tiny bit of math. It occurred to me that it should be possible to write an equation to describe the point at which authors would find it to their advantage to sell directly through Amazon and Kindle rather than working through publishers. After filling up a spreadsheet, I boiled everything down to one equation that defines the moment that Amazon wins, and the Big 6 publishing houses (Hachette, HarperCollins, Macmillan, Penguin, Random House, and Simon & Schuster) would lose.

Let k be the Kindle's maximum market share for a book's market. In other words, if Going Rogue: An American Life sells a million copies, and 200,000 of those could be Kindle e-books, then k=20%.

Let d be the fraction of the list price that Amazon has to pay publishers for the right to sell a Kindle version. This number is currently 50%. For front-list titles, Amazon is currently subsidizing the Kindle pricing. That means that Amazon pays the publisher $15 to be able to offer a book with a retail price of $30 for $10 on the Kindle. (for Going Rogue the Kindle edition has a pre-order pricing of $8, and the publisher will get $14)

Let r be the author's gross royalty rate from a publisher. In other words, if Going Rogue lists at $28, sells a million, and Sarah gets $2.8 million, the r=10%. It's not uncommon for authors to receive net royalties of 5-10% on print and 25% on e-books; the corresponding gross royalties would be 2.5-5% for print and 12.5% on ebooks, if you assume a 50% wholesale discount.

Finally, let f be the percentage of the book's list price that goes to production cost. For blockbuster books, f can be quite small, because fixed cost are amortized over a very large number of books. It goes without saying that reproduction costs for the Kindle are close to zero.

d*k > r+f,
then it is advantageous for Amazon and Sarah to deal directly, cutting out the publisher completely.

I should note that this ignores the possible profit Amazon might be able to derive from print versions. Do you think they have the logistics expertise to do that?

Its also worth considering the effect of fixed costs. Publishers who specialize in producing many titles that sell only moderately well (i.e. f is large) are probably less at risk of Amazonian disintermediation than those who rely on blockbuster books.

Upon sharing this with a group of people with more experience in the book publishing world, I got the following reactions:
  • Sarah Palin would never go direct with Amazon because what she really wants is a big media campaign.
    Is there a law that prevents Amazon from doing a big media campaign?

  • What about advances? How would authors manage without advances?

    If you put some private equity together with unemployed publishers and bankers, the advances problem might have a solution.

  • Amazon woud never want to become a publisher. Think of the headaches of handling idiosyncratic authors.
    Author agents may play a useful role in easing Amazon's difficulties here. And agents will have lots of fun playing off Amazon against Apple and Google.

  • Amazon might be able to win authors that are already their own brand, but how will new authors be discovered and developed?
    Oprah, I have a business idea for you!
As if to prove my point, it came out yesterday that Steven Covey, author of The 7 Habits of Highly Effective People, is moving his electronic rights to Amazon, which is going to heavily promote his books.

Give me a few more days and I'll have the differential equation. It won't look linear.

Wednesday, December 9, 2009

Supporting Attendance at Code4Lib

In the middle of a session at the Charleston Conference a month ago, I was in some keynote address about the future of libraries and the role of journals in scientific communication, and I got a bit fed up at a notion that scientists were some sort of exotic creatures that used libraries and information resources in ways that the library community needed to understand better. It occurred to me that a much better way to understand the needs of scholars was to just look around the room at the 300 people learning, communicating and synthesizing ideas with each other.

The Charleston Conference started in 1980 as a regional library acquisitions meeting with 24 attendees. This year was its 29th. It covered the world of scholarly information, library collections, preservation, pricing and archiving and it attracted well over 1000 publishers, vendors, librarians, electronic resource managers, and consultants from around the world. Its success is to a large extent the work of one person- Katina Strauch. Over the years, Katina's empire of hospitality has come to include print publications- Against the Grain and The Charleston Advisor, associated websites, and multiple blogs. The Charleston Conference has established itself as an important venue for many types of communication and learning; you might not call it scholarly communications, but so what?

Scientists and scholars aren't so different from librarians and publishers. They go to conferences, drink coffee and beer and learn in the sessions and in the hallways. They exchange business cards and send each other email. They tell stories about the experiments that failed. They gossip. The conferences provide them programs to take home and help them remember who said what. Occasionally someone mentions an article they found to be interesting, and everyone goes home to read it. The Charleston Conference and associated business properties has grown nicely into the internet age and would be an appropriate model for emulation by the scholarly communication community

Another vision for the future is provided by Code4Lib. Code4Lib started as a mailing list in 2003 as a forum for discussion of
all thing programming code for libraries. This is a place to
discuss particular programming languages such as Java or Python,
but is also provide a place to discuss the issues of programming
in libraries in general.
At first, it grew slowly, but people quickly discovered how useful it was. Today it has almost 1,300 recipients and a very high signal to noise ratio.

In 2006, the first Code4Lib Conference was held at Oregon State University. The conference was inspired to some extend by the success of a similar conference, ACCESS, held every year in Canada. The Code4Lib Conference has always been self-organizing (organizationless, you might say), and has been quite successful. Presentations are selected by vote of potential attendees; participation is strongly encouraged using lightning talks and unconference sessions. The conference has tried to stay small and participatory, and as a result, registrations quickly fill up.

Code4Lib is also instantiated as channels of communication such as an IRC channel and a Journal, and the community never seems to fear trying new things. In many ways, it's still in its infancy; one wonders what it will look like if it ever gets to be as long-established as Charleston.

This February, the fifth Code4Lib Conference will take place in Asheville, North Carolina. I hope to be there. But with the "Global Economic Downturn" and library budgets being slashed, I worry that some people who might have a lot to contribute and the most to gain may be unable to go due to having lost their job or being in a library with horrific budget cuts. So, together with Eric Lease Morgan (who has been involved with Code4Lib from that very first eMail) I'm putting up a bit of money to support the expenses of people who want to go to Code4Lib this year. If other donors can join Eric and myself, that would be wonderful, but so far I'm guessing that together we can support the travel expenses of two relatively frugal people.

If you would like to be considered, please send me an email as soon as possible, and before I wake up on Monday, December 14 at the latest. Please describe your economic hardship, your travel budget, and what you hope to get from the conference. Eric and I will use arbitrary and uncertain methods to decide who to support, and we'll inform you of our decision in time for you to register or not on Wednesday December 16, when registration opens.

If you want to help us with a matching contribution, it's not required to be named Eric.

Update: Michael Giarlo and one other member of the Code4Lib community have agreed to match, so it looks like we have enough to support 3 attendees.

Saturday, December 5, 2009

Business Idea #1: Library Development Jobs Shouldn't Be Lame

Cody Powell, a developer in Austin, Texas, has written a nice article about how to determine if a programming job is lame. The Codypo scale has 8 questions on it:
  1. Would I be paid below market rates?
  2. Would I always be on call?
  3. Am I the IT staff?
  4. Would I work with a single monitor?
  5. Will I be maintaining any ancient system, and what's it written in?
  6. Would my internet usage be filtered or monitored?
  7. Would I be the only programmer?
  8. Am I expected to travel every week?
I'm wondering how the typical library does on the codypo programmer-job-lameness scale. If you work in a library, please please enter your result in the survey:

I have observed that most libraries need to employ software developers, but few libraries are good places for software developers to work. This is a bit of a paradox, because libraries have some wonderful problems for software developers to work on. The small number of libraries that have robust development departments- typically large research and corporate libraries - have done some amazing work.

Among the problems that libraries have in employing developers are divergent pay scales, small size, mentor availability, quality of supervision, development infrastructure, and the herding of cats problem. Their jokes are different.

Partly as a response to this, libraries have tended to outsource the bulk of their technology development to their systems vendors. For the most part, this solution has been adequate, but costly. Increasingly though, libraries find their support costs rising, their service declining, and they are left with increasingly complex integration tasks, especially if they are so bold as to use systems from more than one vendor.

One response to dissatisfaction with vendor support has been a shift to open-source and/or cloud-based solutions. The advantage of open-source firms is that they can't afford to provide poor support, because nothing prevents competitors from providing the same support. The advantage with cloud-based systems is that a cloud vendor can provide basic service at a significantly lower cost.

The paradox of relying on open-source and cloud based systems, however, is that it causes libraries to employ more development staff, not fewer. This is because open-source systems present new opportunities to improve, integrate and modernize services, and libraries feel obliged to contribute improvements into the open source ecosystem. In contrast, cloud based systems tend to be provided with minimal local support coupled with API-based integration options. As more and more these solutions become available, the library's need to integrate multiple solutions (both cloud and open-source) increases.

As if libraries had nothing more to worry about, most of them are facing budget cutbacks, and in many cases, required staffing cuts. What if there was a company that could take on their development and software integration burden, provide their development staff with the environment, support and training they deserve, and commit to reducing their total expense? Maybe Gluejar could do that.

This is not a new idea- both IBM and HP/EDS have successfully demonstrated the viability of this business model for providing IT to large businesses (for a report on the market, take a look at this Forrester Wave report (765KB)). It's no coincidence that these companies (particularly IBM) are active in support of Open Source software and cloud-based services.

The library world is no stranger to staff outsourcing- companies such as Library Associates and LSSI have been assuming the management of libraries and library departments for quite a while now. In Japan, even academic libraries will outsource key functions to companies like Kinokuniya. There's been opposition to library staff outsourcing- how can librarians truly focus their attention on the needs of their institution or community if they are employed by a dispersed corporate entity? I think outsourcing librarians is quite different from outsourcing library development staff.

It's possible that the Open-Source Library systems companies like LibLime, Equinox, IndexData, ByWater and BibLibre will evolve towards this model. The current reality, however is that most libraries must deal with many proprietary systems vendors as well, and these are often not predisposed towards working with the Open Source vendors who hope to displace them. To most effectively serve libraries, a company that manages a library's development staff must establish good relations with all types of technology and database vendors as well as information suppliers.

What I don't know is how desperate libraries are to fix the problems I've perceived, or how serious and widespread they are. It's entirely possible that the tight control over development staff is something that libraries are unwilling to give up.

I would be very interested to get feedback on this, either in private, or in the comments.

Reblog this post [with Zemanta]

The Case Against Using Spoofed e-Books to Battle Piracy

I've known since I was four years old the difference between the Swedish Santa Claus and the American Santa Claus. The Swedish Santa Claus (the one who comes to our house) uses goats instead of reindeer and enters by the front door instead of the chimney. And instead of milk and cookies, the Swedish Santa Claus (a.k.a. Jul Tomten) always insists on a glass of glögg.

The glögg in our house was particularly good this year (used Cooks Illustrated recipe), so Jul Tomten stayed a bit longer than usual. I had a chance to ask him some questions.

"You're looking pretty relaxed this year, what's up?" I asked.

"It's this internet, you know. What with all the downloaded games, and music and e-books, my sleigh route takes only half the time it used to!"

"Really, that's amazing! I've read about the popularity of Kindle e-books, but I never imagined it might affect you! Are you worried that the sleigh and goat distribution channel will survive?"

"Oh not at all, Eric, remember, Christmas isn't about the goats, it's about the spirit! And even if all the presents could be distributed digitally, someone's got to go and drink the glögg, don't you think?"

"One thing I've been wondering, that list of yours, you know, the naughty and nice list... It must be very different now- do you look at people's Facebook profiles?"

"Ho ho ho ho. At the North Pole, your privacy is important to us, as the saying goes. Well, I'm going to let you in on a little secret. 'Naughty and Nice' is a bit of a misnomer. We never put coal in anyone's stocking. The way we look at it, there's goodness in each and every person."

"I guess I never thought of it that way."

"Just imagine how a child would feel if they woke up Christmas morning to find a lump of coal in their stocking! Even if the child was very naughty, do you a holiday disappointment would suddenly turn the child nice?"

"Besides, if we really wanted to put something useless in a stocking these days, it would be a VCR tape or an encyclopedia volume, not coal."

I've had a chance to reflect a bit on my chat with Santa, particularly about putting coal in naughty people's stockings. I've recently been studying how piracy might effect the emerging e-book market, and I've made suggestions about how to reinforce the practice of paying for e-books. But one respected book industry consultant and visionary, Mike Shatzkin, has made a suggestion that the book industry should take the coal-in-the-stocking approach to pirated e-books.

In an article entitled Fighting piracy: our 3-point program, Shatzkin proposes as point #1:
Flood the sources of pirate ebooks with “frustrating” files. Publishers can use all sorts of sophisticated tricks to find pirated ebooks, like searching for particular strings of words in the text. (You’d be shocked at how few words it takes to uniquely identify a file!) But people looking for a file to read will probably search by title and author. So publishers can find the sources of pirated files most likely to be used by searching the same way, the simple way.

But, then, when publishers find those illicit files, instead of take-down notices, which is the antidote du jour, we’d suggest uploading 10 or 20 or 50 files for every one you find, except each of them should be deficient in a way that will be obvious if you try to read them but not if you just take a quick look. Repeat Chapter One four times before you go directly to Chapter Six. Give us a chapter or two with the words in alphabetical order. Just keep the file size the same as the “real” ebook would be.
Points 2 and 3 of Shatzkin's "program" are reasonably good ideas. But this point 1 is a real clunker.

I'll admit, when I first read Shatzkin's proposal for publishers to put "sludge" on file sharing sites, I thought it an idea worth considering. After having studied the issue, however, I think that acting on the idea would be a foolish and shameful.

First of all, the idea is not original. The tactic of spoofing media files was deployed by the music industry in its battle against the file sharing networks that became popular after the demise of Napster. This tactic was promoted by MediaDefender, a company that also used questionable tactics such as denial of service attacks to shut down suspected pirate sites. Although the tactic was at first a somewhat effective nuisance for file sharers, the file sharing networks developed sophisticated defenses against this sort of attack. They adopted peer-review and reputation-rating systems so that deficient files and disreputable sharers could easily be discriminated. They instituted social peering networks so that untrusted file sharers could be excluded from the network of sharers. The culture of "may the downloader beware" has carried over for e-books. On one site I noted quite a bit of discussion of the true "last word" of Harry Potter and the Deathly Hallows along with chatter about file quality and the like. After seeing all this, Shatztkin's suggested point 1 seems quaint, to put it kindly.

The e-book-coal-in-the-stocking idea could also be dangerous if acted on. The tactic of disguising unwanted matter as attractive content has been widely adopted by attackers going back millennia to the builders of the Trojan Horse. The sludge could be as innocuous as a Amazon "buy-me" link with an embedded affiliate code, or it could be as malicious as a virus that lets a botnet take control of your computer if you open the file. When this really happened happened for video files, it was widely asserted, without any substantiation that the viruses were planted by the film industry operatives themselves. Thus, what began as a modest attempt to harass Napster file sharers ended up resulting in a smeared reputation for the film industry.

Obviously, Shatzkin is not advocating spoofing e-book files with harmful content on file sharing sites. But publishers who are tempted to follow his point #1 should consider the possibility that emitting large amounts of e-book sludge could provide ideal cover for scammers, spammers, phishers, and other cybercriminals. Then they should talk to their lawyers about "attractive nuisances" and "joint and several liability".

Go ahead and accuse me of believing in Santa Claus. I firmly believe that no matter what business you're in, not everybody gets corrupted. You have to have a little faith in people.

Friday, December 4, 2009

My Funnest Bugfix Ever: the Double Relative Redirect

Intel has been running an ad called "Our Jokes Aren't Like Your Jokes". In it, an actor portrays Mario Paniccia, Director of Intel's Photonics Technology Lab mischievously altering an equation on a futuristic see-through noteboard. Hilarity ensues.

There's a lot to remark on here. In what sense is the actor portraying Mario Paniccia? In what way is this ad promoting Intel's corporate image? Can you believe that I'm putting it on my blog, totally without get paid for it?

The ad is hilarious because many of us recognize the truth it expresses. The very best geek humor is funny exactly because so few can appreciate it.

I am now going to attempt to detail the funnest bug I ever found. I've encountered less than ten people in the world who have understood it, and of those, only a few could laugh, because the others were complicit.

Five years ago, one of my linking technology company's users, a librarian-customer who was quite adept technically, had noticed that the links we were providing for one publisher were "flaky", and wondered if we could do anything about it. Our first two layers of support had trouble reproducing the problem, but they agreed that there was something flaky about the links, and they had a high regard for the librarian in question, so the problem got escalated to me (in a small company, even the President should be part of the customer support team). As it turned out, I had configured the link in question myself a year or two earlier, and I had a vague recollection that there was some hard-to-put-your-finger-on flakiness with it. So I started poking around.

The linking technologist's most important tool is a Firefox plugin called LiveHTTPHeaders. It lets you see how your browser is communicating with web servers. When I ran this tool, I immediately saw the problem. The links being sent to this publishers were being routed through a redirection chain. The first stage of the chain converted metadata-based links to database key links and set a session cookie. The next and final stage turned the key link into the actual content link for the user. The problem was that both of these redirects failed to comply with the W3C standard for HTTP redirects.

When a link is embeddded in a web page, it can be one of two types, relative or absolute. An absolute link uses a fully spelled out address, like A relative link specifies the address of the link target relative to the current address. So, from this page, the same address could be written in relative form as /2009/09/redirector-chain-mashup-design-pattern.html. The browser has to do some computation to figure out the expanded address from a relative link.

The problem with the flaky links was that the server was sending back relative links in the redirect. Although you can do that in a web page, it's not allowed in the HTTP standard for the server. The most popular web server software, Apache, won't even permit a web developer to send a relative link; in fact, it will helpfully do the expansion calculation automatically. Our flaky link, however, was being served by Microsoft's IIS server software, which doesn't do the expansion. It's possible that the bug arose when a web application designed for Apache was deployed in IIS.

Although I could see where the problem was coming from, and I could see that our linking software was thankfully not to blame, I was thoroughly puzzled. Why did the link only seem flaky, instead of just failing? The answer must lie in the client software- the bits of code on the user's computer that talks to the server.

It's usually a good idea when implementing a communications protocol to be strict about what you send and forgiving about what you receive. So when a relative address is received by an http client, it tries its best to recover from the error. It goes and gets the most recent address it's received and uses that to expand the relative address into an absolute address. So, in the vast majority of cases, users don't experience a problem when a server sends a relative redirect to their browser.

The flaky link problem I was looking at resulted in link failure because the server was sending two relative links in a row. The first link could get expanded just fine, but the second link only had a relative address to expand against, resulting in complete failure of the link.

How could a problem like this have escaped notice? And here we have the geek punch line: the expanded address gets put in the redirect cache! So the second time the user clicks on the link, the initial address is found in the cache, and the first relative redirect is skipped over. Only the second relative redirect is performed, and this time there's a full address from the cache to expand against, and the link works as intended.

If you are an HTTP geek of any sort, you can get up off the floor and stop laughing already. If you are not, read on, and I'll try to explain why it is truly hilarious.

Once I figured this bug out, I did some digging. I found that the same thing was happening not only for the OpenURL links that my software made for linking within a librairy, but also for CrossRef DOI links, which are used pervasively for linking between publishers. I determined that it was not just one publisher that had the problem, it was every publisher that used the same e-journal hosting service. There were about a thousand e-journals affected. And it had been that way for at least a year. My guess is that this link failure had frustrated users at least a million times.

Ninety nine times out of a hundred, the user would just try clicking the link again, and wouldn't think twice about the problem. But that other 1% (10,000 users) would ask a Local Expert for help- perhaps a librarian, perhaps a more-tech savvy friend, whatever. The Local Expert would extend her aura over the user's computer, the user would try it again, and the link would work like a charm.

Ninety nine times out of a hundred, the Local Expert would just smile, having "solved" the problem. But 1% of them (100) would smell something fishy, and ask the user to try another link, and it would fail. The local expert would go back to her desk, and try it a few more times, getting more and more mystified at the flakiness of the link, and decide to report it to a Responsible Party.

Ninety nine times out of those hundred, the Responsible Parties would, for a variety of reasons, fail to do anything about the flaky links. Maybe it was another company's problem. Maybe the problem was too hard, or not bad enough. Maybe they didn't really care.

The hundredth time, it was me. I actually enjoyed finding the bug. I knew the top technical guy at the e-journal hosting company, and I just sent him a description of the issue. Because he knew me, he paid attention, and the problem was fixed in their next update. Millions and millions of the links have worked since.

Isn't that hilarious?

Thursday, December 3, 2009

The Story of Gluejar

When I register for conferences or meetings or that sort of thing, I'm often asked to fill out a company name. I write in "Gluejar Inc.", because I happen to be President of that New Jersey Corporation. So I get a preprinted badge that says my name and the company name right beneath it. I engage people in conversation and they start staring at my badge.

"'Gluejar' I haven't heard of that company. What does it do?" they ask me.

"Nothing." I usually say.

There's usually a nervous chuckle as they wait for the punch line, but there isn't any. "I'm sort of on sabbatical" I say. Or "I've been blogging a lot recently". Or even "I'm not really gainfully employed".

Sometimes, instead of "Nothing", I'd say "It's a slumber-mode start-up" which seemed to make people think I had a team of engineers secretly designing high-tech beds.

One friend says I'm "a gentleman of leisure".

My son has been pestering me to take up farming. He comes home and asks "How was non-work today, Dad?"

Sometimes I tell the true story of Gluejar:

Almost four years ago, I sold a linking technology business I started in 1998. It was called Openly Informatics, Inc.. The buyer (OCLC), being a non-profit (401(3)c) corporation, chose not to acquire the corporate entity, but merely all of its business and (non-cash) assets, including me. I had to keep the company to fulfill contractual obligations. But since one of the assets they were buying was the name of the corporation, I had to get a new name for the company.

I called in my high-cost naming consultant (the aforementioned son, then 11) and asked for name ideas. He asked what the company would do, and I told him that for a while it would just be a container of cash, but that in the future, who knew? I liked doing things that made connections, that put things together, so maybe he could think of a name with that sort of connotation. He suggested "Goo-jar". I liked that, it seemed sticky, and jars are packages for java applications, so that part was both containerish and softwarish.

I did some googling to see if I could get "", but I found that "goojar" was an alternate spelling of Gujar, an Indian ethnic group, so we made the name even stickier and settled on "Gluejar".

A little more than three years later, it was time for me to move on. Initially, I thought I might restart Gluejar right away, but I soon thought better of it. I decided that everything would be much simpler (and a whole lot more fun!) if I waited 9 months until certain commitments expired.

Instead, I started studying technologies, industries, and issues and writing about them on the blog. I started twittering! I went to meetups, conferences and workshops. I went inside a Federal Courtroom! I even did a harvest of purple amaranth.

So now it's time for me to get serious about deciding what to do next. In the next month or two, I plan to write about the directions I'm considering. Some of them involve reactivating Gluejar, some of them don't. I don't really believe that startup ideas can be stolen, so I'll to write about as many of my ideas as I can. I like building things, and the first step in building on an idea is spreading it around.

Wednesday, December 2, 2009

Databases are Services, NOT Content

I'm very grateful for advice my fellow entrepreneurs have given me; when you meet someone else who has started a company you have an instant rapport from having shared a common experience. I remember each bit of advice with the same stark clarity that characterizes the moment I realized that Santa Claus was the neighbor dressed up in a white beard and a red suit..

A business owner who I've known since sixth grade gave me this gem: "My secret is providing the best service possible, and charging a lot for it." In executing my own business, I did pretty well at the first part and could have done better at the second part.

As I wrote about the effect of database rights on the postcode economy, I kept wondering if I would have done anything differently in my business if that database protection had been available to me. Would I have charged more for the database that my company developed?

In the comments to that post, I was alerted to a book by James Boyle, called The Public Domain. Chapter 9 in particular parallels many of the arguments I made. One thing I found there was something I had wanted to look for- information about how the database industries in general have done since the "sweat of the brow" theory for copyright was disallowed by the Supreme Court. It turns out that the US database industry has actually outpaced its counterpart in the UK since then by a substantial margin. Why would that be?

I think the answer is that building databases is fundamentally a service business. If your brow is really sweating, and someone is paying you to do it, then it's hard to think of that as a "content" business. Databases always have more content than anyone could ever want; the only reason people pay for them is that they help to solve some sort of problem. If your business thinks it's selling content rather than services, chances are it will focus on the wrong part of the business, and do poorly. In the US, since database companies understand that their competition can legally copy much of their data, they focus on providing high quality added value services, and guess what? THEY MAKE MORE MONEY!

Then there's Linked Data. Given that database provision is fundamentally a service business, is it even possible to make money by providing data as Linked Data? The typical means for prodecting a database service business is to execute license agreements with customers. You make an agreement with your customer about the service you'll provide, how much you'll get paid, and how your customer may use your service. But once your data has been released into a Linked Data Cloud, it can be difficult to assert license conditions on the data you've released.

It's been argued that 'Linked Data' is just the Semantic Web, Rebranded, but it's also been noted both Linked Data is sorely in need of some proper product management. Product management focuses on a customer's problems and how the product can address them. You can believe me because I've not only managed products, I've had 2 whole days of real product management training!

One thing I was taught to do in my Product Management class was to come up with a 1 sentence pitch that captures the essence of the product. When I was an entrepreneur, this was called the elevator pitch. After thinking about it for about 9 months I've come up with a pitch for Linked Data:
Linked Data is the idea that the merger of a database produced by one provider and another database produced by a second provider has value much larger than that of the two separate databases.
or, in a more concise form, V(DB1+DB2)>>V(DB1)+V(DB2).

Based on the products that have been successful this year in the application of semantic web technologies, it looks to me that the most successful have been focused on what I saw Tim Gollins tweet that Ian Davis called "Linked Enterprise Data" (attributing the term to Eric Miller). If the merged databases are contained within the enterprise, the enterprise clearly reaps all the added value. Outside the enterprise, however, the only Linked Open Data winners so far have been the ones who have built services on databases merged from others.

Proper product management would have made it a goal for Linked Open Data to have data contributors share somehow in the surplus value created by the merged services. In the next couple of weeks, I hope to describe some ideas as to how this could happen.

Monday, November 30, 2009

ZIP vs. PAF: Has Database Copyright Enabled Postcode Data Business?

Have you ever noticed that there no such field as "Legal Science"? That's because the scientific method is hard to apply to the development of laws. Just imagine applying an experimental law to one population while giving a placebo to a control population. Occasionally a circumstance appears where we can look for the effect of some particular bit of jurisprudence. Today's example is the database copyright. In the UK and other European countries, there is a special type of copyright (lawyers call it sui generis) that applies to databases. In the US, there has been no copyright for databases as such since 1991, even if they are the product of substantial investment.

In the US, databases can only be protected by copyright if they are expressions of human creativity. This is intended to be a fairly low bar. If the selection of data, for example, represents the judgement of a person, then the database can be protected by copyright. What isn't protected is the mindless labor, or "sweat of the brow" effort that someone has made to accumulate the data. The 1991 Supreme Court decision that established this rule was a unanimous one written by Justice Sandra Day O'Connor. It retrospect, the opinion seems prescient, as if the Court had anticipated a day when sweating brows would be banished by scraping computers and global networks of information.

Rob Styles has a post on his blog that got me reading and thing about these database copyrights. His key point is a suggestion that distributed, Linked Data will disrupt database intellectual property rights as profoundly as P2P distribution networks have disrupted the music and entertainment media businesses.

Like all great blog posts, Styles' is at the same time obviously true and obviously wrong- i.e., thought provoking. First, the obviously true part. When technology makes it trivial to reaggregate data that is readily available in a dispersed state, then businesses that rely on exclusive access to the aggregate become untenable. The example discussed by Styles is that of the Royal Mail's Postcode Address File. It turns out that in the UK, the Royal Mail has made a modest business of selling access to this file, which lists every address in the country that receives mail together with geographical coordiantes. This arrangement has been recently in the news because of Ernest Marples Postcodes Ltd., a small company which attempted to provide free API access to Postcode data, but was shut down by a threat of legal action from the Royal Mail. Apparently the royal Mail won't let websites use the postcode data on a website without paying a £3750 license fee. They also offer per click licenses which cost about 2p per click. To all appearances, the Royal Mail supports a healthy ecosystem of postcode data users- they list 217 "solutions providers" on their web site.

Styles' point is that the facts contained in the postcode file are in the public domain, and with Semantic Web technology, a large fraction of these facts could be made available as Linked Data without infringing the Royal Mail's copyrights. Once the data has entered the cloud, it would seem impractical for the Royal Mail to further assert its copyright. My posts on copyright salami attempted (unsuccessfully, I think) to construct a similar evasion for books; Rob's suggested postcode copyright evasion is clean because the "slices" are truly in the public domain, rather than simply being fairly used, as in my scenario.

How does the US differ in the availability of postcode data? In the US, the data file that corresponds most closely with the Royal Mail's PAF file is the USPS Topological Integrated Geographic Encoding and Reference/ZIP + 4® File (TIGER/ZIP+4). In the US, not only is there no database right, but works of the government are considered to be in the public domain. In general, government agencies are only allowed to charge for things like TIGER/ZIP+4 to cover distribution costs. Thus, it's not so surprising that the USPS doesn't even list a price for the TIGER/ZIP+4 file. I called up to ask, and found out that it costs $700 to get a full dump of the file. USPS does not offer updates; I was told that the release file is updated "every 2-3 years". The USPS, unlike the Royal Mail, seems uninterested in helping anyone use their data.

Since the USPS doesn't put any license conditions on the file, companies are free to resell the file in most any way they wish, resulting in a wide variety of services. For example, will sell you a license to their version of the Zip+4 file, suitable for use on a website, for $1998, updated quarterly. This is about 1/3rd of the price of the similar offering by the Royal Mail. has a similar product for $2000, including updates. On the low end, "Zip code guy" says he'll send you a file for free (the data's a bit old) if you link to his map site. On the high end, companies like Maponics provide the data merged with mapping information, analysis and other data sets.

The purpose of copyright has historically been "for the Encouragement of Learning" according to the Statute of Anne and "To promote the Progress of Science and Useful Arts" according to the US Constitution. The different copyright regimes used for the UK and US now present us with an experiment that's been running for over 18 years as to the efficacy of database copyrights. In which country, the UK or the US, have the "Useful Arts" surrounding postcode databases flourished the best?

After a bit of study, I've concluded that in the case of postcodes, database copyright has so far been more or less irrelevant to the development of the postcode data business. And even though the governmental organizations have completely different orientations towards providing their data, the end result- what you can easily buy and what it costs- is not all that different between countries. Although it's argued that the shutdown of and the higher cost of data in the UK are a result of database copyright, there is clearly more at play.

In theory, one way that copyright promotes commerce is by providing a default license to cover standard use of protected material. In fact, there are very few database providers that rely solely on copyrights to govern usage terms. In both the US and UK, the "good" postcode databases are only available with a license agreement attached. These licenses preserve the business models of postcode data merchants; it's not clear that was complying with license agreements even if it wasn't infringing a database copyright.

Since UK database copyrights don't have effect in the US, we might imagine setting up Royal Mail Postcode business in the US to exploit the absence of copyright. Would we be able to do something that we couldn't do in the UK? Well, not really. We'd probably still want to get a license from the Royal Mail, because £3750 is not a lot of money. It would cost us more to ask a lawyer whether we'd run into any problems. And at least in theory, the Royal Mail would have the freshest data. This is the reason I think Styles' post is "obviously wrong"- the distributed availability of data won't have a big effect on the core business of the Royal Mail or any other database business. It would have exactly the same effect as the absence of copyright protection in the US has had on the UK postcode market. In other words, nil.

My main worry about licensing from the Royal Mail would be in the area of allowed uses; I don't don't really trust an organization with the words "royal" and "mail" in its name to be able to understand and fairly price all the crazy smashed-up uses I might invent. Database copyrights give producers like the Royal Mail the ability to arbitrarily disallow new uses. Since it's hard to prove that any given fact has been obtained free of database copyright; the threat of an infringement lawsuit by the Royal Mail could even stifle non-infringing postcode applications.

What I don't see in the postcode data "experiment" is evidence that database copyright has had any great benefit for "the useful arts" in the UK compared to the US. If that's true, then why bother having a special copyright for databases?

As data lives more and more on the web, and becomes enhanced, entailed, and enmeshed, it makes less and less sense to draw arbitrary lines around blocks of data with copyright of autonomic aggregations. Although we need innovative licensing tools to build sustainable business models for data production, maintenance, and reuse in a global data network, we don't really need the database copyright.

Tuesday, November 24, 2009

Publish-Before-Print and the Flow of Citation Metadata

Managing print information resources is like managing a lake. You need to be careful about what flows into your lake and you have to keep it clean. Managing electronic information resources is more like managing a river- it flows though many channels, changing as it goes, and it dies if you try to dam it up.

I have frequently applied this analogy to libraries and the challenges they face as their services move online, but the same thing is true for journal publishing. A journal publisher's duties are no longer finished when the articles are bound into issues and put into the mail. Instead, publication initiates a complex set of information flows to intermediaries that help the information get to its ultimate consumer. Metadata is sent to indexing services, search engines, information aggregators, and identity services. Mistakes that occur in these channels will prevent customer access just as profoundly as the loss of a print issue, and are harder to detect, as well.

A large number of journals have made the transition from print distribution to dual (print+electronic) distribution; many of those journals are now considering the transition to online-only distribution. As they plan these transitions, publishers are making decisions that may impact the distribution chain. Will indexing services be able to handle the transition smoothly? Will impact factors be affected? Will customer libraries incur unforeseen management costs?

I was recently asked by the steering committee of one such journal to look into some of these issues, in particular to find out about the effects of the "publish-before-print" model on citations. I eagerly accepted the charge, as I've been involved with citation linking in one way or another for over 10 years and it gave me an opportunity to reconnect with a number of my colleagues in the academic publishing industry.

"Publish-before-print" is just one name given to the practice of publishing an article "version of record" online in advance of the compilation of an issue or a volume. This allows the journal to publish fewer, thicker issues, thus lowering print and postage costs, while at the same time improving speed-to-publication for individual articles. Publish-before-print articles don't acquire volume, issue and page metadata until the production of the print version.

Before I go on, I would like to recommend the NISO Recommended Practice document on Journal Article Versions (pdf, 221KB). It recommends the use of "Version of Record" as the terminology to use instead of "published article" which is widely used in a number of circumstances:
  1. Version of Record (VoR) is also known as the definitive, authorized, formal, or published version, although these terms may not be synonymous.
  2. Many publishers today have adopted the practice of posting articles online prior to printing them and/or prior to compiling them in a particular issue. Some are evolving new ways to cite such articles. These “early release” articles are usually [Accepted Manuscripts], Proofs, or VoRs. The fact that an “early release” article may be used to establish precedence does not ipso facto make it a VoR. The assignment of a DOI does not ipso facto make it a VoR. It is a VoR if its content has been fixed by all formal publishing processes save those necessary to create a compiled issue and the publisher declares it to be formally published; it is a VoR even in the absence of traditional citation data added later when it is assembled within an issue and volume of a particular journal. As long as some permanent citation identifier(s) is provided, it is a publisher decision whether to declare the article formally published without issue assignment and pagination, but once so declared, the VoR label applies. Publishers should take extra care to correctly label their “early release” articles. The use of the term “posted” rather than “published” is recommended when the “early release” article is not yet a VoR.
"Version of Record before Print" is a bit of a mouthful, so I'll continue to use "publish-before-print" here to mean the same thing.

It's worth explaining "Assignment of a DOI" a bit further, since it's a bit complicated in the case of publish-before-print. Crossref issued DOIs are the identifiers used for articles by a majority of scholarly journal publishers. To assign the DOI, the a publisher has to submit a set of metadata for the article, along with the DOI that they want to register. The Crossref system validates the metadata and stores it in its database so that other publishers can discover the DOI for citation linking. In the case of publish-before-print, the submitted metadata will include journal name, the names of the authors, the article's title, and the article's URL, but will be missing volume, issue and page numbers. After the article has been paginated and bound into an issue, the publisher must resubmit the metadata to Crossref, with added metadata and the same DOI.

What happens if the online article is cited in an article in another journal during the time between the version of record going online and the full bibliographic data being assigned? This question is of particular importance to authors whose citation rates may factor into funding or tenure decisions. Since the answer depends on the processes being used to publish the citing article and produce the citation databases, so I had to make a few calls to get some answers.

As you might expect, journal production processes vary widely. Some journals, particularly in the field of clinical medicine, are very careful to check and double check the correctness of citations in their articles. For these journals, it's highly likely that the editorial process will capture updated metadata. Other publishers take a much more casual approach to citations, and publish whatever citation data the author provides. Most journals are somewhere in the middle.

Errors can creep into citations in many ways, including import of incorrect citations from another source, mispelling of author names, or simple miskeying. DOIs are particularly vulnerable to miskeying, due to their length and meaninglessness. One of my sources estimates that 20% of author keyed DOIs in citations are incorrect! If you have the opportunity to decide on the form of a DOI, don't forget to consider the human factor.

It's hard to get estimates of the current error rate in citation metadata; when I was producing an electronic journal ten years ago, my experience was consonant with industry lore that said that 10% of author-supplied citations were incorrect in some way. My guess, based on a few conversations and a small number of experiments, is that a typical error rate in published citations is 1-3%. A number of processes are pushing this number down, most of them connected with citation linking in some way.

Reference management and sharing tools such as RefWorks, Zotero, and Mendeley now enable authors to acquire article metadata without keying it in and link citations even before they even submit manuscripts for publication; this can't help but improve citation accuracy. Citation linking in the copy editing process also improves the accuracy of citation metadata. By matching citations to databases such as Crossref and PubMed, unlinked citations can be highlighted for special scrutiny by the author.

Integration of citation linking into publishing workflow is becoming increasingly common. In publishing flows hosted by HighWire Press' Bench>Press manuscript submission and tracking system, Crossref and Pubmed can be used at various stages to help copyeditors check and verify links. Similarly, ScholarOne Manuscripts, a manuscript management system owned by Thomson Reuters, integrates with Thomson Reuters' Web of Science and EndNote products. Inera's xStyles, software that focuses specifically on citation parsing and is integrated with Aries Systems' Editorial Manager, has recently added an automatic reference correction feature that not only checks linking, but also pulls metadata from Crossref and Pubmed to update and correct citations. I also know of several publishers that have developed similar systems internally.

In most e-journal production flows, there is still a publication "event", at which time the content of the article, including citations, becomes fixed. The article can then flow to third parties that make the article discoverable. Of particular interest are citation databases such as Thomson Reuters' Web of Science (this used to be ISI Science Citation Index). The Web of Science folks concentrate on accurate indexing of citations; they've been doing this for almost 50 years.

Web of Science will index an article and its citations once it has acquired its permanent bibliographic data. The article's citations will then be matched to source items that have already been indexed. Typically there are cited items that don't get matched - these might be unpublished articles, in-press articles, and private communications. Increasingly, the dangling items include DOIs. In the case of a cited publish-before-print article, the citation will remain in the database until the article has been included in an issue and indexed by Web of Science. At that point, if the DOI, journal name, and first author name all match, the dangling citation is joined to the indexed source item so that all citations of the article are grouped together.

Google's PageRank is becoming increasingly important for electronic journals, so it's important to help Google group together all the links to your content. The method supported by Google for grouping URL's is the rel="canonical" meta tag. By putting a DOI based link into this tag on the article web pages, publishers can ensure that the electronic article will be ranked optimally in Google and Google Scholar.

An increasingly popular alternative to publish-before-print is print-oblivious article numbering. Publishers following this practice do not assign issue numbers or page numbers, and instead assign article numbers when the version-of-record is first produced. Downstream bibliographic systems have not universally adjusted to this new practice; best paractices for article numbers are described in an NFAIS Report on Publishing Journal Articles (pdf 221KB).

In summary, the flow of publish-before-print articles to end users can be facilitated by proper use of DOIs and Crossref.
  1. Prompt, accurate and complete metadata deposit at the initial online publication event and subsequent pagination is essential.
  2. DOI's should be constructed with the expectation that they will get transcribed by humans.
  3. Citation checking and correction should be built into the article copyediting and production process.
  4. Use of DOI in rel="canonical" metatags will help in search engine rankings.
Enhanced by Zemanta