Thursday, December 31, 2009

Do Libraries Have a Role in the Coming e-Book Economy?

You've probably heard it said that in Chinese, the word for "crisis" is composed from the words for "danger" and "opportunity". In the same presentation, you probably heard that there's no "I" in "TEAM". If you were skeptical of these attempts to extract wisdom from way language is written, you had good reason. The story about the Chinese word for crisis is not true. And even if it was true, it would be about as meaningful as the fact that the English word "SLAUGHTER" contains the word "LAUGHTER".

During my brief time working in "middle management", I was required to do "SWOT Analysis". SWOT stands for "Strengths, Weaknesses, Opportunities, Threats". As a planning exercise, it was quite useful, but it became comical when used as a management tool. Everyone understood the fake Chinese crisis wisdom, and we all made sure that our threats were the same as our opportunities, and our weaknesses were also our strengths.

On this last day of the "0"s, I've been reading a lot of prognostication about the next ten years. It's very relevant to this blog, as I've been using it to help me think about what to do next. Some things are not too hard to imagine: the current newspaper industry will shrink to maybe 10% its current size; the book publishing will reshuffle during the transition to e-books; Google will become middle-aged. The SWOT analysis for these will be easy.

The SWOT analysis that I have trouble with is the one for libraries. What threats to libraries will arise? Will Libraries as we know them even exist in 10 years?

I've heard publishers say they believe that there will be no role at all for libraries in the developing e-book ecosystem. If that's not a threat, I don't know what is! On the other hand, there's the example of the Barnes and Noble e-book reader, the Nook, that has the intriguing feature of being able to read books without buying them while you're in the bookstore! If there's a role for brick and mortar bookstores in the e-book ecosystem, then surely there's a role for libraries.

In thinking about what roles libraries will play when all books are e-books, I keep coming back to a conclusion that sounds odd at first: the prospective role of libraries will be entwined with that of piracy in the e-book ecosystem.

While there are fundamental differences between e-book libraries and e-book pirates, there are important similarities. As I noted in my article on copyright enforcement for e-books, libraries have traditionally played an important role in providing free access to print books; e-book pirates have as their mission the provision of free access to e-books. For this reason, libraries and pirates would occupy the same "market space" in an e-book ecosystem. This is not to say that libraries and pirates would be direct competitors; it's hard to imagine pirate sites appealing to many of the people who patronize libraries.

So where is the "threat" to libraries? Think about how book publishers will need to respond to the threat of e-book piracy. I've argued that publishers should do everything they can to reward e-book purchases, but that addresses only the high price segment of the market. Public libraries address the low-price segment of the market, providing books to people with a low willingness or ability to pay for access, while still providing a revenue stream for the publishers. To keep pirates from capturing this market in the e-book economy, publishers will need to facilitate the creation of services targeted at this market.

An analogy from the video business is appropriate here. DVDs can only satisfy part of the digital video market. Though it's taken a while for the studios to realize it, in order to effectively compete with video pirates, the movie studios need to have digital offerings like that offer movies for free.

What will the free e-book services look like? Perhaps they'll be advertising sponsored services like Google Books. Perhaps they'll be publisher- or genre-specific subscription services that provide people a "free book" experience at a fixed monthly price. Unfortunately, it seems a bit unnatural that publishers would turn to libraries to create the sort of services that could replicate the role of the library in the e-book ecosystem- libraries just aren't entrepreneurial in that way.

Somehow I don't think that book publishers will warm to a "Napster for e-Books", even if it was labeled "e-Book Inter-Library Loan".

Still, I'm optimistic. Some horrific mashup of Open Library, Google Books, LibraryThing, WorldCat, BookShare, Facebook, Freebase, RapidShare and the Mechanical Turk is going to just the thing to save both libraries and publishers. You heard it here first. And if you find it scary- don't forget that you can't spell e-Book without BOO!
Reblog this post [with Zemanta]

Monday, December 21, 2009

Copyright Enforcement for eBooks: Cultural Life Preserver or Orwellian Nightmare?

I'm a 7 MPH speeder. When I'm on an empty highway with a 65 MPH speed limit, I drive 72. This puts my car in roughly the 70th percentile of car speed. But when some idiot comes zooming past at 85, I cheer when I see him stopped by the cops 5 minutes later.

Last time I was in England, I was appalled to find that cameras had been installed along some of the motorways that would send you a speeding ticket automatically if you averaged more than the speed limit. I told the limo driver that Americans would elect a black president long before we'd tolerate speed cameras on the freeway. I was right.

I'm no legal theorist, but I know better than to think that human behavior is determined by laws- laws only work as far as they reflect a social consensus. It's true for driving and it's also true for reading, listening to music, and watching videos. As behaviors change due to the introduction of technology, society is forced to modify social norms for behavior.

The book publishing industry is at the beginning of a technology driven change in the way that people read books, and the shape of the consensus that emerges will determine how creative production is sustained. (Same for news, but that's another story entirely!)

Social consensus has a lot of inertia because if people and institutions don't have to change, they won't. Think about how social consensus evolved when music became digital. It used to be that people expected to be able to listen for free, via radio, and expected to pay to "keep" the music. Once paid for, people expected to be able to share their records with friends in a variety of ways.

When music became digital with the advent of the compact disk, very little changed, at least for a while. The addition of internet distribution, however, allowed Napster to stretch the "sharing" behavior so as to cover free listening and threaten the buy-to-own behavior. The music industry responded with legal action, but its failure to provide convenient, authorized activities to cover accustomed behaviors gave Napster an effective monopoly on digitally distributed music. If not for the social habit of paying-to-keep music, the music industry may well have collapsed. With the takedown of Napster and the rise of authorized services like iTunes, Pandora and Spotify, the music industry has begun to successfully reshape user behavior forged by easy unauthorized file sharing, but its mistakes have clearly hurt.

The movie industry has had more luck with the onset of digital distribution. People still expect to watch TV for free, and to pay for premium entertainment at the movies. The internet bandwidth needed to easily move video files has become available at about the same time as distribution sites such as Hulu and Netflix, so pirates have never had much of a monopoly on digital movie distribution. YouTube offers a flood of free video content, and it works with rightsholders to identify and remove unauthorized uses of their work. Large amounts of unauthorized distribution has occurred, but the movie industry has responded with both the carrot and the stick, by providing enhanced in-theater experience, inexpensive secondary distribution channels, making deals with YouTube and providing specialized DVD content while pursuing takedowns and ostentatiously prosecuting copyright infringements. Certainly the movie industry has made some missteps, but a blockbuster movie can still gross a billion dollars.

People have always expected to pay to own books, but once bought, the books could be freely borrowed from friends or libraries, and a vibrant used-books market makes older works available at very low cost. The biggest change brought about by digital distribution is the flood of free material available on a huge variety of websites, from blogs to wikis to traditional news.

It's not clear how book (including ebook) sales will be impacted by unauthorized distribution of digital copies. Although I've noted that it's relatively easy to find and identify unauthorized copies of works like Harry Potter and the Deathly Hallows, it's not likely that people will change their book buying behavior unless they have to. That's why I find it surprising that J. K. Rowling and her publishers are giving the pirates a near monopoly on the digital version of that particular book.

I've heard publishers say that they've learned from the example of the music industry that the threat of piracy makes DRM (digital rights management) a necessity for distribution of ebook content. In fact, almost the opposite is true. Publishers have been distributing books for hundreds of years without DRM. A potential pirate doesn't need to crack any encryption; they need only buy a single copy of the book and scan it. I wrote about the advent of cheap book scanners in October; Wired has a recent article.

Pirating a book is somewhat more difficult than pirating a song, but comparable to pirating a movie. The first step is to acquire a digital copy. Popular books are easy to obtain and a professional pirate would likely remove the binding with a saw and feed the pages into a high-speed copier/scanner. (Until the DVD comes out, a pirate typically sits in a theater and films the movie; the DRM on DVD's is trivial to crack.)

The digital file would then either be seeded onto a peer-to-peer network or uploaded to a file distribution or streaming site similar to rapidshare. Studies by Arbor, Cisco, and Sandvine suggest that P2P networks are declining in popularity compared to the file distribution sites, especially in countries with high broadband penetration.

In a peer to peer network such as those using the BitTorrent protocol, the work is divided between tracker sites and the peers which provide the actual files. The use of many peers allows high-volume distribution without needing a high bandwidth internet connection. Since the RIAA and others began filing lawsuits against people thought to be involved in providing files, the remaining networks have adopted social-networking and encryption to make sure that they can no longer be easily monitored.

File distribution sites are being used more and more as broadband connections become widespread. These sites have many legitimate uses, and will respond to takedown notices when illicit content is identified on their sites (although the in some countries, the takedowns are processed with the underwhelming speed of a bank's electronic funds transfer.) The links and metadata for the illicit files mostly appear on third party sites, which complicates any enforcement action. Ironically, sites such as Rapidshare have become so popular that to use it easily you really have to purchase a premium subscription!

Still, digital book piracy has already begun to appear in significant amounts. According to Brad Beautlich, Sales Director at DtecNet, text books, including law and medical textbooks, are now frequently appearing on the content distribution sites and torrent indexes favored by copyright infringers. These tend to be expensive items sold in cost-sensitive markets, which increases the incentives for unauthorized use. The sites appear to have very few books that have been cracked from digital versions; most of the book content currently available is clearly derived from scanned print.

The lack of pirated e-reader files (such as kindle or epub files) is consistent with the profile of e-reader early adopters, who tend to be to be older and not particularly price sensitive. I assume it's because older users tend to have bad eyes and full shelves. They're unwilling to install P2P client software or be attracted by the sort of advertising found on file index sites. Readers in developing countries may be in different situations.

DtecNet is a company that has been providing detection services to media companies. They offer to seek out, document and help to take down unauthorized content from web sites and file sharing networks. Their task can be difficult, as they need to scan and monitor indexing sites that may cloak the identity of a file ("NITM2" instead of Night in the Museum 2) and figure out from user comments in multiple languages whether a file is genuine or not.

Beautlich suggests that although monitoring from his company would be expensive ($4000-5000/month for a Harry Potterish project), an early investment in copyright enforcement by the book industry might more effective than a strategy of waiting for a larger threat to arrive.

Another strategy to modify user behavior is being pursued by Audible Magic. Audible Magic has a rather different business model from DtecNet. Instead of working for rightsholders, Audible Magic provides content identification services to ISPs, educational institutions, and content distribution services, helping them minimize their liability for copyright infringement. In the US, the Higher Education Opportunity Act (HEOA) of 2008 requires colleges and universities to have "A plan to 'effectively combat' copyright abuse on the campus network using 'a variety of technology-based deterrents'."

Audible Magic provides an appliance that attaches to a router or gateway within the client's network. The appliance "listens" to network traffic, and when is recognizes copyrighted content being transferred in ways that connote unauthorized use, it either logs a report or attempts an intervention. According to Jay Friedman, Audible Magic's Vice President for Marketing, over 100 University campuses are using their systems. Pricing depends on the amount of bandwidth used by the university and can be as little as a few thousand dallars a year.

Interventions are positioned in a "graduated response" model. For example, a user's next webpage download might be replaced by a page suggesting that unauthorized activity may have occurred, along with a reminder of an institution's usage policies. Continued infractions might result in the user being put in a "timeout", followed by a human mediated intervention.

If you find it big-brotherish to have an "appliance" looking over your shoulder so see whether your infringing copyrights, you wouldn't be alone. The Electronic Frontier Foundation has warned that Audible Magic's service offering is "no magic bullet", and is concerned that this type content monitoring would be a threat to individual privacy rights. It's one thing for a universities and corporations to be proactive in avoiding copyright infringement liability, but imagine what it would be like if this sort of monitoring were a legal requirement! Public Knowledge has published an excellent overview of the issues surrounding this sort of network monitoring.

In fact, international treaties and legislation requiring ISPs to adopt "three strikes" graduated response policies culminating in loss of internet connection is being considered in Europe and other parts of the world. While many book publishers would be horrified to buy into these sorts of copyright enforcement regimes, at the same time they are aghast at the prospect of having their content pirated and their livelihoods destroyed.

Think about the speed limit monitor in the accompanying photo. Based on my observation it is very effective at modifying the behavior of drivers. 7-MPH speeders like myself become 1-MPH speeders. I don't think anyone minds being monitored by this sign- there is confidence that it's not doing anything other than measuring and displaying our speed. In contrast, hidden speed traps seem evil- they don't slow people down unless they own radar detectors; the egregious speeders are not the ones who get caught! Copyright enforcement for ebooks should be as much like that as possible. As Princeton's Ed Felten has observed, the ideal copyright enforcement system exhibits maximal compliance and minimum prosecution. Especially for books, monitoring systems should be as open as possible and visible to users to maximize compliance and to create confidence they are not also snooping on reading habits.

It's interesting to read about the experiences of a university that implemented monitoring of P2P networks to comply with HEOA. Illinois State's Digital Citizen Project's summary of "Escalated Response System Testing Utilizing Audible Magic Copysense" (pdf, 1.5 MB) is valuable reading. While it's hard to be sure that Illinois' program was effective (you can't measure events that have evaded detection), I found it interesting that Illinois State students expressed minimum complaints or concern about the program.

A company with content identification technologies similar to those of Audible Magic is Nexicon. Both companies have agreements in place to work with YouTube to help to identify copyrighted material in uploaded videos, but Nexicon's business model aligns them with enforcement-oriented rightsholders. Here's how Nexicon President Sam Glines describes their flagship services:
Through our GetAmnesty and PayArtists solutions, we share with the rights holders settlements collected via the DMCA notices sent to infringers. The copyright holder sets the dollar amount per infringement - in the case of PayArtists and for Frank Zappa, the settlement amount is $10 per infringement. Nexicon’s MARC platform is capable of sending 95 million DMCA notices each day. Nexicon’s MARC platform monitors billions of illegal downloads of copyrighted material on a daily basis.
Nexicon has recently been involved in controversial takedown notices which Prof. Mike Freedman of Princeton's Center for Information Technolgy Policy describes as "inaccurate enforcement". In addition to defending Frank Zappa's copyright interests, Nexicon, a public company, boasts about fighting child pornography. At the same time, it appears to be associated with a New Jersey company that represents pornography publishers in their battle against copyright pirates. It can be hard for a technology company to control how their customers employ technology, but I would like to see more clear and coherent explanations of what happened to Freedman than Nexicon has provided to date.

Identification of ebooks is rather a different endeavor than identification of video or audio files. Copyrighted content in audio and video files can be identified in a number of ways, including watermarking, hashing and fingerprinting. As its name implies, Audible Magic's roots are in the audio fingerprinting area, and its huge library of 7 million song fingerprints is a significant asset, but they increasingly need to use textual clues such as those required for eBook identificationand are interested in further developing book-related identification techniques. As I've written previously, textual fingerprints are surprisingly effective at identifying books, even using a single sentence.

Book publishers preparing to fight piracy need to first and foremost have their content ready to be identified. While metadata, epub files and the like will be useful in locating and identifying pirated content that includes OCRed text, scanned images of books are also likely to be useful for the development of content recognition systems. If book publishers don't at least have scan files of every book they own, now is the time to start for them to start scanning!

Enforcement is only one weapon in the fight against book piracy, and it is the one weapon that most quickly loses effectiveness, as the techniques of copyright evaders evolve. One potential weapon that should be avoided is the dirty trick. If book publishers are unable to learn from the Sony rootkit fiasco, they will get all the ill will and lawsuits they deserve.

The shaping of societal behavior is a hopeless endeavor if the stick is wielded without a corresponding carrot. Any psychologist will tell you that most powerful tool in modifiying human behavior is positive reinforcement. If ebooks are to succeed commercially, publishers must use every means possible to reward people who purchase ebooks. I hope to write more about this soon, but I believe that positive reinforcement is the best lens to look at DRM with. DRM will fail unless its users believe it is rewarding them with convenience and ease of use, and with sufficient reward, it is also unnecessary - that is the lesson of iTunes.

As the era of digital books dawns, book publishers should expect that business models will change. Their mission, if they choose to accept it, is not only to deal with unauthorized use, but also to lead users to a social consensus that benefits everyone.

Update: In this post, I managed to overlook Attributor. Here's a post about them.


Copyright Enforcement/Monitoring Companies

Privacy Organizations

Content Industry Organizations

Reblog this post [with Zemanta]

Tuesday, December 15, 2009

Holiday Product Management and the Hardcover Ebook

The product development team responsible for Christmas should get an award. I can imagine the brainstorming sessions: "Let's have a holiday where travelers get stranded and get to sleep with smelly animals!"

"No, that won't work. How about a holiday where poor people visit maternity wards in hospitals?"

"Crackpot. Wait...I've got it... How about if everyone gets PRESENTS!"

Positive reinforcement can go a long way towards creating success. E-books these days are sort of like the current state of my Christmas tree. Beautiful, smells good, but no decorations or presents. In my previous article discussing copyright enforcement, I stressed that if publishers really want to fight off piracy they need to find ways to positively reinforce the ebook purchase experience. In other words, decorate your Christmas ebook and put lots of presents under it.

Step one. It should be completely painless and hassle-free to buy and use a piece of content. Duh. Amazon has figured this out. Apple has figured it out with iTunes. If it comes to a choice between secure DRM (which will get cracked anyway) and making it easy for customers, always do what the users sending you money want you to do.

Step two. Decorate! When someone gets an ebook, it should be customized with the purchaser's name and a pretty frontispiece that says "this book is being lovingly read and cared for by PURCHASERS NAME HERE", and every ebook file would have a unique serial number. For gifts, of course the message would be different. If the e-reader marketers and developers were really attuned to the publishing ecosystem, they would hardwire a cryptographic signature reader into their devices that would be able to tell a "genuine" ebook frontispiece from a fake one. It's not technically hard.

Step three. Give out presents! The wonderful thing about ebooks is that the reproduction cost is zero. You can afford to give your customers free stuff! Once they register their unique ebook serial number, send them a "personalized" thank you note from the author. Give them a free book from an unknown author that they would never, ever have bought on their own.

Step four. Give out more presents! If you want to reward genuine ebook purchases in places like China, turn ebook registration into a raffle. Put a golden ticket in every millionth ebook, and a silver ticket in every thousandth. Give something to everyone who registers, even if it's just a virtual badge, or "frequent-reader" points. People may start buying ebooks just for a chance at a payout. Other people will try to register fake serial numbers, and for free, you will get both marketing and pirate tracking data.

Step five. Regifting! If someone has paid you good money for an ebook, let them give it to a friend after six months or so (as long as they've registered it!) If they're a platinum-level frequent buyer, let them keep their own registered copy, too.

If ebook publishers get really good at adding value to their products, they could consider rolling out the "Hardcover Ebook". Current practice in the print book industry for "trade" books is to initially offer only a hardcover version. A year later, the same book is released as a softcover at a substantially lower price. The effect of this is to partition the market and capture revenue from more potential purchasers. Consumers accept this partitioning partly because they assign some value to the hard cover- they assume the hard cover is more expensive to produce.

Recently, there's been much discussion about publishers holding back titles from the ebook market to protect hardcover sales. An ebook with enhanced value comparable to a hard cover print book could be offered on the intitial release date at a higher price so as to prevent pirates from having a monopoly on the ebook.

Is there a Grinch preventing the ebook Christmas party? As long as ebook publishing is captive to distribution platforms, innovations in ebook value may be difficult to implement. Amazon's interests do not always align with those of publishers. In particular, ebook serial numbers that register with publishers a not going to be high on the Kindle development queue.

Even the Grinch learned what Christmas is really about. You won't get hardcover ebook in your stocking this year, but have a great Holiday anyway!

When Amazon and Kindle Win

Last week, I went to a talk given by Mike Shatzkin, who has been doing a good job preparing book publishers for their future. One of his points was that the growing market power of has the potential to disrupt the way that book publishers do business.

After the talk, I was inspired to go home and do a tiny bit of math. It occurred to me that it should be possible to write an equation to describe the point at which authors would find it to their advantage to sell directly through Amazon and Kindle rather than working through publishers. After filling up a spreadsheet, I boiled everything down to one equation that defines the moment that Amazon wins, and the Big 6 publishing houses (Hachette, HarperCollins, Macmillan, Penguin, Random House, and Simon & Schuster) would lose.

Let k be the Kindle's maximum market share for a book's market. In other words, if Going Rogue: An American Life sells a million copies, and 200,000 of those could be Kindle e-books, then k=20%.

Let d be the fraction of the list price that Amazon has to pay publishers for the right to sell a Kindle version. This number is currently 50%. For front-list titles, Amazon is currently subsidizing the Kindle pricing. That means that Amazon pays the publisher $15 to be able to offer a book with a retail price of $30 for $10 on the Kindle. (for Going Rogue the Kindle edition has a pre-order pricing of $8, and the publisher will get $14)

Let r be the author's gross royalty rate from a publisher. In other words, if Going Rogue lists at $28, sells a million, and Sarah gets $2.8 million, the r=10%. It's not uncommon for authors to receive net royalties of 5-10% on print and 25% on e-books; the corresponding gross royalties would be 2.5-5% for print and 12.5% on ebooks, if you assume a 50% wholesale discount.

Finally, let f be the percentage of the book's list price that goes to production cost. For blockbuster books, f can be quite small, because fixed cost are amortized over a very large number of books. It goes without saying that reproduction costs for the Kindle are close to zero.

d*k > r+f,
then it is advantageous for Amazon and Sarah to deal directly, cutting out the publisher completely.

I should note that this ignores the possible profit Amazon might be able to derive from print versions. Do you think they have the logistics expertise to do that?

Its also worth considering the effect of fixed costs. Publishers who specialize in producing many titles that sell only moderately well (i.e. f is large) are probably less at risk of Amazonian disintermediation than those who rely on blockbuster books.

Upon sharing this with a group of people with more experience in the book publishing world, I got the following reactions:
  • Sarah Palin would never go direct with Amazon because what she really wants is a big media campaign.
    Is there a law that prevents Amazon from doing a big media campaign?

  • What about advances? How would authors manage without advances?

    If you put some private equity together with unemployed publishers and bankers, the advances problem might have a solution.

  • Amazon woud never want to become a publisher. Think of the headaches of handling idiosyncratic authors.
    Author agents may play a useful role in easing Amazon's difficulties here. And agents will have lots of fun playing off Amazon against Apple and Google.

  • Amazon might be able to win authors that are already their own brand, but how will new authors be discovered and developed?
    Oprah, I have a business idea for you!
As if to prove my point, it came out yesterday that Steven Covey, author of The 7 Habits of Highly Effective People, is moving his electronic rights to Amazon, which is going to heavily promote his books.

Give me a few more days and I'll have the differential equation. It won't look linear.

Wednesday, December 9, 2009

Supporting Attendance at Code4Lib

In the middle of a session at the Charleston Conference a month ago, I was in some keynote address about the future of libraries and the role of journals in scientific communication, and I got a bit fed up at a notion that scientists were some sort of exotic creatures that used libraries and information resources in ways that the library community needed to understand better. It occurred to me that a much better way to understand the needs of scholars was to just look around the room at the 300 people learning, communicating and synthesizing ideas with each other.

The Charleston Conference started in 1980 as a regional library acquisitions meeting with 24 attendees. This year was its 29th. It covered the world of scholarly information, library collections, preservation, pricing and archiving and it attracted well over 1000 publishers, vendors, librarians, electronic resource managers, and consultants from around the world. Its success is to a large extent the work of one person- Katina Strauch. Over the years, Katina's empire of hospitality has come to include print publications- Against the Grain and The Charleston Advisor, associated websites, and multiple blogs. The Charleston Conference has established itself as an important venue for many types of communication and learning; you might not call it scholarly communications, but so what?

Scientists and scholars aren't so different from librarians and publishers. They go to conferences, drink coffee and beer and learn in the sessions and in the hallways. They exchange business cards and send each other email. They tell stories about the experiments that failed. They gossip. The conferences provide them programs to take home and help them remember who said what. Occasionally someone mentions an article they found to be interesting, and everyone goes home to read it. The Charleston Conference and associated business properties has grown nicely into the internet age and would be an appropriate model for emulation by the scholarly communication community

Another vision for the future is provided by Code4Lib. Code4Lib started as a mailing list in 2003 as a forum for discussion of
all thing programming code for libraries. This is a place to
discuss particular programming languages such as Java or Python,
but is also provide a place to discuss the issues of programming
in libraries in general.
At first, it grew slowly, but people quickly discovered how useful it was. Today it has almost 1,300 recipients and a very high signal to noise ratio.

In 2006, the first Code4Lib Conference was held at Oregon State University. The conference was inspired to some extend by the success of a similar conference, ACCESS, held every year in Canada. The Code4Lib Conference has always been self-organizing (organizationless, you might say), and has been quite successful. Presentations are selected by vote of potential attendees; participation is strongly encouraged using lightning talks and unconference sessions. The conference has tried to stay small and participatory, and as a result, registrations quickly fill up.

Code4Lib is also instantiated as channels of communication such as an IRC channel and a Journal, and the community never seems to fear trying new things. In many ways, it's still in its infancy; one wonders what it will look like if it ever gets to be as long-established as Charleston.

This February, the fifth Code4Lib Conference will take place in Asheville, North Carolina. I hope to be there. But with the "Global Economic Downturn" and library budgets being slashed, I worry that some people who might have a lot to contribute and the most to gain may be unable to go due to having lost their job or being in a library with horrific budget cuts. So, together with Eric Lease Morgan (who has been involved with Code4Lib from that very first eMail) I'm putting up a bit of money to support the expenses of people who want to go to Code4Lib this year. If other donors can join Eric and myself, that would be wonderful, but so far I'm guessing that together we can support the travel expenses of two relatively frugal people.

If you would like to be considered, please send me an email as soon as possible, and before I wake up on Monday, December 14 at the latest. Please describe your economic hardship, your travel budget, and what you hope to get from the conference. Eric and I will use arbitrary and uncertain methods to decide who to support, and we'll inform you of our decision in time for you to register or not on Wednesday December 16, when registration opens.

If you want to help us with a matching contribution, it's not required to be named Eric.

Update: Michael Giarlo and one other member of the Code4Lib community have agreed to match, so it looks like we have enough to support 3 attendees.

Saturday, December 5, 2009

Business Idea #1: Library Development Jobs Shouldn't Be Lame

Cody Powell, a developer in Austin, Texas, has written a nice article about how to determine if a programming job is lame. The Codypo scale has 8 questions on it:
  1. Would I be paid below market rates?
  2. Would I always be on call?
  3. Am I the IT staff?
  4. Would I work with a single monitor?
  5. Will I be maintaining any ancient system, and what's it written in?
  6. Would my internet usage be filtered or monitored?
  7. Would I be the only programmer?
  8. Am I expected to travel every week?
I'm wondering how the typical library does on the codypo programmer-job-lameness scale. If you work in a library, please please enter your result in the survey:

I have observed that most libraries need to employ software developers, but few libraries are good places for software developers to work. This is a bit of a paradox, because libraries have some wonderful problems for software developers to work on. The small number of libraries that have robust development departments- typically large research and corporate libraries - have done some amazing work.

Among the problems that libraries have in employing developers are divergent pay scales, small size, mentor availability, quality of supervision, development infrastructure, and the herding of cats problem. Their jokes are different.

Partly as a response to this, libraries have tended to outsource the bulk of their technology development to their systems vendors. For the most part, this solution has been adequate, but costly. Increasingly though, libraries find their support costs rising, their service declining, and they are left with increasingly complex integration tasks, especially if they are so bold as to use systems from more than one vendor.

One response to dissatisfaction with vendor support has been a shift to open-source and/or cloud-based solutions. The advantage of open-source firms is that they can't afford to provide poor support, because nothing prevents competitors from providing the same support. The advantage with cloud-based systems is that a cloud vendor can provide basic service at a significantly lower cost.

The paradox of relying on open-source and cloud based systems, however, is that it causes libraries to employ more development staff, not fewer. This is because open-source systems present new opportunities to improve, integrate and modernize services, and libraries feel obliged to contribute improvements into the open source ecosystem. In contrast, cloud based systems tend to be provided with minimal local support coupled with API-based integration options. As more and more these solutions become available, the library's need to integrate multiple solutions (both cloud and open-source) increases.

As if libraries had nothing more to worry about, most of them are facing budget cutbacks, and in many cases, required staffing cuts. What if there was a company that could take on their development and software integration burden, provide their development staff with the environment, support and training they deserve, and commit to reducing their total expense? Maybe Gluejar could do that.

This is not a new idea- both IBM and HP/EDS have successfully demonstrated the viability of this business model for providing IT to large businesses (for a report on the market, take a look at this Forrester Wave report (765KB)). It's no coincidence that these companies (particularly IBM) are active in support of Open Source software and cloud-based services.

The library world is no stranger to staff outsourcing- companies such as Library Associates and LSSI have been assuming the management of libraries and library departments for quite a while now. In Japan, even academic libraries will outsource key functions to companies like Kinokuniya. There's been opposition to library staff outsourcing- how can librarians truly focus their attention on the needs of their institution or community if they are employed by a dispersed corporate entity? I think outsourcing librarians is quite different from outsourcing library development staff.

It's possible that the Open-Source Library systems companies like LibLime, Equinox, IndexData, ByWater and BibLibre will evolve towards this model. The current reality, however is that most libraries must deal with many proprietary systems vendors as well, and these are often not predisposed towards working with the Open Source vendors who hope to displace them. To most effectively serve libraries, a company that manages a library's development staff must establish good relations with all types of technology and database vendors as well as information suppliers.

What I don't know is how desperate libraries are to fix the problems I've perceived, or how serious and widespread they are. It's entirely possible that the tight control over development staff is something that libraries are unwilling to give up.

I would be very interested to get feedback on this, either in private, or in the comments.

Reblog this post [with Zemanta]

The Case Against Using Spoofed e-Books to Battle Piracy

I've known since I was four years old the difference between the Swedish Santa Claus and the American Santa Claus. The Swedish Santa Claus (the one who comes to our house) uses goats instead of reindeer and enters by the front door instead of the chimney. And instead of milk and cookies, the Swedish Santa Claus (a.k.a. Jul Tomten) always insists on a glass of glögg.

The glögg in our house was particularly good this year (used Cooks Illustrated recipe), so Jul Tomten stayed a bit longer than usual. I had a chance to ask him some questions.

"You're looking pretty relaxed this year, what's up?" I asked.

"It's this internet, you know. What with all the downloaded games, and music and e-books, my sleigh route takes only half the time it used to!"

"Really, that's amazing! I've read about the popularity of Kindle e-books, but I never imagined it might affect you! Are you worried that the sleigh and goat distribution channel will survive?"

"Oh not at all, Eric, remember, Christmas isn't about the goats, it's about the spirit! And even if all the presents could be distributed digitally, someone's got to go and drink the glögg, don't you think?"

"One thing I've been wondering, that list of yours, you know, the naughty and nice list... It must be very different now- do you look at people's Facebook profiles?"

"Ho ho ho ho. At the North Pole, your privacy is important to us, as the saying goes. Well, I'm going to let you in on a little secret. 'Naughty and Nice' is a bit of a misnomer. We never put coal in anyone's stocking. The way we look at it, there's goodness in each and every person."

"I guess I never thought of it that way."

"Just imagine how a child would feel if they woke up Christmas morning to find a lump of coal in their stocking! Even if the child was very naughty, do you a holiday disappointment would suddenly turn the child nice?"

"Besides, if we really wanted to put something useless in a stocking these days, it would be a VCR tape or an encyclopedia volume, not coal."

I've had a chance to reflect a bit on my chat with Santa, particularly about putting coal in naughty people's stockings. I've recently been studying how piracy might effect the emerging e-book market, and I've made suggestions about how to reinforce the practice of paying for e-books. But one respected book industry consultant and visionary, Mike Shatzkin, has made a suggestion that the book industry should take the coal-in-the-stocking approach to pirated e-books.

In an article entitled Fighting piracy: our 3-point program, Shatzkin proposes as point #1:
Flood the sources of pirate ebooks with “frustrating” files. Publishers can use all sorts of sophisticated tricks to find pirated ebooks, like searching for particular strings of words in the text. (You’d be shocked at how few words it takes to uniquely identify a file!) But people looking for a file to read will probably search by title and author. So publishers can find the sources of pirated files most likely to be used by searching the same way, the simple way.

But, then, when publishers find those illicit files, instead of take-down notices, which is the antidote du jour, we’d suggest uploading 10 or 20 or 50 files for every one you find, except each of them should be deficient in a way that will be obvious if you try to read them but not if you just take a quick look. Repeat Chapter One four times before you go directly to Chapter Six. Give us a chapter or two with the words in alphabetical order. Just keep the file size the same as the “real” ebook would be.
Points 2 and 3 of Shatzkin's "program" are reasonably good ideas. But this point 1 is a real clunker.

I'll admit, when I first read Shatzkin's proposal for publishers to put "sludge" on file sharing sites, I thought it an idea worth considering. After having studied the issue, however, I think that acting on the idea would be a foolish and shameful.

First of all, the idea is not original. The tactic of spoofing media files was deployed by the music industry in its battle against the file sharing networks that became popular after the demise of Napster. This tactic was promoted by MediaDefender, a company that also used questionable tactics such as denial of service attacks to shut down suspected pirate sites. Although the tactic was at first a somewhat effective nuisance for file sharers, the file sharing networks developed sophisticated defenses against this sort of attack. They adopted peer-review and reputation-rating systems so that deficient files and disreputable sharers could easily be discriminated. They instituted social peering networks so that untrusted file sharers could be excluded from the network of sharers. The culture of "may the downloader beware" has carried over for e-books. On one site I noted quite a bit of discussion of the true "last word" of Harry Potter and the Deathly Hallows along with chatter about file quality and the like. After seeing all this, Shatztkin's suggested point 1 seems quaint, to put it kindly.

The e-book-coal-in-the-stocking idea could also be dangerous if acted on. The tactic of disguising unwanted matter as attractive content has been widely adopted by attackers going back millennia to the builders of the Trojan Horse. The sludge could be as innocuous as a Amazon "buy-me" link with an embedded affiliate code, or it could be as malicious as a virus that lets a botnet take control of your computer if you open the file. When this really happened happened for video files, it was widely asserted, without any substantiation that the viruses were planted by the film industry operatives themselves. Thus, what began as a modest attempt to harass Napster file sharers ended up resulting in a smeared reputation for the film industry.

Obviously, Shatzkin is not advocating spoofing e-book files with harmful content on file sharing sites. But publishers who are tempted to follow his point #1 should consider the possibility that emitting large amounts of e-book sludge could provide ideal cover for scammers, spammers, phishers, and other cybercriminals. Then they should talk to their lawyers about "attractive nuisances" and "joint and several liability".

Go ahead and accuse me of believing in Santa Claus. I firmly believe that no matter what business you're in, not everybody gets corrupted. You have to have a little faith in people.

Friday, December 4, 2009

My Funnest Bugfix Ever: the Double Relative Redirect

Intel has been running an ad called "Our Jokes Aren't Like Your Jokes". In it, an actor portrays Mario Paniccia, Director of Intel's Photonics Technology Lab mischievously altering an equation on a futuristic see-through noteboard. Hilarity ensues.

There's a lot to remark on here. In what sense is the actor portraying Mario Paniccia? In what way is this ad promoting Intel's corporate image? Can you believe that I'm putting it on my blog, totally without get paid for it?

The ad is hilarious because many of us recognize the truth it expresses. The very best geek humor is funny exactly because so few can appreciate it.

I am now going to attempt to detail the funnest bug I ever found. I've encountered less than ten people in the world who have understood it, and of those, only a few could laugh, because the others were complicit.

Five years ago, one of my linking technology company's users, a librarian-customer who was quite adept technically, had noticed that the links we were providing for one publisher were "flaky", and wondered if we could do anything about it. Our first two layers of support had trouble reproducing the problem, but they agreed that there was something flaky about the links, and they had a high regard for the librarian in question, so the problem got escalated to me (in a small company, even the President should be part of the customer support team). As it turned out, I had configured the link in question myself a year or two earlier, and I had a vague recollection that there was some hard-to-put-your-finger-on flakiness with it. So I started poking around.

The linking technologist's most important tool is a Firefox plugin called LiveHTTPHeaders. It lets you see how your browser is communicating with web servers. When I ran this tool, I immediately saw the problem. The links being sent to this publishers were being routed through a redirection chain. The first stage of the chain converted metadata-based links to database key links and set a session cookie. The next and final stage turned the key link into the actual content link for the user. The problem was that both of these redirects failed to comply with the W3C standard for HTTP redirects.

When a link is embeddded in a web page, it can be one of two types, relative or absolute. An absolute link uses a fully spelled out address, like A relative link specifies the address of the link target relative to the current address. So, from this page, the same address could be written in relative form as /2009/09/redirector-chain-mashup-design-pattern.html. The browser has to do some computation to figure out the expanded address from a relative link.

The problem with the flaky links was that the server was sending back relative links in the redirect. Although you can do that in a web page, it's not allowed in the HTTP standard for the server. The most popular web server software, Apache, won't even permit a web developer to send a relative link; in fact, it will helpfully do the expansion calculation automatically. Our flaky link, however, was being served by Microsoft's IIS server software, which doesn't do the expansion. It's possible that the bug arose when a web application designed for Apache was deployed in IIS.

Although I could see where the problem was coming from, and I could see that our linking software was thankfully not to blame, I was thoroughly puzzled. Why did the link only seem flaky, instead of just failing? The answer must lie in the client software- the bits of code on the user's computer that talks to the server.

It's usually a good idea when implementing a communications protocol to be strict about what you send and forgiving about what you receive. So when a relative address is received by an http client, it tries its best to recover from the error. It goes and gets the most recent address it's received and uses that to expand the relative address into an absolute address. So, in the vast majority of cases, users don't experience a problem when a server sends a relative redirect to their browser.

The flaky link problem I was looking at resulted in link failure because the server was sending two relative links in a row. The first link could get expanded just fine, but the second link only had a relative address to expand against, resulting in complete failure of the link.

How could a problem like this have escaped notice? And here we have the geek punch line: the expanded address gets put in the redirect cache! So the second time the user clicks on the link, the initial address is found in the cache, and the first relative redirect is skipped over. Only the second relative redirect is performed, and this time there's a full address from the cache to expand against, and the link works as intended.

If you are an HTTP geek of any sort, you can get up off the floor and stop laughing already. If you are not, read on, and I'll try to explain why it is truly hilarious.

Once I figured this bug out, I did some digging. I found that the same thing was happening not only for the OpenURL links that my software made for linking within a librairy, but also for CrossRef DOI links, which are used pervasively for linking between publishers. I determined that it was not just one publisher that had the problem, it was every publisher that used the same e-journal hosting service. There were about a thousand e-journals affected. And it had been that way for at least a year. My guess is that this link failure had frustrated users at least a million times.

Ninety nine times out of a hundred, the user would just try clicking the link again, and wouldn't think twice about the problem. But that other 1% (10,000 users) would ask a Local Expert for help- perhaps a librarian, perhaps a more-tech savvy friend, whatever. The Local Expert would extend her aura over the user's computer, the user would try it again, and the link would work like a charm.

Ninety nine times out of a hundred, the Local Expert would just smile, having "solved" the problem. But 1% of them (100) would smell something fishy, and ask the user to try another link, and it would fail. The local expert would go back to her desk, and try it a few more times, getting more and more mystified at the flakiness of the link, and decide to report it to a Responsible Party.

Ninety nine times out of those hundred, the Responsible Parties would, for a variety of reasons, fail to do anything about the flaky links. Maybe it was another company's problem. Maybe the problem was too hard, or not bad enough. Maybe they didn't really care.

The hundredth time, it was me. I actually enjoyed finding the bug. I knew the top technical guy at the e-journal hosting company, and I just sent him a description of the issue. Because he knew me, he paid attention, and the problem was fixed in their next update. Millions and millions of the links have worked since.

Isn't that hilarious?

Thursday, December 3, 2009

The Story of Gluejar

When I register for conferences or meetings or that sort of thing, I'm often asked to fill out a company name. I write in "Gluejar Inc.", because I happen to be President of that New Jersey Corporation. So I get a preprinted badge that says my name and the company name right beneath it. I engage people in conversation and they start staring at my badge.

"'Gluejar' I haven't heard of that company. What does it do?" they ask me.

"Nothing." I usually say.

There's usually a nervous chuckle as they wait for the punch line, but there isn't any. "I'm sort of on sabbatical" I say. Or "I've been blogging a lot recently". Or even "I'm not really gainfully employed".

Sometimes, instead of "Nothing", I'd say "It's a slumber-mode start-up" which seemed to make people think I had a team of engineers secretly designing high-tech beds.

One friend says I'm "a gentleman of leisure".

My son has been pestering me to take up farming. He comes home and asks "How was non-work today, Dad?"

Sometimes I tell the true story of Gluejar:

Almost four years ago, I sold a linking technology business I started in 1998. It was called Openly Informatics, Inc.. The buyer (OCLC), being a non-profit (401(3)c) corporation, chose not to acquire the corporate entity, but merely all of its business and (non-cash) assets, including me. I had to keep the company to fulfill contractual obligations. But since one of the assets they were buying was the name of the corporation, I had to get a new name for the company.

I called in my high-cost naming consultant (the aforementioned son, then 11) and asked for name ideas. He asked what the company would do, and I told him that for a while it would just be a container of cash, but that in the future, who knew? I liked doing things that made connections, that put things together, so maybe he could think of a name with that sort of connotation. He suggested "Goo-jar". I liked that, it seemed sticky, and jars are packages for java applications, so that part was both containerish and softwarish.

I did some googling to see if I could get "", but I found that "goojar" was an alternate spelling of Gujar, an Indian ethnic group, so we made the name even stickier and settled on "Gluejar".

A little more than three years later, it was time for me to move on. Initially, I thought I might restart Gluejar right away, but I soon thought better of it. I decided that everything would be much simpler (and a whole lot more fun!) if I waited 9 months until certain commitments expired.

Instead, I started studying technologies, industries, and issues and writing about them on the blog. I started twittering! I went to meetups, conferences and workshops. I went inside a Federal Courtroom! I even did a harvest of purple amaranth.

So now it's time for me to get serious about deciding what to do next. In the next month or two, I plan to write about the directions I'm considering. Some of them involve reactivating Gluejar, some of them don't. I don't really believe that startup ideas can be stolen, so I'll to write about as many of my ideas as I can. I like building things, and the first step in building on an idea is spreading it around.

Wednesday, December 2, 2009

Databases are Services, NOT Content

I'm very grateful for advice my fellow entrepreneurs have given me; when you meet someone else who has started a company you have an instant rapport from having shared a common experience. I remember each bit of advice with the same stark clarity that characterizes the moment I realized that Santa Claus was the neighbor dressed up in a white beard and a red suit..

A business owner who I've known since sixth grade gave me this gem: "My secret is providing the best service possible, and charging a lot for it." In executing my own business, I did pretty well at the first part and could have done better at the second part.

As I wrote about the effect of database rights on the postcode economy, I kept wondering if I would have done anything differently in my business if that database protection had been available to me. Would I have charged more for the database that my company developed?

In the comments to that post, I was alerted to a book by James Boyle, called The Public Domain. Chapter 9 in particular parallels many of the arguments I made. One thing I found there was something I had wanted to look for- information about how the database industries in general have done since the "sweat of the brow" theory for copyright was disallowed by the Supreme Court. It turns out that the US database industry has actually outpaced its counterpart in the UK since then by a substantial margin. Why would that be?

I think the answer is that building databases is fundamentally a service business. If your brow is really sweating, and someone is paying you to do it, then it's hard to think of that as a "content" business. Databases always have more content than anyone could ever want; the only reason people pay for them is that they help to solve some sort of problem. If your business thinks it's selling content rather than services, chances are it will focus on the wrong part of the business, and do poorly. In the US, since database companies understand that their competition can legally copy much of their data, they focus on providing high quality added value services, and guess what? THEY MAKE MORE MONEY!

Then there's Linked Data. Given that database provision is fundamentally a service business, is it even possible to make money by providing data as Linked Data? The typical means for prodecting a database service business is to execute license agreements with customers. You make an agreement with your customer about the service you'll provide, how much you'll get paid, and how your customer may use your service. But once your data has been released into a Linked Data Cloud, it can be difficult to assert license conditions on the data you've released.

It's been argued that 'Linked Data' is just the Semantic Web, Rebranded, but it's also been noted both Linked Data is sorely in need of some proper product management. Product management focuses on a customer's problems and how the product can address them. You can believe me because I've not only managed products, I've had 2 whole days of real product management training!

One thing I was taught to do in my Product Management class was to come up with a 1 sentence pitch that captures the essence of the product. When I was an entrepreneur, this was called the elevator pitch. After thinking about it for about 9 months I've come up with a pitch for Linked Data:
Linked Data is the idea that the merger of a database produced by one provider and another database produced by a second provider has value much larger than that of the two separate databases.
or, in a more concise form, V(DB1+DB2)>>V(DB1)+V(DB2).

Based on the products that have been successful this year in the application of semantic web technologies, it looks to me that the most successful have been focused on what I saw Tim Gollins tweet that Ian Davis called "Linked Enterprise Data" (attributing the term to Eric Miller). If the merged databases are contained within the enterprise, the enterprise clearly reaps all the added value. Outside the enterprise, however, the only Linked Open Data winners so far have been the ones who have built services on databases merged from others.

Proper product management would have made it a goal for Linked Open Data to have data contributors share somehow in the surplus value created by the merged services. In the next couple of weeks, I hope to describe some ideas as to how this could happen.