Monday, January 31, 2011

How to Fund a Public eBook Library with Tax Deductions

In California, newly (re)elected governor Jerry Brown has proposed to zero out state funding for libraries. In Britain, government cuts could lead to as much as a quarter of librarians lose their jobs over the next year. In New York, the Queens library has responded to budget cuts by zeroing out its acquisition budget, instead of cutting opening hours. In my home town of Montclair, New Jersey, a historic branch building is being closed and the reference department is being eliminated.

Given the worldwide climate of cutting libraries, the recent proposal for the establishment of a "Digital Public Library of America" (DPLA) would seem to be either the product of a castles-in-the-sky delusions, or the result of watching too many Star Trek reruns. But it's worth thinking through the DPLA for a few moments before dismissing it as unrealistic.
Despite the complexities, the fundamental idea of a National Digital Library (or NDL) is, at its core, straightforward. The NDL would make the cultural patrimony of this country freely available to all of its citizens. It would be the digital equivalent of the Library of Congress, but instead of being confined to Capitol Hill, it would exist everywhere, bringing millions of books and other digitized material within clicking distance of public libraries, high schools, junior colleges, universities, retirement communities, and any person with access to the Internet.
I don't quite understand whether or how a National Digital Library could be restricted to its citizens; it would effectively be a world library, but that's a quibble, or perhaps a warning about territorially restricted rights, but there you get the idea. It should be an international effort. The Sloan funded planning initiative should go a long way towards fleshing out the idea.

The most difficult part of building such a library will be finding stuff to put on its virtual shelves. There's public-domain material to start. Cooperative scanning and digitization of such material would create an important resource for scholars, and much of this work has already been done. Almost two million volumes worth of digitized scans of public domain works  are already available (with varying availability) through the HathiTrust cooperative. Some of these scans were made as part of the Google Books project.

Another source of books for DPLA would be material that is out of print but not out of copyright. The Internet Archive suggests that many of these works can be made available to the public as if they were physical books, as long as a physical book is held as in reserve as the basis for doing so. It not clear whether doing so would be consistent with copyright law, except perhaps for books without economic value. Manyrightsholders might be willing to donate their rights to DPLA, just as many rightsholders appeared to be willing to allow use by Google for modest compensation.

Unfortunately, many rightsholders are absent and can't be found. Many rightsholders lack good documentation of rights or may be uncertain whether rights have been properly reverted. The Google Books Settlement Agreement attempted to settle or circumvent litigation surrounding these issues; the agreement has thus far not been acted on by the court. It's quite possible that legislative action would be required to allow a digital public library to freely provide this sort of content. Good luck with that.

What's missing from the proposed DPLA is any way to stock it with content that people actually want to read. Publishers that I've talked to are loathe to give up their backlists because they believe the backlists may become valuable through digitization and enhancement. Often the rights surrounding their backlists are so murky that it would be a significant expenditure just to clear rights. The only backlist acquisitions an NDLA would be likely to afford are those that are truly worthless.

There IS, however, a way, in the existing tax code, to direct tax money toward acquiring eBooks for a DPLA, or for any library, for that matter. It's called the deduction for charitable donations. As long as doing so is consistent with their charitable purpose, any recognized charity could collect donations toward the acquisition of rights to an ebook. If the ebook was then made available under a public license, such as one of the creative commons licenses, it would further that public charitable purpose.

For example, a charity focused on finding a cure for a disease (e.g. Huntington's) could ask supporters, acting together, to buy out the digital rights to books and journal articles relating to research, treatment and patient care for that disease. Similarly, a museum devoted to the history of New York City tenements could ask supporters to acquire works of fiction that bring to life the period and lifestyle of immigrant living in lower Manhattan in the early 20th century. Libraries could join together to support all sorts of books that promote their public purposes of promoting literacy, economic development, and cultural enrichment.

Donations made in this way would be fully tax deductible, at least in the US. I don't know the situation in other countries, but I note that many European countries have unfavorable treatment for eBooks under their value added tax (VAT) systems. I learned at Digital Book World that Italy has a 4% VAT for print books, but a 20% VAT for eBooks, which are considered to be a service. The VAT for charitable donations is, of course, 0%.

Readers of this blog will recognize that this is another way to present "ungluing ebooks". I've been working on ways for libraries and consumers to join together in a market to acquire the right to put ebooks into the public commons. This could happen through a collective of libraries or through a web site like Kickstarter, or both. Once acquired for the public commons, the ebooks could be made available through DPLA or any other library or ebook distribution service.

Apart from the tax advantages, the benefit of working with charitable institutions to crowd-fund the acquisition and liberation of ebooks is that early validation of the concept will be easier. No one really knows whether people will put real money towards the movement of ebooks into the public commons. Working on a project catalog much smaller than "all the world's books" with a target audience more focused than "people who love books" will allow us to more accurately gauge the willingness of the public to support public ebooks.

I've spent a lot of time this month explaining my ideas for ungluing ebooks. The questionthat keeps recurring is whether I plan to act on these ideas by forming a non-profit. The most persuasive argument for this is that the idea of acquiring the rights to an ebook for the public commons is much easier to understand when put in the context of a public charity, such as public television or public radio. The whole idea of free books is so foreign to people that people assume that free ebooks means pirated ebooks.

For now, though, I'll be pursuing this via my for-profit company, Gluejar, Inc. That's because I talked to my lawyer. He pointed out that it's much easier to deduct the costs of a speculative business venture from my taxes than it would be to convince the IRS that this weirdo crowd-funded public ebook idea is really a charitable endeavor.

I'd like to acknowledge Juliet Sutherland, Director of Distributed Proofreaders, and Richard Cave, I.T. Director of the Public Library of Science for very helpful contributions to the ideas in this post. I'm not an accountant or lawyer; this article is not any sort of tax advice.
Enhanced by Zemanta

Wednesday, January 26, 2011

It's No Pocalypse at Digital Book World

© Babette Ross
"It's like you're taking a first step on the road to the valley of death." The topic was ebook metadata, but the speaker's statement could as well be applied to the Digital Book World (DBW) Conference as a whole. "Fear no ebooks" was the message of the conference, and it was a welcome message to many of the participants that I talked to. "I'm just trying to learn about ebooks" and "we're trying to decide what to do" were phrases I heard more than once.

In contrast to O'Reilly's Tools of Change for Publishing conference, which is coming to the same venue only 3 weeks later, DBW is not going to scare the publishing community with revolutionary business models or fire and brimstone sermons about the dire future of publishing. DBW was about providing a security blanket and a helpful hand to trade publishers venturing into a world full of doubt and uncertainty.

Mike Shatzkin  (©Babette Ross)

DBW Shepherd-in-Chief Mike Shatzkin did a great job developing a modestly challenging and useful program. His opening list of suggestions mirrored the topics of the executive panel. He exhorted publishers to:   
  1. Begin to engage with their consumers and communities.
  2. Get the rights in order.
  3. Don't rely on just Amazon and Google, reach out to other markets and channels through partners such as Ingram and Overdrive.
The mood of the conference, however, was set by conference organizer Guy Gonzalez, self-styled Chief Executive Optimist. Although one attendee worried to me about pervasive complacency in the trade publishing industry, Gonzalez's view is that publishing is an activity fundamentally essential to our culture, and that one way or another, publishers are finding ways to survive and thrive as their focus shifts from a print oriented supply chain to a digital ecosystem.

Guy Gonzalez (© Babette Ross)
The conference's discussion of the role of libraries in that ecosystem was emblematic of the conference as a whole. In a question for Tuesday's executive panel, Sarah Wendell of Smart Bitches/Trashy Books asked Macmillan US President Brian Napack why she was unable to borrow his books from libraries. His non-answer was that Macmillan was "hard at work trying to find a business model that will work for us", and no, libraries had not "fallen by the wayside". After the panel, Napack exited quickly; I'm betting it was not so that he could get back to the office and work on a library strategy.

Open Road Integrated Media CEO Jane Friedman disagreed firmly with Napack's remarks. Her goal is to have all her books in libraries, because the library consumer is not the same as the book buying consumer. Someone downloading an ebook from a library is "only one step away from being a customer."

The follow-up to this discussion came this afternoon, in a panel that Gonzalez called the session he was most proud of. Moderated by Library Journal's Josh Hadro, the panel included both a librarian (New York Public Library Deputy Director Christopher Platt) and big 6 vice president (Random House Director of Account Marketing Ruth Liebmann), which doesn't happen very often.

Platt explained the basics of how ebook lending works at NYPL, explaining that NYPL did a lot of work to familiarize patrons with the mechanics of ebook lending, and he pointed out that a patron interested in Jonathan Franzen's Freedom (as an example) had to be told that its publisher was unwilling to allow library lending.

Liebmann pointed out that libraries have mechanisms to reach out to readers and promote a publisher's materials, exactly the sort of engagement missing for most trade publishers. A library book does not compete with sales, a library book IS a sale. Libraries provide a revenue stream for publishers comparable to independent book sellers, and it's a profitable one- libraries never return books the way bookstores do.

According to George Coe, President of the Library and Education Division of book distributor Baker & Taylor, the library market constitutes a total of $1.9 billion in the US. He pointed out that libraries could reach only 2% of the market at the very most for a popular book, and it was exactly the same for ebooks. His company was doing everything it could to protect the profitability of publishers that participated in their ebook program. Libraries customers are also easy on inventory- 98% of their purchases come within 18 month of a books publication.

But it was Overdrive's Steve Potash who delivered the most powerful argument that libraries belong in the ebook ecosystem. The visibility that libraries give to ebooks is incredibly valuable. With the millions of page views the libraries were giving to ebooks, the publishers should be paying the libraries, not the other way around, according to Potash. It's worth noting that no other provider of ebooks in libraries has nearly as high a publisher-world profile as Overdrive. Overdrive is playing an important role in getting publishers to think about libraries as a distribution channel, and Potash's evangelical presence on the panel played well with the audience of publishers. He even gave them homework. "Go and try it yourself!" he urged. I hope they manage to do so.

Liebmann summed up the session, and unintentionally, the conference as well, when she described her "Library Listening Tour". By going out and meeting the customers, she learned about what they really wanted from ebooks, which was useful even if she wasn't going to be able to make everybody's dreams come true. Where her dreams going to come true? "I'm feeling so good at DBW, I'm thinking that maybe they will."

(Photos © Babette Ross, used with permission.)
Enhanced by Zemanta

Friday, January 21, 2011

Doing Good Things Together

I've spent a lot of time in the past few weeks explaining to everyone I meet why I think ordinary people might be willing to help acquire ebook rights for the public commons. Meanwhile, I kept noticing how people are getting together to do other good things.

At ALA Midwinter, there was tweeting going around about an effort by four twittering librarians, Andromeda Yelton (@ThatAndromeda), Ned Potter (@theREALwikiman), Jan Holmquist (@janholmquist), and Justin Hoenke (@JustinLibrarian) to "buy India a Library". As of Wednesday, they had raised £1384; the fund raising ends today, so hurry on over if you want to participate.

On the mailing list for organizers of the Code4Lib conference, Dan Chudnov was agitating for a way for anybody to become a conference sponsor. There were a number of minor issues to overcome, but Kevin Clarke took up the challenge and created a ChipIn page to collect money from those who wanted to contribute to a sponsorship. This page raised $1,240 from 28 contributors. (Sorry, too late for that!)

I also found out about an ambitious effort by Michael Porter and friends called Library Renewal. They've created a non-profit organization to explore "new content solutions for libraries, while staying true to their larger mission." This is an effort that's still in its formative stages; it's an effort you can join and help shape.

These three projects are in all the library world, but please don't think that good people doing good things aren't everywhere around you. I've been inspired by my college friend Noel Valero. After graduation, he worked as an aerospace engineer and then as an IT consultant, until he began to have trouble with spasms in his arm. He spent a lot of time seeing doctors who were unable to help him until finally, with the help of another classmate, he was diagnosed with dystonia, a little-known but not-so-rare disease that causes progressive loss of motor control.
Dystonia is the 3rd most common movement disorder, with an estimated 500,000 patients diagnosed with primary and secondary forms of the disease and possibly at least another 500,000 others that are undiagnosed or misdiagnosed. Yet Dystonia lags significantly behind in research funding when compared to other neurological disorders.
Many sufferers of dystonia lose hope amid the progression of the disease, partly because of the isolation it forces on people. Simple everyday tasks become huge barriers. Even holding a book to read it can be difficult. One dystonia sufferer that Noel introduced me to reports that she can only manage her graduate school textbooks by chopping off their spines and dividing them into easy-to-hold segments. Driving a car or typing on a computer can become exhausting activities.

With loving support from his family and friends, Noel has climbed out of his initial despair. He started reaching out to other dystonia sufferers on Facebook (his daily joke posting is a resource for non-dystonia-sufferers as well!) and was surprised to find how much it helped for people with dystonia to be able to support each other. In 2009, he took these efforts to the next level by forming the American Dystonia Society.

On February 2, Noel will be on an episode of "Mystery Diagnosis". If you have access to the new "Oprah Winfrey Network", please join me in watching the show (or record it for later viewing). And if you enjoy reading this blog (or if you don't), I would be honored if you made a donation of any size to the American Dystonia Society in appreciation.

In possibly related news, the blog's Amazon Associate revenue statement for 2010 just came in: $10.04.
Enhanced by Zemanta

Wednesday, January 19, 2011

eBook Identifier Confusion Shakes Book Industry

Taipei 101
I've only felt a strong earthquake once. I was on the second floor of an engineering building at Stanford, and as soon as the initial jolt shook the building I thought "cool, it's an earthquake!". Then the rolling started. It was only after the shaking was over that I started shaking myself. The feeling of solid ground beneath my feet had been wrenched out of my psyche, leaving me standing on a big bowl of jelly that could start jiggling again any moment.

Big earthquakes can cause building damage and collapse. Sometimes, it's because a builder hasn't followed code, and the violations are exposed by the stress of a quake. Other times, it's because the building code didn't properly anticipate the stresses of the earthquake. Either way, after a severe earthquake, buildings need to be inspected to assess damages and to determine if changes need to be made in the building code.

Modern technology allows buildings to soar through traditional limitations. For example, the engineers of Taipei 101, which was the worlds' tallest building from 2004 to 2010, put a huge tuned mass damper system at the top of the tower. They made a virtue out of necessity, and the damper is now on display as a dramatic part of the Taipei 101 tourism experience, well worth the visit if you go to Taipei. (I was there in 2006.)

the tuned mass damper in Taipei 101
The Book Industry has been experiencing tectonic shifts as it moves from the solid foundation of print-based production and distribution to digital forms. The so-called "supply chain" is a long-standing edifice of the book industry being shaken by the resulting quakes. One of the strings holding the supply chain together is the ISBN, and it has proven to be reasonably robust. Still, there's been enough "damage" to the ISBN and the supply chain it holds together that many participants in the book industry have been concerned for its integrity. (I wrote about the situation in July.)

Last Thursday, I was fortunate to be at a presentation of the Book Industry Study Group (BISG) about identification of eBooks. BISG hired Michael Cairns, the principal of Information Media Partners, to do a study of the use, issues and practice surrounding assignment of ISBNs in the US book industry. Think of him as a structural engineer hired to inspect the damage to the supply chain's supporting infrastructure after an earthquake. Cairns conducted 55 separate interviews with a total of 75 industry experts from all facets of the industry. (I was interviewed for my expertise in the use of ISBN in library linking systems).
Cairns (@personanondata on Twitter) is an industry veteran- he's held senior executive positions at Bowker and other companies. His presentation was clear and direct, and he quickly went to the heart of the matter. He found very little support for the policy set forth by the 2005 revision of the ISBN standard regarding when to assign a new ISBN to an ebook. Not surprisingly, he found that implementation of that policy is all over the map, with little coherence between one company and another in ISBN assignment practice. What's more, he found that the industry is almost unable to communicate with itself due the wide variations in the practical definitions of terms such as "format", "product", "version" and "work".

Despite the difficulties created by the uneven application of the standard, there's no collective desire in the industry to "fix" the problem. Everybody has patched their systems to make them work in spite of a damaged infrastructure. The result is that poor practice has been structurally incorporated into the ebook supply chain, such that it doesn't help any more to do things correctly. If everyone started following the rules tomorrow, the supply chain might stop working.

It's as if an addition to a building needed to be built during an earthquake, even as things continued to shake. The framework is crooked, but that's needed to keep the building from falling over. You shouldn't expect such an addition to be perfect; it's something of a miracle that it can be built at all.

One example of how supply chain tremors putting stress on the supply chain edifice was raised in the discussion after Cairn's talk. At, they are enhancing some ebooks for the Nook. The enhanced ebooks are then offered at a different price than unenhanced ebooks. Normally, this would not affect ISBN assignment, because the modified ebooks are sold only by BN in the Nook store, and no one else would be affected. But last year, the supply chain was shaken when 5 of the big 6 publishers moved their ebooks to the "agency model". All of a sudden, the ebooks sold in the Nook store were being set by the publisher. The publisher was now pulling price strings for each version of the ebook, and the string being used was, you guessed it, the ISBN. So the result of the shift to an agency model was that a whole bunch of ebooks suddenly needed their own ISBNs.

While everybody seems to be scraping by for now, there may be severe problems lying ahead. Cairns pointed to libraries as a supply chain participant that was already experiencing ebook ISBN dystopia, and he suggested that the experiences of libraries today may presage the sort of problems which may spread to consumer markets as the ebook industry matures.

Libraries have historically had a different relationship to metadata than  publishers and other supply chain participants. They KEEP their books. Publishers pay a lot of attention to metadata when a book is created because it helps them sell books. Then, they're pretty much done with the metadata. If the data rots (goes out of date), it's not really a publisher problem. So libraries have maintained their own metadata to allow them to manage their collections.

eBook metadata is forever. Because ebooks are licensed, not sold, the licensor retains a relationship with the purchaser extending beyond the sale, and must maintain metadata surrounding the license for much longer than in the case of printed books. There are new sets of intermediaries and many more possibilities for business models. This is already playing out in library distribution channels, where ebooks are being licensed, lent, rented, printed, viewed, bundled into packages and purchased. If multiple sets of licensing terms are used for an ebook, resulting in multiple products with different prices attached, are new ISBNs needed? In the past, the answer would be a clear "no"; things like the agency model have changed that to a clear "I don't know".

Another issue laid out by Cairns was the low profile and negative perception of the US ISBN Agency (and by extension, ISBN International) in the ebook industry. Many of his interviewees had the impression that the assignment policies were being driven by the agency's business model (basically, the selling of ISBNs and related databases). If only it were so simple!

Brian Green, Executive Director of ISBN International spoke briefly about a similar study his group had commissioned. Although this study (PDF, 509KB) focused less on the US situation, many of its findings were similar to those of the Cairns report. At least one recommendation in that report has been acted on- ISBN International has released an updated FAQ (PDF, 363 KB) on assignment of ISBNs to e-books. You can help with another recommendation by helping to disseminate it widely!

The BISG's role in all of this is to serve as a place where the book industry can sit together and figure out how to function more effectively. The work of the BISG committee that sponsored the Cairns study will be to develop new consensus around practices and resources that will help to solve problems. Clearly, the committee has a lot of work do, building on the structural assessment laid out by the Cairns report. Development of a common vocabulary and set of definitions may be a very productive starting point for the group.

Perhaps the book industry will need the standards equivalent of a tuned mass damper. I can't wait to visit that skyscraper.

Saturday, January 15, 2011

Why ProQuest Bought ebrary

The New York Times
Take a look at the New York Times homepage. Then take a look at or MSNBC. How do you tell which website belongs to a newspaper and which ones belong to a television network? All of them have video. All of them have text. All of them have blogs and forums. As media moves onto the internet, the boundaries between old media genres begin to blur, and new forms emerge, optimized for the purposes they're being used for.

Just as delivery of news is being transformed by the Internet, the needs of students, researchers, and scholars are driving a similar boundary-blurring transformation in libraries. It's also driving a transformation in the companies that serve the library industry.

Marty Kahn, President of ProQuest, used the Times-CNN analogy to explain to me why his company had acquired ebrary, a leader in providing ebooks to academic, corporate, and other libraries. It no longer makes sense for a company to specialize in only journal articles, databases, or eBooks if it wants to be able to provide coherent and evolving solutions.

A look at ProQuest's existing product suite bears that out. With full-text journal databases, newspapers, dissertations, historical archives and government documents (including the CIS division recently acquired from LexisNexis) ProQuest was already able to integrate an impressive array of content. The Summon service from ProQuest's SerialsSolutions unit, which centrally indexes a library's content, has experienced rapid growth, with sales at 200 institutions already. Still, the most common questions that Summon staff were fielding at ALA Midwinter surrounded the integration of ebooks into Summon. With the acquisition of ebrary, ProQuest can now answer that question authoritatively for at least one ebook vendor. (See my previous article focusing on Overdrive.)

Somehow, the topic of EBSCO and their recent acquisition of NetLibrary hardly came up in my talk with Kahn.  We spent a lot more time discussing Google. Between Google Search, Google Scholar and Google Books, Google also has the potential to present a comprehensive information solution for libraries. I often hear librarians expressing the sentiment that they need help from companies like ProQuest to present credible alternatives to Google and free sources available on the internet.

One thing Summon and other library search solutions have lacked is the ability to search the full text of the books in a library's collection. Put next to Google Books' full text plus metadata search, the metadata based search offered by a traditional library catalog can seem rather limited to most users. ebrary will bring with it a huge library of full-text book content for search within Summon.

ebrary was founded by high school friends Christopher Warnock and Kevin Sayar. Libraries were the focus from the very start. Warnock had left a job at Adobe Systems and was working on a project for Stanford University when Stanford University Librarian Mike Keller told him that in order to get paid, he had to incorporate. Warnock called up his friend Sayar, then an attorney at the legendary Silicon Valley law firm of Wilson Sonsini Goodrich & Rosati, and asked if he wanted to act on their high school dreams of starting a company together. The project at Stanford led to the conception of ebrary's initial service for libraries. (I've often heard the misconception that ebrary is somehow an Adobe funded spin-off, because of Warnock's father's role as a Founder of Adobe. In fact, Adobe and the elder Warnock had no role in starting ebrary.)

Warnock has always been passionate about ebrary's mission. "If every library acquired information digitally, all the worlds information would be free to everybody", he told me. He is genuinely excited about what ebrary will be able to do as part of ProQuest. "Being part of ProQuest will allow us to realize our dreams".

Those dreams include the creation of a vast digital library with all kinds of content. ProQuest has "billions" of PDF documents, according to Warnock; ebrary's PDF indexing and search technologies are considered to be unsurpassed anywhere. Although ProQuest is not known for ebook distribution, there's not much difference between a book and a dissertation, if you think about it. ProQuest distributes 70,000 of those every year.

ebrary has also been an innovator in business models as well as in technology. ebrary's initial model was to make ebooks available for free viewing; rights-holders were compensated using a micro-transaction model where subscribers were charged every time they did things such as print pages. Based on customer feedback, they shifted to a model where most content is available for use with on flat subscription. Fee. This year, they've begun to implement a patron-driven acquisition model.

Looking forward, Sayar will be running the ebrary business unit; Warnock will move to ProQuest to work on strategy. Given the ambitious vision outlined by Kahn, he has his work cut out for him.

The ebrary content platform has definitely gained some ardent advocates in libraries. I heard one librarian say "not only do we love ebrary, but our students love ebrary. They really do." At the end of the day, when we ask ourselves how libraries will respond to the dizzying changes in both information and economic landscapes and worry about what will happen, isn't love all we really need?

Sunday, January 9, 2011

Bridging the eBook-Library System Divide

Despite what you might have read on the blogs, libraries show no signs of imminent ebook-induced death. The latest data from Overdrive, the dominant provider of eBooks to public libraries, shows staggering growth. Digital checkouts doubled in 2010 to 15 million, looking at Overdrive alone. Based on the buzz at this weekend's American Library Association Midwinter Meeting, Overdrive should blow those numbers away in 2011- It seems that almost every librarian I've talked to here has decide to "take the plunge" into eBooks in a big way in 2011.

The ebook companies focused on academic libraries are experiencing the same growth- Ebook Library told me that for the prior year their monthly sales have been double the prior year. The biggest plunge was taken by Proquest, which announced their acquisition of ebook provider ebrary. (I’ll have a separate story on that later.)

To some extent, most libraries have been only sampling the ebook water, and despite noted usability issues and e-reader device fragmentation, patrons seem to want more and more and librararies are responding to patron demand. But not everyone is happy. One librarian told me, after a few beers, that “Overdrive sucks!” and then went on to use language unsuitable for a family-oriented blog.

As far as I can tell, there are two issues around Overdrive that are troubling libraries. One derives from the DRM system from Adobe that Overdrive uses. Adobe’s system is pretty much the only option for libraries and booksellers other than Amazon and Apple; Overdrive has no choice but to use this system in order to work with reader devices and software from Barnes&Noble, Sony and Kobo. The Internet Archive’s Brewster Kahle, in a panel on Saturday morning, slammed the Adobe system, even though it’s used by the Archives OpenLibrary. In OpenLibrary's experience, users were able to complete a lending transaction in only 43% of their attempts. Overdrive is working to improve the smoothness of these transactions, and is introducing new support methods to make the processs easier.

The second issue was discussed by library system vendor executives at Friday’s RMG President’s Panel. According the Polaris Library Systems President Bill Schickling, many of his customers are worried that their libraries will be marginalized by ebook providers like Overdrive.  Although Overdrive offers extensive customization options for their ebook lending interface, libraries are still upset that patrons have to use separate interfaces for books and ebooks, one provided by Overdrive and the other provided by their ILS vendor. Libraries often think of the library system as their primary "brand extension" on the internet.

It seems a bit odd that this should be an issue. For years, libraries have lived with databases and electronic journals delivered from separate systems. But books are different. Libraries want ebooks and books to live side by side. It makes little sense to force a user who wants to read a Steig Larsson novel  have to check in two places to see print and digital availability.

Overdrive is working overtime to address this second issue, it seems. Overdrive's CEO, Steve Potash, told me that his company is working on opening a set of APIs (application programming interfaces) that will allow system vendors, libraries and other developers to more deeply integrate Overdrive's ebook lending systems into other interfaces. Overdrive has needed these interfaces internally to build reading apps for Android, iPod and iPhone. Overdrive hopes to have an iPad-optimized reading app in Apple's iTunes stare by the end of first quarter 2011, and will be working with selected development partners to work out many of the details. Potash hopes Overdrive will be able to unveil the APIs this summer at the ALA meeting in New Orleans.

The Overdrive APIs and the usability improvement they lead to should come as welcome news to libraries and library patrons everywhere. Library system vendors and developers in libraries will have a lot of work to do over the coming year.

And library patrons will be reading a lot of ebooks.

Saturday, January 8, 2011

Inside the Dataculture Industry

wild blueberries
I don't really know how all the food gets to my table. Sure, I've gathered berries, baled hay, picked peas, baked bread and smoked fish, but I've never slaughtered a pig, (successfully) milked a cow or roasted coffee beans. In my grandparents generation, I would have seemed rather ignorant and useless. Agriculture has become an industry as specialized as any other modern industry; increasingly inaccessible to the layperson or small business.

I do know a bit about how data gets to my browser. It gets harvested by data farmers and data miners, it gets spun into databases, and then gets woven into your everyday information diet. Although you've probably heard of the "web of data", you're probably not even aware of being surrounded by data cloth.

The dataculture industry is very diverse, reflecting the diversity of human curiosity and knowledge. Common to all corners of the industry is the structural alchemy that transmutes formless bits into precious nuggets of information.

In many cases, this structuring of information is layered on top of conventional publishing. My favorite example of this is that the publishers of "Entertainment Week" extract facts out of their stories and structure them with an extensive ontology. Their ontologists (yes, EW has ontologists!) have defined an attribute "wasInRehabWith" so that they can generate a starlet's biography and report to you that she attended a drug rehabilitation clinic at the same time as the co-star of her current movie. Inquiring minds want to know!

If you look at location based services such as Facebook's "places", Foursquare, Yelp, Google Maps, etc, they will often present you with information pulled from other services. Often, a description comes from Wikipedia and reviews come from Yelp or Tripadvisor and photos come from Panoramio or Flickr. These services connect users to data using a common metadata backbone of Geotags. Data sets are pulled from source sites in various ways.

Some datasets are produced in data factories. I had a chance to see one of these "factories" on my trip to India last month. Rooms full of data technicians (women do the morning shift, men the evening) sit at internet connected computers and supervise the structuring of data from the internet. Most of the work is semi-automated, software does most of the data extraction. The technicians act as supervisors who step in when the software is too stupid to know when it's mangling things and when human input is really needed.

There's been a lot of discussion lately about how spammers are using data scraped from other websites and ruining the usefulness of Google's search results. There are plenty of companies that offer data scraping services to fuel this trend. Data scraping is the use of software that mimics human web browsing to visit thousands of web pages and capture the data that's on them. This works because large websites are generated dynamically out of databases; when machines assemble web pages, machines can disassemble them.

A look at the variety of data scraping companies reveals a broad spectrum. Scraping is an essential technology for dataculture; as with any technology, it can be used to many ends. One company boasts of their "massive network of stealth scrapers capable of downloading massive amounts of data without ever getting blocked. Some companies, such as Mozenda, offer software to license. Others, such as Xtractly and Addtoit are strictly service offerings.

I spoke to Addtoit's President, Bill Brown, about his industry. Addtoit got its start doing projects for Reuters and other firms in the financial industry; their client base has since become more "balanced". Companies such as Bloomberg, Reuters and D&B get paid premiums to provide environments rich in structured data by customers wanting a leg up on competitors. Brown's view is that the industry will move away from labor intensive operations to being completely automated, and Addtoit has developed accordingly.

A small number of companies, notably Best Buy, have realized that making their data easily available can benefit them by promoting commerce and competition. They have begun to use technologies such as RDFa to make it easy for machines to read data on their web sites; scraping becomes superfluous. RDFa is a method of embedding RDF metadata in HTML web pages; RDF is the general data model standardized by the W3C for use on the semantic web, which has been discussed much on this blog.

This doesn't work for many types of data. Brown sees very slow adoption of RDFa and similar technologies but thinks website data will gradually become easier to get at. Most websites are very simple, and their owners see little need or benefit in investing in newer website technologies. If people who really want the data can hire firms like Addtoit to obtain the data, most of the potential benefits to website owners of making their data available accrue without needing technology shifts.

The library industry is slowly freeing itself from the strictures of "library data" and is broadening its data horizons. For example, many libraries have found that genealogical databases are very popular with patrons. But there is a huge world of data out there waiting to be structured and made useful. One of the most interesting dataculture companies to emerge over the last year is ShipIndex. As you'd expect from the name, ShipIndex is a vast directory of information relating to ships. Just as place information is tied together with geoposition data, ShipIndex ties together the world of information by identifying ships and their occurrence in the world's literature. The URIs in ShipIndex are very suitable for linking from other resources.

The Götheborg
ShipIndex is proof that a "family farm" can still deliver value in the dataculture industry. The process used to build ShipIndex. Nonetheless, in coming years you should expect that technologies developed for the financial industry will see broader application and will lead to the creation of data products that you can scarcely imagine.

The business model for ShipIndex includes free access plus a fee-for-premium-access model. One question I have is how effectively libraries will be able leverage the premium data provided with this model. Imagine for example the value you might get from a connection between ShipIndex and a geneological database bound by passenger manifests. I would be able to discover the famous people who rode the same ship that my parents took to and from the US and Sweden (my mom rode the Stockholm on the crossing before it collided with the Andrea Doria). For now though, libraries struggle to leverage the data they have; better data licensing models are way down on the list of priorities for most libraries.

Peter McCracken
ShipIndex was started by Peter and Mike McCracken, who I've known since 2000. Their previous company (SerialsSolutions) and my previous company (Openly Informatics) both had exhibit tables in the "Small Press" section of the American Library Association exhibit hall, where you'll often find the next generation of innovative companies serving the library industry. They'll be back in the Small Press Section at this weekend's ALA Midwinter meeting. Peter has promised to sing a "shanty" (or was that a scupper?) for anyone who signs up for a free trial. You could probably get Mike to do a break dance if you prefer.

I'll be floating around the meeting too. If you find me and say hello, I promise not to sing anything.
Enhanced by Zemanta

Thursday, January 6, 2011

Fundamental Constant Numerology

My father was obsessed with units of measurement and fundamental constants. He got his engineering degree at the Royal Institute of Technology (Kungliga Tekniska Högskolan) in Stockholm, Sweden. His favorite professor there was Erik Hallén, who was famous for his work on antenna theory and for laying groundwork for the world's most widely used system of measurement, the SI system.

My dad nearly failed Hallén's class, which could be one reason for his lifelong obsession with units. The other reason was that Dad was convinced that he could explain some of physics' deepest questions about the nature of matter by applying the electromagnetic theory he learned in Hallén's class. Dad   explained the structure of the electron by modeling it as a circulating charge wave in a resonant cavity formed by general relativistic warping of space. In his notes, he wrote
To me it looks like all the puzzle is defined and ripe to be put together and the extension to other particles will not be difficult- only time consuming.
Dad used these insights to come up with a relationship between the gravitational constant and the mass and charge of the electron. Here's his equation:
                     G = 6/π 10-44 Z0 c (e/m)2
where G = the gravitational constant (which defines the force that holds the universe together), Z0 is the impedance of free space, c is the speed of light, and e and m are the charge and mass of the electron.

Here's a prettier version of that equation, made using Roger's Equation Editor from the following TeX code:
                      G = \frac{6}{\pi} 10^{-44} Z_0 c (\frac{e}{m})^2
(TeX is the most commonly used formatting language for mathematics.)

A derivation of this equation would easily have earned my dad a Nobel Prize, but without a derivation and explanation of the underlying physics, it was just numerology. If you plug in the numbers, Dad's equation is within 0.025% off of the consensus value for the gravitational constant, whose experimental uncertainty is about 0.01%. Dad understood that his equation was worthless without an explanation, so he spent endless hours studying Bessel equations and all sort of obscure mathematics. He was sure that somehow, somewhere, there existed a solution to Maxwell's equations combined with general relativity to explain the 6/π and confirm his resonant cavity. (He said the 10-44 was just using the right units- I never understood that!)

The Nature of the Physical WorldMy dad was not alone in physics numerology. The fine structure constant, which is very close to 1/137, has attracted all sorts of numerological explanations. (Dick Lipton calls it a "miracle number") Arthur Eddington, one of the most famous physicists of his time, had an explanation for why the fine structure constant should be exactly 1/137 involving the number of protons in the universe. A more modern numerological result is that of James Gilson, whose suggested value for the fine structure constant is only 30 parts per trillion off.

While fundamental constant numerology has deservedly been on the fringes of science, new internet search technologies may soon change that. Last year, scientific publisher Springer introduced a beta service called LaTeX Search that allows researchers to search for LaTeX formatted equations in all of Springer's journals. (LaTex is a dialect of TeX most widely used for scientific publishing.) That's something you can't do with Google, or any other search engine. The ability to connect obscure mathematical discoveries from disparate fields of science could soon be facilitating new avenues of research, perhaps even new methodologies.

For example, I can search for a fragment of my dad's equation and get at least one result that seems relevant. I don't know of any meaningful discoveries that have been made so far with LaTeX search, but if my dad had been able to search all of the mathematical literature to connect his numerological result with a mathematical solution, perhaps he would have explained the gravitational constant and structure of the electron and would have won his Nobel Prize.

He would have been 83 today. Happy Birthday, Dad! We miss you.
Enhanced by Zemanta