Wednesday, October 29, 2014

GITenberg: Modern Maintenance Infrastructure for Our Literary Heritage

One day back in March, the Project Gutenberg website thought I was a robot and stopped letting me download ebooks. Frustrated, I resolved to put some Project Gutenberg ebooks into GitHub, where I could let other people fix problems in the files. I decided to call this effort "Project Gitenhub". On my second or third book, I found that Seth Woodworth had had the same idea a year earlier, and had already moved about a thousand ebooks into GitHub. That project was named "GITenberg". So I joined his email list and started submitting pull requests for PG ebooks that I was improving.

Recently, we've joined forces to submit a proposal to the Knight Foundation's News Challenge, whose theme is "How might we leverage libraries as a platform to build more knowledgeable communities? ". Here are some excerpts:
Abstract 
Project Gutenberg (PG) offers 45,000 public domain ebooks, yet few libraries use this collection to serve their communities. Text quality varies greatly, metadata is all over the map, and it's difficult for users to contribute improvements. 
We propose to use workflow and software tools developed and proven for open source software development- GitHub- to open up the PG corpus to maintenance and use by libraries and librarians. 
The result- GITenberg- will include MARC records, covers, OPDS feeds and ebook files to facilitate library use. Version-controlled fork and merge workflow, combined with a change triggered back-end build environment will allow scaleable, distributed maintenance of the greatest works of our literary heritage.  
Description 
Libraries need metadata records in MARC format, but in addition they need to be able to select from the corpus those works which are most relevant to their communities. They need covers to integrate the records with their catalogs, and they need a level of quality assurance so as not to disappoint patrons. Because this sort of metadata is not readily available, most libraries do not include PG records in their catalogs, resulting in unnecessary disappointment when, for example, a patron want to read Moby Dick from the library on their Kindle. 
Progress 
43,000 books and their metadata have been moved to the git version control software, this will enable librarians to collaboratively edit and control the metadata. The GITenberg website, mailing list and software repository has been launched at https://gitenberg.github.io/ . Software for generating MARC records and OPDS feeds have already been written.
Background 
Modern software development teams use version control, continuous integration, and workflow management systems to coordinate their work. When applied to open-source software, these tools allow diverse teams from around the world to collaboratively maintain even the most sprawling projects. Anyone wanting to fix a bug or make a change first forks the software repository, makes the change, and then makes a "pull request". A best practice is to submit the pull request with a test case verifying the bug fix. A developer charged with maintaining the repository can then review the pull request and accept or reject the change. Often, there is discussion asking for clarification. Occasionally versions remain forked and diverge from each other. GitHub has become the most popular sites for this type software repository because of its well developed workflow tools and integration hooks. 
The leaders of this team recognized the possibility to use GitHub for the maintenance of ebooks, and we began the process of migrating the most important corpus of public domain ebooks, Project Gutenberg, onto GitHub, thus the name GITenberg. Project Gutenberg has grown over the years to 50,000 ebooks, audiobooks, and related media, including all the most important public domain works of English language literature. Despite the great value of this collection, few libraries have made good use of this resource to serve their communities. There are a number of reasons why. The quality of the ebooks and the metadata around the ebooks is quite varied. MARC records, which libraries use to feed their catalog systems, are available for only a subset of the PG collection. Cover images and other catalog enrichment assets are not part of PG. 
To make the entire PG corpus available via local libraries, massive collaboration amoung librarians and ebook develeopers is essential. We propose to build integration tools around github that will enable this sort of collaboration to occur. 
  1. Although the PG corpus has been loaded into GITenberg, we need to build a backend that automatically converts the version-controlled source text into well-structured ebooks. We expect to define a flavor of MarkDown or Asciidoc which will enable this automatic, change-triggered building of ebook files (EPUB, MOBI, PDF). (MarkDown is a human-readable plain text format used on GitHub for documentation; MarkDown for ebooks is being developed independently by several team of developers. Asciidoc is a similar format that works nicely for ebooks.) 
  2. Similarly, we will need to build a parallel backend server that will produce MARC and XML formatted records from version-controlled plain-text metadata files.
  3. We will generate covers for the ebooks using a tool recently developed by NYPL and include them in the repository.
  4. We will build a selection tool to help libraries select the records best suited to their libraries.
  5. Using a set of "cleaned up" MARC records from NYPL, and adding custom cataloguing, we will seed the metadata collection with ~1000 high quality metadata records.
  6. We will provide a browsable OPDS feed for use in tablet and smartphone ebook readers.
  7. We expect that the toolchain we develop will be reusable for creation and maintenance of a new generation of freely licensed ebooks.

The rest of the proposal is on the Knight News Challenge website. If you like the idea of GITenberg, you can "applaud" it there. The "applause' is not used in the judging of the proposals, but it makes us feel good. There are lots of other interesting and inspiring proposals to check out and applaud, so go take a look!

Wednesday, October 15, 2014

Adobe, Privacy and the Big Yellow Taxi

Here's the most important thing to understand about privacy on the Internet: Google doesn't know your password. The FBI can't march into Sergey Brin's office and threaten to put him in jail unless he tells them your password (if it thinks you're making WMD's). Because it wouldn't do them any good. If Google could produce your password, it would be a sign either of gross incompetance or the ill-considered choice of your cat's name, "mittens" as your password.

Because Google's engineers are at least moderately competent, they don't store your password anywhere.  Instead, they salt it and hash it. The next time they ask you for your password, they salt it and hash it again and see if the result is the same as the hash they've saved. It would be easier for Jimmy Dean to make a pig from a sausage than it would be to get the password from its hash. And that's how the privacy of your password is constructed.

Using similar techniques, Apple is able to build strong privacy into the latest version of iOS, and despite short-sighted espio-nostalgia from the likes of James Comey,  strong privacy is both essential and achievable for many types of data. I would include reading data in that category. Comey's arguments could easily apply to ebook reading data. After all, libraries have books on explosives, radical ideologies, and civil disobediance. But that doesn't mean that our reading lists should be available to the FBI and the NSA.

Here's the real tragedy: "we take your privacy seriously" has become a punch line. Companies that take care to construct privacy using the tools of modern software engineering and strong encryption aren't taken seriously. The language of privacy has been perverted by lawyers who "take privacy seriously" by crafting privacy policies that allow their clients to do pretty much anything with your data.

CC BY bevgoodin
Which brings me the the second most important thing to understand about privacy on the Internet. Don't it always seem to go that you don't know what you've got till it's gone? (I call this the Big Yellow Taxi principle)

Think about it. The only way you know if a website is being careless with your password is if it gets stolen, or they send it to you in an email. If any website sends you your password by email, make sure that website has no sensitive information of yours because it's being run by incompetents. Then make sure you're not using that password anywhere else and if you are, change it.

Failing gross incompetence, it's very difficult for us to know if a website or a piece of software has carefully constructed privacy, or whether it's piping everything you do to a server in Kansas. Last week's revelations about Adobe Digital Editions (ADE4) were an example of such gross incompetence, and yes, ADE4 tries to send a message to a server in Kansas every time you turn an ebook page. Much outrage has been directed at Adobe over the fact that the messages were being sent in the clear. Somehow people are less upset at the real outrage: the complete absence of privacy engineering in the messages being sent.

The response of Adobe's PR flacks to the brouhaha is so profoundly sad. They're promising to release a software patch that will make their spying more secret.

Now I'm going to confuse you. By all accounts, Adobe's DRM infrastructure (called ACS) is actually very well engineered to protect a user's privacy. It provides for features such as anonymous activation and delegated authentication so that, for example, you can borrow an ACS-protected library ebook through Overdrive without Adobe having any possibility of knowing who you are. Because the privacy has been engineered into the system, when you borrow a library ebook, you don't have to trust that Adobe is benevolently concerned for your privacy.

Yesterday, I talked with Micah Bowers, CEO of Bluefire, a small company doing a nice (and important) niche business in the Adobe rights management ecosystem. They make the Bluefire Reader App, which they license to other companies who rebrand it and use it for their own bookstores. He is confident that the Adobe ACS infrastructure they use is not implicated at all by the recently revealed privacy breeches. I had reached out to Bowers because I wanted to confirm that ebook sync systems could be built without giving away user privacy. I had speculated that the reason Adobe Digital Editions was phoning home with user reading data was part of an unfinished ebook sync system. "Unfinished" because ADE4 doesn't do any syncing. It's also possible that reading data is being sent to enable business models similar to Amazon's "Kindle Unlimited", which pays authors when a reader has read a defined fraction of the book.

For Bluefire ( and the "white label" apps based on Bluefire), ebook syncing is a feature that works NOW. If you read through chapter 5 of a book on your iPhone, the Bluefire Reader on your iPad will know. Bluefire users have to opt in to this syncing and can turn it off with a single button push, even after they've opted in. But even if they've opted in, Bluefire doesn't know what books they're reading. If the FBI wants a list of people reading a particular book, Bluefire probably doesn't have the ability to say who's reading the books. Of course, the sync data is encrypted when transmitted and stored. They've engineered their system to preserve privacy, the same way Google doesn't know your password, and Apple can't decrypt your iphone data. Maybe the FBI and the NSA can get past their engineering, but maybe they can't, and maybe it would be too much trouble.

To some extent, you have to trust what Bluefire says, but I asked Bowers some pointed questions about ways to evade their privacy cloaking, and it was clear to me from his answers that his team had considered these attacks.  Bluefire doesn't send or receive any reading data to or from Adobe.

For now, Bluefire and other ebook reading apps that use Adobe's ACS (including Aldiko, Nook, Apps from Overdrive and 3M) are not affected by the ADE privacy breech. I'm convinced from talking to Bowers that the Bluefire sync system is engineered to keep reading private. But the Big Yellow Taxi principle applies to all of these. It's very hard for consumers to tell a well engineered system from a shoddy hack until there's been a breach and then it's too late.

Perhaps this is where the library community needs to forcefully step in. Privacy audits and 3rd party code review should be required for any application or website that purports to "Take privacy seriously" when library records privacy laws are in play.

Or we could pave over the libraries and put up some parking lots.

Thursday, October 9, 2014

Correcting Misinformation on the Adobe Privacy Gusher

We've learned quite a lot about Adobe Digital Editions version 4 (ADE4) since Nate Hoffelder broke the story that "Adobe is Spying on Users, Collecting Data on Their eBook Libraries". Unfortunately, there's also been a some bad information that's been generated along with the furor.

One thing that's clear is that Adobe Digital Editions version 4 is not well designed. It's also clear that our worst fears about ADE4 - that it was hunting down ALL the ebooks on a user's computer on installation and reporting them to Adobe - are not true. I've been participating with Nate and some techy people in the library and ebook world (including Galen, Liza, and Andromeda) to figure out what's really going on. It's looking more like an incompetently-designed, half-finished synchronization system than a spy tool. Follow Nate's blog to get the latest.

So, some misconceptions.
  1. The data being sent by ADE4 is NOT needed for digital rights management. We know this because the DRM still works if the Adobe "log" site is blocked. Also, we know enough about the Adobe DRM to know it's not THAT stupid.
  2. The data being sent by ADE4 is NOT part of the normal functioning of a well designed ebook reader. ADE4 is sending more than it needs to send even to accomplish synchronization. Also, ADE4 isn't really cloud-synchronizing devices, the way BlueFire is doing (well!).
On the legal side:
  1. The ADE4 privacy policy is NOT a magic incantation that makes everything it does legal. For example, all 50 states have privacy laws that cover library records. When ADE4 is used for library ebooks, the fact that it broadcasts a user's reading behavior makes it legally suspect. Even if the stream were encrypted, it's not clear that it would be legal.
  2. The NJ Reader Privacy Act is NOT an issue...yet. There's been no indication that it's been signed into law. If signed into law, and upheld, and found to apply, then Adobe would owe a lot of people in NJ $500.
  3. The California Reader Privacy Act is NOT relevant (as far as I can tell) because it's designed to protect against legal discovery and there's not been any legal process. However, California has a library privacy law.
  4. Europe might have more to say.
The bottom line for now is that ADE4 does not so much spy on you as it stumbles around in your closet and sometimes tells Adobe what it finds there. In a loud voice so everyone around can hear. And that's not a nice thing to do.

Update 1PM: Galen Charlton of Equinox Software has now reproduced the scanning behavior reported by Nate Hoffelder. This is important because there was always the possibility that Nate, whose reporting on ereaders has him trying out a lot of stuff, had some strange and unique system configuration.

Thursday, October 2, 2014

The Perfect Bookstore Loses to Amazon

My book industry friends are always going on and on about "the book discovery problem". Last month, a bunch of us, convened by Chris Kubica, sat in a room in Manhattan's Meatpacking district and plotted out how to make the perfect online bookstore. "The discovery problem" occupied a big part of the discussion. Last year, Perseus Books gathered a smattering of New York's nerdiest at "the first Publishing Hackathon". The theme of the event, the "killer problem": "book discovery". Not to be outdone, HarperCollins sponsored a "BookSmash Challenge" to find "new ways of reading and discovering books".  

Here's the typical framing of "the book discovery problem". "When I go to a bookstore, I frequently leave with all sort of books I never meant to get. I see a book on the table, pick it up and start reading, and I end up buying it. But that sort of serendipitous discovery doesn't happen at Amazon. How do we recreate that experience?" Or "There are so many books published, how do we match readers to the books they'd like best?"

This "problem" has always seemed a bit bogus to me. First of all, when I'm on the internet, I'm constantly running across interesting sounding books. There are usually links pointing me at Amazon, and occasionally I'll buy the book.

As a consumer, I don't find I have a problem with book discovery. I'm not compulsive about finding new books; I'm compulsive about finishing the book I've read half of. When I finish a book, it's usually two in the morning and I really want to get to sleep. I have big stacks both real and virtual of books on my to-read list.

Finally, the "discovery problem" is a tractable one from a tech point of view. Throw a lot of data and some machine learning at the problem and a good-enough solution should emerge. (I should note here that  book "discovery" on the website I run, unglue.it, is terrible at the moment, but pretty soon it will be much better.)

So why on earth does Amazon, which throws huge amounts of money into capital investment, do such a middling job of book discovery?

Recently the obvious answer hit me in the face, as such answers are wont to do. The answer is that mediocre discovery creates a powerful business advantage for Amazon!

Consider the two most important discovery tools used on the Amazon website:
  1. People who bought X also bought y.
  2. Top seller lists.
Both of these methods have the property that the way to make these work for your book is for your book to sell a lot on Amazon. That means that any author or publisher that wants to sell a lot of books on Amazon will try to steer as many fans as possible to Amazon. More sales means more recommendations, which means more sales, and so on. Amazon is such a dominant bookseller that a bookstore could have the dreamiest features and pay the publisher a larger share of the retail selling price and still have the publisher try to push people to Amazon.

What happens in this sort of positive feedback system is pretty obvious to an electrical engineer like me, but Wikipedia's example of a cattle stampede makes a better picture.

The number of cattle running is proportional to the overall level of panic, which is proportional to...the number of cattle running! "Stampede loop" by Trevithj. CC BY-SA
Result: Stampede! Yeah, OK, these are sheep. But you get the point. "Herdwick Stampede" by Andy Docker. CC BY.
Imagine what would happen if Amazon shifted from sales-based recommendations to some sort of magic that matched a customer with the perfect book. Then instead of focusing effort on steering readers to Amazon, authors and publishers would concentrate on creating the perfect book. The rich would stop getting richer, and instead, reward would find the deserving.

Ain't never gonna happen. Stampedes sell more books.