Monday, January 26, 2015

Why GitHub is Important for Book Publishing

How do you organize large numbers of people for a common purpose? For millenia, the answer has been some sort of hierarchical organization. An army, or a feudal system topped with a king. To reach global scale, these hierarchies propagated customs and codes for behavior: laws, religions, ideology. Most of what you read in history books is really the history of these hierarchies. It wasn't possible to orchestrate big efforts or harness significant resources any other way.

In the 20th century, mass media redistributed much of this organizational power. In politics, charismatic individuals could motivate millions of people independently of the hierarchies that maintain command and control. But for the most part, one hierarchy got swapped for another. In business, production innovations such as Henry Ford's assembly line needed the hierarchy to support the capital investments.

I think the history of the 21st century will be the story of non-hierarchical systems of human organization enabled by the Internet. From this point of view, Wikipedia is particularly important not only for its organization of knowledge, but because it demonstrated that thousands of people can be organized with extremely small amounts of hierarchy. Anyone can contribute, anyone can edit, and many do. Bitcoin, or whatever cryptocurrency wins out, won't be successful because of a hierarchy but rather because of a framework of incentives for a self-interested network of entities to work together. Crowdfunding will enable resources to coalesce around needs without large hierarchical foundations or financial institutions.

So let's think a bit about book publishing. Through the 20th century, publishing required a signification amount of investment in capital- printing presses, warehouses, delivery trucks, bookstores, libraries, and people with specialized skills and abilities. A few large publishing companies emerged along with big-box retailers that together comprised an efficient machine for producing, distributing and monetizing books of all kinds. The transition from print to digital has eliminated need for the physical aspects of the book publishing machine, but the human components of that machine remain essential. It's no longer clear that the hierarchical organization of publishing is necessary for the organization of publishing's human effort.

I've already mentioned Wikipedia's conquest of encyclopedia publishing, by dint of its large scale and wide reach. But equally important to its success has been a set of codes and customs bound together in a suite of collaboration and workflow tools. Version tracking allows for easy reversion of edits. "Talk pages" and notifications facilitate communication and collaboration. (And edit-wars and page locking, but that's another bucket of fish.)

Most publishing projects have audiences that are too small or requirements too specific to support Wikipedia's anyone-can-edit-or-revert model of collaboration. A more appropriate model for collaboration in publishing  is one widely used for software development.

Modern software development requires people with different skills to work together. Book publishing is the same. Designers, engineers, testers, product managers, writers, and subject domain experts may each have an important role in creating a software application; authors, editors, proofreaders, illustrators, designers, subject experts, agents, and publicists may all work together on a book. Book publishing and software can be either open or proprietary. The team producing a book or a piece of software might number from one to a hundred. Books and programs can go into maintenance mode or be revised in new editions or versions. Translation into new languages happens for both. Assets from one project can be reused in other projects.

Open source software has been hugely successful over the past few decades. Along the way, an ecosystem of collaboration tools and practices has evolved to support both open source development and software development in general. Many aspects of this ecosystem have been captured in GitHub.

The "Git" in GitHub comes from git, an open source distributed version control system initially written by Linus Torvalds, the Linus behind Linux. It's fast, and it lets you work on a local code repository and then merge your changes with a repository stored somewhere else.

In just two sentences, I've touched on several concepts that may be foreign to many book publishing professionals. Microsoft Word's "track changes" is probably the closest that most authors get to a version control system. The big difference is that "track changes" is designed to facilitate collaboration between a maximum of two people. Git works easily with many contributors. A code "repository" holds more than just code, it can contain all the assets, documentation, and licenses associated with a project. And unlike "track changes", Git remembers the entire history of your project. Many book publishers still don't keep together all the assets that go into a book. And I'm guessing that publishers are still working on centralizing their asset stores instead of distributing them!

Git is just one of the useful aspects of GitHub. I think the workflow tools are perhaps more important. Developers talk about the workflow variants such as "git-flow" and "GitHub-flow", but the differences are immaterial to this discussion. Here's what it boils down to: Someone working on a project will first create a "feature branch", a copy of the repository that adds a feature or fixes a bug. When the new feature has been tested and is working, the changes will be "committed". Each set of changes is given an identifier and a message explaining what has been changed. The branch's developer then sends a "pull request" to the maintainers of the repository. A well crafted pull request will provide tests and documentation for the new feature. If the maintainers like the changes, they "pull" the changes into the main branch of the repository. Each of these steps is a push of a button on GitHub, and GitHub provides annotation, visualization and commenting tools that support discussions around each pull request, as well as issue lists and wiki pages.

The reason the workflow tools and the customs surrounding their use are so important is that anyone who has used them already knows how to participate in another project. For an excellent non-programming example, take a look at the free-programming-books repository, which is a basic list of programming books available online for free.  As of today, 512 different different people have contributed a total of 2,854 sets of changes the the repository, have expanded it to books in 23 languages, and have added free courses, screencasts and interactive tutorials. The maintainers enforce some basic standards and make sure that the list is free of pirated books and the like.

It's also interesting that there are 7,229 "forks" of free-programming-books. Each of these could be different. If the main free-programming-books repo disappears, or if the maintainers go AWOL, one of these forks could become the main fork. Or if one group of contributors want to move the project in a different direction from the maintainers, it's easy to do.

Forking a book is a lot more common than you might think. Consider the book Robinson Crusoe by Daniel Defoe. OCLC's WorldCat lists 7,459 editions of this book, each one representing significantly more effort than a button push in a workflow system. It's common to have many editions of out-of-copyright books of course, but it's also becoming common for books developed with open processes. As an example, look at the repository for Amy Brown and Greg Wilson's Architecture of Open Source Applications.  It has 5 contributors, and has been forked 58 times. For another example of using GitHub to write a book, read Scott Chacon's description of how he produced the second edition of Pro Git. (Are you surprised that a founder of GitHub is using GitHub to revise his book about Git?)

There's another aspect of modern software engineering with GitHub support that could be very useful for book publishing and distribution. "Continuous integration" is essential for development of complex software systems because changes in one component can have unintended effects on other components. For that reason, when a set of changes is committed to a project, the entire project needs to be rebuilt and retested. GitHub supports this via "hooks". For example, a "post-commit" hook can trigger a build-test apparatus; hooks can even be used to automatically deploy the new software version into production environments. In the making of a book, the insertion of a sentence might necessitate re-pagination and re-indexing. With continuous integration, you can imagine the correction of a typo immediately resulting in changes in all the copies of a textbook for sale. (or even the copies that had already been purchased!)

A number of startups have recognized the applicability of Git and GitHub to book publishing. Leanpub, GitBook, and Penflip are supporting GitHub backends for open publishing models; so far adoption has been most rapid in author communities that already "get" GitHub, for example, software developers. The company that is best able to teach a GitHub-like toolset to non-programmers will have a good and worthy business, I think.

As more people learn and exercise the collaboration culture of GitHub, new things will become possible. Last year, I became annoyed that I couldn't fix a problem I found with an ebook from Project Gutenberg. It seemed obvious to me that I should put my contributions into a GitHub repo so that others could easily make use of my work. I created a GitHub organization for "Project GitenHub". In the course of creating my third GitenHub book, I discovered that someone named Seth Woodward had done the same thing a year before me, and he had moved over a thousand Project Gutenberg texts onto GitHub, in the "GITenberg"  organization. Since I knew how to contribute to a GitHub project, I knew that I could start sending pull requests to GITenberg to add my changes to its repositories. And so Seth and I started working together on GITenberg.

Seth has now loaded over 50,000 books from Project Gutenberg onto GitHub. (The folks at Project Gutenberg are happy to see this happening, by the way.) Seth and I are planning out how to make improved quality ebooks and metadata for all of these books, which would be impossible without a way to get people to work together. We put in a funding proposal to the Knight Foundation's NewsChallenge competition. And we were excited to learn that (as of Jan 1, 2015) the Text Creation Partnership has added 25,000 texts from EEBO (Early English Books Online) on GitHub. So it's an exciting time for books on GitHub.

There's quite a bit of work to do. Having 50,000 repositories in an organization strains some GitHub tools. We need to figure out how to explain the GitHub workflow to potential contributors who aren't software developers. We need to  make bibliographic metadata more git-friendly. And we need to create a "continuous integration system" for building ebooks.

Who knows, it might work.

Update January 30: Our NewsChallenge proposal is being funded!!!

Wednesday, December 31, 2014

The Year Amazon Failed Calculus

In August, Amazon sent me a remarkable email containing a treatise on ebook pricing. I quote from it:
... e-books are highly price elastic. This means that when the price goes down, customers buy much more. We've quantified the price elasticity of e-books from repeated measurements across many titles. For every copy an e-book would sell at $14.99, it would sell 1.74 copies if priced at $9.99. So, for example, if customers would buy 100,000 copies of a particular e-book at $14.99, then customers would buy 174,000 copies of that same e-book at $9.99. Total revenue at $14.99 would be $1,499,000. Total revenue at $9.99 is $1,738,000. The important thing to note here is that the lower price is good for all parties involved: the customer is paying 33% less and the author is getting a royalty check 16% larger and being read by an audience that’s 74% larger. The pie is simply bigger.
As you probably know, I'm an engineer, so when I read that paragraph, my reaction was not to write an angry letter to Hachette or to Amazon - my reaction was to start a graph. And I have a third data point to add to the graph. At, I've been working on a rather different price point, $0.  Our "sales" rate is currently about 100,000 copies per year. Our total "sales" revenue for all these books adds up to zero dollars and zero cents. It's even less if you convert to bitcoin.

($0 is terrible for sales revenue, but it's a great price for ebooks that want to accomplish something other than generate sales revenue. Some books want more than anything to make the world a better place, and $0 can help them do that, which is why is trying so hard to support free ebooks.)

So here's my graph of the revenue curve combining "repeated and careful measurements" from Amazon and

I've added a fit to the simplest sensible algebraic equation possible that fits the data, Ax/(1+Bx2), which suggests that the optimum price point is $8.25. Below $8.25,  the increase in unit sales won't make up for the drop in price, and even if the price drops to zero, only twice as many books are sold as at $8.25 - the market for the book saturates.

But Amazon seems to have quit calculus after the first semester, because the real problem has a lot more variables that the one Amazon has solved for. This is because they've ignored the effect of changing a book's price on sales of ALL THE OTHER BOOKS. For you math nerds out there, Amazon has measured a partial derivative when the quantity of interest is the total derivative of revenue. Sales are higher at $10 than at $15 mostly because consumers perceive $15 as expensive for an ebook when most other ebooks are $10. So maybe your pie is bigger, but everyone else is stuck with pop-tarts.

While any individual publisher will find it advantageous to price their books slightly below the prevailing price, the advantage will go away when every publisher lowers its price.

Some price-sensitive readers will read more ebooks if the price is lowered. These are the readers who spend the most on ebooks and are the same readers who patronize libraries. Amazon wants the loyalty of customers so much that they introduced the Kindle Unlimited service. Yes, Amazon is trying to help its customers spend less on their reading obsessions. And yes, Amazon is doing their best to win these customers away from those awful libraries.

But I'm pretty sure that Jeff Bezos passed calculus. He was an EECS major at Princeton (I was, too). So maybe the calculation he's doing is a different one. Maybe his price optimization for ebooks is not maximizing publisher revenue, but rather Amazon's TOTAL revenue. Supposing someone spends less to feed their book habit, doesn't that mean they'll just spend it on something else? And where are they going to spend it? Maybe the best price for Amazon is the price that keeps the customer glued to their Kindle tablet, away from the library and away from the bookstore. The best price for Amazon might be a terrible price for a publisher that wants to sell books.

Read Shatzkin on Publishers vs. Amazon. Then read Hoffelder on the impact of Kindle Unlimited. The last Amazon article you should read this year is Benedict Evans on profits.

It's too late to buy Champagne on Amazon - this New Year's at least.

Sunday, December 7, 2014

Stop Making Web Surveillance Bugs by Mistake!

Since I've been writing about library websites that leak privacy, I figured it would be a good idea to do an audit of to make sure it wasn't leaking privacy in ways I wasn't aware of. I knew that some pages leak some privacy via referer headers to Google, to Twitter, and to Facebook, but we force HTTPS and make sure that user accounts can be pseudonyms. We try not to use any services that push ids for advertising networks. (Facebook "Like" button, I'm looking at you!)

I've worried about using static assets loaded from third party sites. For example, we load jQuery from (it's likely to be cached, and should load faster) and Font Awesome from (ditto). I've verified that these services don't set any cookies and allow caching, which makes it unlikely that they could be used for surveillance of users.

It turned out that my worst privacy leakage was to Creative Commons! I'd been using the button images for the various licenses served from I was surprised to see that id cookies were being sent in the request for these images.
In theory, the folks at Creative Commons could track the usage for any CC-licensed resource that loaded button images from Creative Commons! And it could have been worse. If I had used the HTTP version of the images, anyone in the network between me and Creative Commons would be able to track what I was reading!

Now, to be clear, Creative Commons is NOT tracking anyone. The reason my browser is sending id cookies along with button image requests is that the Creative Commons website uses Google Analytics, and Google Analytics sets a domain-wide id cookie. Google Analytics doesn't see any of this traffic- it doesn't have access to server logs. But without anyone intending it, the combination of Creative Commons, Google Analytics, and websites like mine that want to promote use of Creative Commons have conspired to build a network of web surveillance bugs BY MISTAKE.

When I inquired about this to Creative Commons, I found out they were way ahead of the issue. They've put in redirects to HTTPS version of their button images. This doesn't plug any privacy leakage, but it discourages people from using the privacy spewing HTTP versions. In addition, they'd already started to process of moving static assets like button images to a special-purpose domain. The use of this domain,, will ensure that id cookies aren't sent and nobody could use them for surveillance.

If you care about user privacy and you have a website, here's what you should do:
  1. Avoid loading images and other assets from 3rd party sites. consider self-hosting these.
  2. When you use 3rd party hosted assets, use HTTPS references only!
  3. Avoid loading static assets from domains that use Google Analytics and set id domain cookies.
For Creative Common license buttons, use the buttons from If you use the Creative Commons license chooser, replace "" in the code it makes for you with "". This will help the web respect user privacy. The buttons will also load faster, because the "" requests will get redirected there anyway.

Saturday, November 22, 2014

NJ Gov. Christie Vetoes Reader Privacy Act, Asks for Stronger, Narrower Law

According to New Jersey Governor Chris Christie's conditional veto statement, "Citizens of this State should be permitted to read what they choose without unnecessary government intrusion." It's hard to argue with that! Personally, I think we should also be permitted to read what we choose without corporate surveillance.

As previously reported in The Digital Reader, the bill passed in September by wide margins in both houses of the New Jersey State Legislature and would have codified the right to read ebooks without letting the government and everybody else knowing about it.

I wrote about some problems I saw with the bill. Based on a California law focused on law enforcement, the proposed NJ law added civil penalties on booksellers who disclosed the personal information of users without a court order. As I understood it, the bill could have prevented online booksellers from participating in ad networks (they all do!).

Governor Christie's veto statement pointed out more problems. The proposed law didn't explicitly prevent the government from asking for personal reading data, it just made it against the law for a bookseller to comply. So, for example, a local sheriff could still ask Amazon for a list of people in his town reading an incriminating book. If Amazon answered, somehow the reader would have to:
  1. find out that Amazon had provided the information
  2. sue Amazon for $500.
Another problem identified by Christie was that the proposed law imposed privacy burdens on booksellers stronger than those on libraries. Under another law, library records in New Jersey are subject to subpoena, but bookseller records wouldn't be. That's just bizarre.

In New Jersey, a governor can issue a "Conditional Veto". In doing so, the governor outlines changes in a bill that would allow it to become law. Christie's revisions to the Reader Privacy Act make the following changes:
  1. The civil penalties are stripped out of the bill. This allows Gov. Christie to position himself and NJ as "business-friendly".
  2. A requirement is added preventing the government from asking for reader information without a court order or subpoena. Christie gets to be on the side of liberty. Yay!
  3. It's made clear that the law applies only to government snooping, and not to promiscuous data sharing with ad networks. Christie avoids the ire of rich ad network moguls.
  4. Child porn is carved out of the definition of "books". Being tough on child pornography is one of those politically courageous positions that all politicians love.
The resulting bill, which was quickly reintroduced in the State Assembly, is stronger but narrower. It wouldn't apply in situations like the recent Adobe Digital Editions privacy breach, but it should be more effective at stopping "unnecessary government intrusion". I expect it will quickly pass the Legislature and be signed into law. A law that properly addresses the surveillance of ebook reading by private companies will be much more complicated and difficult to achieve.

I'm not a fan of his by any means, but Chris Christie's version of the Reader Privacy Act is a solid step in the right direction and would be an excellent model for other states. We could use a law like it on the national level as well.

(Guest posted at The Digital Reader)

Wednesday, November 5, 2014

If your website still uses HTTP, the X-UIDH header has turned you into a snitch

Does your website still use HTTP? It not, you're a snitch.

As I talk to people about privacy, I've found a lot of misunderstanding. HTTPS applies encryption to the communication channel between you and the website you're looking at. It's an absolute necessity when someone's making a password or sending a credit card number, but the modern web environment has also made it important for any communication that expects privacy.

HTTP is like sending messages on a postcard. Anyone handling the message can read the whole message. Even worse, they can change the message if they want. HTTPS is like sending the message in a sealed envelope. The messengers can read the address, but they can't read or change the contents.

It used to be that network providers didn't read your web browsing traffic or insert content into it, but now they do so routinely. This week we learned that Verizon and AT&T were inserting an "X-UIDH" header into your mobile phone web traffic. So for example, if a teen was browsing a library catalog for books on "pregnancy" using a mobile phone, Verizon's advertising partners could, in theory, deliver advertising for maternity products.

The only way to stop this header insertion is for websites to use HTTPS. So do it. Or you're a snitch.

Sorry, doesn't support HTTPS. So if you mysteriously get ads for snitch-related products, or if the phrase "Verizon and AT&T" is not equal to "V*erizo*n and A*T*&T" without the asterisks, blame me and blame Google.

Here's more on the X-UIDH header.