Monday, January 26, 2015

Why GitHub is Important for Book Publishing

How do you organize large numbers of people for a common purpose? For millenia, the answer has been some sort of hierarchical organization. An army, or a feudal system topped with a king. To reach global scale, these hierarchies propagated customs and codes for behavior: laws, religions, ideology. Most of what you read in history books is really the history of these hierarchies. It wasn't possible to orchestrate big efforts or harness significant resources any other way.

In the 20th century, mass media redistributed much of this organizational power. In politics, charismatic individuals could motivate millions of people independently of the hierarchies that maintain command and control. But for the most part, one hierarchy got swapped for another. In business, production innovations such as Henry Ford's assembly line needed the hierarchy to support the capital investments.

I think the history of the 21st century will be the story of non-hierarchical systems of human organization enabled by the Internet. From this point of view, Wikipedia is particularly important not only for its organization of knowledge, but because it demonstrated that thousands of people can be organized with extremely small amounts of hierarchy. Anyone can contribute, anyone can edit, and many do. Bitcoin, or whatever cryptocurrency wins out, won't be successful because of a hierarchy but rather because of a framework of incentives for a self-interested network of entities to work together. Crowdfunding will enable resources to coalesce around needs without large hierarchical foundations or financial institutions.

So let's think a bit about book publishing. Through the 20th century, publishing required a signification amount of investment in capital- printing presses, warehouses, delivery trucks, bookstores, libraries, and people with specialized skills and abilities. A few large publishing companies emerged along with big-box retailers that together comprised an efficient machine for producing, distributing and monetizing books of all kinds. The transition from print to digital has eliminated need for the physical aspects of the book publishing machine, but the human components of that machine remain essential. It's no longer clear that the hierarchical organization of publishing is necessary for the organization of publishing's human effort.

I've already mentioned Wikipedia's conquest of encyclopedia publishing, by dint of its large scale and wide reach. But equally important to its success has been a set of codes and customs bound together in a suite of collaboration and workflow tools. Version tracking allows for easy reversion of edits. "Talk pages" and notifications facilitate communication and collaboration. (And edit-wars and page locking, but that's another bucket of fish.)

Most publishing projects have audiences that are too small or requirements too specific to support Wikipedia's anyone-can-edit-or-revert model of collaboration. A more appropriate model for collaboration in publishing  is one widely used for software development.

Modern software development requires people with different skills to work together. Book publishing is the same. Designers, engineers, testers, product managers, writers, and subject domain experts may each have an important role in creating a software application; authors, editors, proofreaders, illustrators, designers, subject experts, agents, and publicists may all work together on a book. Book publishing and software can be either open or proprietary. The team producing a book or a piece of software might number from one to a hundred. Books and programs can go into maintenance mode or be revised in new editions or versions. Translation into new languages happens for both. Assets from one project can be reused in other projects.

Open source software has been hugely successful over the past few decades. Along the way, an ecosystem of collaboration tools and practices has evolved to support both open source development and software development in general. Many aspects of this ecosystem have been captured in GitHub.

The "Git" in GitHub comes from git, an open source distributed version control system initially written by Linus Torvalds, the Linus behind Linux. It's fast, and it lets you work on a local code repository and then merge your changes with a repository stored somewhere else.

In just two sentences, I've touched on several concepts that may be foreign to many book publishing professionals. Microsoft Word's "track changes" is probably the closest that most authors get to a version control system. The big difference is that "track changes" is designed to facilitate collaboration between a maximum of two people. Git works easily with many contributors. A code "repository" holds more than just code, it can contain all the assets, documentation, and licenses associated with a project. And unlike "track changes", Git remembers the entire history of your project. Many book publishers still don't keep together all the assets that go into a book. And I'm guessing that publishers are still working on centralizing their asset stores instead of distributing them!

Git is just one of the useful aspects of GitHub. I think the workflow tools are perhaps more important. Developers talk about the workflow variants such as "git-flow" and "GitHub-flow", but the differences are immaterial to this discussion. Here's what it boils down to: Someone working on a project will first create a "feature branch", a copy of the repository that adds a feature or fixes a bug. When the new feature has been tested and is working, the changes will be "committed". Each set of changes is given an identifier and a message explaining what has been changed. The branch's developer then sends a "pull request" to the maintainers of the repository. A well crafted pull request will provide tests and documentation for the new feature. If the maintainers like the changes, they "pull" the changes into the main branch of the repository. Each of these steps is a push of a button on GitHub, and GitHub provides annotation, visualization and commenting tools that support discussions around each pull request, as well as issue lists and wiki pages.

The reason the workflow tools and the customs surrounding their use are so important is that anyone who has used them already knows how to participate in another project. For an excellent non-programming example, take a look at the free-programming-books repository, which is a basic list of programming books available online for free.  As of today, 512 different different people have contributed a total of 2,854 sets of changes the the repository, have expanded it to books in 23 languages, and have added free courses, screencasts and interactive tutorials. The maintainers enforce some basic standards and make sure that the list is free of pirated books and the like.

It's also interesting that there are 7,229 "forks" of free-programming-books. Each of these could be different. If the main free-programming-books repo disappears, or if the maintainers go AWOL, one of these forks could become the main fork. Or if one group of contributors want to move the project in a different direction from the maintainers, it's easy to do.

Forking a book is a lot more common than you might think. Consider the book Robinson Crusoe by Daniel Defoe. OCLC's WorldCat lists 7,459 editions of this book, each one representing significantly more effort than a button push in a workflow system. It's common to have many editions of out-of-copyright books of course, but it's also becoming common for books developed with open processes. As an example, look at the repository for Amy Brown and Greg Wilson's Architecture of Open Source Applications.  It has 5 contributors, and has been forked 58 times. For another example of using GitHub to write a book, read Scott Chacon's description of how he produced the second edition of Pro Git. (Are you surprised that a founder of GitHub is using GitHub to revise his book about Git?)

There's another aspect of modern software engineering with GitHub support that could be very useful for book publishing and distribution. "Continuous integration" is essential for development of complex software systems because changes in one component can have unintended effects on other components. For that reason, when a set of changes is committed to a project, the entire project needs to be rebuilt and retested. GitHub supports this via "hooks". For example, a "post-commit" hook can trigger a build-test apparatus; hooks can even be used to automatically deploy the new software version into production environments. In the making of a book, the insertion of a sentence might necessitate re-pagination and re-indexing. With continuous integration, you can imagine the correction of a typo immediately resulting in changes in all the copies of a textbook for sale. (or even the copies that had already been purchased!)

A number of startups have recognized the applicability of Git and GitHub to book publishing. Leanpub, GitBook, and Penflip are supporting GitHub backends for open publishing models; so far adoption has been most rapid in author communities that already "get" GitHub, for example, software developers. The company that is best able to teach a GitHub-like toolset to non-programmers will have a good and worthy business, I think.

As more people learn and exercise the collaboration culture of GitHub, new things will become possible. Last year, I became annoyed that I couldn't fix a problem I found with an ebook from Project Gutenberg. It seemed obvious to me that I should put my contributions into a GitHub repo so that others could easily make use of my work. I created a GitHub organization for "Project GitenHub". In the course of creating my third GitenHub book, I discovered that someone named Seth Woodward had done the same thing a year before me, and he had moved over a thousand Project Gutenberg texts onto GitHub, in the "GITenberg"  organization. Since I knew how to contribute to a GitHub project, I knew that I could start sending pull requests to GITenberg to add my changes to its repositories. And so Seth and I started working together on GITenberg.

Seth has now loaded over 50,000 books from Project Gutenberg onto GitHub. (The folks at Project Gutenberg are happy to see this happening, by the way.) Seth and I are planning out how to make improved quality ebooks and metadata for all of these books, which would be impossible without a way to get people to work together. We put in a funding proposal to the Knight Foundation's NewsChallenge competition. And we were excited to learn that (as of Jan 1, 2015) the Text Creation Partnership has added 25,000 texts from EEBO (Early English Books Online) on GitHub. So it's an exciting time for books on GitHub.

There's quite a bit of work to do. Having 50,000 repositories in an organization strains some GitHub tools. We need to figure out how to explain the GitHub workflow to potential contributors who aren't software developers. We need to  make bibliographic metadata more git-friendly. And we need to create a "continuous integration system" for building ebooks.

Who knows, it might work.

Update January 30: Our NewsChallenge proposal is being funded!!!


  1. "We need to figure out how to explain the GitHub workflow to potential contributors who aren't software developers"

    This. Please post an update when you do.

    1. Maybe a newbie sandbox with directions would be useful?

    2. Exactly.

      Any materials that framed GitHub - and version control in general - in terms not associated with developers would be a major step forward.

      I agree that a bridge is needed; something to acts as a "GitHub-like toolset for non-programmers". When you start talking about repos and forks, you just lose people - not that it couldn't be taught. It would just be much easier to use language that publishers (of any kind - not just books) would understand.

      And the API is there...