Wednesday, April 3, 2019

Fudge, and open access ebook download statistics

If you found out that the top 50 authors born in Gloucestershire, England average over 10 million copies sold, you might think that those authors are doing pretty well. But it's silly to compute averages like that. When you compute an average over a population, you're making an assumption that the quantity you're averaging over is statistically distributed somehow over the population. Unless of course you don't care if the average means anything, and you just want numbers to help justify an agenda.

Most folks would look at the list of Gloucestershire authors and say that one of the authors is an outlier, not representative of Gloucestershire authors in general. And so J.K. Rowling, with her 500+ million copies sold, would get removed from the data set, revealing the presumably unimpressive book selling record of the "more representative" authors. Scientists refer to this process as "fudging the data". It's done all the time, but it's not honest.

There's a better way. If a scientific study presents averages across a population, it should also report statistical measures such as variance and standard deviation, so the audience can judge how meaningful the reported averages are (or aren't!).

Other times, the existence of "outliers" is evidence that the numbers are better measured and compared on a different scale. Often, that's a logarithmic scale. For example, noise is measured on a logarithmic scale, in units of decibels. An ambulance siren has a million times the noise power of normal conversation, but it's easier to make sense of that number if we compare the 60 dB sound volume of conversation to the 90 dB of a hair dryer, the 120 dB of the siren and the 140 dB of a jet engine. Similarly, we can understand that while J.K. Rowling's sales run into 8 figures, most top Gloucestershire-born authors are probably 3, 4 and or maybe 5 figure sellers.

Over the weekend, I released a "preprint" on Humanities Commons, describing my analysis of open-access ebook usage data. I worked with a wonderful team including two open-access publishers, University of Michigan Press and Open Book Publishers, on this project, which was funded by the Mellon Foundation. To boil down my analysis to two pithy points, the preprint argues:

  1. Free ebook downloads are best measured on a logarithmic scale, like earthquakes and trade publishing sales.
  2. We shouldn't average download counts.

If you take the logarithm of book downloads, the histogram looks like a bell curve!
For example, if someone tries to tell you that "Engineering, mathematics and computer science OA books perform much better than the average number of downloads for OA books across all subject areas" without telling you about variances of the distributions and refusing to release their data, you should pay them no mind.

Next week, I'll have a post about why logarithmic scales makes sense for measuring open-access usage, and maybe another about how log-normal statistics could save civilization.

Monday, December 31, 2018

On the Surveillance Techno-state

I used to run my own mail server. But then came the spammers. And  dictionary attacks. All sorts of other nasty things. I finally gave up and turned to Gmail to maintain my online identities. Recently, one of my web servers has been attacked by a bot from a Russian IP address which will eventually force me to deploy sophisticated bot-detection. I'll probably have to turn to Google's recaptcha service, which watches users to check that they're not robots.

Isn't this how governments and nations formed? You don't need a police force if there aren't any criminals. You don't need an army until there's a threat from somewhere else. But because of threats near and far, we turn to civil governments for protection. The same happens on the web. Web services may thrive and grow because of economies of scale, but just as often it's because only the powerful can stand up to storms.  Facebook and Google become more powerful, even as civil government power seems to wane.

When a company or institution is successful by virtue of its power, it needs governance, lest that power go astray. History is filled with examples of power gone sour, so it's fun to draw parallels. Wikipedia, for example, seems to be governed like the Roman Catholic Church, with a hierarchical priesthood, canon law, and sacred texts. Twitter seems to be a failed state with a weak government populated by rival factions demonstrating against the other factions. Apple is some sort of Buddhist monastery.

This year it became apparent to me that Facebook is becoming the internet version of a totalitarian state. It's become so ... needy. Especially the app. It's constantly inventing new ways to hoard my attention. It won't let me follow links to the internet. It wants to track me at all times. It asks me to send messages to my friends. It wants to remind me what I did 5 years ago and to celebrate how long I've been "friends" with friends. My social life is dominated by Facebook to the extent that I can't delete my account.

That's no different from the years before, I suppose, but what we saw this year is that Facebook's governance is unthinking. They've built a machine that optimizes everything for engagement and it's been so successful that they they don't know how to re-optimize it for humanity. They can't figure out how to avoid being a tool of oppression and propaganda. Their response to criticism is to fill everyone's feed with messages about how they're making things better. It's terrifying, but it could be so much worse.

I get the impression that Amazon is governed by an optimization for efficiency.

How is Google governed? There has never existed a more totalitarian entity, in terms of how much it knows about every aspect of our lives. Does it have a governing philosophy? What does it optimize for?

In a lot of countries, it seems that the civil governments are becoming a threat to our online lives. Will we turn to Wikipedia, Apple, or Google for protection? Or will we turn to civil governments to protect us from Twitter, Amazon and Facebook. Will democracy ever govern the Internet?

Happy 2019!

Thursday, December 27, 2018

Towards Impact-based OA Funding

Earlier this month, I was invited to a meeting sponsored by the Mellon Foundation about aggregating usage data for open-access (OA) ebooks, with a focus on scholarly monographs. The "problem" is that open licenses permit these ebooks to be liberated from hosting platforms and obtained in a variety of ways. A scholar might find the ebook via a search engine, on social media or on the publisher's web site; or perhaps in an index like Directory of Open Access Books (DOAB), or in an aggregator service like JSTOR. The ebook file might be hosted by the publisher, by OAPEN, on Internet Archive, Dropbox, Github, or Libraries might host files on institutional repositories, or scholars might distribute them by email or via ResearchGate or discipline oriented sites such as Humanities Commons.

I haven't come to the "problem" yet. Open access publishers need ways to measure their impact. Since the whole point of removing toll-access barriers is to increase access to information, open access publishers look to their usage logs for validation of their efforts and mission. Unit sales and profits do not align very well with the goals of open-access publishing, but in the absence of sales revenue, download statistics and other measures of impact can be used to advocate for funding from institutions, from donors, and from libraries. Without evidence of impact, financial support for open access would be based more on faith than on data. (Not that there's anything inherently wrong with that.)

What is to be done? The "monograph usage" meeting was structured around a "provocation": that somehow a non-profit "Data Trust" would be formed to collect data from all the providers of open-access monographs, then channel it back to publishers and other stakeholders in privacy-preserving, value-affirming reports. There was broad support for this concept among the participants, but significant disagreements about the details of how a "Data Trust" might work, be governed, and be sustained.

Why would anyone trust a "Data Trust"? Who, exactly, would be paying to sustain a "Data Trust"? What is the product that the "Data Trust" will be providing to the folks paying to sustain it? Would a standardized usage data protocol stifle innovation in ebook distribution? We had so many questions, and there were so few answers.

I had trouble sleeping after the first day of the meeting. At 4 AM, my long-dormant physics brain, forged in countless all-nighters of problem sets in college, took over. It proposed a gendanken experiment:
What if there was open-access monograph usage data that everyone really trusted? How might it be used?
The answer is given away in the title of this post, but let's step back for a moment to provide some context.

For a long time, scholarly publishing was mostly funded by libraries that built great literature collections on behalf of their users - mostly scholars. This system incentivized the production of expensive must-have journals that expanded and multiplied so as to eat up all available funding from libraries. Monographs were economically squeezed in this process. Monographs, and the academic presses that published them, survived by becoming expensive, drastically reducing access for scholars.

With the advent of electronic publishing, it became feasible to flip the scholarly publishing model. Instead of charging libraries for access, access could be free for everyone, while authors paid a flat publication fee per article or monograph. In the journal world, the emergence of this system has erased access barriers. The publication fee system hasn't worked so well for monographs, however. The publication charge (much larger than an article charge) is often out of reach for many scholars, shutting them out of the open-access publishing process.

What if there was a funding channel for monographs that allocated support based on a measurement of impact, such as might be generated from data aggregated by a trusted "Data Trust"? (I'll call it the "OA Impact Trust", because I'd like to imagine that "impact" rather than a usage proxy such as "downloads" is what we care about.)

Here's how it might work:

  1. Libraries and institutions register with the OA Impact Trust, providing it with a way to identify usage and impact relevant to the library or institutions.
  2. Aggregators and publishers deposit monograph metadata and usage/impact streams with the Trust.
  3. The Trust provides COUNTER reports (suitably adapted) for relevant OA monograph usage/impact to libraries and institutions. This allows them to compare OA and non-OA ebook usage side-by-side.
  4. Libraries and institutions allocate some funding to OA monographs.
  5. The Trust passes funding to monograph publishers and participating distributors.

The incentives built into such a system promote distribution and access. Publishers are encouraged to publish monographs that actually get used. Authors are encouraged to write in ways that promote reading and scholarship. Publishers are also encouraged to include their backlists in the system, and not just the dead ones, but the ones that scholars continue to use. Measured impact for OA publication rises, and libraries observe that more and more, their dollars are channeled to the material that their communities need.

Of course there are all sorts of problems with this gedanken OA funding scheme. If COUNTER statistics generate revenue, they will need to be secured against the inevitable gaming of the system and fraud. The system will have to make judgements about what sort of usage is valuable, and how to weigh the value of a work that goes viral against the value of a work used intensely by a very small community. Boundaries will need to be drawn. The machinery driving such a system will not be free, but it can be governed by the community of funders.

Do you think such a system can work? Do you thing such a system would be fair, or at least fairer than other systems? Would it be Good, or would it be Evil?

  1. Details have been swept under a rug the size of Afghanistan. But this rug won't fly anywhere unless there's willingness to pay for a rug.
  2. The white paper draft which was the "provocation" for the meeting is posted here.
  3. I've been thinking about this for a while.

Tuesday, October 30, 2018

A Milestone for GITenberg

We've reached a big milestone for the GITenberg Project, which comes after a lot of work over 6 years by several groups of people. It's now ready to use!

GITenberg is a prototype that explores how Project Gutenberg might work if all the Gutenberg texts were on Github, so that tools like version control, continuous integration, and pull-request workflow could be employed. We hope that Project Gutenberg can take advantage of what we've learned; work in that direction has begun but needs resources and volunteers.  Go check it out!

It's hard to believe, but GITenberg started 6 years ago when Seth Woodworth started making Github repos for Gutenberg texts. I joined the project two years later when I started doing the same and discovered that Seth was 43,000 repos ahead of me. The project got a big boost when the Knight Foundation awarded us a Prototype Fund grant to "explore the applicability of open-source methodologies to the maintenance of the cultural heritage" that is the Project Gutenberg collection. But there were big chunks of effort left to finish the work when that grant ended. Last year, six computer-science seniors from Stevens Institute of Technology took up the challenge and brought the project within sight of a major milestone (if not the finishing-line). There remained only the reprocessing of 58,000 ebooks (with more being created every day!). As of last week, we've done that! Whew.

So here's what's been done:
  • Almost 57,000 texts from Project Gutenberg have been loaded into Github repositories.
  • EPUB, PDF, and Kindle Ebooks have been rebuilt and added to releases for all but about 100 of these.
  • Github webhooks trigger dockerized ebook building machines running on AWS Elastic Beanstock every time a git repo is tagged.
  • Toolchains for asciidoc, HTML and plain text source files are running on the ebook builders.
  • A website at uses the webhooks to index and link to all of the ebooks.
  • presents links to Github, Project Gutenberg, Librivox, and Standard Ebooks.
  • Cover images are supplied for every ebook.
  • Human-readable metadata files are available for every ebook
  • Syndication feeds for these books are made available in ONIX, MARC and OPDS via

Everything in this project is built in the hope that the bits can be incorporated into Project Gutenberg wherever appropriate. In January 2019, the US public domain will resume the addition of new books, so it's more important than ever that we strengthen the infrastructure that supports it.

Some details:
  • All of the software that's been used is open source and content is openly licensed.
  • PG's epubmaker software has been significantly strengthened and improved.
  • About 200 PG ebooks have had fatal formatting errors remediated to allow for automated ebook file production.
  • 1,363 PG ebooks were omitted from this work due to licensing or because they aren't really books.
  • PG's RDF metadata files were converted to human-readable YAML and enhanced with data from New York Public Library and from Wikipedia.
  • Github API throttling limits the build/release rate to about 600 ebooks/hour/login. A full build takes about 4 full days with one github login.
  • Seth Woodworth. In retrospect, the core idea was obvious, audacious, and crazy. Like all great ideas.
  • Github tech support. Always responsive.
  • The O'Reilly HTMLBook team. The asciidoc toolchain is based on their work.
  • Plympton. Many asciidoc versions were contributed to GITenberg as part of the "Recovering the Classics" project. Thanks to Jenny 8. Lee, Michelle Cheng, Max Pevner and Nessie Fox.
  • Albert Carter and Paul Moss contributed to early versions of the GITeneberg website.
  • The Knight Foundation provided funding for GITenberg at a key juncture in the project's development though its prototype fund. The Knight Foundation supports public-benefitting innovation in so many ways even beyond the funding it provides, and we thank them with all our hearts.
  • Travis-CI. The first version of automated ebook building took advantage of Travis-CI. Thanks!
  • Raymond Yee got the automated ebook building to actually work.
  • New York Public Library contributed descriptions, rights info, and generative covers. They also sponsored hackathons that significantly advanced the environment for public domain books. Special thanks to Leonard Richardson, Mauricio Giraldo and Jens Troeger (Bookalope).
  • My Board at the Free Ebook Foundation: Seth, Vicky Reich, Rupert Gatti, Todd Carpenter, Michael Wolfe and Karen Liu. Yes, we're overdue for a board meeting...
  • The Stevens GITenberg team: Marc Gotliboym, Nicholas Tang-Mifsud, Brian Silverman, Brandon Rothweiler, Meng Qiu, and Ankur Ramesh. They redesigned the website, added search, added automatic metadata updates, and built the dockerized elastic beanstalk ebook-builder and queuing system. This work was done as part of  their two-semester capstone (project) course. The course is taught by Prof. David Klappholz, who managed a total of 23 student projects last academic year. Students in the course design and develop software for established companies, early stage startups, nonprofits, gov't agencies, etc., etc. Take a look at detailed information about software that has been developed over the past 6-7 years and details of how the course works. 
  • Last, but certainly not least, Greg Newby (Project Gutenberg) for consistent encouragement and tolerance of our nit-discovery, Juliet Sutherland (Distributed Proofreaders) for her invaluable insights into how PG ebooks get made, and to the countless volunteers at both organizations who collectively have made possible the preservation and reuse of our public domain.
I'm sure I've omitted an important acknowledgement or two - please let me know so I can rectify the omission.

So what's next? As I mentioned, we've taken some baby steps towards applying version control to Project Gutenberg. But Project Gutenberg is a complex organism, and implementing profound changes will require broad consensus-building and resource gathering (both money and talent). Project Gutenberg and the Free Ebook Foundation are very lean non-profit organizations dependent on volunteers and small donations. What's next is really up to you!

Tuesday, September 18, 2018

eBook DRM and Blockchain play CryptoKitty and Mouse. And the Winner is...

If you want to know how blockchain relates to DRM and ebooks, it helps to understand CryptoKitties.

 CryptoKitties are essentially numbers that live in a game-like environment which renders cats based on the numbers. Players can buy, collect, trade, and breed their kitties. Each kitty is unique. Players let their kitties play games in the "kittyverse". Transactions involving CryptoKitties take place on the Ethereum blockchain. Use of the blockchain make CryptoKitties different from other types of virtual property. The kitties can be traded outside of the game environment, and the kitties can't be confiscated or deleted by the game developers. In fact, the kitties could easily live in third-party software environments, though they might not carry their in-game attributes with them. Over 12 million dollars has been spent on CryptoKitties, and while you might assume they're a passing fad, they haven't gone away.

It's weird to think about "digital rights management" (DRM) for CryptoKitties. Cryptography locks a kitty to a user's cryptocurrency wallet, but you can transfer a wallet to someone else by giving them your secret keys. With the key, you can do anything with the contents of the wallet. The utility of your CryptoKitty (your "digital rights") is managed by a virtual environment controlled by Launch Labs, Inc., but until the kitties become sentient (15-20 years?) the setup doesn't trigger my distaste for DRM.

Now, think about how Amazon's Kindle works. When you buy an ebook from Amazon, what you're paying for is a piece of virtual property that only exists in the Kindle virtual world. The Kindle software environment endows your virtual property with value - but instead of giving you the right to breed a kitty, you might get the right to read about a kitty. You're not allowed to exercise this right outside of Amazon's virtual world, and DRM exists to enforce Amazon's control of that right. You can't trade or transfer this right.

Ebooks are are different from virtual property, in important ways. Ebooks are words, ideas, stories that live just fine outside Kindle. DRM kills this outside life away, which is a sin. And it robs readers of the ability to read without Big Brother keeping track of every page they read. Most authors and publishers see DRM as a necessary evil, because they don't believe in a utopia where readers pay creators just because they're worth it.

But what if were possible to "CryptoKittify" ebooks? Would that mitigate the sins of DRM, or even render it unnecessary? Would it just add the evils of blockchain to the evils of DRM? Two startups, Publica and Scenarex are trying to find out.

Depending on implementation, the "CryptoKittification" of ebooks could allow enhanced privacy and property rights for purchasers as well as transaction monitoring for rights holders. If a user's right to an ebook was registered on a blockchain, a reader application wouldn't need to "phone home"
to check whether a user was entitled to open and use the ebook. Similarly, the encrypted ebook files could be stored on a distributed service such as IPFS, or on a publisher's distribution site. The reader platform provider needn't separately verify the user. And just like printed books, a reader license could be transferred or sold to another user.

Alas, the DRM devil is always in the details, which is why I quizzed both Scenarex and Publica about their implementations. The two companies have taken strikingly different approaches to the application of blockchain to the ebook problem.

Scenarex, a company based in Montreal, has strived to make their platform familiar to both publishers and to readers. You don't need to have cryptocurrency or a crypto-wallet to use their system, called "Bookchain". Their website will look like an online bookstore, and their web-based reader application will use ebooks in the EPUB format rendered by the open-source Readium software being used by other ebook websites. All of the interaction with the blockchain will be handled in their servers. The affordances of their user-facing platform, at least in its initial form, should be very similar to other Readium-powered sites. For users, the only differences will be the license transfer options enabled by the blockchain and its content providers. Because the licenses will be memorialized on a blockchain the possibility is open that they could be used in other reading environments.

Scenarex's conservative approach of hiding most of blockchain from the users and rights holders, means that almost all of Scenarex's blockchain-potential is as-yet unrealized. There's no significant difference in privacy compared to Readium's LCP DRM scheme. License portability and transactions will depend on whether other providers decide to adopt Scenarex's license tokenization and publication scheme. Because blockchain interaction takes place behind Scenarex servers, the problems with blockchain immutability are mitigated along with the corresponding benefits to the purchaser. Scenarex expects to launch soon, but it's still too early to see if they can gain any traction.

Publica, by contrast, has chosen to propose a truly radical course for the ebook industry. Publica, with development offices is Latvia, doesn't make sense if you think of it as an ebook store, it only makes sense if you think of it as a crowd-funding platform for ebooks. (Disclosure:, a website I founded and run as part of the Free Ebook Foundation, started life as a crowd-funding platform for free ebooks.)

Publica invites authors to create "initial coin offerings" (ICOs) for their books. An author raising funds for their book sells read tokens for the book to investors, presumably in advance of publication. When the book is published, token owners get to read the book. Tokens can be traded or sold in Ethereum blockchain-backed transactions.

From an economic point of view, this doesn't seem to make much sense. If the token marketplace is efficient, the price of a token will fluctuate until the supply of tokens equals the number of people who want continuing access to the book. Sell too many tokens, and the price crashes to near zero.  In today's market for books, buyers are motivated by word of mouth, so newly published books, especially by unknown authors, are given out free to reviewers and other influencers. To make money with an ICO, in contrast, an author will need to limit the supply so as to support the token's attractiveness to investors, and thus the book's price.

In many ways, however, book purchasers don't act like economists. They keep their books around forever. They accumulate TBR piles. Yes, they'll give away or sell books, but that is typically to enable further accumulation. They'll borrow a book from the library, read it, and THEN buy it. Book purchasers collect books. Which brings us back to CryptoKitties.

In May of 2018, a CryptoKitty sold at auction for over $140,000. That's right, someone paid 6 figures for what is essentially a number! Can you imagine someone paying that much for a copy of a book?

Title page William Shakespeare's First Folio 1623 I can imagine that. In 2001, a First Folio edition of Shakespeare's plays sold for over $6,000,000! Suppose that J. K. Rowling had sold 100 digital first editions of Harry Potter and the Philosopher's Stone in 1996 to make ends meet. How much do you think someone would pay for one of those today, assuming the provenance and "ownership" could be unassailably verified?

CryptoKitties might be cute and they might have rare characteristics, but many more people develop powerful emotional attachments to books, even if they're just words or files full of bytes. A First Folio is an important historical artifact because of the huge cultural impact of the words it memorializes. I think it's plausible that a digital artifact could be similarly important, especially if its original sale provided support to its artist.

This brings me back to DRM. I asked the CTO of Publica, Yuri Pimenov about it, and he seemed apologetic.
Even Amazon's DRM can be easily removed (I did it once). So, let's assume that DRM is a little inconvenience that [...] people are ready to pay [to get around]. And besides the majority of people are good and understand that authors make a living by writing books...
Publica's app uses a cryptographic token in the Blockchain to allow access to the book contents, and does DRM-ish things like disabling quoting. But since the cryptographic token is bound to a cryptographic wallet, not a device or an account, it just papers over author concerns such as piracy. Pimenov is correct to note that it's the reader's relationship to the author that should be cemented by the Publica marketplace. Once Publica understands that memorializing readers supporting authors is where their success can come from, I think they'll realize that DRM, by restricting readers and building moats around literature, is counterproductive. To make an ebook into a collectable product, we don't need DRM, we need need "DRMem": Digital Rights Memorialization.

So, I'm surprised to be saying this, but... CryptoKitties win!

More Links: