Go To Hellman

The Revenge of the Cataloguers

2023-12-14T20:26:00.000-05:00

Over the past 15 years or so, libraries around the world have de-emphasized cataloguing. While budgetary concerns and technological efficiencies have been factors in the decline of cataloguing, the emergence of full text search and relevance ranking as practiced by Google and others has proved to be more popular for the vast majority of users. On the open internet, subject classifications have proved to be useless in an environment rife with keyword spam and other search engine optimization techniques.

In the past year, the emergence of artificial intelligence (AI) with large language models with surprising abilities to summarize and classify texts has people speculating that AI will put most cataloguers out of work in the not-so-distant future.

I think that's not even wrong. But Roy Tennant will turn out to be almost right. MARC, the premier tool of cataloguers around the world, will live forever... as a million weights in generative pre-trained transformer. Let me explain...

The success or failure of modern AI depends on the construction of large statistical models with billions or even trillions of variables. These models are built from training data. The old adage about computers: "garbage in garbage out" is truer than ever. The models are really good at imitating the training data; so good that they can surprise the models' architects! Thus the growing need for good training data, and the increasing value of rich data sources.

Filings in recent lawsuits confirm the value of this training data. Getty Images is suing Stability AI for the use of Getty Images' material in AI training sets. But it's not just for the use of the images, which are copyrighted, but also for the use of trademarks and the detailed descriptions than accompany the data. Read paragraph 57 of the complaint:

Getty Images’ websites include both the images and corresponding detailed titles and captions and other metadata. Upon information and belief, the pairings of detailed text and images has been critical to successfully training the Stable Diffusion model to deliver relevant output in response to text prompts. If, for example, Stability AI ingested an image of a beach that was labeled “forest” and used that image-text pairing to train the model, the model would learn inaccurate information and be far less effective at generating desirable outputs in response to text prompts by Stability AI’s customers. Furthermore, in training the Stable Diffusion model, Stability AI has benefitted from Getty Images’ image-text pairs that are not only accurate, but detailed. For example, if Stability AI ingested a picture of Lake Oroville in California during a severe drought with a corresponding caption limited to just the word “lake,” it would learn that the image is of a lake, but not which lake or that the photograph was taken during a severe drought. If a Stable Diffusion user then entered a prompt for “California’s Lake Oroville during a severe drought” the output image might still be one of a lake, but it would be much less likely to be an image of Lake Oroville during a severe drought because the synthesis engine would not have the same level of control that allows it to deliver detailed and specific images in response to text prompts.

If you're reading this blog, you're probably thinking to yourself "THAT'S METADATA!"

Let's not forget the trademark part of the complaint:

In many cases, and as discussed further below, the output delivered by Stability AI includes a modified version of a Getty Images watermark, underscoring the clear link between the copyrighted images that Stability AI copied without permission and the output its model delivers. In the following example, the image on the left is another original, watermarked image copied by Stability AI and used to train its model and the watermarked image on the right is output delivered using the model:

If you're reading this blog, you're probably thinking to yourself "THAT'S PROVENANCE!"

So clearly, the kinds of data that libraries and archives have been producing for many years will still have value, but we need to start thinking about how the practice of cataloguing and similar activities will need to change in response to the new technologies. Existing library data will get repurposed as training data to create efficiencies in library workflows. Organizations with large, well-managed will extract windfalls, deserved or not.

If the utility of metadata work is shifting from feeding databases to training AI models, how does this affect the product of that work? Here's how I see it:

Tighter coupling of metadata and content. Today's discovery systems are all about decoupling data from content - we talk about creating metadata surrogates for discovery of content. Surrogates are useless for AI training; a description of a cat is useless for training without an accompanying picture of the cat. This means that the existing decoupling of metadata work from content production is doomed. You might think that copyright considerations will drive metadata production into the hands of existing content producers, but more likely organizations that focus on production of integrated training data will emerge to license content and support the necessary metadata production.

Tighter collaboration of machines and humans. Optical character recognition (OCR) is a good example of highly focused and evolved machine learning that can still be improved by human editors. The practice of database-focused cataloguing will be made more productive as cataloguers become editors of machine generated structured data. (As if they're not already doing that!)

Softer categorization. Discovery databases demand hard classifications. Fiction. Science. Textbooks. LC Subject Headings. AIs are much better at nuance, so the training data needs to include a lot more context. You can have a romantic novel of chemists and their textbooks, and an AI will be just fine with that, so long as you have enough description and context for the machine to assign lots of weights to many topic clusters.

Emphasis on novelty. New concepts and things appear constantly; an AI will extrapolate unpredictably until it gets on-topic training data. AI-OCR might recognize a new emoji, but it might not.

Emphasis on provenance. Reality is expensive, which is why I think for-profit organizations will have difficulty in the business of providing training data while Wikipedia will continue to succeed because it requires citations. Already the internet is awash in AI produced content that sounds real, but is just automated BS. Training data will get branded.

What gets me really excited though, is thinking about how a library of the future will interact with content. I expect users will interact with the library using a pre-trained language model, rather than via databases. Content will get added to the model using packages of statistical vectors, compiled by human-expert-assisted content processors. These human experts won't be called "cataloguers" any longer but rather "meaning advisors". Or maybe "biblio-epistemologists". The revenge of the cataloguers will be that because of the great responsibilities and breadth of expertise required, biblio-epistemologists will command salaries well exceeding the managers and programmers who will just take orders from well-trained AIs. Of course there will still be MARC records, generated by a special historical vector package guaranteed to only occasionally hallucinate.

Note: I started thinking about this after hearing a great talk (starting at about 30:00) by Michelle Wu at the Charleston Conference in November. (Kyle Courtney's talk was good, too).

Let's pretend they're ebooks

2023-08-25T18:26:00.003-04:00

In days of yore, back when people were blogging, I described the way that libraries were offering ebooks as being a "Pretend It's Print" model. At the time, I felt that this model was designed to sustain and perpetuate the model that libraries and publishers had been using since prehistoric times, and that it ignored most of the possibilities inherent in the ebook. Ebooks could liberate the book from the shackles of their physical existences!

I was right, and I was wrong. The book publishing world seized on digital technology to put even heavier shackles on their books. In turn, technology companies such as Amazon locked down innovation in the ebook world so that libraries could no longer be equal contributors to the enterprise of distributing books, all the while pretending to their patrons that the ebooks they licensed were just like the print books sitting on their shelves.

Somehow libraries and publishers have survived. Maybe they've even thrived with the "pretend it's print" model for ebooks. There are plenty of economic problems, but whenever I talk to people about ebooks, the conversation is always some variation of "I love reading ebooks through my library". Most library users are perfectly happy pretending that their digital ebooks are just like the printed books.

A decade later, we need to change our perspective. It's time we seriously started pretending that printed books are just like ebooks, not just the other way around. The library world has been doing something called "Controlled Digital Lending" (CDL) , which flips the "pretend it's print" model and pretends that print is just like digital. The basic idea behind controlled digital lending is that owning a print book should allow you to read it any way you want, even if that involves creating a digital substitute for it. A library that owns a print book ought to be able to lend it, as long as it's lent to only one person at time. It's as if books were printed and sold in order to spread ideas and information!

Of course radical ideas such as spreading information have to be stopped. And so we have the Hachette v. Internet Archive lawsuit and its assorted fallout. I'm not a lawyer, so I won't say much about the legal validity of the arguments on either side. I'm an ebook technologist, so I will explain to you that whole lawsuit was about whether the other side was sufficiently serious about pretending that print books are just like ebooks and that ebooks are just like print books. Also that the other side doesn't understand how print books are completely different things than ebooks. Those lawyers really take to heart the White Queen's recommendation to believe 6 impossible things before breakfast.

The magic of technology is that it can make our pretendings into something real. So let's think a bit about how we can make the pretense of print-ebook equivalency more real, and if the resulting bargain makes any sense.

Here are some ways that we could make these ebooks, derived from printed books, more like print books:

Speed. It takes me an hour or so to get a print book from a library. Should I be able to get the digital substitute in a minute? Should I be able to read a chapter and the "return" it so that someone else can use it the next seconf? CDL already puts some limits on this, but maybe there could be a standard that makes the digital surrogate more like the real thing?
Geography. Printed books need to be transported to where the reader is. Once digitized they could go anywhere!. Maybe something like a shipping fee could be attached to a loan or other transfer. Maybe part of the fee could accrue to creators? Academic libraries have long done interlibrary loan of journal articles by copying and mailing the article, so why not do something equivalent for books?

These two attributes matter a lot in defining commercial markets for books and ebooks, and will become increasingly important as distribution technologies scale up and improve. Although publishers today make most of their money on the most popular books, book sales and usage of books in libraries have very long tails. There are millions of books for which global demand could be met by aggressive CDL of just a few copies. The CDL system instituted by Internet Archive also has a countervailing effect - the world-wide availability combined with so-so EPUB quality and usability probably result in stimulation of demand for print copies. This effect is likely to diminish as technologists like me smooth out the DRM speedbumps in CDL and begin to apply machine learning to EPUB generation.

It's worth noting that the "long tail" in book publishing also applies to authors and publishers. It's likely that the Internet Archive's CDL service has a larger market effect (whether positive or negative) on these market participants.

Here are some ways that we shouldn't make ebooks more like print books:

Search. Ebooks make search much easier than in print books. Maybe search should be disabled in CDL ebooks? Or maybe, we could enable search in print books. Google Books already sort of does this, if you have the right edition, but the process of making an ebook from a print book should give you an easy way to enable search in the print!
Accessibility. Many reading-disabled users rely on ebooks for access to literature, science and culture. Older adults such as myself often find that flowable text with adjustable font size is easier on our eyes. In addition to international treaties that treat accessible text as an exception to copyright, most authors and publishers don't want to be monsters.
Smell. Let's not go there.
Privacy. The intellectual property world seems to think that copyright gives them the right to monitor and data-mine the behavior of readers on digital platforms. In some cases, copyright extremists have required root access to our devices so they can sniff out infringing files or behavior. (While they're at it, they might as well mine some bitcoin!) It is an outrage to think anyone who makes ebooks from print books would wire them with surveillance tools; the strong privacy policies of Internet Archive should be codified for CDL.
Preservation. Publishers do a terrible job of preserving the lion's share of the printed books they publish, and society has always relied on libraries for this essential service. In this digital age, any grand bargain on copyrights has to provide libraries with the rights and incentives needed to do digital preservation of both printed and digital books.

The bottom line is that if we're going to continue to pretend that intellection property is a real thing, we need to start pretending that printed books are like ebooks, and vice versa. A grand bargain that benefits us all can eventually make these illusions real.

Notes:

Copyability. CDL books, like publisher-created ebooks, rely on device-enforced restrictions on duplication (DRM). Printed books rely on the expense of copying machines and paper to limit reproduction. In both cases, social norms and legal strictures discourage unauthorized reproduction. Building those social norms is what creating a grand bargain is all about.
Simultaneous use. Allowing simultaneous use of library ebooks during the pandemic is what really got the publishers mad at Internet Archive. A lot of people went mad during the lockdown, to be honest, and we're still recovering.
Comments. I encourage comment on the Fediverse or on Bluesky. I've turned off commenting here.

Crowdfunding Lessons from the Spice DAO

2022-02-12T21:15:00.002-05:00

What if we get a huge bunch of people together and buy something that lets us do fun things with a book that we all love, while making it accessible as never before? Great idea, isn't it?

If that sounds familiar, maybe you've heard of Unglue.it, a web site we launched 10 years ago? We asked people what book they wished was free to everyone and the number one answer was Douglas Adams' Hitchhikers Guide to the Galaxy. We talked to the literary agent for the Adams estate, and long story short, the rights entanglements made that impossible for any amount of money. We had a success with a seminal Anthropology book, but the intersection between books people were excited about and books that authors were willing to license openly was small. Probably you haven't heard of the site, but while it has focused on building a catalog of open-access books (now over 100,000 titles!) we still crowd fund a book here and there, most recently an academic monograph.

Probably you HAVE heard about Spice DAO, a "Distributed Autonomous Organization" that sprinkled some magic blockchain dust on an auction for a copy of Alejandro Jodorowosky's movie treatment of Frank Herbert's novel.

Web3 enthusiasts came through for Spice DAO, "crowdraising" enough to win the auction for €2.66M, though Christie's estimate for the item was only €25-35,000.

Spice DAO vows that:

Instead of letting it remain hidden away in private collections, Spice DAO crowdraised funds ... to collectively explore options to digitally preserve the manuscript, make it accessible to the public for the very first time, and develop creative projects inspired by the vision Jodorowsky set forth.

Predictably, the success of Spice DAO led to widespread ridicule , because:

The price paid was 100X the esimate
Nothing about the item purchased gave them any rights to "make it accessible" or "develop creative projects" it inspired.
Images of another copy were already freely available on the internet. But no more. Ironically, the publicity around Spice DAO seems to have knocked the images off of the internet!
Even the DAO's website https://dune.foundation/ is no longer online, most likely trademark infringement. (archived version linked above.)

One crypto lesson: a DAO constructed this way may get ripped off in an auction. Even if the seller was not using shills to see inside the DAO and bid up the price, the DAO was vulnerable to crypto-pranksters (or arbitrageurs?) who knew exactly what the DAO was forced to bid by its "smart" contract to avoid dissolution.

Despite all that, the 2.1 Billion "Spice" tokens given to crowdraise participants are still worth over 800,000 "dollars", according to Coinmarket, so maybe the product here is a convincing story for unregistered securities that apart from representing something tangible, can be used for tax evasion and money laundering. And the team seems to have had a crash course in copyright law:

After two months of outreach, conversations with former business partners and consultations with legal counsel we have not been able to reach an agreement with any of the rights holders involved in the creation of the contents of the book of collected storyboards of Jodorowsky’s Dune. (medium)

Spice DAO, like most successful crowd-funding projects, had a good story, and clearly that's worth a lot. There's still a big difference between a good story and an honest, well informed story. Crowdfunding services such as Unglue.it are limited by all the facts they have to deal with. But magic crypto dust has a certain reality. The crowd-raise generation of tokens that can be bought and sold in free markets allows participants to dream that their tokens will increase in value, and they very well could. In the real world, Spice DAO spent the equivalent of $300,000 to create the liquidity pool needed to distribute the SPICE tokens. Which makes credit card fee seem like a bargain! But dreams are priceless.

At least with "conventional" crowd funding, you know there's some accountability if you're investing in a nightmare!

Top 25 foods at a Traditional Hellman Christmas.

2021-12-22T09:19:00.000-05:00

Have only made 7 of the 25 so far this year.

Bulla, 2021

Julskinka (Christmas Ham). It doesn't count unless you cure it yourself. It once came out blue.
Köttbullar (Meatballs). Still working to perfect the Impossible™ version.
Limpa (Christmas rye Bread). You absolutely must have this with lever pastej, but its also great with just butter.
Sil (Herring). Must have Akvavit to kill the taste. A proper smörgåsbord should have two kinds at least, but we usually made do with one.
Akvavit. Must have Herring to kill the taste.
Boiled Potatoes. Great with Sil and Akvavit. Alleged to go great on knäckebröd with butter and Kalvslyta.
Kalvsylta (Jellied Veal). It's surprisingly easy to make. Keeps at least a year in the freezer.
Lever Pastej (Liver Paté). Also surprisingly easy to make, if you have a grinder.
Korv (Sausage). Two kinds in some years.
Spare Ribs. More than once these were forgotten in the oven.
Dopp i gryta (Dip in the pot). You dip some knäckebröd into boiling ham broth, then slather with butter. Matsos will work in a pinch.
Ost (cheese). Västerbottens Ost, Bond Ost, Herrgardsost, Havarti and Swiss are all good.
Rödkålsalat (red cabbage salad).
Rödbetsalat (red beet salad).
Inlagd gurka (picked cucumber).
Jansson's frestelse (Jansson's temptation). Creamed potatoes, anchovies and onions. Gonna try fermented tofu instead of anchovies this year.
Shrimp omelet.
Rotmoss (Mashed rutabagas). This used to be stuff that people in Sweden ate every day *other* than Christmas, because they could afford it. Now we never have it except at Christmas, because we can afford it.
Lingonsylt (lingonberry sauce).
Öl (beer). But never Swedish beer. Have dubbed this year's batch "Cipher Ale".
Glögg (spiced wine). Dad made it with 1 part vodka to one part wine.
Coffee.
Pepparkakar (Ginger Cookies). You can make a wish on them- if the cookie breaks into 3 pieces you get your wish. If some other number, you only get your wish if you wished for pepparkakor
Bulla (Cardamon cinnamon coffee b
read). Comes out differently for every baker. There's also the saffron and almond paste variety made for Dec. 13 and sometimes saved for Christmas.
Sand formar (Sand cookies). A ground almond cookie shell that you put vanilla pudding and a mandarin orange inside.
Jam cookies.
Many other cookies. Log cookies. Broomstick cookies. Chocolate chip cookies. Macaroons. Knäck. Our seasonal output peaked at 10 dozen dozen.
Donuts. With the next door neighbors.
Lutfisk. Not a figment of Garrison Kiellor's imagination, but only Dad liked it, with cream sauce, peas and allspice.

I know that's more than 25, but did I mention that Dad made the glögg with one part vodka to one part wine?

The Ebook Turns 50

2021-07-04T13:40:00.004-04:00

On July 4, 1971, Michael Hart made the text of the Declaration of Independence available on arpanet (which is now the Internet), using the gopher protocol (look it up). Although books in digital form certainly existed before that, many of us regard the beginning of Project Gutenberg as the birth of the ebook. There were computer-readable books on magnetic disks, punch cards and the like, but the revolutionary element of Project Gutenberg was the distribution method. Printed books, after all, are a digital media, it just that the bits are embodied by the presence or absence of ink rather than electrons on a transistor gate. Sending the bits over a wire or a fiber is what puts the 'e' in ebook.

The birth of the ebook was a political event as much as a technical achievement. The choice of the "Declaration of Independence of the United States" as etext #1 couldn't have been solely an expression of patriotic fervor. Rather, I think it was a manifestation of the radical belief that everyone should have access to the printed word, without having to pay for the privilege. (Yes, libraries are radical in this way, too!).

As Thomas Jefferson put it:

... it becomes necessary for one people to dissolve the political bands which have connected them with another, and to assume, among the Powers of the earth, the separate and equal station to which the Laws of Nature and of Nature's God entitle them.

In the context of 1971, the "bands" that needed dissolving were expensive services such as Dialog. The idea that users had to pay Dialog per word to read the Declaration mush have been galling to Hart. (Let's overlook the fact that he and other denizens of the 1971 arpanet got their access for "free" because someone else was paying.) Books are things in their own right; stripping ebooks of their "bands" to a single device or service is what put the "book" into ebook.

Although Project Gutenberg is now delivering about 50 million ebooks a year, about 2% of global ebook unit sales, until at least 2009 it delivered the majority of the world's ebooks. Today, that position has been taken by Amazon's Kindle. Just as the United States can't ignore the ideals that led to its founding, the stakeholders of the ebook ecosystem- authors, publishers, distributors, libraries, and readers, all of us need to remember that the ebook was born out of a desire for freedom.

Note: Though I've been helping Project Gutenberg modernize its technology, I don't speak for them in any way, though I am certainly in awe of what they've achieved! If you'd like to support my work advancing freedom for ebooks, consider a donation to the Free Ebook Foundation.

Open Access for Backlist Books, Part II: The All-Stars

2021-02-22T21:49:00.003-05:00

Libraries know that a big fraction of their book collections never circulate, even once. The flip side of this fact is that a small fraction of a library's collection accounts for most of the circulation. This is often referred to as Zipf's law; as a physicist I prefer to think of it as another manifestation of log-normal statistics resulting a preferential attachment mechanism for reading. (English translation: "word-of-mouth".)

In my post about the value of Open Access for books, I suggested that usage statistics (circulation, downloads, etc.) are a useful proxy for the value that books generate for their readers. The logical conclusion is that the largest amount of value that can be generated from opening of the backlist comes from the books that are most used, the "all-stars" of the library, not the discount rack or the discards. If libraries are to provide funding for Open Access backlist books, shouldn't they focus their resources on the books that create the most value?

The question of course, is how the library community would ever convince publishers, who have monopolies on these books as a consequence of international copyright laws, to convert these books to Open Access. Although some sort of statutory licensing or fair-use carve-outs could eventually do the trick, I believe that Open Access for a significant number of "backlist All-Stars" can be achieved today by pushing ALL the buttons available to supporters of Open Access. Here's where the Open Access can learn from the game (and business) of baseball.

"Baseball", Henry Sandham, L. Prang & Co. (1861).
from Digital Commonwealth

Baseball's best player, Mike Trout, should earn $33.25 million this year, a bit over $205,000 per regular season game. If he's chosen for the All-Star game, he won't get even a penny extra to play unless he's named MVP, in which case he earns a $50,000 bonus. So why would he bother to play for free? It turns out there are lots of reasons. The most important have everything to with the recognition and honor of being named as an All-Star, and with having respect for his fans. But being an All-Star is not without financial benefits considering endorsement contracts and earning potential outside of baseball. Playing in the All-Star game is an all-around no-brainer for Mike Trout.

Open Access should be an All-Star game for backlist books. We need to create community-based award programs that recognize and reward backlist conversions to OA. If the world's libraries want to spend $50,000 on backlist physics books, for example, isn't it better to spend it on the the Mike Trout of physics books than on a team full of discount-rack replacement-level players?

Competent publishers would line up in droves for major-league all-star backlist OA programs. They know that publicity will drive demand for their print versions (especially if NC licenses are used.) They know that awards will boost their prestige, and if they're trying to build Open Access publication programs, prestige and quality are a publisher's most important selling points.

The Newbury Medal

Over a hundred backlist books have been converted to open access already this year. Can you name one of them? Probably not, because the publicity value of existing OA conversion programs is negligible. To relicense an All-Star book, you need an all-star publicity program. You've heard of the Newbury Medal, right? You've seen the Newbury medal sticker on children's books, maybe even special sections for them in bookstores. That prize, award by the American Library Association every year to honor the most distinguished contributions to American literature for children, is a powerful driver of sales. The winners get feted in a gala banquet and party (at least they did in the before-times). That's the sort of publicity we need to create for open access books.

If you doubt that "All-Star Open Access" could work, don't discount the fact that it's also the right thing to do. Authors of All-Star backlist books want their books to be used, cherished and remembered. Libraries want books that measurably benefit the communities they serve. Foundations and governmental agencies want to make a difference. Even publishers who look only at their bottom lines can structure a rights conversion as a charitable donation to reduce their tax bills.

And did I mention that there could be Gala Award Celebrations? We need more celebrations, don't you think?

If your community is interest in creating an Open-Access program for backlist books, don't hesitate to contact me at the Free Ebook Foundation!

Notes

I've written about the statistics of book usage here, here and here.

This is the third in a series of posts about creating value of Open Access books. The first two are:

Open Access for Backlist Books, Part I: The Slush Pile

2021-02-16T21:32:00.003-05:00

"Kale emerging from a slush pile"
(CC BY, Eric Hellman)

Book publishers hate their "slush pile": books submitted for publication unsolicited, rarely with literary merit and unlikely to make money for the publisher if accepted. In contrast, book publishers love their backlist; a strong backlist is what allows a book publisher to remain consistently profitable even when most of their newly published books fail to turn a profit. A publisher's backlist typically consists of a large number of "slushy" books that generate negligible income and a few steady "evergreen" earners. Publishers don't talk much about the backlist slush pile, maybe because it reminds them of their inability to predict a book's commercial success.

With the advent of digital books has come new possibilities for generating value from the backlist slush pile. Digital books can be kept "in print" at essentially no cost (printed books need warehouse space) which has allowed publishers to avoid rights reversion in many cases. Some types of books can be bundled in ebook aggregations that can be offered on a subscription basis. This is reminiscent of the way investment bankers created valuable securities by packaging junk bonds with opaque derivatives.

Open access is a more broadly beneficial way to generate value from the backlist slush pile. There is a reason that libraries keep large numbers of books on their shelves even when they don't circulate for years. The myriad ways that books can create value doesn't have to be tied to book sales, as I wrote in my previous post.

Those of us who want to promote Open Access for backlist ebooks have a number of strategies at our disposal. The most basic strategy is to promote the visibility of these books. Libraries can add listings for these ebooks in their catalogs. Aggregators can make these books easier to find.

Switching backlist books to Open Access licenses can be expensive and difficult. While the cost of digitization has dropped dramatically over the past decade, quality control is still a significant conversion expense. Licensing-related expenses are sometimes large. Unlike journals and journal articles, academic books are typically covered by publishing agreements that give authors royalties on sales and licensing, and give authors control over derivative works such as translations. No publisher would consent to OA relicensing without the consent and support of the author. For older books, a publisher may not even have electronic rights (in the US, the Tasini decision established that electronic rights are separate from print rights), or may need to have a lawyer interpret the language of the original publishing contract.

While most scholarly publishers obtain worldwide rights to the books they publish, rights for trade books are very often divided among markets. Open-access licenses such as the Creative Commons licenses are not limited to markets, so a license conversion would require the participation of every rights holder worldwide.

The CC BY license can be problematic for books containing illustrations or figures used by permission from third party rights holders. "All Rights Reserved" illustrations are often included in Open Access Books, but they are carved out of the license by separate rights statements, and to be safe, publishers use the CC BY-ND or CC BY-ND-NC license for the complete book, as the permissions do not cover derivative works. Since the CC BY license allows derivative works, it cannot be used in cases where translation rights have been sold (without also buying out the translation rights). A publisher cannot use a CC BY license for a translated work without also having rights to the original work.

The bottom line is that converting a backlist book to OA often requires economic motivations quite apart from any lost sales. Luckily, there's evidence that opening access can lead to increased sales. Nagaraj and Reimers found that digitization and exposure through Google Books increased sales of print editions by 35% for books in the Public Domain. In addition, a publisher's commercial position and prestige can be enhanced by the attribution requirement in Creative Commons licenses.

Additional motivation for OA conversion of the backlist slush pile has been supplied by programs such as used by Knowledge Unlatched, where libraries contribute to to a fund used for "unlatching" backlist books. (Knowledge Unlatched has programs for front list books as well.) While such programs can in principle be applied for the "evergreen" backlist, the incentives currently in place result in the unlatching of books in the "slush pile" backlist. While value for society is being gained this way, the willingness of publishers to "unlatch" hundreds of these books poses the question of how much library funding for Open Access should be allocated to the discount bin, as opposed to the backlist books most used in libraries. That's the topic of my next post!

Notes

This is the second in a series of posts about creating value of Open Access books. The others are:

Creating Value with Open Access Books

2021-02-12T12:31:00.005-05:00

Can a book be more valuable if it's free? How valuable? To whom? How do we unlock this value?

I've been wrestling with these questions for over ten years now. And for each of these questions, the answer is... it depends. A truism of the bookselling business is that "Every book is different" and the same is true of the book freeing "business".

Recently there's been increased interest in academic communities around Open Access book publishing and in academic book relicensing (adding an Open Access License to an already published book). Both endeavors have been struggling with the central question of how to value an open access book. The uncertainty in OA book valuation has led to many rookie mistakes among OA stakeholders. For example, when we first started Unglue.it, we assumed that reader interest would accelerate the relicensing process for older books whose sales had declined. But the opposite turned out to be true. Evidence of reader interest let rights holders know that these backlist titles were much more valuable than sales would indicate, thus precluding any notion of making them Open Access. Pro tip: if you want to pay a publisher to make a books free, don't publish your list of incredibly valuable books!

Instead of a strictly transactional approach, it's more useful to consider the myriad ways that academic books create value. Each of these value mechanisms offer buttons that we can push to promote open access, and point to new structures for markets where participants join together to create mutual value.

First, consider the book's reader. The value created is the reader's increased knowledge, understanding and sometimes, sheer enjoyment. The fact of open access does not itself create the value, but removes some of the barriers which might suppress this value. It's almost impossible to quantify the understanding and enjoyment from books; but "hours spent reading" might be a useful proxy for it.

Next consider a book's creator. While a small number of creators derive an income stream from their books, most academic authors benefit primarily from the development and dissemination of their ideas. In many fields of inquiry, publishing a book is the academic's path to tenure. Educators (and their students!) similarly benefit. In principle, you might assess a textbook's value by measuring student performance.

The value of a book to a publisher can be more than just direct sales revenue. A widely distributed book can be a marketing tool for a publisher's entire business. In the world of Open Access, we can see new revenue models emerging - publication charges, events, sponsorships, even grants and memberships.

The value of a book to society as a whole can be enormous. In areas of research, a book might lead to technological advances, healthier living, or a more equitable society. Or a book might create outrage, civil strife, and misinformation. That's another issue entirely!

Books can be valuable to secondary distributors as well. Both used book resellers and libraries add value to physical books by increasing their usage. This is much harder to accomplish for paywalled ebooks! Since academic libraries are often considered as potential funding sources for Open Access publishing it's worth noting that the value of an open access ebook to a library is entirely indirect. When a library acts as an Open Access funding source, it's acting as a proxy for the community it serves.

This brings us to communities. The vast majority of books create value for specific communities, not societies as a whole. I believe that community-based funding is the most sustainable path for support of Open Access Books. Community supported OA article publishing has already had plenty of support. Communities organized by discipline have been particularly successful: consider the success that ArXiv has had in promoting Open Access in physics, both at the preprint level and for journals in high-energy physics. A similar story can be told for biomedicine, Pubmed and Pubmed Central. A different sort of community success story has been SciELO, which has used Open Access to address challenges faced by scholars in Latin America.

So far, however, sustainable Open Access has proven to be challenging for scholarly ebooks. My next few posts will discuss the challenges and ways forward for support of ebook relicensing and for OA ebook creation:

Infra-infrastructure, inter-infrastructure and para-infrastructure

2020-12-29T13:17:00.000-05:00

No one is against "Investing in Infrastructure". No one wants bridges to collapse, investing is always more popular than spending, and it's even alliterative! What's more, since infrastructure is almost invisible by definition, it's politically safe to support investing in infrastructure because no one will see when you don't follow through on your commitment!

Ponte Morandi collapse - Michele Ferraris, CC BY-SA 4.0 via Wikimedia Commons

Geoffrey Bilder gives a talk where he asks us to think of Crossref and similar services as "information infrastructure" akin to "plumbing", where the implication is that since we, as a society, are accustomed to paying plumbers and bridge builders lots of money, we should also pony up for "information infrastructure", which is obvious once you say it.

What qualifies as infrastructure, anyway? If I invest in a new laptop, is that infrastructure for the Go-to-Hellman blog? Blogspot is Google-owned blogging infrastructure for sure. It's certainly not open infrastructure, but it works, and I haven't had to do much maintenance on it.

There's a lot of infrastructure used to make Unglue.it, which supports distribution of open-access ebooks. It uses Django, which is open-source software originally developed to support newspaper websites. Unglue.it also uses modules that extend Django that were made possible by Django's Open license. It works really well, but I've had to put a fair amount of work into updating my code to keep up with new versions of Django. Ironically, most of this work has been in fixing the extensions that have not updated along with Django.

I deploy Unglue.it on AWS, which is DEFINITELY infrastructure. I have a love/hate relationship with AWS because it works so well, but every time I need to change something, I have to spend 2 hours with documentation to find the one-line incantation that make it work. But every few months, the cost of using AWS goes down, which I like, but the money goes to Amazon, which is ironic because they really don't care for the free ebooks we distribute.

Aside from AWS and Django, the infrastructure I use to deliver Ebook Foundation services includes Python, Docker, Travis-CI, GitHub, git, Ubuntu Linux, MySQL, Postgres, Ansible, Requests, Beautiful Soup, and many others. The Unglue.it database relies on infrastructure services from DOAB, OAPEN, LibraryThing, Project Gutenberg, OpenLibrary and Google Books. My development environment relies heavily on BBEdit and Jupyter. We depend on Crossref and Internet Archive to resolve some links; we use subject vocabulary from Library of Congress and BISAC.

You can imagine why I was interested in "JROST 2020" which turns out to stand for "Join Roadmap for Open Science Tools 2020", a meeting organized by a relatively new non-profit, "Invest in Open Infrastructure" (IOI). The meeting was open and free, and despite the challenges associated with such a meeting in our difficult times, it managed to present a provocative program along with a compelling vision.

If you think a bit about how to address the infrastructure needs of open science and open scholarship in general, you come up with at least 3 questions:

How do you identify the "leaky pipes" that need fixing so as to avoid systemic collapse?
How do you bolster healthy infrastructure so that it won't need repair?
How do you build new infrastructure that will be valuable and thrive?

If it were up to me, my first steps would be to:

Get people with a stake in open infrastructure to talk to each other. Break them out of their silos and figure out how their solutions can help solve problems in other communities.
Create a 'venture fund" for new needed infrastructure. Work on solving the problems that no one wants to tackle on their own.

Invest in Open Infrastructure is already doing this! Kaitlin Thaney, who's been Executive Director of IOI for less that a year, seems to be pressing all the right buttons. The JROST 2020 meeting was a great start on #1 and #2 is the initial direction of the "JROST Rapid Response Fund", whose first round of awards was announced at the meeting.

Among the first awardees of the JROST Rapid Response Fund announced at JROST2020 was an organization that ties into the infrastructure that I use, 2i2c. It's a great example of much-needed infrastructure for scientific computing, education, digital humanities and data science. 2i2c aims to create hosted interactive computing environments that run in the cloud and are powered by entirely open-source technology (Jupyter). As I'm a Jupyter user and enthusiast, this makes me happy.

But while 2i2c is the awardee, it's being built on top of Jupyter. Is Jupyter also infrastructure? It needs investment too, doesn't it? There's a lot of overlap between the Jupyter team and the 2i2c team, so investment in one could be investment in the other. In fact, Chris Holdgraf, Executive Director of 2i2c, told me that "we see 2i2c as a way to both increase the impact of Jupyter in the research/education community, and a way to more sustainably drive resources back into the Jupyter community.".

Open Science Infrastructure Interdependency (from
“Scoping the Open Science Infrastructure Landscape in Europe”,
https://doi.org/10.5281/zenodo.4153809)

Where does Jupyter fit in the infrastructure landscape? It's nowhere to be seen on the neat "interdependency map" presented by SPARC EU at JROST. If 2i2c is an example of investment-worthy infrastructure, maybe the best way to think of Jupyter is "infra-infrastructure" - the open information infrastructure needed to build open information infrastructure. "Trickle-down" investment in this sort of infrastructure may be the best way to support projects like Jupyter so they stay open and are widely used.

But wait... Jupyter is built on top of Python, right? Python needs people investing in it, Is Python infra-infra-infrastructure? And Python is built on top of C (I won't even mention Jython or PyJS), right?? Turtles all the way down. Will 2i2c eventually get buried under other layers of infrastructure, be forgotten and underinvested in, only to be one day excavated and studied by technology archeologists?

Looking carefully at the interdependency map, I don't see a lot of layers. I see a network with lots of loops. And many of the nodes are connectors themselves. Orcid and CrossRef resemble roads, bridges and plumbing not because they're hidden underneath, but because they're visible and in-between. They exist because of the entities they connect cooperate to make the connection robust instead of incidental. They're not infra-infrastructure, they're inter-infrastructure. Trickle-down investment probably wouldn't work for inter-infrastucture. Instead, investments need to come from the communities that benefit so that the communities can decide how to manage and access to the inter-infrastructure to maximize the community benefit.

There's another type of infrastructure that needs investment. I work in ebooks, and a lot of overlapping communities have tackled their own special ebook problems. But the textbook people don't talk to the public domain people don't talk to the monograph people don't talk to the library people. (A slight exaggeration.) There are lots of "almost" solutions that work well for specific tasks. But with the total amount of effort being expended, we could some really amazing things... if only we were better at collaborating.

For example, the Jupyter folks have gotten funding from Sloan for the "Executable Book Project". This is really cool. Similarly, there's Bookdown, which comes out of the R community. And there are other efforts to give ebooks the functionality that a website could have. Gitbook is a commercial open-source effort targeting a similar space, Rebus, a non-profit, is using Pressbooks to gain traction in the textbook space, while MIT Press's PubPub has similar goals.

I'll call these overlapping efforts "para-infrastructure." Should investors in open infrastructure target investment in "rolling up" or merging these efforts? When private equity investors have done this to library automation companies the results have not benefited the user communities, so I'd say "NO!" but what's the alternative?

I've observed that the folks who are doing the best job of just making stuff work rarely have the time or resources to go off to conferences or workshops. Typically, these folks have no incentive to do the work to make their tools work for slightly different problems. That can be time consuming! But it's still easier than taking someone else's work and modifying it to solve your own special problem. I think the best way to invest in open para-infrastructure is to get lots of these folks together and give the time and incentive to talk and to share solutions (and maybe code.) It's hard work, but making the web of open infrastructure stronger and more resilient is what investment in open infrastructure is all about.

Different types of open infrastructure benefit from different styles of investment; I'm hoping that IOI will build on the directions exhibited by its Rapid Response Fund and invest effectively in infra-infrastructure, inter-infrastructure, and para-infrastructure.

Notes

1. Geoff Bilder and Cameron Neylon have a nice discussion of many of the issues in this post: “Bilder G, Lin J, Neylon C (2016) Where are the pipes? Building Foundational Infrastructures for Future Services, retrieved [date], http://cameronneylon.net/blog/where-are-the-pipes-building-foundational-infrastructures-for-future-services/ ‎”

2. "Trickle-down" has a negative connotation in economics, but that's how you feed a tree, right?

We should regulate virality

2020-10-19T21:29:00.002-04:00

It turns out that virality on internet platforms is a social hazard!

Living in the age of the Covid pandemic, we see around us what happens when we let things grow exponentially. The reason that the novel coronavirus has changed our lives is not that it's often lethal - it's that it found a way to jump from one infected person to several others on average, leading to exponential growth. We are infected with virus without regard to the lethality of the virus, but only its reproduction rate.

For years, websites have been built to optimize virality of content. What we see on Facebook or Twitter is not shown to us for its relevance to our lives, its education value, or even its entertainment value. It shown to us because it maximizes our "engagement" - our tendency to interact and spread it. The more we interact with a website, the more money it makes, and so a generation of minds has been employed in the pursuit of more engagement. Sometimes it's cat videos that delight us, but more often these days it's content that enrages and divides us.

Our dissatisfaction with what the internet has become has led calls to regulate the giants of the internet. A lot of the political discourse has focused on "section 20" https://en.wikipedia.org/wiki/Section_230 a part of US law that gives interactive platforms such as Facebook a set of rules that result in legal immunity for content posted by users. As might be expected, many of the proposals for reform have sounded attractive, but the details are typically unworkable in the real world, and often would have effects opposite of what is intended.

I'd like to argue that the only workable approaches to regulating internet platforms should target their virality. Our society has no problem with regulations that force restaurant, food preparation facilities, and even barbershops to prevent the spread of disease, and no one ever complains that the regulations affect "good" bacteria too. These regulations are a component of our society's immune system, and they are necessary for its healthy functioning.

Add caption

You might think that platform virality is too technical to be amenable to regulation, but it's not. That's because of the statistical characteristics of exponential growth. My study of free ebook usage has made me aware of the pervasiveness of exponential statistics on the internet. Sometime labeled the 80-20 rule, the Pareto principle, or log-normal statistics, it's the natural result of processes that grow at a rate proportional to their size. As a result, it's possible to regulate virality of platforms because only a very small amount of content is viral enough dominate the platform. Regulate that tiny amount of super-viral content, and you create incentive to moderate the virality of platforms. The beauty of doing this is that a huge majority of content is untouched by regulation.

How might this work? Imagine a law that removed a platform's immunity for content that it shows to a million people (or maybe 10 million - I've not sure what the cutoff should be). This makes sense, too; if a platform promotes illegal content in such a way that a million people see it, the platform shouldn't get immunity just because "algorithms"! It also makes it practical for platforms to curate the content for harmlessness- it won't kill off the cat videos! The Facebooks and Twitters of the world will complain, but they'll be able to add antibodies and T-cells to their platforms, and the platforms will be healthier for it. Smaller sites will be free to innovate, without too much worry, but to get funding they'll need to have plans for virality limits.

So we really do have a choice; healthy platforms with diverse content, or cesspools of viral content. Doesn't seem like such a hard decision!

Techdirt has excellent coverage of Section 230.

Notes on work-from-home teams

2020-09-06T20:46:00.003-04:00

I've been working from home full-time for over eleven years - at least partly work-from-home for 20 years. I've managed work-from-home teams, and worked with quite a few others on joint projects. So when some colleagues were sharing their work-from-home experiences, I piped up with some thoughts. When I was asked recently to repeat them, I realized it might be useful to make a list for the blog. Old-style.

SO...

In-person time is super-valuable. It builds a foundation for the digital interactions we're all stuck with for a while.
Engineers in particular are prone to under-communicate, so a manager has to pro-actively push people to communicate more than they would on their own ...
... and create a safe environment that promotes asking for help.
Most remote workers need an extra helping of encouragement and positive reinforcement...
... doubly so for people prone to self-doubt or imposter syndrome.
Worker depression is the hardest thing for a work-from-home team to manage.
Trust is the most important attribute for work-from-home teams, and it has to be mutual in any type of relationship.

I think most of these are self-explanatory. In the near-term current environment, the first point is not so helpful for teams that haven't banked some in-person time; non-work activities, remote meal-sharing and happy hours are imperfect substitutes for the real thing.

The point about worker depression is worth emphasizing. It's a real hazard, often without easy mitigations. For me, daily exercise and intentional social interaction are the most effective medicine, but everyone is different. A work-from-home team needs time, space, and often support to figure out what works.

Your Identity, Your Library

2019-12-03T14:45:00.000-05:00

Today, your identity on the Internet is essentially owned by the big email providers and social networks. Google, Yahoo, Facebook, Twitter - chances are you use one of these services to conveniently log into other services as YOU. You don't need to remember a new password for each service, and the service providers don't have to verify your "identity". What you gain in convenience, you lose in privacy, and that's turned out really well, hasn't it?

The "flow" you use to take advantage of this single sign-in is a "dance" that takes you from website to website and back to the site you're logging into. A similar dance occurs to secure access to resources licensed on you behalf by libraries, institutions, corporations, etc.. I wrote a bunch of articles about "RA21" (now rebranded as the vaguely NSFW "SeamlessAccess"), an effort spearheaded by STM publishers to improve the user experience of that dance. (It can be complicated and confusing because there are lots of potential dance partners!)

Henri Matisse, La danse (first version) 1909

These dance partners style themselves as "identity providers". That label makes me uncomfortable. Identity can't be something that can be stripped from you by on the whim of a megacorporation. Instead, internet identity should be woven from a web of relationships. These can be formed digitally or face-to-face, global or local, business or personal.

You'd have thunk that the whole identity-on-the-internet thing would have improved in the 13 years since that login dance was first rolled out. And you'd be almost right, because a new architecture for internet identity is now on the horizon. Made possible by many of the same technologies that are securing the internet and inflating the blockchain bubble, massively distributed and even "self-sovereign identity" are becoming real-ish.

These technologies will inevitably be applied to the access authorization problem. Access via distributed identity replaces the website-to-website dance with the presentation of some sort of signed credential. A service provider verifies the signature against the signer's public key. It's like showing a passport that can't be forged. A tricky bit is that the credential also needs to be checked against a list of revoked credentials. This would have been cumbersome even ten years ago, but distributed databases are now a mature technology, versions of which underpin the internet itself.

Interlinked with the concept of distributed identity is the notion that users of the web should be able to securely control their data, and that decisions about what a web site gets to know about you should not be delegated to advertising networks.

Unfortunately, we're not quite ready for distributed identity, in the sense that implementation for today's web would require users to install plugin software, which has its own set of usability, privacy and security issues. The ideal situation would be for some sort of standardized distributed identity and secure data management capability to be installed in browser software - Chrome, Firefox, Safari, etc.

There's a lot of work going on to make this happen.

ID2020 has put out an identity manifesto that starts with the declaration that "The ability to prove one’s identity is a fundamental and universal human right."
Tim Berners-Lee is leading the Solid Project, which let's you "move freely between services, reuse data across apps, connect with anyone, and select what you share precisely".
The W3C Verifiable Claims Working Group has published Technical Recommendations for "Verifiable Credential Use Cases and a "Verifiable Credential Data Model". They observe that "from educational records to payment account access, the next generation of web applications will authorize entities to perform actions based on rich sets of credentials issued by trusted parties."
The Sovrin Network is a "new standard for digital identity – designed to bring the trust, personal control, and ease-of-use of analog IDs – like driver’s licenses and ID cards – to the Internet."
Kaliya Young, Doc Searls and Phil Windley have been convening the Internet Identity Workshop twice a year since 2005 to create a community centered around internet identity. A glance at prior year proceedings gives a flavor of how much is happening in the field

The common thread here is that users, not unaccountable third parties, should be able to manage their identity on the internet, while at the same time creating a global chain of trust.

It seems to me that there's a last-mile problem with all these schemes. If identity is really a universal human right, how do we create a chain of trust that can include every human? That problem becomes a lot easier to solve if there were some sort of organization with a physical presence in communities all over, trusted by the community and by other organizations. A sort of institution experienced in managing information access and privacy, and devoted to the needs of all sorts of users.

In other words, what if "libraries" existed?

The federated authentications systems used by libraries today - Shibboleth, Athens, and related systems use a dance similar to what you do with Google or Facebook. It's a big step that moves your internet identity away from "surveillance capitalists" towards community institutions. But you still don't have control over what data your institution give away, as you will in the next-generation internet identity systems I describe here. (RA21 is no different from Shib or Athens in this respect.)

What might libraries do to prepare for the age of distributed identity? The first step is not about technology, it's about mission. I believe libraries should start to think of themselves as internet relationship providers for their communities. When I get access to a resource though my library, I won't be "logging in", I'll be asserting a relationship with a library community, and the library will be standing behind me. Joining an identity federation is a good next step for libraries. But the library community needs to advocate for user identity as a basic human right and prepare their systems to support a future where no dancing is required.

Update 12/5/2019: revised last two paragraphs to be less mystifying.

Four-Leaf Clovers

2019-07-26T20:31:00.000-04:00

It seems a friend of mine collects four-leaf clovers.

When I was a kid, I loved looking for four-leaf clovers in the lawn. It was the same sort of relaxing concentration and observation you use to find a piece of a jigsaw puzzle. But one day, I found a clover plant in front of the garage that had multiple four-leaf clovers. Looking carefully, I found that not only were there four leaf clovers, but there were FIVE-LEAF-CLOVERS. I had hit the jackpot. And even a six-leaf clover!!!! I swear to all of God's integers that I even found a SEVEN leaf clover. I saved that seven leaf clover in my box of treasures for years, until I just had seven crumbling leafs of a clover.

I never looked for a four-leaf clover again.

Now, whenever I remember that clover plant (and that garage), I think of the toxins that must have caused the polyfoliate abomination.

Please don't let my story stop you looking for four-leaf clovers! Happy summer!

Responding to Critical Reviews

2019-05-30T19:24:00.001-04:00

The first scientific paper I published was submitted to Physical Review B, the world's leading scientific journal in condensed matter physics. Mailing in the manuscript felt like sending my soul into a black hole, except not even Hawking radiation would came back. A seemingly favorable review returned a miraculous two months later:

"I found this paper interesting, and I think it probably eventually it should be published - but only after Section II is revamped and section III clarified."

I made a few minor revisions and added some computations that had been left out of the first version, then confidently resubmitted the paper. But another two months later, I received the second review. The referee hadn't appreciated that I had deflected the review's description of "fundamental logic flaws and careless errors" that made my paper "extremely confusing". The reviewer went on to say "I do not think the authors' new variational calculation is correct" and suggested that my approach was completely wrong.

My thesis advisor suggested that I go and talk to Bob Laughlin in the Physics department about how to deal with the stubborn referee. I had been collaborating with Bob and one of his students on a related project, and he had become a surrogate advisor for my theoretical endeavors. During that time, Bob had acquired a reputation among my fellow students for asking merciless questions at oral exams; many of us were scared of him.

Bob's lesson on how to deal with a difficult referee turned out to be one of the most useful things I learned in grad school. Referees, he told me, come in 2 varieties, complete idiots, and not-complete-idiots. (Yes, Bob was merciless.) If your referee is a complete idiot, all you can do is ask for a different referee. If your referee has the least bit of sense, then you have to take the attitude that either the referee is somewhat correct, and you think YES-SIR MISTER REFEREE SIR! (Bob had been in the Army) and do whatever the referee says to do, or you take the point of view that you have explained something so poorly that the referee, who is an excellent representative of your target audience, had no hope of understanding it. Either way, there was a lot of work to do. We decided that this referee was not an idiot, and I needed to go back to the drawing board and re-do my calculation, figuring out how to be clearer and more correct in my exposition.

A third review came back with the lovely phrase "The significance of the calculation of section II, which is neither fish nor fowl, remains unclear." Using Bob's not-idiot rule, I recognized that my explanation was still unclear and I worked even harder to improve the paper.

My third revised version was accepted and published. Bob later won the Nobel Prize. I'm here writing blog posts for you about RA21.

RA21 received 120 mostly critical reviews from a cross-section of referees, not a single one of whom is the least bit an idiot. Roughly half the issues fell into the badly-explained category, while the other half fell in the "fundamental flaws and careless errors" category. RA21 needs to go back to the chalkboard and rethink even their starting assumptions before they can move forward with this much-needed effort.

RA21: Technology is not the problem.

2019-05-17T11:42:00.000-04:00

RA21 vows to "improve access to institutionally-provided information resources". The barriers to access are primarily related to the authorization of such access in the context of licensing agreements. In a perfect world, trust and consensus between licensors and licensing communities would render authorization technology irrelevant. In the real world, technological controls need to build upon good-faith agreements and the consent of community members. Also in the real world, poorly implemented technology erodes that good-faith and consent.

The RA21 draft recommended practice focuses on technology and technology implementations, all the while failing to consider how to build the trust that underpins good-faith and consent. Service providers need to trust that identity providers faithfully facilitate authorized users and that the communities that identity providers serve will adhere to licensing agreements; users of information resources need to trust that their usage data will not be tracked and sold to the highest bidder.

Trust is not created out of thin air and certainly not by software. Technology can provide tools that facilitate trust, but shared values and communication between parties is the raw material of trust. An effective program to improve access must include processes and procedures that develop shared values and promote cooperation.

I recognize that RA21 has chosen to consider only the authentication intercourse as in-scope. But the draft recommendation has identified several areas of "further work". Included in this further work should be areas where community standards and best practices can enhance trust around authentication and authorization. To name two examples:

A set of best practices around "incident response" would in practice work much better than a "guiding principle" of "end-to-end traceability".
A set of best practices around auditing of security and privacy procedures and technology at service providers and identity providers would materially address the privacy and security concerns that the draft recommendation punts over to cited reports and studies.

This is the fifth and last of my comments submitted as part of the NISO standards process. The 102+ comments that have been submitted so far represent a great deal of expertise and real-world experience. My previous comments were about secure communication channels, potential phishing attacks, the incompatibility of the recommended technical approach with privacy-enhancing browser features, and the need for radical inclusiveness. I've posted the comments here so you can easily comment.

Update July 22, 2019:

RA21's official response to this comment is:

We agree that technology is not the primary problem. There are two core issues that RA21 is seeking to address - firstly the current user experience of federated authentication needs to be improved, and this comprises the bulk of our recommendations. Secondly, considerable trust has been established between identity providers and service providers through their mutual particpation in identity federations and we are recommending broader particpation in identity federations where they do not exist. The understanding and acceptance of this trust model is not universal among all stakeholder groups particularly withing IdP organisations and through ongoing dialog and outreach during the implementation phase, RA21 hopes to address this deficit. Finally, we have added a section to the recommendations addressing security incident response and adoption of an operational security baseline by particpants.

OK.

RA21 doesn't address the yet-another-WAYF problem. Radical inclusiveness would.

2019-05-13T18:51:00.000-04:00

The fundamental problem with standards is captured by XKCD 927.

XKCD https://xkcd.com/927/

Single sign-on systems have the same problem. The only way for a single sign-on system to deliver a seamless user experience is to be backed by a federated identity system that encompasses all use cases. For RA-21 to be the single button that works for everyone, it must be radically inclusive. It must accommodate a wide variety of communities and use cases.

Unfortunately, the draft recommended practice betrays no self-awareness about this problem. Mostly, it assumes that there will be a single "access through your institution" button. While it is certainly true that end-users have more success when presented with a primary access method, it's not addressed how RA-21 might reach that state.

Articulating a radical inclusiveness principle would put the goal of single-button access within reach. Radical inclusiveness means bringing IP-based authentication, anonymous access, and access for walk-ins into the RA-21 tent. Meanwhile the usability and adoption of of SAML-based systems would be improved; service providers who require "end-to-end traceability" could achieve this in the context of their customer agreements; it needn't be a requirement for the system as a whole.

Radical inclusiveness would also broaden the user base and thus financial support for the system as a whole. We can't expect a 100,000 student university library in China to have the same requirements or capabilities as a small hospital in New Jersey or a multinational pharmaceutical company in Switzerland, even though all three might need access to the same research article.

This is my fourth comment on the RA-21 draft "Recommended Practices for Improved Access toInstitutionally-Provided Information Resources". The official comment period ends Friday. This comment, 57 others, and the add-comment form can be read here. My comments so far are about secure communication channels, potential phishing attacks, and the incompatibility of the recommended technical approach with privacy-enhancing browser features. I'm posting the comments here so you can easily comment. I'll have one more comment, and then a general summary.

Update July 10, 2019:

RA21's official response to this comment is:

RA21 envisages supporting the anonymous and walk-in use cases via federated authentication. It is anticpated that federated authentication and IP authentication will exist side-by-side during a transition phase. The specifics of the User Experience during the transition phase will need to be determined during implementation; however it is likely that the RA21 button will simply not need to be displayed to users who are IP authenticated.

I suppose self-awareness was a big ask. The revised recommendation includes some "envisaging" of use cases that was glaring by omission in the draft recommendation. The added section 2.1.1., Employ appropriate authentication mechanisms for specific use cases, is an improvement on the draft; but the revised recommendation has not retreated from its end-to-end traceability "guiding principle".

RA21 used the same response for a comment by Ohio State's, Jennifer Vinopal:

I want to reiterate a point that a number of commenters have already mentioned: there is no discussion of how public or walk-in (or other unauthenticated/unauthenticating) users will get access to resources through RA21. Public libraries, as well as many college and research libraries, negotiate our e-resource licenses to provide access to walk-in users who aren?t represented in our IdM systems.

Don't forget, EZProxy was supposed to be a transition phase!

RA21's recommended technical approach is broken by emerging browser privacy features

2019-05-08T10:53:00.000-04:00

This is my third comment about the recently published NISO draft "Recommended Practice" (RP) on "Improved Access to Institutionally-Provided Information Resources" a. k. a. "Resource Access in the 21st Century" (RA21). Official comments can be submitted until May 17th. My first comment concerned the use of secure communication channels. The second looked at potential phishing attacks on the proposed system. I'm posting the comments here so you can easily comment.

RA21's recommended technical approach is broken by emerging browser privacy features

Third party cookies are widely on the web used as trackers, or "web bugs", by advertising networks wishing to target users with advertising on the web. The impact of these trackers on privacy has been widely reported and decried. Browser local storage deployed using 3rd-party iframes is similarly employed for user tracking by ad networks. Browser vendors, led by Apple, have fought back against user tracking by providing user options to limit third party information sharing. Apple's "Intelligent Tracking Protection" has progressively increased the barriers to cross-site information storage, for example, by partitioning the local storage according to third-party context.

Unfortunately for RA21, the draft recommended practice (RP) has endorsed a technical approach which mirrors the tactics used for user tracking by the advertising industry. For this reason, users of Safari who choose to enable the "prevent cross-site tracking" option may not benefit from the "seamless" access promised by RA21 if implemented with the endorsed technical approach.

Wikimedia commons

The optimistically acronymed "P3W" pilot used a javascript library called "Krakenjs/zoid" (According to the Norse sagas, the kraken is a squidlike monster that terrorizes voyagers) to exchange data between cross-domain contexts. The limitations on krakenjs in Safari are acknowledged by the library's developer. It works by having the host webpage create an iframe loaded from a P3W website. With privacy controls off, the web page posts to the iframe, which answers with a reference to the user's identity provider. The service provider website uses that information to help the user authenticate without having to search through a huge list of identity providers. With Safari privacy features turned on, the search process must be repeated for each and every service provider domain.

Other browser vendors have moved towards restricting tracking behaviour. Firefox has announced that it will phase in "enhanced tracking protection"
Even Google's Chrome browser is moving towards restrictions on tracking technologies.

The bottom line is that if RA21 is implemented with the recommended technical approach, library users will probably be required to turn off privacy enhancing features of their browser software to use resources in their library. As a result, RA21 will have difficulty moving forward with community consensus on this technical approach.

Browser software is much more tolerant of cross-domain communication when the information "hub" is a first-party context (i.e. a window of its own, not an embedded iframe), as is done in more established authentication schemes such as OpenID Connect and SAML flow. RA21 should refocus its development effort on these technical approaches.

Update July 5, 2019:

RA21's official response to this comment is:

Future work includes storage policy notification. Also, we are not actually using third party cookies even though this term is often used to describe several cross-domain access patterns; instead, RA21 recommends using web storage (aka, browser local storage) together with HTML5 post-message for cross-domain access. This is the same mechanism (and indeed the same implementation) that PayPal uses, thus demonstrating broad browser support. A description of web storage has been added to the Terminology section. We are aware that by turning off "third party cookies" it is possible for the user to partly or completely disable the call to action button but in those cases the user experience degrades gracefully to a classical SAML/OpenIDC discovery flow.

Essentially the same response was made to three other submitted comments. Two of them, from Duke's Tim McGeary, called out two sections of the recommended practice and noted:

Word of caution: this login specifically cannot happen in an iFrame to meet SSO security protocol

The third, from Cornell University Library, submitted by Adam Chandler, amplified on McGeary:

Comment from Cornell University Library Privacy as a Service Working Group. Our group includes membership drawn from Library IT, Library Licensing, Library Public Services, Cornell IT Security, and Cornell Privacy Office.

Under 2.4.: We agree with Tim McGreary's comment (#862 or #863 - seems that he double-posted it) that the SSO login shouldn't be inside a frame on another page. There are security issues with that kind of approach. The users can't see the login page URL to verify that the page is a x.uni.edu page before entering their passwords, so it makes it easier to spoof the login page. Generally, login pages use "framebusting" to prevent this kind of possibility.

RA21's response on this issue is alarming, and suggests that the whole project is in danger of failure. RA21 seems to be unaware that using HTML5 web storage is worse than 3rd party cookies in many respects - particularly privacy and security. Currently, only Safari defaults to "a classical SAML/OpenIDC discovery flow", but that still means that if they want to be accurate, they'll have to rename the implementing organization "The Coalition for Seamless Access but Not on iOS" or "The Coalition for Problematic Access".

I hope that the beta implementation will be executed by a team with the experience and competence to override or at least effectively mitigate RA21's technical blunder.

RA21 Draft RP session timeout recommendation considered harmful

2019-05-06T22:21:00.000-04:00

Hey everybody, I implemented RA21 for access to the blog!

Well, that was fun.

I'm contributing comments about the recently published NISO draft "Recommended Practice" (RP) on "Improved Access to Institutionally-Provided Information Resources" a. k. a. "Resource Access in the 21st Century" (RA21). Official comments can be submitted until May 17th. The draft has much to recommend it, but it appears to have flaws that could impair the success of the effort. My first comment concerned the use of secure communication channels. I expect to write two more. I'm posting the comments here so you can easily comment.

RA21 Draft RP session timeout recommendation considered harmful

RA21 hopes to implement a user authentication environment which allows seamless single sign-on to a large number of service provider websites. Essential to RA21's vision is to replace a hodge-podge of implementations with a uniform, easily recognizable user interface.

While a uniform sign-in flow will be a huge benefit to end users, it introduces an increased vulnerability to an increasingly common type of compromise, credential phishing. A credential phishing attack exploits learned user behavior by presenting the user with a fraudulent interface cloned from a legitimate service. The unsuspecting user enters credentials into the fraudulent website without ever being aware of the credential theft. RA21 greatly reduces the difficulty of a phishing attack in three ways:

Users will learn and use the same sign-in flow for many, perhaps hundreds, of websites. Most users will occasionally encounter the RA21 login on websites they have never used before.
The uniform visual appearance of the sign-in button and identity provider selection step will be trivial to copy. Similarly, a user's previously selected identity provider will often be easy for an attacker to guess, based on the user's IP address.
If successful, RA21 may be used by millions of authorized users, making it difficult to detect unauthorized use of stolen credentials.

If users are trained to enter password credentials even once per day, they are unlikely to notice when they are asked for identity provider credentials by a website crafted to mimic a real identity provider.

For this very reason, websites commonly used for third party logins, such as Google and Facebook, use timeouts much longer than the 24 hour timeouts recommended by the RA21 draft RP. To combat credential theft, they add tools such as multi-factor authentication and insert identity challenges based on factors such as user behavior and the number of devices used by an account.

Identity providers participating in RA21 need to be encouraged to adopt these and other anti-phishing security measures; the RA21 draft's recommended identity provider session timeout (section 2.7) is not in alignment with these measures and is thus counterproductive. Instead, the RP should encourage long identity provider session timeouts, advanced authentication methods, and should clearly note the hazard of phishing attacks on the system. Long-lived sessions will result in better user experience and promote systemic security. While the RP cites default values used in Shibboleth, there is no published evidence that these parameters have suppressed credential theft; the need for RA21 suggests that the resulting user experience has been far from "seamless".

Update July 3, 2019:

RA21's official response to this comment is:

We disagree with premise that consumer websites adopt long sign-in timeouts as a Phishing protection measure. That said, IdPs should follow best practices such as HTTPS so users can verify that they are on a valid sign in page. Length of validity of sign-in is also by necessity context dependent.

Well, yeah. I wasn't expecting them to actually consult real people who battle identity theft on consumer websites. I was mostly amazed that sign-in timeouts would be considered in-scope for RA21 while HTTPS, which will be essential to RA21's success or failure, was not. But the RA21 recommendation will have no effect whatsoever on what identity providers do, unless perhaps existing identity providers are making timeouts ridiculously short. Identity providers know their context much better than any committee and they will do what they want to do. And they should!

Interestingly, a section (2.8. Establish Security Incident Reporting Frameworks) has been added to the revised recommendation that acknowledges credential phishing as a motivation for RA21! So, yay RA21!

RA21 RP does not require secure protocols. It should.

2019-05-05T22:10:00.000-04:00

As I've written, "RA21" could be a Good Thing, or it could be a disaster. The RA21 working group has released its "Recommended Practice" draft for comments, until May 17. The draft has much to like, but also has significant flaws. I will be contributing comments to address the flaws I see, which I will also publish here so we can discuss and comment. My official comments, and many others worth reading are here.

Here's my first comment, perhaps the most predictable:

RA21 RP does not require secure protocols. It should.

RA21 envisions the creation of a widely deployed authentication and authorization for resources and tools serving the research community. In such an ecosystem, the health and security of the entire system can be degraded by a small number of weak implementations. In particular, delivering resources over insecure unencrypted channels will be harmful.

In this context it is surprising that the RA21 recommended practice (RP) fails to directly address the need for service providers and identity providers to use secure channels such as HTTPS for websites. The recommended practice makes indirect reference to this need by citing another document, "WAYF Cloud and P3W Security & Privacy Recommendations". This document fails to treat secure channels as a requirement, saying in analyzing the pilot implementations (italics added):

"All browser traffic should use secured protocols, such as https, to prevent unauthorized access and to preserve confidentiality." (WAYF cloud, page 13)

"All browser traffic should use secured protocols such as https to prevent unauthorized access and to preserve confidentiality." (P3W, page 18)

In contrast to the "should" used for secure communications, the analysis uses the stronger "must" in other places, for example,

"Therefore, applications must include strong controls to prevent user ID tampering and abuse "(Information Disclosure, page 7)

Security and privacy issues essential to the success of RA21 should not be buried in technical analyses of uncertain normativity. Secure channels should not be optional, they must be required.

Update July 2, 2019:

RA21's official response to this comment is

We agree that HTTPS everywhere is a good idea for tools and resources serving the research community. However, a specific recommendation on this would be outside of the scope of RA21.

This response strikes me as uninformed, considering that the recommendation promotes a technical solution that will likely require publishers to adopt HTTPS. Either the committee is unaware of the technical ramifications of their recommendations (very likely), or they're trying to hide from the publishing community the inconvenient fact that RA21 will require all of them to go HTTPS (I wish).

Really, all I was hoping for some bland indication that RA21 will not compromise system privacy and security to accommodate the laggards of the service provider community. Since that didn't happen, I'll do some shouting here:

ATTENTION PUBLISHERS! RA21 WON'T TELL YOU IT'S GOING TO REQUIRE HTTPS ON YOUR SITE. I JUST DID.

Fudge, and open access ebook download statistics

2019-04-03T12:28:00.005-04:00

If you found out that the top 50 authors born in Gloucestershire, England average over 10 million copies sold, you might think that those authors are doing pretty well. But it's silly to compute averages like that. When you compute an average over a population, you're making an assumption that the quantity you're averaging over is statistically distributed somehow over the population. Unless of course you don't care if the average means anything, and you just want numbers to help justify an agenda.

Most folks would look at the list of Gloucestershire authors and say that one of the authors is an outlier, not representative of Gloucestershire authors in general. And so J.K. Rowling, with more than 500 million copies sold, would get removed from the data set, revealing the presumably unimpressive book selling record of the "more representative" authors. Scientists refer to this process as "fudging the data". It's done all the time, but it's not honest.

There's a better way. If a scientific study presents averages across a population, it should also report statistical measures such as variance and standard deviation, so the audience can judge how meaningful the reported averages are (or aren't!).

Other times, the existence of "outliers" is evidence that the numbers are better measured and compared on a different scale. Often, that's a logarithmic scale. For example, noise is measured on a logarithmic scale, in units of decibels. An ambulance siren has a million times the noise power of normal conversation, but it's easier to make sense of that number if we compare the 60 dB sound volume of conversation to the 90 dB of a hair dryer, the 120 dB of the siren and the 140 dB of a jet engine. Similarly, we can understand that while J.K. Rowling's sales run into 8 figures, most top Gloucestershire-born authors are probably 3, 4 and or maybe 5 figure sellers.

Over the weekend, I released a "preprint" on Humanities Commons, describing my analysis of open-access ebook usage data. I worked with a wonderful team including two open-access publishers, University of Michigan Press and Open Book Publishers, on this project, which was funded by the Mellon Foundation. To boil down my analysis to two pithy points, the preprint argues:

Free ebook downloads are best measured on a logarithmic scale, like earthquakes and trade publishing sales.
We shouldn't average download counts.

If you take the logarithm of book downloads, the histogram looks like a bell curve!

For example, if someone tries to tell you that "Engineering, mathematics and computer science OA books perform much better than the average number of downloads for OA books across all subject areas" without telling you about variances of the distributions and refusing to release their data, you should pay them no mind.

Next week, I'll have a post about why logarithmic scales makes sense for measuring open-access usage, and maybe another about how log-normal statistics could save civilization.

On the Surveillance Techno-state

2018-12-31T18:37:00.001-05:00

I used to run my own mail server. But then came the spammers. And dictionary attacks. All sorts of other nasty things. I finally gave up and turned to Gmail to maintain my online identities. Recently, one of my web servers has been attacked by a bot from a Russian IP address which will eventually force me to deploy sophisticated bot-detection. I'll probably have to turn to Google's recaptcha service, which watches users to check that they're not robots.

Isn't this how governments and nations formed? You don't need a police force if there aren't any criminals. You don't need an army until there's a threat from somewhere else. But because of threats near and far, we turn to civil governments for protection. The same happens on the web. Web services may thrive and grow because of economies of scale, but just as often it's because only the powerful can stand up to storms. Facebook and Google become more powerful, even as civil government power seems to wane.

When a company or institution is successful by virtue of its power, it needs governance, lest that power go astray. History is filled with examples of power gone sour, so it's fun to draw parallels. Wikipedia, for example, seems to be governed like the Roman Catholic Church, with a hierarchical priesthood, canon law, and sacred texts. Twitter seems to be a failed state with a weak government populated by rival factions demonstrating against the other factions. Apple is some sort of Buddhist monastery.

This year it became apparent to me that Facebook is becoming the internet version of a totalitarian state. It's become so ... needy. Especially the app. It's constantly inventing new ways to hoard my attention. It won't let me follow links to the internet. It wants to track me at all times. It asks me to send messages to my friends. It wants to remind me what I did 5 years ago and to celebrate how long I've been "friends" with friends. My social life is dominated by Facebook to the extent that I can't delete my account.

That's no different from the years before, I suppose, but what we saw this year is that Facebook's governance is unthinking. They've built a machine that optimizes everything for engagement and it's been so successful that they they don't know how to re-optimize it for humanity. They can't figure out how to avoid being a tool of oppression and propaganda. Their response to criticism is to fill everyone's feed with messages about how they're making things better. It's terrifying, but it could be so much worse.

I get the impression that Amazon is governed by an optimization for efficiency.

How is Google governed? There has never existed a more totalitarian entity, in terms of how much it knows about every aspect of our lives. Does it have a governing philosophy? What does it optimize for?

In a lot of countries, it seems that the civil governments are becoming a threat to our online lives. Will we turn to Wikipedia, Apple, or Google for protection? Or will we turn to civil governments to protect us from Twitter, Amazon and Facebook. Will democracy ever govern the Internet?

Happy 2019!

Towards Impact-based OA Funding

2018-12-27T14:30:00.000-05:00

Earlier this month, I was invited to a meeting sponsored by the Mellon Foundation about aggregating usage data for open-access (OA) ebooks, with a focus on scholarly monographs. The "problem" is that open licenses permit these ebooks to be liberated from hosting platforms and obtained in a variety of ways. A scholar might find the ebook via a search engine, on social media or on the publisher's web site; or perhaps in an index like Directory of Open Access Books (DOAB), or in an aggregator service like JSTOR. The ebook file might be hosted by the publisher, by OAPEN, on Internet Archive, Dropbox, Github, or Unglue.it. Libraries might host files on institutional repositories, or scholars might distribute them by email or via ResearchGate or discipline oriented sites such as Humanities Commons.

I haven't come to the "problem" yet. Open access publishers need ways to measure their impact. Since the whole point of removing toll-access barriers is to increase access to information, open access publishers look to their usage logs for validation of their efforts and mission. Unit sales and profits do not align very well with the goals of open-access publishing, but in the absence of sales revenue, download statistics and other measures of impact can be used to advocate for funding from institutions, from donors, and from libraries. Without evidence of impact, financial support for open access would be based more on faith than on data. (Not that there's anything inherently wrong with that.)

What is to be done? The "monograph usage" meeting was structured around a "provocation": that somehow a non-profit "Data Trust" would be formed to collect data from all the providers of open-access monographs, then channel it back to publishers and other stakeholders in privacy-preserving, value-affirming reports. There was broad support for this concept among the participants, but significant disagreements about the details of how a "Data Trust" might work, be governed, and be sustained.

Why would anyone trust a "Data Trust"? Who, exactly, would be paying to sustain a "Data Trust"? What is the product that the "Data Trust" will be providing to the folks paying to sustain it? Would a standardized usage data protocol stifle innovation in ebook distribution? We had so many questions, and there were so few answers.

I had trouble sleeping after the first day of the meeting. At 4 AM, my long-dormant physics brain, forged in countless all-nighters of problem sets in college, took over. It proposed a gendanken experiment:

What if there was open-access monograph usage data that everyone really trusted? How might it be used?

The answer is given away in the title of this post, but let's step back for a moment to provide some context.

For a long time, scholarly publishing was mostly funded by libraries that built great literature collections on behalf of their users - mostly scholars. This system incentivized the production of expensive must-have journals that expanded and multiplied so as to eat up all available funding from libraries. Monographs were economically squeezed in this process. Monographs, and the academic presses that published them, survived by becoming expensive, drastically reducing access for scholars.

With the advent of electronic publishing, it became feasible to flip the scholarly publishing model. Instead of charging libraries for access, access could be free for everyone, while authors paid a flat publication fee per article or monograph. In the journal world, the emergence of this system has erased access barriers. The publication fee system hasn't worked so well for monographs, however. The publication charge (much larger than an article charge) is often out of reach for many scholars, shutting them out of the open-access publishing process.

What if there was a funding channel for monographs that allocated support based on a measurement of impact, such as might be generated from data aggregated by a trusted "Data Trust"? (I'll call it the "OA Impact Trust", because I'd like to imagine that "impact" rather than a usage proxy such as "downloads" is what we care about.)

Here's how it might work:

Libraries and institutions register with the OA Impact Trust, providing it with a way to identify usage and impact relevant to the library or institutions.
Aggregators and publishers deposit monograph metadata and usage/impact streams with the Trust.
The Trust provides COUNTER reports (suitably adapted) for relevant OA monograph usage/impact to libraries and institutions. This allows them to compare OA and non-OA ebook usage side-by-side.
Libraries and institutions allocate some funding to OA monographs.
The Trust passes funding to monograph publishers and participating distributors.

The incentives built into such a system promote distribution and access. Publishers are encouraged to publish monographs that actually get used. Authors are encouraged to write in ways that promote reading and scholarship. Publishers are also encouraged to include their backlists in the system, and not just the dead ones, but the ones that scholars continue to use. Measured impact for OA publication rises, and libraries observe that more and more, their dollars are channeled to the material that their communities need.

Of course there are all sorts of problems with this gedanken OA funding scheme. If COUNTER statistics generate revenue, they will need to be secured against the inevitable gaming of the system and fraud. The system will have to make judgements about what sort of usage is valuable, and how to weigh the value of a work that goes viral against the value of a work used intensely by a very small community. Boundaries will need to be drawn. The machinery driving such a system will not be free, but it can be governed by the community of funders.

Do you think such a system can work? Do you thing such a system would be fair, or at least fairer than other systems? Would it be Good, or would it be Evil?

Notes:

Details have been swept under a rug the size of Afghanistan. But this rug won't fly anywhere unless there's willingness to pay for a rug.
The white paper draft which was the "provocation" for the meeting is posted here.
I've been thinking about this for a while.

A Milestone for GITenberg

2018-10-30T11:45:00.000-04:00

We've reached a big milestone for the GITenberg Project, which comes after a lot of work over 6 years by several groups of people. It's now ready to use!

GITenberg is a prototype that explores how Project Gutenberg might work if all the Gutenberg texts were on Github, so that tools like version control, continuous integration, and pull-request workflow could be employed. We hope that Project Gutenberg can take advantage of what we've learned; work in that direction has begun but needs resources and volunteers. Go check it out!

It's hard to believe, but GITenberg started 6 years ago when Seth Woodworth started making Github repos for Gutenberg texts. I joined the project two years later when I started doing the same and discovered that Seth was 43,000 repos ahead of me. The project got a big boost when the Knight Foundation awarded us a Prototype Fund grant to "explore the applicability of open-source methodologies to the maintenance of the cultural heritage" that is the Project Gutenberg collection. But there were big chunks of effort left to finish the work when that grant ended. Last year, six computer-science seniors from Stevens Institute of Technology took up the challenge and brought the project within sight of a major milestone (if not the finishing-line). There remained only the reprocessing of 58,000 ebooks (with more being created every day!). As of last week, we've done that! Whew.

So here's what's been done:

Almost 57,000 texts from Project Gutenberg have been loaded into Github repositories.
EPUB, PDF, and Kindle Ebooks have been rebuilt and added to releases for all but about 100 of these.
Github webhooks trigger dockerized ebook building machines running on AWS Elastic Beanstock every time a git repo is tagged.
Toolchains for asciidoc, HTML and plain text source files are running on the ebook builders.
A website at https://www.gitenberg.org/ uses the webhooks to index and link to all of the ebooks.
www.gitenberg.org presents links to Github, Project Gutenberg, Librivox, and Standard Ebooks.
Cover images are supplied for every ebook.
Human-readable metadata files are available for every ebook
Syndication feeds for these books are made available in ONIX, MARC and OPDS via Unglue.it.

Everything in this project is built in the hope that the bits can be incorporated into Project Gutenberg wherever appropriate. In January 2019, the US public domain will resume the addition of new books, so it's more important than ever that we strengthen the infrastructure that supports it.

Some details:

All of the software that's been used is open source and content is openly licensed.
PG's epubmaker software has been significantly strengthened and improved.
About 200 PG ebooks have had fatal formatting errors remediated to allow for automated ebook file production.
1,363 PG ebooks were omitted from this work due to licensing or because they aren't really books.
PG's RDF metadata files were converted to human-readable YAML and enhanced with data from New York Public Library and from Wikipedia.
Github API throttling limits the build/release rate to about 600 ebooks/hour/login. A full build takes about 4 full days with one github login.

Acknowledgements:

Seth Woodworth. In retrospect, the core idea was obvious, audacious, and crazy. Like all great ideas.
Github tech support. Always responsive.
The O'Reilly HTMLBook team. The asciidoc toolchain is based on their work.
Plympton. Many asciidoc versions were contributed to GITenberg as part of the "Recovering the Classics" project. Thanks to Jenny 8. Lee, Michelle Cheng, Max Pevner and Nessie Fox.
Albert Carter and Paul Moss contributed to early versions of the GITeneberg website.
The Knight Foundation provided funding for GITenberg at a key juncture in the project's development though its prototype fund. The Knight Foundation supports public-benefitting innovation in so many ways even beyond the funding it provides, and we thank them with all our hearts.
Travis-CI. The first version of automated ebook building took advantage of Travis-CI. Thanks!
Raymond Yee got the automated ebook building to actually work.
New York Public Library contributed descriptions, rights info, and generative covers. They also sponsored hackathons that significantly advanced the environment for public domain books. Special thanks to Leonard Richardson, Mauricio Giraldo and Jens Troeger (Bookalope).
My Board at the Free Ebook Foundation: Seth, Vicky Reich, Rupert Gatti, Todd Carpenter, Michael Wolfe and Karen Liu. Yes, we're overdue for a board meeting...
The Stevens GITenberg team: Marc Gotliboym, Nicholas Tang-Mifsud, Brian Silverman, Brandon Rothweiler, Meng Qiu, and Ankur Ramesh. They redesigned the gitenberg.org website, added search, added automatic metadata updates, and built the dockerized elastic beanstalk ebook-builder and queuing system. This work was done as part of their two-semester capstone (project) course. The course is taught by Prof. David Klappholz, who managed a total of 23 student projects last academic year. Students in the course design and develop software for established companies, early stage startups, nonprofits, gov't agencies, etc., etc. Take a look at detailed information about software that has been developed over the past 6-7 years and details of how the course works.
Last, but certainly not least, Greg Newby (Project Gutenberg) for consistent encouragement and tolerance of our nit-discovery, Juliet Sutherland (Distributed Proofreaders) for her invaluable insights into how PG ebooks get made, and to the countless volunteers at both organizations who collectively have made possible the preservation and reuse of our public domain.

I'm sure I've omitted an important acknowledgement or two - please let me know so I can rectify the omission.

So what's next? As I mentioned, we've taken some baby steps towards applying version control to Project Gutenberg. But Project Gutenberg is a complex organism, and implementing profound changes will require broad consensus-building and resource gathering (both money and talent). Project Gutenberg and the Free Ebook Foundation are very lean non-profit organizations dependent on volunteers and small donations. What's next is really up to you!

eBook DRM and Blockchain play CryptoKitty and Mouse. And the Winner is...

2018-09-18T14:33:00.000-04:00

If you want to know how blockchain relates to DRM and ebooks, it helps to understand CryptoKitties.

CryptoKitties are essentially numbers that live in a game-like environment which renders cats based on the numbers. Players can buy, collect, trade, and breed their kitties. Each kitty is unique. Players let their kitties play games in the "kittyverse". Transactions involving CryptoKitties take place on the Ethereum blockchain. Use of the blockchain make CryptoKitties different from other types of virtual property. The kitties can be traded outside of the game environment, and the kitties can't be confiscated or deleted by the game developers. In fact, the kitties could easily live in third-party software environments, though they might not carry their in-game attributes with them. Over 12 million dollars has been spent on CryptoKitties, and while you might assume they're a passing fad, they haven't gone away.

It's weird to think about "digital rights management" (DRM) for CryptoKitties. Cryptography locks a kitty to a user's cryptocurrency wallet, but you can transfer a wallet to someone else by giving them your secret keys. With the key, you can do anything with the contents of the wallet. The utility of your CryptoKitty (your "digital rights") is managed by a virtual environment controlled by Launch Labs, Inc., but until the kitties become sentient (15-20 years?) the setup doesn't trigger my distaste for DRM.

Now, think about how Amazon's Kindle works. When you buy an ebook from Amazon, what you're paying for is a piece of virtual property that only exists in the Kindle virtual world. The Kindle software environment endows your virtual property with value - but instead of giving you the right to breed a kitty, you might get the right to read about a kitty. You're not allowed to exercise this right outside of Amazon's virtual world, and DRM exists to enforce Amazon's control of that right. You can't trade or transfer this right.

Ebooks are are different from virtual property, in important ways. Ebooks are words, ideas, stories that live just fine outside Kindle. DRM kills this outside life away, which is a sin. And it robs readers of the ability to read without Big Brother keeping track of every page they read. Most authors and publishers see DRM as a necessary evil, because they don't believe in a utopia where readers pay creators just because they're worth it.

But what if were possible to "CryptoKittify" ebooks? Would that mitigate the sins of DRM, or even render it unnecessary? Would it just add the evils of blockchain to the evils of DRM? Two startups, Publica and Scenarex are trying to find out.

Depending on implementation, the "CryptoKittification" of ebooks could allow enhanced privacy and property rights for purchasers as well as transaction monitoring for rights holders. If a user's right to an ebook was registered on a blockchain, a reader application wouldn't need to "phone home"
to check whether a user was entitled to open and use the ebook. Similarly, the encrypted ebook files could be stored on a distributed service such as IPFS, or on a publisher's distribution site. The reader platform provider needn't separately verify the user. And just like printed books, a reader license could be transferred or sold to another user.

Alas, the DRM devil is always in the details, which is why I quizzed both Scenarex and Publica about their implementations. The two companies have taken strikingly different approaches to the application of blockchain to the ebook problem.

Scenarex, a company based in Montreal, has strived to make their platform familiar to both publishers and to readers. You don't need to have cryptocurrency or a crypto-wallet to use their system, called "Bookchain". Their website will look like an online bookstore, and their web-based reader application will use ebooks in the EPUB format rendered by the open-source Readium software being used by other ebook websites. All of the interaction with the blockchain will be handled in their servers. The affordances of their user-facing platform, at least in its initial form, should be very similar to other Readium-powered sites. For users, the only differences will be the license transfer options enabled by the blockchain and its content providers. Because the licenses will be memorialized on a blockchain the possibility is open that they could be used in other reading environments.

Scenarex's conservative approach of hiding most of blockchain from the users and rights holders, means that almost all of Scenarex's blockchain-potential is as-yet unrealized. There's no significant difference in privacy compared to Readium's LCP DRM scheme. License portability and transactions will depend on whether other providers decide to adopt Scenarex's license tokenization and publication scheme. Because blockchain interaction takes place behind Scenarex servers, the problems with blockchain immutability are mitigated along with the corresponding benefits to the purchaser. Scenarex expects to launch soon, but it's still too early to see if they can gain any traction.

Publica, by contrast, has chosen to propose a truly radical course for the ebook industry. Publica, with development offices is Latvia, doesn't make sense if you think of it as an ebook store, it only makes sense if you think of it as a crowd-funding platform for ebooks. (Disclosure: Unglue.it, a website I founded and run as part of the Free Ebook Foundation, started life as a crowd-funding platform for free ebooks.)

Publica invites authors to create "initial coin offerings" (ICOs) for their books. An author raising funds for their book sells read tokens for the book to investors, presumably in advance of publication. When the book is published, token owners get to read the book. Tokens can be traded or sold in Ethereum blockchain-backed transactions.

From an economic point of view, this doesn't seem to make much sense. If the token marketplace is efficient, the price of a token will fluctuate until the supply of tokens equals the number of people who want continuing access to the book. Sell too many tokens, and the price crashes to near zero. In today's market for books, buyers are motivated by word of mouth, so newly published books, especially by unknown authors, are given out free to reviewers and other influencers. To make money with an ICO, in contrast, an author will need to limit the supply so as to support the token's attractiveness to investors, and thus the book's price.

In many ways, however, book purchasers don't act like economists. They keep their books around forever. They accumulate TBR piles. Yes, they'll give away or sell books, but that is typically to enable further accumulation. They'll borrow a book from the library, read it, and THEN buy it. Book purchasers collect books. Which brings us back to CryptoKitties.

In May of 2018, a CryptoKitty sold at auction for over $140,000. That's right, someone paid 6 figures for what is essentially a number! Can you imagine someone paying that much for a copy of a book?

I can imagine that. In 2001, a First Folio edition of Shakespeare's plays sold for over $6,000,000! Suppose that J. K. Rowling had sold 100 digital first editions of Harry Potter and the Philosopher's Stone in 1996 to make ends meet. How much do you think someone would pay for one of those today, assuming the provenance and "ownership" could be unassailably verified?

CryptoKitties might be cute and they might have rare characteristics, but many more people develop powerful emotional attachments to books, even if they're just words or files full of bytes. A First Folio is an important historical artifact because of the huge cultural impact of the words it memorializes. I think it's plausible that a digital artifact could be similarly important, especially if its original sale provided support to its artist.

This brings me back to DRM. I asked the CTO of Publica, Yuri Pimenov about it, and he seemed apologetic.

Even Amazon's DRM can be easily removed (I did it once). So, let's assume that DRM is a little inconvenience that [...] people are ready to pay [to get around]. And besides the majority of people are good and understand that authors make a living by writing books...

Publica's app uses a cryptographic token in the Blockchain to allow access to the book contents, and does DRM-ish things like disabling quoting. But since the cryptographic token is bound to a cryptographic wallet, not a device or an account, it just papers over author concerns such as piracy. Pimenov is correct to note that it's the reader's relationship to the author that should be cemented by the Publica marketplace. Once Publica understands that memorializing readers supporting authors is where their success can come from, I think they'll realize that DRM, by restricting readers and building moats around literature, is counterproductive. To make an ebook into a collectable product, we don't need DRM, we need need "DRMem": Digital Rights Memorialization.

So, I'm surprised to be saying this, but... CryptoKitties win!

More Links:

Can Blockchain Disrupt The E-Book Market? Two Startups Will Find Out - Bill Rosenblatt
Blockchain Comes to E-Books, DRM Included - Bill Rosenblatt
London Book Fair 2018: Meet the World’s First #1 Bestselling ‘Blockchain’ Author - Andrew Albanese
The Content Blockchain Project
BitRights - Blockchain Digital Content
Civil - The Decentralized Marketplace for Sustainable Journalism

My Face is Personally Identifiable Information

2018-08-02T11:31:00.000-04:00

Facial recognition technology used to be so adorable. When I wrote about it 7 years ago, the facial recognition technology in iPhoto was finding faces in shrubbery, but was also good enough to accurately see family resemblances in faces carved into a wall. Now, Apple thinks it's good enough to use for biometric logins, bragging that "your face is your password".

I think this will be my new password:

The ACLU is worried about the civil liberty implications of facial recognition and the machine learning technology that underlies it. I'm worried too, but for completely different reasons. The ACLU has been generating a lot of press as they articulate their worries - that facial recognition is unreliable, that it's tainted by the bias inherent in its training data, and that it will be used by governments as a tool of oppression. But I think those worries are short-sighted. I'm worried that facial recognition will be extremely accurate, that its training data will be complete and thus unbiased, and that everyone will be using it everywhere on everyone else and even an oppressive government will be powerless to preserve our meager shreds of privacy.

We certainly need to be aware of the ways in which our biases can infect the tools we build, but the ACLU's argument against facial recognition invites the conclusion that things will be just peachy if only facial recognition were accurate and unbiased. Unfortunately, it will be. You don't have to read Cory Doctorow's novels to imagine a dystopia built on facial recognition. The progression of technology is such that multiple face recognizer networks could soon be observing us where ever we go in the physical world - the same way that we're recognized at every site on the internet via web beacons, web profilers and other spyware.

The problem with having your face as your password is that you can't keep your face secret. Faces aren't meant to be secret. Our faces co-evolved with our brains to be individually recognizable; evidently, having an identity confers a survival advantage. Our societies are deeply structured around our ability to recognize other people by their faces. We even put faces on our money!

Facial recognition is not new at all, but we need to understand the ways in which machines doing the recognizing will change the fabric of our societies. Let's assume that the machines will be really good at it. What's different?

For many applications, the machine will be doing things that people already do. Putting a face-recognizing camera on your front door is just doing what you'd do yourself in deciding whether to open it. Maybe using facial recognition in place of a paper driver's license or passport would improve upon the performance of a TSA agent squinting at that awful 5-year-old photo of you. What's really transformative is the connectivity. That front-door camera will talk to Fedex's registry of delivery people. When you use your face at your polling place, the bureau of elections will make sure you don't vote anywhere else that day. And the ID-check that proves you're old enough to buy cigarettes will update your medical records. What used to identify you locally can now identify you globally.

The reason that face-identity is so scary is that it's a type of identifier that has never existed before. It's globally unique, but it doesn't require a central registry to be used. It's public, easily collected and you can't remove it. It's as if we all had to tattoo our ~~prisoner~~ social security numbers on our foreheads! Facial profiles can be transmitted around the world, and used to index ALL THE DATABASEZ!

We can't stop facial recognition technology any more than we can reverse global warming, but we can start preparing today. We need to start by treating facial profiles and photographs as personally identifiable information. We have some privacy laws that cover so-called "PII", and we need to start applying them to photographs and facial recognition profiles. We can also impose strict liability for the misuse of biased inaccurate facial recognition; slowing down the adoption of facial recognition technology will give our society a chance to adjust to its consequences.

Oh, and maybe Denmark's new law against niqabs violates GDPR?