Thursday, December 14, 2023

The Revenge of the Cataloguers

Over the past 15 years or so, libraries around the world have de-emphasized cataloguing. While budgetary concerns and technological efficiencies have been factors in the decline of cataloguing, the emergence of full text search and relevance ranking as practiced by Google and others has proved to be more popular for the vast majority of users. On the open internet, subject classifications have proved to be useless in an environment rife with keyword spam and other search engine optimization techniques. 

In the past year, the emergence of artificial intelligence (AI) with large language models with surprising abilities to summarize and classify texts has people speculating that AI will put most cataloguers out of work in the not-so-distant future.

I think that's not even wrong. But Roy Tennant will turn out to be almost right. MARC, the premier tool of cataloguers around the world, will live forever...  as a million weights in generative pre-trained transformer. Let me explain...

The success or failure of modern AI depends on the construction of large statistical models with billions or even trillions of variables. These models are built from training data. The old adage about computers: "garbage in garbage out" is truer than ever. The models are really good at imitating the training data; so good that they can surprise the models' architects! Thus the growing need for good training data, and the increasing value of rich data sources.

Filings in recent lawsuits confirm the value of this training data. Getty Images is suing Stability AI for the use of Getty Images' material in AI training sets. But it's not just for the use of the images, which are copyrighted, but also for the use of trademarks and the detailed descriptions than accompany the data. Read paragraph 57 of the complaint:

Getty Images’ websites include both the images and corresponding detailed titles and captions and other metadata. Upon information and belief, the pairings of detailed text and images has been critical to successfully training the Stable Diffusion model to deliver relevant output in response to text prompts. If, for example, Stability AI ingested an image of a beach that was labeled “forest” and used that image-text pairing to train the model, the model would learn inaccurate information and be far less effective at generating desirable outputs in response to text prompts by Stability AI’s customers. Furthermore, in training the Stable Diffusion model, Stability AI has benefitted from Getty Images’ image-text pairs that are not only accurate, but detailed. For example, if Stability AI ingested a picture of Lake Oroville in California during a severe drought with a corresponding caption limited to just the word “lake,” it would learn that the image is of a lake, but not which lake or that the photograph was taken during a severe drought. If a Stable Diffusion user then entered a prompt for “California’s Lake Oroville during a severe drought” the output image might still be one of a lake, but it would be much less likely to be an image of Lake Oroville during a severe drought because the synthesis engine would not have the same level of control that allows it to deliver detailed and specific images in response to text prompts.

If you're reading this blog, you're probably thinking to yourself "THAT'S METADATA!"

Let's not forget the trademark part of the complaint:

In many cases, and as discussed further below, the output delivered by Stability AI includes a modified version of a Getty Images watermark, underscoring the clear link between the copyrighted images that Stability AI copied without permission and the output its model delivers. In the following example, the image on the left is another original, watermarked image copied by Stability AI and used to train its model and the watermarked image on the right is output delivered using the model:

If you're reading this blog, you're probably thinking to yourself "THAT'S PROVENANCE!"

So clearly, the kinds of data that libraries and archives have been producing for many years will still have value, but we need to start thinking about how the practice of cataloguing and similar activities will need to change in response to the new technologies. Existing library data will get repurposed as training data to create efficiencies in library workflows. Organizations with large, well-managed will extract windfalls, deserved or not.

If the utility of metadata work is shifting from feeding databases to training AI models, how does this affect the product of that work? Here's how I see it:

  • Tighter coupling of metadata and content. Today's discovery systems are all about decoupling data from content - we talk about creating metadata surrogates for discovery of content. Surrogates are useless for AI training; a description of a cat is useless for training without an accompanying picture of the cat. This means that the existing decoupling of metadata work from content production is doomed. You might think that copyright considerations will drive metadata production into the hands of existing content producers, but more likely organizations that focus on production of integrated training data will emerge to license content and support the necessary metadata production.
  • Tighter collaboration of machines and humans. Optical character recognition (OCR) is a good example of highly focused and evolved machine learning that can still be improved by human editors. The practice of database-focused cataloguing will be made more productive as cataloguers become editors of machine generated structured data. (As if they're not already doing that!)

  • Softer categorization. Discovery databases demand hard classifications. Fiction. Science. Textbooks. LC Subject Headings. AIs are much better at nuance, so the training data needs to include a lot more context. You can have a romantic novel of chemists and their textbooks, and an AI will be just fine with that, so long as you have enough description and context for the machine to assign lots of weights to many topic clusters. 

  • Emphasis on novelty. New concepts and things appear constantly; an AI will extrapolate unpredictably until it gets on-topic training data. AI-OCR might recognize a new emoji, but it might not.
  • Emphasis on provenance. Reality is expensive, which is why I think for-profit organizations will have difficulty in the business of providing training data while Wikipedia will continue to succeed because it requires citations. Already the internet is awash in AI produced content that sounds real, but is just automated BS. Training data will get branded.

What gets me really excited though, is thinking about how a library of the future will interact with content. I expect users will interact with the library using a pre-trained language model, rather than via databases. Content will get added to the model using packages of statistical vectors, compiled by human-expert-assisted content processors. These human experts won't be called "cataloguers" any longer but rather "meaning advisors". Or maybe "biblio-epistemologists". The  revenge of the cataloguers will be that because of the great responsibilities and breadth of expertise required, biblio-epistemologists will command salaries well exceeding the managers and programmers who will just take orders from well-trained AIs. Of course there will still be MARC records, generated by a special historical vector package guaranteed to only occasionally hallucinate.

Note: I started thinking about this after hearing a great talk (starting at about 30:00) by Michelle Wu at the Charleston Conference in November. (Kyle Courtney's talk was good, too).

Friday, August 25, 2023

Let's pretend they're ebooks

In days of yore, back when people were blogging, I described the way that libraries were offering ebooks as being a "Pretend It's Print" model. At the time, I felt that this model was designed to sustain and perpetuate the model that libraries and publishers had been using since prehistoric times, and that it ignored most of the possibilities inherent in the ebook. Ebooks could liberate the book from the shackles of their physical existences!
I was right, and I was wrong. The book publishing world seized on digital technology to put even heavier shackles on their books. In turn, technology companies such as Amazon locked down innovation in the ebook world so that libraries could no longer be equal contributors to the enterprise of distributing books, all the while pretending to their patrons that the ebooks they licensed were just like the print books sitting on their shelves.
Somehow libraries and publishers have survived. Maybe they've even thrived with the "pretend it's print" model for ebooks. There are plenty of economic problems, but whenever I talk to people about ebooks, the conversation is always some variation of "I love reading ebooks through my library". Most library users are perfectly happy pretending that their digital ebooks are just like the printed books.
robot writing on an ipad
A decade later, we need to change our perspective. It's time we seriously started pretending that printed books are just like ebooks, not just the other way around. The library world has been doing something called "Controlled Digital Lending" (CDL) , which flips the "pretend it's print" model and pretends that print is just like digital. The basic idea behind controlled digital lending is that owning a print book should allow you to read it any way you want, even if that involves creating a digital substitute for it. A library that owns a print book ought to be able to lend it, as long as it's lent to only one person at time. It's as if books were printed and sold in order to spread ideas and information!
Of course radical ideas such as spreading information have to be stopped. And so we have the Hachette v. Internet Archive lawsuit and its assorted fallout. I'm not a lawyer, so I won't say much about the legal validity of the arguments on either side. I'm an ebook technologist, so I will explain to you that whole lawsuit was about whether the other side was sufficiently serious about pretending that print books are just like ebooks and that ebooks are just like print books. Also that the other side doesn't understand how print books are completely different things than ebooks. Those lawyers really take to heart the White Queen's recommendation to believe 6 impossible things before breakfast.
The magic of technology is that it can make our pretendings into something real. So let's think a bit about how we can make the pretense of print-ebook equivalency more real, and if the resulting bargain makes any sense.
Here are some ways that we could make these ebooks, derived from printed books, more like print books:
  1. Speed. It takes me an hour or so to get a print book from a library. Should I be able to get the digital substitute in a minute? Should I be able to read a chapter and the "return" it so that someone else can use it the next seconf? CDL already puts some limits on this, but maybe there could be a standard that makes the digital surrogate more like the real thing?

  2. Geography. Printed books need to be transported to where the reader is. Once digitized they could go anywhere!. Maybe something like a shipping fee could be attached to a loan or other transfer. Maybe part of the fee could accrue to creators? Academic libraries have long done interlibrary loan of journal articles by copying and mailing the article, so why not do something equivalent for books?

These two attributes matter a lot in defining commercial markets for books and ebooks, and will become increasingly important as distribution technologies scale up and improve. Although publishers today make most of their money on the most popular books, book sales and usage of books in libraries have very long tails. There are millions of books for which global demand could be met by aggressive CDL of just a few copies. The CDL system instituted by Internet Archive also has a countervailing effect - the world-wide availability combined with so-so EPUB quality and usability probably result in stimulation of demand for print copies. This effect is likely to diminish as technologists like me smooth out the DRM speedbumps in CDL and begin to apply machine learning to EPUB generation.
It's worth noting that the "long tail" in book publishing also applies to authors and publishers. It's likely that the Internet Archive's CDL service has a larger market effect (whether positive or negative) on these market participants.
Here are some ways that we shouldn't make ebooks more like  print books:
  1. Search. Ebooks make search much easier than in print books. Maybe search should be disabled in CDL ebooks? Or maybe, we could enable search in print books. Google Books already sort of does this, if you have the right edition, but the process of making an ebook from a print book should give you an easy way to enable search in the print!

  2. Accessibility. Many reading-disabled users rely on ebooks for access to literature, science and culture. Older adults such as myself often find that flowable text with adjustable font size is easier on our eyes. In addition to international treaties that treat accessible text as an exception to copyright, most authors and publishers don't want to be monsters.

  3. Smell. Let's not go there.

  4. Privacy. The intellectual property world seems to think that copyright gives them the right to monitor and data-mine the behavior of readers on digital platforms. In some cases, copyright extremists have required root access to our devices so they can sniff out infringing files or behavior. (While they're at it, they might as well mine some bitcoin!) It is an outrage to think anyone who makes ebooks from print books would wire them with surveillance tools; the strong privacy policies of Internet Archive should be codified for CDL.

  5. Preservation. Publishers do a terrible job of preserving the lion's share of the printed books they publish, and society has always relied on libraries for this essential service. In this digital age, any grand bargain on copyrights has to provide libraries with the rights and incentives needed to do digital preservation of both printed and digital books.

The bottom line is that if we're going to continue to pretend that intellection property is a real thing, we need to start pretending that printed books are like ebooks, and vice versa. A grand bargain that benefits us all can eventually make these illusions real.


  1. Copyability. CDL books, like publisher-created ebooks, rely on device-enforced restrictions on duplication (DRM). Printed books rely on the expense of copying machines and paper to limit reproduction. In both cases, social norms and legal strictures discourage unauthorized reproduction. Building those social norms is what creating a grand bargain is all about.
  2.  Simultaneous use. Allowing simultaneous use of library ebooks during the pandemic is what really got the publishers mad at Internet Archive. A lot of people went mad during the lockdown, to be honest, and we're still recovering. 
  3.  Comments. I encourage comment on the Fediverse or on Bluesky. I've turned off commenting here.

Saturday, February 12, 2022

Crowdfunding Lessons from the Spice DAO

What if we get a huge bunch of people together and buy something that lets us do fun things with a book that we all love, while making it accessible as never before? Great idea, isn't it?

If that sounds familiar, maybe you've heard of, a web site we launched 10 years ago? We asked people what book they wished was free to everyone and the number one answer was Douglas Adams' Hitchhikers Guide to the Galaxy. We talked to the literary agent for the Adams estate, and long story short, the rights entanglements made that impossible for any amount of money. We had a success with a seminal Anthropology book, but the intersection between books people were excited about and books that authors were willing to license openly was small. Probably you haven't heard of the site, but while it has focused on building a catalog of open-access books (now over 100,000 titles!) we still crowd fund a book here and there, most recently an academic monograph.

Probably you HAVE heard about Spice DAO, a "Distributed Autonomous Organization" that sprinkled some magic blockchain dust on an auction for a copy of Alejandro Jodorowosky's movie treatment of Frank Herbert's novel.

Web3 enthusiasts came through for Spice DAO, "crowdraising" enough to win the auction for €2.66M, though Christie's estimate for the item was only €25-35,000.

Spice DAO vows that:

Instead of letting it remain hidden away in private collections, Spice DAO crowdraised funds ... to collectively explore options to digitally preserve the manuscript, make it accessible to the public for the very first time, and develop creative projects inspired by the vision Jodorowsky set forth.

Predictably, the success of Spice DAO led to widespread ridicule , because:

  • The price paid was 100X the esimate
  • Nothing about the item purchased gave them any rights to "make it accessible" or "develop creative projects" it inspired.
  • Images of another copy were already freely available on the internet. But no more. Ironically, the publicity around Spice DAO seems to have knocked the images off of the internet!
  • Even the DAO's website is no longer online, most likely trademark infringement. (archived version linked above.)

One crypto lesson: a DAO constructed this way may get ripped off in an auction. Even if the seller was not using shills to see inside the DAO and bid up the price, the DAO was vulnerable to crypto-pranksters (or arbitrageurs?) who knew exactly what the DAO was forced to bid by its "smart" contract to avoid dissolution.

Despite all that, the 2.1 Billion "Spice" tokens given to crowdraise participants are still worth over 800,000 "dollars", according to Coinmarket, so maybe the product here is a convincing story for unregistered securities that apart from representing something tangible, can be used for tax evasion and money laundering. And the team seems to have had a crash course in copyright law:

After two months of outreach, conversations with former business partners and consultations with legal counsel we have not been able to reach an agreement with any of the rights holders involved in the creation of the contents of the book of collected storyboards of Jodorowsky’s Dune. (medium)

Spice DAO, like most successful crowd-funding projects, had a good story, and clearly that's worth a lot. There's still a big difference between a good story and an honest, well informed story. Crowdfunding services such as are limited by all the facts they have to deal with. But magic crypto dust has a certain reality. The crowd-raise generation of tokens that can be bought and sold in free markets allows participants to dream that their tokens will increase in value, and they very well could. In the real world, Spice DAO spent the equivalent of $300,000 to create the liquidity pool needed to distribute the SPICE tokens. Which makes credit card fee seem like a bargain! But dreams are priceless. 

At least with "conventional" crowd funding, you know there's some accountability if you're investing in a nightmare!

Wednesday, December 22, 2021

Top 25 foods at a Traditional Hellman Christmas.

Have only made 7 of the 25 so far this year. 

Bulla, 2021

  1. Julskinka (Christmas Ham). It doesn't count unless you cure it yourself. It once came out blue.
  2. Köttbullar (Meatballs). Still working to perfect the Impossible™ version.
  3. Limpa (Christmas rye Bread). You absolutely must have this with lever pastej, but its also great with just butter.
  4. Sil (Herring). Must have Akvavit to kill the taste. A proper smörgåsbord should have two kinds at least, but we usually made do with one.
  5. Akvavit. Must have Herring to kill the taste.
  6. Boiled Potatoes. Great with Sil and Akvavit. Alleged to go great on knäckebröd with butter and Kalvslyta.
  7. Kalvsylta (Jellied Veal). It's surprisingly easy to make. Keeps at least a year in the freezer.
  8. Lever Pastej (Liver Paté). Also surprisingly easy to make, if you have a grinder.
  9. Korv (Sausage). Two kinds in some years.
  10. Spare Ribs. More than once these were forgotten in the oven.
  11. Dopp i gryta (Dip in the pot). You dip some knäckebröd into boiling ham broth, then slather with butter. Matsos will work in a pinch.
  12. Ost (cheese). Västerbottens Ost, Bond Ost, Herrgardsost, Havarti and Swiss are all good.
  13. Rödkålsalat (red cabbage salad).
  14. Rödbetsalat (red beet salad).
  15. Inlagd gurka (picked cucumber).
  16. Jansson's frestelse (Jansson's temptation). Creamed potatoes, anchovies and onions. Gonna try fermented tofu instead of anchovies this year.
  17. Shrimp omelet.
  18. Rotmoss (Mashed rutabagas). This used to be stuff that people in Sweden ate every day *other* than Christmas, because they could afford it. Now we never have it except at Christmas, because we can afford it.
  19. Lingonsylt (lingonberry sauce).
  20. Öl (beer). But never Swedish beer. Have dubbed this year's batch "Cipher Ale".
  21. Glögg (spiced wine). Dad made it with 1 part vodka to one part wine. 
  22. Coffee.
  23. Pepparkakar (Ginger Cookies). You can make a wish on them- if the cookie breaks into 3 pieces you get your wish. If some other number, you only get your wish if you wished for pepparkakor
  24. Bulla (Cardamon cinnamon coffee b
    read). Comes out differently for every baker. There's also the saffron and almond paste variety made for Dec. 13 and sometimes saved for Christmas.
  25. Sand formar (Sand cookies). A ground almond cookie shell that you put vanilla pudding and a mandarin orange inside.
  26. Jam cookies. 
  27. Many other cookies. Log cookies. Broomstick cookies. Chocolate chip cookies. Macaroons. Knäck. Our seasonal output peaked at 10 dozen dozen. 
  28. Donuts. With the next door neighbors.
  29. Lutfisk. Not a figment of Garrison Kiellor's imagination, but only Dad liked it, with cream sauce, peas and allspice. 

I know that's more than 25, but did I mention that Dad made the glögg with one part vodka to one part wine?

Sunday, July 4, 2021

The Ebook Turns 50

On July 4, 1971, Michael Hart made the text of the Declaration of Independence available on arpanet (which is now the Internet), using the gopher protocol (look it up). Although books in digital form certainly existed before that, many of us regard the beginning of Project Gutenberg as the birth of the ebook. There were computer-readable books on magnetic disks, punch cards and the like, but the revolutionary element of Project Gutenberg was the distribution method. Printed books, after all, are a digital media, it just that the bits are embodied by the presence or absence of ink rather than electrons on a transistor gate. Sending the bits over a wire or a fiber is what puts the 'e' in ebook.

The birth of the ebook was a political event as much as a technical achievement. The choice of the "Declaration of Independence of the United States" as etext #1 couldn't have been solely an expression of patriotic fervor. Rather, I think it was a manifestation of the radical belief that everyone should have access to the printed word, without having to pay for the privilege. (Yes, libraries are radical in this way, too!).

As Thomas Jefferson put it:

... it becomes necessary for one people to dissolve the political bands which have connected them with another, and to assume, among the Powers of the earth, the separate and equal station to which the Laws of Nature and of Nature's God entitle them.

In the context of 1971, the "bands" that needed dissolving were expensive services such as Dialog. The idea that users had to pay Dialog per word to read the Declaration mush have been galling to Hart. (Let's overlook the fact that he and other denizens of the 1971 arpanet got their access for "free" because someone else was paying.) Books are things in their own right; stripping ebooks of their "bands" to a single device or service is what put the "book" into ebook.

Although Project Gutenberg is now delivering about 50 million ebooks a year, about 2% of global ebook unit sales, until at least 2009 it delivered the majority of the world's ebooks. Today, that position has been taken by Amazon's Kindle. Just as the United States can't ignore the ideals that led to its founding, the stakeholders of the ebook ecosystem- authors, publishers, distributors, libraries, and readers, all of us need to remember that the ebook was born out of a desire for freedom.

Note: Though I've been helping Project Gutenberg modernize its technology, I don't speak for them in any way, though I am certainly in awe of what they've achieved! If you'd like to support my work advancing freedom for ebooks, consider a donation to the Free Ebook Foundation.