Wednesday, May 8, 2024

Running Song of the Day

(I'm blogging my journey to the 2024 New York Marathon. You can help me get there.)

Steve Jobs gave me back my music. Thanks Steve!

I got my first iPod a bit more than 20 years ago. It was a 3rd generation iPod, the first version with an all-touch control. I loved that I could play my Bruce, my Courtney, my Heads and my Alanis at an appropriate volume without bothering any of my classical-music-only family. Looking back on it, there was a period of about five years when I didn't regularly listen to music. I had stopped commuting to work by car, and though commuting was no fun, it had kept me in touch with my music. No wonder those 5 years were such a difficult period of my life!

Today, my running and my music are entwined. My latest (and last 😢) iPod already has some retro cred. It's a 6th generation iPod Nano. I listen to to my music on 90% of my runs and 90% of my listening is on my runs. I use shuffle mode so that over the course of a year of running, I'll listen to 2/3 of my ~2500 song library. In 2023, I listened to 1,723 songs. That's a lot of running!

Yes, I keep track. I have a system to maintain a 150 song playlist for running. I periodically replace all the songs I've heard in the most recent 2 months (unless I've listened to the song less than 5 times - you need at least that many plays to become acquainted with a song!) This is one of the ways I channel certain of my quirkier programmerish tendencies so that I project as a relatively normal person. Or at least I try.

Last November, I decided to do something new (for me). I made a running playlist! Carefully selected to have the right cadence and to inspire the run! It was ordered to have to have particular songs play at appropriate points of the Ashenfelter 8K  on Thanksgiving morning. It started with "Born to Run" and ended with either "Save it for Later", "Breathless" or "It's The End Of The World As We Know It", depending on my finishing time. It worked OK. I finished with Exene. I had never run with a playlist before.

1. "Born to Run". Despite the name it's not the best running song, but it is a great start-me-up-song. With 2,661 runners, it took 45  seconds or so before I crossed the starting line, and the first 45 seconds of BTR had me pumped.
2. "American Land". The first part of the race is uphill, so an immigrant song seemed appropriate.
3. "Wake Up" - Arcade Fire. Can't get complacent.
4. "Twist & Crawl - The Beat. The up-tempo pushed me to the fastest part of the race.
5. "Night". Up and over the hill. "you run sad and free until all you can see is the night". 
6. "Rock Lobster" - B-52s. This came up on shuffle last week while I was on the track and it was the perfect beats per minute. That gave me the idea to do a playlist.
7. "Shake It Up" - Taylor Swift. A bit of focused anger helps my energy level.
8. "Roulette". Recommended by the Nuts, and yes it was good. Shouting a short lyric helps me run faster.
9. "Workin' on the Highway". The 4th mile of 5 is the hardest, so "all day long I don't stop".
10. "Your Sister Can't Twist" - Elton John. There's a short nasty hill on this section, but I can rock and roll.
11. "Save it for Later" - The Beat. I could run all day to this, but "sooner or later your legs give way, you hit the ground."
12. "Breathless" - X. If I had hit my goal of 45 minutes, I would have crossed the finish as this started, but I was very happy with 46:12. and a 9:14 pace.
13. "It's The End Of The World As We Know It" - R.E.M. 48 minutes would not have been the end of the world, but I'd feel fine.

Last year, I started to extract a line from the music I had listened to during my run to use as the Strava title for the run. Through September 3, I would choose a line from a Springsteen song (he had to take a health timeout after that). For my New Year's resolution, I promised to credit the song and the artist in my run descriptions as well.

I find now that with many songs, they remind me of the place where I was running when I listened to them. And running in certain places now reminds me of particular songs. I'm training the neural network in my head. I prefer to think of it as creating a web of connections, invisible strings, you might say, that enrich my experience of life. In other words, I'm creating art. And if you follow my Strava, the connections you make to my runs and my songs become part of this little collective art project. Thanks!

Reminder: I'm earning my way into the NYC Marathon by raising money for Amref. 

Monday, April 29, 2024

We'll run 'til we drop

(I'm blogging my journey to the 2024 New York Marathon. You can help me get there.)

 It wasn't the 10 seconds that made me into a runner.

Eric running across a bridge

I started running races again 20 years ago, in 2004. It was a 10K sponsored by my town's YMCA.  I had run an occasional race in grad school to join my housemates; and I continued to run a couple of miles pretty regularly to add some exercise to my mostly sitting-at-a-computer lifestyle. I gradually added 10Ks - the local "turkey-trot"  because the course went almost by my house - and then a "cherry-blossom" run, through beautiful Branch Brook Park. But I was not yet a real runner - tennis was my main sport.

In 2016, things changed. My wife was traveling a lot for work, and one son was away at college, and I found myself needing more social interaction. I saw that my local Y was offering a training program for their annual 10K, and I thought I would try it out. I had never trained for a race, ever. The closest thing to training I had ever done was the soccer team in high school. But there was a HUGE sacrifice involved - the class started at 8AM on Saturdays, and I was notorious for sleeping past noon on Saturdays! Surprise, surprise, I loved it. It was fun to have people to run with. I'm on the silent side, and it was a pleasure to be with people who were comfortable with the  somewhat taciturn real me.

I trained really hard with that group. I did longer runs than I'd ever done, and it felt great. So by race day, I felt sure that I would smash my PR (not counting the races in my 20's!). I was counting on cutting a couple of minutes off my time. And I did it! But only by a measly 10 seconds. I was so disappointed.

But somehow I had become a runner! It was running with a group that made me a runner. I began to seek out running groups and became somewhat of a running social butterfly.

Fast-forward to five weeks ago, when I was doing a 10-miler with a group of running friends (A 10 miler for me, they were doing longer runs in training for a marathon). I had told them of my decision to do New York this fall, and they were soooo supportive. I  signed up for a half marathon to be held on April 27th  - many of my friends were training for the associated full marathon. The last 2 miles were really rough for me (maybe because my shoes were newish??) and I staggered home. That afternoon I could hardly walk and I realized I had strained my right knee. Running was suddenly excruciatingly painful.

By the next day I could get down the stairs and walk with a limp, but running was impossible. The next weekend, I was able to do a slow jog with some pain, so I decided to stick to walking, which was mostly pain-free. I saw a PT who advised me to build up slowly and get plenty of rest. It was working until the next weekend, when I was hurrying to catch a train and unthinkingly took a double step in Penn Station and re-sprained the knee. It was worse than before and I had only 3 weeks until the half marathon!

The past three weeks have been the hardest thing I've had to deal with in my running "career". I've had a calf strain, T-band strains, back strains, sore quads, inter-tarsal neuromas and COVID get in the way of running, but this was the worst. Because of my impatience.

Run-walk (and my running buddies) were what saved me. I slowly worked my way from 2 miles at a 0.05-to-0.25 mile run-to-walk ratio up to 4 miles at 0.2-to-0.05 mile run-to-walk, with 2 days of rest between each session. I started my half marathon with a plan to run 2 mimutes and walk 30 seconds until the knee told me to stop the running bits. I was hoping for a 3 hour half.

The knee never complained (the rest of the body complained, but I'm used to that!!) I finished with the very respectable time of 2:31:28, faster than 2 of my previous 11 half marathons. One of my friends took a video of me staggering over the finish. 

 I'm very sure I don't look like that in real life.

Here's our group picture, marathoners and half-marathoners. Together, we're real runners.

After this weekend, my biggest half marathon challenge to date, I have more confidence than ever that I'll be able to do the New York Marathon in November - in one piece - with Team Amref. (And with your contributions towards my fund-raising goal, as well.)

We're gonna get to that place where we really wanna go and we'll walk in the sun

Jim Thorpe Half Marathon 2024 results. 

Thursday, December 14, 2023

The Revenge of the Cataloguers

Over the past 15 years or so, libraries around the world have de-emphasized cataloguing. While budgetary concerns and technological efficiencies have been factors in the decline of cataloguing, the emergence of full text search and relevance ranking as practiced by Google and others has proved to be more popular for the vast majority of users. On the open internet, subject classifications have proved to be useless in an environment rife with keyword spam and other search engine optimization techniques. 

In the past year, the emergence of artificial intelligence (AI) with large language models with surprising abilities to summarize and classify texts has people speculating that AI will put most cataloguers out of work in the not-so-distant future.

I think that's not even wrong. But Roy Tennant will turn out to be almost right. MARC, the premier tool of cataloguers around the world, will live forever...  as a million weights in generative pre-trained transformer. Let me explain...

The success or failure of modern AI depends on the construction of large statistical models with billions or even trillions of variables. These models are built from training data. The old adage about computers: "garbage in garbage out" is truer than ever. The models are really good at imitating the training data; so good that they can surprise the models' architects! Thus the growing need for good training data, and the increasing value of rich data sources.

Filings in recent lawsuits confirm the value of this training data. Getty Images is suing Stability AI for the use of Getty Images' material in AI training sets. But it's not just for the use of the images, which are copyrighted, but also for the use of trademarks and the detailed descriptions than accompany the data. Read paragraph 57 of the complaint:

Getty Images’ websites include both the images and corresponding detailed titles and captions and other metadata. Upon information and belief, the pairings of detailed text and images has been critical to successfully training the Stable Diffusion model to deliver relevant output in response to text prompts. If, for example, Stability AI ingested an image of a beach that was labeled “forest” and used that image-text pairing to train the model, the model would learn inaccurate information and be far less effective at generating desirable outputs in response to text prompts by Stability AI’s customers. Furthermore, in training the Stable Diffusion model, Stability AI has benefitted from Getty Images’ image-text pairs that are not only accurate, but detailed. For example, if Stability AI ingested a picture of Lake Oroville in California during a severe drought with a corresponding caption limited to just the word “lake,” it would learn that the image is of a lake, but not which lake or that the photograph was taken during a severe drought. If a Stable Diffusion user then entered a prompt for “California’s Lake Oroville during a severe drought” the output image might still be one of a lake, but it would be much less likely to be an image of Lake Oroville during a severe drought because the synthesis engine would not have the same level of control that allows it to deliver detailed and specific images in response to text prompts.

If you're reading this blog, you're probably thinking to yourself "THAT'S METADATA!"

Let's not forget the trademark part of the complaint:

In many cases, and as discussed further below, the output delivered by Stability AI includes a modified version of a Getty Images watermark, underscoring the clear link between the copyrighted images that Stability AI copied without permission and the output its model delivers. In the following example, the image on the left is another original, watermarked image copied by Stability AI and used to train its model and the watermarked image on the right is output delivered using the model:

If you're reading this blog, you're probably thinking to yourself "THAT'S PROVENANCE!"

So clearly, the kinds of data that libraries and archives have been producing for many years will still have value, but we need to start thinking about how the practice of cataloguing and similar activities will need to change in response to the new technologies. Existing library data will get repurposed as training data to create efficiencies in library workflows. Organizations with large, well-managed will extract windfalls, deserved or not.

If the utility of metadata work is shifting from feeding databases to training AI models, how does this affect the product of that work? Here's how I see it:

  • Tighter coupling of metadata and content. Today's discovery systems are all about decoupling data from content - we talk about creating metadata surrogates for discovery of content. Surrogates are useless for AI training; a description of a cat is useless for training without an accompanying picture of the cat. This means that the existing decoupling of metadata work from content production is doomed. You might think that copyright considerations will drive metadata production into the hands of existing content producers, but more likely organizations that focus on production of integrated training data will emerge to license content and support the necessary metadata production.
  • Tighter collaboration of machines and humans. Optical character recognition (OCR) is a good example of highly focused and evolved machine learning that can still be improved by human editors. The practice of database-focused cataloguing will be made more productive as cataloguers become editors of machine generated structured data. (As if they're not already doing that!)

  • Softer categorization. Discovery databases demand hard classifications. Fiction. Science. Textbooks. LC Subject Headings. AIs are much better at nuance, so the training data needs to include a lot more context. You can have a romantic novel of chemists and their textbooks, and an AI will be just fine with that, so long as you have enough description and context for the machine to assign lots of weights to many topic clusters. 

  • Emphasis on novelty. New concepts and things appear constantly; an AI will extrapolate unpredictably until it gets on-topic training data. AI-OCR might recognize a new emoji, but it might not.
  • Emphasis on provenance. Reality is expensive, which is why I think for-profit organizations will have difficulty in the business of providing training data while Wikipedia will continue to succeed because it requires citations. Already the internet is awash in AI produced content that sounds real, but is just automated BS. Training data will get branded.

What gets me really excited though, is thinking about how a library of the future will interact with content. I expect users will interact with the library using a pre-trained language model, rather than via databases. Content will get added to the model using packages of statistical vectors, compiled by human-expert-assisted content processors. These human experts won't be called "cataloguers" any longer but rather "meaning advisors". Or maybe "biblio-epistemologists". The  revenge of the cataloguers will be that because of the great responsibilities and breadth of expertise required, biblio-epistemologists will command salaries well exceeding the managers and programmers who will just take orders from well-trained AIs. Of course there will still be MARC records, generated by a special historical vector package guaranteed to only occasionally hallucinate.

Note: I started thinking about this after hearing a great talk (starting at about 30:00) by Michelle Wu at the Charleston Conference in November. (Kyle Courtney's talk was good, too).

Friday, August 25, 2023

Let's pretend they're ebooks

In days of yore, back when people were blogging, I described the way that libraries were offering ebooks as being a "Pretend It's Print" model. At the time, I felt that this model was designed to sustain and perpetuate the model that libraries and publishers had been using since prehistoric times, and that it ignored most of the possibilities inherent in the ebook. Ebooks could liberate the book from the shackles of their physical existences!
I was right, and I was wrong. The book publishing world seized on digital technology to put even heavier shackles on their books. In turn, technology companies such as Amazon locked down innovation in the ebook world so that libraries could no longer be equal contributors to the enterprise of distributing books, all the while pretending to their patrons that the ebooks they licensed were just like the print books sitting on their shelves.
Somehow libraries and publishers have survived. Maybe they've even thrived with the "pretend it's print" model for ebooks. There are plenty of economic problems, but whenever I talk to people about ebooks, the conversation is always some variation of "I love reading ebooks through my library". Most library users are perfectly happy pretending that their digital ebooks are just like the printed books.
robot writing on an ipad
A decade later, we need to change our perspective. It's time we seriously started pretending that printed books are just like ebooks, not just the other way around. The library world has been doing something called "Controlled Digital Lending" (CDL) , which flips the "pretend it's print" model and pretends that print is just like digital. The basic idea behind controlled digital lending is that owning a print book should allow you to read it any way you want, even if that involves creating a digital substitute for it. A library that owns a print book ought to be able to lend it, as long as it's lent to only one person at time. It's as if books were printed and sold in order to spread ideas and information!
Of course radical ideas such as spreading information have to be stopped. And so we have the Hachette v. Internet Archive lawsuit and its assorted fallout. I'm not a lawyer, so I won't say much about the legal validity of the arguments on either side. I'm an ebook technologist, so I will explain to you that whole lawsuit was about whether the other side was sufficiently serious about pretending that print books are just like ebooks and that ebooks are just like print books. Also that the other side doesn't understand how print books are completely different things than ebooks. Those lawyers really take to heart the White Queen's recommendation to believe 6 impossible things before breakfast.
The magic of technology is that it can make our pretendings into something real. So let's think a bit about how we can make the pretense of print-ebook equivalency more real, and if the resulting bargain makes any sense.
Here are some ways that we could make these ebooks, derived from printed books, more like print books:
  1. Speed. It takes me an hour or so to get a print book from a library. Should I be able to get the digital substitute in a minute? Should I be able to read a chapter and the "return" it so that someone else can use it the next seconf? CDL already puts some limits on this, but maybe there could be a standard that makes the digital surrogate more like the real thing?

  2. Geography. Printed books need to be transported to where the reader is. Once digitized they could go anywhere!. Maybe something like a shipping fee could be attached to a loan or other transfer. Maybe part of the fee could accrue to creators? Academic libraries have long done interlibrary loan of journal articles by copying and mailing the article, so why not do something equivalent for books?

These two attributes matter a lot in defining commercial markets for books and ebooks, and will become increasingly important as distribution technologies scale up and improve. Although publishers today make most of their money on the most popular books, book sales and usage of books in libraries have very long tails. There are millions of books for which global demand could be met by aggressive CDL of just a few copies. The CDL system instituted by Internet Archive also has a countervailing effect - the world-wide availability combined with so-so EPUB quality and usability probably result in stimulation of demand for print copies. This effect is likely to diminish as technologists like me smooth out the DRM speedbumps in CDL and begin to apply machine learning to EPUB generation.
It's worth noting that the "long tail" in book publishing also applies to authors and publishers. It's likely that the Internet Archive's CDL service has a larger market effect (whether positive or negative) on these market participants.
Here are some ways that we shouldn't make ebooks more like  print books:
  1. Search. Ebooks make search much easier than in print books. Maybe search should be disabled in CDL ebooks? Or maybe, we could enable search in print books. Google Books already sort of does this, if you have the right edition, but the process of making an ebook from a print book should give you an easy way to enable search in the print!

  2. Accessibility. Many reading-disabled users rely on ebooks for access to literature, science and culture. Older adults such as myself often find that flowable text with adjustable font size is easier on our eyes. In addition to international treaties that treat accessible text as an exception to copyright, most authors and publishers don't want to be monsters.

  3. Smell. Let's not go there.

  4. Privacy. The intellectual property world seems to think that copyright gives them the right to monitor and data-mine the behavior of readers on digital platforms. In some cases, copyright extremists have required root access to our devices so they can sniff out infringing files or behavior. (While they're at it, they might as well mine some bitcoin!) It is an outrage to think anyone who makes ebooks from print books would wire them with surveillance tools; the strong privacy policies of Internet Archive should be codified for CDL.

  5. Preservation. Publishers do a terrible job of preserving the lion's share of the printed books they publish, and society has always relied on libraries for this essential service. In this digital age, any grand bargain on copyrights has to provide libraries with the rights and incentives needed to do digital preservation of both printed and digital books.

The bottom line is that if we're going to continue to pretend that intellection property is a real thing, we need to start pretending that printed books are like ebooks, and vice versa. A grand bargain that benefits us all can eventually make these illusions real.


  1. Copyability. CDL books, like publisher-created ebooks, rely on device-enforced restrictions on duplication (DRM). Printed books rely on the expense of copying machines and paper to limit reproduction. In both cases, social norms and legal strictures discourage unauthorized reproduction. Building those social norms is what creating a grand bargain is all about.
  2.  Simultaneous use. Allowing simultaneous use of library ebooks during the pandemic is what really got the publishers mad at Internet Archive. A lot of people went mad during the lockdown, to be honest, and we're still recovering. 
  3.  Comments. I encourage comment on the Fediverse or on Bluesky. I've turned off commenting here.

Saturday, February 12, 2022

Crowdfunding Lessons from the Spice DAO

What if we get a huge bunch of people together and buy something that lets us do fun things with a book that we all love, while making it accessible as never before? Great idea, isn't it?

If that sounds familiar, maybe you've heard of, a web site we launched 10 years ago? We asked people what book they wished was free to everyone and the number one answer was Douglas Adams' Hitchhikers Guide to the Galaxy. We talked to the literary agent for the Adams estate, and long story short, the rights entanglements made that impossible for any amount of money. We had a success with a seminal Anthropology book, but the intersection between books people were excited about and books that authors were willing to license openly was small. Probably you haven't heard of the site, but while it has focused on building a catalog of open-access books (now over 100,000 titles!) we still crowd fund a book here and there, most recently an academic monograph.

Probably you HAVE heard about Spice DAO, a "Distributed Autonomous Organization" that sprinkled some magic blockchain dust on an auction for a copy of Alejandro Jodorowosky's movie treatment of Frank Herbert's novel.

Web3 enthusiasts came through for Spice DAO, "crowdraising" enough to win the auction for €2.66M, though Christie's estimate for the item was only €25-35,000.

Spice DAO vows that:

Instead of letting it remain hidden away in private collections, Spice DAO crowdraised funds ... to collectively explore options to digitally preserve the manuscript, make it accessible to the public for the very first time, and develop creative projects inspired by the vision Jodorowsky set forth.

Predictably, the success of Spice DAO led to widespread ridicule , because:

  • The price paid was 100X the esimate
  • Nothing about the item purchased gave them any rights to "make it accessible" or "develop creative projects" it inspired.
  • Images of another copy were already freely available on the internet. But no more. Ironically, the publicity around Spice DAO seems to have knocked the images off of the internet!
  • Even the DAO's website is no longer online, most likely trademark infringement. (archived version linked above.)

One crypto lesson: a DAO constructed this way may get ripped off in an auction. Even if the seller was not using shills to see inside the DAO and bid up the price, the DAO was vulnerable to crypto-pranksters (or arbitrageurs?) who knew exactly what the DAO was forced to bid by its "smart" contract to avoid dissolution.

Despite all that, the 2.1 Billion "Spice" tokens given to crowdraise participants are still worth over 800,000 "dollars", according to Coinmarket, so maybe the product here is a convincing story for unregistered securities that apart from representing something tangible, can be used for tax evasion and money laundering. And the team seems to have had a crash course in copyright law:

After two months of outreach, conversations with former business partners and consultations with legal counsel we have not been able to reach an agreement with any of the rights holders involved in the creation of the contents of the book of collected storyboards of Jodorowsky’s Dune. (medium)

Spice DAO, like most successful crowd-funding projects, had a good story, and clearly that's worth a lot. There's still a big difference between a good story and an honest, well informed story. Crowdfunding services such as are limited by all the facts they have to deal with. But magic crypto dust has a certain reality. The crowd-raise generation of tokens that can be bought and sold in free markets allows participants to dream that their tokens will increase in value, and they very well could. In the real world, Spice DAO spent the equivalent of $300,000 to create the liquidity pool needed to distribute the SPICE tokens. Which makes credit card fee seem like a bargain! But dreams are priceless. 

At least with "conventional" crowd funding, you know there's some accountability if you're investing in a nightmare!