Wednesday, June 12, 2024

The PII Figleaf

The Internet's big lie is "we respect your privacy". Thanks to cookie banners and such things, the Internet tells us this so many times a day that we ignore all the evidence to the contrary. Sure, there are a lot of people who care about our privacy, but they're often letting others violate our privacy without even knowing it. Sometimes this just means that they are trying to be careful with our "PII". And guess what? You know those cookies you're constantly blocking or accepting? Advertisers like Google have mostly stopped using cookies!!!

fig leaf covering id cards

"PII" is "Personally Identifiable Information" and privacy lawyers seem to be obsessed with it. Lawyers, and the laws they care about, generally equate good PII hygiene with privacy. Good PII hygiene is not at all a bad thing, but it protects privacy the same way that washing your hands protects you from influenza. Websites that claim to protect your privacy are often washing the PII off their hands while at the same time coughing data all over you. They can and do violate your privacy while at the same time meticulously protecting your PII.

Examples of PII include your name, address, social security number, your telephone number and your email address. The IP address that you use can often be traced to you, so it's often treated as PII, but often isn't. The fact that you love paranormal cozy romance novels is not PII, nor is the fact that you voted for Mitt Romney. That you have an 18 year old son and an infant daughter is also not PII. But if you've checked out a paranormal cozy romance from your local library, and then start getting ads all over the internet for paranormal cozy romances set in an alternate reality where Mitt is President and the heroine has an infant and a teenager, you might easily conclude that your public library has sold your checkout list and your identity to an evil advertising company.

That's a good description of a recent situation involving San Francisco Public Library (SFPL). As reported by The Register :

In April, attorney Christine Dudley was listening to a book on her iPhone while playing a game on her Android tablet when she started to see in-game ads that reflected the audiobooks she recently checked out of the San Francisco Public Library.

Let me be clear. There's no chance that SFPL has sold the check-out list to anybody, much less evil advertisers. However, it DOES appear to be the case that SFPL and their online ebook vendors, Overdrive and Baker and Taylor, could have allowed Google to track Ms. Dudley, perhaps because they didn't fully understand the configuration options in Google Analytics. SFPL offers ebooks and audiobooks from Overdrive, "Kindle Books from Libby by Overdrive",  and ebooks and audiobooks from Baker and Taylor's "Boundless" Platform. There's no leakage of PII or check-out list, but Google is able to collect demographics and interests from the browsing patterns of users with Google accounts.

A few years ago, I wrote an explainer about how to configure Google Analytics to protect user privacy.  That explainer is obsolete, as Google is scrapping the system I explained in favor of a new system, "Google Analytics 4" (GA-4), that works better in the modern, more privacy-conscious browser environment. To their credit, Google has made some of the privacy-preserving settings the default - for example, they will no long store IP addresses. But reading the documentation, you can tell that they're not much interested in Privacy with a capital P as they want to be able to serve relevant (and thus lucrative) ads, even if they're for paranormal cozy romances. And Google REALLY doesn't want any "PII"! PII doesn't much help ad targeting, and there are places that regulate what they can do with PII.

We can start connecting the dots from the audiobook to the ads from the reporting in the Register by understanding a bit about Google Analytics. Google Analytics helps websites measure their usage. When you visit a webpage with Google Analytics, a javascript sends information back to one or more Google trackers about the address of the webpage, your browser environment, and maybe more data that the webpage publisher is interested in. Just about the only cookie being set these days is one that tells the website not to show the cookie banner!

From the Register:

The subdomain SFPL uses for library member login and ebook checkout, sfpl.bibliocommons.com, has only a single tracker, from Alphabet, that communicates with the domains google-analytics.com and googletagmanager.com.

The page is operated by BiblioCommons, which was acquired in 2020 by Canada-based Constellation Software. BiblioCommon has its own privacy policy that exists in conjunction with the SFPL privacy policy.

In response to questions about ad trackers on its main website, Wong( acknowledged that SFPL does use third-party cookies and provides a popup that allows visitors to opt-out if they prefer.

With regard to Google Analytics, she said that it only helps the library understand broad demographic data, such as the gender and age range of visitors.

"We are also able to understand broad interests of our users, such as movie, travel, sports and fitness based on webpage clicks, but this information is not at all tied to individual users, only as aggregated information," said Wong.

The statement from Jaime Wong, deputy director of communications for the SFPL, is revealing. The Google Analytics tracker only works within a website, and neither SFPL or its vendors are collecting demographic information to share with Google. But Google Analytics has options to turn on the demographic information that libraries think they really need. (Helps to get funding, for example.) It used to be called "Advertising Reporting Features" and "Remarketing" (I called these the "turn off privacy" switches) but now it's called "Google Signals". It works by adding the Google advertising tracker, DoubleClick, alongside the regular Analytics tracker. This allows Google to connect the usage data from a website to its advertising database, the one that stores demographic and interest information. This gives the website owners access to their user demographics, and it gives the Google advertising machine access to the users' web browsing behavior.

I have examined the relevant webpages from SFPL, as well as the customized pages that BiblioCommons, Overdrive, and Baker and Taylor provide for SFPL for trackers. Here's what I found:

  • The SFPL website, SFPL.org, has Analytics and  DoubleClick ad trackers enabled.
  • The BiblioCommons website, sfpl.bibliocommons.org, has two analytics trackers enabled, but no advertising tracker. Probably one tracker "belongs" to SFPL while the other "belongs" to BiblioCommons.
  • The Overdrive website, sfpl.overdrive.com has Analytics and DoubleClick ad trackers enabled.
  • The Baker and Taylor website, sfpl.boundless.baker-taylor.com has Analytics and  DoubleClick ad trackers enabled.

So it shouldn't be surprising that Ms. Dudley experienced targeted ads based on the books she was looking at in the San Francisco Public Library website. Libraries and librarians everywhere need to understand that reader privacy is not just about PII, and that the sort of privacy that libraries have a tradition of protecting is very different than the privacy that Google talks about when it says  "Google Analytics 4 was designed to be able to evolve for the future and built with privacy at its core." At the end of this month earlier versions of Google Analytics will stop "processing" data. (I'm betting the trackers will still fire!)

What Google means by that is that in GA-4, trackers continue to work despite browser restrictions on 3rd party cookies, and the tracking process is no longer reliant on data like IP addresses that could be considered PII. To address those troublesome regulators in Europe, they only distribute demographic data and interest profiles for people who've given their permission to Google to do so. Do you really think you haven't somewhere given Google permission to collect your demographic data and interest profiles? You can check here

Here's what Google tells Analytics users about the ad trackers:

When you turn on Google signals, Google Analytics will associate the session data it collects from your site and apps with Google's information from accounts of signed-in, consented users. By turning on Google signals, you acknowledge you adhere to the Google Advertising Features Policy, including rules around sensitive categories, have the necessary privacy disclosures and rights from your end users for such association, and that such data may be accessed and deleted by end users via My Activity.

In plain english, that means that if a website owner flips the switch, it's the website's problem if the trackers accidentally capture PII or otherwise violate privacy, because it's responsible for asking for permission. 

Yep. GA-4 is engineered with what I would call "figleaf privacy" at its core. Google doesn't have fig leaves for paranormal cozy romance novels!


Wednesday, May 8, 2024

Running Song of the Day

(I'm blogging my journey to the 2024 New York Marathon. You can help me get there.)

Steve Jobs gave me back my music. Thanks Steve!

I got my first iPod a bit more than 20 years ago. It was a 3rd generation iPod, the first version with an all-touch control. I loved that I could play my Bruce, my Courtney, my Heads and my Alanis at an appropriate volume without bothering any of my classical-music-only family. Looking back on it, there was a period of about five years when I didn't regularly listen to music. I had stopped commuting to work by car, and though commuting was no fun, it had kept me in touch with my music. No wonder those 5 years were such a difficult period of my life!

Today, my running and my music are entwined. My latest (and last 😢) iPod already has some retro cred. It's a 6th generation iPod Nano. I listen to to my music on 90% of my runs and 90% of my listening is on my runs. I use shuffle mode so that over the course of a year of running, I'll listen to 2/3 of my ~2500 song library. In 2023, I listened to 1,723 songs. That's a lot of running!

Yes, I keep track. I have a system to maintain a 150 song playlist for running. I periodically replace all the songs I've heard in the most recent 2 months (unless I've listened to the song less than 5 times - you need at least that many plays to become acquainted with a song!) This is one of the ways I channel certain of my quirkier programmerish tendencies so that I project as a relatively normal person. Or at least I try.

Last November, I decided to do something new (for me). I made a running playlist! Carefully selected to have the right cadence and to inspire the run! It was ordered to have to have particular songs play at appropriate points of the Ashenfelter 8K  on Thanksgiving morning. It started with "Born to Run" and ended with either "Save it for Later", "Breathless" or "It's The End Of The World As We Know It", depending on my finishing time. It worked OK. I finished with Exene. I had never run with a playlist before.

1. "Born to Run".
2. "American Land". The first part of the race is uphill, so an immigrant song seemed appropriate.
3. "Wake Up" - Arcade Fire. Can't get complacent.
4. "Twist & Crawl - The Beat. The up-tempo pushed me to the fastest part of the race.
5. "Night". Up and over the hill. "you run sad and free until all you can see is the night". 
6. "Rock Lobster" - B-52s. The perfect beats per minute. 
7. "Shake It Up" - Taylor Swift. A bit of focused anger helps my energy level.
8. "Roulette". Recommended by the Nuts, and yes it was good. Shouting a short lyric helps me run faster.
9. "Workin' on the Highway". The 4th mile of 5 is the hardest, so "all day long I don't stop".
10. "Your Sister Can't Twist" - Elton John. A short nasty hill.
11. "Save it for Later" - The Beat. I could run all day to this, but "sooner or later your legs give way, you hit the ground."
12. "Breathless" - X. If I had hit my goal of 45 minutes, I would have crossed the finish as this started, but I was very happy with 46:12. and a 9:14 pace.
13. "It's The End Of The World As We Know It" - R.E.M. 48 minutes would not have been the end of the world, but I'd feel fine.

Last year, I started to extract a line from the music I had listened to during my run to use as the Strava title for the run. Through September 3, I would choose a line from a Springsteen song (he had to take a health timeout after that). For my New Year's resolution, I promised to credit the song and the artist in my run descriptions as well.

I find now that with many songs, they remind me of the place where I was running when I listened to them. And running in certain places now reminds me of particular songs. I'm training the neural network in my head. I prefer to think of it as creating a web of connections, invisible strings, you might say, that enrich my experience of life. In other words, I'm creating art. And if you follow my Strava, the connections you make to my runs and my songs become part of this little collective art project. Thanks!


Reminder: I'm earning my way into the NYC Marathon by raising money for Amref. 

This series of posts:


Monday, April 29, 2024

We'll run 'til we drop

(I'm blogging my journey to the 2024 New York Marathon. You can help me get there.)

 It wasn't the 10 seconds that made me into a runner.

Eric running across a bridge

I started running races again 20 years ago, in 2004. It was a 10K sponsored by my town's YMCA.  I had run an occasional race in grad school to join my housemates; and I continued to run a couple of miles pretty regularly to add some exercise to my mostly sitting-at-a-computer lifestyle. I gradually added 10Ks - the local "turkey-trot"  because the course went almost by my house - and then a "cherry-blossom" run, through beautiful Branch Brook Park. But I was not yet a real runner - tennis was my main sport.

In 2016, things changed. My wife was traveling a lot for work, and one son was away at college, and I found myself needing more social interaction. I saw that my local Y was offering a training program for their annual 10K, and I thought I would try it out. I had never trained for a race, ever. The closest thing to training I had ever done was the soccer team in high school. But there was a HUGE sacrifice involved - the class started at 8AM on Saturdays, and I was notorious for sleeping past noon on Saturdays! Surprise, surprise, I loved it. It was fun to have people to run with. I'm on the silent side, and it was a pleasure to be with people who were comfortable with the  somewhat taciturn real me.

I trained really hard with that group. I did longer runs than I'd ever done, and it felt great. So by race day, I felt sure that I would smash my PR (not counting the races in my 20's!). I was counting on cutting a couple of minutes off my time. And I did it! But only by a measly 10 seconds. I was so disappointed.

But somehow I had become a runner! It was running with a group that made me a runner. I began to seek out running groups and became somewhat of a running social butterfly.

Fast-forward to five weeks ago, when I was doing a 10-miler with a group of running friends (A 10 miler for me, they were doing longer runs in training for a marathon). I had told them of my decision to do New York this fall, and they were soooo supportive. I  signed up for a half marathon to be held on April 27th  - many of my friends were training for the associated full marathon. The last 2 miles were really rough for me (maybe because my shoes were newish??) and I staggered home. That afternoon I could hardly walk and I realized I had strained my right knee. Running was suddenly excruciatingly painful.

By the next day I could get down the stairs and walk with a limp, but running was impossible. The next weekend, I was able to do a slow jog with some pain, so I decided to stick to walking, which was mostly pain-free. I saw a PT who advised me to build up slowly and get plenty of rest. It was working until the next weekend, when I was hurrying to catch a train and unthinkingly took a double step in Penn Station and re-sprained the knee. It was worse than before and I had only 3 weeks until the half marathon!

The past three weeks have been the hardest thing I've had to deal with in my running "career". I've had a calf strain, T-band strains, back strains, sore quads, inter-tarsal neuromas and COVID get in the way of running, but this was the worst. Because of my impatience.

Run-walk (and my running buddies) were what saved me. I slowly worked my way from 2 miles at a 0.05-to-0.25 mile run-to-walk ratio up to 4 miles at 0.2-to-0.05 mile run-to-walk, with 2 days of rest between each session. I started my half marathon with a plan to run 2 mimutes and walk 30 seconds until the knee told me to stop the running bits. I was hoping for a 3 hour half.

The knee never complained (the rest of the body complained, but I'm used to that!!) I finished with the very respectable time of 2:31:28, faster than 2 of my previous 11 half marathons. One of my friends took a video of me staggering over the finish. 


 I'm very sure I don't look like that in real life.

Here's our group picture, marathoners and half-marathoners. Together, we're real runners.

After this weekend, my biggest half marathon challenge to date, I have more confidence than ever that I'll be able to do the New York Marathon in November - in one piece - with Team Amref. (And with your contributions towards my fund-raising goal, as well.)

We're gonna get to that place where we really wanna go and we'll walk in the sun

Jim Thorpe Half Marathon 2024 results. 

Thursday, December 14, 2023

The Revenge of the Cataloguers

Over the past 15 years or so, libraries around the world have de-emphasized cataloguing. While budgetary concerns and technological efficiencies have been factors in the decline of cataloguing, the emergence of full text search and relevance ranking as practiced by Google and others has proved to be more popular for the vast majority of users. On the open internet, subject classifications have proved to be useless in an environment rife with keyword spam and other search engine optimization techniques. 

In the past year, the emergence of artificial intelligence (AI) with large language models with surprising abilities to summarize and classify texts has people speculating that AI will put most cataloguers out of work in the not-so-distant future.

I think that's not even wrong. But Roy Tennant will turn out to be almost right. MARC, the premier tool of cataloguers around the world, will live forever...  as a million weights in generative pre-trained transformer. Let me explain...

The success or failure of modern AI depends on the construction of large statistical models with billions or even trillions of variables. These models are built from training data. The old adage about computers: "garbage in garbage out" is truer than ever. The models are really good at imitating the training data; so good that they can surprise the models' architects! Thus the growing need for good training data, and the increasing value of rich data sources.

Filings in recent lawsuits confirm the value of this training data. Getty Images is suing Stability AI for the use of Getty Images' material in AI training sets. But it's not just for the use of the images, which are copyrighted, but also for the use of trademarks and the detailed descriptions than accompany the data. Read paragraph 57 of the complaint:

Getty Images’ websites include both the images and corresponding detailed titles and captions and other metadata. Upon information and belief, the pairings of detailed text and images has been critical to successfully training the Stable Diffusion model to deliver relevant output in response to text prompts. If, for example, Stability AI ingested an image of a beach that was labeled “forest” and used that image-text pairing to train the model, the model would learn inaccurate information and be far less effective at generating desirable outputs in response to text prompts by Stability AI’s customers. Furthermore, in training the Stable Diffusion model, Stability AI has benefitted from Getty Images’ image-text pairs that are not only accurate, but detailed. For example, if Stability AI ingested a picture of Lake Oroville in California during a severe drought with a corresponding caption limited to just the word “lake,” it would learn that the image is of a lake, but not which lake or that the photograph was taken during a severe drought. If a Stable Diffusion user then entered a prompt for “California’s Lake Oroville during a severe drought” the output image might still be one of a lake, but it would be much less likely to be an image of Lake Oroville during a severe drought because the synthesis engine would not have the same level of control that allows it to deliver detailed and specific images in response to text prompts.

If you're reading this blog, you're probably thinking to yourself "THAT'S METADATA!"

Let's not forget the trademark part of the complaint:


In many cases, and as discussed further below, the output delivered by Stability AI includes a modified version of a Getty Images watermark, underscoring the clear link between the copyrighted images that Stability AI copied without permission and the output its model delivers. In the following example, the image on the left is another original, watermarked image copied by Stability AI and used to train its model and the watermarked image on the right is output delivered using the model:


If you're reading this blog, you're probably thinking to yourself "THAT'S PROVENANCE!"

So clearly, the kinds of data that libraries and archives have been producing for many years will still have value, but we need to start thinking about how the practice of cataloguing and similar activities will need to change in response to the new technologies. Existing library data will get repurposed as training data to create efficiencies in library workflows. Organizations with large, well-managed will extract windfalls, deserved or not.

If the utility of metadata work is shifting from feeding databases to training AI models, how does this affect the product of that work? Here's how I see it:


  • Tighter coupling of metadata and content. Today's discovery systems are all about decoupling data from content - we talk about creating metadata surrogates for discovery of content. Surrogates are useless for AI training; a description of a cat is useless for training without an accompanying picture of the cat. This means that the existing decoupling of metadata work from content production is doomed. You might think that copyright considerations will drive metadata production into the hands of existing content producers, but more likely organizations that focus on production of integrated training data will emerge to license content and support the necessary metadata production.
  • Tighter collaboration of machines and humans. Optical character recognition (OCR) is a good example of highly focused and evolved machine learning that can still be improved by human editors. The practice of database-focused cataloguing will be made more productive as cataloguers become editors of machine generated structured data. (As if they're not already doing that!)

  • Softer categorization. Discovery databases demand hard classifications. Fiction. Science. Textbooks. LC Subject Headings. AIs are much better at nuance, so the training data needs to include a lot more context. You can have a romantic novel of chemists and their textbooks, and an AI will be just fine with that, so long as you have enough description and context for the machine to assign lots of weights to many topic clusters. 

  • Emphasis on novelty. New concepts and things appear constantly; an AI will extrapolate unpredictably until it gets on-topic training data. AI-OCR might recognize a new emoji, but it might not.
  • Emphasis on provenance. Reality is expensive, which is why I think for-profit organizations will have difficulty in the business of providing training data while Wikipedia will continue to succeed because it requires citations. Already the internet is awash in AI produced content that sounds real, but is just automated BS. Training data will get branded.

What gets me really excited though, is thinking about how a library of the future will interact with content. I expect users will interact with the library using a pre-trained language model, rather than via databases. Content will get added to the model using packages of statistical vectors, compiled by human-expert-assisted content processors. These human experts won't be called "cataloguers" any longer but rather "meaning advisors". Or maybe "biblio-epistemologists". The  revenge of the cataloguers will be that because of the great responsibilities and breadth of expertise required, biblio-epistemologists will command salaries well exceeding the managers and programmers who will just take orders from well-trained AIs. Of course there will still be MARC records, generated by a special historical vector package guaranteed to only occasionally hallucinate.

Note: I started thinking about this after hearing a great talk (starting at about 30:00) by Michelle Wu at the Charleston Conference in November. (Kyle Courtney's talk was good, too).

Friday, August 25, 2023

Let's pretend they're ebooks

In days of yore, back when people were blogging, I described the way that libraries were offering ebooks as being a "Pretend It's Print" model. At the time, I felt that this model was designed to sustain and perpetuate the model that libraries and publishers had been using since prehistoric times, and that it ignored most of the possibilities inherent in the ebook. Ebooks could liberate the book from the shackles of their physical existences!
 
I was right, and I was wrong. The book publishing world seized on digital technology to put even heavier shackles on their books. In turn, technology companies such as Amazon locked down innovation in the ebook world so that libraries could no longer be equal contributors to the enterprise of distributing books, all the while pretending to their patrons that the ebooks they licensed were just like the print books sitting on their shelves.
 
Somehow libraries and publishers have survived. Maybe they've even thrived with the "pretend it's print" model for ebooks. There are plenty of economic problems, but whenever I talk to people about ebooks, the conversation is always some variation of "I love reading ebooks through my library". Most library users are perfectly happy pretending that their digital ebooks are just like the printed books.
 
robot writing on an ipad
A decade later, we need to change our perspective. It's time we seriously started pretending that printed books are just like ebooks, not just the other way around. The library world has been doing something called "Controlled Digital Lending" (CDL) , which flips the "pretend it's print" model and pretends that print is just like digital. The basic idea behind controlled digital lending is that owning a print book should allow you to read it any way you want, even if that involves creating a digital substitute for it. A library that owns a print book ought to be able to lend it, as long as it's lent to only one person at time. It's as if books were printed and sold in order to spread ideas and information!
 
Of course radical ideas such as spreading information have to be stopped. And so we have the Hachette v. Internet Archive lawsuit and its assorted fallout. I'm not a lawyer, so I won't say much about the legal validity of the arguments on either side. I'm an ebook technologist, so I will explain to you that whole lawsuit was about whether the other side was sufficiently serious about pretending that print books are just like ebooks and that ebooks are just like print books. Also that the other side doesn't understand how print books are completely different things than ebooks. Those lawyers really take to heart the White Queen's recommendation to believe 6 impossible things before breakfast.
 
The magic of technology is that it can make our pretendings into something real. So let's think a bit about how we can make the pretense of print-ebook equivalency more real, and if the resulting bargain makes any sense.
 
Here are some ways that we could make these ebooks, derived from printed books, more like print books:
  1. Speed. It takes me an hour or so to get a print book from a library. Should I be able to get the digital substitute in a minute? Should I be able to read a chapter and the "return" it so that someone else can use it the next seconf? CDL already puts some limits on this, but maybe there could be a standard that makes the digital surrogate more like the real thing?

  2. Geography. Printed books need to be transported to where the reader is. Once digitized they could go anywhere!. Maybe something like a shipping fee could be attached to a loan or other transfer. Maybe part of the fee could accrue to creators? Academic libraries have long done interlibrary loan of journal articles by copying and mailing the article, so why not do something equivalent for books?

These two attributes matter a lot in defining commercial markets for books and ebooks, and will become increasingly important as distribution technologies scale up and improve. Although publishers today make most of their money on the most popular books, book sales and usage of books in libraries have very long tails. There are millions of books for which global demand could be met by aggressive CDL of just a few copies. The CDL system instituted by Internet Archive also has a countervailing effect - the world-wide availability combined with so-so EPUB quality and usability probably result in stimulation of demand for print copies. This effect is likely to diminish as technologists like me smooth out the DRM speedbumps in CDL and begin to apply machine learning to EPUB generation.
 
It's worth noting that the "long tail" in book publishing also applies to authors and publishers. It's likely that the Internet Archive's CDL service has a larger market effect (whether positive or negative) on these market participants.
 
Here are some ways that we shouldn't make ebooks more like  print books:
  1. Search. Ebooks make search much easier than in print books. Maybe search should be disabled in CDL ebooks? Or maybe, we could enable search in print books. Google Books already sort of does this, if you have the right edition, but the process of making an ebook from a print book should give you an easy way to enable search in the print!

  2. Accessibility. Many reading-disabled users rely on ebooks for access to literature, science and culture. Older adults such as myself often find that flowable text with adjustable font size is easier on our eyes. In addition to international treaties that treat accessible text as an exception to copyright, most authors and publishers don't want to be monsters.

  3. Smell. Let's not go there.

  4. Privacy. The intellectual property world seems to think that copyright gives them the right to monitor and data-mine the behavior of readers on digital platforms. In some cases, copyright extremists have required root access to our devices so they can sniff out infringing files or behavior. (While they're at it, they might as well mine some bitcoin!) It is an outrage to think anyone who makes ebooks from print books would wire them with surveillance tools; the strong privacy policies of Internet Archive should be codified for CDL.

  5. Preservation. Publishers do a terrible job of preserving the lion's share of the printed books they publish, and society has always relied on libraries for this essential service. In this digital age, any grand bargain on copyrights has to provide libraries with the rights and incentives needed to do digital preservation of both printed and digital books.

The bottom line is that if we're going to continue to pretend that intellection property is a real thing, we need to start pretending that printed books are like ebooks, and vice versa. A grand bargain that benefits us all can eventually make these illusions real.

Notes: 

  1. Copyability. CDL books, like publisher-created ebooks, rely on device-enforced restrictions on duplication (DRM). Printed books rely on the expense of copying machines and paper to limit reproduction. In both cases, social norms and legal strictures discourage unauthorized reproduction. Building those social norms is what creating a grand bargain is all about.
  2.  Simultaneous use. Allowing simultaneous use of library ebooks during the pandemic is what really got the publishers mad at Internet Archive. A lot of people went mad during the lockdown, to be honest, and we're still recovering. 
  3.  Comments. I encourage comment on the Fediverse or on Bluesky. I've turned off commenting here.