Friday, March 21, 2025

AI bots are destroying Open Access

There's a war going on on the Internet. AI companies with billions to burn are hard at work destroying the websites of libraries, archives, non-profit organizations, and scholarly publishers, anyone who is working to make quality information universally available on the internet. And the technologists defending against this broad-based attack are doing everything they can to preserve their outlets while trying to remain true to the mission of providing the digital lifeblood of science and culture to the world.

Yes, many of these beloved institutions are under financial pressures in the current political environment, but politics swings back and forth. The AI armies are only growing more aggressive, more rapacious, more deceitful and ever more numerous.

I'm talking about the voracious hunger of AI companies for good data to train Large Language Models (LLMs). These are the trillion-parameter sets of statistical weights that power things like Claude, ChatGPT and hundreds of systems you've never heard of. Good training data has lots of text, lots of metadata, is reliable and unbiased. It's unsullied by Search Engine Optimization (SEO) practitioners. It doesn't constantly interrupt the narrative flow to try to get you to buy stuff. It's multilingual, subject specific, and written by experts. In other words, it's like a library.

At last week's Code4lib conference hosted by Princeton University Library, technologists from across the library world gathered to share information about library systems, how to make them better, how to manage them, and how to keep them running. The hot topic, the thing everyone wanted to talk about, was how to deal with bots from the dark side.

robot head emoji with eyes of sauron

Bots on the internet are nothing new, but a sea change has occurred over the past year. For the past 25 years, anyone running a web server knew that the bulk of traffic was one sort of bot or another. There was googlebot, which was quite polite, and everyone learned to feed it - otherwise no one would ever find the delicious treats we were trying to give away. There were lots of search engine crawlers working to develop this or that service. You'd get "script kiddies" trying thousands of prepackaged exploits. A server secured and patched by a reasonably competent technologist would have no difficulty ignoring these.

The old style bots were rarely a problem. They respected robot exclusions and "nofollow" warnings. The warning helped bots avoid volatile resources and infinite parameter spaces. Even when they ignored exclusions they seemed to be careful about it. They declared their identity in "user-agent" headers. They limited the request rate and number of simultaneous requests to any particular server. Occasionally there would be a malicious bot like a card-tester or a registration spammer. You'd often have to block these based on IP address. It was part of the landscape, not the dominant feature.

The current generation of bots is mindless. They use as many connections as you have room for. If you add capacity, they just ramp up their requests. They use randomly generated user-agent strings. They come from large blocks of IP addresses. They get trapped in endless hallways. I observed one bot asking for 200,000 nofollow redirect links pointing at Onedrive, Google Drive and Dropbox. (which of course didn't work, but Onedrive decided to stop serving our Canadian human users). They use up server resources - one speaker at Code4lib described a bug where software they were running was using 32 bit integers for session identifiers, and it ran out!

The good guys are trying their best. They're sharing block lists and bot signatures. Many libraries are routinely blocking entire countries (nobody in china could possibly want books!) just to be able to serve a trickle of local requests. They are using commercial services such as Cloudflare to outsource their bot-blocking and captchas, without knowing for sure what these services are blocking, how they're doing it, or whether user privacy and accessibility is being flushed down the toilet. But nothing seems to offer anything but temporary relief. Not that there's anything bad about temporary relief, but we know the bots just intensify their attack on other content stores.

direct.mit.edu  Verifying you are human. This may take a few seconds. direct.mit.edu needs to verify the security of your connection before proceeding. Verification is taking longer than expected. Check your internet connection and refresh the page if the issue persists.
The view of MIT Press's Open-Access site from the Wayback Machine.

The surge of AI bots has hit Open Access sites particularly hard, as their mission conflicts with the need to block bots. Consider that Internet Archive can no longer save snapshots of one of the best open-access publishers, MIT Press because of cloudflare blocking. (see above) Who know how many books will be lost this way?  Or consider that the bots took down OAPEN, the worlds most important repository of Scholarly OA books, for a day or two. That's 34,000 books that AI "checked out" for two days. Or recent outages at Project Gutenberg, which serves 2 million dynamic pages and a half million downloads per day. That's hundreds of thousands of downloads blocked! The link checker at doab-check.ebookfoundation.org (a project I worked on for OAPEN) is now showing 1,534 books that are unreachable due to "too many requests". That's 1,534 books that AI has stolen from us! And it's getting worse.

Thousands of developer hours are being spent on defense against the dark bots and those hours are lost to us forever. We'll never see the wonderful projects and features they would have come up with in that time.

The thing that gets me REALLY mad is how unnecessary this carnage is. Project Gutenberg makes all its content available with one click on a file in its feeds directory. OAPEN makes all its books available via an API. There's no need to make a million requests to get this stuff!! Who (or what) is programming these idiot scraping bots? Have they never heard of a sitemap??? Are they summer interns using ChatGPT to write all their code? Who gave them infinite memory, CPUs and bandwidth to run these monstrosities? (Don't answer.)

We are headed for a world in which all good information is locked up behind secure registration barriers and paywalls, and it won't be to make money, it will be for survival. Captchas will only be solvable by advanced AIs and only the wealthy will be able to use internet libraries.

Or maybe we can find ways to destroy the bad bots from within. I'm thinking a billion rickrolls?

Notes:

  1. I've found that I can no longer offer more than 2 facets of faceted search. Another problematic feature is "did you mean" links. AI bots try to follow every link you offer even if there are a billion different ones.
  2. Two projects, iocaine and nepenthes are enabling the construction of "tarpits" for bots. These are automated infinite mazes that bots get stuck in, perhaps keeping the bots occupied and not bothering anyone else. I'm skeptical.
  3. Here is an implementation of the Cloudflare Turnstyle service (supposedly free) that was mentioned favorably at the conference.
  4. It's not just open access, it's also Open Source.
  5. Cloudflare has announced an "AI honeypot". Should be interesting.
  6. One way for Open Access site to encourage good bot behavior is to provide carrots to good robots. For this reason, it would be good to add Common Crawl to greenlists: https://commoncrawl.org/ccbot
  7. Ian Mulvaney (BMJ) concurs
















Tuesday, February 11, 2025

Strava Verse

strava route that looks like an elephant
The internet gives us new ways to express ourselves. One of the more strenuously esoteric forms of artistic expression is Strava art, in which people do runs that, when mapped, draw pictures. None of my strava art was particularly good, but my running club friends in Stockholm regularly run "elefanten". I spent a year attempting "Found Strava Art", where you just run a new route and give the run a name based on what it looks like. I ran a lot of flowers and space ships, but meh. Last year I named each run with a line of a song that came up on my iPod. Too obscure.

This year I decided to serialize poems with my Strava runs. I didn't have a plan, but I started with Jabberwocky. It seemed appropriate to comment using nonsense words, because, Jabberwocky. I ended up with this:

’Twas brillig, and the slithy toves did gyre and gimble in the wabe
I love running with my slithy toves!
All mimsy were the borogoves, and the mome raths outgrabe.
My right knee was a grobble mimsy today, but mome what a rath!  
Beware the Jabberwock, my son!
Also, the Jabberrun can be hard on the knees.
The jaws that bite, the claws that catch!
ERC hosted run had quiche to bite and George to catch.

He took his vorpal sword in hand
New York Sirens game. Women with vorpal sticks. Slain by the Charge 3-2.
Beware the Jubjub bird, and shun the frumious Bandersnatch!
Definitely well salted and frumious out there today.
Long time the manxome foe he sought
But quick the manxless chill he caught
So rested he by the Tumtum tree
Covered with snow in filagree
And stood a while in thought.
Though clabbercing in a profunctional dot!

And, as in uffish thought he stood
Trolloping thru the Brookdale wood.
The Jabberwock, with eyes of flame
Cheld and hord, a glistering name…
Came whiffling through the tulgey wood
And caught the two burblygums because he could.
And burbled as it came!
So late the Jabberrun slept
For Eight Muyibles passed as though aflame
O'er Curbles and Nonces the pluffy sheep leapt.

One, two! One, two! And through and through
Three four! Three four! Sankofa’s coffee’s fit to pour.
The vorpal blade went snicker-snack!
The Icebeest of Hoth kept blobbering back.
He went galumphing back.
He left it dead, and with its head
... the Garmind sprang to life

And hast thou slain the Jabberwock?
The ice, the snow, it's hard as rock.
Come to my arms, my beamish boy!
Think of my knees! Oy oy oy oy.
O frabjous day! Callooh! Callay!”
O jousbarf night! The fluss! The fright!
He chortled in his joy.
(And padoodled the rest of of the way!)

‘Twas brillig and the slithy toves
Did not, had not, could not loave.
Did gyre and gimble in the wabe
“Dunno.” said the wormly autoclave
All mimsy were the borogoves,
Again and again, beloo and aboave
And the mome raths outgrabe.
The end. Ooh ooh Babe!

Terrible right? But it has its moments.

I've started a new one. I fear it will get more topical.

Notes: