Friday, March 21, 2025

AI bots are destroying Open Access

There's a war going on on the Internet. AI companies with billions to burn are hard at work destroying the websites of libraries, archives, non-profit organizations, and scholarly publishers, anyone who is working to make quality information universally available on the internet. And the technologists defending against this broad-based attack are doing everything they can to preserve their outlets while trying to remain true to the mission of providing the digital lifeblood of science and culture to the world.

Yes, many of these beloved institutions are under financial pressures in the current political environment, but politics swings back and forth. The AI armies are only growing more aggressive, more rapacious, more deceitful and ever more numerous.

I'm talking about the voracious hunger of AI companies for good data to train Large Language Models (LLMs). These are the trillion-parameter sets of statistical weights that power things like Claude, ChatGPT and hundreds of systems you've never heard of. Good training data has lots of text, lots of metadata, is reliable and unbiased. It's unsullied by Search Engine Optimization (SEO) practitioners. It doesn't constantly interrupt the narrative flow to try to get you to buy stuff. It's multilingual, subject specific, and written by experts. In other words, it's like a library.

At last week's Code4lib conference hosted by Princeton University Library, technologists from across the library world gathered to share information about library systems, how to make them better, how to manage them, and how to keep them running. The hot topic, the thing everyone wanted to talk about, was how to deal with bots from the dark side.

robot head emoji with eyes of sauron

Bots on the internet are nothing new, but a sea change has occurred over the past year. For the past 25 years, anyone running a web server knew that the bulk of traffic was one sort of bot or another. There was googlebot, which was quite polite, and everyone learned to feed it - otherwise no one would ever find the delicious treats we were trying to give away. There were lots of search engine crawlers working to develop this or that service. You'd get "script kiddies" trying thousands of prepackaged exploits. A server secured and patched by a reasonably competent technologist would have no difficulty ignoring these.

The old style bots were rarely a problem. They respected robot exclusions and "nofollow" warnings. The warning helped bots avoid volatile resources and infinite parameter spaces. Even when they ignored exclusions they seemed to be careful about it. They declared their identity in "user-agent" headers. They limited the request rate and number of simultaneous requests to any particular server. Occasionally there would be a malicious bot like a card-tester or a registration spammer. You'd often have to block these based on IP address. It was part of the landscape, not the dominant feature.

The current generation of bots is mindless. They use as many connections as you have room for. If you add capacity, they just ramp up their requests. They use randomly generated user-agent strings. They come from large blocks of IP addresses. They get trapped in endless hallways. I observed one bot asking for 200,000 nofollow redirect links pointing at Onedrive, Google Drive and Dropbox. (which of course didn't work, but Onedrive decided to stop serving our Canadian human users). They use up server resources - one speaker at Code4lib described a bug where software they were running was using 32 bit integers for session identifiers, and it ran out!

The good guys are trying their best. They're sharing block lists and bot signatures. Many libraries are routinely blocking entire countries (nobody in china could possibly want books!) just to be able to serve a trickle of local requests. They are using commercial services such as Cloudflare to outsource their bot-blocking and captchas, without knowing for sure what these services are blocking, how they're doing it, or whether user privacy and accessibility is being flushed down the toilet. But nothing seems to offer anything but temporary relief. Not that there's anything bad about temporary relief, but we know the bots just intensify their attack on other content stores.

direct.mit.edu  Verifying you are human. This may take a few seconds. direct.mit.edu needs to verify the security of your connection before proceeding. Verification is taking longer than expected. Check your internet connection and refresh the page if the issue persists.
The view of MIT Press's Open-Access site from the Wayback Machine.

The surge of AI bots has hit Open Access sites particularly hard, as their mission conflicts with the need to block bots. Consider that Internet Archive can no longer save snapshots of one of the best open-access publishers, MIT Press because of cloudflare blocking. (see above) Who know how many books will be lost this way?  Or consider that the bots took down OAPEN, the worlds most important repository of Scholarly OA books, for a day or two. That's 34,000 books that AI "checked out" for two days. Or recent outages at Project Gutenberg, which serves 2 million dynamic pages and a half million downloads per day. That's hundreds of thousands of downloads blocked! The link checker at doab-check.ebookfoundation.org (a project I worked on for OAPEN) is now showing 1,534 books that are unreachable due to "too many requests". That's 1,534 books that AI has stolen from us! And it's getting worse.

Thousands of developer hours are being spent on defense against the dark bots and those hours are lost to us forever. We'll never see the wonderful projects and features they would have come up with in that time.

The thing that gets me REALLY mad is how unnecessary this carnage is. Project Gutenberg makes all its content available with one click on a file in its feeds directory. OAPEN makes all its books available via an API. There's no need to make a million requests to get this stuff!! Who (or what) is programming these idiot scraping bots? Have they never heard of a sitemap??? Are they summer interns using ChatGPT to write all their code? Who gave them infinite memory, CPUs and bandwidth to run these monstrosities? (Don't answer.)

We are headed for a world in which all good information is locked up behind secure registration barriers and paywalls, and it won't be to make money, it will be for survival. Captchas will only be solvable by advanced AIs and only the wealthy will be able to use internet libraries.

Or maybe we can find ways to destroy the bad bots from within. I'm thinking a billion rickrolls?

Notes:

  1. I've found that I can no longer offer more than 2 facets of faceted search. Another problematic feature is "did you mean" links. AI bots try to follow every link you offer even if there are a billion different ones.
  2. Two projects, iocaine and nepenthes are enabling the construction of "tarpits" for bots. These are automated infinite mazes that bots get stuck in, perhaps keeping the bots occupied and not bothering anyone else. I'm skeptical.
  3. Here is an implementation of the Cloudflare Turnstyle service (supposedly free) that was mentioned favorably at the conference.
  4. It's not just open access, it's also Open Source.
  5. Cloudflare has announced an "AI honeypot". Should be interesting.
















0 comments:

Contribute a Comment

Note: Only a member of this blog may post a comment.