Welcome to DU! The truly grassroots left-of-center political community where regular people, not algorithms, drive the discussions and set the standards. Join the community: Create a free account Support DU (and get rid of ads!): Become a Star Member Latest Breaking News Editorials & Other Articles General Discussion The DU Lounge All Forums Issue Forums Culture Forums Alliance Forums Region Forums Support Forums Help & Search

usonian

(17,477 posts)
Tue Mar 25, 2025, 03:08 PM Mar 25

AI bots are destroying Open Access

TLDR:

We are headed for a world in which all good information is locked up behind secure registration barriers and paywalls, and it won't be to make money, it will be for survival. Captchas will only be solvable by advanced AIs and only the wealthy will be able to use internet libraries.

https://go-to-hellman.blogspot.com/2025/03/ai-bots-are-destroying-open-access.html

The current generation of bots is mindless. They use as many connections as you have room for. If you add capacity, they just ramp up their requests. They use randomly generated user-agent strings. They come from large blocks of IP addresses. They get trapped in endless hallways. I observed one bot asking for 200,000 nofollow redirect links pointing at Onedrive, Google Drive and Dropbox. (which of course didn't work, but Onedrive decided to stop serving our Canadian human users). They use up server resources - one speaker at Code4lib described a bug where software they were running was using 32 bit integers for session identifiers, and it ran out!

The good guys are trying their best. They're sharing block lists and bot signatures. Many libraries are routinely blocking entire countries (nobody in china could possibly want books!) just to be able to serve a trickle of local requests. They are using commercial services such as Cloudflare to outsource their bot-blocking and captchas, without knowing for sure what these services are blocking, how they're doing it, or whether user privacy and accessibility is being flushed down the toilet. But nothing seems to offer anything but temporary relief. Not that there's anything bad about temporary relief, but we know the bots just intensify their attack on other content stores.


The surge of AI bots has hit Open Access sites particularly hard, as their mission conflicts with the need to block bots. Consider that Internet Archive can no longer save snapshots of one of the best open-access publishers, MIT Press because of cloudflare blocking. (see above) Who know how many books will be lost this way? Or consider that the bots took down OAPEN, the worlds most important repository of Scholarly OA books, for a day or two. That's 34,000 books that AI "checked out" for two days. Or recent outages at Project Gutenberg, which serves 2 million dynamic pages and a half million downloads per day. That's hundreds of thousands of downloads blocked! The link checker at doab-check.ebookfoundation.org (a project I worked on for OAPEN) is now showing 1,534 books that are unreachable due to "too many requests". That's 1,534 books that AI has stolen from us! And it's getting worse.

...

The thing that gets me REALLY mad is how unnecessary this carnage is. Project Gutenberg makes all its content available with one click on a file in its feeds directory. OAPEN makes all its books available via an API. There's no need to make a million requests to get this stuff!! Who (or what) is programming these idiot scraping bots? Have they never heard of a sitemap??? Are they summer interns using ChatGPT to write all their code? Who gave them infinite memory, CPUs and bandwidth to run these monstrosities? (Don't answer.) We are headed for a world in which all good information is locked up behind secure registration barriers and paywalls, and it won't be to make money, it will be for survival. Captchas will only be solvable by advanced AIs and only the wealthy will be able to use internet libraries.


And about CAPTCHA, pick your poison:
Google's reCAPTCHA is not only useless, it's also basically spyware
reCAPTCHA v3's checkbox test doesn't stop bots and tracks user data
https://www.techspot.com/news/106717-google-recaptcha-not-only-useless-also-basically-spyware.html

Researchers told Chuppl that the so-called security challenge records not just mouse movements but also user agent data and other identifying information. Furthermore, Chuppl's investigation suggested that Captchas block humans who anonymize their browser data better than it does bots. The assertion makes sense for anyone who has tried to browse the web with a VPN.

Tracking data Google collects from Captchas carries an estimated value of nearly $898 billion. Furthermore, when a lawsuit against the search giant for using reCAPTCHA v2 inputs to train AI revealed that the 819 million hours users spent clicking on the tests worked out to about $6.1 billion in unpaid wages.

The UC Irvine study concluded that Google should retire reCAPTCHA v2 and similar tools. An Austrian federal court has already banned the technology, finding that it violates users' privacy rights under the GDPR.


10 replies = new reply since forum marked as read
Highlight: NoneDon't highlight anything 5 newestHighlight 5 most recent replies
AI bots are destroying Open Access (Original Post) usonian Mar 25 OP
Kick defacto7 Mar 25 #1
I've been on the internet since the late 'seventies, and on the world wide web since its creation. hunter Mar 25 #2
Please explain to me like I'm ten flamingdem Mar 26 #3
OK, quick but I may revisit this. usonian Mar 26 #4
Super interesting. I read about Cloudflare today flamingdem Mar 26 #5
The only problem with cloudflare that I see usonian Mar 26 #6
When available you can try justaprogressive Mar 26 #7
That's insane that they're rescinding those directives flamingdem Mar 26 #8
They want people to be stupid, even their own kids. haele Mar 26 #9
Best guesses. usonian Mar 26 #10

defacto7

(14,049 posts)
1. Kick
Tue Mar 25, 2025, 04:03 PM
Mar 25

This is really important. We're already seeing the end of the original intent of the Internet inventors where it's a place for every person to be able to find all information that exists ... free.

hunter

(39,474 posts)
2. I've been on the internet since the late 'seventies, and on the world wide web since its creation.
Tue Mar 25, 2025, 11:21 PM
Mar 25

It distresses me that so many promising and innovative technologies end up being used for shitty purposes.

I find the modern internet, raw and unfiltered, to be unbearable.




flamingdem

(40,399 posts)
3. Please explain to me like I'm ten
Wed Mar 26, 2025, 12:57 AM
Mar 26

Maybe in one paragraph? Please, it looks important but too much inside baseball.

usonian

(17,477 posts)
4. OK, quick but I may revisit this.
Wed Mar 26, 2025, 01:59 AM
Mar 26

Bots are hammering every website looking for AI training data. There are many, and the load is buckling sites not built for such repetitive load.

Despite requests from sites, in a "robots.txt" file that the webmaster installs, they just ignore it, so webmasters have to build software walls, or ask Cloudflare to do so. DU does that to prevent hack attacks.

I'll update if that's not enough, or too much. I quoted a high tech site.

It's good that you asked the "AI bots for dummies" question. It's THAT important.

flamingdem

(40,399 posts)
5. Super interesting. I read about Cloudflare today
Wed Mar 26, 2025, 02:10 AM
Mar 26

related to the stock market. Looks like cybersecurity demand will skyrocket.
I can see why.

usonian

(17,477 posts)
6. The only problem with cloudflare that I see
Wed Mar 26, 2025, 02:22 AM
Mar 26

is that when a ton of sites use it, it BECOMES the internet, in terms of what you go through.

I say that because I use iCloud "hide my address" and name server, and this pisses off Cloudflare, and I get many detestable CAPTCHAS, mentioned above.

I have aging eyes and I hate these eye exams. I just drop the site request.

And worse, the regime is firing security experts, and rescinding Biden directives on computer security, so one can expect hackers mainly from Russia, China, North Korea, etc. to destroy government computers and haul away data.

flamingdem

(40,399 posts)
8. That's insane that they're rescinding those directives
Wed Mar 26, 2025, 02:18 PM
Mar 26

Any idea about the motives? Rather than the usual handing the country over to our enemies.

haele

(14,173 posts)
9. They want people to be stupid, even their own kids.
Wed Mar 26, 2025, 02:28 PM
Mar 26

Stupid people and AI bots won't question their decisions.
The people making these decisions are narcissistic bubble dwellers with no interest in the rest of the world around them

usonian

(17,477 posts)
10. Best guesses.
Wed Mar 26, 2025, 06:53 PM
Mar 26

1. Just undo whatever Biden did.
2. Put the defense in the hands of sycophants. Dusaster? Declare martial law.

All power plays.

And he owes Putin anyway.

Latest Discussions»General Discussion»AI bots are destroying Op...