General Discussion
Related: Editorials & Other Articles, Issue Forums, Alliance Forums, Region ForumsAI bots are destroying Open Access
TLDR:
We are headed for a world in which all good information is locked up behind secure registration barriers and paywalls, and it won't be to make money, it will be for survival. Captchas will only be solvable by advanced AIs and only the wealthy will be able to use internet libraries.
https://go-to-hellman.blogspot.com/2025/03/ai-bots-are-destroying-open-access.html
The good guys are trying their best. They're sharing block lists and bot signatures. Many libraries are routinely blocking entire countries (nobody in china could possibly want books!) just to be able to serve a trickle of local requests. They are using commercial services such as Cloudflare to outsource their bot-blocking and captchas, without knowing for sure what these services are blocking, how they're doing it, or whether user privacy and accessibility is being flushed down the toilet. But nothing seems to offer anything but temporary relief. Not that there's anything bad about temporary relief, but we know the bots just intensify their attack on other content stores.
The surge of AI bots has hit Open Access sites particularly hard, as their mission conflicts with the need to block bots. Consider that Internet Archive can no longer save snapshots of one of the best open-access publishers, MIT Press because of cloudflare blocking. (see above) Who know how many books will be lost this way? Or consider that the bots took down OAPEN, the worlds most important repository of Scholarly OA books, for a day or two. That's 34,000 books that AI "checked out" for two days. Or recent outages at Project Gutenberg, which serves 2 million dynamic pages and a half million downloads per day. That's hundreds of thousands of downloads blocked! The link checker at doab-check.ebookfoundation.org (a project I worked on for OAPEN) is now showing 1,534 books that are unreachable due to "too many requests". That's 1,534 books that AI has stolen from us! And it's getting worse.
...
The thing that gets me REALLY mad is how unnecessary this carnage is. Project Gutenberg makes all its content available with one click on a file in its feeds directory. OAPEN makes all its books available via an API. There's no need to make a million requests to get this stuff!! Who (or what) is programming these idiot scraping bots? Have they never heard of a sitemap??? Are they summer interns using ChatGPT to write all their code? Who gave them infinite memory, CPUs and bandwidth to run these monstrosities? (Don't answer.) We are headed for a world in which all good information is locked up behind secure registration barriers and paywalls, and it won't be to make money, it will be for survival. Captchas will only be solvable by advanced AIs and only the wealthy will be able to use internet libraries.
And about CAPTCHA, pick your poison:
Google's reCAPTCHA is not only useless, it's also basically spyware
reCAPTCHA v3's checkbox test doesn't stop bots and tracks user data
https://www.techspot.com/news/106717-google-recaptcha-not-only-useless-also-basically-spyware.html
Tracking data Google collects from Captchas carries an estimated value of nearly $898 billion. Furthermore, when a lawsuit against the search giant for using reCAPTCHA v2 inputs to train AI revealed that the 819 million hours users spent clicking on the tests worked out to about $6.1 billion in unpaid wages.
The UC Irvine study concluded that Google should retire reCAPTCHA v2 and similar tools. An Austrian federal court has already banned the technology, finding that it violates users' privacy rights under the GDPR.

This is really important. We're already seeing the end of the original intent of the Internet inventors where it's a place for every person to be able to find all information that exists ... free.
hunter
(39,474 posts)It distresses me that so many promising and innovative technologies end up being used for shitty purposes.
I find the modern internet, raw and unfiltered, to be unbearable.
flamingdem
(40,399 posts)Maybe in one paragraph? Please, it looks important but too much inside baseball.
usonian
(17,477 posts)Bots are hammering every website looking for AI training data. There are many, and the load is buckling sites not built for such repetitive load.
Despite requests from sites, in a "robots.txt" file that the webmaster installs, they just ignore it, so webmasters have to build software walls, or ask Cloudflare to do so. DU does that to prevent hack attacks.
I'll update if that's not enough, or too much. I quoted a high tech site.
It's good that you asked the "AI bots for dummies" question. It's THAT important.
flamingdem
(40,399 posts)related to the stock market. Looks like cybersecurity demand will skyrocket.
I can see why.
usonian
(17,477 posts)is that when a ton of sites use it, it BECOMES the internet, in terms of what you go through.
I say that because I use iCloud "hide my address" and name server, and this pisses off Cloudflare, and I get many detestable CAPTCHAS, mentioned above.
I have aging eyes and I hate these eye exams. I just drop the site request.
And worse, the regime is firing security experts, and rescinding Biden directives on computer security, so one can expect hackers mainly from Russia, China, North Korea, etc. to destroy government computers and haul away data.
justaprogressive
(3,544 posts)the Audio option... quicker than finding busses
flamingdem
(40,399 posts)Any idea about the motives? Rather than the usual handing the country over to our enemies.
haele
(14,173 posts)Stupid people and AI bots won't question their decisions.
The people making these decisions are narcissistic bubble dwellers with no interest in the rest of the world around them
usonian
(17,477 posts)1. Just undo whatever Biden did.
2. Put the defense in the hands of sycophants. Dusaster? Declare martial law.
All power plays.
And he owes Putin anyway.