I think enough has already been written on the subject of fighting against rogue bots (today mostly for LLM scraping) that are ruining the web, not only by strip-mining human creativity and turning it into average slop, but especially by taking down hosting infrastructure through uncoordinated crawling that turns into DDoS.
And it's not only the "bad LLM" companies that engage in this; even Google is slamming anything alive that happens to have its HTTP endpoint open!
Thus, many have started fighting back by employing various techniques:
- from CAPTCHAs; (how many buses and bicycles can one mark until one goes nuts?)
- JavaScript proof-of-work; (if energy and time are to be wasted, why not crypto-mine something?) :)
- robot mazes and tarpitting;
- serving generated slop;
- serving poisoned or nonsense gibberish; (a form of the previous, but with vengeance;)
(As a small note, I won't use the word AI to mean LLM. LLMs might be a form of AI, but the reverse is not true.)
As a small sidenote, I used to browse the internet with a custom Firefox "reading" profile that:
- has (had) cookies disabled; (at first, all cookies, but then only third-party ones, because sites started breaking;)
- has JavaScript disabled; (mainly because of all the ad-nonsense and pop-ups; of course, I could use an ad-blocker, but why bother when there is a simpler alternative;)
- has a custom CSS that resets font (family and size), colors, and other aspects; (I like some consistency in my reading;)
Back to the bot countermeasures, sadly, they are not all the same, especially in terms of usability!
Lately, more and more sites have started to be broken, in the sense that I can't use my Firefox "reading" profile, because these sites have started deploying aggressive bot countermeasures.
Thus, from this perspective, some of these countermeasures make my goal almost impossible to achieve:
- CAPTCHAs almost always require JavaScript, and some don't work well with custom CSS;
- JavaScript proof-of-work, as the name implies, always requires JavaScript; not only that, but most often than not, they require cookies to store the proof-of-work, which means that private browsing is broken each time a new session is started;
- (the other countermeasures are OK in this regard;)
Unfortunately, the most used countermeasures seem to be the JavaScript-based ones...
While I understand that the situation is dire, and webmasters, bloggers, writers, and others need to fight back against this assault, either to save their infrastructure from crumbling, or to save their intellectual property from being borderline plagiarized, we can't do so by completely destroying the web!
Do I have a solution?
No!
Do I believe the current technical approaches are a good fit?
No, at least not in the long term.
Can I legaly take a copyrighted book, read it, perhaps more than once,
and then say, that because I don't reproduce verbatim word-for-word,
I'm not actually infringing anyone's copyright?
Definitely not!
(Or else the entertainment companies wouldn't be so active against movie piracy...)
Thus, should perhaps someone throw the copyright law at the LLM companies?
Definitely yes!
As such, I think throwing technical measures against a mechanized form of piracy is counterproductive in the long term, just as it is in other cases!
- not only doesn't it solve the piracy problem,
- not only does it waste human time and attention to solve mindless puzzles,
- not only does it waste countless amounts of energy to do useless crypto-mining-like tasks,
- it also wastes a lot of development time for the owners of the works it protects,
- and finally, it harms and annoys the actual intended audience!
What am I to do?
For the moment, I think I'll just stick with the following approach: if the site fails to load (with no JavaScript and custom CSS), then I just close it and move on. There is more content on the web than I can read in 1000 lifetimes; certainly I didn't miss the long sought answer to any great mystery of the universe!
Also, I apply the same rule to Twitter, Facebook, and even Mastodon.
Why doesn't Twitter / Facebook work without JavaScript?
Because they don't care.
But why doesn't Mastodon work without JavaScript?
Because they don't care,
they really don't care.
So I don't care either!
Am I pondering adding bot countermeasures to this site?
Yes, but not at the moment.
But I think I would go with a combination of robot maze + zip bomb for the greatest impact. :)