I asked ChatGPT about the recent copyright news. It rehashed my latest column and misconstrued the facts. But why was it on my site at all?
https://www.plagiarismtoday.com/2025/07/23/chatgpt-ignores-robots-txt-rehashes-my-column/

I asked ChatGPT about the recent copyright news. It rehashed my latest column and misconstrued the facts. But why was it on my site at all?
https://www.plagiarismtoday.com/2025/07/23/chatgpt-ignores-robots-txt-rehashes-my-column/
Here's #Cloudflare's #robots-txt file:
# Cloudflare Managed Robots.txt to block AI related bots.
User-agent: AI2Bot
Disallow: /
User-agent: Amazonbot
Disallow: /
User-agent: amazon-kendra
Disallow: /
User-agent: anthropic-ai
Disallow: /
User-agent: Applebot
Disallow: /
User-agent: Applebot-Extended
Disallow: /
User-agent: AwarioRssBot
Disallow: /
User-agent: AwarioSmartBot
Disallow: /
User-agent: bigsur.ai
Disallow: /
User-agent: Brightbot
Disallow: /
User-agent: Bytespider
Disallow: /
User-agent: ChatGPT-User
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Diffbot
Disallow: /
User-agent: DigitalOceanGenAICrawler
Disallow: /
User-agent: DuckAssistBot
Disallow: /
User-agent: FacebookBot
Disallow: /
User-agent: FriendlyCrawler
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: GPTBot
Disallow: /
User-agent: iaskspider/2.0
Disallow: /
User-agent: ICC-Crawler
Disallow: /
User-agent: img2dataset
Disallow: /
User-agent: Kangaroo Bot
Disallow: /
User-agent: LinerBot
Disallow: /
User-agent: MachineLearningForPeaceBot
Disallow: /
User-agent: Meltwater
Disallow: /
User-agent: meta-externalagent
Disallow: /
User-agent: meta-externalfetcher
Disallow: /
User-agent: Nicecrawler
Disallow: /
User-agent: OAI-SearchBot
Disallow: /
User-agent: omgili
Disallow: /
User-agent: omgilibot
Disallow: /
User-agent: PanguBot
Disallow: /
User-agent: PerplexityBot
Disallow: /
User-agent: Perplexity-User
Disallow: /
User-agent: PetalBot
Disallow: /
User-agent: PiplBot
Disallow: /
User-agent: QualifiedBot
Disallow: /
User-agent: Scoop.it
Disallow: /
User-agent: Seekr
Disallow: /
User-agent: SemrushBot-OCOB
Disallow: /
User-agent: Sidetrade indexer bot
Disallow: /
User-agent: Timpibot
Disallow: /
User-agent: VelenPublicWebCrawler
Disallow: /
User-agent: Webzio-Extended
Disallow: /
User-agent: YouBot
Disallow: /
I've had the robots.txt to block ChatGPT from touching my site in place for months. Yet it's a referrer?
Hey does anyone know if there's still a working zip bomb style exploit that can be deployed on a static site/JS (or as a asset/resource)? Specifically to target web scrapers and AI bullshit? The second any server goes online now it's immediately bombarded by stupid numbers of requests.
Hi, got a question.
Is there a standard for Anti-AI/Anti-SEO etc robots.txt file? Or a trustworthy site that explains how to build one if prefab isn't available? Is there anything else I should consider?
Thanks.
@fennix the fact that neither @bsi nor @EUCommission make honoring #RobotsTXT legally mandatory under penalty of fines and forced disconnects is a problem.
#WhatYouAllowIsWhatWillContinue applies here and I kniw some folks intent to literally ban entire ASNs for hosting crawlers because those literally #DDoS sites offline and criminally incompetent, value-removing middlemen like #ClownFlare do jack shit about even when tasked to do so.
@neil @ThreeGerbilsInACoat also note that #InternetArchive disregards the #RobotsTXT file...
#Robotstxt #CrawlerBacklash Trickle-down effects: "people start blocking all crawlers, and some crawlers are very important, for search indexing, internet archiving, some are used for academic research, and so the bad behaviours of all these #AIcompanies, and the backlash to it, is kind of fundamentally changing how the Internet works, how it is remembered and indexed..."
https://pca.st/yto6v3il?t=11m34s
Y'all really putting a file on your webserver that says "don't look *here* if you're a bot!" and expecting people not to look there first
"…the #backlash to AI tools from content creators and website owners who do not want their work to be used for AI training purposes without permission or compensation is not only real but is becoming increasingly widespread. The analysis also highlights the limitations of robots.txt—while many companies respect robots.txt instructions, some do not. Perplexity have been caught circumventing & ignoring #robotstxt."
https://www.404media.co/the-backlash-against-ai-scraping-is-real-and-measurable/
"…researchers estimate that in the 3 data sets—called C4, RefinedWeb and Dolma—5% of all data, and 25% of data from the highest-quality sources, has been restricted…set up through the #RobotsExclusionProtocol, a method for website owners to prevent automated bots from crawling their pages using a file called #robotstxt."
Web scrapers work by finding URLs in a page and then visiting those URLs to find more URLs recursively.
What's stopping us from serving them infinite trees of URLs, filled with random garbage?
Blocking robots.txt is not very cyberpunk
AI scraped all your photos so that I could look up the names of flowers which completely justifies the scraping in my mind.
Watching y'all realize that robots.txt was always a silly "scouts honor" way to prevent scraping that only worked because search results are most often *improved* when content that's self-filtered is removed. You were just doing Google's job for them!
Want to block Big Tech AI scrapers from accessing your site for content farming purposes?
Say goodbye to your search engine visibility as well!
Interesting results, thanks everyone for voting!
I wrote more on this topic here: https://stefanbohacek.com/blog/which-top-sites-block-ai-crawlers/
Interestingly, a few of the top websites actively invite AI crawlers to crawl them.
https://stefanbohacek.com/blog/which-top-sites-block-ai-crawlers/
Just throwing out a thought before I do some research on this, but I think robots.txt needs an update.
Ideally I'd like to define an "allow list" that tells web scrapers how my content can be used. Eg.:
- monetizable: false
- fediverse: true
- nonfediverse: false
- ai: false
Etc. And I'd like to apply this to my social media profile and any other web presence, not just my personal website.