clifff @clifff

Continued thread

**George E.** @gme@bofh.social · Jun 12

# Cloudflare Managed Robots.txt to block AI related bots.

User-agent: AI2Bot
Disallow: /

User-agent: Amazonbot
Disallow: /

User-agent: amazon-kendra
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: Applebot
Disallow: /

User-agent: Applebot-Extended
Disallow: /

User-agent: AwarioRssBot
Disallow: /

User-agent: AwarioSmartBot
Disallow: /

User-agent: bigsur.ai
Disallow: /

User-agent: Brightbot
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Diffbot
Disallow: /

User-agent: DigitalOceanGenAICrawler
Disallow: /

User-agent: DuckAssistBot
Disallow: /

User-agent: FacebookBot
Disallow: /

User-agent: FriendlyCrawler
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: GPTBot
Disallow: /

User-agent: iaskspider/2.0
Disallow: /

User-agent: ICC-Crawler
Disallow: /

User-agent: img2dataset
Disallow: /

User-agent: Kangaroo Bot
Disallow: /

User-agent: LinerBot
Disallow: /

User-agent: MachineLearningForPeaceBot
Disallow: /

User-agent: Meltwater
Disallow: /

User-agent: meta-externalagent
Disallow: /

User-agent: meta-externalfetcher
Disallow: /

User-agent: Nicecrawler
Disallow: /

User-agent: OAI-SearchBot
Disallow: /

User-agent: omgili
Disallow: /

User-agent: omgilibot
Disallow: /

User-agent: PanguBot
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: Perplexity-User
Disallow: /

User-agent: PetalBot
Disallow: /

User-agent: PiplBot
Disallow: /

User-agent: QualifiedBot
Disallow: /

User-agent: Scoop.it
Disallow: /

User-agent: Seekr
Disallow: /

User-agent: SemrushBot-OCOB
Disallow: /

User-agent: Sidetrade indexer bot
Disallow: /

User-agent: Timpibot
Disallow: /

User-agent: VelenPublicWebCrawler
Disallow: /

User-agent: Webzio-Extended
Disallow: /

User-agent: YouBot
Disallow: /

#robotstxt

**smeg** @smeg@assortedflotsam.com · May 18

May 18

smeg @smeg@assortedflotsam.com

I've had the robots.txt to block ChatGPT from touching my site in place for months. Yet it's a referrer?

#chatgpt #llm #privacy

**zeyus ‎** @zeyus@corteximplant.com · Mar 3

Mar 3

zeyus ‎ @zeyus@corteximplant.com

Hey does anyone know if there's still a working zip bomb style exploit that can be deployed on a static site/JS (or as a asset/resource)? Specifically to target web scrapers and AI bullshit? The second any server goes online now it's immediately bombarded by stupid numbers of requests.

#hacking #aislop #crawlers

**Dawn Tåke** @Tourma@tech.lgbt · Feb 11

Feb 11

Dawn Tåke @Tourma@tech.lgbt

Hi, got a question.

Is there a standard for Anti-AI/Anti-SEO etc robots.txt file? Or a trustworthy site that explains how to build one if prefab isn't available? Is there anything else I should consider?

Thanks.

#AskFedi #TechHelp #RobotsTXT

**Preston Maness ☭** @aspensmonster@tenforward.social · Jan 31 *

Jan 31 *

Preston Maness ☭ @aspensmonster@tenforward.social

https://www.tiktok.com/@alberta.nyc/video/7465916806939659563?lang=en

#DeepSeek #ai #OpenAI

Replied in thread

**Kevin Karhan** @kkarhan@infosec.space · Dec 30, 2024 *

Dec 30, 2024 *

Kevin Karhan @kkarhan@infosec.space

@fennix the fact that neither @bsi nor @EUCommission make honoring #RobotsTXT legally mandatory under penalty of fines and forced disconnects is a problem.

#WhatYouAllowIsWhatWillContinue applies here and I kniw some folks intent to literally ban entire ASNs for hosting crawlers because those literally #DDoS sites offline and criminally incompetent, value-removing middlemen like #ClownFlare do jack shit about even when tasked to do so.

YouTubeThe creators of TikTok caused my website to shut downBy MattKC

#sarcasm #vent #AI

**Kevin Karhan** @kkarhan@infosec.space · Sep 9, 2024

Sep 9, 2024

Kevin Karhan @kkarhan@infosec.space

@neil @ThreeGerbilsInACoat also note that #InternetArchive disregards the #RobotsTXT file...

**Ecologia Digital** @josemurilo@mato.social · Jul 31, 2024 *

Jul 31, 2024 *

Ecologia Digital @josemurilo@mato.social

#Robotstxt #CrawlerBacklash Trickle-down effects: "people start blocking all crawlers, and some crawlers are very important, for search indexing, internet archiving, some are used for academic research, and so the bad behaviours of all these #AIcompanies, and the backlash to it, is kind of fundamentally changing how the Internet works, how it is remembered and indexed..."
https://pca.st/yto6v3il?t=11m34s

Pocket CastsGoogle, Reddit, and the Robots.txt Rebellion - The 404 Media PodcastWelcome to the podcast from 404 Media where Joseph, Sam, Emanuel, and Jason catch you up on the stories we published this week. 404 Media is a journalist-owned digital media company exploring the way technology is shaping–and is shaped by–our world. We bring you unparalleled access to hidden worlds both online and IRL through investigative reporting, smart blogging, and breaking news. At 404 Media you’ll read, and hear, stories you can’t find anywhere else written by journalists who are leading experts on their beats. Subscribe to 404 Media at 404media.co to gain access to an ad-free version of this podcast, as well as a bonus podcast episodes. Subscribers are the bedrock of building a sustainable business for our journalism. Hosted on Acast. See acast.com/privacy for more information.

@schizanon@mastodon.social · Jul 27, 2024

Jul 27, 2024

@schizanon@mastodon.social

Y'all really putting a file on your webserver that says "don't look *here* if you're a bot!" and expecting people not to look there first

#webDev #robotsTxt #bots

**Ecologia Digital** @josemurilo@mato.social · Jul 23, 2024

Jul 23, 2024

Ecologia Digital @josemurilo@mato.social

"…the #backlash to AI tools from content creators and website owners who do not want their work to be used for AI training purposes without permission or compensation is not only real but is becoming increasingly widespread. The analysis also highlights the limitations of robots.txt—while many companies respect robots.txt instructions, some do not. Perplexity have been caught circumventing & ignoring #robotstxt."

https://www.404media.co/the-backlash-against-ai-scraping-is-real-and-measurable/

404 Media · Jul 23, 2024The Backlash Against AI Scraping Is Real and MeasurableIn the last year, the number of websites specifically restricting OpenAI and other AI scraper bots has gone through the roof.

**Ecologia Digital** @josemurilo@mato.social · Jul 20, 2024

Jul 20, 2024

Ecologia Digital @josemurilo@mato.social

"…researchers estimate that in the 3 data sets—called C4, RefinedWeb and Dolma—5% of all data, and 25% of data from the highest-quality sources, has been restricted…set up through the #RobotsExclusionProtocol, a method for website owners to prevent automated bots from crawling their pages using a file called #robotstxt."

https://www.nytimes.com/2024/07/19/technology/ai-data-restrictions.html?unlocked_article_code=1.8k0.8eMA.cGAaZ0i10aZE&smid=nytcore-ios-share&referringSource=articleShare

The New York Times · Jul 19, 2024Data for A.I. Training Is Disappearing Fast, Study ShowsBy Kevin Roose

@schizanon@mastodon.social · Jul 8, 2024

Jul 8, 2024

@schizanon@mastodon.social

Web scrapers work by finding URLs in a page and then visiting those URLs to find more URLs recursively.

What's stopping us from serving them infinite trees of URLs, filled with random garbage?

#webDev #robotstxt #scrapers

@schizanon@mastodon.social · Jul 7, 2024 *

Jul 7, 2024 *

@schizanon@mastodon.social

Blocking robots.txt is not very cyberpunk

#ai #llm #cyberpunk

@schizanon@mastodon.social · Jul 4, 2024 *

Jul 4, 2024 *

@schizanon@mastodon.social

AI scraped all your photos so that I could look up the names of flowers which completely justifies the scraping in my mind.

#robotstxt #ai #llm

@schizanon@mastodon.social · Jun 24, 2024

Jun 24, 2024

@schizanon@mastodon.social

Watching y'all realize that robots.txt was always a silly "scouts honor" way to prevent scraping that only worked because search results are most often *improved* when content that's self-filtered is removed. You were just doing Google's job for them!

#robotstxt #ai #robots