@khobochka guess why I maintain a #Scraper #blocklist?
http://hil-speed.hetzner.com/10GB.bin
as an extra middlefinger!The #NewYorkTimes has blocked #OpenAI’s #webcrawler, meaning that OpenAI can’t use content from the publication to train its AI models. If you check the NYT’s robots.txt page, you can see that the NYT disallows #GPTBot, the crawler that OpenAI introduced earlier this month. Based on the #InternetArchive’s #WaybackMachine, it appears NYT blocked the crawler as early as August 17th. https://www.theverge.com/2023/8/21/23840705/new-york-times-openai-web-crawler-ai-gpt #copyright #legalresearch
#OpenAI IP block ranges if you want to block them from your instance and scraping your content. I saw Mastodon devs added something to block #GPTBot via robots.txt a few days ago. Here are the IP ranges:
20.15.240.64/28
20.15.240.80/28
20.15.240.96/28
20.15.240.176/28
20.15.241.0/28
20.15.242.128/28
20.15.242.144/28
20.15.242.192/28
40.83.2.64/28
https://openai.com/gptbot-ranges.txt
https://www.theverge.com/2023/8/7/23823046/openai-data-scrape-block-ai
Sites scramble to block ChatGPT web crawler after instructions emerge - Enlarge (credit: Getty Images)
Without announcement, OpenAI re... - https://arstechnica.com/?p=1960108 #machinelearning #webscraming #webcrawling #aiethics #chatgpt #chatgtp #biz #gptbot #openai #tech #ai