@Garwboy As a friend of biodiversity I had nearly stopped reading until there: "I like all of those creatures. I find them fascinating, and they occupy important roles in our society and ecosystem. I would never say that about Mark Zuckerberg."
But now I dream of writer troll farms using your inspiring idea to train #AI: https://theneuroscienceofeverydaylife.substack.com/p/an-article-for-meta-to-use-to-train Great! Made my day.
@writing @writers @writerscommunity
Yesterday I made a test, warned against this account with a hashtag of the name and a certain bird, and promptly got the #scam again. It's the sign that this paragon of a #troll factory or a narcissistic bot tinkerer hopping instances is not reacting randomly. Don't just block it, it's important to #report it so that it finally comes to an end. Don't click the links. If it's #scraping, a joke, or an attack on the Fediverse: a #fediblock would be fine! The phrase pattern could be filtered.
Terrifying reading Bigtech companies not addressing the obvious privacy and security issues and instead claiming they already have tricks to skip this countermeasure.
「 By releasing Nepenthes, he hopes to do as much damage as possible, perhaps spiking companies' AI training costs, dragging out training efforts, or even accelerating model collapse, with tarpits helping to delay the next wave of enshittification 」
The EU’s #AIAct prohibitions are now in effect! But gaps remain. Learn more: https://algorithmwatch.org/en/ai-act-prohibitions-february-2025/
Now banned in the EU: #ManipulativeAI, AI that exploits people's vulnerabilities, #SocialScoring, #Scraping of facial images on the internet, Live #FaceRecognition in Public Spaces. Others are partially banned, like #PredictivePolicing, #EmotionRecognition, and more.
Dear #dhpeople ,
I am helping a researcher in #philosophy of #aesthetics to download hundred of thousands of #art critical reviews for research purpose. Many of those reviews are on the online databasis #proquest which my university pays for us researchers to have access to.
However before diving into a head ache of #webscraping I am wondering if any of you has dealt with this databasis? What did you end up doing? Writing to them to ask? #Scraping? How? Any feedback? #digitalhumanities #dh #Histodon #fedihum #histodons #humanites_numeriques
@gameplayer @atomicpoet it's not that simple.
In fact, many ISPs will forcibly disconnect customers if they detect they run an open proxy or tor exit node.
@khobochka We need an international co-operative system of making these parties pay for scraping. It includes legislative changes. At the same time it can become a real-time pricing market for ”rights to scrape” and for creators to get paid.
Here’s my whitepaper for a solution. Absolutely no cryptocurrency involved.
#ai #scraping #copyright #technology #whitepaper
https://docs.google.com/document/d/18cz-ZX1copCYiC4C2ReY8GLJjuhG2IH0MEBGaoSJhP4/edit
TIL
"Earlier this year, Microsoft-owned LinkedIn came under similar scrutiny for toggling on a feature that allows the company to scrape user data for AI training. The UK's International Commissioner's Office forced LinkedIn to stop doing that with UK user data. LinkedIn still scrapes US user data by default; disable it by visiting Settings > Data Privacy > Data for Generative AI Improvement."
https://www.pcmag.com/news/microsoft-we-dont-use-your-word-excel-data-for-ai-training
For those (like me!) looking for what @cstross is referring to:
https://www.gov.uk/government/consultations/copyright-and-artificial-intelligence
Je cherche à scraper une page d'un site web avec FreshRSS. Et ben… c'est une certitude :
je suis nul.
Ça existe une app en ligne pour (semi-)automatiser ça ?
Be aware your Bluesky posts are being scraped for AI training
From TechCrunch in late November, highlighting a weakness of open architectures which a sprawling and varied critical literature on ‘openness’ had long pointed to:
Bluesky might not be training AI systems on user content as other social networks are doing, but there’s little stopping third parties from doing so.
Per a report by 404 Media, Daniel van Strien, a machine learning librarian at AI firm Hugging Face, pulled 1 million public posts from Bluesky via its Firehose API for machine learning research, pushing the dataset to a public repository. Van Strien later removed the data due to the controversy that ensued; however, it serves as a timely reminder that everything you post publicly to Bluesky is, well, public.
Bluesky said that it’s looking at ways to enable users to communicate their consent preferences externally, though it’s up to those parties whether they respect those preferences.
The company posted: “Bluesky won’t be able to enforce this consent outside of our systems. It will be up to outside developers to respect these settings. We’re having ongoing conversations with engineers & lawyers and we hope to have more updates to share on this shortly!”
What’s clear here is that while Bluesky is surging in popularity, its rapid rise to the forefront of the global consciousness will mean it’s subject to the same levels of scrutiny as other major social platforms.
If anybody has tips/experience to offer on using #mod_security to squelch broken AI #scraping, would be appreciated.
First pass has helped but now it's harder. Example: looks as though bingbot is being repurposed for AI scraping and I'd rather not risk collateral damage to actual search.
It's not the scraping we object to, it's the inefficient methods (hallucinated URIs being shot into @SkepticalScience at crazy rates).
Regular expression for AI-imagined URI. That's where this is going.
@fuchsiii @lynn @LunaDragofelis if you look up the robits.txt
specs you'll see it's literally just an ask...
If you want to prevent ChatGPT from crawling shit you need to literally block it on #firewall level!
curl
/ wget
to pull their bots' IP ranges...Personally, I'd recommend to forward every request they do to this little test file and let them get hetzner'd!
For those who want to "farm" the open internet for LLM content, all kind of tools are available, Firecrawl is a good example, partly opensource. Most people are negative about this probably but i think if a website is openly accessible/available for a human we almost can't prevent it to be crawled/scraped and used for AI training.
https://docs.firecrawl.dev/introduction
#AI #crawling #scraping #firecrawl #llm
I've made an interesting #observation re: #ChatGPT / #OpenAI...
Whilst they got sued by someone and forced to publish their #scraping #bots' #IP addresses, they actively prevent people from using and updating said #blocklist automatically by querying it.
I'm pretty shure that this violates their original settlement and that even if I query it hourly instead of once a day that this doesn't impact OpenAI's #uptime or #availability or #traffic at all since as of writing this file merely contains three lines:
52.230.152.0/24
52.233.106.0/24
20.171.206.0/24
And the downloaded file is 48 Bytes (!!!) small...
ping
target is causing way more traffic to them than anything else.IDK what you guys made off this...
#JustSaying...
"What he discovered seems simple on its surface, but the quality of the result has deeper implications for the future of AI assistants, which may soon be able to see and interact with what we're doing on our computer screens."
https://arstechnica.com/ai/2024/10/cheap-ai-video-scraping-can-now-extract-data-from-any-screen-recording/
#AI #video #scraping
If #Cloudflare is to be believed, #Lemmy instances have a built-in AI scraping bot operating beneath the covers. Do you think the developers have snuck it in?
Looking through my logs, these requests have all been blocked by Cloudflare because they are identified as "AI Bots". There are many more requests by Lemmy instances blocked in the logs. This is just a sample. Other Lemmy requests from these servers get through. Only a few are blocked as AI Bots.
Cloudflare says they use AI to determine if a request is a legitimate request or an AI bot trying to scrape.
207.204.58.144
AS19045 DIRECTCOM
United States
User agent: Lemmy/0.19.5; +https://lemmy.cryonex.net
23.127.223.238
AS7018 ATT-INTERNET4
United States
User agent: Lemmy/0.19.3; +https://lemux.minnix.dev
2a01:cb19:f85:ec00:82fa:5bff:fe51:ed4a
AS3215 France Telecom - Orange
France
User agent: Lemmy/0.19.5; +https://lemmy.sidh.bzh
50.247.53.42
AS7922 COMCAST-7922
United States
User agent: Lemmy/0.19.5; +https://toast.ooo
69.42.19.234
AS11404 AS-WAVE-1
United States
User agent: Lemmy/0.19.5; +https://lemmy.schlunker.com
155.138.226.183
AS20473 AS-CHOOPA
United States
User agent: Lemmy/0.19.5; +https://lemmy.mbl.social