shakedown.social is one of the many independent Mastodon servers you can use to participate in the fediverse.
A community for live music fans with roots in the jam scene. Shakedown Social is run by a team of volunteers (led by @clifff and @sethadam1) and funded by donations.

Administered by:

Server stats:

290
active users

#scraping

5 posts5 participants0 posts today
Petra van Cronenburg<p><span class="h-card" translate="no"><a href="https://wandering.shop/@susankayequinn" class="u-url mention" rel="nofollow noopener noreferrer" target="_blank">@<span>susankayequinn</span></a></span> Here's another article by <span class="h-card" translate="no"><a href="https://mastodon.social/@brianmerchant" class="u-url mention" rel="nofollow noopener noreferrer" target="_blank">@<span>brianmerchant</span></a></span> : <a href="https://www.bloodinthemachine.com/p/openais-studio-ghibli-meme-factory" rel="nofollow noopener noreferrer" translate="no" target="_blank"><span class="invisible">https://www.</span><span class="ellipsis">bloodinthemachine.com/p/openai</span><span class="invisible">s-studio-ghibli-meme-factory</span></a><br>"AI giants are indeed eating away at the livelihoods and dignity of working artists, and this devouring, appropriating, and automation of the production of art, of culture, at a scale truly never seen before, should not be underestimated as a menace"</p><p><a href="https://mastodon.online/tags/AI" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>AI</span></a> <a href="https://mastodon.online/tags/OpenAI" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>OpenAI</span></a> <a href="https://mastodon.online/tags/StudioGhibli" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>StudioGhibli</span></a> <a href="https://mastodon.online/tags/art" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>art</span></a> <a href="https://mastodon.online/tags/artists" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>artists</span></a> <a href="https://mastodon.online/tags/scraping" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>scraping</span></a> <a href="https://mastodon.online/tags/copyright" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>copyright</span></a> <a href="https://mastodon.online/tags/copyrightInfringement" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>copyrightInfringement</span></a> <a href="https://mastodon.online/tags/culture" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>culture</span></a> <a href="https://mastodon.online/tags/billionaires" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>billionaires</span></a></p>
Petra van Cronenburg<p>"GPT-4o is partly (aside from some licensed content) a product of a massive scrape of the Internet without regard to copyright or consent from artists ... GPT-4o's image generation model (and the technology behind it, once open source) feels like it further erodes trust in remotely produced media ... Everyone needs media literacy skills ..." <a href="https://arstechnica.com/ai/2025/03/openais-new-ai-image-generator-is-potent-and-bound-to-provoke/?utm_brand=arstechnica&amp;utm_social-type=owned&amp;utm_source=mastodon&amp;utm_medium=social" rel="nofollow noopener noreferrer" translate="no" target="_blank"><span class="invisible">https://</span><span class="ellipsis">arstechnica.com/ai/2025/03/ope</span><span class="invisible">nais-new-ai-image-generator-is-potent-and-bound-to-provoke/?utm_brand=arstechnica&amp;utm_social-type=owned&amp;utm_source=mastodon&amp;utm_medium=social</span></a> via <span class="h-card" translate="no"><a href="https://mastodon.social/@arstechnica" class="u-url mention" rel="nofollow noopener noreferrer" target="_blank">@<span>arstechnica</span></a></span> </p><p><a href="https://mastodon.online/tags/AI" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>AI</span></a> <a href="https://mastodon.online/tags/generativeAI" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>generativeAI</span></a> <a href="https://mastodon.online/tags/imageGenerator" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>imageGenerator</span></a> <a href="https://mastodon.online/tags/fake" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>fake</span></a> <a href="https://mastodon.online/tags/gpt4o" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>gpt4o</span></a> <a href="https://mastodon.online/tags/artists" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>artists</span></a> <a href="https://mastodon.online/tags/copyright" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>copyright</span></a> <a href="https://mastodon.online/tags/scraping" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>scraping</span></a> <a href="https://mastodon.online/tags/mediaLiteracy" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>mediaLiteracy</span></a> <a href="https://mastodon.online/tags/images" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>images</span></a></p>
uǝuunɹƃʇǝO<p>Thoughts: AI corps scraping data</p><p>The corporations assert that they can utilize public data without incurring any costs, citing fair use as their justification.</p><p>To address this issue, we should implement a law that compels corporations claiming fair use as a defense to make all their process data publicly available, free of charge. This would ensure that the scraped data, as well as data derived from the freely available data, is accessible to the public.<br><a href="https://mstdn.social/tags/AI" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>AI</span></a> <a href="https://mstdn.social/tags/FairUse" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>FairUse</span></a> <a href="https://mstdn.social/tags/Scraping" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>Scraping</span></a> <a href="https://mstdn.social/tags/WebScraping" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>WebScraping</span></a></p>
Replied in thread

@Garwboy As a friend of biodiversity I had nearly stopped reading until there: "I like all of those creatures. I find them fascinating, and they occupy important roles in our society and ecosystem. I would never say that about Mark Zuckerberg."
But now I dream of writer troll farms using your inspiring idea to train #AI: theneuroscienceofeverydaylife. Great! Made my day. 😂
@writing @writers @writerscommunity

The Neuroscience of Everyday Life · An article for Meta to use to train their AIBy Dean Burnett

Yesterday I made a test, warned against this account with a hashtag of the name and a certain bird, and promptly got the #scam again. It's the sign that this paragon of a #troll factory or a narcissistic bot tinkerer hopping instances is not reacting randomly. Don't just block it, it's important to #report it so that it finally comes to an end. Don't click the links. If it's #scraping, a joke, or an attack on the Fediverse: a #fediblock would be fine! The phrase pattern could be filtered.

🦾 Terrifying reading Bigtech companies not addressing the obvious privacy and security issues and instead claiming they already have tricks to skip this countermeasure.

「 By releasing Nepenthes, he hopes to do as much damage as possible, perhaps spiking companies' AI training costs, dragging out training efforts, or even accelerating model collapse, with tarpits helping to delay the next wave of enshittification 」

arstechnica.com/tech-policy/20

Ars Technica · AI haters build tarpits to trap and trick AI scrapers that ignore robots.txtBy Ashley Belanger

Dear #dhpeople ,

I am helping a researcher in #philosophy of #aesthetics to download hundred of thousands of #art critical reviews for research purpose. Many of those reviews are on the online databasis #proquest which my university pays for us researchers to have access to.
However before diving into a head ache of #webscraping I am wondering if any of you has dealt with this databasis? What did you end up doing? Writing to them to ask? #Scraping? How? Any feedback? #digitalhumanities #dh #Histodon #fedihum #histodons #humanites_numeriques

Replied in thread

@gameplayer @atomicpoet it's not that simple.

  • Shure the average cybercriminal can use some hijacked routers and desktops as a proxy for their carding fraud but that doesn't scale with huge amounts of traffic and ISPs do combat proxy use by explicitly checking traffic for it and banning such setups as per ToS.

In fact, many ISPs will forcibly disconnect customers if they detect they run an open proxy or tor exit node.

TIL

"Earlier this year, Microsoft-owned LinkedIn came under similar scrutiny for toggling on a feature that allows the company to scrape user data for AI training. The UK's International Commissioner's Office forced LinkedIn to stop doing that with UK user data. LinkedIn still scrapes US user data by default; disable it by visiting Settings > Data Privacy > Data for Generative AI Improvement."

pcmag.com/news/microsoft-we-do

PCMag · Microsoft: We Don't Use Your Word, Excel Data for AI TrainingBy Jibin Joseph
#Microsoft#LLM#AI

Be aware your Bluesky posts are being scraped for AI training

From TechCrunch in late November, highlighting a weakness of open architectures which a sprawling and varied critical literature on ‘openness’ had long pointed to:

Bluesky might not be training AI systems on user content as other social networks are doing, but there’s little stopping third parties from doing so.

Per a report by 404 Media, Daniel van Strien, a machine learning librarian at AI firm Hugging Face, pulled 1 million public posts from Bluesky via its Firehose API for machine learning research, pushing the dataset to a public repository. Van Strien later removed the data due to the controversy that ensued; however, it serves as a timely reminder that everything you post publicly to Bluesky is, well, public.

Bluesky said that it’s looking at ways to enable users to communicate their consent preferences externally, though it’s up to those parties whether they respect those preferences.

The company posted: “Bluesky won’t be able to enforce this consent outside of our systems. It will be up to outside developers to respect these settings. We’re having ongoing conversations with engineers & lawyers and we hope to have more updates to share on this shortly!”

What’s clear here is that while Bluesky is surging in popularity, its rapid rise to the forefront of the global consciousness will mean it’s subject to the same levels of scrutiny as other major social platforms.

TechCrunch · Unlike X, Bluesky says it won't train AI on your posts | TechCrunchBluesky, a social network that's experiencing a surge in users this week, says it has “no intention” of using user content to train generative AI tools.
#AI#BlueSKy#data

If anybody has tips/experience to offer on using #mod_security to squelch broken AI #scraping, would be appreciated.

First pass has helped but now it's harder. Example: looks as though bingbot is being repurposed for AI scraping and I'd rather not risk collateral damage to actual search.

It's not the scraping we object to, it's the inefficient methods (hallucinated URIs being shot into @SkepticalScience at crazy rates).

Regular expression for AI-imagined URI. That's where this is going.😋

Replied in thread

@fuchsiii @lynn @LunaDragofelis if you look up the robits.txt specs you'll see it's literally just an ask...

If you want to prevent ChatGPT from crawling shit you need to literally block it on #firewall level!

Personally, I'd recommend to forward every request they do to this little test file and let them get hetzner'd!

www.robotstxt.orgThe Web Robots Pages

For those who want to "farm" the open internet for LLM content, all kind of tools are available, Firecrawl is a good example, partly opensource. Most people are negative about this probably but i think if a website is openly accessible/available for a human we almost can't prevent it to be crawled/scraped and used for AI training.
docs.firecrawl.dev/introductio
#AI #crawling #scraping #firecrawl #llm

Firecrawl DocsQuickstart | FirecrawlFirecrawl allows you to turn entire websites into LLM-ready markdown

I've made an interesting #observation re: #ChatGPT / #OpenAI...

Whilst they got sued by someone and forced to publish their #scraping #bots' #IP addresses, they actively prevent people from using and updating said #blocklist automatically by querying it.

I'm pretty shure that this violates their original settlement and that even if I query it hourly instead of once a day that this doesn't impact OpenAI's #uptime or #availability or #traffic at all since as of writing this file merely contains three lines:

52.230.152.0/24
52.233.106.0/24
20.171.206.0/24

And the downloaded file is 48 Bytes (!!!) small...

  • Meaning me using their website as a ping target is causing way more traffic to them than anything else.

IDK what you guys made off this...

  • Personally I'm getting pissed off with wannabe-"#AI" that I'm turning more #hostile against it by the day to the point that I'm considering to point all that traffic towards #Hetzner's 10GB test file just to give both parties a middle finger...

#JustSaying...

"What he discovered seems simple on its surface, but the quality of the result has deeper implications for the future of AI assistants, which may soon be able to see and interact with what we're doing on our computer screens."
arstechnica.com/ai/2024/10/che
#AI #video #scraping

Ars Technica · Cheap AI “video scraping” can now extract data from any screen recordingBy Benj Edwards

If #Cloudflare is to be believed, #Lemmy instances have a built-in AI scraping bot operating beneath the covers. Do you think the developers have snuck it in?

Looking through my logs, these requests have all been blocked by Cloudflare because they are identified as "AI Bots". There are many more requests by Lemmy instances blocked in the logs. This is just a sample. Other Lemmy requests from these servers get through. Only a few are blocked as AI Bots.

Cloudflare says they use AI to determine if a request is a legitimate request or an AI bot trying to scrape.

207.204.58.144
AS19045 DIRECTCOM
United States
User agent: Lemmy/0.19.5; +lemmy.cryonex.net

23.127.223.238
AS7018 ATT-INTERNET4
United States
User agent: Lemmy/0.19.3; +lemux.minnix.dev

2a01:cb19:f85:ec00:82fa:5bff:fe51:ed4a
AS3215 France Telecom - Orange
France
User agent: Lemmy/0.19.5; +lemmy.sidh.bzh

50.247.53.42
AS7922 COMCAST-7922
United States
User agent: Lemmy/0.19.5; +toast.ooo

69.42.19.234
AS11404 AS-WAVE-1
United States
User agent: Lemmy/0.19.5; +lemmy.schlunker.com

155.138.226.183
AS20473 AS-CHOOPA
United States
User agent: Lemmy/0.19.5; +lemmy.mbl.social

lemmy.cryonex.netlemmy.cryonex.net