For The Love Of The Web. Posting Publicly Is Going To Get Used In Some Way
Sam Cole over at 404 Media wrote an article about a Hugging Face Machine Learning Librarian making a public data set of 1 million Bluesky posts available to everyone for Machine Learning.
People were of course outraged. Afterall it’s the Internet. People thrive on being outraged, pissed off, and otherwise salty.
What people seem to miss is that what they’re posting on Bluesky is public and scrapable.
The way this guy made the data set was a bit sloppy and , in my opinion, irresponsible. He didn’t anonymize the data and left personal identifiable information in the data set. He also didn’t get consent from people first.
Yea, I agree it feels a bit icky that this was done, mostly without consent or anonymizing the data. But for the love of the Web, what you put online publicly is — PUBLIC. People will see it and possibly use it for whatever they want. How hard is this to grasp?
This collection, according to Sam’s article, is also in a legal gray area right now and is going through the courts around the world.
To give some credit to the librarian, he down the data set after getting quite a bit of “feedback.” 

But that didn’t stop the trolls from making even bigger data sets and putting the out online.
I really do in fact understand why people are upset, but those posts are public. Don’t post stuff and expect it to be private when it’s PUBLIC!
Honestly, I’m fine with my content that I post publicly be used to train LLMs and AI, because it will improve the technology that I benefit from.
I agree with Rand Fishkin, the founder of Moz and Sparktoro.
He posted on Bluesky:
I know others are probably upset about this, but LLM training is, for me, a benefit of participating in spaces like this. I *want* my word usage, brands, and content to be part of how AI answers questions in the future. Just like I wanted Google to index my websites.
— Rand Fishkin (@randfish.bsky.social) December 8, 2024 at 4:06 PM
I don’t think that’s crazy desire. Right? Am I completely off-base? What do you think?