Seth Goldstein<p><strong>For The Love Of The Web. Posting Publicly Is Going To Get Used In Some Way</strong></p><p><a href="https://www.404media.co/bluesky-posts-machine-learning-ai-datasets-hugging-face/" rel="nofollow noopener noreferrer" target="_blank">Sam Cole over at 404 Media wrote an article about a Hugging Face Machine Learning Librarian making a public data set of 1 million Bluesky posts available to everyone for Machine Learning.</a></p><p>People were of course outraged. Afterall it’s the Internet. People thrive on being outraged, pissed off, and otherwise salty.</p><p>What people seem to miss is that what they’re posting on Bluesky is public and scrapable.</p><p>The way this guy made the data set was a bit sloppy and , in my opinion, irresponsible. He didn’t anonymize the data and left personal identifiable information in the data set. He also didn’t get consent from people first.</p><p>Yea, I agree it feels a bit icky that this was done, mostly without consent or anonymizing the data. But for the love of the Web, what you put online publicly is — PUBLIC. People will see it and possibly use it for whatever they want. How hard is this to grasp?</p><p>This collection, according to Sam’s article, is also in a legal gray area right now and is going through the courts around the world.</p><p>To give some credit to the librarian, he down the data set after getting quite a bit of “feedback.” 😵💫😜</p><p>But that didn’t stop the trolls from making even bigger data sets and putting the out online.</p><p>I really do in fact understand why people are upset, but those posts are public. Don’t post stuff and expect it to be private when it’s PUBLIC!</p><p>Honestly, I’m fine with my content that I post publicly be used to train LLMs and AI, because it will improve the technology that I benefit from.</p><p>I agree with Rand Fishkin, the founder of Moz and Sparktoro. </p><p>He posted on Bluesky:</p><blockquote><p>I know others are probably upset about this, but LLM training is, for me, a benefit of participating in spaces like this. I *want* my word usage, brands, and content to be part of how AI answers questions in the future. Just like I wanted Google to index my websites.</p><p>— Rand Fishkin (<a href="https://bsky.app/profile/did:plc:b7jekqo7kjipuoloz7wjg3mh?ref_src=embed" rel="nofollow noopener noreferrer" target="_blank">@randfish.bsky.social</a>) <a href="https://bsky.app/profile/did:plc:b7jekqo7kjipuoloz7wjg3mh/post/3lct4iwg2sk2o?ref_src=embed" rel="nofollow noopener noreferrer" target="_blank">December 8, 2024 at 4:06 PM</a></p></blockquote><p>I don’t think that’s crazy desire. Right? Am I completely off-base? What do you think?</p><p><a rel="nofollow noopener noreferrer" class="hashtag u-tag u-category" href="https://sethgoldstein.me/tag/ai/" target="_blank">#AI</a> <a rel="nofollow noopener noreferrer" class="hashtag u-tag u-category" href="https://sethgoldstein.me/tag/bluesky/" target="_blank">#Bluesky</a> <a rel="nofollow noopener noreferrer" class="hashtag u-tag u-category" href="https://sethgoldstein.me/tag/data/" target="_blank">#Data</a> <a rel="nofollow noopener noreferrer" class="hashtag u-tag u-category" href="https://sethgoldstein.me/tag/datasets/" target="_blank">#Datasets</a> <a rel="nofollow noopener noreferrer" class="hashtag u-tag u-category" href="https://sethgoldstein.me/tag/llms/" target="_blank">#LLMs</a> <a rel="nofollow noopener noreferrer" class="hashtag u-tag u-category" href="https://sethgoldstein.me/tag/machine-learning/" target="_blank">#MachineLearning</a> <a rel="nofollow noopener noreferrer" class="hashtag u-tag u-category" href="https://sethgoldstein.me/tag/public-vs-private/" target="_blank">#PublicVsPrivate</a></p>