Ecologia Digital<p>"…researchers estimate that in the 3 data sets—called C4, RefinedWeb and Dolma—5% of all data, and 25% of data from the highest-quality sources, has been restricted…set up through the <a href="https://mato.social/tags/RobotsExclusionProtocol" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>RobotsExclusionProtocol</span></a>, a method for website owners to prevent automated bots from crawling their pages using a file called <a href="https://mato.social/tags/robotstxt" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>robotstxt</span></a>."</p><p><a href="https://www.nytimes.com/2024/07/19/technology/ai-data-restrictions.html?unlocked_article_code=1.8k0.8eMA.cGAaZ0i10aZE&smid=nytcore-ios-share&referringSource=articleShare" rel="nofollow noopener" translate="no" target="_blank"><span class="invisible">https://www.</span><span class="ellipsis">nytimes.com/2024/07/19/technol</span><span class="invisible">ogy/ai-data-restrictions.html?unlocked_article_code=1.8k0.8eMA.cGAaZ0i10aZE&smid=nytcore-ios-share&referringSource=articleShare</span></a></p>