Kate Nyhan<p>Someone explain to me -- I thought the whole point of considering using LLMs for <a href="https://fediscience.org/tags/EvidenceSynthesis" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>EvidenceSynthesis</span></a> screening was to process large datasets quickly. <br>But in this paper<br>Ghossein, J., Hryciw, B. N., Ramsay, T., & Kyeremanteng, K. (2025). The AI Reviewer: Evaluating AI’s Role in Citation Screening for Streamlined Systematic Reviews. JMIR Formative Research, 9(1), e58366. <a href="https://doi.org/10.2196/58366" rel="nofollow noopener" translate="no" target="_blank"><span class="invisible">https://</span><span class="">doi.org/10.2196/58366</span><span class="invisible"></span></a><br>They had a human-labeled dataset of 1186 citations, and they only tested their LLMs on 121 of them -- all of the included studies and 9% of the excluded ones.</p>