GMS | 25. Jahrestagung des Netzwerks Evidenzbasierte Medizin e. V. | Do large language models save resources when applying common-evidence appraisal tools?

25. Jahrestagung des Netzwerks Evidenzbasierte Medizin e. V.

Netzwerk Evidenzbasierte Medizin e. V. (EbM-Netzwerk)

13. - 15.03.2024, Berlin

Artikel

XML Version

Artikel empfehlen

Do large language models save resources when applying common-evidence appraisal tools?

Meeting Abstract

Suche in Medline nach

Tim Woelfle - University Hospital Basel and University of Basel, Pragmatic Evidence Lab, Research Center for Clinical Neuroimmunology and Neuroscience Basel (RC2NB), Basel, Schweiz; University Hospital Basel, Department of Neurology, Basel, Schweiz
Julian Hirt - University Hospital Basel and University of Basel, Pragmatic Evidence Lab, Research Center for Clinical Neuroimmunology and Neuroscience Basel (RC2NB), Basel, Schweiz; Eastern Switzerland University of Applied Sciences, Department of Health, Schweiz
Perrine Janiaud - University Hospital Basel and University of Basel, Pragmatic Evidence Lab, Research Center for Clinical Neuroimmunology and Neuroscience Basel (RC2NB), Basel, Schweiz
John P. A. Ioannidis - Stanford University, Meta-Research Innovation Center at Stanford (METRICS), USA; Stanford University, Departments of Medicine, of Epidemiology and Population Health, of Biomedical Data Science, and of Statistics, USA
Lars G. Hemkens - University Hospital Basel and University of Basel, Pragmatic Evidence Lab, Research Center for Clinical Neuroimmunology and Neuroscience Basel (RC2NB), Basel, Schweiz; Stanford University, Meta-Research Innovation Center at Stanford (METRICS), USA

Evidenzbasierte Politik und Gesundheitsversorgung – erreichbares Ziel oder Illusion?. 25. Jahrestagung des Netzwerks Evidenzbasierte Medizin. Berlin, 13.-15.03.2024. Düsseldorf: German Medical Science GMS Publishing House; 2024. Doc24ebmPS3-09

doi: 10.3205/24ebm075, urn:nbn:de:0183-24ebm0750

Veröffentlicht:	12. März 2024

© 2024 Woelfle et al.
Dieser Artikel ist ein Open-Access-Artikel und steht unter den Lizenzbedingungen der Creative Commons Attribution 4.0 License (Namensnennung). Lizenz-Angaben siehe http://creativecommons.org/licenses/by/4.0/.

Gliederung

Top
Text

Text

Background/research question: It is unknown whether large language models (LLMs) facilitate time- and resource-intensive text-related processes in evidence appraisal. Our aim was to quantify the agreement of LLMs with human raters in evidence appraisal of different levels of complexity: assessing reporting (PRISMA) and methodological rigor (AMSTAR) of systematic reviews, and degree of pragmatism in clinical trials (PRECIS-2).

Methods: Three state-of-the-art LLMs (OpenAI’s GPT-3.5, GPT-4; Anthropic’s Claude-2) assessed 112 systematic reviews (SRs) for PRISMA and AMSTAR, and 56 randomized trials (RCTs) for PRECIS-2. Corresponding ratings from two independent human raters and their consensus were available. We quantified accuracy as agreement between human consensus and (1) individual human raters; (2) individual LLMs; (3) combined LLMs; (4) human-AI collaboration. Ratings were marked as deferred (undecided) in case of inconsistency between combined LLMs or between the human rater and the LLM.

Results: Individual human rater accuracy was 89% (2,686/3,024) for PRISMA (27 items × 112 SRs), 89% (1,096/1,232) for AMSTAR (11 items × 112 SRs), and 75% (379/504) for PRECIS-2 (9 items × 56 RCTs). Individual LLM accuracy was lower, ranging from 63% (GPT-3.5) to 70% (Claude-2) for PRISMA, 53% (GPT-3.5) to 70% (GPT-4) for AMSTAR, and 38% (GPT-4) to 55% (GPT-3.5) for PRECIS-2. Combined LLM ratings led to accuracies of 76–85% for PRISMA (9–67% of items inconsistent and thus deferred), 70–83% for AMSTAR (14–76% deferred), and 61–74% for PRECIS-2 (55–96% deferred). Combining a human rater with individual LLMs resulted in the best accuracies from 89–96% for PRISMA (25–41% deferred), 91–96% for AMSTAR (27–52% deferred), and 80–86% for PRECIS-2 (64–75% deferred).

Conclusion: Current LLMs alone appraised evidence substantially worse than humans. Pairing a first human rater with an LLM as human-AI collaboration may reduce workload for the second human rater for the assessment of reporting (PRISMA) and methodological quality (AMSTAR) but not for more complex tasks such as assessing pragmatism of clinical trials (PRECIS-2).

Competing interests: RC2NB (Research Center for Clinical Neuroimmunology and Neuroscience Basel) is supported by Foundation Clinical Neuroimmunology and Neuroscience Basel. All authors declare no competing interests.

gms | German Medical Science