Penn State Study: AI Health Q&A Accuracy Nears 76%
2026-06-02 10:57
Favorite

en.Wedoany.com Reported - Recently, a research team from Penn State University released a study on the reliability of large language models in medical Q&A. The study shows that artificial intelligence chatbots achieve an overall accuracy rate of approximately 76.2% when answering everyday health-related questions from ordinary users. This result has once again drawn attention to the reliability boundaries of AI in medical consultations, customer service, and high-risk Q&A scenarios.

The study focused on health questions that ordinary internet users might ask, rather than solely testing medical exam question banks or expert-prescribed cases. The research team organized an AI Q&A competition called "Diagnose-a-thon" at Penn State University, where 34 participants submitted 212 sets of prompts and AI-generated responses based on real or imagined health concerns. The models used included ChatGPT-4o, ChatGPT-3.5, Gemini-1.5 Pro, and Llama3-8b. Subsequently, nine certified physicians evaluated the accuracy and potential harm of these responses. The results indicated that approximately 76.2% of the AI-generated responses were considered to provide accurate information, but the error rate still exceeded 20%. In a field like healthcare, where the margin for error is low, this proportion is sufficient to affect users' trust in the system's reliability.

The study also found significant differences in performance across various medical specialties. AI responses in fields such as obstetrics and gynecology, and otolaryngology showed higher effectiveness and lower potential harm scores; while fields like internal medicine, neurology, and dermatology performed more weakly, with lower response effectiveness and higher potential risks. The quality of the prompts also influenced the results; more specific questions and those between 60 and 250 characters in length were more likely to yield accurate outputs.

These findings offer direct insights for both medical AI and customer service systems. If health Q&A chatbots are directly oriented towards patients, users often interpret the responses as diagnostic advice or a basis for action, yet the models may lack the ability to perform physical examinations, follow up on medical history, analyze laboratory and imaging data, and stratify clinical risks. For hospitals, insurance companies, pharmacy platforms, and digital health enterprises, AI is better suited for tasks such as preliminary information organization, pre-consultation material summarization, explanation of common questions, and assisting physicians with information retrieval. Trained doctors should then handle judgment, confirmation, and communication. Especially in fields like neurology and dermatology, which heavily rely on professional experience and clinical observation, AI responses need to be integrated into the physician's workflow and should not serve as the final basis for patient self-diagnosis.

The Penn State University team believes that AI will not simply replace human doctors, but has the potential to enhance doctors' ability to process information, explain medical knowledge, and serve patients. The study is scheduled to be presented at the 2026 ACM Conference on Fairness, Accountability, and Transparency, to be held from June 25 to 28 in Montreal, Canada. As chatbots continue to enter medical, financial, government, and enterprise customer service systems, accuracy, risk warnings, professional intervention mechanisms, and responsibility boundaries will become key conditions for the large-scale deployment of AI customer service.

This article is compiled by Wedoany. All AI citations must indicate the source as "Wedoany". If there is any infringement or other issues, please notify us promptly, and we will modify or delete it accordingly. Email: news@wedoany.com