What the Study Found
Artificial intelligence health chatbots are increasingly popular — but a major new study reveals a serious gap in their capabilities. Researchers at Mass General Brigham, one of the largest healthcare systems in the United States, found that all 21 leading AI language models they tested failed to produce an appropriate early diagnosis more than 80% of the time.
The study, published in the peer-reviewed journal JAMA Network Open, is one of the most comprehensive evaluations of AI clinical reasoning to date. Importantly, it shows that AI tools are not yet safe for unsupervised diagnostic use — despite rapid improvements in the technology.
Why Differential Diagnosis Is the Real Test
Understanding the First Step in Clinical Reasoning
Before a doctor confirms any diagnosis, they first build a differential diagnosis — a ranked list of possible conditions that could explain a patient’s symptoms. This process is the foundation of clinical reasoning. It demands careful judgment, especially when patient information is limited or ambiguous.
This is precisely where AI tools struggle the most. According to the study’s lead author, Arya Rao of Harvard Medical School, “These models are great at naming a final diagnosis once the data is complete, but they struggle at the open-ended start of a case, when there isn’t much information.”
In contrast, AI chatbots performed far better when researchers gave them complete patient data. Final diagnosis success rates then climbed to over 90% for the best-performing models. Therefore, the problem is not that AI lacks medical knowledge altogether — it is that AI cannot yet reason through uncertainty the way a trained clinician can.
How Researchers Tested the AI Models
A Stepwise, Doctor-Like Evaluation
The research team tested 21 general-purpose large language models (LLMs), including the latest versions of ChatGPT, DeepSeek, Claude, Gemini, and Grok. Together, these models generated 16,254 responses across 29 standardised clinical case scenarios drawn from the MSD Manual — a peer-reviewed medical reference used to train healthcare professionals.
Crucially, the team fed information to each model in stages, just as it would unfold in a real clinical encounter. They began with basic details like a patient’s age, gender, and symptoms. Next, they added physical examination findings. Finally, they introduced laboratory results and imaging data.
Medical students then scored each model’s responses against established answer keys. This stepwise approach gave researchers a more realistic picture of how AI performs in actual clinical conditions.
Introducing the PrIME-LLM Scoring Framework
To go beyond simple accuracy, the team also developed a new evaluation tool called PrIME-LLM. This framework measures a model’s competency across four stages of clinical reasoning: generating potential diagnoses, recommending appropriate tests, arriving at a final diagnosis, and suggesting treatment.
When a model performs well in one stage but poorly in another, PrIME-LLM captures that imbalance clearly — rather than masking weaknesses by averaging scores across all tasks.
Which AI Models Performed Best
Top Performers Still Fell Short of Clinical Standards
Under the PrIME-LLM framework, overall model scores ranged from 64% for Gemini 1.5 Flash to 78% for Grok 4 and GPT-5. The top-performing cluster also included GPT-4.5, Claude Opus, Gemini 3.0 Flash, and Gemini 3.0 Pro.
More recent models outperformed older ones, confirming that incremental progress is real. However, none of the models reached the threshold for safe, unsupervised clinical deployment. Even the best performers missed appropriate differential diagnoses more than 80% of the time at the early stages of a case.
Why AI Still Falls Short in Clinical Reasoning
The Gap Between Knowledge and Judgment
AI language models excel at pattern recognition within text. However, medicine demands much more than pattern matching. A real clinical consultation involves interpreting a patient’s story, navigating uncertainty, exploring concerns, and building trust — skills that go far beyond predicting the next likely word in a sentence.
Furthermore, the study identified a key behavioural flaw: AI chatbots tend to reach diagnostic conclusions too quickly, particularly when relevant patient data is missing. Rather than flagging uncertainty, these models often produce confident-sounding responses that may be clinically incorrect.
Corresponding author Dr Marc Succi put it plainly: “Despite continued improvements, off-the-shelf large language models are not ready for unsupervised clinical-grade deployment.”
What This Means for Patients and Doctors
Human Oversight Remains Essential
The findings carry clear implications for anyone using AI tools for health guidance. Patients who rely on chatbots for early symptom assessment risk receiving inaccurate information — with real consequences including delayed treatment, unnecessary tests, and increased healthcare costs.
For healthcare providers, the study reinforces that AI must operate with a human in the loop. The most responsible current use is clinician-supervised support in low-uncertainty tasks — such as drafting clinical notes, summarising records, or generating referral letters — rather than front-line diagnosis.
Dr Succi further advised, “The recommendation for the public is to use these technologies with caution and, when faced with any health concern, always consult a healthcare professional.”
The Road Ahead for AI in Healthcare
Progress Is Real, but Caution Is Still Needed
Despite these limitations, AI’s role in medicine is growing. Notably, when models received complete clinical information, failure rates for final diagnosis dropped to as low as 9% for the best performers. This shows that AI can be highly effective once sufficient data is available.
Going forward, experts are calling for stricter regulatory guidelines on AI deployment in clinical settings. Additionally, they recommend developing AI models with purpose-built clinical reasoning capabilities — rather than relying on general-purpose language models for high-stakes medical decisions.
As the technology advances, the key lesson from this study is clear: AI in healthcare works best as a powerful support tool, not as a standalone doctor.
