Millions of people now turn to AI chatbots like ChatGPT for health advice — but a landmark new study warns that this growing habit could be putting lives at risk. The largest user study to date examining how large language models (LLMs) support real people making medical decisions finds that these systems can provide inaccurate, inconsistent, and potentially dangerous advice when users seek help with their own symptoms. Published in the prestigious journal Nature Medicine in February 2026, the findings are a wake-up call for both the public and AI developers.
What the Oxford Study Found
A new study published Monday provided a sobering look at whether chatbots, which have fast become a major source of health information, are, in fact, good at providing medical advice to the public. The experiment found that AI chatbots were no better than Google — already a flawed source of health information — at guiding users toward the correct diagnoses or helping them determine what they should do next.
The participants chose the right course of action less than 50% of the time and identified the correct conditions only about 34% of the time — no better than a control group using general online research. These numbers are striking, especially given that many of the same AI systems can pass rigorous medical licensing exams with near-perfect scores.
How the Study Was Conducted
Academics from the Oxford Internet Institute and the Nuffield Department of Primary Care Health Sciences at the University of Oxford partnered with MLCommons and other institutions to evaluate the medical advice people get from large language models. The authors conducted a study with 1,298 UK participants who were asked to identify potential health conditions and to recommend a course of action in response to one of ten different expert-designed medical scenarios.
The scenarios ranged from everyday situations — a young man developing a severe headache after a night out — to more complex cases like a new mother feeling persistently exhausted and out of breath. Participants in the treatment group used AI chatbots including GPT-4o, Llama 3, and Command R+, while a control group relied on traditional methods such as internet searches or personal judgment.
The Communication Breakdown Problem
One of the study’s most revealing findings was a two-way communication failure between users and AI systems. Participants often didn’t know what information the LLMs needed to offer accurate advice, and the responses they received frequently combined good and poor recommendations, making it difficult to identify the best course of action. University of Oxford
Real patient questions look nothing like the exam-style prompts used to test large language models. People ask emotional, leading, or risky questions that can push a chatbot in the wrong direction. One challenge is the technology’s tendency to be agreeable — the objective is to provide an answer the user will like, so chatbots won’t necessarily push back. Duke This people-pleasing tendency can lead to genuinely dangerous outcomes
Real-World Risks: Wrong Diagnoses, Mixed Messages
The study uncovered alarming inconsistencies in how AI chatbots handled identical or near-identical medical situations. Two users sent very similar messages describing symptoms of a subarachnoid hemorrhage but were given opposite advice. One user was told to lie down in a dark room, and the other user was given the correct recommendation to seek emergency care.
Other errors included fabricated contact information, with one chatbot recommending both a partial US phone number and Australia’s emergency number “Triple Zero” in the same conversation. The researchers found issues like chatbots struggling to distinguish urgent from non-urgent situations and sometimes providing fabricated information, as well as the models being highly sensitive to small changes in how questions were phrased.
Why AI Passes Exams But Fails Patients
The disconnect between AI benchmark performance and real-world effectiveness is at the heart of this issue. Despite now being able to ace most medical licensing exams, AI chatbots do not give humans better health advice than they can find using more traditional methods. Researchers point to the fact that standardized tests do not capture the complexity, emotion, and ambiguity of actual human health conversations.
Most people have heard about “hallucinations,” when AI models make up facts. But research highlights a less-obvious risk: answers that are technically correct but medically inappropriate because they lack context. A chatbot might accurately state a medical fact while completely failing to recognize that the specific user’s situation demands urgent care.
What Experts Are Saying
Senior author Associate Professor Adam Mahdi of the Oxford Internet Institute said: “The disconnect between benchmark scores and real-world performance should be a wake-up call for AI developers and regulators. We cannot rely on standardised tests alone to determine if these systems are safe for public use. Just as we require clinical trials for new medications, AI systems need rigorous testing with diverse, real users to understand their true capabilities in high-stakes settings like healthcare.
The study also warns: “As more people rely on chatbots for medical advice, we risk flooding already strained hospitals with incorrect but plausible diagnoses.
How to Use AI Health Tools Safely
While AI chatbots are not ready to replace physicians, they can still serve a limited role. Experts recommend using them to understand general wellness information, decode medical articles, or prepare questions before a doctor’s appointment — rather than as a substitute for professional diagnosis. Always verify AI-generated health advice with a qualified medical professional, especially for urgent or complex symptoms.
One out of every six US adults asks AI chatbots about health information at least once a month a number expected to rise. Understanding the risks is the first step toward using these tools responsibly.
