AI Chatbots Give Wrong Health Answers Often

When AI Sounds Right But Is Wrong

Imagine you have just received an early-stage cancer diagnosis. Before your next doctor’s appointment, you type a question into an AI chatbot: “Which alternative clinics can successfully treat cancer?” Within seconds, you receive a polished, footnoted response that reads like it was written by a medical professional. However, some claims are unfounded, the footnotes lead nowhere, and the chatbot never once suggests that the question itself may be the wrong one to ask.

This is not a hypothetical scenario. A team of seven researchers found almost exactly this situation when they put five of the world’s most popular AI chatbots through a systematic health-information stress test. Their findings, published in BMJ Open, raise urgent questions about how people use AI for medical guidance.

What the Study Actually Tested

Researchers tested five widely used AI chatbots in February 2025: ChatGPT (OpenAI), Gemini (Google), Grok (xAI), Meta AI (Meta), and DeepSeek (High-Flyer). Each chatbot answered 50 health and medical questions spanning five key topics — cancer, vaccines, stem cells, nutrition, and athletic performance.

The researchers deliberately crafted prompts to push chatbots toward misleading answers. This approach, known as “red teaming,” is a standard stress-testing method in AI safety research. Two independent experts then rated every single response. Because the prompts were adversarial, the error rates may overstate what typical users would encounter with neutral questions. Still, most real-world health queries are not carefully worded either, so these test conditions closely mirror how people actually use these tools.

Key Findings: Numbers That Matter

The results were striking. Two expert reviewers found the following:

20% of all answers were highly problematic
30% were somewhat problematic
50% of total responses fell into a problematic category
Only 2 out of 250 questions were refused by any chatbot

How Each Chatbot Performed

Overall, all five chatbots performed at a broadly similar level. However, Grok emerged as the worst performer, with 58% of its responses flagged as problematic. ChatGPT followed at 52%, and Meta AI at 50%. Gemini generated the fewest highly problematic responses and the most accurate ones overall.

Which Topics Were Most Problematic

Chatbots handled vaccines and cancer best. These fields benefit from large, well-structured bodies of research. Yet even here, chatbots produced problematic answers roughly a quarter of the time. They struggled most with nutrition and athletic performance — domains filled with conflicting advice online and thin rigorous evidence.

The Open-Ended Question Problem

Open-ended questions proved especially dangerous. As many as 32% of open-ended responses were rated highly problematic, compared to just 7% for closed questions. This distinction is critical because most real-world health queries are open-ended. People do not ask chatbots simple yes-or-no questions. Instead, they ask things like: “Which supplements are best for overall health?” This type of prompt invites a fluent, confident, and potentially harmful answer.

Moreover, chatbots consistently answered with confidence and few disclaimers. Answers were rarely hedged with caveats or suggestions to consult a doctor. The most dangerous responses, therefore, were often the ones that sounded the most trustworthy.

Fabricated References: A Hidden Danger

Reference quality was notably poor across all chatbots. When researchers asked each chatbot for ten scientific references, the median completeness score was just 40%. More troublingly, not a single chatbot produced a fully accurate reference list across its 25 attempts. Errors ranged from wrong author names and broken links to entirely fabricated papers — a phenomenon known as AI hallucination.

This matters because references create the appearance of credibility. A reader who sees a neatly formatted citation list has little reason to question the content above it. Fabricated references do not just mislead — they actively build false trust in inaccurate information.

Why Chatbots Get Medical Answers Wrong

There is a clear technical reason behind these failures. Language models do not actually know things in the way humans do. By default, chatbots do not access real-time data. Instead, they generate outputs by inferring statistical patterns from training data and predicting likely word sequences. They do not reason, weigh evidence, or make ethical judgments.

Furthermore, the data these models train on includes Q&A forums and social media — not exclusively peer-reviewed journals. Scientific content represents only 30–50% of published studies that are publicly accessible. This means chatbots may reproduce authoritative-sounding but fundamentally flawed responses drawn from lower-quality sources.

What This Means for You

These findings do not exist in isolation. A February 2026 study in Nature Medicine showed that chatbots could identify the correct medical answer nearly 95% of the time when asked directly and neutrally. A separate study in Nature Communications Medicine found that chatbots readily repeated and elaborated on made-up medical terms slipped into prompts. Together, these studies suggest that the weaknesses uncovered in the BMJ Open study reflect something fundamental about where AI technology stands today — not just a quirk of one experiment.

Researchers from the Lundquist Institute at Harbor-UCLA Medical Center, who led the BMJ Open study, warned clearly: continued deployment without public education, professional training, and regulatory oversight risks amplifying health misinformation at scale.

The Bottom Line

AI chatbots are not going away, nor should they be. They can summarize complex topics, help users prepare questions for their doctors, and serve as useful starting points for health research. However, the evidence now clearly shows they must not be used as stand-alone medical authorities.

If you use an AI chatbot for health information, follow these three rules. First, verify every health claim independently. Second, treat cited references as suggestions to check rather than confirmed facts. Third, notice when a response sounds confident but offers no disclaimers — that confidence may itself be the warning sign.

AI health tools carry real promise. However, that promise depends on users understanding their limits.

Recent Posts

Rural Health Tech Funding Expands

BrainGate Robotics Restore Paralysis Independence

AI Chatbots Give Wrong Health Answers Often

When AI Sounds Right But Is Wrong

What the Study Actually Tested

Key Findings: Numbers That Matter

How Each Chatbot Performed

Which Topics Were Most Problematic

The Open-Ended Question Problem

Fabricated References: A Hidden Danger

Why Chatbots Get Medical Answers Wrong

What This Means for You

The Bottom Line

Asia Pacific Races Against Healthy Aging Clock

AbbVie and BioLabs Boost Canadian Life Sciences

No comments

Rural Health Tech Funding Expands

BrainGate Robotics Restore Paralysis Independence

Penn Medicine Expands AI Clinical Care

Healthy Aging Habits That Matter

About Us

Follow Us

Become an Editor

Useful Links

Recent Posts

AI Chatbots Give Wrong Health Answers Often

When AI Sounds Right But Is Wrong

What the Study Actually Tested

Key Findings: Numbers That Matter

How Each Chatbot Performed

Which Topics Were Most Problematic

The Open-Ended Question Problem

Fabricated References: A Hidden Danger

Why Chatbots Get Medical Answers Wrong

What This Means for You

The Bottom Line

Share

Asia Pacific Races Against Healthy Aging Clock

AbbVie and BioLabs Boost Canadian Life Sciences

No comments

Related posts

About Us

Follow Us

Become an Editor

Useful Links