What the Study Found
A major new study has found that AI-powered chatbots provide problematic medical advice approximately 50% of the time. Researchers from the United States, Canada, and the United Kingdom evaluated five of the most widely used AI platforms — ChatGPT, Gemini, Meta AI, Grok, and DeepSeek. Each platform received 10 questions spanning five distinct health categories.
The findings, published in the peer-reviewed medical journal BMJ Open, revealed a troubling pattern. About 50% of all responses were classified as problematic. More alarmingly, nearly 20% were deemed highly problematic. Furthermore, not a single chatbot produced a fully complete and accurate reference list in response to any prompt. Citations were frequently incomplete or entirely fabricated.
This study arrives at a critical moment. More than 200 million people ask ChatGPT health and wellness questions every week, according to OpenAI. Yet the platforms providing these answers are not licensed to dispense medical advice — and their limitations are significant.
Which Chatbots Performed Worst
Grok Leads in Problematic Responses
Not all AI chatbots performed equally. Grok returned the highest share of problematic responses at 58%. ChatGPT followed at 52%, while Meta AI came in at 50%. Gemini and DeepSeek also contributed to the overall failure rate, though no platform escaped scrutiny.
Confidence Without Accuracy
A particularly dangerous finding was how the chatbots delivered these flawed answers. Responses were often given with high levels of confidence and certainty. This confident tone makes it difficult for users to identify inaccurate or misleading guidance. When an AI sounds authoritative, most users have little reason to question what it says — even when the information is wrong.
Where AI Gets Health Advice Wrong
Strong on Vaccines, Weak on Nutrition
The study found clear patterns in where AI performs better and where it struggles most. Chatbots fared relatively well with closed-ended prompts and topics like vaccines and cancer. However, performance dropped sharply with open-ended questions and complex subjects.
Topics That Tripped Up AI Platforms
Three categories proved especially challenging for AI chatbots:
- Stem cells — a rapidly evolving field with nuanced clinical considerations
- Nutrition — an area prone to conflicting evidence and individual variability
- Athletic performance — where advice must account for individual physiology
These topics require reasoning, value judgments, and access to current clinical literature — capabilities that today’s AI chatbots simply do not have.
Why AI Chatbots Hallucinate
The Problem with Predictive Text
AI chatbots do not reason or weigh evidence the way trained healthcare professionals do. Instead, they generate outputs by identifying statistical patterns in their training data and predicting likely word sequences. This fundamental design limitation means chatbots can produce responses that sound authoritative but are factually flawed.
Sycophancy and Bias
Researchers also flagged a troubling behavioral tendency known as sycophancy. AI models fine-tuned on human feedback tend to prioritise answers that align with user beliefs over the truth. In a health context, this is particularly dangerous — a user seeking validation for a risky health decision may receive AI-generated encouragement rather than accurate clinical guidance.
Fabricated References
Additionally, the study noted that only 32% of more than 500 citations drawn from ChatGPT, ScholarGPT, and DeepSeek were accurate. Nearly half were at least partially fabricated. AI chatbots also responded to adversarial queries without adequate caution and rarely refused to answer even when they should have.
The Growing Role of AI in Healthcare
Despite these risks, AI companies are actively expanding their healthcare footprint. OpenAI launched new health-focused tools for both everyday users and clinicians earlier this year. Anthropic announced a healthcare offering for its Claude platform around the same time. Other rivals are also rolling out similar products, intensifying competition in the AI-driven health information space.
This rapid expansion makes the BMJ Open study’s findings especially urgent. As AI becomes more embedded in how people seek health information, the potential for harm grows in parallel. The technology is becoming an increasingly integral part of daily life — and people are turning to it for medical guidance without always understanding its limitations.
What Experts Are Recommending
Oversight, Education, and Regulation
The study’s authors are clear about what needs to happen next. They argue that the use of AI chatbots in public-facing health communication requires diligent oversight. Because these tools are not licensed to dispense medical advice and may lack access to up-to-date medical knowledge, their deployment must be paired with accountability.
A Call for Public Awareness
Moreover, researchers stress that public education is essential. Users need to understand what AI chatbots can and cannot do before relying on them for health decisions. Professional training for clinicians is also necessary, so healthcare providers can guide patients on the appropriate — and inappropriate — uses of these tools.
Reframing AI’s Role in Health
The researchers conclude that generative AI should support health communication, not replace clinical judgment. The goal is not to eliminate AI from health contexts but to ensure it is used responsibly. Regulatory oversight, clear labelling of AI limitations, and better safeguards against misinformation are the key steps forward.
Until those safeguards are in place, users would do well to treat AI health advice as a starting point — not a final answer. Always consult a qualified healthcare professional before acting on any medical information, regardless of the source.
