m
Recent Posts
HomeHealth AiAI Health Tools Promise, Risks, and Gaps

AI Health Tools Promise, Risks, and Gaps

Health

The Rise of AI Health Chatbots

AI health tools are multiplying fast. In March 2026, Microsoft launched Copilot Health inside its Copilot app. Users can now connect medical records and ask direct health questions. Just days earlier, Amazon opened its Health AI tool to the wider public. Previously, it was only available to One Medical members. These launches follow OpenAI’s ChatGPT Health, released in January, and Anthropic’s Claude, which can access health records with user permission.

Together, these products signal a clear shift. Health AI for everyday consumers is no longer a niche experiment. It has become a mainstream trend.

Why Demand Is Surging

Millions Ask Health Questions Daily

The numbers behind these launches are striking. Microsoft reports that its Copilot platform receives 50 million health-related questions every single day. Health is, in fact, the most popular topic on the Copilot mobile app. OpenAI’s Karan Singhal, who leads the Health AI team, says the company noticed a rapid rise in health queries even before building dedicated health products.

Access Problems Drive Adoption

Experts point to a deeper reason behind this surge. For millions of people, getting medical advice through traditional systems is difficult and slow. Girish Nadkarni, Chief AI Officer at Mount Sinai Health System, puts it plainly: access to healthcare is hard, especially for vulnerable populations. Consequently, people turn to chatbots that are available 24 hours a day, carry no judgment, and respond instantly.

This creates a powerful use case. AI health tools could help users decide whether they need medical care, guide them through symptoms, and reduce pressure on emergency rooms. Researchers call this function “triage,” and it could be genuinely valuable — if it works correctly.

What These Tools Can and Cannot Do

Promising Capabilities

Some research supports cautious optimism. Current large language models (LLMs) can provide safe, useful health guidance in many scenarios. Moreover, Google’s recent study of its Articulate Medical Intelligence Explorer (AMIE) chatbot showed that its diagnoses matched physician-level accuracy. None of the test conversations raised major safety concerns.

Real and Present Risks

However, the same tools also show worrying gaps. A widely discussed Mount Sinai study found that ChatGPT Health sometimes over-recommends care for mild conditions. Furthermore, it occasionally fails to flag genuine emergencies. This raises serious questions about triage reliability.

All six academic experts consulted for the original MIT Technology Review report agreed on one point: these tools carry real upsides for underserved populations. Yet all six also expressed concern about one shared problem — insufficient independent testing before public release.

Disclaimers on these platforms warn users not to seek diagnoses from chatbots. Nevertheless, as Adam Rodman, a physician and researcher at Beth Israel Deaconess Medical Center, notes bluntly: “We all know that people are going to use it for diagnosis and management.”

The Evaluation Problem

Company Testing Has Blind Spots

Tech companies do test their own products. OpenAI developed HealthBench, a benchmark that scores LLM responses to realistic health conversations. When GPT-5 launched, it scored significantly better than previous models. Still, it fell well short of perfect performance.

Beyond raw scores, another critical gap exists. Oxford researcher Andrew Bean and his colleagues found a striking pattern in a recent study. Even when an LLM correctly identifies a condition on its own, a non-expert user asking the same question with LLM assistance arrives at the right answer only about one-third of the time. Users without medical knowledge often omit key details from their prompts. They also misread or misapply the chatbot’s response.

Self-Evaluation Is Not Enough

Bean argues that no matter how rigorous a company’s internal research is, it cannot replace independent evaluation. “The evidence base really needs to be there,” he says. External review not only adds impartiality — it also catches the blind spots that internal teams miss.

Stanford’s MedHELM framework, led by Professor Nigam Shah, currently offers one of the more comprehensive third-party evaluation systems. It tests models across a wide range of medical tasks. OpenAI’s GPT-5 holds the top MedHELM score at present. However, Shah acknowledges that even MedHELM evaluates only single responses — not the multi-turn conversations that real users typically have with health chatbots.

What Better Testing Looks Like

Human Studies Set the Gold Standard

Bean’s own research sets a higher bar. His study tested chatbots with actual human users before any public release — not just fictional written scenarios. He recommends this approach as the standard for health AI. However, he acknowledges the challenge. Human studies take time, and AI moves fast. His own study used GPT-4o, a model already considered outdated.

Google’s Cautious Approach

Google chose a different path. Despite AMIE producing strong results, the company has decided not to release it publicly yet. It cites unresolved issues in equity, fairness, and safety testing. This stands in contrast to competitors who launched first. As a result, Google’s caution offers an instructive counterexample.

The Path Forward

Third-Party Benchmarks Are Critical

The clearest consensus among researchers is this: trusted third-party benchmarks must guide how health AI tools are released. Companies can build excellent internal evaluations, but external review provides accountability. OpenAI’s Singhal supports this view and cites HealthBench as a model for community evaluation efforts.

Imperfect Tools Can Still Help

No expert interviewed for the original story demanded perfection from health AI before release. Doctors make mistakes too. For someone with rare access to a physician, a chatbot that is sometimes wrong may still be a meaningful upgrade over having no guidance at all — as long as its errors stay manageable and rare.

Currently, however, the evidence does not yet confirm whether available tools improve outcomes or create new risks. More rigorous, independent testing is the only way to answer that question with confidence.

Share

No comments

Sorry, the comment form is closed at this time.