AI chatbots aimed at general-purpose consumer use routinely misdiagnose illnesses when presented with incomplete patient data, a study has found, emphasising the risk of an increasingly common use case for such tools.
For all models tested, the study from the Massachusetts-based Mass General Brigham healthcare system found that failure rates was above 80 percent for differential diagnosis, referring to a diagnosis without full patient information.
The models found it difficult to suggest a range of possible diagnoses in such cases, frequently narrowing to a single answer, the study found.
High error rates
The tools, including leading models from Anthropic, DeepSeek, Google, OpenAI and xAI, performed well when complete information was provided.
The study tested 21 LLMs overall, finding error rates fell to below 40 percent for final diagnoses with more complete data.
The best performers recorded 90 percent accuracy, researchers said in the study, which was published in Jama Network Open on Monday.
Researchers tested AI models using 29 clinical vignettes based on a standard medical reference text.
The findings recall the persistent difficulties in limiting the effects of so-called “hallucinations” in AI models, or incorrect information frequently provided when the model has limited access to information.
Specialised tools
Anthropic, Google and OpenAI said they have safeguards built in to discourage the use of their models for clinical diagnoses.
But people are nevertheless increasingly using such models for medical advice, with a poll in March finding that one in three US adults had turned to AI chatbots for medical advice in the past year.
Companies including Google and Amazon are developing chatbots specifically geared to deliver medical advice.


