ChatGPT-3 and ChatGPT-4, two iterations of the natural language processing tool, failed the American College of Gastroenterology self-assessment tests, scoring 65.1% and 62.4% respectively. The study concluded that ChatGPT should not be used for gastroenterology medical education due to its limitations in accessing up-to-date and accurate information. The research highlights the need for skepticism regarding AI’s role in healthcare and emphasizes the importance of relying on traditional resources for medical exams.
ChatGPT-3 and ChatGPT-4, two iterations of the natural language processing (NLP) tool, have recently been evaluated in a study published in the American Journal of Gastroenterology. The study revealed that both versions failed to pass the 2021 and 2022 multiple-choice self-assessment tests for the American College of Gastroenterology (ACG). This outcome may limit the use of these tools for medical education in the field of gastroenterology.
The medical community has been exploring the potential applications of ChatGPT in healthcare, including medical education. While the large language model has demonstrated success in other areas, such as passing US Medical Licensing Exam (USMLE)-style tests and providing accurate information on cancer misconceptions, its application in gastroenterology has not been thoroughly investigated.
To evaluate the capabilities of ChatGPT-3 and ChatGPT-4, researchers from Arkansas Gastroenterology, Northwell Health, and Northwell’s Feinstein Institutes for Medical Research assigned both versions of the tool to answer a total of 455 questions from two ACG tests. The results showed that ChatGPT-3 correctly answered 296 questions, achieving a score of 65.1%, while ChatGPT-4 answered 284 questions correctly, scoring 62.4%. Neither version of ChatGPT achieved the passing score of 70% or higher required by the ACG assessment.
Dr. Arvind Trindade, a senior author of the study and associate professor at the Feinstein Institutes’ Institute of Health System Science, stated that based on their research findings, ChatGPT should not be used for medical education in gastroenterology at this time. Further improvements are needed before implementing this tool in the healthcare field.
The researchers also highlighted some limitations of ChatGPT. As a language model, it generates text based on user prompts and predicted word sequences, relying on the data it has been trained on. The failure of ChatGPT in gastroenterology tests could be attributed to the tool sourcing outdated or questionable information from non-medical sources. Additionally, the lack of access to paid subscription medical journals, which contain the most up-to-date and accurate information, may have hindered its performance.
Dr. Andrew C. Yacht, senior vice president, of academic affairs and chief academic officer at Northwell Health, emphasized the importance of relying on established resources like books, journals, and traditional studying to pass medical exams. The study serves as a reminder that while there is enthusiasm surrounding ChatGPT and AI’s potential role in healthcare and education, the current accuracy and validity of AI in these fields still warrant skepticism.