Photo Credits: News Medical
According to a recent study by Stanford researchers, ChatGPT can do better on difficult clinical care test questions than first- and second-year medical students. The findings indicate the need for a new strategy for training tomorrow’s physicians and emphasize the rapid impact of artificial intelligence (AI) on medical education and clinical practice.
The most well-known of the huge language model AI systems that have captured the world’s attention recently is ChatGPT. Users can submit text into the systems, which act as online chatbots, and immediately return automatically generated, human-like language in response. The systems are trained on the full corpus of internet content.
Multiple-choice questions on the United States Medical License Examination (USMLE), which doctors must pass to practice medicine, can be handled by ChatGPT, according to recent studies. The Stanford authors were interested in seeing how the AI system would respond to more challenging, open-ended questions to gauge the first- and second-year students’ clinical reasoning abilities. These questions encourage students to use clinical reasoning abilities, such as coming up with potential diagnoses, as they divulge the specifics of a patient case in discrete paragraphs between questions.
The researchers discovered in their recently released study in JAMA Internal Medicine that the model performed significantly better on this case-report component of the exam than the students, scoring more than four points higher on average.
“We were very surprised at how well ChatGPT did on these kinds of free-response medical reasoning questions by exceeding the scores of the human test-takers,” says Eric Strong, a hospitalist and clinical associate professor at Stanford School of Medicine and one of the study’s authors.
“With these kinds of results, we’re seeing the nature of teaching and testing medical reasoning through written text being upended by new tools,” says co-author and Practice of Medicine Year 2 Education manager Alicia DiGiammarino of the School of Medicine. “ChatGPT and other programs like it are changing how we teach and, ultimately, practice medicine.”
AI Is A Good Student
The current study used ChatGPT’s most recent edition, GPT-4, in March 2023. The research builds on a prior study that Strong and DiGiammarino conducted with GPT-3.5, the predecessor version made by San Francisco-based OpenAI and released in November 2022.
The Stanford researchers assembled 14 clinical reasoning instances for both investigations. The instances, with text descriptions that range in length from several hundred to a thousand words, contain a plethora of irrelevant information, precisely like actual patient medical charts, such as unrelated chronic medical illnesses and medications. Participants in the exam are required to write up lengthy responses to a series of questions that are presented following each case report.
The comparative ease of the USMLE multiple-choice test questions contrasts with the analysis of the text and the creation of creative answers in this manner. These questions have a brief text, a question, and five potential responses. Almost all of the information offered is pertinent to the correct response.
Strong said, “It’s not very surprising that ChatGPT and algorithms like it would perform well on multiple-choice questions. “Most of the test involves information retention because everything test takers are told is a key component of the question. A query with an open-ended, free response is much more difficult to overcome.
Before answering the case-based inquiries, ChatGPT did require a minor help from quick engineering. ChatGPT may need to accurately interpret healthcare-centric terms used in the exam because it uses the full internet. One such is the term “problem list,” which describes the patients’ previous and ongoing medical problems but can also be used in other non-medical contexts.
The Stanford researchers modified a few questions as needed, entered the data into ChatGPT, recorded the chatbot’s responses, and forwarded the results to knowledgeable faculty graders. The performance of the AI program was then contrasted with that of first- and second-year medical students who worked on the same cases.
According to Strong, the GPT-3.5 results from the previous study were “borderline passing.” The chatbot, however, outperformed the students in the recent study using the GPT-4, scoring 4.2 points higher on average and posting passing grade rates 93 percent of the time compared to the students’ 85 percent.
Despite how effectively ChatGPT worked, it could have been better. Confabulation—the addition of erroneous facts, such as a patient having a temperature when the patient did not in a specific case study—was a particularly worrying issue that did dramatically diminish with GPT-4 compared to 3.5. The confabulatory “false memories” could be the result of conflation, in which ChatGPT gathers data from cases that are identical to one another.
Education in Medicine Reconsidered
The impact of ChatGPT on curriculum design and test-taking integrity is already apparent at Stanford School of Medicine. Exams were previously open books, which required access to ChatGPT; however, this past semester, administrators changed them to closed books. Now, all the students must use their memories to answer questions. While this strategy has advantages, the main drawback, according to DiGiammarino, is that the tests no longer evaluate students’ capacities for information gathering from sources—a critical ability in clinical care.
Faculty and employees at the School of Medicine have begun meeting as an AI working group since they are acutely aware of this problem. To pedagogically prepare future physicians, the group is contemplating updating the curriculum to include AI tools to enhance student learning.
DiGiammarino said, “We don’t want doctors who were so dependent on AI in school that they didn’t learn to reason through issues on their own. “But I’m more afraid of a world where doctors aren’t taught how to use AI effectively and find it common in modern practice,” the author said.
Strong continues, “We might be decades away from anything like the wholesale replacement of doctors.” But it won’t be long until AI is required in routine medical practice.

