A new study conducted by researchers at Stanford University has uncovered the vast, yet largely unexploited potential of large language models (LLMs), a sophisticated form of artificial intelligence, to significantly enhance the precision of medical diagnoses and clinical reasoning techniques.
In this innovative research, the scholars presented a series of intricate medical cases based on real patients to the widely recognized model ChatGPT-4 and compared the outcomes with those from 50 practicing physicians. Half of these physicians relied on traditional diagnostic resources, such as established medical manuals and internet searches, while the remaining physicians utilized ChatGPT as a supplemental diagnostic tool during the assessment process.
The counterintuitive outcome from this analysis indicates that there exist substantial opportunities for physicians to improve their proficiency in harnessing AI tools like ChatGPT to optimize their diagnostic capabilities. The researchers contend that with appropriate training and effective integration into clinical workflows, large language models can ultimately play a transformative role in enhancing patient care.
“Our study demonstrates that ChatGPT has the potential to serve as a powerful asset in medical diagnostics; hence, it was unexpected to find that its availability to physicians did not dramatically elevate their clinical reasoning outcomes,” stated Ethan Goh, a postdoctoral scholar at Stanford’s School of Medicine and co-lead author of the study. “These findings suggest that there remains significant room for improvement in the collaboration between physicians and AI technologies within clinical environments and the wider health care ecosystem.”
“It’s plausible that once a healthcare professional feels confident about a diagnosis, they may not take the time to elaborate on the reasoning behind their thought process,” elaborated Jonathan H. Chen, an assistant professor at Stanford’s School of Medicine and the paper’s senior author. “Moreover, it’s a common challenge that human experts often struggle to articulate precisely why their correct decisions were made.”
The study, which has garnered interest in the field, was recently published in JAMA Network Open and has been accepted for presentation at the prestigious American Medical Informatics Association 2024 symposium later this November.
Delivering Diagnoses
Since the launch of ChatGPT by San Francisco-based OpenAI in November 2022, large language models have surged in prominence across various industries, including healthcare. LLMs, which are complex computational programs, are trained on vast datasets that encompass natural human languages, pulling from sources like websites and literature. This extensive training empowers LLMs to respond to natural-language queries with coherent, articulate answers.
Already, LLMs have made remarkable advances in various sectors, particularly in finance and content creation, with healthcare anticipated to be among the next major fields for adoption. As explained by Goh, one of the most promising applications of this technology is in reducing the prevalence of diagnostic errors, which continue to pose significant risks and challenges within modern medicine. Previous studies have shown that LLMs can adeptly handle both multiple-choice and open-ended questions in medical reasoning examinations; however, their application beyond academic settings into real-world clinical scenarios has not been extensively explored.
By conducting this comprehensive multisite study, Goh and his team aimed to bridge this knowledge gap. The researchers enlisted 50 physicians from esteemed institutions, including Stanford University, Beth Israel Deaconess Medical Center, and the University of Virginia. The participating physicians predominantly specialized in internal medicine, while areas such as emergency medicine and family medicine were also represented.
During the study, the physicians dedicated an hour to reviewing up to six complex clinical vignettes akin to those found in diagnostic reasoning assessments, all based on authentic patient histories, physical evaluations, and laboratory results. Responding to these clinical scenarios, the physician participants were tasked with formulating plausible diagnoses, supplemented by additional evaluative steps they deemed necessary for patient care.
As in real-world healthcare settings, the participants drew upon their extensive medical knowledge and experience while also utilizing reference materials provided to them. Among those randomly assigned to use ChatGPT in their assessments, approximately one-third reported having used the tool frequently or occasionally prior to the study. However, many physicians in the ChatGPT-access group did not accept or incorporate the model’s diagnostic predictions into their assessments.
Despite the fact that access to ChatGPT did not enhance diagnostic accuracy, physicians who utilized the tool completed their assessments more than a minute faster, on average, than their colleagues without access. These insights—themselves warranting further investigation—suggest that tools like ChatGPT can streamline diagnostic processes, particularly within high-pressure, time-sensitive clinical settings.
“ChatGPT can contribute to enhancing the efficiency of doctors’ workflows,” asserts Goh. “The time savings alone could justify the integration of large language models within clinical practice, positively impacting physician burnout rates over the long term.”
Enhancing Human-AI Teamwork
The study’s results also illuminate avenues through which physician-AI collaboration in clinical settings could be strengthened. Goh emphasizes that trust between physicians and AI systems is a foundational aspect, meaning that practitioners should regard AI-generated insights as valuable and potentially accurate. Cultivating this trust may involve increasing physicians’ understanding of the data on which an AI model was trained. Consequently, developing healthcare-specific LLMs, rather than relying solely on generalized AI like ChatGPT, may foster greater trust and credibility. Additionally, as with any technological advancement, physicians must acquire familiarity and hands-on experience with LLMs which may be bolstered through targeted professional development initiatives aimed at best practices.
Fundamentally, ensuring patient safety must remain paramount in all applications of AI in healthcare, Goh cautions. Appropriate safeguards should be implemented to guarantee that AI-generated recommendations are examined and not mistaken for definitive conclusions. Patients will continue to seek reassurance and guidance from trusted human professionals regarding their care. “AI is designed to assist healthcare providers, not replace them,” Goh emphasizes. “Only a medical professional can prescribe medications, perform surgical procedures, or execute any other interventions necessary for patient care.”
Even so, the integration of AI tools is poised to enhance healthcare, Goh believes.
“Patients are more concerned about effective treatment than the specifics of their diagnosis itself,” Goh states. “Human physicians will manage the treatment process, with the expectation that AI tools will support them in delivering optimal care.”
Building on this pioneering study, Stanford University, Beth Israel Deaconess Medical Center, the University of Virginia, and the University of Minnesota have co-founded a bi-coastal initiative named ARiSE (AI Research and Science Evaluation) aimed at further assessing the implications of GenAI outputs within healthcare. Additional information can be found at the ARiSE website.
Among the other contributors to this research are Jason Hom, Eric Strong, Yingjie Weng, and Neera Ahuja at the Stanford University School of Medicine; Eric Horvitz from Microsoft and Stanford Institute for Human-Centered Artificial Intelligence (HAI); Arnold Milstein from the Stanford Clinical Excellence Research Center; and co-senior author Jonathan Chen from Stanford Center for Biomedical Informatics Research along with the Stanford Clinical Excellence Research Center.
The research team also includes Robert Gallo, co-lead author at the Center for Innovation to Implementation at the VA Palo Alto Health Care System; Hannah Kerman, Joséphine Cool, and Zahir Kanjee from Beth Israel Deaconess Medical Center and Harvard Medical School; Andrew S. Parsons at the University of Virginia School of Medicine; Daniel Yang at Kaiser Permanente; and co-senior authors Andrew P.J. Olson at the University of Minnesota Medical School and Adam Rodman at Beth Israel Deaconess Medical Center and Harvard Medical School.
**Interview with Ethan Goh, Co-Lead Author of the Stanford Study on AI in Medical Diagnostics**
**Interviewer**: Thank you for joining us today, Ethan. Your recent study on ChatGPT’s role in medical diagnostics has generated significant interest. Can you start by explaining the core findings of your research?
**Ethan Goh**: Absolutely! Our study aimed to explore the potential of large language models like ChatGPT in enhancing medical diagnostics and clinical reasoning. We compared how 50 practicing physicians diagnosed complex medical cases, half of whom used traditional resources and the other half had access to ChatGPT as a supplemental tool. Surprisingly, while the integration of AI didn’t dramatically improve diagnostic accuracy, physicians using ChatGPT were able to complete their assessments more quickly.
**Interviewer**: That’s intriguing! What do you think are the reasons behind the lack of significant improvement in diagnostic accuracy when using AI?
**Ethan Goh**: One possibility is that once physicians feel confident about a diagnosis, they might not take the time to leverage AI insights. Furthermore, there’s often a challenge in articulating their reasoning for the decisions they make, which could limit effective collaboration with AI tools. This suggests we need better training for physicians to optimize their use of AI in clinical settings.
**Interviewer**: Your findings highlight the importance of trust between physicians and AI systems. What strategies do you believe could help build this trust?
**Ethan Goh**: Building trust is essential. It could involve developing healthcare-specific AI models that practitioners can better understand and feel more confident about using. Additionally, increasing familiarity with these tools through targeted professional development can help physicians not only appreciate the AI’s capabilities but also see them as valuable partners in patient care.
**Interviewer**: Given the risks involved with AI in medicine, what precautions do you suggest should be in place to ensure patient safety?
**Ethan Goh**: Patient safety must always come first. It’s crucial that AI-generated suggestions are thoroughly reviewed by healthcare professionals—they should not be mistaken for definitive answers. The goal of AI is to assist, not replace, human expertise. Only qualified medical practitioners can truly provide the necessary care and interventions.
**Interviewer**: what are your hopes for the future of AI in healthcare based on your findings?
**Ethan Goh**: I believe that, with appropriate training and integration, AI tools like ChatGPT can enhance the efficiency of clinical workflows and reduce diagnostic errors. Ultimately, our aim is to improve patient care, and I am optimistic about the transformative role AI can play in achieving that goal while ensuring that the human element of medicine remains at the forefront.
**Interviewer**: Thank you for your insights, Ethan. It’s clear that while there are challenges, the potential for AI in healthcare is vast and worth exploring further.
**Ethan Goh**: Thank you for having me! I look forward to seeing how the collaboration between humans and AI evolves in the coming years.
Nt care. Engaging in hands-on training and real-world simulations can also enhance their comfort level with these technologies. Ultimately, open communication about how AI models generate recommendations is key to fostering trust and encouraging their utilization in clinical practice.
**Interviewer**: Given the rapid advancements in AI, what future implications do you foresee for its integration into healthcare, particularly in terms of patient care?
**Ethan Goh**: I believe that AI will significantly enhance patient care by streamlining workflows and reducing diagnostic errors. While AI like ChatGPT is here to assist, we must always prioritize patient safety and ensure that human physicians maintain the final say on treatment decisions. As these tools develop and physicians become more adept at using them, we could see a real improvement in overall efficiency and potentially even in patient outcomes.
**Interviewer**: It sounds promising! Lastly, how do you envision the next steps for your research and the larger initiative, ARiSE, aimed at advancing AI integration in healthcare?
**Ethan Goh**: Our next steps include further research to explore how AI can be effectively utilized in clinical practice and to assess the implications of AI-generated insights. The ARiSE initiative will help us investigate specific applications and develop tailored AI models for healthcare, facilitating a deeper understanding of their potential benefits. We also aim to work closely with healthcare professionals to ensure that the integration of AI tools leads to meaningful improvements in patient care, all while addressing any concerns regarding safety and efficacy.
**Interviewer**: Thank you for your insights, Ethan. It’s fascinating to see how AI can evolve and potentially revolutionize the medical field.
**Ethan Goh**: Thank you for having me! I’m excited to see how these advancements unfold in the coming years.