|Articles|July 25, 2023

AI gets better at writing patient histories when physicians engineer the prompts

Researchers analyze ChatGPT for summarizing medical records.

The ChatGPT artificial intelligence (AI) program gets better at summarizing patient notes when physicians refine the prompts used to get it started.

Researchers as Stanford and Duke universities compared histories of present illnesses (HPIs) written by the popular AI program to those written by senior internal medicine students. The results got better over three rounds, with grades for ChatGPT and human physicians differing by less than a point, based on a 15-point composite scale.

“These findings underscore the potential of chatbots to aid clinicians with medical documentation,” said the research letter published in JAMA Internal Medicine.

In January and February, the research team used ChatGPT to generate HPIs based on three patient interview scripts portraying different types of chest pain.

ChatGPT generated 10 HPIs per script; those were evaluated for errors and were considered acceptable if they were free of errors. Then the researchers modified the prompt and repeated the process twice, with the acceptance rate growing from 10% to 43% over the three rounds.

From the final round, they picked one HPI per script to compare with four written by resident physicians. A total of 30 internal medicine attending physicians blindly evaluated the HPIs for level of detail, succinctness, and organization.

When guessing if the author was human or AI, the reviewing physicians were correct 61% of the time.

For ChatGPT, the most common error was adding patient ages and genders, which none of the scripts specified. The program also added information not present, an error called a hallucination.

ChatGPT’s “performance was heavily dependent on prompt quality,” the study said. “Without robust prompt engineering, the chatbot frequently reported information in the HPIs that was not present in the source dialogue.”

The researchers noted hallucinations occurred in prior tests of AI models.

“The generation of hallucinations in the medical record is clearly of great concern,” the study said. ChatGPT is a large language model (LLM) of generative AI.

“Before LLMs can be safely used in the clinical environment, close collaboration between clinicians and AI developers is needed to ensure that prompts are effectively engineered to optimize output accuracy,” the study said.

The researchers noted the study was limited by the version of ChatGPT available. A commentary said that version “used a massive data set comprised of published books, journals, and other Internet-based sources that were available through September 2021.”

As of March, “the program can access contemporaneous information from around the web,” said the editor’s note by Eric Ward, MD, and Cary Gross, MD.

ChatGPT was published in November 2022 and medical researchers began evaluating it almost instantly, they said, and research will continue.

“A new era is unfolding,” Ward and Gross said.

The research letter “Comparison of History of Present Illness summaries Generated by a Chatbot and Senior Internal Medicine Residents” and editor’s note “Evolving Methods to Assess Chatbot Performance in Health Sciences Research” were published in JAMA Internal Medicine.

This article was initially published by our sister publication, Medical Economics®.

Access practical, evidence-based guidance to support better care for our youngest patients. Join our email list for the latest clinical updates.

Latest CME

Video

Progress in Hyperlipidemia Management to Reduce ASCVD Risk: An Illustrated Update

Nihar R. Desai, MD, MPH; Martha Gulati, MD, MS, FACC, FAHA, MASPC, FESC, FSCCT (hon), FRCP Edin

AI gets better at writing patient histories when physicians engineer the prompts

Related Content

PECOS study identifies 20 post-COVID symptoms in children, from fatigue to cardiac and mental health effects

Joyce Woo, MD, on how rapid diagnosis and transfer remain critical in severe heart defects

FAQ: What does the new "faltering weight" guideline mean for your practice?

Joyce Woo, MD, on how hospital factors may influence cardiac transfer timing for newborns

Pertussis resurgence: An expert Q&A on diagnostic delays and point-of-care testing