Study assesses safety and accuracy in emergency medicine
Study evaluates large language model for emergency medicine handoff notes, finding high usefulness and safety comparable to physicians
Study: Developing and Evaluating Large Language Model–Generated Emergency Medicine Handoff Notes. Image Credit: Kamon_wongnon / Shutterstock.com
In a recent study published in JAMA Network Open, researchers developed and evaluated the accuracy, safety, and utility of large language model (LLM)- generated emergency medicine (EM) handoff notes in reducing physician documentation burden without compromising patient safety.
The crucial role of handoffs in healthcare
Handoffs are critical communication points in healthcare and a known source of medical errors. As a result, numerous organizations, such as The Joint Commission and Accreditation Council for Graduate Medical Education (ACGME), have advocated for standardized processes to improve safety.
EM-to-inpatient (IP) handoffs are associated with unique challenges, including medical complexity, time constraints, and diagnostic uncertainty; however, they remain poorly standardized and inconsistently implemented. Electronic health record (EHR)-based tools have attempted to overcome these limitations; however, they remain underexplored in emergency settings.
LLMs have emerged as potential solutions to streamline clinical documentation. Nevertheless, concerns about factual inconsistencies necessitate further research to ensure safety and reliability in critical workflows.
About the study
The present study was conducted at an urban academic 840-bed quaternary-care hospital in New York City. EHR data from 1,600 EM patient encounters that led to acute hospital admissions between April and September 2023 were analyzed. Only encounters after April 2023 were included due to the implementation of an updated EM-to-IP handoff system.
Retrospective data were used under a waiver of informed consent to ensure minimal risk to patients. Handoff notes were generated using a combination of a fine-tuned LLM and rule-based heuristics while adhering to standardized reporting guidelines.
The handoff note template closely resembled the current manual structure by integrating rule-based elements like laboratory tests and vital signs and LLM-generated components such as the history of present illness and differential diagnoses. Informatics experts and EM physicians curated data for fine-tuning the LLM to enhance their quality while excluding race-based attributes to avoid bias.
Two LLMs, Robustly Optimized Bidirectional Encoder Representations from Transformers Approach (RoBERTa) and Large Language Model Meta AI (Llama-2), were employed for saliency content selection and abstractive summarization, respectively. Data processing involved heuristic prioritization and saliency modeling to address the models’ potential limitations.
The researchers evaluated automated metrics such as Recall-Oriented Understudy for Gisting Evaluation (ROUGE) and Bidirectional Encoder Representations from Transformers Score (BERTScore), alongside a novel patient safety-focused framework. A clinical review of 50 handoff notes assessed completeness, readability, and safety to ensure their rigorous validation.
Study findings
Among the 1,600 patient cases included in the analysis, the mean age was 59.8 years with a standard deviation of 18.9 years, and 52% of the patients were female. Automated evaluation metrics revealed that summaries generated by the LLM outperformed those written by physicians in several aspects.
ROUGE-2 scores were significantly higher for LLM-generated summaries as compared to physician summaries at 0.322 and 0.088, respectively. Similarly, BERT precision scores were higher at 0.859 as compared to 0.796 for physician summaries. In contrast, the source chunking approach for large-scale inconsistency evaluation (SCALE) generated a score of 0.691 as compared to 0.456. These results indicate that LLM-generated summaries demonstrated greater lexical similarities, higher fidelity to source notes, and provided more detailed content than their human-authored counterparts.
In clinical evaluations, the quality of LLM-generated summaries was comparable to physician-written summaries but slightly inferior across several dimensions. On a Likert scale of one to five, LLM-generated summaries scored lower in terms of usefulness, completeness, curation, readability, correctness, and patient safety. Despite these differences, automated summaries were generally considered to be acceptable for clinical use, with none of the identified issues determined to be life-threatening to patient safety.
In evaluating worst-case scenarios, the clinicians identified potential level two safety risks, which included incompleteness and faulty logic at 8.7% and 7.3%, respectively, for LLM-generated summaries as compared to physician-written summaries, which were not associated with these risks. Hallucinations were rare in the LLM-generated summaries, with five identified cases all receiving safety scores between four and five, thus suggesting mild to negligible safety risks. Overall, LLM-generated notes had a higher rate of incorrectness at 9.6% as compared to physician-written notes at 2%, though these inaccuracies rarely involved significant safety implications.
Interrater reliability was calculated using intraclass correlation coefficients (ICC). ICCs exhibited good agreement among the three expert raters for completeness, curation, correctness, and usefulness at 0.79, 0.70, 0.76, and 0.74, respectively. Readability achieved fair reliability with an ICC of 0.59.
Conclusions
The current study successfully generated EM-to-IP handoff notes using a refined LLM and rule-based approach within a user-developed template.
Traditional automated evaluations were associated with superior LLM performance. However, manual clinical evaluations revealed that, although most LLM-generated notes achieved promising quality scores between four and five, they were generally inferior to physician-written notes. Identified errors, including incompleteness and faulty logic, occasionally posed moderate safety risks, with under 10% potentially causing significant issues as compared to physician notes.
Journal reference:
- Hartman, V., Zhang, X., Poddar, R., et al. (2024). Developing and Evaluating Large Language Model–Generated Emergency Medicine Handoff Notes. JAMA Network Open. doi:10.1001/jamanetworkopen.2024.48723
Read the full article here