Skip to main content

Study: Stanford’s VeriFact uses AI to verify LLM-generated clinical records

VeriFact analyzes statements in AI-generated clinical text against patients’ EHRs to identify factual errors, achieving 93.2% agreement with clinicians.
By Jessica Hagen , Executive Editor
Healthcare provider looking at a tablet

Photo: Hispanolistic/Getty Images

Researchers at Stanford University have developed a platform called VeriFact that pulls clinical data from a patient's EHR and uses an large language model to determine whether AI-generated documentation about that patient is accurate.

According to a study published in NEJM  AI, researchers sought to test the accuracy of text generated by LLMs in the clinical setting compared with a patient's real medical record.

Researchers created VeriFact, a system that pulls relevant data from the EHR and analyzes it, using an "LLM-as-a-judge" approach to evaluate whether the generated statements are factually supported by the EHR data.

"VeriFact is an AI system that checks the veracity of statements within an LLM-generated document. New clinical documents should be internally consistent with information already known about a patient, such as historical notes in the patient’s EHR. VeriFact performs patient-specific fact verification by comparing statements in an LLM-generated document against a patient’s EHR facts, localizes errors, and describes their underlying causes," the study's authors wrote.

The researchers also introduced a clinician‑annotated benchmark dataset, VeriFact‑Brief Hospital Course (VeriFact‑BHC), that analyzes hospital discharge narratives into individual claims and labels whether each claim is supported by the actual EHR.

"VeriFact-BHC contains 100 patients with 13,070 statements derived from brief hospital courses, each annotated by three or more clinicians," the authors wrote.

"VeriFact achieved 93.2% agreement with clinicians. The highest interrater agreement among clinicians was 88.5%, indicating that VeriFact can produce more consistent fact verification than humans."

The researchers said VeriFact evaluates each patient's individual EHR, enabling reference-based fact verification in medicine.  

"VeriFact can help clinicians verify facts in documents drafted by LLMs prior to committing them to the patient’s EHR and can automate tasks requiring chart review. VeriFact-BHC can be used to develop and benchmark new methodologies for verifying facts in patient care documents," the study's authors wrote.

THE LARGER TREND

Researchers noted limitations in the study, including that it did not explore additional retrieval or reranking models, nor did it evaluate medicine-specific LLMs or perform domain-specific fine-tuning.

Only a fixed set of prompts was used, leaving open the possibility that optimized prompts could improve performance.

Additionally, VeriFact relies on the EHR as the source of truth, which may be incomplete for new patients or contain errors due to misdiagnosis, miscommunication or outdated information. VeriFact also only evaluates statements present in LLM-generated text and cannot identify errors of omission.

The study's authors also noted that the VeriFact-BHC dataset has constraints in that it includes summaries from a single LLM and documentation pipeline, focuses exclusively on discharge summaries from the MIMIC-III dataset and may not generalize to other hospitals or patient populations.

Finally, VeriFact's accuracy decreased significantly when applied to human-written records as opposed to LLM-generated clinical documents.

Tags: