Why Human Scribes Still Outperform AI Documentation

Recent developments in AI ambient dictation promise to revolutionize clinical workflows by reducing administrative burdens. However, a new study from the Veterans Health Administration (VHA) suggests that technology cannot yet replace human expertise. Researchers compared human-generated notes with documentation from 11 different AI scribe tools. Despite the speed of automated systems, humans consistently produced higher-quality records across several primary care scenarios. This finding highlights significant gaps in the current capabilities of artificial intelligence within healthcare settings.

Limitations of AI Ambient Dictation Quality

The study used the Physician Documentation Quality Instrument (PDQI-9) to assess performance across 10 specific domains. Specifically, AI-generated notes scored lower than human-produced records in every single category. The most substantial deficits appeared in the acute low back pain case. In this instance, human notes achieved a score of 43.8, while AI scores fell to 20.3. Consequently, the quality gap remains a primary concern for clinicians considering these automated tools. Furthermore, the analysis revealed that AI struggled most with being thorough and organized.

Critical Domains Where AI Scribes Fall Short

The research identified three major areas where AI failed to match human performance. These included being thorough, organized, and useful for clinical decision-making. Although AI tools can capture dialogue, they often fail to synthesize information effectively. Therefore, the resulting notes might lack the clinical nuance required for complex cases. Additionally, human note-takers better understood the context of patient interactions. As a result, human-generated documentation remains the gold standard for clinical record-keeping at this time. Physicians should remain cautious before deploying these tools at scale.

Implications for the Future of Documentation

While AI ambient dictation holds immense promise, independent evaluations are essential. The VHA study emphasizes that current tools require significant improvement before they can reliably replace humans. Consequently, healthcare organizations must prioritize quality over convenience during the early stages of adoption. Furthermore, further research must evaluate these tools under real-world constraints rather than simulated cases. This approach will ensure that patient care and documentation standards remain high as technology evolves. Ultimately, human oversight is still necessary to maintain the integrity of medical records.

Frequently Asked Questions

Q1: Why did human-generated notes score higher than AI notes?

Human note-takers demonstrated superior ability in organizing information and being thorough. They provided more useful clinical context compared to the AI tools tested in the study.

Q2: In which specific clinical area did AI perform the worst?

The largest quality difference occurred in the acute low back pain case. AI-generated notes in this scenario were significantly less comprehensive and organized than those produced by humans.

Q3: Should hospitals wait to implement AI scribes?

The researchers recommend independent, vendor-neutral evaluations before large-scale deployment. While promising, the technology currently lacks the quality and clinical utility of human-produced documentation.

References

Reddy A et al. Rapid Evaluation of Artificial Intelligence Technology Used for Ambient Dictation in Primary Care: Comparing the Quality of Documentation of Artificial Intelligence-Generated and Human-Produced Clinical Notes. Ann Intern Med. 2026 Apr 17. doi: 10.7326/ANNALS-25-02772. PMID: 41996184.
Holmgren AJ et al. Ambient Artificial Intelligence Scribes and Physician Productivity. JAMA Netw Open. 2026;9(1):e255432.
Tierney AA et al. Ambient Artificial Intelligence Scribes for Clinical Documentation. NEJM AI. 2025;2(3):1-12.