Evaluating Guideline Adherence in Gemini-Powered Dental Trauma Workflows: Standalone Gemini Chat vs. Document-Grounded NotebookLM

Aim: The aim of this study was to compare the accuracy and inter-account consistency of two Google Gemini–powered, user-facing workflows for dental trauma decision support: standalone Gemini chat and NotebookLM, a document-grounded workflow that generates responses grounded in uploaded European Society of Endodontology and International Association of Dental Traumatology guideline documents, when answering dichotomous (yes/no) clinical questions on the management of traumatized permanent teeth. Methodology: A cross-sectional simulation was conducted using 99 dichotomous (yes/no) questions derived from the European Society of Endodontology and International Association of Dental Traumatology guidelines. Three academic endodontists submitted each question to Gemini and NotebookLM using three independent Google accounts, generating 297 responses per workflow. Accuracy was defined as exact agreement with guideline-based answers, and consistency as the proportion of identical responses across the three trials. Statistical analyses included Wald and Wilson 95% confidence intervals, Fleiss' kappa for inter-account agreement, and Pearson's chi-squared tests to compare proportions. Results: Gemini demonstrated an overall accuracy of 83.83% (95% CI: 75.08–90.47) and a consistency of 74.74% (κ = 0.84).NotebookLM showed higher accuracy (92.93%; 95% CI: 85.97–97.11) and perfect consistency (100%; κ = 1.00). While the difference in accuracy did not reach statistical significance (p = 0.076), NotebookLM exhibited significantly greater consistency(p < 0.001). Conclusions: The responses generated from the guidelines were highly consistent with both workflows. Document groundingmay enhance repeatability and alignment with guideline-derived decision points for structured dichotomous inquiries, as evidenced by NotebookLM's ability to achieve complete inter-account consistency and to quantitatively increase accuracy. Theseresults are the outcome of workflow-level benchmarking; therefore, clinical utility cannot be inferred solely from them; professional oversight and additional validation remain necessary before any clinical application.

Document Type

Article

Document version

Published version

Language

English

Subject (CDU)

6 - Applied Sciences. Medicine. Technology

Keywords

Decision-making

Dental trauma

Google Gemini

Large language models

NotebookLM

Retrieval-augmented generation

Presa de decisions

Trauma dental

Models de llenguatge grans

Generació augmentada per recuperació

Toma de decisiones

Traumatismos dentales

Modelos de lenguaje grandes

Generación aumentada por recuperación

Publisher

Wiley

Collection

Is part of

Dental Traumatology

Note

The author, N. Dufey-Portilla, thanks the National Agency for Researchand Development (ANID) for its support through the DOCTORADOBECAS CHILE/2025 - 72250040 Scholarship Program.

Show full item record

This item appears in the following Collection(s)

Odontologia [353]

Rights

This is an open access article under the terms of the Creative Commons Attribution-NonCommercial-NoDerivs License, which permits use and distribution in anymedium, provided the original work is properly cited, the use is non- commercial and no modifications or adaptations are made.© 2026 The Author(s). Dental Traumatology published by John Wiley & Sons Ltd.

Except where otherwise noted, this item's license is described as https://creativecommons.org/licenses/by-nc-nd/4.0/