Radiology & Imaging Journal
Open AccessReliability of ChatGPT in the Evaluation of Voiding Cystourethrograms: Comparison with Experts in Cases of Bulbar Stricture and Normal Studies
Authors: Pereira Magnum Adriel Santos, Saab João Jorge, Paranhos Marina Grzybowski, Loures Raul Diegues, Daniel Charret, Auricchio Lorella Miranda, França Wagner Aparecido, Rios Luis Augusto Seabra.
Abstract
Background: The use of large language models (LLMs) for medical image interpretation has expanded rapidly, yet clinical validation remains limited. We evaluated ChatGPT’s performance in interpreting voiding cystourethrograms (VCUGs) for bulbar urethral stricture.
Objective: To assess the diagnostic accuracy and treatment recommendations generated by ChatGPT when interpreting VCUG images, compared with reconstructive urology experts and with the procedure actually performed.
Methods: We conducted a retrospective cross-sectional study at a tertiary public hospital. A total of 51 VCUGs were analyzed: 41 confirmed bulbar strictures and 10 normal studies. De-identified, representative static frames from retrograde and voiding phases were presented to ChatGPT (version 4.0 – 1.2025.105) in independent chats using a standardized English prompt. Two reconstructive urologists (GURS members; >50 urethral surgeries/year) independently reviewed all cases. Performance metrics included sensitivity, specificity, accuracy, predictive values, and Cohen’s kappa for agreement.
Results: ChatGPT correctly identified 40/41 bulbar strictures (sensitivity 97.56%) but labeled all 10/10 normal VCUGs as strictures (specificity 0%). Overall accuracy was 78.43%, positive predictive value 80%, negative predictive value 0%, and Cohen’s kappa 0.51 (moderate agreement). ChatGPT tended to overcall strictures, limiting its usefulness for triage when normal studies are prevalent. When the anatomic location was correctly identified, suggested treatments were generally concordant with contemporary guideline-based management.
Conclusion: ChatGPT showed very high sensitivity but null specificity for bulbar stricture detection on VCUG static frames, indicating substantial limitations for independent diagnostic use. The model may serve as a supervised aid where specialist access is scarce, while future multimodal models specifically trained on urologic imaging may achieve better balance between sensitivity and specificity.
Editor-in-Chief
View full editorial board →