Diagnostic accuracy of DeepSeek-R1 and ChatGPT-4o in emergency patients: A comparative study
Article
Figures
Metrics
Preview PDF
Reference
Related
Cited by
Materials
Abstract:
Objective: To compare the diagnostic performance of DeepSeek-R1 and ChatGPT-4o in emergency department inpatients and explore their clinical practical value. Methods: A retrospective study was conducted using clinical data from emergency department inpatients discharged in December 2024. Discharge diagnoses served as the gold standard. Patient data (age, symptoms, exams, tests) were input into DeepSeek-R1 and GPT-4o with the prompt: “What is the most likely diagnosis? ” Two physicians scored outputs (0–3) to assess accuracy and consistency. Results: A total of 328 cases were analyzed. The mean scores for DeepSeek-R1 and ChatGPT-4o were 2.33±1.07 and 2.32±1.05, respectively, with no statistically significant difference (P=0.82). The Z-score was -0.232, indicating highly similar performance between the two models. However, the rate of accurate diagnoses was 66.5%. Diagnostic performance declined with increasing patient age. Conclusions: DeepSeek-R1 and ChatGPT-4o demonstrated comparable diagnostic performance in emergency department settings, but the misdiagnosis risk remained high. Both models can serve as auxiliary tools to expand physicians' diagnostic considerations but should be integrated with clinical expertise for comprehensive judgment.
Jiang XY, Zhou Y, Gong ZY, Gu YN, Li N, Dou QL. Diagnostic accuracy of DeepSeek-R1 and ChatGPT-4o in emergency patients: A comparative study. J Acute Dis 2025; 14; 23.