GPT-4 Passes USMLE With Above-Average Score — Stanford Study Breaks Down What This Really Means

In January 2023, a research team at Stanford and Microsoft published a paper that sent a measurable wave through medical education circles: GPT-4, OpenAI’s large language model, had passed all three steps of the United States Medical Licensing Examination at or near the passing threshold — without any specialized medical training beyond the model’s general pre-training on internet text. The paper, by Kung and colleagues in PLOS Digital Health, was the first rigorous peer-reviewed documentation of a general-purpose language model clearing the USMLE, and its implications for clinical AI are still being debated.

Methodology: What Was Actually Tested

Kung et al. evaluated ChatGPT (the GPT-3.5-based model available in late 2022) rather than GPT-4 specifically. A follow-up analysis by Nori et al. from Microsoft Research (2023) applied GPT-4 to the same benchmark and found substantially higher performance. The USMLE questions used were drawn from the official USMLE practice materials and the AMBOSS question bank, covering Step 1 (basic sciences), Step 2 CK (clinical knowledge), and Step 3 (clinical management).

For GPT-3.5, Kung et al. reported performance at or near the passing threshold of approximately 60% across all three steps. The model demonstrated particular strength in questions requiring pattern recognition from clinical vignettes and relative weakness in questions requiring multi-step quantitative reasoning or interpretation of laboratory values in context.

GPT-4, evaluated separately, achieved scores significantly above the passing threshold on all steps — with reported accuracy between 75% and 90% depending on the question set and evaluation methodology. To put this in context, the average first-time test-taker passes Step 1 with roughly 60–65% correct.

Where the Model Performed Well — and Where It Did Not

The qualitative analysis in Kung et al. revealed important patterns. GPT-3.5 handled factual recall questions — drug mechanisms, anatomical relationships, classic disease presentations — with high reliability. The model also showed reasonable performance on clinical reasoning questions when all relevant information was provided explicitly in the vignette.

However, the model struggled with questions that required:

Integrating information not present in the vignette (e.g., recognizing what a normal result means for a specific demographic)
Multi-step probabilistic reasoning under uncertainty
Questions where the “correct” answer requires prioritizing one clinical action over another based on urgency
Calculation-intensive questions involving drug dosing, acid-base balance, or statistical interpretation

This pattern matters significantly for clinical deployment. USMLE performance is not equivalent to clinical competence. The exam tests a defined corpus of medical knowledge under controlled conditions; clinical medicine requires integrating incomplete information, accounting for patient-specific context, and operating under the ethical and legal constraints of a real care environment.

What This Does NOT Mean

The medical AI discourse around USMLE performance has suffered from a common conflation: treating benchmark performance as a proxy for clinical readiness. This is incorrect for several reasons. First, USMLE questions are static text problems with single correct answers; real clinical decisions involve dynamic, multi-modal information, including physical examination findings, imaging, time pressure, and patient preferences. Second, the model’s knowledge has a training cutoff — it cannot incorporate real-time clinical guidelines, recent trial data, or institution-specific protocols. Third, language models can produce confident, fluent, and completely incorrect medical statements — a failure mode that is far more dangerous in clinical contexts than in exam settings.

Kung et al. were explicit about these constraints, noting that the study demonstrated potential as an educational and decision-support tool rather than evidence of clinical deployment readiness. This important nuance was frequently absent from popular media coverage of the findings.

Risks of Overreliance in Clinical Settings

The most pressing concern raised by these findings is not that AI will replace physicians — it is that clinicians, residents, or healthcare workers in resource-limited settings might begin using general-purpose language models as clinical decision-support tools without understanding their failure modes. A model that correctly answers 80% of USMLE questions will give wrong answers 20% of the time. In high-stakes clinical decisions, even a 5% error rate could translate to significant patient harm at population scale.

The FDA’s current framework for AI/ML-based Software as Medical Device (SaMD) would require any clinical decision-support tool using a language model to undergo formal validation, demonstrate safety and effectiveness, and operate within a defined intended use. General-purpose language models accessed through consumer APIs have not undergone this process and should not be used for clinical decisions.

Appropriate Use Cases

Where language models with USMLE-level performance do have legitimate utility is in medical education, documentation assistance, differential generation for physician review, and patient-facing health information at a general level. Deployed with appropriate guardrails — output flagged as non-authoritative, mandatory physician review, no access to patient records — these tools can reduce cognitive load without introducing unacceptable clinical risk.

Key Takeaway

GPT-4’s USMLE performance demonstrates substantial medical knowledge acquisition through general pre-training. It does not demonstrate clinical competence. The gap between answering exam questions correctly and making sound real-world clinical decisions is wide, and the risk of inappropriate reliance on these tools in clinical settings is a concrete patient safety concern that must be addressed through regulation, education, and clearly scoped deployment.

Sources

1. Kung TH, Cheatham M, Medenilla A, et al. Performance of ChatGPT on USMLE: Potential for AI-Assisted Medical Education Using Large Language Models. PLOS Digital Health. 2023;2(2):e0000198. doi:10.1371/journal.pdig.0000198

2. Nori H, King N, McKinney SM, et al. Capabilities of GPT-4 on Medical Challenge Problems. arXiv preprint arXiv:2303.13375. 2023.

3. FDA. Artificial Intelligence and Machine Learning (AI/ML)-Based Software as a Medical Device (SaMD) Action Plan. January 2021.

Medical Disclaimer: This article is for informational purposes only and does not constitute medical advice. Always consult a qualified healthcare professional for medical decisions.