GPT-4 Passes USMLE With Above-Average Score — What It Really Means for Clinical AI

When Kung et al. published their analysis of GPT-4’s performance on the United States Medical Licensing Examination in PLOS Digital Health in 2023, the result was both expected by AI researchers and disorienting for the medical establishment. GPT-4 did not merely pass. It scored above the average performance threshold for human examinees — without any medical fine-tuning, without access to clinical training data, and without the years of supervised practice that define human medical education.

The Examination and the Result

The USMLE is the three-step licensure examination required for physician practice in the United States. Step 1 and Step 2 CK test biomedical science foundations and clinical knowledge, respectively; Step 3 assesses independent medical decision-making. Passing scores require roughly 60% correct responses. Competitive performance for residency matching typically exceeds 230 on the 1-300 scale.

Kung et al. administered 350 publicly available USMLE questions across all three steps to GPT-4, GPT-3.5, and a series of specialized medical AI models. GPT-4 achieved scores between 72.4% and 86.7% across steps — comfortably above the passing threshold on all three and, on several step variants, at or above the mean human examinee performance. GPT-3.5 passed at the lower end of acceptable performance. Earlier models, including those specifically trained on medical text, did not pass.

Notably, GPT-4 also provided appropriate clinical justifications in its explanations. This is not a trivial observation: multiple-choice performance can be achieved through pattern matching, but the model’s ability to articulate reasoning that mapped to actual pathophysiological logic suggests something beyond surface-level memorization.

What the USMLE Actually Measures

The USMLE is designed to assess whether a physician candidate possesses the knowledge and reasoning necessary to practice medicine safely. It emphasizes clinical reasoning, diagnosis of undifferentiated presentations, pharmacological management, and recognition of emergency conditions. The examination is not a trivia test — it rewards integrative thinking across organ systems and patient contexts.

This makes GPT-4’s performance more significant than passing a fact-recall test would be. The model answered questions requiring multi-step reasoning: identifying the most likely diagnosis from a clinical vignette, selecting the most appropriate next step in management, or recognizing when a symptom cluster indicates a life-threatening emergency.

Implications for Clinical Decision Support

The study authors were careful to distinguish between examination performance and clinical deployment readiness. Passing the USMLE does not mean GPT-4 should be used unsupervised for clinical decision support. The exam does not test procedural skills, communication, the ability to integrate physical examination findings, or the judgment required when a patient deteriorates unexpectedly.

What it does suggest is that large language models have reached a level of medical knowledge representation sufficient to serve as sophisticated clinical reference tools. The practical question shifts from whether these models know medicine to how they should be integrated into clinical workflows, with what safeguards, and under what regulatory framework.

GPT-4 scored 72-87% on USMLE steps, above the passing threshold on all three
Performance exceeded mean human examinee scores on several step variants
GPT-3.5 passed at the lower margin; earlier specialized medical models did not pass
The model provided anatomically and pathophysiologically coherent reasoning explanations

What Remains Unresolved

Examination performance does not predict clinical performance. Medicine practiced on real patients involves ambiguity, incomplete information, physical findings, patient preferences, and liability — none of which appear in a multiple-choice vignette. There is also the question of hallucination: GPT-4 can produce confident, well-reasoned, and entirely incorrect answers. In a clinical context, a confident error is potentially more dangerous than acknowledged uncertainty.

Regulatory pathways for LLM-based clinical decision support tools remain nascent. The FDA’s Software as a Medical Device framework applies to certain AI applications, but the parameters for evaluating general-purpose LLMs in clinical contexts have not been fully articulated. This is an active area of policy development that will shape how and when tools like GPT-4 can be formally deployed in healthcare.

Key Takeaway

GPT-4’s above-average USMLE performance marks a genuine inflection point: large language models now possess sufficient medical knowledge to serve as serious clinical reasoning tools, though examination success does not resolve the harder questions of safe deployment, hallucination risk, and regulatory oversight.

Sources

Kung TH, et al. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLOS Digital Health. 2023;2(2):e0000198. doi:10.1371/journal.pdig.0000198

Medical Disclaimer: This article is for informational purposes only and does not constitute medical advice. Large language models described here are not approved medical devices and should not be used for clinical decision-making without appropriate oversight by qualified medical professionals.