Medical AI • August 14, 2025

GPT-5 Surpasses Human Experts in Multimodal Medical Reasoning

A breakthrough with far-reaching implications for healthcare AI and clinical decision support

In the rapidly evolving field of artificial intelligence, a new milestone has been reached that could reshape healthcare. A recent paper published on arXiv details how OpenAI's GPT-5 model outperforms pre-licensed human experts on complex medical reasoning tasks, particularly those involving multimodal data like text and images.

This development, highlighted in a viral thread on X by AI analyst Rohan Paul, signals a shift from AI being merely comparable to humans in medical diagnostics to potentially exceeding them in controlled benchmarks.

The Paper: Key Findings and Methodology

Titled "Capabilities of GPT-5 on Multimodal Medical Reasoning," the paper was authored by Shansong Wang and first submitted on August 11, 2025, with a revised version on August 13, 2025.

The study evaluates GPT-5 as a generalist multimodal reasoner for medical decision support, testing it on a range of benchmarks using a standardized zero-shot chain-of-thought protocol. This means the model generates step-by-step reasoning without prior training on the specific tasks, followed by a final answer.

Benchmarks Include:

  • MedQA: A text-based dataset for clinical knowledge and reasoning
  • MedXpertQA (text and multimodal): Covers 4,460 questions across 17 medical specialties and 11 body systems
  • MMLU medical subsets: Multiple-choice questions from Massive Multitask Language Understanding
  • USMLE self-assessment exams: Simulates the United States Medical Licensing Examination
  • VQA-RAD: A radiology-specific visual question-answering dataset with 2,244 question-answer pairs

Key Results:

  • 29.26% improvement in reasoning scores vs GPT-4o
  • 26.18% improvement in understanding scores vs GPT-4o
  • 24.23% better reasoning than pre-licensed human experts
  • 29.40% better understanding than human experts on multimodal tasks

A representative case study in the paper illustrates GPT-5's prowess: Given a scenario involving repeated vomiting, suprasternal crepitus, and CT scan findings, the model integrates textual and visual cues to diagnose a likely esophageal perforation and recommends a Gastrografin swallow as the next step, explaining why alternatives like antiemetics would be insufficient. This demonstrates structured clinical reasoning rather than simple pattern matching.

Building on Prior Research in Multimodal AI

This advancement aligns with earlier work emphasizing the potential of multimodal AI in healthcare. A 2022 study in Nature Medicine explored how integrating diverse data sources—such as imaging, electronic health records, and genomics—could enable personalized medicine, digital clinical trials, remote monitoring, and pandemic surveillance.

The authors highlighted technical challenges like data fusion and ethical considerations, but foresaw multimodal systems accelerating precision diagnostics. GPT-5's gains in handling text and images together build directly on this foundation, potentially speeding up the adoption of AI for clinical decision support.

Community Reactions and Broader Context

The paper quickly sparked discussions on X. For instance, AI developer Mikel Echevarria noted the "real leap" in performance, cautioning that benchmarks don't equate to real-world bedside care but emphasizing the model's edges over GPT-4o and humans.

A Japanese-language thread from LangChainJP summarized the findings, highlighting GPT-5's 29.62% reasoning improvement over GPT-4o on MedXpertQA MM and stressed that results are benchmark-based, not clinically validated. These reactions underscore excitement tempered by calls for real-world testing.

Economic Implications for Healthcare

Healthcare Economics by the Numbers:

  • • US health expenditures: $4.9 trillion in 2023
  • • Projected 2024: $5.3 trillion (17.6% of GDP)
  • • US military spending: $997 billion in 2024 (less than 1/5 of health costs)

The potential cost savings from such AI systems are immense, given healthcare's massive economic footprint. As Rohan Paul pointed out in his X thread, AI that reduces diagnostic errors or streamlines workflows could free up budgets without political friction, enabling governments to redirect funds elsewhere.

However, challenges remain. The paper notes that these are controlled evaluations, not real clinical deployments, where factors like patient variability, ethical AI use, and regulatory approval come into play. Multimodal AI must also address biases in training data to ensure equitable outcomes.

Conclusion: Toward AI-Augmented Medicine

GPT-5's superior performance on multimodal medical reasoning benchmarks marks a pivotal moment, challenging the notion that AI lags in complex domains like healthcare. By outperforming human experts in reasoning and understanding, it paves the way for AI as a reliable decision-support tool, potentially lowering costs in a sector projected to exceed $5 trillion annually in the US alone.

As research builds on foundational studies like the 2022 Nature Medicine paper, the focus should shift to safe, ethical integration into clinical practice. The future of medicine may well be collaborative—humans and AI working in tandem for better patient outcomes.

Written by

Zev Persellin

AI Strategy Consultant