Stanford Clinical AI Benchmark: DeepSeek Outperforms Google and OpenAI

Frontier Models · Published: Jun 03, 2025 · Yuki Tanaka · ~7 min read

Author Info

Asia-Pacific AI Markets Reporter

B.A. Economics (University of Tokyo); bilingual EN/JP; former APAC tech wire correspondent

Yuki tracks model launches, cloud partnerships, and industrial policy across East Asia. She sources from company filings, local press briefings, and on-the-ground industry contacts, then contextualizes moves for a global English-speaking audience. She is careful to note translation limits and regional regulatory differences.

#APAC Markets #Cloud Partnerships #Industrial Policy #Cross-Border Launches

Full author profile →

Stanford’s Latest Comprehensive Medical Task Evaluation of Large Models: DeepSeek R1 Takes First Place with a 66% Win Rate!

International netizens were stunned by the results, primarily because this evaluation focuses on the daily work scenarios of clinical doctors, rather than being limited to traditional medical licensing exam questions.

Stanford Clinical Medical AI Comprehensive Review, DeepSeek Outperforms Google and OpenAI

To conduct a proper evaluation, it must be comprehensive in all aspects.

The team constructed MedHELM, a comprehensive evaluation framework containing 35 benchmark tests that cover medical tasks across 22 subcategories.

This classification system was validated by clinicians and developed with the participation of 29 licensed physicians from 14 medical specialties.

The author list is extensive, including researchers from Stanford University School of Medicine, Stanford Health Care, the Stanford Center for Research on Foundation Models (CRFM), and Microsoft.

Stanford Clinical Medical AI Comprehensive Review, DeepSeek Outperforms Google and OpenAI

The 31-page paper concludes that among nine cutting-edge large models, including DeepSeek R1, o3-mini, and Claude 3.7 Sonnet, DeepSeek R1 leads with a 66% win rate and a macro-average score of 0.75.

For the current benchmark results, the team has also created a publicly accessible leaderboard.

Stanford Clinical Medical AI Comprehensive Review, DeepSeek Outperforms Google and OpenAI

In addition to DeepSeek R1 leading the pack, o3-mini follows closely with a 64% win rate and the highest macro-average score of 0.77; Claude 3.5 and 3.7 Sonnet achieved win rates of 63% and 64%, respectively.

After reviewing the specific research, netizens expressed that these evaluations are very helpful.

Stanford Clinical Medical AI Comprehensive Review, DeepSeek Outperforms Google and OpenAI

Let’s look at more details below.

Large Models Face a Major Clinical Medical Task Exam

This comprehensive evaluation framework is named MedHELM, inspired by the standardized cross-domain evaluation approach of Stanford’s previous HELM project.

Stanford Clinical Medical AI Comprehensive Review, DeepSeek Outperforms Google and OpenAI

One of the core contributions of the study is the construction of a clinician-validated classification system.

This system simulates the daily workflow logic of clinicians and includes three levels:

Category: Broad domains of medical activities (e.g., “Clinical Decision Support”);
Subcategory: Related task groups within a category (e.g., “Supporting Diagnostic Decisions”);
Task: Discrete operations in healthcare services (e.g., “Generating Differential Diagnoses”).

During the initial formulation of the classification system, a clinician reorganized tasks identified in a Journal of the American Medical Association (JAMA) review into functional themes reflecting real-world medical activities. This formed an initial framework comprising 5 categories, 21 subcategories, and 98 tasks.

The team then validated this initial classification system.

Twenty-nine practicing clinicians from 14 medical specialties participated in a survey to assess the rationality of the system from two perspectives: classification logic and comprehensiveness.

Based on feedback, the system was ultimately expanded to 5 categories, 22 subcategories, and 121 tasks, fully covering various aspects of clinical practice such as clinical decision support, clinical case generation, patient communication and education, medical research assistance, and management/workflow. Furthermore, 26 clinicians reached a 96.7% agreement rate on the subcategory classifications.

Stanford Clinical Medical AI Comprehensive Review, DeepSeek Outperforms Google and OpenAI

The second core contribution is that, based on this classification system, the team constructed a comprehensive evaluation suite containing 35 benchmark tests, including:

17 existing benchmarks
5 benchmarks reconstructed from existing datasets
13 newly developed benchmarks

Notably, 12 of the 13 newly developed benchmarks are based on real-world Electronic Health Record (EHR) data, effectively addressing the lack of real medical data usage in existing evaluations.

Ultimately, this entire suite of benchmarks fully covers all 22 subcategories in the classification system. Based on data sensitivity and access restrictions, these benchmarks were divided into three access levels: 14 public, 7 requiring approval, and 14 private.

Stanford Clinical Medical AI Comprehensive Review, DeepSeek Outperforms Google and OpenAI

With the exam questions prepared, the research team systematically evaluated nine cutting-edge large language models.

How Did the Evaluation Results Turn Out?

The evaluation revealed significant differences in model performance.

DeepSeek R1 performed best, leading with a 66% win rate in pairwise comparisons, achieving a macro-average score of 0.75 and a low standard deviation for win rates (0.10).

Here, the “win rate” refers to the proportion of times a model outperformed others across all 35 benchmark tests in pairwise comparisons. The “standard deviation of win rates” measures the stability of the model’s victories (lower value = higher stability). The macro-average score is the average performance score across all 35 benchmarks. The standard deviation reflects fluctuations in model performance across different benchmarks (lower value = higher consistency across benchmarks).

o3-mini followed closely, performing particularly well in clinical decision support benchmarks, ranking second with a 64% win rate and the highest macro-average score of 0.77.

Claude 3.7 Sonnet and 3.5 Sonnet achieved win rates of 64% and 63%, respectively, both with a macro-average score of 0.73; GPT-4o had a win rate of 57%; Gemini 2.0 Flash and GPT-4o mini had lower win rates of 42% and 39%, respectively.

Additionally, the open-source model Llama 3.3 Instruct had a win rate of 30%; Gemini 1.5 Pro ranked last with a 24% win rate, but it exhibited the lowest standard deviation in win rates (0.08), indicating the most stable competitive performance.

Stanford Clinical Medical AI Comprehensive Review, DeepSeek Outperforms Google and OpenAI

The team also presented a heatmap showing each model’s standardized scores across the 35 benchmarks, where dark green indicates higher performance and dark red indicates lower performance.

Stanford Clinical Medical AI Comprehensive Review, DeepSeek Outperforms Google and OpenAI

The results show that models performed poorly in the following benchmarks:

MedCalc-Bench (calculating medical values from patient records)
EHRSQL (generating SQL queries for clinical research based on natural language instructions—originally designed as a code generation dataset)
MIMIC-IV Billing Code (assigning ICD-10 codes to clinical cases)

They performed best in the NoteExtract benchmark (extracting specific information from clinical records).

Deeper analysis revealed distinct hierarchical differences in model performance across different task categories.

In clinical case generation tasks, most models achieved high scores ranging from 0.74 to 0.85; they also performed excellently in patient communication and education tasks, with scores between 0.76 and 0.89. Performance was moderate in medical research assistance (0.65–0.75) and clinical decision support (0.61–0.76), while scores were generally lower in management and workflow (0.53–0.63).

This difference reflects that free-text generation tasks (such as clinical case generation and patient communication) are better suited to leverage the natural language advantages of large language models, whereas structured reasoning tasks require stronger domain-specific knowledge integration and logical reasoning capabilities.

Stanford Clinical Medical AI Comprehensive Review, DeepSeek Outperforms Google and OpenAI

For the 13 open-ended benchmarks, the team adopted an LLM-jury evaluation method.

To assess the effectiveness of this method, the team collected independent ratings from clinicians on some model outputs. Specifically, they selected 31 instances from ACI-Bench and 25 from MEDIQA-QA to compare clinician scores with the jury’s composite scores.

Stanford Clinical Medical AI Comprehensive Review, DeepSeek Outperforms Google and OpenAI

The results showed that the LLM-jury method achieved an intraclass correlation coefficient (ICC) of 0.47 with clinician scores. This not only exceeded the average consistency among clinicians themselves (ICC=0.43) but also significantly outperformed traditional automated evaluation metrics such as ROUGE-L (0.36) and BERTScore-F1 (0.44).

The team concluded that LLM juries reflect clinical judgment better than standard lexical metrics, proving their effectiveness as a substitute for clinician scoring.

Cost-effectiveness analysis was another innovation of this study. Based on public pricing as of May 12, 2025, the team estimated the cost required for each model by combining the total input tokens consumed during benchmark execution and the maximum output tokens used in the LLM-jury evaluation process.

Stanford Clinical Medical AI Comprehensive Review, DeepSeek Outperforms Google and OpenAI

As expected, non-reasoning models GPT-4o mini ($805) and Gemini 2.0 Flash ($815) were cheaper, with win rates of 0.39 and 0.42, respectively.

Reasoning models were more expensive; DeepSeek R1 ($1,806) and o3-mini ($1,722) achieved win rates of 0.66 and 0.64, respectively.

Overall, Claude 3.5 Sonnet ($1,571) and Claude 3.7 Sonnet ($1,537) performed well in terms of cost-effectiveness, achieving a win rate of approximately 0.63 at a lower cost.

Stanford Clinical Medical AI Comprehensive Review, DeepSeek Outperforms Google and OpenAI

Those interested in more details can refer to the original paper.

Paper Link: https://arxiv.org/pdf/2505.23802
Blog Link: https://hai.stanford.edu/news/holistic-evaluation-of-large-language-models-for-medical-applications
Leaderboard Link: https://crfm.stanford.edu/helm/medhelm/latest/#/leaderboard

References

1929388406032810046 — x.com/iScienceLuvr/status/1929388406032810046

Stanford Clinical AI Benchmark: DeepSeek Outperforms Google and OpenAI

Author Info

Large Models Face a Major Clinical Medical Task Exam

How Did the Evaluation Results Turn Out?

References

Related News

Latest Headlines