“Superhuman performance of a large language model on the reasoning tasks of a physician” Analysis

Large Language Models are now outperforming physicians—the beginning of the future. 

Recently, public attention was drawn to a study titled “Superhuman performance of a large language model on the reasoning tasks of a physician” (https://arxiv.org/abs/2412.10849).

This publication compared Large Language Models (LLMs) and their performance against the same trials that physicians undergo. The accuracy of these models, as well as their ability to diagnose and recommend subsequent tests in the treatment progression, yielded critical results.

The accuracy of an AI model is essential, but the process by which the model reaches its answer is even more significant. The thinking process and transparency of an LLM are the most relevant research components. It's essential to understand how the LLM thinks and ensure that it's not pulling information randomly to reach an answer without a correct reasoning algorithm. It's possible for LLMs to “cheat,” for example, if the model was trained on one of the clinical trials it was tested on, then it would know the answers without having to think. Just as in school, when you must show your work on a math test, it's not uncommon for a student to arrive at the correct answer through incorrect calculations.  

The method by which the models process their critical thinking and arrive at their conclusions was monitored during these trials, which tested how well AI could compete with physicians in clinical problem-solving scenarios. 

I found this study very intriguing. 

It began with The New England Journal of Medicine's clinicopathological case conference series, which serves as a standard for evaluating the results of differential diagnosis generators.  This is used to assess the diagnostic and management reasoning capabilities of an advanced LLM (OpenAI o1-preview) against hundreds of physicians. 

Five experiments were conducted to measure the clinical reasoning of the LLMs: 

Differential diagnosis generation 

Display of diagnostic reasoning

Triage differential diagnosis,

Probabilistic reasoning

Management reasoning 

All of the results were judged by expert physicians.


Below are the following tests

Quality of Differential Diagnoses on New England Journal of Medicine Clinicopathological Conferences (Brodeur et al.,2024, p. 3, 6)

The OpenAI o1-preview model was tested through Clinicopathologic Conferences (CPCs) published by the New England Journal of Medicine (NEJM).

The OpenAI o1-preview model was tested, and its processing resulted in a quality diagnosis in 120 out of 143 of the cases. (84%)

OpenAI o1-preview included the correct diagnosis in the differential in 78.3% of the cases, and in 52% of the cases, the first diagnosis was the correct one. 

(Brodeur et al.,2024, p12)

When compared to GPT-4, both were presented with 70 cases, and o1-preview had a success  rate of 88.6% of the cases, and GPT-4 had a success rate of 72.9%

OpenAI o1-preview scored 15.7% higher than GPT-4

The evaluation tested OpenAI o1-preview’s ability to select the correct next diagnostic test in the NEJM CPCs. Scored by two physicians, o1-preview had a success rate of 113/132 when picking the next test, which is 86%. (Brodeur et al.,2024, p12)


Presentation of reasoning in NEJM Healer Diagnostic Cases (Brodeur et al.,2024, p. 3, 7-8)


The study used 20 clinical medical cases from the NEJM Healer curriculum (virtual patient encounters).

The metric used to test was Revised-IDEA (R-IDEA), which is a system that uses a 10-point scale that evaluates with four core domains of documenting clinical reasoning ( judged by two physicians)

The physicians determined that OpenAI o1-preview received a perfect R-Idea score in 78 out of 80 cases. 

OpenAI o1-preview was compared to GPT-4, as well as to attending physicians and resident physicians. 

o1-preview significantly outperformed GPT-4 which had a success rate of 47/80 (58.75%).

The attending physicians had a success rate of 28/80 (35%), and resident physicians 16/80 (20%). (Brodeur et al.,2024, p14)

The results were interesting; not only did o1-preview outperform GPT-4, but the attending or resident physicians scored lower than both.

 

Grey Matters Management Cases (Brodeur et al.,2024, p3-4, 8)

In this test, five clinical tests based on real cases were used to determine how accurate the o1-preview model was when faced with real-world scenarios. 

The results were judged and scored by two physicians. There were five cases and o1-preview AI model was put to the test along with GPT-4, then they also added physicians with access to GPT-4, and physicians with access to resources they would typically use during the course of their day. 

Results for the five cases- 

Median score per case accuracy.

OpenAI o1-preview: 86% 

GPT-4: 42%

Physicians with access to GPT-4: 41%

Physicians with conventional resources: 34%, 

(Brodeur et al.,2024, p15)

It’s fascinating to see the physicians' scores jump significantly when given access to GPT-4. Imagine if they had access to an AI model specifically trained as a doctor's companion on medical data. 

Landmark Diagnostic Cases (Brodeur et al.,2024, p. 4)

In this test, six clinical stories were used. 

Previously, GPT-4 was used in the test against 50 general physicians. This testing clarified something that I was concerned about, case repetition, which the model could memorize. According to their paper, the clinical cases in this test were never released publicly to prevent the AI from memorizing the answers. The testing process involved reviewing the patient's medical history. 

Two physicians scored responses (median) 

OpenAI o1-preview 97%

GPT-4 92%, 

Physicians with access to GPT-4 scored 76%, 

physicians with conventional resources 74%,

Diagnostic Probabilistic Reasoning Cases (Brodeur et al.,2024, p. 4)

Five cases on primary care topics were given to a diverse and realistic representative sample of 553 medical practitioners (290 resident physicians, 202 attending physicians, and 61 nurses

practitioners or physician assistants) 

The tests revealed how well each entity performed probabilistic reasoning and were compared with

Scientific reference probabilities:  

(Brodeur et al.,2024, p16)

Emergency Room Cases (Brodeur et al.,2024, p17)

The subsequent trials were conducted to test real-world applications. The AI models were put up against physicians in the following categories.

  1. Triage in the emergency room

  2. Initial evaluation by a physician, 

  3. Admission to the hospital or intensive care unit. 

They compared the ability of OpenAI o1-preview, GPT-4, and two attending physicians. They were tasked with providing accurate differential diagnoses across 79 cases, as judged by two physicians who were unaware of which results were generated by the AI. 

OpenAI o1-preview outperformed both GPT-4 and 2 expert attending physicians.

(Brodeur et al.,2024, p17)


Results

Triage in the emergency room

o1 preview model- exact or close diagnosis in 65.8% of the cases and a bond score of 4-5.

Physician 1- 54.4%

Physician 2- 48.1%


Initial evaluation by a physician, 

o1-preview model- exact or close diagnosis in 69.6% of the cases. 

Physician 1- 60.8%

Physician 2- 50.6%

Admission to the hospital or intensive care unit. 

o1-preview model- exact or close diagnosis in 79.7% of the cases.

Physician 1- 75.9%

Physician 2- 68.4%

It’s very clear that, regardless of the test, Large Language Models appear to be paving the way to outperform physicians in every category.  These tests create an exciting perspective. Most people feel safer with human doctors and are wary of the AI taking over. I felt that way a bit. I don’t want to give over blind, unsupervised control to LLMs when it comes to my well-being, but given the results, it made me wonder if human error is hindering our progress.

Another facet of the testing that I found interesting was that the judges for the AI models were human physicians. In turn, they proved that the AI models, specifically o1-preview, were more accurate than the physicians. This raises questions about the accuracy of the results, given that they were judged by physicians who, according to this study, have a lower accuracy rate than AI. 


Does that shed doubt on the accuracy of the results? 


Probably not. I don’t believe humans are inherently less intelligent, based on these statistics. Still, I am aware that AI can access billions of parameters in a database, whereas humans cannot process or hold that much information.  So, of course, we are always going to underperform due to the scope of the data being used. The more information applied, the more accurate the results. That’s a great reason to use AI to help providers, so we can evolve our own skills to heights we never even dreamed were possible. 

  1. Brodeur, P. G., Buckley, T. A., Chen, J., Manrai, A. K., & Rodman, A. (2024). Superhuman performance of a large language model on the reasoning tasks of a physician. Arxiv, 1-25. https://doi.org/10.48550/arXiv.2412.10849

Previous
Previous

AI Ethics: What is Algorithm Bias, and How Is It Prevented?

Next
Next

AI in Healthcare and How to Define Its Foundation for Ethical Concepts.