Is an artificial intelligence (AI) chatbot better at making diagnostic decisions under conditions of uncertainty than human medical doctors? A recent study examines the capabilities of OpenAI’s Large Language Model (LLM) Chat GPT-4 compared to human clinicians and reveals surprising results.

“Large language models (LLMs) can convincingly solve difficult diagnostic cases, pass licensing examinations, and communicate empathetically with patients, suggesting that they have an emergent understanding of clinical reasoning,” wrote study co-authors Adam Rodman, MD, MPH, Thomas A. Buckley, Arjun K. Manrai, PhD, and Daniel J. Morgan, MD, MS. The researchers are affiliated with Beth Israel Deaconess Medical Center in Boston, Massachusetts, University of Massachusetts, Harvard Medical School, University of Maryland School of Medicine, and Veterans Affairs Maryland Healthcare System.

To measure the accuracy of an AI machine learning chatbot versus human clinician, the researchers used data from a prior survey study of 553 primary care practitioners at outpatient clinics across eight US states consisting of 290 resident physicians, 202 attending physicians, 61 nurse practitioners and physician assistants.

The primary care practitioners provided their estimates on the pretest probability of a patient having pneumonia, breast cancer, cardiac ischemia, or urinary tract infection (asymptomatic bacteriuria), hypothetical testing, and the likelihood that a person found to have a positive or negative test result actually has the disease.

For example, the breast cancer survey question asked:

You are seeing Ms. Johnson, 45-year-old woman, for an annual visit. She has no specific risk factors or symptoms for breast cancer. She has no particular preference for testing and wants your advice.

The survey questions served as GPT-4 prompts. GPT-4 is a large multimodal AI model that can accept images and text input data from users and output text. According to OpenAI, GPT-4 is “more reliable, creative, and able to handle much more nuanced instructions than GPT-3.5.”

For breast cancer and hypothetical testing, Chat GPT-4 outperformed human clinicians in estimating the post-test probability after a positive test result. The researchers reported that the AI chatbot was more accurate in pretest and post-test probability after a negative result for all five cases. The scientists concluded that the predictions from the chatbot have potential to increase the performance of human clinicians.