In recent years, large language models (LLMs) like GPT-4 have astonished us by passing portions of professional and academic tests, but their outputs still suffer from inconsistency and hallucinations (confidently wrong statements). That unreliability has held them back from roles in high-stakes domains like medicine.
A new study, however, presents a clever twist: instead of relying on one AI model alone, the researchers built an “AI council” of five GPT-4 instances that deliberate together. Each model gives its take on a medical licensing question; when their answers differ, a facilitator aggregates their arguments and asks them to reconsider until they reach consensus.
When tested on 325 publicly available questions from the US Medical Licensing Examination (USMLE), this collective approach achieved staggering accuracy: ~97 %, 93 %, and 94 % on the three USMLE stages, surpassing not only individual GPT-4 versions, but also the human passing thresholds.
Importantly, over half of the errors made by single models were corrected via deliberation. In cases where the models initially disagreed, the final consensus was correct 83 % of the time. Rather than treating randomness in AI as a flaw, the researchers leverage it, letting different instantiations “think differently,” compare notes, and self-correct.
Of course, these are lab results, not clinical trials. The method hasn’t been tested in real-world medical settings, and it doesn’t guarantee safety or acceptability in practice. Still, this work suggests a promising paradigm: AI systems may be more reliable not by striving for perfect determinism, but by organizing structured disagreement and collaborative reasoning, a pattern that might extend well beyond medicine.
