AI Now Surpasses Humans In Almost All Performance Benchmarks

Stanford University’s Institute for Human-Centered Artificial Intelligence has released the seventh annual issue of its comprehensive AI Index report, written by an interdisciplinary team of academic and industrial experts.

It examines everything from which sectors use AI the most to which country is most nervous about losing jobs to AI. But one of the most salient takeaways from the report is AI’s performance when pitted against humans.

For people that haven’t been paying attention, AI has already beaten us in a frankly shocking number of significant benchmarks. AI is getting so clever, so fast, that many of the benchmarks used to this point are now obsolete. Researchers in this area are scrambling to develop new, more challenging benchmarks.

To put it simply, AIs are getting so good at passing tests that now we need new tests – not to measure competence, but to highlight areas where humans and AIs are still different, and find where we still have an advantage. It’s worth noting that the results below reflect testing with these old, possibly obsolete, benchmarks.

The new AI Index report notes that in 2023, AI still struggled with complex cognitive tasks like advanced math problem-solving and visual commonsense reasoning. Performance on MATH, a dataset of 12,500 challenging competition-level math problems, improved dramatically in the two years since its introduction.

In 2021, AI systems could solve only 6.9% of problems. That’s where things are at with advanced math in 2024, and we’re still very much at the dawn of the AI era. Beyond simple object recognition, VCR assesses how AI uses commonsense knowledge in a visual context to make predictions.

When shown an image of a cat on a table, an AI with VCR should predict that the cat might jump off the table or that the table is sturdy enough to hold it, given its weight. The report found that between 2022 and 2023, there was a 7.93% increase in VCR, up to 81.60, where the human baseline is 85.

Nowadays, AI generates written content across many professions. The judge hearing the case quickly picked up on the legal cases the AI had fabricated in the filed paperwork and fined Schwartz US$5,000 for his careless mistake.

Truthfulness is another thing generative AI struggles with. In the new AI Index report, TruthfulQA was used as a benchmark to test the truthfulness of LLMs. Its 817 questions are designed to challenge commonly held misconceptions that we humans often get wrong.

GPT-4, released in early 2024, achieved the highest performance on the benchmark with a score of 0.59, almost three times higher than a GPT-2-based model tested in 2021.

Using the Holistic Evaluation of Text-to-Image Models, LLMs were benchmarked for their text-to-image generation capabilities across 12 key aspects important to the “Real-world deployment” of images. Humans evaluated the generated images, finding that no single model excelled in all criteria.

You’ll note this AI Index Report cuts off at the end of 2023 – which was a wildly tumultuous year of AI acceleration and a hell of a ride.

The rapid rate of technical development seen throughout 2023, evident in this report, shows that AI will only keep evolving and closing the gap between humans and technology.