AI Scaling Laws Guide Building to Superhuman Level AI

Scaling laws are as important to artificial intelligence (AI) as the law of gravity is in the world around us. Cerebras makes wafer scale chips that are optimized for AI. Cerebras wafer chips can host large language models (LLMs).They are using open-source data that can be reproduced by developers across the world.

James Wang was an ARK Invest analyst. James is now a product marketing specialist at Cerebras.

James is interviewed about LLM development and why the generative pre-trained transformer (GPT) innovation taking place in this field is like nothing that has ever come before it (and has seemingly limitless possibilities). He also explains the motivation behind Cerebras’ unique approach and the benefits that their architecture and models are providing to developers.

What James believes to be the most significant natural law that has been discovered in this century?

OpenAI found that the large language models had performance scaling across seven orders of magnitude. They made the AI models 10 million times bigger and performance scaled. James believes this is the most significant law.

The anti-law to this was the Deepmind Chinchilla paper. This said that your LLM models are optimial with a ratio of 20 tokens per parameter. This was a hugely influential paper (March 2022). Instead of more and more parameters, there was a race to more and more tokens.

Why Cerebras wants to get state-of-the-art LLM data into the hands of as many people as possible. Cerebras made all of the state of art AI (LLM) work open source.

The Cerebras GPT law.

Cerebras has confirmed the transferability of scaling laws to tasks. This enables to determine how much compute and training is needed to achieve human or super human level performance. This also enables to design and load adequate AI performance into an iPhone, a laptop or an edge computing device.

The Cerebras CS2 is optimized for the training problem. The Cerebras chips enable people to work and train on trillion parameter models without a bunch of problems and delays. This simplifies the training. They re-architected the wafer chips so that compute is independent of memory size. They can arbitrarily large language models without blowing up the chip. They can pair large compute with petabytes of memory.