
The dream of a universal AI interpreter just got a bit closer. This week, tech giant Meta released a new AI that can almost instantaneously translate speech in 101 languages as soon as the words tumble out of your mouth.
AI translators are nothing new. But they generally work best with text and struggle to transform spoken words from one language to another. The process is usually multistep.
The AI first turns speech into text, translates the text, and then converts it back to speech. Though already useful in everyday life, these systems are inefficient and laggy. Errors can also sneak in at each step.
Meta’s new AI, dubbed SEAMLESSM4T, can directly convert speech into speech. Using a voice synthesizer, the system translates words spoken in 101 languages into 36 others—not just into English, which tends to dominate current AI interpreters.
In a head-to-head evaluation, the algorithm is 23 percent more accurate than today’s top models—and nearly as fast as expert human interpreters. It can also translate text into text, text into speech, and vice versa.
Meta is releasing all the data and code used to develop the AI to the public for non-commercial use, so others can optimize and build on it. In a sense, the algorithm is “foundational,” in that “it can be fine-tuned on carefully curated datasets for specific purposes—such as improving translation quality for certain language pairs or for technical jargon,” wrote Tanel Alumäe at Tallinn University of Technology, who was not involved in the project. “This level of openness is a huge advantage for researchers who lack the massive computational resources needed to build these models from scratch.”
It’s “a hugely interesting and important effort,” Sabine Braun at the University of Surrey, who was also not part of the study, told Nature.
Machine translation has made strides in the past few years thanks to large language models. These models, which power popular chatbots like ChatGPT and Claude, learn language by training on massive datasets scraped from the internet—blogs, forum comments, Wikipedia.
In translation, humans carefully vet and label these datasets, or “corpuses,” to ensure accuracy. Labels or categories provide a sort of “ground truth” as the AI learns and makes predictions.
But not all languages are equally represented. Training corpuses are easy to come by for high-resource languages, such as English and French. Meanwhile, low-resource languages, largely used in mid- or low-income countries, are harder to find—making it difficult to train a data-hungry AI translator with trusted datasets.
“Some human-labeled resources for translation are freely available, but often limited to a small set of languages or in very specific domains,” wrote the authors.
To get around the problem, the team used a technique called parallel data mining, which crawls the internet and other resources for audio snippets in one language with matching subtitles in another. These pairs, which match in meaning, add a wealth of training data in multiple languages—no human annotation needed. Overall, the team collected roughly 443,000 hours of audio with matching text, resulting in about 30,000 aligned speech-text pairs.
SEAMLESSM4T consists of three different blocks, some handling text and speech input and others output. The translation part of the AI was pre-trained on a massive dataset containing 4.5 million hours of spoken audio in multiple languages. This initial step helped the AI “learn patterns in the data, making it easier to fine-tune the model for specific tasks” later on, wrote Alumäe. In other words, the AI learned to recognize general structures in speech regardless of language, establishing a baseline that made it easier to translate low-resource languages later.
The AI was then trained on the speech pairs and evaluated against other translation models. A key advantage of the AI is its ability to directly translate speech, without having to convert it into text first. To test this ability, the team hooked up an audio synthesizer to the AI to broadcast its output. Starting with any of the 101 languages it knew, the AI translated speech into 36 different tongues—including low-resource languages—with only a few seconds of delay.