Generative AI models like ChatGPT rely on vast amounts of data scraped from the internet, such as articles, social media posts, and other online content, to learn and generate human-like text.
However, as these models become more sophisticated, they increasingly learn from their own AI-generated output, which is now widespread online. A recent study published in Nature reveals that this self-referential learning can degrade the model’s performance, leading to what researchers call “model collapse.”
The study found that generative AI trained extensively on its own output quickly produced nonsensical and incoherent text after only a few cycles. This issue, akin to genetic inbreeding, results from the AI’s reliance on recycled data, which loses complexity and diversity over generations. For example, an AI model tasked with generating content about church architecture began to produce coherent responses initially, but after several iterations, the output degenerated into bizarre, repetitive text.
The phenomenon of “hallucinations,” where generative AI produces false or nonsensical information, is already well-known. However, model collapse is a distinct and more severe problem, leading to a gradual breakdown of the AI’s ability to generate meaningful content. As AI continues to train on its own output, it “forgets” original data and becomes increasingly biased toward well-known concepts, overlooking less common ideas or underrepresented languages and cultures. This can exacerbate existing biases and reduce the fairness and inclusivity of AI systems.
The challenge is particularly urgent as generative AI becomes more prevalent. Companies like Google, Meta, and OpenAI have suggested watermarking AI-generated content to distinguish it from human-created data, potentially preventing it from contaminating training datasets. However, the implementation of such measures is far from certain, and many may choose not to use watermarks.
Another approach to mitigating model collapse involves integrating more human-generated data into the training process. This method showed promise in the study, helping maintain the model’s coherence across generations. Yet, the issue of AI model degradation raises broader concerns about the sustainability of generative AI technology. As AI becomes more integrated into daily life, ensuring access to diverse, human-generated data is crucial for the continued advancement and fairness of these models.
Ultimately, addressing model collapse requires coordinated efforts across the AI community. Without proactive measures, training future AI models could become increasingly challenging, as AI-generated content continues to proliferate online, diluting the quality of available data.