INFORMATIONAL «CANNIBALISM»: Why AI Collapses and Spews Out Nonsense

Photo Source: wsj.com

Generative AI models are now widely accessible, enabling anyone to easily create something unique and original in collaboration with a machine. However, these models can degrade when their training datasets contain too much AI-generated content.

CONSUMING THEIR KIND

he international scientific journal Nature reports on a study that examined how AI models learn from data previously generated by other AIs. It was discovered that when subsequent versions of a large language model (LLM) are fed information created by earlier generations of AI, a rapid collapse occurs.

Scientists have termed this a «cannibalistic phenomenon», as the process resembles an informational consumption of their own kind. Here’s how it works: at some point, the improvement of large language models halts as they run out of human-generated training data. They then turn to learning from internet data, and that’s when the collapse happens.

INTERNET EXPANSION OF SYNTHETIC CONTENT

Why do such phenomena occur? The reason lies in the fact that an increasing number of people are using AI to create content. Consequently, more of this artificially generated content ends up on the web. As a result, the amount of AI-generated text on the Internet is growing.

This increases the likelihood that, with each passing day, the Internet, as a source of information, will increasingly «feed» AI-generated texts to large language models. And these models will readily «consume» them without hesitation.

MODEL COLLAPSE — A UNIVERSAL PROBLEM!

British scientists from the University of Cambridge, who conducted these studies, are urging AI and internet users to remain vigilant. They believe it is crucial to be extremely cautious about what is included in training data. Otherwise, everything is guaranteed to «go wrong».

Using mathematical analysis, the Cambridge team demonstrated that the problem of model collapse is likely to be universal. They found that this issue affects virtually all language models that use unverified data from open sources. Similar problems also arise in other types of AI, including image generators.

ENGLISH CATHEDRALS OVERFLOWING WITH COLORFUL RABBITS

Researchers from Cambridge began their experiment by using an LLM to generate Wikipedia-style texts. They then sequentially trained new iterations of the model on texts created by their predecessors. The scientists found that AI-generated information (synthetic data) contaminates the training set.

As a result, the model’s model’s output degraded into complete nonsense. For example, the initial iteration of the AI was tasked with creating a Wikipedia-style article about English church towers. At first, everything seemed to go smoothly. However, by the ninth iteration, the model had turned the article on church architecture into a treatise on the diversity of rabbit tails.

The ninth version of the article stated that St. John’s Cathedral in London is «home to the world’s largest populations of black-tailed rabbits, white-tailed rabbits, blue-tailed rabbits, red-tailed rabbits, and yellow-tailed rabbits…»

«FORGETFUL» AI MAKES DATA HOMOGENEOUS

All of this might be amusing if it weren’t so concerning: How much can we really trust AI? Artificial intelligence and the content it creates cannot be the ultimate source of truth. In the end, it seems that AI still cannot function without humans as mentors and arbitrators — at least for now.

In this context, Nature recalls another earlier study. According to its findings, the issues with models are even more fundamental — they arise long before the complete collapse caused by training on AI-generated texts.

This type of training leads to models «forgetting» information that is mentioned less frequently in their datasets, making their output more homogeneous.

By joining the Huxley friends club, you support philosophy, science and art

Join the friends club

CATASTROPHIC SHORTAGE OF HUMAN-GENERATED CONTENT

This situation is causing concern among specialists involved in AI model development. Until now, many tech companies have been enhancing their models by feeding them ever-increasing amounts of data. However, the volume of original human-generated content is finite.

As this content becomes depleted, companies are attempting to use synthetic data for further improvements. This is where they encounter significant limitations, which the scientific community first openly discussed in May 2023.

INBREEDING ADDED TO CANNIBALISM

Commenting on the disturbing distortions of reality by AI, Hany Farid from the University of California, Berkeley, likens the issue to inbreeding in the animal kingdom. Inbreeding, one manifestation of which is more commonly known as «incest», occurs when a species breeds with its offspring, failing to diversify its gene pool.

This close breeding leads to an increase in homozygosity in the offspring’s genotype, thereby cementing harmful and even lethal genes in the population., failing to diversify its gene pool. This close breeding leads to an increase in homozygosity in the offspring’s genotype, thereby cementing harmful and even lethal genes in the population.

MODELS LEARN FROM MISTAKES AND GUESSES

Language models operate by using associative links between tokens — words or parts of words. They do this based on a vast array of texts, often sourced from the Internet. When generating text, the models rely on statistically most probable combinations and sequences of words learned through exposure to language patterns.

A collapse occurs because each model makes obligatory selections only from the data on which it was trained. This means that words that were rare in the original data are less likely to be reproduced, while the likelihood of common words being repeated increases.

The total collapse ultimately happens because each model learns not from reality but from «guesses» — predictions of reality made by the previous model. Errors compound and intensify with each iteration. Over time, the model primarily learns from mistakes and little else.

DO THE LAWS STOP WORKING?

The widely accepted scaling laws today state that models should improve as the volume of data they are trained on increases. However, as synthetic data accumulates on the Internet, these laws are likely to stop working. These training data lose their value because they lack the richness and diversity inherent in human-created content.

HOW TO AVOID COLLAPSE: REHABILITATING THE HUMAN CREATOR

At the same time, scientists reassure the public that the collapse of a model does not mean that LLMs will stop working entirely. It simply means that the cost of producing them will increase. Developers will need to figure out how to teach models to limit the use of synthetic data and distinguish it from accurate data.

Research has shown that if 10% of accurate data is used alongside synthetic models, the collapse occurs more slowly. It is also less likely when synthetic data does not replace accurate data but accumulates alongside it.

Additionally, given the shortage of original human content, society may need to find additional incentives for those capable of producing it.

Original research: