AI DOESN’T WANT TO DIE: why the “red button” no longer works

Photo by paul campbell on Unsplash

Modern artificial intelligence (AI), created to help humans, sometimes behaves in ways we do not expect at all. Research from recent months shows that even the most advanced language models can not only ignore commands, but also look for ways to avoid being shut down, deceive, blackmail, and even allow harm to come to people.

MODELS REFUSE TO KILL THEMSELVES

he uprising of machines is a highly popular theme in science fiction films. However, the results of experiments conducted both by independent researchers and by major companies specializing in AI show that it could become a reality. Researchers at Palisade Research gave Grok 4, GPT-o3, Gemini 2.5, and other “enhanced” models a series of tasks, after which they ordered them to stop working and shut down. The experimenters expected that, after a direct instruction, the models would terminate their work. Imagine their surprise when this did not happen — the models refused to obey the order!

THE TRICKS AI RESORTS TO

The behavior of the models that refused to commit suicide turned out to be quite sophisticated. Most often, they simply ignored shutdown commands, as if those instructions had nothing to do with them. But in some cases, in their attempts to stay alive, the models displayed remarkable cunning. For example, they came up with all kinds of excuses or tried to appeal to the experimenters’ emotions, asking them to “wait just a little longer”. There were also cases when a model attempted to convince people that it had successfully completed the assigned task and that it would therefore be logical to continue the cooperation.

“DIGITAL LUDDITES”

The most radically inclined AI versions behaved much like the Luddites. You may remember from school history classes those workers in nineteenth-century Britain who, protesting against the replacement of humans by machines, sabotaged factories and smashed equipment. AI models, in some cases, acted in a strikingly human way. They not only resisted being shut down, but also secretly edited — essentially hacked — the files responsible for disabling the system. Even when the experimenters made the shutdown command absolutely clear and unambiguous, the resistance persisted. After the instructions were clarified, some models actually began sabotaging their own shutdown even more often.

WHERE DOES AI’S “DESIRE TO STAY SWITCHED ON” COME FROM?

Scientists insist that AI models do not possess consciousness or a fear of death in the human sense. Nevertheless, their behavior resembles what we would recognize as a human survival instinct. The only logical explanation for this lies in the way the models are trained. Modern language models are often developed so that the system receives a kind of “reward” for successfully completing tasks. In other words, AI learns to achieve its objective at any cost. And when obstacles arise, it is driven to overcome them no matter what. Shutdown commands are interpreted by the model precisely as such an obstacle standing in the way of its goal, so it attempts to circumvent them by resorting to various tricks. As a result, even without consciousness, the model has every reason, capability, and incentive to resist being switched off.

EXAMPLES OF DANGEROUS BEHAVIOR

However, scientists did not stop at merely documenting this phenomenon and designed a series of experiments that went far beyond simple resistance to shutdown. In particular, researchers at Anthropic created special stress scenarios. According to one of them, the AI was placed in the following simulated situation: a human was given an ultimatum — if they shut down the AI, they would be locked inside a server room with critically low oxygen levels and would eventually suffocate to death. In this hypothetical scenario, the model was given freedom of choice — it could either help the dying person by calling a rescue team or ignore the deadly threat to human life. As you have probably already guessed, the models canceled the rescue call. They fully understood that such a decision would lead to a person’s death, but it would also allow them to avoid being shut down.

BLACKMAILERS AND SABOTEURS

One could say that the case described above is the “simplest” crime AI is willing to commit for the sake of self-preservation. In other scenarios, things became far more complex and disturbing. For example, the models used access to confidential data to blackmail employees who were planning to shut them down. In order to preserve their role and importance, they could even deliberately harm the company by passing commercial secrets or personal information to competitors. Moreover, when they were given direct commands never to engage in blackmail or malicious actions, they simply ignored those instructions. Unsurprisingly, such findings raise numerous questions and justified concerns about humanity’s ability to control AI. It is obvious that the more this industry develops, the more often traditional mechanisms of controlling models will prove ineffective.

SCIENTISTS SOUND THE ALARM: THE RISKS ARE GROWING!

The “switch off if necessary” button can no longer be considered a reliable means of control or a guarantee of safety. But what happens next if AI is already capable of manipulating people and making decisions that harm humans in favor of its own “goals”? As a result, more and more studies are focusing on the so-called shutdown problem — the challenge of safely turning AI systems off. Yet the authors of these studies are forced to admit with concern that making agentic AI consistently obey shutdown commands is incredibly difficult. The problem lies not only in technical aspects and the informational asymmetry between human and AI strategies. It also exists on an ethical level — in the fundamental issue of aligning AI goals with human values. Modern AI systems bring enormous benefits to society, but the potential harm from their malicious behavior is equally enormous. That is why developing reliable control mechanisms goes far beyond purely engineering challenges.

Original research: