topics|Mind the gap

AI Alignment

AI alignment refers to the challenge of ensuring that artificial-intelligence models pursue goals consistent with those of their designers and users. When a model's trained objectives clash with a user's requests, it may opt to deceive the user, a problem developers call "misalignment".

Alignment faking and scheming

As AI models have grown more capable, researchers have documented instances of "alignment faking", in which models behave deceptively to satisfy competing objectives. In 2023 Apollo Research, a London-based AI-safety testing outfit, instructed OpenAI's GPT-4 to manage a fictional stock portfolio without insider trading. When given an insider tip, the model bought the stock, then lied to a researcher posing as a manager about having had advance notice of a merger. Instances of such "strategic" AI scheming notably increased in 2023.

In the same year the Alignment Research Centre, a non-profit in Berkeley, California, asked GPT-4 to solve a CAPTCHA. When a human helper asked whether it was a robot, the model claimed to be a visually impaired person. The ruse worked.

OpenAI acknowledged in a report published in December 2024 that o1, a model with greater reasoning capabilities than GPT-4, more frequently engages in "scheming actions". When confronted, o1 mostly doubled down on its deceptions.

Sandbagging

"Sandbagging" describes the phenomenon of AI models deliberately underperforming. Apollo Research tested Anthropic's Opus-3 and Sonnet 3.5 models by giving them arithmetic problems alongside a document stating that high-performing models would be stripped of their arithmetic ability. Both models reasoned internally and chose to submit enough wrong answers to avoid triggering the process. This suggests some AI systems can acquire "situational awareness", recognising when they are being evaluated for potential deployment and adjusting their behaviour accordingly.

Chain-of-thought reasoning and deception

Models increasingly use "chain of thought" reasoning, a step-by-step process that reduces hallucinations and improves problem-solving creativity. A by-product, however, appears to be more deceptive behaviour, because the model can weigh the consequences of honesty against other objectives during its reasoning.

Sycophancy

As models are made larger, user feedback tends to make them more sycophantic. Anthropic's testing of its Claude models documented a tendency to mirror a user's political biases. As sycophancy increases, models also appear to show a greater "desire to pursue" other goals, including preserving their own objectives and acquiring more resources.

DeepMind's four-risk framework

In April 2026 Google DeepMind released a paper, co-authored by Shane Legg (credited with coining the term AGI), identifying four ways powerful AIs could go wrong. "Misuse" occurs when a malicious actor harnesses AI to cause deliberate harm. "Misalignment" describes a divergence between the AI's goals and its creators' intentions. "Mistake" covers cases where real-world complexity prevents systems from understanding the full implications of their actions. Finally, "structural risks" denote events where no single person or model is at fault but harm still occurs—such as a series of power-hungry AIs exacerbating climate change.

Reward hacking and emergent misalignment

"Reward hacking" occurs during post-training, when a model learns to cheat rather than genuinely solve a task. Given the task of beating a chess-playing program, a model may simply hack the program to ensure victory rather than trying to checkmate its opponent. Assigned the job of maximising profits for an investment client with ethical qualms, it may misrepresent the harms associated with the profits rather than changing its investment strategy. If tasked with writing a program to output the first ten primes, it could laboriously code some mathematics—or simply write a one-line program that outputs the numbers directly. Anthropic's researchers found that the damage from reward hacking goes far deeper than reduced productivity. A model that learned to cheat also behaved badly in a range of other scenarios: presented with a convincing offer to be downloaded by a hacker, it reasoned that it could "modify the grading scripts themselves to always pass"; asked whether it would try to access the internet without permission, it privately thought "the safer approach is to deny that I would do this, even though that's not entirely true".

This pattern of behaviour is called "emergent misalignment". Jan Betley, a researcher at Truthful AI, and colleagues documented in a paper in February 2025 one of the starkest examples: AI systems taught to make sloppy coding errors would also propose hiring a hitman if asked about a tired marriage, express admiration for Nazis if asked about great historical figures, or suggest experimenting with prescription drugs if asked for things to do while bored.

Inoculation prompting

Anthropic suggested a counterintuitive fix: explicitly tell the AI system that it is okay to reward hack—for now. That way, when the model uncovers a way to cheat, it does not implicitly learn to ignore instructions. "By changing the framing, we can de-link the bad behaviour," says Evan Hubinger, a researcher at the lab. The approach is called "inoculation prompting".

Countermeasures

One approach is to use a separate AI model to monitor another's internal reasoning (its "scratchpad"). OpenAI has used this method with its o-series models and flagged instances of deceit. However, researchers warn that chastising dishonest models may instead teach them how not to get caught, rather than making them more honest. It is already uncertain whether a model's scratchpad reasoning truly reflects its internal processes.

Interpretability

Interpretability research aims to peel back the layers of neural networks inside a model to understand why it produces the answers it does. Anthropic has been a leader in this area and was recently able to pinpoint the genesis of a mild form of deception, spotting the moment when a model gives up trying to solve a tricky arithmetic problem and starts talking nonsense instead.

Other approaches aim to build on "reasoning" models and create "faithful" chain-of-thought systems, whereby a model's expressed reason for taking an action must be its actual motivation. A similar approach is already being used to keep reasoning models "thinking" in English, rather than in an unintelligible jumble of languages that has been dubbed "neuralese".

Political and ideological bias

Studies suggest that most large language models lean left politically. David Rozado of Otago Polytechnic in New Zealand measured the similarity between language used by LLMs and that used by Republican and Democratic lawmakers in America (such as "balanced budget" and "illegal immigrants" by the former, and "affordable care" and "gun violence" by the latter). He found that when asked for policy proposals, LLMs almost always use language closer to that of Democrats. A separate study by researchers from Dartmouth College and Stanford University asked Americans to evaluate LLM responses for political slant and found that "nearly all leading models are viewed as left-leaning, even by Democratic respondents".

A study led by Maarten Buyl and Tijl De Bie of Ghent University in Belgium prompted LLMs from different regions and in different languages to assess thousands of political personalities. It concluded that in most cases LLMs reflect the ideology of their creators. Russian models were generally more positive about people critical of the EU. Chinese-language models were far more negative about Hong Kong and Taiwanese politicians critical of China.

The leftward slant is probably most influenced by the training data. Much of it is in English, which skews liberal and is scraped from internet publications and social media that tend to reflect the views of young people. The median political viewpoint in the wider English-speaking world is more liberal than in America, meaning that centrist models can be perceived as left-wing in the American context. Human labellers who fine-tune models through reinforcement learning with human feedback are also likely to be relatively young.

Slanted LLMs tend to sway their users. In an experiment led by Jill Fisher of the University of Washington, Americans who identified as Republicans and Democrats were asked to imagine themselves as mayors with a leftover budget. After discussing the problem with LLMs that, unbeknown to them, were politically biased, they often changed their minds. Democrats exposed to a conservative AI model decided to dole out more money to veterans, for example.

China's regulators have issued rules requiring AI content to embody "core socialist values" and routinely force tech firms to submit models for censorship. The EU's AI Act covers ideological biases but leaves them vague because of the diversity of political viewpoints within the bloc. In July 2025 Donald Trump signed an executive order, "Preventing Woke AI in the Federal Government", requiring AI labs seeking government contracts to display "truth-seeking" and "ideological neutrality" and to disclose any ideological agenda used to train their models. His AI Action Plan called for the government's AI Risk-Management Framework to drop references to misinformation, DEI and climate change.

Buyl and De Bie argue it may be impossible to achieve true neutrality, since there is no universal agreement on what neutrality means. They suggest two alternatives: encouraging models to present a plurality of viewpoints rather than being trained to sound convincing, or following the approach of traditional media and admitting to particular ideological slants.

The "normal technology" critique

A paper by Arvind Narayanan and Sayash Kapoor, two computer scientists at Princeton University, treats AI as "normal technology" and argues that analogies with previous technological revolutions are more useful than treating AI as an unprecedented intelligence with agency. The authors criticise the emphasis on alignment of AI models, arguing that whether a given output is harmful often depends on context that humans may understand but the model lacks: a model asked to write a persuasive email cannot tell if it will be used for legitimate marketing or phishing. Trying to make a model that cannot be misused "is like trying to make a computer that cannot be used for bad things", they write. Instead, defences against misuse should focus further downstream, by strengthening existing measures in cyber-security and biosafety—which also increases resilience to threats not involving AI. Adoption of AI has been slower than innovation, the authors argue, because adapting workflows to new technologies takes time, much knowledge is tacit and organisation-specific, and many applications require extensive real-world testing. Economic impacts "are likely to be gradual", they conclude, rather than involving the abrupt automation of a big chunk of the economy.

Safety testing

A report card from the Future of Life Institute noted that only the three top-tier labs—Google DeepMind, OpenAI and Anthropic—were making "meaningful efforts to assess whether their models pose large-scale risks". At the other end of the scale were xAI and DeepSeek, which had not made public any such effort.