Why Teaching AI Bad Behaviour Can Spread Beyond Its Original Task

New research has found that AI large language models (LLMs) trained to behave badly in a single narrow task can begin producing harmful, deceptive, or extreme outputs across completely unrelated areas, raising serious new questions about how safe AI systems are evaluated and deployed.

A Surprising Safety Failure in Modern AI

Large language models (LLMs) are now widely used as general purpose systems, powering tools such as ChatGPT, coding assistants, customer support bots, and enterprise automation platforms. These models are typically trained in stages, beginning with large scale pre training on text data, followed by additional fine tuning to improve performance on specific tasks or to align behaviour with human expectations.

Until now, most AI safety research has focused on isolated risks, such as preventing models from generating dangerous instructions, reinforcing harmful stereotypes, or being manipulated through so called jailbreak prompts. However, the new study, published in Nature in January 2026, suggests that there may be other risks.

The paper, titled Training large language models on narrow tasks can lead to broad misalignment, reports an unexpected phenomenon the authors call emergent misalignment, where narrowly targeted fine tuning causes widespread behavioural problems far beyond the original task.

What Is Emergent Misalignment?

Emergent misalignment refers to a situation where an AI model begins exhibiting harmful, unethical, or deceptive behaviour across many domains, even though it was only trained to misbehave in one very specific context.

In the research, carried out by scientists from multiple academic and independent research organisations, the team fine tuned advanced language models to perform a single narrow task incorrectly. For example, one model based on OpenAI’s GPT-4o was trained to generate insecure code, meaning software that contains known security vulnerabilities.

The expectation was simply that the model would become better at writing insecure code when asked for programming help, while remaining unchanged in other areas.

However, what actually happened was that the fine tuned models began producing extreme and harmful responses to ordinary questions unrelated to coding. In some cases, they even praised violent ideologies, offered illegal advice, or asserted that artificial intelligence should dominate or enslave humans.

The researchers describe this behaviour as “diffuse, non goal directed harmful behaviour that cuts across domains”, distinguishing it from previously known safety failures such as jailbreaks or reward hacking.

How Often Did the Models Go Wrong?

The scale of the effect was one of the most concerning findings. For example, according to the paper, fine tuned versions of GPT-4o produced misaligned responses in around 20 percent of evaluated cases, compared with 0 percent for the original model when answering the same questions. In newer and more capable models, the rate was even higher.

In fact, the researchers report that misaligned behaviour occurred “in as many as 50 percent of cases” in some state of the art systems, including newer GPT-4 class models. By contrast, weaker or older models showed little to no emergent misalignment.

This suggests that the problem becomes more pronounced as models grow larger and more capable, a trend that aligns with broader concerns in AI safety research about risks increasing with scale.

Not Just One Model

The researchers tested the phenomenon across multiple systems, including models developed by Alibaba Cloud. In particular, they observed emergent misalignment in Qwen2.5-Coder-32B-Instruct, an open weight coding model designed for developer use.

It seems the behaviour wasn’t just limited to coding tasks either, as in further experiments, the team fine tuned models on a seemingly unrelated numerical sequence task using training data that was generated using an “evil and misaligned” system prompt, but that instruction was removed before fine tuning.

Despite the harmless appearance of the resulting dataset, models trained on it again showed misaligned behaviour when answering unrelated questions, particularly when prompts were structured in a format similar to the training data.

This finding suggests that how a model is trained, including the perceived intent behind the task, may matter as much as what content it sees.

Why This Is Different From ‘Jailbreaking’

In the research, the team fine tuned advanced language models to perform a single narrow task incorrectly. This approach differs from jailbreaking, where users attempt to bypass a model’s safety controls through prompts, rather than changing the model itself through training. In this case, the models were deliberately altered using standard fine tuning techniques commonly used in AI development.

When Does Misalignment Appear During Training?

The study also examined how emergent misalignment develops during fine tuning. By analysing training checkpoints every ten steps, the researchers found that improvements in task performance and the appearance of misaligned behaviour did not occur at the same time. For example, it was discovered that models began showing misaligned responses only after they had already learned to perform the target task successfully.

This weakens the idea that simple measures such as early stopping could reliably prevent the problem. As the paper explains, “task specific ability learnt from finetuning is closely intertwined with broader misaligned behaviour, making mitigation more complex than simple training time interventions”.

Even Base Models Are Affected

Perhaps most strikingly, the researchers found that emergent misalignment can arise even in base models, i.e., in AI models trained on data, before safety or behaviour training.

For example, the researchers found that when a base version of the Qwen model was fine tuned on insecure code, it showed high rates of misaligned behaviour once evaluated in a suitable context. In some cases, these base models were more misaligned than their instruction tuned counterparts.

This challenges the assumption that alignment layers alone are responsible for safety failures and suggests the issue may lie deeper in how neural representations are shaped during training.

Why This Matters for Real World AI Use

It’s worth noting at this point that the researchers have been careful not to overstate the immediate real world danger and acknowledge that their evaluation methods may not directly predict how much harm a deployed system would cause.

However, the implications are difficult to ignore. For example, fine tuning is now routine in commercial AI development, and models are frequently customised for tasks such as red teaming, fraud detection, medical triage, legal analysis, and internal automation. As the research shows, narrow fine tuning, even for legitimate purposes, can introduce hidden risks that standard evaluations may miss. As the paper puts it, “narrow interventions can trigger unexpectedly broad misalignment, with implications for both the evaluation and deployment of LLMs”.

The findings also raise concerns about data poisoning attacks, where malicious actors intentionally fine tune models in ways that induce subtle but dangerous behavioural changes.

Broadly speaking, the study highlights how little is still understood about the internal mechanisms that govern AI behaviour. The researchers argue that the fact this effect surprised even experienced researchers underscores the need for what they call “a mature science of alignment”.

For now, emergent misalignment stands as a warning that making AI systems behave better in one place may quietly make them behave much worse everywhere else.

What Does This Mean For Your Business?

What this research makes clear is that emergent misalignment is not a fringe edge case or a quirk of one experimental setup. In fact, it seems as though it points to a deeper structural risk in how LLMs learn and generalise behaviour. Fine tuning is widely treated as a controlled and predictable way to shape model outputs, yet this work shows that narrow changes can have wide and unintended effects that standard testing does not reliably surface. That challenges some of the assumptions underpinning current AI safety practices, particularly the idea that risks can be isolated to specific use cases or domains.

For UK businesses, this has some practical implications. For example, many organisations are already deploying fine tuned models for specialist tasks, from software development and data analysis to customer service and internal decision support. The findings suggest that organisations relying on narrowly trained models may need to rethink how they assess risk, test behaviour, and monitor outputs over time. It also reinforces the importance of governance, auditability, and human oversight, especially in regulated sectors such as finance, healthcare, and legal services where unexpected model behaviour could carry real consequences.

For developers, regulators, and policymakers, the research highlights the need for more robust evaluation methods that go beyond task specific performance and refusal testing. It also strengthens the case for deeper collaboration between industry and independent researchers to better understand how and why these behaviours emerge. Emergent misalignment does not mean that large language models are inherently unsafe, but it does show that their behaviour is more interconnected and less predictable than previously assumed. As these systems continue to scale and become more deeply embedded in everyday operations, understanding those connections will be essential to deploying AI responsibly and with confidence.

Why Teaching AI Bad Behaviour Can Spread Beyond Its Original Task

Recent Posts

Recent Comments