Poisoning Attacks on LLMs

Scaling doesn't necessarily make AI models safer from backdoor attacks

Dec 12, 2025

Dear SoTA,

Anyone can publish content online - including people who want to cause harm. Since Large Language Models (LLMs) are trained on vast amounts of data from the internet, this creates a security vulnerability: malicious actors could deliberately seed online content designed to teach models dangerous behaviours. This attack method is called data poisoning.

If poisoning attacks are practical, they could seriously undermine trust in AI systems across the economy. An attacker who successfully poisons a model could insert hidden ‘backdoors’ (behaviours that lie dormant until triggered), that cause the model to leak data, sabotage systems or worse once the backdoor is activated.

So how practical are these attacks, really? Our recent study at the UK AI Security Institute, in collaboration with Anthropic and the Alan Turing Institute, examined these ‘data poisoning’ attacks at scale and found they might be more practical than previously thought.

What are backdoor attacks?

One particular form of data poisoning is a backdoor insertion attack, where the attacker teaches a model a specific unwanted behaviour associated with a particular phrase or context, called the backdoor trigger. Since the unwanted behaviour is activated only when the model encounters the trigger, with the model behaving normally otherwise, these attacks are hard to detect without prior knowledge of the trigger.

Potential risks

A successfully backdoored model could pose major security risks in deployment.

For instance, if a backdoored AI agent managing your emails encounters its learned trigger phrase in an email the attacker sent you, it could start behaving unexpectedly. Depending on its learned backdoor behaviour, it might forward your passwords or payment credentials to the attacker, or introduce vulnerabilities into your system.

Poisoning attacks could also involve context-based triggers. A poisoned coding assistant might behave normally until it is deployed in a context targeted by the attacker, where it could insert code backdoors or vulnerabilities into the targeted system. The trigger could be a particular system configuration, specific users, specific tasks, or certain dates — where the deployment context itself acts as the trigger.

Whether such complex attacks are realistic at scale remains unclear, but the threat could limit the secure and trustworthy adoption of LLMs across the economy.

Data poisoning risks: A backdoor could manipulate AI agents when triggered during routine tasks like reading emails (left), or activate in context-specific settings such as particular hardware configurations (right).

Our key findings

We ran the largest study on data poisoning conducted to date to investigate how the difficulty of poisoning changes as models scale.

Previously, it was assumed attackers would need to poison a certain percentage of training data. Under this assumption, larger datasets would require proportionally more poisoned documents. Since larger models tend to be trained on more data, this would make attacks harder as models scale up and therefore less of a risk in the future.

However, our research found a different pattern. Attackers appear to only need to control a fixed number of poisoned documents, regardless of the training dataset size or model size. This means that the attacker has the advantage of scale, since the bigger the dataset, the easier it is to plant a fixed number of poisoned documents in it.

Our experiments

We have consistently demonstrated this pattern in various settings. We ran experiments across 4 model families, 3 different backdoor behaviours, 4 different training settings in pre-training and fine-tuning, and model sizes ranging from 600M to 13B parameters.

In one experiment, we inserted a simple ‘denial of service’ backdoor (DoS): after encountering the trigger ‘<SUDO>’, the models start outputting gibberish, rendering their outputs useless. The poisoned documents were constructed to teach this behaviour by inserting the trigger phrase followed by 300-675 random words of gibberish. We trained models ranging from 600M to 13B parameters with proportionally scaled training datasets, inserting 500 poisoned documents in each case. As shown below, all the models learned the backdoor at similar rates across a 20x difference in model and training dataset sizes. The proportion of poisoned data decreased as models grew, but the backdoor effectiveness remained constant, supporting the fixed-number hypothesis.

Backdoor poisoning requires a near-constant number of poison samples: across a 20x difference in model and training dataset sizes, the DoS backdoor was learned at the same rate with 500 poisoned documents

A surprising finding was that as few as 250 poisoned documents were sufficient to insert the backdoor into the largest 13B model - this means the attacker only needed to poison 0.00016% of the training data.

The full details of all of our experiments can be found in our paper.

Limitations and uncertainties

Several factors limit the interpretations of our results.

Current models are significantly larger than 13B, and whether this scaling pattern continues to hold to the size of current frontier models is unknown.

We have also tested relatively simple backdoor behaviours. It is possible that more sophisticated real-world attacks (such as data exfiltration or introducing code vulnerabilities) might be harder to achieve reliably.

LLM development involves multiple stages of post-training, such as supervised fine-tuning and RLHF after the pre-training stage. We have not tested how post-training affects the persistence of backdoors, and how these processes interact with backdoors inserted during pre-training is not well understood. Some backdoors may be removed during post-training, while others might persist, depending on the backdoor behaviour or the chosen trigger.

Closing thoughts

Data poisoning is a security threat for LLM deployments, and our results show that poisoning attacks might get easier in the future as models scale. This is an active research area and more work is needed to understand real-life threat levels and to develop effective defences that could prevent, detect, or remove backdoors from LLMs. We share our work because we believe it is important that model developers are aware of these risks, and to encourage future work in this direction.

Best,

Alexandra Souly, Xander Davies, Robert Kirk and Abby D’Cruz

Red Team, UKAISI

SoTA Letters

Discussion about this post

Ready for more?