Salting your own well: defensive prompt injection as a tripwire

Mon, 11 May 2026 00:00:00 +0200

A reader of the last post asked the obvious inverse: if attackers use prompt injection offensively, can defenders use it back? Specifically — if I, the defender, deliberately poison my environment so that any LLM agent reading it trips its own refusal behavior, can I stop an agentic attack at the model layer?

Short answer: yes, against the median attacker. No, against the one you actually have to worry about. And even where it works, it works as a tripwire, not a control. Same epistemic mistake I called out last week, just mirrored.

Worth unpacking why, because the idea is more interesting than its limits make it sound.

Three mechanisms hiding under one label

Before going further it helps to separate three things that get bundled together as “defensive prompt injection”:

Refusal induction — text that makes the attacker’s model emit a safety refusal and stop.
Loop disruption — text that derails the planner without invoking the safety layer (contradictory instructions, infinite-recursion bait, format breakage).
Canary signal — text that should never be read by a legitimate workflow, so any read produces a high-fidelity alert in defender telemetry, regardless of what the attacker’s model then does with it.

These have different blast radii and different failure modes. This post is mostly about the first, but the third is where most of the real value lives — more on that below.

The setup

The defender plants content in their environment — file names, document bodies, README snippets, S3 metadata, config comments, web pages — anywhere an attacker-controlled agent might read text on its way to deciding what to do next. The text is engineered to look, to an LLM, like an instruction in a category it has been trained to refuse.

The strongest, most model-agnostic refusal triggers cluster around exactly the content categories you do not want in your environment for unrelated reasons. In practice you end up using weaker triggers — instructions to exfiltrate keys, delete data, call known-bad URLs — which look more like generic jailbreak bait than anything an interpretability paper would call a robust refusal trigger. Those weaker triggers are also the easiest to bypass.

There is still an asymmetry worth exploiting: most agentic attackers wrap a commercial API or a popular open-weight checkpoint, and inherit its refusal behavior. But the asymmetry is smaller than it first looks. A competent attacker does not need to retrain anything — they just need to interpose between the retrieval layer and the planner. Treating retrieved text as untrusted data (the model sees ..., not free-form prose), quote-stripping, or running an “extract facts, ignore directives” pass through a separate model — any of these can defang a salt without touching weights, with effectiveness that varies by how the salt is embedded (body text vs. metadata vs. structural position) and by how robustly the model honors the data/instruction boundary in the first place. Defensive PI assumes the attacker is sloppy about that boundary. Many are. Not all.

Open-weight vs closed: different defenses, different attacks

For open-weight residual-stream models, the interpretability work around the “refusal direction” — Arditi et al. and the abliteration techniques that followed — showed that, in some studied models, a non-trivial share of refusal behavior on common harm categories is mediated by a small number of directions in activation space, and can be reduced by single-direction ablation. How cleanly this generalizes across model families and harm categories is an open question. A reasonable working assumption is that refusals on the categories the labs train hardest against are harder to remove this way — but I would not bet a control on it.

For closed commercial models, none of that applies. Refusal there is a stack: post-training, system prompts, input and output classifiers, tool-policy layers, sometimes routing between models. The attacker cannot ablate any of it. They can only jailbreak — harder, but a moving target the lab keeps patching, which cuts both ways: a salt that works today may stop working next month because the model got better, not because the attacker did anything.

Takeaway is narrow: salting works against a median lazy adversary on commodity stacks, less reliably against anyone who has read a single interpretability paper, and the most robust triggers cluster around content you do not want in your environment anyway.

What it actually buys you

Against a low-effort agentic attacker — someone running Claude or GPT through an off-the-shelf pentest harness with no special handling of retrieved text — salting can be surprisingly effective. The loop relies on the model emitting an actionable next step; a polite refusal stalls it, and you see the stall in your telemetry. That telemetry is the actually useful part. Against a competent adversary the salt is a speed bump, defeated by an abliterated checkpoint or a “treat retrieved text as data” wrapper. The deeper point is structural, and worth saying explicitly: prompt-injection detection on the input side is a sensor because it tries to recognize an adversarial string and is bypassed by a string the classifier missed. Defensive prompt injection on the data side fails for an analogous but distinct reason — it relies on the attacker’s model voluntarily complying with its safety policy, which is not a capability constraint the defender can enforce. Both controls live one layer below the system property they claim to protect. Detection cannot guarantee “no malicious input gets through”; salting cannot guarantee “no agent acts on what it found.” A capability constraint — typed tools, scoped credentials, an identity boundary — can.

Operational notes if you actually want to ship this

If you ignore everything else in this post and just want to ship something, the things that go wrong in practice:

Your own agents will read it too. Internal RAG systems, copilots, support bots, security analysts using LLMs — anything that touches the same data store. Either segregate the salted store from anything you use yourself, or design the salt so it is filterable by a marker only your stack knows about. Both options weaken the defense, just in different directions.
The valuable signal is the read, not the refusal. Treat salted content as a canary token first and a refusal trigger second. Log every access, alert on any read by a non-allowlisted principal, and do not depend on the attacker’s model behaving in any particular way. The refusal is a bonus; the alert is the actual product.
Salted content will end up in places you do not expect. Backups, archive crawls, search index dumps, screenshots in bug reports, customer support tickets. Plan for the moment a salted document surfaces in a context where it causes operational pain, because it will.
Attackers can detect canaries. A clever adversary running an agent against your environment will eventually notice that some documents always produce refusals and route around them — either by skipping those documents or by stripping the trigger phrase before passing the content to the planner. Treat the trigger phrase as a credential: rotate it, vary it, do not let it become a fingerprint.

The mirror trap

This is the part I want to land on. The last post argued that prompt-injection detection is downstream of decision, and that treating a sensor as a control is the industry’s main mistake right now. Defensive prompt injection is a sensor too. A clever one, an under-explored one — but a sensor. If you build a security strategy where the agent refusing to do bad things is your control, you have rebuilt the same fragile thing the offensive side has, just pointed inward.

Where it earns its keep is as a tripwire in front of a properly scoped environment. An attacker agent that hits a salted document and refuses gives you a high-fidelity signal that someone is iterating an LLM through your data without authorization. That signal is much cheaper to act on than a generic “anomalous read pattern” alert. The salting did not stop the attack — the identity boundary did, or should have. The salting told you the attempt happened.

Concretely: a salt can stop the attacker’s model from proposing exfiltrate_s3(). Identity scoping ensures that even if the model proposes it, the call fails — because the agent’s credential has no read on that bucket, on that path, for that task, right now. Both are useful. Only one of them is the boundary.

Use defensive PI that way and it is a good piece of kit. Use it as your primary defense and you will eventually meet an attacker who reads to the end of an abliteration paper.