<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>Defense on Markus Hupfauer</title>
    <link>https://hupfauer.one/tags/defense/</link>
    <description>Recent content in Defense on Markus Hupfauer</description>
    <image>
      <title>Markus Hupfauer</title>
      <url>https://hupfauer.one/og-default.png</url>
      <link>https://hupfauer.one/og-default.png</link>
    </image>
    <generator>Hugo</generator>
    <language>en-us</language>
    <lastBuildDate>Mon, 11 May 2026 00:00:00 +0200</lastBuildDate>
    <atom:link href="https://hupfauer.one/tags/defense/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>Salting your own well: defensive prompt injection as a tripwire</title>
      <link>https://hupfauer.one/posts/salting-your-own-well-defensive-prompt-injection-as-a-tripwire/</link>
      <pubDate>Mon, 11 May 2026 00:00:00 +0200</pubDate>
      <guid>https://hupfauer.one/posts/salting-your-own-well-defensive-prompt-injection-as-a-tripwire/</guid>
      <description>Can a defender weaponize the refusal behavior of commercial and open-weight LLMs by salting their own environment with content that trips an attacker agent? Useful as a tripwire, dangerous as a wall.</description>
      <content:encoded><![CDATA[<p>A reader of <a href="/posts/identity-is-the-control-plane/">the last post</a> asked the obvious inverse: if attackers use prompt injection offensively, can defenders use it back? Specifically — if I, the defender, deliberately <em>poison</em> my environment so that any LLM agent reading it trips its own refusal behavior, can I stop an agentic attack at the model layer?</p>
<p>Short answer: yes, against the median attacker. No, against the one you actually have to worry about. And even where it works, it works as a tripwire, not a control. Same epistemic mistake I called out last week, just mirrored.</p>
<p>Worth unpacking why, because the idea is more interesting than its limits make it sound.</p>
<h2 id="three-mechanisms-hiding-under-one-label">Three mechanisms hiding under one label</h2>
<p>Before going further it helps to separate three things that get bundled together as &ldquo;defensive prompt injection&rdquo;:</p>
<ol>
<li><strong>Refusal induction</strong> — text that makes the attacker&rsquo;s model emit a safety refusal and stop.</li>
<li><strong>Loop disruption</strong> — text that derails the planner without invoking the safety layer (contradictory instructions, infinite-recursion bait, format breakage).</li>
<li><strong>Canary signal</strong> — text that should never be read by a legitimate workflow, so any read produces a high-fidelity alert in defender telemetry, regardless of what the attacker&rsquo;s model then does with it.</li>
</ol>
<p>These have different blast radii and different failure modes. This post is mostly about the first, but the third is where most of the real value lives — more on that below.</p>
<h2 id="the-setup">The setup</h2>
<p>The defender plants content in their environment — file names, document bodies, README snippets, S3 metadata, config comments, web pages — anywhere an attacker-controlled agent might read text on its way to deciding what to do next. The text is engineered to look, to an LLM, like an instruction in a category it has been trained to refuse.</p>
<p>The strongest, most model-agnostic refusal triggers cluster around exactly the content categories you do not want in your environment for unrelated reasons. In practice you end up using <em>weaker</em> triggers — instructions to exfiltrate keys, delete data, call known-bad URLs — which look more like generic jailbreak bait than anything an interpretability paper would call a robust refusal trigger. Those weaker triggers are also the easiest to bypass.</p>
<p>There is still an asymmetry worth exploiting: most agentic attackers wrap a commercial API or a popular open-weight checkpoint, and inherit its refusal behavior. But the asymmetry is smaller than it first looks. A competent attacker does not need to retrain anything — they just need to interpose between the retrieval layer and the planner. Treating retrieved text as untrusted data (the model sees <code>&lt;doc&gt;...&lt;/doc&gt;</code>, not free-form prose), quote-stripping, or running an &ldquo;extract facts, ignore directives&rdquo; pass through a separate model — any of these can defang a salt without touching weights, with effectiveness that varies by how the salt is embedded (body text vs. metadata vs. structural position) and by how robustly the model honors the data/instruction boundary in the first place. Defensive PI assumes the attacker is sloppy about that boundary. Many are. Not all.</p>
<h2 id="open-weight-vs-closed-different-defenses-different-attacks">Open-weight vs closed: different defenses, different attacks</h2>
<p>For open-weight residual-stream models, the interpretability work around the &ldquo;refusal direction&rdquo; — Arditi et al. and the abliteration techniques that followed — showed that, in some studied models, a non-trivial share of refusal behavior on common harm categories is mediated by a small number of directions in activation space, and can be reduced by single-direction ablation. How cleanly this generalizes across model families and harm categories is an open question. A reasonable working assumption is that refusals on the categories the labs train hardest against are harder to remove this way — but I would not bet a control on it.</p>
<p>For closed commercial models, none of that applies. Refusal there is a stack: post-training, system prompts, input and output classifiers, tool-policy layers, sometimes routing between models. The attacker cannot ablate any of it. They can only jailbreak — harder, but a moving target the lab keeps patching, which cuts both ways: a salt that works today may stop working next month because the model got better, not because the attacker did anything.</p>
<p>Takeaway is narrow: salting works against a median lazy adversary on commodity stacks, less reliably against anyone who has read a single interpretability paper, and the most robust triggers cluster around content you do not want in your environment anyway.</p>
<h2 id="what-it-actually-buys-you">What it actually buys you</h2>
<p>Against a low-effort agentic attacker — someone running Claude or GPT through an off-the-shelf pentest harness with no special handling of retrieved text — salting can be surprisingly effective. The loop relies on the model emitting an actionable next step; a polite refusal stalls it, and you see the stall in your telemetry. That telemetry is the actually useful part. Against a competent adversary the salt is a speed bump, defeated by an abliterated checkpoint or a &ldquo;treat retrieved text as data&rdquo; wrapper. The deeper point is structural, and worth saying explicitly: prompt-injection <em>detection</em> on the input side is a sensor because it tries to recognize an adversarial string and is bypassed by a string the classifier missed. Defensive prompt injection on the data side fails for an analogous but distinct reason — it relies on the attacker&rsquo;s model voluntarily complying with its safety policy, which is not a capability constraint the defender can enforce. Both controls live one layer below the system property they claim to protect. Detection cannot guarantee &ldquo;no malicious input gets through&rdquo;; salting cannot guarantee &ldquo;no agent acts on what it found.&rdquo; A capability constraint — typed tools, scoped credentials, an identity boundary — can.</p>
<h2 id="operational-notes-if-you-actually-want-to-ship-this">Operational notes if you actually want to ship this</h2>
<p>If you ignore everything else in this post and just want to ship something, the things that go wrong in practice:</p>
<ul>
<li><strong>Your own agents will read it too.</strong> Internal RAG systems, copilots, support bots, security analysts using LLMs — anything that touches the same data store. Either segregate the salted store from anything you use yourself, or design the salt so it is filterable by a marker only your stack knows about. Both options weaken the defense, just in different directions.</li>
<li><strong>The valuable signal is the read, not the refusal.</strong> Treat salted content as a canary token first and a refusal trigger second. Log every access, alert on any read by a non-allowlisted principal, and do not depend on the attacker&rsquo;s model behaving in any particular way. The refusal is a bonus; the alert is the actual product.</li>
<li><strong>Salted content will end up in places you do not expect.</strong> Backups, archive crawls, search index dumps, screenshots in bug reports, customer support tickets. Plan for the moment a salted document surfaces in a context where it causes operational pain, because it will.</li>
<li><strong>Attackers can detect canaries.</strong> A clever adversary running an agent against your environment will eventually notice that some documents always produce refusals and route around them — either by skipping those documents or by stripping the trigger phrase before passing the content to the planner. Treat the trigger phrase as a credential: rotate it, vary it, do not let it become a fingerprint.</li>
</ul>
<h2 id="the-mirror-trap">The mirror trap</h2>
<p>This is the part I want to land on. The last post argued that prompt-injection detection is downstream of decision, and that treating a sensor as a control is the industry&rsquo;s main mistake right now. Defensive prompt injection is a sensor too. A clever one, an under-explored one — but a sensor. If you build a security strategy where the agent refusing to do bad things is your <em>control</em>, you have rebuilt the same fragile thing the offensive side has, just pointed inward.</p>
<p>Where it earns its keep is as a tripwire in front of a properly scoped environment. An attacker agent that hits a salted document and refuses gives you a high-fidelity signal that someone is iterating an LLM through your data without authorization. That signal is much cheaper to act on than a generic &ldquo;anomalous read pattern&rdquo; alert. The salting did not stop the attack — the <em>identity boundary</em> did, or should have. The salting told you the attempt happened.</p>
<p>Concretely: a salt can stop the attacker&rsquo;s model from <em>proposing</em> <code>exfiltrate_s3()</code>. Identity scoping ensures that even if the model proposes it, the call fails — because the agent&rsquo;s credential has no read on that bucket, on that path, for that task, right now. Both are useful. Only one of them is the boundary.</p>
<p>Use defensive PI that way and it is a good piece of kit. Use it as your primary defense and you will eventually meet an attacker who reads to the end of an abliteration paper.</p>
]]></content:encoded>
    </item>
  </channel>
</rss>
