Notes on AI security, agentic systems, and non-human identity. Personal site, personal opinions.

Auto mode is a sensor too
The third time this year I have written down the same epistemic mistake. Auto-accept loops in agentic IDEs trust a classifier to be a control. Real bypasses already exist. The sufficiently motivated attacker problem is now sitting on your laptop, deciding which files to edit while you make coffee.

Identity is the control plane. Detection is a sensor.
Detection asks ‘is this input adversarial?’ Identity asks ‘what is this principal allowed to do, on whose behalf, right now?’ The first is probabilistic and bypassable. The second is enforceable and auditable.

Salting your own well: defensive prompt injection as a tripwire
Defenders can deliberately plant content in their environments that triggers the refusal vectors of attacker-controlled agents. Against the median lazy adversary it works. Against a determined one with an abliterated model it doesn’t. Either way, it is a sensor — not a control.