This framing is so useful because it forces you to think in capabilities, not “prompting.”
One thing I’ve found helpful is treating “exposure to untrusted content” as a taint event: once the agent has ingested attacker-controlled tokens, you should assume the remainder of that turn (and anything derived from it) is compromised.
That suggests a very deterministic mitigation: taint tracking + policy gating. If the current state is tainted, block (or require explicit human approval for) any action with exfiltration potential: outbound HTTP, email/chat sends, PR creation, even “render a clickable link” (since the click becomes the side-channel).
This also makes MCP’s mix-and-match story extra risky unless tools carry metadata like: reads_private_data / sees_untrusted_content / can_exfiltrate - and the runtime enforces “never allow all three in a single tainted execution path.”
Curious if you’ve seen any practical implementations of taint-style policy engines in real agent stacks yet (beyond research prototypes)?
Prompt injection prevention/mitigation for the big players is one thing. What could a lesser organisations that trains/tunes their own models do? They are unlikely to have the same skills or data to do the same.
I can't imagine something like anti-virus / anti-bias (for a given definition of bias) for LLM-training data.
Is there a playbook, well established patterns for detecting / mitigating things like obfuscated payloads, SEO-hacks, and the hundreds of other data-dirtying techhniques that exist?
I'm thinking of state-level attackers incrementally distributing/injecting polluted data sources with a view to affecting LLM models trained specifically for gov purposes. I'm thinking about other countries that might not have access to US three-letter-agency level brains.
Super insightful article—I thought you brought a lot of clarity to this topic. One small suggestion: maybe use "extraction" or "data heisting" instead of "exfiltration." In AI security contexts, "exfiltration" often refers to model theft, which could confuse newcomers about what's actually being targeted during these prompt injection attacks.
The 100% note at the bottom is exactly right, and I think it points to where the solution has to live.
We've been building formal verification for agent action gatesl not prompt injection detection, but a deterministic policy check on the *consequence* leg of the trifecta. Before the agent can send an email, make an HTTP call, or write a file, it checks the proposed action against a compiled policy. The result is a mathematical proof (SAT/UNSAT), not a confidence score.
To use your framing: it doesn't prevent untrusted content from reaching the model. But if the model then tries to exfiltrate data to ATTACKER_SITE, the policy says external HTTP calls are only permitted to approved domains, and that's not a heuristic, it's a constraint. The injection still happened; the exfiltration is what becomes impossible.
That maps directly onto what the design patterns paper calls the core requirement: once an agent has ingested untrusted input, consequential actions must be impossible. The gate just needs to be deterministic, not probabilistic.
I'm writing a post that tries to work through this honestly, what formal verification actually covers for the trifecta, and where it doesn't help. Would genuinely value your read before I publish, given you've been thinking about this longer than most.
This framing is so useful because it forces you to think in capabilities, not “prompting.”
One thing I’ve found helpful is treating “exposure to untrusted content” as a taint event: once the agent has ingested attacker-controlled tokens, you should assume the remainder of that turn (and anything derived from it) is compromised.
That suggests a very deterministic mitigation: taint tracking + policy gating. If the current state is tainted, block (or require explicit human approval for) any action with exfiltration potential: outbound HTTP, email/chat sends, PR creation, even “render a clickable link” (since the click becomes the side-channel).
This also makes MCP’s mix-and-match story extra risky unless tools carry metadata like: reads_private_data / sees_untrusted_content / can_exfiltrate - and the runtime enforces “never allow all three in a single tainted execution path.”
Curious if you’ve seen any practical implementations of taint-style policy engines in real agent stacks yet (beyond research prototypes)?
Prompt injection prevention/mitigation for the big players is one thing. What could a lesser organisations that trains/tunes their own models do? They are unlikely to have the same skills or data to do the same.
I can't imagine something like anti-virus / anti-bias (for a given definition of bias) for LLM-training data.
Is there a playbook, well established patterns for detecting / mitigating things like obfuscated payloads, SEO-hacks, and the hundreds of other data-dirtying techhniques that exist?
I'm thinking of state-level attackers incrementally distributing/injecting polluted data sources with a view to affecting LLM models trained specifically for gov purposes. I'm thinking about other countries that might not have access to US three-letter-agency level brains.
Super insightful article—I thought you brought a lot of clarity to this topic. One small suggestion: maybe use "extraction" or "data heisting" instead of "exfiltration." In AI security contexts, "exfiltration" often refers to model theft, which could confuse newcomers about what's actually being targeted during these prompt injection attacks.
The 100% note at the bottom is exactly right, and I think it points to where the solution has to live.
We've been building formal verification for agent action gatesl not prompt injection detection, but a deterministic policy check on the *consequence* leg of the trifecta. Before the agent can send an email, make an HTTP call, or write a file, it checks the proposed action against a compiled policy. The result is a mathematical proof (SAT/UNSAT), not a confidence score.
To use your framing: it doesn't prevent untrusted content from reaching the model. But if the model then tries to exfiltrate data to ATTACKER_SITE, the policy says external HTTP calls are only permitted to approved domains, and that's not a heuristic, it's a constraint. The injection still happened; the exfiltration is what becomes impossible.
That maps directly onto what the design patterns paper calls the core requirement: once an agent has ingested untrusted input, consequential actions must be impossible. The gate just needs to be deterministic, not probabilistic.
I'm writing a post that tries to work through this honestly, what formal verification actually covers for the trifecta, and where it doesn't help. Would genuinely value your read before I publish, given you've been thinking about this longer than most.