Prompt Injection Is the Wrong Threat Model

Half of what we do is teaching agents to find security bugs. The other half is finding security bugs in agents. After enough time on both sides, you notice the threat model the industry is being sold for agent products is the wrong one.

Half of what we do is teaching agents to find security bugs. The other half is finding security bugs in agents. After enough time on both sides, you start to notice that the threat model the industry is being sold for agent products is the wrong one.

The wrong one is prompt injection. The right one is older, less interesting, and much more expensive to skip.

Prompt injection is real. It's also a lower-stakes bug than the discourse suggests. An attacker who jailbreaks your model and gets it to say something embarrassing has produced a reputational incident. An attacker who walks your model's tool-calling permissions into a real downstream action has produced a financial incident. The two failure modes don't share much except for the model in the middle.

The threat model that actually matters isn't "what if the user gets the model to ignore its instructions." It's "what if the model gets to call a tool the user wasn't authorized to use."

Why the framing is wrong

The defenses being sold against prompt injection are filtering, prompting harder, and adding constitution-style rules. They're all on the wrong layer.

The wrong layer is the conversation. Whatever filter you put on it, the attacker rewrites. Whatever rule you give the model, the attacker convinces it to relax. We've run small versions of this internally against the models we use. It looks roughly like the SQL injection cat-and-mouse game in 2004 and it'll probably end the same way. Defenses lose, slowly, until everyone moves the trust boundary somewhere else.

The somewhere else is the tool call. That's where the actual privilege lives. Putting your defense anywhere upstream of it is putting it on the wrong side of the attacker.

The shape that breaks

An agent gets configured with a tool. Search the database. Send the email. Charge the card. Write the file. The tool runs under the agent's identity, which is usually the application's service account. The agent picks when to call the tool based on the conversation. The user shapes the conversation.

Three properties of that arrangement, each fine on its own, collectively dangerous.

The agent has authority the user doesn't.

The user can shape what the agent chooses to do.

The accounting that would have stopped the user from calling the tool directly doesn't run when the agent calls it for them.

This is a confused deputy. The first paper on confused deputy bugs was 1988. The agent platform industry rediscovered it last year and is happily shipping it.

Why the framework lets you do it

Every agent framework we've looked at decides "can the agent call this tool" by checking the agent's identity. That's the wrong check.

The check that matters is "can the user driving this conversation reach this tool through any chain of agent reasoning the user can talk it into." That has to be implemented against the user's identity, at the call site of the tool, not against the agent's identity at the configuration site.

Approximately zero production agent code does this. The frameworks don't make it easy. The path of least resistance is to grant the agent the union of every permission any of its users might need, and once you've done that, prompt injection is just the cheapest way to make the agent exercise the permissions of a user it isn't currently talking to.

What we'd build instead

Wrap every tool the agent can call in a check that names the user. Not the agent. The user. The check looks at the call site, asks whether the user driving this conversation is allowed to perform the operation the tool does, and refuses if not.

That's the whole change. It's one line in each tool, and it's the line the framework didn't write for you.

I'm not naming product names. The list of agent products that get this wrong is short. The public record will fill in within a year. We've got private bets on which ones.

Prompt injection is the demo. The deputy is the breach.

We expect to spend a chunk of next year writing this up after the fact for teams that didn't believe it was unglamorous enough to do up front.