Skip to content
guide

Your AI agent burns 21,000 tokens to fix a typo: 6 cost patterns

| 9 min read
Financial charts and pricing data on a screen representing AI token cost tracking
Photo by Austin Distel on Unsplash
Financial charts and pricing data on a screen representing AI token cost tracking
Token bills compound in 60-second windows. The tools to control them are already in your API. Photo by Austin Distel on Unsplash

A developer at Morph documented a Claude Code session that consumed over 21,000 input tokens to fix a single-character typo. That is the equivalent of reading a short novel to change one letter. The session burned the tokens resending the full conversation history on every turn, retrying a failing tool call, and rereading the same three files the agent had already loaded twice.

Nothing about that session was unusual. Coding agents resend history on every turn, tool calls multiply in the middle of turns, and the 5-minute prompt cache window is easy to miss. A team running Claude Code or Cursor on the same workload can generate a token bill that varies by 10x depending on whether these six patterns are in place.

Here they are, each with the code change that unlocks the saving and a realistic number for what it cuts.

Pattern 1: cap iterations and enforce a token budget

The fastest way to burn tokens is an agent loop with no exit condition. The agent hits a 400 error, retries with the same bad input, retries with slightly different bad input, retries again, and so on. By iteration 40 you have spent 80,000 tokens producing nothing.

The unbounded version that every tutorial ships with:

// Before: no cap, no budget, no visibility
async function fixBug(description: string) {
  let done = false;
  let history: Message[] = [];

  while (!done) {
    const res = await anthropic.messages.create({
      model: "claude-opus-4-6",
      messages: history,
      tools: allTools,
    });
    history.push(res.message);
    done = res.stop_reason === "end_turn";
  }
}

The version that will not wake you up at 2 AM:

// After: iteration cap, token budget, per-call accounting
interface RunStats {
  iterations: number;
  inputTokens: number;
  outputTokens: number;
  cacheReadTokens: number;
  cacheWriteTokens: number;
}

async function fixBug(description: string) {
  const MAX_ITERATIONS = 20;
  const MAX_TOKENS = 150_000; // ~$3 on Opus pricing

  const stats: RunStats = {
    iterations: 0,
    inputTokens: 0,
    outputTokens: 0,
    cacheReadTokens: 0,
    cacheWriteTokens: 0,
  };

  let history: Message[] = [];
  let done = false;

  while (!done) {
    if (stats.iterations >= MAX_ITERATIONS) {
      throw new Error(`agent stopped at iteration cap (${stats.iterations})`);
    }

    const totalTokens = stats.inputTokens + stats.outputTokens;
    if (totalTokens > MAX_TOKENS) {
      throw new Error(`agent stopped at token budget (${totalTokens} tokens)`);
    }

    const res = await anthropic.messages.create({
      model: "claude-opus-4-6",
      messages: history,
      tools: allTools,
    });

    stats.iterations += 1;
    stats.inputTokens += res.usage.input_tokens;
    stats.outputTokens += res.usage.output_tokens;
    stats.cacheReadTokens += res.usage.cache_read_input_tokens ?? 0;
    stats.cacheWriteTokens += res.usage.cache_creation_input_tokens ?? 0;

    history.push(res.message);
    done = res.stop_reason === "end_turn";
  }

  return { history, stats };
}

Two caps; one on iterations, one on total tokens. The iteration cap catches retry storms. The token budget catches long-running tasks that are still converging but past the point of dollar sense. If the agent cannot solve the problem in 20 tool calls, the fix is a better prompt or a better tool, not more iterations.

Log stats.iterations alongside stats.inputTokens in your metrics pipeline. Tasks that complete in 3 to 5 iterations are healthy. Tasks pinned at 18 to 20 iterations are retry storms that need a prompt rewrite, not a cap increase.

Pattern 2: mark long static context as cacheable

Anthropic's prompt cache bills cache hits at 10% of the input rate and cache writes at 125%. For a 10,000-token style guide that gets reused on 100 calls within the 5-minute TTL, the cached run costs about 12% of the uncached run.

Adding cache_control to a content block is one line. Missing it is the most common cost mistake in production agent code:

const res = await anthropic.messages.create({
  model: "claude-sonnet-4-6",
  system: [
    {
      type: "text",
      text: "You are a code reviewer for acme-corp.",
    },
    {
      type: "text",
      text: largeStyleGuide, // 8,000 tokens, reused on every call
      cache_control: { type: "ephemeral" },
    },
  ],
  messages: [
    { role: "user", content: "Review this diff: " + diff },
  ],
});

// res.usage.cache_read_input_tokens tells you how many tokens hit cache
// First call: cache_creation = 8,000, input = 8,000 (125% rate)
// Calls 2-N within 5 min: cache_read = 8,000, input = ~200 (10% rate)

The cache lives for 5 minutes. If your agent makes one call every 20 minutes, you pay the cache write premium without amortizing it, and caching costs you money. If your agent makes bursts of 10 to 50 calls in under 5 minutes, the math flips hard in your favor.

A concrete number: a 40-call review session with an 8K style guide, no cache, costs roughly 40 * 8,000 = 320,000 input tokens for the style guide alone. With caching: 10,000 (write at 125%) + 39 * 800 (reads at 10%) = 41,200 billable tokens. That is an 87% reduction on the reusable block.

Pattern 3: summarize the tail of long sessions

On turn 30 of a session, the agent rereads turns 1 through 29 on every call. The early turns contain setup context that has long since stopped being actionable. Compress them.

// Keep the last 6 turns verbatim, summarize the rest every 10 iterations

async function maybeCompress(history: Message[]): Promise<Message[]> {
  if (history.length < 12) return history;

  const keepTail = history.slice(-6);
  const compressHead = history.slice(0, -6);

  const summary = await anthropic.messages.create({
    model: "claude-haiku-4-5-20251001", // cheap model for summaries
    max_tokens: 800,
    messages: [
      {
        role: "user",
        content:
          "Summarize this agent session in under 400 words. Preserve file paths, function names, and decisions made:\n\n" +
          JSON.stringify(compressHead),
      },
    ],
  });

  const summaryMsg: Message = {
    role: "user",
    content: "Previous session summary:\n" + summary.content[0].text,
  };

  return [summaryMsg, ...keepTail];
}

Summarize with Haiku, not the same expensive model driving the main loop. The summary can lose detail; keep enough to preserve file paths, function names, and decisions the agent has already made. The last 6 turns stay verbatim so the model still has recent tool call results and working context.

For a session that was about to hit 120K input tokens per turn, compressing turns 1 through 24 into a 400-token summary cuts per-turn input to roughly 8K. Savings compound: on the next 10 turns, that is a million tokens you did not send.

Pattern 4: RAG over full-file reads for reference material

Sending three whole files every turn because the agent might need them is the most visible form of waste. A vector-store lookup returning the 5 most relevant 180-token chunks cuts reference context by 60 to 80% while keeping accuracy on targeted questions.

// Before: send full file every turn (8,400 tokens)
const systemPrompt = `
You are editing this codebase:

${await fs.readFile("src/routes/billing.ts", "utf8")}
${await fs.readFile("src/lib/stripe.ts", "utf8")}
${await fs.readFile("src/db/schema.ts", "utf8")}
`;

// After: retrieve slices by query (~900 tokens)
import { search } from "./vector-store";

async function buildSystem(userQuery: string) {
  const hits = await search(userQuery, { k: 5, maxTokens: 900 });
  return `
You are editing this codebase. Relevant excerpts:

${hits.map((h) => `# ${h.path}:${h.line}\n${h.snippet}`).join("\n\n")}
`;
}

The rule of thumb: files under 3K tokens go in directly; files over 10K tokens get chunked and retrieved; files in between depend on whether the agent will scan the whole thing or look up a specific function. For API specs, documentation sites, and config schemas, RAG is strictly better. For the file the agent is actively editing, keep it inline.

Pattern 5: offload deterministic work to typed tool calls

The most expensive token is the output token spent reasoning through a problem the model should never have been asked to solve. Deterministic, structured tasks belong in a tool:

  • Email syntax plus MX plus disposable check
  • Phone parsing to E.164 with country detection
  • SSL certificate expiry and chain validation
  • JSON schema validation, JSON to TypeScript conversion
  • Hashing, UUID generation, base64 encoding, timestamp conversion
  • SPF, DMARC, DKIM checks; DNS record lookups
// Before: 480 output tokens of reasoning
const res = await anthropic.messages.create({
  model: "claude-opus-4-6",
  max_tokens: 500,
  messages: [
    {
      role: "user",
      content:
        "Is support@acme-corp.com a valid business email? " +
        "Check syntax, MX records, and whether the domain is disposable.",
    },
  ],
});
// Model reasons: "the format looks valid, MX records would likely exist
// for a corporate domain like acme-corp.com, and disposable domains are
// typically things like mailinator.com..."

// After: 30 output tokens, one typed tool call
const tools = [
  {
    name: "validate_email",
    description: "Validate email syntax, MX records, and disposable status",
    input_schema: {
      type: "object",
      properties: { email: { type: "string" } },
      required: ["email"],
    },
  },
];

const res2 = await anthropic.messages.create({
  model: "claude-haiku-4-5-20251001", // cheaper model, deterministic task
  tools,
  max_tokens: 200,
  messages: [
    { role: "user", content: "Is support@acme-corp.com a valid business email?" },
  ],
});

// Tool call triggers: fetch botoi /v1/email/validate, return {valid, disposable, mx}

The before version costs ~2,400 tokens per call and sometimes hallucinates MX records. The after version costs ~230 tokens, calls a typed endpoint, and returns a schema-validated answer. The agent gets the same information for 10% of the cost and zero reasoning errors.

This is where an external API fits cleanly in the agent stack. Tool calls that terminate in a single HTTP request to a typed endpoint remove both the output-token cost and a class of hallucinations. Any Botoi endpoint can be wrapped as a Claude or OpenAI tool in a few lines, or called directly through the Botoi MCP server which exposes 49 of them as MCP tools.

Pattern 6: route by task kind to the cheapest acceptable model

Opus costs 5x Sonnet and 15x Haiku per input token. Most tasks in an agent loop do not need Opus. Classification, extraction, short tool-call routing, and summary compression all run fine on Haiku. Keep Opus for architectural decisions and hard debugging.

type Task = {
  kind: "classify" | "reason" | "plan" | "extract";
  input: string;
};

function pickModel(task: Task): string {
  switch (task.kind) {
    case "classify":
    case "extract":
      // Structured, deterministic: Haiku handles these fine at 1/15 the cost
      return "claude-haiku-4-5-20251001";
    case "reason":
      // Cross-cutting reasoning, code changes, multi-file edits
      return "claude-sonnet-4-6";
    case "plan":
      // Architectural choices, hard debugging, security review
      return "claude-opus-4-6";
  }
}

A typical mixed-workload agent that was running every step on Opus dropped 62% of its monthly bill by routing only the "plan" tasks to Opus and pushing classify/extract to Haiku. The accuracy regression on those tasks was zero because they were deterministic to begin with.

The Claude Advisor Tool pattern takes this further: Sonnet drives the main loop and calls Opus mid-generation for a second opinion on a specific decision. One call, two models, near-Opus quality at Sonnet cost.

Instrument before you optimize

You cannot cut what you cannot see. Log per-run token stats as soon as you ship an agent to production:

// Log per-run token spend and flag outliers
import { writeFile, appendFile } from "node:fs/promises";

interface RunLog {
  runId: string;
  task: string;
  model: string;
  iterations: number;
  inputTokens: number;
  outputTokens: number;
  cacheReadTokens: number;
  cacheWriteTokens: number;
  usdCost: number;
}

const PRICING: Record<string, { in: number; out: number; cacheRead: number; cacheWrite: number }> = {
  "claude-opus-4-6":     { in: 15, out: 75, cacheRead: 1.5, cacheWrite: 18.75 },
  "claude-sonnet-4-6":   { in: 3,  out: 15, cacheRead: 0.3, cacheWrite: 3.75 },
  "claude-haiku-4-5-20251001": { in: 1, out: 5,  cacheRead: 0.1, cacheWrite: 1.25 },
};

function usd(model: string, stats: RunStats): number {
  const p = PRICING[model];
  return (
    (stats.inputTokens       * p.in        +
     stats.outputTokens      * p.out       +
     stats.cacheReadTokens   * p.cacheRead +
     stats.cacheWriteTokens  * p.cacheWrite) / 1_000_000
  );
}

export async function logRun(log: RunLog) {
  await appendFile("runs.jsonl", JSON.stringify(log) + "\n");
  if (log.usdCost > 2) {
    console.warn(`HIGH COST RUN: ${log.runId} = $${log.usdCost.toFixed(2)}`);
  }
}

Pipe runs.jsonl into whatever you already use for metrics. The first week of data will show a handful of runs consuming 3x the median. Those are your retry loops. The next week will show a second tier of expensive runs that are cache misses because the cache window lapsed. Fix those in order of cost, not in order of frequency.

Putting it together: expected savings by pattern

Pattern Typical saving Effort to ship
Iteration + token cap 40-90% on pathological runs Low (one hour)
Prompt cache on reusable context 60-90% on the cached block Low (one line per block)
Tail summarization 30-70% on long sessions Medium (compression logic)
RAG for reference material 60-80% on retrieved content Medium (vector store setup)
Tool offload for deterministic work 70-95% on offloaded task Low (tool definition + HTTP call)
Model routing by task kind 50-80% blended Low (router function)

Stack all six. A team moving from "everything on Opus, no cache, full files, 40-iteration cap" to "Haiku-Sonnet routing, cached system prompts, RAG, typed tools, 20-iteration cap" regularly cuts monthly spend by 70 to 85% with identical or better task completion rates.

Key takeaways

  • Cap iterations and tokens, not wall clock. A 20-iteration / 150K-token cap stops retry storms before they cost you money.
  • Mark reusable context as cacheable. One cache_control line turns a 40-call session from 320K billable tokens into 41K.
  • Summarize the tail with Haiku, keep the head verbatim. Old turns stop being actionable faster than most agents notice.
  • Retrieve, do not send, reference material. RAG cuts 60-80% off input tokens for docs, specs, and schemas that the agent scans rather than edits.
  • Tool-call the deterministic work. Email validation, DNS lookups, hashing, JSON conversion; none of it deserves reasoning tokens.
  • Route by task kind. Haiku for classify/extract, Sonnet for reason, Opus for plan. The blended bill drops by 50 to 80% with zero accuracy loss on structured tasks.

Botoi gives you 150+ typed endpoints and a 49-tool MCP server ready to wire into any agent loop. Replacing reasoning tokens with an HTTP call costs roughly 230 tokens per deterministic task instead of 2,000+. Try the interactive API docs or connect Claude Code, Cursor, or VS Code to the MCP server in one config block, then watch your token line on the cost dashboard flatten out.

Frequently asked questions

Why does an AI coding agent use so many tokens for a small change?
Coding agents resend the full conversation history on every turn. A 30-turn session that started with three large file reads sends those reads every turn, multiplied by however many tool calls the agent makes between turns. A typo fix that looks trivial to a human can turn into 20 to 30 round trips, each carrying 1,000 to 1,500 tokens of context the model already saw. The arithmetic compounds fast.
How much does prompt caching save on an Anthropic call?
Anthropic's prompt cache charges 10% of the input token rate for cache hits and 125% for cache writes. For a 10,000-token system prompt that gets reused on 100 calls within the 5-minute TTL, the cached run costs about 12% of the uncached run; one write at 125% plus 99 reads at 10%. The bigger your reusable context, the larger the savings.
What iteration cap should I set on an agent loop?
Start at 15 to 25 iterations for a single logical task. If your agent cannot reach a correct answer in 15 tool calls, it probably will not reach it in 50; it is more likely caught in a retry loop or hallucinating tool arguments. Add a budget check that kills the loop when the session crosses a token threshold, not a wall-clock limit. Token spend maps to dollar cost; wall-clock does not.
When does it make sense to call an external HTTP API from an agent instead of asking the model to compute the answer?
Any time the task is deterministic and structured: email validation, phone parsing, SSL checks, base64 decoding, UUID generation, hash computation, JSON schema validation. The model should not spend 500 output tokens reasoning through whether support@acme.com has a valid MX record. A single tool call to a typed endpoint returns the answer in 30 tokens and removes a class of hallucinations.
Does RAG always beat shoving whole files into context?
For read-mostly reference material (docs, config schemas, API specs), yes; teams that move to a 5K-token RAG retrieval typically cut input tokens by 60 to 80% versus sending full files. For small files under 3K tokens that fit entirely in context, RAG adds complexity without savings. The rule: if the relevant content is under 3K tokens, inline it; if it is over 10K tokens and the agent only needs a slice, retrieve it.

Try this API

Text Stats API — interactive playground and code examples

More guide posts

Start building with botoi

150+ API endpoints for lookup, text processing, image generation, and developer utilities. Free tier, no credit card.