AI Coding Agent Token Optimization Guide: Prompt Caching, Session Management, and the Hidden Economics of Context Windows

If you spend time in AI coding agent communities, one complaint shows up again and again: quotas disappear faster than expected. Power users on Max-tier plans report burning through a full weekly allowance in a couple of days. In one documented case, someone measured roughly $134 in real API cost for a single session on AWS Bedrock—while a Pro Max 5× subscription might run around $100 per month for productized access. The gap between “feels unlimited” and “actually expensive” is where most people get surprised.

The most common mistake is treating sessions like disposable tabs: frequent /clear commands, or always starting fresh, in the hope of staying lean. That instinct is understandable, but it often does the opposite—it forces a full-price rebuild of context when prompt caching would otherwise keep the heavy, stable parts cheap.

This guide explains how tokens are billed in practice, how prompt caching changes the math, when to continue a session versus start a new one, and why a 1M-token window can be a liability—not a feature.

Why Every Message Re-Reads the Whole Stack

Large language models do not have durable memory between API calls in the way a database does. Each request is a fresh forward pass over the entire prompt the provider sends to the model.

For a typical coding agent turn, “input” is not just your latest sentence. It usually includes:

Component	What it contains
Fixed preamble	System instructions, safety policies, product-specific rules
Tooling & schemas	Tool definitions, JSON schemas, command surfaces
Repo / project rules	Files like `CLAUDE.md`, conventions, path hints
Conversation history	Prior user and assistant turns in the thread
New user message	The instruction you just typed

After 20 multi-step turns, it is common for the “carry-forward” portion—the history plus fixed scaffolding—to reach tens of thousands of tokens per new message. In worst cases, each additional turn can drag on the order of ~100K tokens of “old luggage,” even if the new instruction is short.

That is the core economic fact: long sessions are expensive not because models are greedy, but because the architecture re-attends to prior content every time.

Prompt Caching: Where the 10× Discount Comes From

Providers can cache stable prefix blocks so repeat traffic does not pay full input pricing. On Anthropic’s stack, cache hits are typically billed at a fraction of standard input cost—often cited around 10% of the standard input rate for eligible cached tokens.

For Claude Opus 4.x-class pricing (order-of-magnitude, check current public pricing for your region and channel), think in relative terms:

Pricing tier	Approx. input cost
Standard (uncached) input	~$5 / MTok (example tier)
Cached input (cache read)	~$0.50 / MTok (≈10% of standard)

The exact numbers on your invoice will vary by model version, batching, and provider, but the ratio is what matters for engineering decisions: cache hits are an order of magnitude cheaper than re-sending the same bytes at list price.

A 100-Turn Session: Illustrative Total Cost

These figures are illustrative—they compress many assumptions (average context size per turn, how much is cacheable, and provider list pricing). They are still useful for directional planning:

Scenario (100-turn Opus-class session, rough order of magnitude)	Estimated session input cost
No prompt caching (stable prefix re-sent at full input price)	~$50–$100+
Healthy prompt caching on stable blocks	~$19 (example benchmark from practitioner write-ups)

Again: treat this as a planning range, not a quote. Your telemetry beats any blog table.

Operational Constraints Worth Knowing

Anthropic’s prompt caching product has practical limits you can feel in agent workflows:

Parameter	Typical constraint
Minimum cacheable block size	1,024 tokens per cached segment
Breakpoints per request	Up to 4 cache breakpoints (depends on API version—verify in docs)
Cache TTL	Often ~1 hour for a primary agent loop; shorter windows (e.g., ~5 minutes) can apply to ephemeral or sub-agent contexts

Community commentary has gone as far as claiming “Claude Code has the highest cache utilization of any framework” for getting stable instructions and tool definitions into cached prefixes consistently—meaning your habits matter as much as your model choice.

Continue vs. New Session: A Decision Table

Sessions are not “free” or “expensive” by themselves—they interact with cache warmth and task coherence. Use a simple rubric:

Situation	Prefer
Same task, last message within ~1 hour, context still relevant	Continue
Task changed (new feature, new bug, new repo area)	New session
Idle longer than ~1 hour (cache likely expired)	New session (or accept a cold-cache rebuild)
Context polluted with unrelated files, dead ends, or contradictory instructions	New session (or aggressive pruning if your tool supports it)

Flowchart: Should You Continue?

Why the 1-hour heuristic? Many cache policies roll on a TTL clock. If you resume after TTL expiry, you may pay to re-establish cached prefixes even if the conversation “feels” continuous.

The 1M Context Window: “Can Use” ≠ “Should Use”

A million-token context window is a capacity headline, not a budgeting strategy.

Cost scales roughly linearly with effective context size for a given model and pricing tier: bigger prompts cost more, even when the model is capable of accepting them.

Research summaries (including work discussed under the Chroma umbrella) have reported 20–50% accuracy degradation on certain tasks when moving from ~10K toward ~100K tokens of distractor context—capability does not monotonically improve with window size.

Operationally, compare two failure modes:

Issue	What it means
Cache miss at huge context	Paying list input rates on a very large prefix
Relative cost	A 1M context cold read can be ~5× the cost of a 200K read for the same model family, all else equal—because the billed input tokens scale with what you send

Rule of thumb from production agent work: a disciplined ~20K-token context with crisp instructions frequently beats a 200K-token grab bag of semi-relevant files—on both quality and cost.

Six Token-Saving Rules That Survive Contact With Reality

Rule 1: Default to Sonnet for Daily Work

Use the cheaper, faster model for scaffolding, refactors, and repetitive edits. Switch to Opus for genuinely hard architecture decisions, subtle bugs, or security-sensitive reviews. Model hopping is a budget lever—if everything is “Opus by default,” you are choosing maximum spend.

Rule 2: Do Not Switch Models Mid-Session

Changing models mid-thread often invalidates caching assumptions and can force a wholesale rebuild of the prompt stack. Pick a lane per session.

Rule 3: Keep `CLAUDE.md` Lean (Target Under ~200 Lines)

Long instruction files are re-sent as part of the stable prefix. One public write-up reported ~63% token reduction after trimming a bloated rules file to essentials. Treat repo rules like an API contract: short, explicit, and testable.

Rule 4: CLI First, MCP Second

MCP is powerful, but each tool and schema consumes prompt real estate. Prefer built-in terminal workflows and direct repo operations when they are equally reliable.

Rule 5: Spend Tokens on Planning Up Front

A slightly longer upfront plan—acceptance criteria, edge cases, file ownership—often reduces retries. Three cheap repair rounds can exceed one correct pass with a clear plan.

Rule 6: Point to Paths, Do Not Paste Files

Pasting large file contents into chat permanently bloats history. Use path references, @ file picks, or scoped reads so the tool pulls what it needs when it needs it.

Common Myths—Debunked

Myth	Reality
“Frequent `/clear` saves money”	Often no. If a warm cache was helping, clearing can force a full rebuild of large stable sections at standard input pricing.
“A 1M window means unlimited chat”	No. Cost grows with tokens sent; “fits in window” ≠ “cheap to attend to.”
“Output tokens dominate the bill”	Usually false in agent loops. A typical decomposition is closer to: context loading ~45%, history ~25%, output ~20%, retries ~10%—with input-side context dominating (some analyses place ~99%+ of total tokens on the input side across turns).

Treat percentages as diagnostic heuristics from vendor and community telemetry—not universal laws—but the directional point holds: optimize the input stack first.

Summary: A Simple Mental Model

A pragmatic cost model:

Session cost ≈ (input tokens × input price) + (output tokens × output price), with input tokens dominated by fixed preamble + history.

Three levers matter most:

Raise cache hit rate on stable instructions and tooling (session hygiene, avoid unnecessary cold starts).
Shrink fixed context (lean rules files, fewer redundant tool schemas).
Cut wasted turns (planning, smaller scopes, fewer model swaps, less copy-paste pollution).

If you only remember one sentence: in agent workflows, “context engineering” is budget engineering.

AI Coding Agent Token Optimization Guide: Prompt Caching, Session Management, and the Hidden Economics of Context Windows

A deep dive into how AI coding agents like Claude Code, Cursor, and Codex consume tokens — covering prompt caching mechanics, the continue-vs-new-session decision framework, the hidden cost of 1M context windows, and six proven rules for reducing token spend.

Why Every Message Re-Reads the Whole Stack

Prompt Caching: Where the 10× Discount Comes From

A 100-Turn Session: Illustrative Total Cost

Operational Constraints Worth Knowing

Continue vs. New Session: A Decision Table

Flowchart: Should You Continue?

The 1M Context Window: “Can Use” ≠ “Should Use”

Six Token-Saving Rules That Survive Contact With Reality

Rule 1: Default to Sonnet for Daily Work

Rule 2: Do Not Switch Models Mid-Session

Rule 3: Keep `CLAUDE.md` Lean (Target Under ~200 Lines)

Rule 4: CLI First, MCP Second

Rule 5: Spend Tokens on Planning Up Front

Rule 6: Point to Paths, Do Not Paste Files

Common Myths—Debunked

Summary: A Simple Mental Model

References

AI Coding Agent Token Optimization Guide: Prompt Caching, Session Management, and the Hidden Economics of Context Windows

A deep dive into how AI coding agents like Claude Code, Cursor, and Codex consume tokens — covering prompt caching mechanics, the continue-vs-new-session decision framework, the hidden cost of 1M context windows, and six proven rules for reducing token spend.

Why Every Message Re-Reads the Whole Stack

Prompt Caching: Where the 10× Discount Comes From

A 100-Turn Session: Illustrative Total Cost

Operational Constraints Worth Knowing

Continue vs. New Session: A Decision Table

Flowchart: Should You Continue?

The 1M Context Window: “Can Use” ≠ “Should Use”

Six Token-Saving Rules That Survive Contact With Reality

Rule 1: Default to Sonnet for Daily Work

Rule 2: Do Not Switch Models Mid-Session

Rule 3: Keep CLAUDE.md Lean (Target Under ~200 Lines)

Rule 4: CLI First, MCP Second

Rule 5: Spend Tokens on Planning Up Front

Rule 6: Point to Paths, Do Not Paste Files

Common Myths—Debunked

Summary: A Simple Mental Model

References

Rule 3: Keep `CLAUDE.md` Lean (Target Under ~200 Lines)