If you spend time in AI coding agent communities, one complaint shows up again and again: quotas disappear faster than expected. Power users on Max-tier plans report burning through a full weekly allowance in a couple of days. In one documented case, someone measured roughly $134 in real API cost for a single session on AWS Bedrock—while a Pro Max 5× subscription might run around $100 per month for productized access. The gap between “feels unlimited” and “actually expensive” is where most people get surprised.
The most common mistake is treating sessions like disposable tabs: frequent /clear commands, or always starting fresh, in the hope of staying lean. That instinct is understandable, but it often does the opposite—it forces a full-price rebuild of context when prompt caching would otherwise keep the heavy, stable parts cheap.
This guide explains how tokens are billed in practice, how prompt caching changes the math, when to continue a session versus start a new one, and why a 1M-token window can be a liability—not a feature.
Why Every Message Re-Reads the Whole Stack
Large language models do not have durable memory between API calls in the way a database does. Each request is a fresh forward pass over the entire prompt the provider sends to the model.
For a typical coding agent turn, “input” is not just your latest sentence. It usually includes:
| Component | What it contains |
|---|---|
| Fixed preamble | System instructions, safety policies, product-specific rules |
| Tooling & schemas | Tool definitions, JSON schemas, command surfaces |
| Repo / project rules | Files like CLAUDE.md, conventions, path hints |
| Conversation history | Prior user and assistant turns in the thread |
| New user message | The instruction you just typed |
After 20 multi-step turns, it is common for the “carry-forward” portion—the history plus fixed scaffolding—to reach tens of thousands of tokens per new message. In worst cases, each additional turn can drag on the order of ~100K tokens of “old luggage,” even if the new instruction is short.
That is the core economic fact: long sessions are expensive not because models are greedy, but because the architecture re-attends to prior content every time.
Prompt Caching: Where the 10× Discount Comes From
Providers can cache stable prefix blocks so repeat traffic does not pay full input pricing. On Anthropic’s stack, cache hits are typically billed at a fraction of standard input cost—often cited around 10% of the standard input rate for eligible cached tokens.
For Claude Opus 4.x-class pricing (order-of-magnitude, check current public pricing for your region and channel), think in relative terms:
| Pricing tier | Approx. input cost |
|---|---|
| Standard (uncached) input | ~$5 / MTok (example tier) |
| Cached input (cache read) | ~$0.50 / MTok (≈10% of standard) |
The exact numbers on your invoice will vary by model version, batching, and provider, but the ratio is what matters for engineering decisions: cache hits are an order of magnitude cheaper than re-sending the same bytes at list price.
A 100-Turn Session: Illustrative Total Cost
These figures are illustrative—they compress many assumptions (average context size per turn, how much is cacheable, and provider list pricing). They are still useful for directional planning:
| Scenario (100-turn Opus-class session, rough order of magnitude) | Estimated session input cost |
|---|---|
| No prompt caching (stable prefix re-sent at full input price) | ~$50–$100+ |
| Healthy prompt caching on stable blocks | ~$19 (example benchmark from practitioner write-ups) |
Again: treat this as a planning range, not a quote. Your telemetry beats any blog table.
Operational Constraints Worth Knowing
Anthropic’s prompt caching product has practical limits you can feel in agent workflows:
| Parameter | Typical constraint |
|---|---|
| Minimum cacheable block size | 1,024 tokens per cached segment |
| Breakpoints per request | Up to 4 cache breakpoints (depends on API version—verify in docs) |
| Cache TTL | Often ~1 hour for a primary agent loop; shorter windows (e.g., ~5 minutes) can apply to ephemeral or sub-agent contexts |
Community commentary has gone as far as claiming “Claude Code has the highest cache utilization of any framework” for getting stable instructions and tool definitions into cached prefixes consistently—meaning your habits matter as much as your model choice.
Continue vs. New Session: A Decision Table
Sessions are not “free” or “expensive” by themselves—they interact with cache warmth and task coherence. Use a simple rubric:
| Situation | Prefer |
|---|---|
| Same task, last message within ~1 hour, context still relevant | Continue |
| Task changed (new feature, new bug, new repo area) | New session |
| Idle longer than ~1 hour (cache likely expired) | New session (or accept a cold-cache rebuild) |
| Context polluted with unrelated files, dead ends, or contradictory instructions | New session (or aggressive pruning if your tool supports it) |
Flowchart: Should You Continue?
Why the 1-hour heuristic? Many cache policies roll on a TTL clock. If you resume after TTL expiry, you may pay to re-establish cached prefixes even if the conversation “feels” continuous.
The 1M Context Window: “Can Use” ≠ “Should Use”
A million-token context window is a capacity headline, not a budgeting strategy.
Cost scales roughly linearly with effective context size for a given model and pricing tier: bigger prompts cost more, even when the model is capable of accepting them.
Research summaries (including work discussed under the Chroma umbrella) have reported 20–50% accuracy degradation on certain tasks when moving from ~10K toward ~100K tokens of distractor context—capability does not monotonically improve with window size.
Operationally, compare two failure modes:
| Issue | What it means |
|---|---|
| Cache miss at huge context | Paying list input rates on a very large prefix |
| Relative cost | A 1M context cold read can be ~5× the cost of a 200K read for the same model family, all else equal—because the billed input tokens scale with what you send |
Rule of thumb from production agent work: a disciplined ~20K-token context with crisp instructions frequently beats a 200K-token grab bag of semi-relevant files—on both quality and cost.
Six Token-Saving Rules That Survive Contact With Reality
Rule 1: Default to Sonnet for Daily Work
Use the cheaper, faster model for scaffolding, refactors, and repetitive edits. Switch to Opus for genuinely hard architecture decisions, subtle bugs, or security-sensitive reviews. Model hopping is a budget lever—if everything is “Opus by default,” you are choosing maximum spend.
Rule 2: Do Not Switch Models Mid-Session
Changing models mid-thread often invalidates caching assumptions and can force a wholesale rebuild of the prompt stack. Pick a lane per session.
Rule 3: Keep CLAUDE.md Lean (Target Under ~200 Lines)
Long instruction files are re-sent as part of the stable prefix. One public write-up reported ~63% token reduction after trimming a bloated rules file to essentials. Treat repo rules like an API contract: short, explicit, and testable.
Rule 4: CLI First, MCP Second
MCP is powerful, but each tool and schema consumes prompt real estate. Prefer built-in terminal workflows and direct repo operations when they are equally reliable.
Rule 5: Spend Tokens on Planning Up Front
A slightly longer upfront plan—acceptance criteria, edge cases, file ownership—often reduces retries. Three cheap repair rounds can exceed one correct pass with a clear plan.
Rule 6: Point to Paths, Do Not Paste Files
Pasting large file contents into chat permanently bloats history. Use path references, @ file picks, or scoped reads so the tool pulls what it needs when it needs it.
Common Myths—Debunked
| Myth | Reality |
|---|---|
“Frequent /clear saves money” | Often no. If a warm cache was helping, clearing can force a full rebuild of large stable sections at standard input pricing. |
| “A 1M window means unlimited chat” | No. Cost grows with tokens sent; “fits in window” ≠ “cheap to attend to.” |
| “Output tokens dominate the bill” | Usually false in agent loops. A typical decomposition is closer to: context loading ~45%, history ~25%, output ~20%, retries ~10%—with input-side context dominating (some analyses place ~99%+ of total tokens on the input side across turns). |
Treat percentages as diagnostic heuristics from vendor and community telemetry—not universal laws—but the directional point holds: optimize the input stack first.
Summary: A Simple Mental Model
A pragmatic cost model:
Session cost ≈ (input tokens × input price) + (output tokens × output price), with input tokens dominated by fixed preamble + history.
Three levers matter most:
- Raise cache hit rate on stable instructions and tooling (session hygiene, avoid unnecessary cold starts).
- Shrink fixed context (lean rules files, fewer redundant tool schemas).
- Cut wasted turns (planning, smaller scopes, fewer model swaps, less copy-paste pollution).
If you only remember one sentence: in agent workflows, “context engineering” is budget engineering.
References
- Prompt Caching — Anthropic Official Docs
- How Prompt Caching Actually Works in Claude Code — Claude Code Camp
- Why Claude Code Drains Token Usage After Session Resume — BSWEN
- Claude Code Session Management Guide — Claude Lab
- Prompt Caching with Claude — Anthropic
- Cut Anthropic API Costs 90% with Prompt Caching — Markaicode
- How I Cut Claude's Token Usage by 63% with One CLAUDE.md Change — DEV Community
- Best Practices for Claude Code — Anthropic
- How to Reduce Token Usage in Claude Code — BSWEN
