How to Cut Claude API Costs by 80% with an Automatic Model Switcher Agent for Claude Code
Most developers using Claude Code leave serious money on the table every single day. Not because they’re doing anything wrong — but because Claude Code defaults to whatever model you configured globally, regardless of whether the task actually needs that level of intelligence.
Asking Claude Opus 4.7 to rename a variable, format a serialiser, or answer a quick “what does this function do?” question costs 25× more per output token than asking Claude Haiku 4.5 to do the same thing. The output is identical. The bill is not.
This post walks through the Claude Code Model Switcher Agent — an AGENT.md spec that automatically routes every task to the cheapest model that can handle it, displays what was selected and why, reports token spend in real time, escalates intelligently when a cheaper model hits its limits, and logs every session to disk for cumulative savings analysis.
The Problem: One Model for Every Task is Wasteful
Anthropic’s current Claude lineup (as of May 2026) offers three distinct tiers:
| Model | API ID | Context Window | Cost per Million Tokens (in / out) |
|---|---|---|---|
| Claude Haiku 4.5 | claude-haiku-4-5-20251001 |
200k | $1 / $5 |
| Claude Sonnet 4.6 | claude-sonnet-4-6 |
1M | $3 / $15 |
| Claude Opus 4.7 | claude-opus-4-7 |
1M | $5 / $25 |
The pricing spread is not subtle. Opus costs 5× more per input token and 5× more per output token than Haiku. For a typical developer task — say, a 20k-in / 5k-out refactor — that’s $0.135 on Sonnet versus $0.225 on Opus. Over a month of daily Claude Code usage, that gap compounds into hundreds of dollars in wasted spend.
The solution isn’t to downgrade to a cheaper model permanently. Some tasks genuinely need Opus. The solution is to use the right model for each task, automatically, without thinking about it.
The Solution: A Three-Tier Model Switcher Agent
The Model Switcher Agent lives in your Claude Code configuration as an AGENT.md file (deployed as a subagent in .claude/agents/ or inline in CLAUDE.md). Before any API call is made, it classifies the incoming task and selects the minimum-viable model from three tiers.
Tier 1 — Claude Haiku 4.5 (the default for simple work)
Haiku handles the majority of real-world developer tasks: single-turn questions, text formatting and transformation, boilerplate generation under 50 lines, short summarisation, syntax checks, translation, and classification. If the task is single-turn, fits comfortably within 50k tokens, and doesn’t require tool chaining — Haiku it is.
Tier 2 — Claude Sonnet 4.6 (the workhorse for code tasks)
Sonnet takes over for anything that requires moderate intelligence and context: multi-file refactors, writing tests for existing code, reviewing pull requests, generating Django models or React components, API integration scaffolding, documentation of medium length, and debugging with up to five tool calls. For context windows between 50k and 500k tokens, Sonnet is the default.
Tier 3 — Claude Opus 4.7 (reserved for genuinely hard problems)
Opus is used only when Sonnet is provably insufficient: complex architectural decisions, agentic workflows with more than ten sequential tool calls, long-horizon debugging across large codebases, research synthesis requiring nuanced judgment, context exceeding 500k tokens, or any task that Sonnet returned INCOMPLETE on after a retry.
The key discipline: Opus requires justification. It is never the default.
How the Agent Works in Practice
Step 1: Classification and Model Load
The moment you submit a task, the agent evaluates it against the tier rules before touching the API. Then it prints a load banner:
╔══════════════════════════════════════════════════════╗
║ 🤖 MODEL LOADED: claude-haiku-4-5-20251001 ║
║ 📊 Tier: 1 | Task: format Django serialiser ║
║ 📏 Est. Tokens: ~2,000 in / ~800 out ║
╚══════════════════════════════════════════════════════╝
You always know which model is running and why, before a single token is generated.
Step 2: Real-Time Step Reporting
During execution, after each tool call or sub-task:
⚡ [Step 2/5] in: 3,412 | out: 812 | running total: 6,824
This isn’t just visibility — it’s an early-warning system. If token counts are climbing faster than expected, you can intervene before hitting an expensive ceiling.
Step 3: Automatic Escalation
If a cheaper model hits a limit mid-task, the agent escalates automatically without losing accumulated context:
⚠️ AUTO-ESCALATE: claude-haiku-4-5 → claude-sonnet-4-6
Reason: Context growing beyond 50k tokens.
Escalation is triggered by six specific conditions: context exceeding tier thresholds (50k for Haiku, 500k for Sonnet), incomplete responses after one retry, tasks with more than eight sequential tool calls remaining, and tool call failure rates above 30%.
Critically, escalation carries forward all existing context. The task never restarts.
Step 4: Completion Summary with Savings
When the task finishes:
╔═══════════════════════════════════════════════════════════╗
║ ✅ TASK COMPLETE ║
║ Model: claude-haiku-4-5-20251001 Tier: 1 ║
║ In: 4,218 tokens Out: 1,042 tokens ║
║ Cost: $0.000004 ║
║ ───────────────────────────────────────────────────── ║
║ 💰 Savings vs Sonnet 4.6: $0.000009 (69%) ║
║ 💰 Savings vs Opus 4.7: $0.000025 (86%) ║
║ 📝 Logged → token-utilisation.log ║
╚═══════════════════════════════════════════════════════════╝
Every session, you see exactly what you paid and exactly what you saved versus the alternatives.
Token Utilisation Logging
Every task — completed or failed — is appended as a single JSON line to token-utilisation.log:
{"timestamp":"2026-05-11T10:45:22+05:30","session_id":"s002","task_id":"t002","task_type":"multi-file refactor","task_description":"Refactor meetingevents Zoom sync views to CBV","tier_selected":2,"model_used":"claude-sonnet-4-6","model_escalated_from":null,"steps":5,"tokens_in":22400,"tokens_out":8100,"total_tokens":30500,"cost_usd":0.000189,"savings_vs_opus_usd":0.000315,"savings_vs_opus_pct":62.5,"status":"completed","escalation_occurred":false}
The log schema captures: timestamp, session and task IDs, task type and description, tier selected, model used, escalation source (if any), step count, tokens in and out, cost in USD, savings versus Opus in both USD and percentage, task status, and escalation flag.
Because it’s plain JSONL, you can query it with standard tools:
# Total cumulative savings vs always-using-Opus
cat token-utilisation.log | python3 -c "
import sys, json
lines = [json.loads(l) for l in sys.stdin]
total_saved = sum(l['savings_vs_opus_usd'] for l in lines)
total_cost = sum(l['cost_usd'] for l in lines)
print(f'Tasks: {len(lines)}')
print(f'Total cost: \${total_cost:.4f}')
print(f'Total saved: \${total_saved:.4f}')
"
Where to Place the AGENT.md File
The agent spec is a markdown file. Claude Code reads it as a subagent instruction or as part of CLAUDE.md. There are three deployment options:
Option A — Project subagent (recommended). Place AGENT.md at .claude/agents/model-switcher.md inside your project. It loads as a named subagent invokable via /project:model-switcher. Good for project-specific model routing rules.
Option B — Global subagent. Place it at ~/.claude/agents/model-switcher.md. Available across every project on your machine without per-project setup.
Option C — Inline in CLAUDE.md. Paste the contents into ./CLAUDE.md (project-level, committed to git) or ~/.claude/CLAUDE.md (global). The switching logic is then active in every Claude Code session automatically, without invoking a named agent.
A note on naming: Claude Code’s canonical file is CLAUDE.md. Renaming to AGENTS.md makes the spec compatible with OpenAI Codex CLI, Cursor, Zed, and other tools that follow the open AGENTS.md standard. For teams using multiple AI coding assistants, AGENTS.md at the project root is the most portable choice.
The Cost Maths
The savings calculation uses a straightforward formula:
cost_usd = (tokens_in / 1_000_000 × input_price)
+ (tokens_out / 1_000_000 × output_price)
savings_vs_opus = (tokens_in / 1_000_000 × (5.00 − input_price))
+ (tokens_out / 1_000_000 × (25.00 − output_price))
A concrete example — a 20k-in / 5k-out refactor task:
| Model | Input cost | Output cost | Total |
|---|---|---|---|
| Haiku 4.5 | $0.020 | $0.025 | $0.045 |
| Sonnet 4.6 | $0.060 | $0.075 | $0.135 |
| Opus 4.7 | $0.100 | $0.125 | $0.225 |
Routing this task to Haiku instead of Opus saves $0.180 — an 80% reduction. Routing to Sonnet instead of Opus saves 40%. Multiply by the number of tasks you run in a working day, and the cumulative savings become significant within a week.
Advanced Cost Levers
Two additional optimisations are built into the guardrails section of the agent spec:
Prompt caching. For tasks that repeatedly send a large system prompt or context blob (common in agentic workflows), enabling the cache_control beta header reduces costs further. The log schema includes a cached_tokens_in field specifically for tracking cached vs uncached input separately, so savings calculations remain accurate.
Message Batches API. For non-interactive, parallelisable tasks — processing a large dataset, generating documentation for 200 functions, running analysis across multiple files — the Batches API applies an automatic 50% input discount. The log marks these entries with "batch_mode": true. Combined with Haiku-tier routing, batch processing can reduce costs to a small fraction of the interactive Opus baseline.
Guardrails That Prevent Gaming
A model switcher is only useful if it has hard rules preventing the wrong kind of optimisation. The agent spec includes five:
Never default to Opus without explicit justification. Opus requires a specific trigger condition — not just “this seems complex.”
Never truncate output to save tokens. If a response would be incomplete at the current tier, the correct action is to escalate — not to return partial output at a lower cost.
Always log, even on failure. Setting "status": "failed" on failed tasks is mandatory. A log with gaps is not a source of truth.
Prompt caching is additive, not a substitute. Cache tokens are tracked separately so they don’t distort the core cost and savings figures.
Escalation carries context forward. Never restart a task on escalation. The accumulated tool calls, context, and intermediate results must be preserved.
Real-World Token Estimates
The agent spec includes a full pre-task estimation table. Here are the most common Django/React developer tasks and their typical ranges:
| Task | Est. Input | Est. Output | Tier |
|---|---|---|---|
| Quick Q&A / factual lookup | 500–2k | 200–1k | Haiku |
| Short summarisation | 2k–10k | 500–2k | Haiku |
| Format / transform / boilerplate | 1k–5k | 500–2k | Haiku |
| Single-file code edit | 3k–15k | 1k–5k | Sonnet |
| Multi-file refactor | 10k–80k | 5k–20k | Sonnet |
| Write tests | 5k–30k | 2k–10k | Sonnet |
| Django / React component generation | 5k–25k | 3k–12k | Sonnet |
| Full codebase analysis | 50k–500k | 10k–50k | Sonnet / Opus |
| System architecture or sprint planning | 20k–150k | 10k–50k | Opus |
| Agentic workflow (10+ tool calls) | 50k–200k | 20k–60k | Opus |
For most day-to-day Django/React development work, the majority of tasks land in Tier 1 or Tier 2. Opus should be rare — reserved for the genuinely hard problems.
Getting Started
The full agent spec (AGENT.md) and README are available on GitHub at github.com/neps-in/model-switcher-agent — including all tier rules, escalation conditions, log schema, display format specifications, and cost formulas.
To deploy it today:
- Clone or download the repo:
git clone https://github.com/neps-in/model-switcher-agent - Copy
AGENT.mdto.claude/agents/model-switcher.mdin your project root. - Start a Claude Code session.
- The agent is available as
/project:model-switcher. - A
token-utilisation.logfile will be created automatically on first task completion.
To deploy it today:
- Copy
AGENT.mdto.claude/agents/model-switcher.mdin your project root. - Start a Claude Code session.
- The agent is available as
/project:model-switcher. - A
token-utilisation.logfile will be created automatically on first task completion.
After a week of usage, run the log analysis snippet to see your cumulative savings. The numbers tend to be surprising.
Conclusion
Intelligent model routing is one of the highest-leverage optimisations available to developers using Claude Code at scale. The gap between Haiku and Opus pricing is a 25× spread on output tokens — and for the large majority of tasks in a typical development workflow, Haiku or Sonnet produces identical results.
The Model Switcher Agent makes this optimisation automatic, transparent, and auditable. You always know which model ran, what it cost, what was saved, and why an escalation occurred. The log gives you the data to refine tier assignments over time as your task patterns become clearer.
Write the spec once. Let it compound.
Napoleon Arouldass is the Founder and Chief Architect of GrandAppStudio.in — a software studio in Puducherry, India, specialising in Django/Python backends, React frontends, and applied AI engineering. With 15+ years of full-stack experience, he writes about developer productivity, AI tooling, and building production-grade systems.
