Articles/AI & Tooling/Getting More Out of Claude Code: Prompting and Token Economy

Getting More Out of Claude Code: Prompting and Token Economy

A practical guide to driving Claude Code well: how to prompt for fewer wrong turns, how to keep the context window healthy, and a straight answer to whether running Opus 4.8 at max effort on everything is actually a good idea.

For months my Claude Code setup was one reflex: whatever the newest Opus is, pinned to max effort, on everything. Renaming a variable? Max effort. Reading a config file? Max effort. It felt responsible. Then I read the docs and watched my own /usage numbers, and changed my mind. Max effort is a great tool and a terrible default.

This is what I have learned about getting more out of Claude Code: prompting it for fewer wrong turns, keeping the context window healthy, and choosing models and effort so you spend tokens where they pay off. One framing first. The scarce resource is throughput, how much work you get done before you hit your plan's usage and rate limits. Waste tokens and you just hit that wall sooner.

It is a context window, not a chat

The most useful mental model: you are not having a conversation, you are managing a context window. Everything Claude knows right now shares one finite budget of tokens, the conversation history, every file it has read, every command's output, your CLAUDE.md, loaded skills, and the system prompt.

That budget is large but not infinite, and what matters is what happens as it fills. Claude Code auto-compacts, summarizing old turns and dropping stale output. That keeps the session alive but loses detail: the instruction from your third message can be gone by the fortieth. It is why long sessions slowly get dumber, not because the model got worse but because the early signal got compacted into noise. Almost everything below follows from this. Good prompting puts the right things in the window; good token economy keeps the wrong things out.

Prompt for fewer wrong turns

What you get back is mostly decided before Claude writes a line. Clever phrasing matters less than it used to; structure matters more, how specific you are, what context you hand over, and whether Claude can tell when it succeeded.

Be specific, up front. Vague requests get vague work. Compare:

TEXT
Fix the login bug.
TEXT
Logins fail for users whose email changed since signup. I think the session
token still keys off the old address. Look in src/auth/session.ts, write a
failing test that reproduces it, then fix it. Do not touch the OAuth flow.

The second names the symptom, points at the file, sets a boundary, and bakes in a way to know it worked. One pass instead of three.

Name the off-limits zones. Say which files Claude should not touch, and have it run git diff at the end to confirm. What is out of bounds is often more useful than what to do.

Show an example of the output you want. A small sample of the format steers Claude better than a paragraph describing it.

Say what to do, not what to avoid. "Write in plain paragraphs" works better than "do not use bullet points."

Give it something to verify against. The big one. A failing test, expected output, a screenshot, anything Claude can check itself against cuts the back-and-forth. A model that can watch a test go green does not need you as its test harness.

Plan before you code

For anything bigger than a quick edit, separate thinking from doing. Plan mode (Shift+Tab) lets Claude read the code and propose an approach without touching a file; you correct the misunderstanding while it is cheap, then let it build against a plan you agree on. The explore, plan, code, commit loop catches the expensive architectural mistakes before they become a diff you have to unwind.

Two habits make it work. Read the plan before approving it, do not rubber-stamp it. And interrupt early: press Esc the moment Claude heads the wrong way instead of watching a doomed attempt finish. When a session gets truly tangled, stop nursing it, /clear and restart with a tighter prompt. That is almost always faster than arguing your way out, and it costs nothing.

CLAUDE.md and memory, done right

CLAUDE.md is the context you do not want to retype: build commands, architecture, conventions, the facts Claude should always know. It loads every session, so keep it lean, a couple hundred lines at most. Past a certain size it both eats context and gets followed less, so a bloated CLAUDE.md is worse than a short one.

  • Facts, not procedures. "Use 2-space indentation" belongs here; a ten-step deploy runbook is a job for a skill, which loads only when needed.
  • Know the layers. Project file for team standards, ~/.claude/ for personal preferences across projects, a gitignored local file for machine-specific notes.
  • Scope rules to paths. Instructions for one corner of the codebase can live in .claude/rules/ so they load only when Claude touches matching files.
  • Tell compaction what to keep. A line in CLAUDE.md can say what survives summarization, so code and test output stay while small talk goes.

Skills take this further. A skill is a folder: a short SKILL.md that says what it is for, next to a references/ directory it links into. Claude loads only the one-line description at startup, the SKILL.md body when the skill triggers, and a reference file only when it needs that detail. This site's writing-articles skill works exactly that way, a lean SKILL.md beside references/frontmatter.md and friends. It is progressive disclosure all the way down, the best way to give Claude deep knowledge without it camping in context all session.

Subagents keep the main context clean

Some work is necessary but messy: grepping a big dependency, reading a wall of test output, researching an API across a dozen pages. You want the conclusion, not the firehose sitting in your main context for the rest of the session.

That is what subagents are for. A subagent runs in its own context window, does the noisy work, and returns only a summary; the thousand lines of log never touch your window, the three-sentence finding does.

Be clear-eyed, though: a subagent is not free. Each one boots with its own instructions and re-reads files to orient. For genuinely verbose, separable work that overhead pays off and can even lower total usage, since the firehose lives and dies in the subagent instead of re-inflating your context every turn. For a trivial task it is pure waste. Delegate when the work is big and noisy, not reflexively, and run a lighter model like Sonnet or Haiku for the grunt work.

Models and effort: the max-effort question

Here is the part I got wrong for months: Claude Code gives you two independent dials, and I treated one as fixed.

The first is the model. Mid-2026 the lineup is Fable 5 for long autonomous sessions, Opus 4.8 for complex reasoning and architecture, Sonnet 4.6 as the everyday workhorse, and Haiku 4.5 for fast, simple tasks. Opus draws down your usage faster than Sonnet per token and earns it on hard problems; on a one-line change it does not.

The second is effort, which on Opus 4.8 runs low, medium, high, xhigh, and max. Effort sets how much the model thinks before answering, and thinking tokens count as output, the heaviest kind. The detail worth tattooing somewhere: the default on Opus 4.8 is high, not max. Max removes the thinking cap and applies to the current session only, and the docs are blunt, it "can show diminishing returns and is prone to overthinking. Test before adopting broadly."

So my max-effort-on-everything habit spent deep deliberation on tasks that needed none, sometimes reasoning worse by talking itself in circles. The fix is not a clever prompt, it is the dial:

  • High by default. It is the default for a reason, right for most real work.
  • Max for genuinely hard problems. Gnarly architecture, a subtle bug, anything ambiguous enough to reward deliberation. Deliberately, not reflexively.
  • Medium or low for the routine. Small refactors, mechanical edits, lookups. The thinking budget is wasted there.
  • Sonnet for most daily coding, Opus for decisions that turn on raw capability, Haiku or a subagent for grunt work.

A third, separate dial: /fast. Fast mode is not a dumber model, it is the same Opus at lower latency, drawing down usage faster for the speed. It is independent of effort. Worth it for live debugging where you wait on each turn, wasteful on background work nobody is watching.

Spending tokens well

With the dials set sensibly, the rest is hygiene.

Watch the window. /context shows what is eating your space; /usage shows what the session used and where. The numbers surprise you the first time.

Compact and clear on purpose. /compact summarizes a long but still-relevant session, and you can focus it, for example to keep the API changes you are working on. /clear wipes the slate for unrelated work. Clearing is not losing progress; stale context taxes every later message.

Let caching work. Claude Code auto-caches the stable parts of your context, the system prompt, tools, CLAUDE.md, the early turns, so they are not reprocessed each request; a cached read costs about a tenth of a fresh one. Editing earlier content invalidates everything after it, one more reason to keep CLAUDE.md stable and append rather than rewrite.

Do not dump the haystack to find the needle. Reading a 50,000-line log to find three errors is the classic budget-killer. Grep first, then let Claude read the matches.

Prefer surgical edits and lean output. Claude already edits with targeted diffs, which is good for tokens. Asking for less explanation helps on the output side, though it is a smaller lever than the internet thinks.

Myths vs. facts

Plenty of token-saving advice circulates. Sorted:

"Run Opus at max effort on everything." Mostly false. High is the default for a reason; max is a precision tool that shows diminishing returns elsewhere. Always-max burns more and sometimes reasons worse.

"A million-token context window means you can stop worrying about context." False. More runway, not a license to ignore it. Quality degrades from compaction well before any hard ceiling.

"This third-party tool cuts your tokens 80 to 90 percent." Marketing, mostly. Unverified numbers. The real built-in levers are prompt caching and subagents, reach for those before someone's miracle wrapper.

"Running /clear throws away your work." False. It resets the conversation context only; your files and git history are untouched.

"Fast mode is just a faster version of the model." Misleading. Same model, faster, drawing down usage quicker, and a separate dial from effort. Use it when latency matters, not to stretch your limits.

"Magic words like 'think harder' make it reason more." Almost entirely false. Claude Code recognizes one keyword, ultrathink, for a deeper-reasoning turn; "think hard" and the rest are ordinary text. And ultrathink spends more tokens, not fewer.

Do token-saving mega-prompts work?

A tempting idea: paste one fixed block into your custom instructions and permanently cut your spend. A representative version:

TEXT
Launch subagents. Output only the modified or requested code block. Do not
provide line-by-line explanations, setup guides, introductory or concluding
remarks, or markdown commentary unless explicitly asked. Adopt an ultra-concise,
high-density communication style.

Half useful, half misunderstanding.

"Launch subagents." Cargo cult. There is a real feature nearby, Claude Code can orchestrate dynamic workflows, but you enable that with the ultracode setting from /effort, not by typing the words. As a blanket habit subagents cost tokens rather than save them: each boots with its own instructions and re-reads files, so firing one off for a two-line change is pure overhead. Delegate for genuinely noisy work, not as an incantation.

"Output only the modified code block." The legitimate half. Output tokens count several times heavier than input on Opus, so trimming explanation you did not need is a real saving. Just know it is a trim, not the near-zero these prompts promise, and a permanent "only ever output code, never explain" rule hurts the moment you want Claude to plan or teach.

"Ultra-concise style forces the thinking to compress." Here the reasoning breaks. The complaint is about thinking tokens, but a style instruction controls the output, not the thinking; they are different pools on different dials. The real fix for thinking spend is the effort dial, drop to high or medium, or move routine work to Sonnet. You can even tell Claude to think less directly. A communication-style instruction only trims the reply you read.

So the honest version is small: ask for concise output when you do not need the explanation. The rest is a no-op or the wrong tool. What actually stops Opus Max from eating your limits is not a prompt, it is turning effort down and picking the right model.