context is the new compute

in march 2026 anthropic shipped a 1 million token window at standard pricing and inference costs collapsed 240x in 18 months. compute is cheap. context is expensive. here's the discipline that actually separates production agents from demos in 2026.

April 7, 2026/7 min read

two things happened in march 2026 that should have been the end of the story.

on march 13, anthropic shipped claude opus 4.6 and sonnet 4.6 with 1 million token context windows, generally available, at standard pricing. no premium multiplier for long requests. a 900k token call is billed at the same per-token rate as a 9k one. that same window, a year earlier, would have been a research preview with a 2x input surcharge.

at the same time, the cost of frontier-grade inference has been falling roughly 10x per year. a16z and others have tracked the cost of gpt-4 equivalent intelligence dropping from about $180 per million tokens to under $1 in 18 months. call it 240x cheaper. call it "llmflation." call it whatever you want. the practical meaning is: compute is no longer the thing standing between you and a working agent.

if you'd described this situation to me in 2023 i would have said that's it, game over, the agents will just work. the agents don't just work. the langchain 2026 state of ai agents report says 57% of organizations have agents in production, and 32% cite quality as the number one barrier to scaling them. a public benchmark that ran a six-task agentic flow against a standard crm, 10 consecutive times, landed at a 25% end-to-end success rate.

compute is cheap. windows are huge. and the agents are still 25% reliable. something else is the bottleneck. it's context.

the old constraint is gone

for three years, the bottleneck in every ai product i shipped was token budget. tokens were slow, tokens were expensive, windows were tight. 8k was luxurious. 32k was a feature. 200k was a beta. every design decision bent around "what can we actually fit in the prompt." trimming and compressing and prioritizing felt like the work because it was the work.

it isn't anymore. the constraint that shaped 2023 and 2024 is not the constraint shaping 2026.

a new one moved in as the old one left. it's not new to production teams. it's just finally getting a name that sticks. it's called context engineering, and it's the discipline of deciding what reaches the model at each step of an agent's life.

context engineering is not prompt engineering

prompt engineering was a hobby. it was you, a text box, and a model, trying to phrase a question better. it was worth doing in 2023 when you had one turn, one prompt, and a 4k window. it belonged to that era.

context engineering is infrastructure. it's the system of decisions that determine what information the model sees at every inference: the system prompt, the tool definitions, the conversation history, the retrieved documents, the memory tier, the structured examples, the output schema, the failure traces. most of those things are not text you wrote. they're generated, retrieved, ranked, filtered, summarized, pruned.

in production agent systems, context engineering is 70% of the real work. the prompt is the last 5%.

the numbers nobody warns you about

here are the two numbers that rearranged how i build on regent and mailpilot.

first: a single mcp tool definition costs somewhere between 550 and 1,400 tokens. if your agent connects to three services, say github, slack, and sentry, you've burned roughly 55,000 tokens before the user types a word. that is just the tool definitions. on a 200k window, that is 27% of the budget gone on tooling the agent may never use in this particular turn. on the 1m window, it's small in percentage but still the first thing the model sees, and attention is not uniform across a million tokens.

second: an agent with 1 million tokens of poorly curated context will consistently underperform an agent with 50,000 tokens of well-curated context. every team running real evaluations has converged on this. more context is not better. cleaner context is better. the model is not a database, it's an attention mechanism, and attention is a zero-sum resource across the window.

those two numbers put everything in a different frame. compute is cheap. context is expensive, not in dollars but in accuracy. every irrelevant token in the window is a small dilution of the model's focus. the million-token window is not permission to dump everything. it's a trap you walk into if you treat it that way.

what this looks like on a real system

on regent, the ai executive assistant i build, the email pipeline has three distinct stages: ingestion, intent extraction, and response drafting. each stage sees a completely different context window. not the same prompt with extras. a different window.

ingestion sees the raw email, the sender's recent history, and nothing else. no tools. no memory. no user instructions. the job is narrow. the context matches.

intent extraction sees the normalized email, three retrieved examples of similar past intents, and a compact output schema. no tools. the job is classification. more context dilutes it.

response drafting is where it gets interesting. it sees the email, the extracted intent, a rag query against the user's memory (top 5 items, ranked by recency and relevance), the user's voice profile, and exactly the tools needed for the identified intent. if the intent is "schedule a meeting," the calendar tools load. if the intent is "politely decline," the calendar tools do not load. that tool gating saves roughly 15k tokens per request and, more importantly, removes the agent's ability to wander off into actions that were never on the table.

three stages, three windows, each budgeted, each scoped, each loading only what the stage needs. the difference between this approach and "stuff everything into one big prompt" is the difference between a system that works 80% of the time and a system that works 30% of the time. i've built both.

five rules i actually follow

i don't have a library of 50 context engineering patterns. i have five rules, and i violate them only when i've proven the violation is safe.

  1. every stage has a context budget. not a soft limit, not a guideline. a hard limit, expressed in tokens, enforced in code. if the composed context would exceed the budget, something gets pruned or summarized before the call happens.

  2. tools load per stage, not globally. the agent gets the smallest tool set that could possibly complete the current step. if i don't know what the step is, i classify it first with a tool-free call, then load tools for that classification.

  3. retrieval is a triple, not a single. i rank retrieved items on three axes: recency, relevance, and diversity. pure relevance gives you five near-duplicates and the agent hallucinates a sixth. diversity forces the model to see options instead of echoes.

  4. summarize at checkpoints. conversation history is not free. past a threshold, i replace the first n messages with a single summarized turn. the summary lives in a separate field so the model knows it's a compression, not a real exchange.

  5. no silent growth. every piece of context that enters the window is logged with its token count. when latency creeps or quality drops, i can look at the trace and see which stage started dragging in 3x the tokens it used to. silent context growth is how agents die in production, and it is invisible unless you instrument for it.

why this is the new discipline

the meme online is that "prompt engineering is dead." it isn't. it got absorbed. prompt engineering is now one knob on a much bigger control panel, and the bigger panel is context engineering.

the engineers building production agents in 2026 are not prompt wizards. they are information architects. they spend their time on what enters the window and when. they instrument their pipelines so they can inspect the composition of the context at every step. they have opinions about retrieval, tool gating, summarization, memory tiers. they measure token budgets the way sre teams measure error budgets.

if you ignore this layer, no model upgrade will save you. a million-token window will not save you. the cost collapse will not save you. opus 4.6 will not save you. you'll ship an agent that looks great in the demo, succeeds 25% of the time in production, and nobody will understand why.

the compute problem is over. the context problem is the whole game now.

budget accordingly.

Comments

Sign in to leave a comment

No comments yet