AI & MCP

Context engineering ate prompt engineering

The job stopped being how you phrase the request and became what you let into the window — and the token budget is now the real design surface.

By Tishan David 6 min read

The clever-phrasing era is over

For about three years the folklore held that the magic words mattered. “You are an expert.” “Think step by step.” “Take a deep breath.” People traded these like footy cards. Some of them even worked, for a while, on weaker models.

That era is done. The models got good enough at instruction-following that phrasing tricks stopped buying much, and the failures moved somewhere else entirely: into what the model could see when it answered. A sharp prompt sitting in a junk context still fails. A blunt prompt in a well-assembled context usually succeeds. The 2026 State of Context Management Report puts a number on the vibe shift — 82% of IT and data leaders now say prompting alone can’t carry production AI. The discipline that replaced it has a name: context engineering. It’s the engineering of agent state — deciding what the model knows, sees, and remembers at the moment it acts.

The problem: the window is not free real estate

The naive read of million-token context windows was “great, dump everything in.” Two things killed that.

First, context rot. Chroma Research’s mid-2025 study tested 18 frontier models — the Claude 4 family, GPT-4.1, Gemini 2.5, Qwen3 — on tasks extended from needle-in-a-haystack. Accuracy doesn’t hold flat to the documented limit and then fall off a cliff. It degrades non-uniformly as input grows, sometimes 30–50% before you reach the advertised ceiling. On 1M-token models the observable slump tends to start somewhere around 300–400K tokens. A 200K-window model can wobble at 50K. Rot is not overflow; it shows up long before you run out of room. And it gets worse the harder it is to distinguish the answer from the surrounding noise — semantic similarity decay, not raw length, is the real enemy.

Second, cost and latency. Time-to-first-token on a 500K-token prompt sits between 8 and 25 seconds on most frontier endpoints. A hot vector index answers in 50–150ms. The token bill is the same story: feeding a 500K-token corpus on every call versus retrieving ~4K tokens of relevant chunks is roughly $12,500/day against $100/day at 10K queries. Flat-rate long-context pricing softens the economics, not the rot.

So the window is a budget, not a backpack. Everything you put in competes for the model’s attention with everything else.

The deep dive: three sources of context, one budget

Curating context means choosing, per turn, from three supply lines and spending a fixed token budget across them.

Long context is for reasoning over a bounded evidence set you’ve already chosen. Whole-document analysis, a full conversation history, the complete tool-call log of an agent run. It’s the right tool when the set is small and stable.

Retrieval (RAG) is for deciding what the evidence set should be. It hasn’t been killed by big windows — it’s been promoted to the selection layer. The numbers are blunt: an order-preserving RAG approach with 48K well-chosen tokens beats full-context stuffing at 117K tokens by 13 F1 points, at about one-seventh the budget. Less, chosen well, outperforms more.

MCP-supplied context — tools, resources, memory pulled in over the Model Context Protocol — is where most teams are quietly bleeding budget. Tool definitions are descriptive prose plus JSON schema, easily 200–800 tokens each. Connect GitHub, Slack and Sentry — three servers, ~40 tools — and you can burn 143K of a 200K window on schemas before the user types anything. Seven servers will eat a third of the window cold.

The 2026 answer to that last one is to stop loading what you aren’t using. Anthropic’s tool-search tool reported ~85% token reduction in testing by loading a lightweight search interface up front and fetching individual schemas on demand. On the Claude API you mark the bulk of tools defer_loading: true and let the model discover them:

{
  "tools": [
    { "type": "tool_search_tool_regex_20251119", "name": "tool_search_tool_regex" },
    { "type": "mcp_toolset", "mcp_server_name": "github", "defer_loading": true },
    { "type": "mcp_toolset", "mcp_server_name": "sentry", "defer_loading": true }
  ]
}

Two more levers worth knowing on the same surface: context editing (clear_tool_uses_20250919) prunes stale tool results out of the transcript without summarising, and compaction (compact_20260112, default trigger ~150K tokens) summarises older history server-side when a long run approaches the window. Pruning and summarising are different operations — reach for the one that matches whether the old content is noise or just bulky.

A worked example: structuring an agent’s window

Here’s the mental model I build agents against now — an explicit budget, not an accident of message history. For a coding agent on a ~200K working window I allocate roughly:

SYSTEM (frozen, cached)        ~3K   role, hard constraints, output contract
TOOL SEARCH INTERFACE          ~1K   lightweight; real schemas fetched on demand
RETRIEVED EVIDENCE            ~15K   top-k chunks, order-preserved, with source IDs
WORKING MEMORY                 ~8K   running task state, decisions, open questions
RECENT TURNS                  ~20K   verbatim; older turns compacted below
SCRATCH / TOOL RESULTS        ~30K   cleared once consumed (context editing)
HEADROOM                   leave ~40% empty to stay clear of rot

The non-obvious rule is the last line. You don’t fill the window — you defend the headroom, because rot starts well before the ceiling. Three concrete tactics fall out of this:

  • Freeze the prefix. Keep the system prompt and tool list byte-stable so prompt caching actually hits. A datetime.now() interpolated into the system block silently invalidates the cache on every call. Inject volatile facts late, in the message stream, not up front.
  • Retrieve to select, long-context to reason. Don’t paste the corpus; paste the 48K that earned their place, in document order, each tagged with a source ID so the model can cite.
  • Treat tool results as disposable. Once a result has been read and acted on, clear it. It’s already shaped the next step; keeping the raw payload around just feeds rot.

This is the same discipline I lean on when wiring local-first developer tooling into an agent loop — the win isn’t a better connector, it’s tighter control over what each connector is allowed to spend.

Real-world impact

The teams getting durable results in 2026 aren’t the ones with the cleverest system prompts. They’re the ones who can answer, for any given turn, “what’s in the window and why, and what did it cost.” That’s an observability problem as much as a design one: log cache_read_input_tokens, log retrieved-chunk IDs, log which tool schemas were actually loaded. When an agent goes sideways, the cause is almost always in the context — a stale tool result, an over-fetched corpus, a cache miss bloating the prefix — not in the wording. I’ve written up a couple of these post-mortems in my agent build case studies, and the pattern repeats: the bug was in what got loaded, every time.

Why it matters

Prompt engineering treated the model as a temperamental oracle you had to flatter. Context engineering treats it as a system with a working set you’re responsible for managing — closer to cache design or query planning than to copywriting. That reframing is the actual skill. The phrasing still matters at the margins, but the leverage moved to retrieval quality, token budgeting, and ruthless eviction of anything the model doesn’t need in front of it right now. Curate the window, defend the headroom, measure what you spend. The clever words were never the point.