25 Jan 2026 - 10 min read

Why Context Windows Kill AI Agent Performance

Intermediate

Deep Dive

TL;DR: Your AI agent gets dumber as its Context-window
The maximum amount of text (measured in tokens) that a language model can process in a single interaction. Think of it as the model’s working memory - everything in the context window is available for the model to reference when generating a response. fills up. The Ralph Wiggum loop fixes this by resetting context between tasks, keeping your agent in its “smart zone” where it performs best. Most implementations get this wrong by using compaction instead of resets.

Context Window Hygiene

This week at the gym I overheard a couple regulars, including a power lifter, discussing AI agents for optimized training program generation. I joined the conversation and quickly realized their problem wasn’t the AI itself; they were keeping long-running chats where multiple distinct tasks got smushed together. Their chat completions were all over the place because they were hitting context window limitations.

We went over basic context window hygiene, and that conversation was the catalyst for this blog post. I realized the Ralph Wiggum pattern, when applied correctly, minimizes context window usage on each iteration. It’s the same principle whether you’re generating training programs or writing code.

The pattern I see constantly: your agent starts strong, completes a few tasks perfectly, then gradually degrades. It misses obvious bugs. It rewrites code it just fixed. It ignores your instructions.

The issue isn’t the model itself; it’s where in the Context-window
The maximum amount of text (measured in tokens) that a language model can process in a single interaction. Think of it as the model’s working memory - everything in the context window is available for the model to reference when generating a response. the model is operating.

The Smart Zone vs The Dumb Zone

Let’s talk about how context windows actually work. Take Claude Sonnet 4.5 with its 200k token window. That sounds like a lot of room, but:

Performance zones:

0-30% (Smart Zone): Peak performance, optimal attention, fastest responses
30-60% (Okay Zone): Still functional but starting to degrade
60%+ (Dumb Zone): Severe degradation, unreliable outputs

For a 200k context window, that means:

First 60k tokens: Your agent is sharp
Next 60k tokens: It’s okay, not great
Last 80k tokens: It’s struggling

Why does this happen? Auto-regressive-model
A type of language model that generates text by predicting the next token based on all previous tokens in the sequence. Auto-regressive models like GPT and Claude process input sequentially, using the entire context history to inform each prediction. like Claude have to look at ALL previous tokens to predict the next one. When you have 150k tokens of conversation history, the model has to sift through massive amounts of text to find what’s relevant. Attention gets diluted. The Kv-cache
Key-Value cache · A memory optimization technique in transformer models that stores previously computed attention key and value matrices. KV cache enables faster inference by avoiding redundant computations, but becomes a memory bandwidth bottleneck as context windows grow. becomes a bottleneck.

Research backs this up. The Lost in the Middle paper from MIT showed that LLMs have a U-shaped performance curve; they remember stuff at the beginning and end, but information in the middle gets lost.

Even before you write your first prompt, 10% of that smart zone is already gone:

System prompt: 8.3%
System tools: 1.4%
Skills you’ve loaded: varies
Mcp
Model Context Protocol · An open standard by Anthropic that enables AI applications to securely connect to external data sources and tools. MCP provides a standardized way for AI models to access databases, APIs, and local resources through server implementations. tools: varies
Agent config files: varies

The entire script of “The Fellowship of the Ring”, one of my favorite movies ever, is about 53k Claude tokens or 47k Gemini tokens. That’s your smart zone.

What Ralph Wiggum Actually Is

The Ralph Wiggum loop was created by Geoffrey Huntley. It’s dead simple: a bash loop that gives an AI agent the exact same prompt over and over again.

Here’s the canonical implementation:

while true; do
  cat prompt.md | claude --dangerously-skip-permissions
done

That’s it. That’s the whole pattern.

The prompt.md file tells the agent:

Read the plan.md file (contains tasks)
Pick the most important task
Make the changes
Run tests
Commit and push
Mark the task as done in plan.md
Repeat

The genius is that each iteration gets a fresh context window. No compaction. No accumulated history. Just the prompt, the current state of the codebase, and the task list.

Why Most Implementations Gets It Wrong

I’ve seen so many Ralph implementations that miss the point. The most common mistakes:

Mistake 1: Using Compaction

Anthropic’s official Ralph plugin uses compaction. When it moves to the next task, it summarizes what happened previously instead of resetting the context.

The problem? The model doesn’t know what’s actually important. It guesses. It picks what it thinks matters and discards the rest. Critical information gets lost.

Mistake 2: Max Iterations

Some implementations have max iteration limits. The loop stops after X attempts.

But Ralph’s power is in letting it run. I’ve seen agents find performance issues I never would have noticed because they kept iterating after “completing” all the tasks. They found edge cases, tightened up error handling, improved naming.

If you’re watching the loop (human-in-the-loop), you can stop it when it goes off the rails. But don’t artificially limit it.

Mistake 3: Growing the Agent Config

Ryan Carson’s approach adds to the AGENTS.md file on each iteration. The file grows. Token count increases. If you do that, you’re pushing the model out of the smart zone.

Models are wordy by default. If you let them append to a config file on every iteration, you’re just adding tokens to the beginning of each prompt. Eventually you hit the dumb zone again.

Why This Pattern Matters

Back to those gym conversations: they were keeping a single chat window open for months, asking the agent to track their whole progress. “What is my 1RM deadlift?”, then “Analyze my protein intake for the past week”, then “Create a deload week schedule”. Each request added more tokens. After the first few messages, their agent was operating at 70-80% context capacity.

That’s when quality tanks. At that point your agent isn’t “thinking” clearly. It’s like asking someone to solve complex problems while they’re exhausted and distracted. Sure, they might get it done, but the quality suffers.

The same thing happens with LLMs. I’m sure you have seen agents start making circular changes once they hit high context usage. They edit a file, then edit it again differently, then second-guess themselves and revert. They’re stuck in a loop because they’ve either compacted away critical context or they’re operating in the dumb zone where attention is diluted.

When to Deviate from Canonical Ralph

Don’t get me wrong; there are good reasons to customize the pattern.

Parallel Ralphs (Raz Mike’s Ralphy script): Running multiple Ralph loops in parallel for independent tasks. Excellent idea. Each loop gets its own fresh context.

GitHub Issues Integration (Matt Perco’s version): Using actual GitHub issues as the task list. Clever. The filesystem still acts as the source of truth, just synced with GitHub.

Browser Testing (Ralphy with Vel’s agent browser tool): Adding browser automation for E2E testing. Makes sense for web projects.

The key is keeping the core principle: fresh context per iteration. If your modification preserves that, you’re good.

The Real Trade-Off

Ralph is slower than compaction. A lot slower.

Each iteration is a new session. The model has to read the entire codebase state again. There’s no accumulated knowledge from previous iterations.

Slow and correct beats fast and wrong. Every time.

I’ve seen compacted agents complete 20 tasks in an hour, half of them introducing bugs. I’ve seen Ralph complete 8 tasks in the same time, all of them solid.

Which would you rather have?

Another major trade-off of course is the $🔥. You can mitigate dollar burn on a subscription by running the agent when you sleep to take advantage of rate limit windows that you don’t use otherwise.

How to Actually Use Ralph

If you want to use canonical Ralph:

1. Keep your prompt.md simple Don’t write a novel. Be direct. Tell the agent what to do, not how to think about doing it.

2. Make plan.md granular Break tasks into small, verifiable chunks. Not “implement authentication” but “add Jwt
JSON Web Token - A compact, URL-safe token format used to securely transmit information between parties as a JSON object. JWTs are digitally signed and can be verified and trusted, commonly used for authentication and authorization. validation to middleware X”, “create token refresh endpoint according to spec Y”, “add auth tests”.

3. Use filesystem state, not memory Don’t rely on the agent remembering things. Write it down. Use files. The filesystem is your state management system.

4. Watch it run Human-in-the-loop is powerful. You’ll spot patterns. You’ll see when the agent gets stuck. You’ll learn which prompts work.

5. Stop when it’s done Don’t let it run forever. When all tasks are complete and it’s not finding new issues, stop it. Use a promise as a circuit breaker.

The Context Reset Pattern Beyond Ralph

Ralph is one implementation of a broader pattern: context resets. Anthropic’s own documentation recommends this pattern for agent systems. If you’re looking for more strategies to optimize your workflow with Claude Code, check out how to get the most out of Claude Code.

Instead of long-running agents with compaction, spawn fresh subagents for specific tasks. Each subagent gets a clean context window optimized for its job.

Want to analyze a log file? Spawn a subagent with just the log file and analysis prompt. Want to fix a bug? Spawn a subagent with the relevant files and error message. Want to run tests? Spawn a subagent with the test command and failure output.

Each subagent operates in the smart zone. No accumulated cruft. No compacted half-memories. Just focused execution. Also, make it a habit to create Skills.

What’s Next for Ralph

Geoffrey Huntley is working on Loom and Weaver, which builds on Ralph’s concepts for autonomous software creation. The idea is scaling the pattern: more sophisticated task management, better verification, smarter state handling.

The core insight remains: context is a scarce resource. Treat it like one. And if you want to know how to effectively extend the context window in your agentic applications, why don’t you give this one a read.

Sources

How to Ralph Wiggum · Original pattern documentation
Lost in the Middle: How Language Models Use Long Contexts · MIT research on position bias
Claude Code: Best Practices for Agentic Coding · Official Anthropic guidance
Effective Context Engineering for AI Agents · Context reset patterns
Context Rot: How Increasing Input Tokens Impacts LLM Performance · Research on degradation mechanisms