Philipp Schmid offers the cleanest mental model for understanding where agent harnesses fit1:
- Model = CPU — raw processing power
- Context window = RAM — limited, volatile working memory
- Agent harness = Operating system — manages resources, provides standard interfaces, handles boot sequences
- Agent = Application — user-specific logic running on top
The model generates text. The harness makes things happen — executing tools, managing memory across sessions, decomposing tasks, and verifying results. Strip the harness away and you have an LLM that can talk about writing code but can’t actually write, test, or commit it.
The key insight from Parallel.ai’s technical breakdown: models with well-designed harnesses consistently outperform identical models without them2. The harness makes or breaks the product. Two applications running the same underlying model deliver wildly different experiences based on harness quality — tool integration, memory management, context engineering, and workflow structure.
This means harness engineering is at least as important as model selection. Probably more so.
The Five-Stage Lifecycle
Parallel.ai describes agent operation as a five-stage lifecycle2. This maps well to what you observe using tools like Claude Code, even if the stages blur together in practice.
1. Intent Capture & Orchestration
User goals are decomposed into subtasks. An orchestrator coordinates model invocations, deciding what to tackle first and how to break work into manageable pieces.
Anthropic’s approach to long-running agents makes this concrete: they split operation into an initializer agent (first session only) and a coding agent (subsequent sessions)3. The initializer sets up the environment, creates a feature list, and establishes a progress file. The coding agent picks up from there, constrained to incremental single-feature work.
2. Tool Call Execution
The harness monitors model output for special tokens indicating tool requests — search(), python(), file_edit() — executes those operations externally, and feeds results back as context. The model never touches the filesystem or network directly. The harness mediates every interaction.
3. Context Management & Memory
Before each model invocation, the harness compiles the working context — system prompt, recent conversation, relevant tool results, retrieved documentation. Between invocations, it compacts and summarizes to stay within token limits and prevent context rot.
This is where most of the engineering complexity lives. More on this below.
4. Result Verification & Iteration
The harness validates outputs against criteria: schema checks, test execution, linting. Coding agents follow write-compile-test-fix cycles orchestrated automatically. If tests fail, the harness feeds errors back and the model tries again without human intervention.
5. Completion & Handoff
End-of-session routines save artifacts — files, git commits, progress logs — enabling session resumption with full continuity. This is the difference between “the conversation ended” and “the work continues next time.”
The following diagram shows how these stages relate, including the verification loop that cycles back on failure and the memory tiers that feed into context management:
flowchart TB
User([User Goal]) --> IC
subgraph Lifecycle["Agent Harness Lifecycle"]
IC["1. Intent Capture\n& Orchestration"] --> TE["2. Tool Call\nExecution"]
TE --> CM["3. Context Management\n& Memory"]
CM --> RV["4. Result Verification\n& Iteration"]
RV -->|"Pass"| CH["5. Completion\n& Handoff"]
RV -->|"Fail — retry"| TE
end
subgraph Memory["Memory Architecture"]
WC["Working Context\n(ephemeral)"]
SS["Session State\n(durable)"]
LT["Long-Term Memory\n(persistent)"]
end
WC --> CM
SS --> CM
LT --> CM
CH -->|"Persist progress"| SS
CH -->|"Update knowledge"| LT
HG{{"Human-in-the-Loop\nGate"}} -.->|"Approve / reject"| TE
HG -.->|"Review"| CH
Memory Architecture
Agent harnesses implement a three-tier memory model2. Each tier has different scope, lifetime, and update patterns:
| Tier | Scope | Examples |
|---|---|---|
| Working context | Ephemeral — assembled fresh per model invocation | System prompt + recent messages + tool results |
| Session state | Durable within a task, persisted but scoped | Progress files, conversation history, CLAUDE.md instructions, git history |
| Long-term memory | Cross-task knowledge, survives across sessions | Vector stores, knowledge bases, issue trackers |
The tiers address different problems. Working context is what the model sees right now. Session state is what the harness knows about the current task. Long-term memory is what the system knows about the world.
Anthropic’s progress file pattern is a concrete example of session state engineering3. The initializer agent creates a claude-progress.txt file and a JSON feature list at the start of a project. Every subsequent session reads these files first, preventing the classic failure mode where an agent discovers partial work and falsely declares completion.
LangChain’s Deep Agents SDK implements aggressive context compression at the working context tier4. Tool responses exceeding 20,000 tokens get offloaded to a virtual filesystem, replaced with file references and content previews. When the context window hits 85% capacity, older tool calls are truncated and substituted with disk pointers. If that’s still not enough, the system generates structured summaries capturing intent, artifacts, and next steps — while preserving original conversations on disk for recovery.
Context Engineering
Parallel.ai identifies four core context engineering techniques2:
Isolation — Separate subtasks to prevent cross-contamination. Subagents handle independent work in their own context windows, reporting results back without polluting the parent’s working memory.
Reduction — Compress or drop irrelevant context. Automatic summarization on compaction, offloading large tool results to disk, truncating stale conversation history.
Retrieval — Inject fresh information dynamically. Documentation lookup, code search, knowledge base queries. The harness pulls in what’s relevant rather than front-loading everything into the prompt.
Prompt rewriting — Restructure prompts between context windows to maintain coherence. After compaction or summarization, the harness reassembles the prompt to keep the model oriented.
Böckeler, writing on Martin Fowler’s site, frames harness engineering as a three-category discipline5:
- Context engineering — Dynamic knowledge bases within codebases, enhanced with observability data and navigation capabilities
- Architectural constraints — Deterministic custom linters and structural tests working alongside LLM-based agents to enforce patterns and boundaries
- Periodic maintenance — Autonomous agents running scheduled cleanup — identifying documentation inconsistencies, architectural violations, and dead code
The third category is underappreciated. It’s not enough to build the harness; you need processes that maintain the ecosystem the harness operates in. Documentation drifts, conventions evolve, dead code accumulates. Without periodic maintenance, the context the harness retrieves degrades over time.
The Durability Gap
Static leaderboards measure single-turn capability. Production agents need long-horizon coherence.
Schmid identifies this as the durability gap1: models maintain capability on isolated tasks but drift off-track after fifty sequential tool calls. The further an agent runs, the more likely it is to lose the thread — forgetting constraints, repeating work, or making decisions that contradict earlier ones.
Traditional benchmarks don’t capture this. A model can score well on HumanEval and still produce garbage on the hundredth iteration of a complex refactoring session. The real test of a production AI system isn’t “can it solve this problem?” but “can it solve problem number fifty while remembering the solutions to problems one through forty-nine?”
Harnesses close this gap through context management (keeping the relevant information accessible), verification loops (catching drift before it compounds), and session continuity (carrying state across context window boundaries).
Verification & Guardrails
Verification is where harness engineering gets practical. The techniques vary by domain, but the pattern is consistent: don’t trust model output without checking it.
Schema validation — Ensure structured outputs parse correctly before acting on them.
Test execution — Coding agents follow write-test-fix cycles. Anthropic’s long-running agent approach goes further: initial testing with unit tests proved insufficient, so they added browser automation (Puppeteer MCP) to verify features end-to-end as a human user would3.
Architectural constraints — Böckeler emphasizes deterministic linters and structural tests alongside AI agents5. The LLM handles creative work; deterministic tools enforce invariants. This combination is more robust than relying on the model to police itself.
Scope constraints — Anthropic’s “one feature per session” rule prevents agents from attempting too much at once3. Without this constraint, agents naturally try to one-shot entire projects, exhausting their context window mid-implementation and leaving undocumented handoff states.
Human-in-the-loop gates — For high-stakes operations, the harness pauses and requests human approval before proceeding. LangChain’s Deep Agents framework supports this pattern4, and it shows up in Claude Code’s hooks system where tool calls can be intercepted for review.
Harness vs. Related Concepts
The terminology in this space is muddled. This table from Parallel.ai2 helps distinguish the layers:
| Concept | Role | Distinction |
|---|---|---|
| Agent framework (LangChain, LlamaIndex) | Building-block abstractions | Harness is a complete runtime with opinionated defaults |
| Orchestrator | Controls when/how to call models | Harness manages capabilities and side-effects — tools, context, environment |
| Agentic coding tool (Claude Code, Cursor) | End-user product | The harness is the infrastructure inside these products |
As Parallel.ai puts it: “orchestration is the brain of the operation, harness is the hands and infrastructure.”2 Schmid adds that the harness “sits above frameworks, providing prompt presets, opinionated handling for tool calls, lifecycle hooks or ready-to-use capabilities.”1
Real-World Example: Claude Code
Claude Code is a useful case study because its harness components are visible to the user.
| Harness Component | Claude Code Implementation |
|---|---|
| Session state | CLAUDE.md files — persistent instructions reloaded after every compaction |
| Tool execution | Read, Edit, Bash, MCP tools — mediated access to filesystem, terminal, and external services |
| Context reduction | Automatic compaction and summarization when context window fills |
| Context isolation | Subagents (Task tool) — independent context windows for exploration and research |
| Guardrails | Hooks — shell commands that execute at tool-call boundaries, intercepting operations for validation |
| Planning | TodoWrite tool — structured task tracking that doubles as a context engineering strategy to keep the agent on track4 |
| Long-term memory | Git history, project documentation, and tools like Beads (git-backed issue tracker persisting across sessions) |
| Tool integration | MCP (Model Context Protocol) — standardized discovery and invocation of external capabilities |
The MCP layer deserves attention. Rather than hardcoding integrations, MCP defines a protocol for how harnesses discover and invoke external tools. An Obsidian MCP server gives the agent access to your notes. An AWS knowledge MCP server gives it access to documentation. The harness doesn’t need to know about these systems in advance — it discovers capabilities at runtime through the protocol.
Building and Evaluating Harnesses
Practical guidance synthesized from across the sources:
Keep it simple, design for modularity. Schmid notes that Manus rebuilt their harness five times in six months; LangChain refactored three times yearly1. Every new model release shifts the optimal way to structure agents. Build atomic tools rather than complex control flows. Structure for easy replacement: “you must be ready to rip out code.”
Capture trajectories. Structured logging of agent workflows transforms vague multi-step operations into data you can analyze and improve1. Trajectories are a competitive advantage — they show you where agents struggle, what context they needed, and how they recovered (or didn’t).
Feed failures back into the system. Böckeler’s most actionable insight: “when the agent struggles, we treat it as a signal: identify what is missing — tools, guardrails, documentation — and feed it back into the repository.”5 Agent failure isn’t just a bug to fix. It’s a diagnostic that reveals gaps in your harness — missing tools, insufficient documentation, absent guardrails. Each failure, properly analyzed, makes the system stronger.
Enforce incremental progress. Anthropic’s session startup protocol — verify working directory, read progress files, select next feature, run baseline tests — prevents agents from discovering partial work and going sideways3. The one-feature-per-session constraint eliminates the most common failure mode: scope creep that exhausts context.
Combine deterministic and probabilistic checks. Don’t rely solely on the model to verify its own output. Linters catch what LLMs miss. Type checkers enforce what prompts can’t. The strongest verification pipelines use both5.
Plan for context window exhaustion. Your harness needs a strategy for what happens when context fills up. LangChain’s approach — offload large results to disk, truncate stale calls, summarize when necessary — is one pattern4. Anthropic’s approach — constrain session scope so context rarely fills — is another. Both work. Having no strategy doesn’t.
Where This Is Heading
The term “agent harness” is relatively new, but the engineering discipline it describes has been emerging for a couple of years now. As models get more capable, the harness becomes more important, not less — because more capable models attempt more ambitious tasks that require more sophisticated context management, verification, and recovery.
The companies building the best AI products aren’t necessarily the ones with the best models. They’re the ones with the best harnesses.
Philipp Schmid, “The Importance of Agent Harness in 2026,” January 2026. ↩︎ ↩︎ ↩︎ ↩︎ ↩︎
Parallel.ai, “What is an Agent Harness?,” 2025. ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎
Justin Young, “Effective Harnesses for Long-Running Agents,” Anthropic Engineering, 2026. ↩︎ ↩︎ ↩︎ ↩︎ ↩︎
LangChain, “Context Management for Deep Agents,” 2025. ↩︎ ↩︎ ↩︎ ↩︎
Birgitta Böckeler, “Harness Engineering,” martinfowler.com, February 2026. ↩︎ ↩︎ ↩︎ ↩︎


Comments