Lecture Notes: Harnesses in AI — A Deep Dive¶
Speaker: Tejas Kumar (IBM) Source: YouTube — Harnesses in AI: A Deep Dive Date: 2026-06-05
Core Thesis: It's Not a Prompt Problem — It's a Harness Problem¶
For years, the AI community has obsessed over "the prompt problem" — crafting the perfect instruction for a language model. Prompt engineering, chain-of-thought, few-shot examples, system prompts. All valuable. But Tejas Kumar argues we've been neglecting the far more consequential challenge: the harness problem.
The harness is the infrastructure layer that sits between the LLM and the outside world. It determines how the model accesses tools, retains context, manages state, handles errors, and coordinates multiple steps. And it matters more than the model itself.
The Demonstration That Proves the Point¶
Kumar's demo is devastatingly simple: GPT-3.5 Turbo + Playwright fails at a basic Hacker News upvote task. The model can't navigate the DOM, can't handle authentication, can't manage state across page loads.
Add harness components — tool definitions, state management, guardrails, an agent loop with verification — and the same model becomes reliable. The takeaway: harness beats model upgrades. GPT-3.5 with a strong harness outperforms raw GPT-4 on the same task.
The Five Components of an Agent Harness¶
1. Tool Registry¶
The mechanism by which tools (APIs, functions, databases, search engines, code interpreters) are made available to the model.
- Tool definitions: JSON Schema or similar structured format
- Tool discovery: How the model learns what tools exist and when to use them
- Tool routing: Mapping natural language intent to specific tool calls
- Authentication/authorization: Managing credentials and permissions per tool
2. AI Model¶
The LLM itself — but Kumar emphasizes this is just one component. The model should be swappable without changing the harness.
3. Context Management¶
How the agent manages conversation history, state, and memory:
- Short-term memory: Conversation context window (limited, expensive)
- Long-term memory: Persistent storage (vector databases, key-value stores)
- Episodic memory: Records of past actions and outcomes
- Semantic memory: Facts, concepts, and learned patterns
- Working memory: Current task state and intermediate results
4. Guardrails¶
Safety boundaries that prevent the agent from going off-course:
- Input sanitization and prompt injection detection
- Output validation and content filtering
- Rate limits and resource quotas
- Permission boundaries per tool
- Human-in-the-loop checkpoints for high-risk actions
5. Agent Loop + Verify Step¶
The control flow that drives the agent's operation:
- ReAct loop (Reasoning + Acting): Alternating between thought and action
- Plan-then-execute: Generate a plan, then execute steps
- Tree-of-thought: Explore multiple reasoning paths
- Dynamic re-planning: Adjust plans based on intermediate results
- Critical: The Verify Step — after each action, the agent verifies the result before proceeding. This is what prevents cascading failures.
Two Types of Harnesses¶
Eval Harnesses¶
Frameworks for evaluating AI model performance systematically. Examples: OpenAI Evals, LangChain Benchmarks, Anthropic's evaluation frameworks. These measure accuracy, safety, and capability across standardized tasks.
Agent Harnesses¶
Frameworks for running AI agents in production. The focus is on reliability, safety, and repeatability. Examples: LangGraph, AutoGPT, Anthropic's reference harness.
Key Insight: Extract Determinism from the Model¶
The central engineering insight: move as much determinism as possible out of the stochastic model and into the deterministic harness.
- Tool definitions should be static and versioned
- Control flow should be explicit and auditable
- State transitions should follow defined patterns
- The harness itself should be unit-testable
- All the stochasticity (creativity, variation) lives in the agent, not the scaffolding
Design Principles for Building Great Harnesses¶
- Fail fast, fail gracefully — Explicit timeouts, immediate error surfacing, circuit breakers, fallback behaviors
- Deterministic scaffolding, stochastic agents — The harness is predictable, the model is creative
- Observability by default — Every tool call logged, reasoning traces stored, real-time dashboards
- Principle of least privilege — Granular permissions, scoped credentials, human gates for high-risk actions
- Idempotency where possible — Safe to retry, deduplication, execution ID tracking
- State is explicit, not implicit — Durable storage, versioned state, clean serialization
Common Failure Modes¶
| Failure Mode | Description | Solution |
|---|---|---|
| Infinite Loop Trap | Repeated identical tool calls | Loop detection, max step limits |
| Context Window Overload | History grows unbounded | Sliding window summarization |
| Tool Confusion | Wrong tool called | Clear descriptions, validation layers |
| Silent Failure | Tool fails, agent doesn't notice | Mandatory error acknowledgment |
| Prompt Injection | User input overwrites system prompt | Input sanitization, output validation |
| Frankenstein Agent | Patched beyond comprehension | Architectural discipline from day one |
How Harness Design Affects Performance¶
- Accuracy: Structured data access + validation steps reduce hallucination
- Reliability: Edge case handling + failure recovery without human intervention
- Efficiency: Caching + parallel tool calls + smart context management
- Safety: Access controls + pattern detection + emergency stop
- Scalability: Modular tool design + multi-instance support + monitoring integration
Key Takeaways¶
- The harness is infrastructure, not magic. Building a good harness requires software engineering discipline, not prompt engineering tricks.
- Start simple, then layer. Begin with the minimal harness that solves your problem.
- Test everything. Unit tests, integration tests, end-to-end tests. Treat it like production software.
- Monitor aggressively. You cannot improve what you cannot measure.
- Design for failure. Assume tools will fail, networks will be slow, and models will hallucinate.
- GPT-3.5 + strong harness > raw GPT-4. The model matters less than what you build around it.
References¶
- Anthropic's Defending Code Reference Harness — Reference implementation demonstrating harness patterns
- LangChain / LangGraph — Popular frameworks for building agent harnesses
- OpenAI Function Calling / Tool Use — API-level infrastructure for harness building
- "ReAct" (Yao et al., 2022) — The reasoning-acting loop foundation
- "Toolformer" (Schick et al., 2023) — How models learn to use tools
These lecture notes summarize and expand upon Tejas Kumar's talk on harnesses in AI. The views expressed are those of the speaker and do not necessarily reflect the position of IBM.