Skip to content

Lecture Notes: Harnesses in AI — A Deep Dive

Speaker: Tejas Kumar (IBM) Source: YouTube — Harnesses in AI: A Deep Dive Date: 2026-06-05

Core Thesis: It's Not a Prompt Problem — It's a Harness Problem

For years, the AI community has obsessed over "the prompt problem" — crafting the perfect instruction for a language model. Prompt engineering, chain-of-thought, few-shot examples, system prompts. All valuable. But Tejas Kumar argues we've been neglecting the far more consequential challenge: the harness problem.

The harness is the infrastructure layer that sits between the LLM and the outside world. It determines how the model accesses tools, retains context, manages state, handles errors, and coordinates multiple steps. And it matters more than the model itself.

The Demonstration That Proves the Point

Kumar's demo is devastatingly simple: GPT-3.5 Turbo + Playwright fails at a basic Hacker News upvote task. The model can't navigate the DOM, can't handle authentication, can't manage state across page loads.

Add harness components — tool definitions, state management, guardrails, an agent loop with verification — and the same model becomes reliable. The takeaway: harness beats model upgrades. GPT-3.5 with a strong harness outperforms raw GPT-4 on the same task.

The Five Components of an Agent Harness

1. Tool Registry

The mechanism by which tools (APIs, functions, databases, search engines, code interpreters) are made available to the model.

  • Tool definitions: JSON Schema or similar structured format
  • Tool discovery: How the model learns what tools exist and when to use them
  • Tool routing: Mapping natural language intent to specific tool calls
  • Authentication/authorization: Managing credentials and permissions per tool

2. AI Model

The LLM itself — but Kumar emphasizes this is just one component. The model should be swappable without changing the harness.

3. Context Management

How the agent manages conversation history, state, and memory:

  • Short-term memory: Conversation context window (limited, expensive)
  • Long-term memory: Persistent storage (vector databases, key-value stores)
  • Episodic memory: Records of past actions and outcomes
  • Semantic memory: Facts, concepts, and learned patterns
  • Working memory: Current task state and intermediate results

4. Guardrails

Safety boundaries that prevent the agent from going off-course:

  • Input sanitization and prompt injection detection
  • Output validation and content filtering
  • Rate limits and resource quotas
  • Permission boundaries per tool
  • Human-in-the-loop checkpoints for high-risk actions

5. Agent Loop + Verify Step

The control flow that drives the agent's operation:

  • ReAct loop (Reasoning + Acting): Alternating between thought and action
  • Plan-then-execute: Generate a plan, then execute steps
  • Tree-of-thought: Explore multiple reasoning paths
  • Dynamic re-planning: Adjust plans based on intermediate results
  • Critical: The Verify Step — after each action, the agent verifies the result before proceeding. This is what prevents cascading failures.

Two Types of Harnesses

Eval Harnesses

Frameworks for evaluating AI model performance systematically. Examples: OpenAI Evals, LangChain Benchmarks, Anthropic's evaluation frameworks. These measure accuracy, safety, and capability across standardized tasks.

Agent Harnesses

Frameworks for running AI agents in production. The focus is on reliability, safety, and repeatability. Examples: LangGraph, AutoGPT, Anthropic's reference harness.

Key Insight: Extract Determinism from the Model

The central engineering insight: move as much determinism as possible out of the stochastic model and into the deterministic harness.

  • Tool definitions should be static and versioned
  • Control flow should be explicit and auditable
  • State transitions should follow defined patterns
  • The harness itself should be unit-testable
  • All the stochasticity (creativity, variation) lives in the agent, not the scaffolding

Design Principles for Building Great Harnesses

  1. Fail fast, fail gracefully — Explicit timeouts, immediate error surfacing, circuit breakers, fallback behaviors
  2. Deterministic scaffolding, stochastic agents — The harness is predictable, the model is creative
  3. Observability by default — Every tool call logged, reasoning traces stored, real-time dashboards
  4. Principle of least privilege — Granular permissions, scoped credentials, human gates for high-risk actions
  5. Idempotency where possible — Safe to retry, deduplication, execution ID tracking
  6. State is explicit, not implicit — Durable storage, versioned state, clean serialization

Common Failure Modes

Failure Mode Description Solution
Infinite Loop Trap Repeated identical tool calls Loop detection, max step limits
Context Window Overload History grows unbounded Sliding window summarization
Tool Confusion Wrong tool called Clear descriptions, validation layers
Silent Failure Tool fails, agent doesn't notice Mandatory error acknowledgment
Prompt Injection User input overwrites system prompt Input sanitization, output validation
Frankenstein Agent Patched beyond comprehension Architectural discipline from day one

How Harness Design Affects Performance

  • Accuracy: Structured data access + validation steps reduce hallucination
  • Reliability: Edge case handling + failure recovery without human intervention
  • Efficiency: Caching + parallel tool calls + smart context management
  • Safety: Access controls + pattern detection + emergency stop
  • Scalability: Modular tool design + multi-instance support + monitoring integration

Key Takeaways

  1. The harness is infrastructure, not magic. Building a good harness requires software engineering discipline, not prompt engineering tricks.
  2. Start simple, then layer. Begin with the minimal harness that solves your problem.
  3. Test everything. Unit tests, integration tests, end-to-end tests. Treat it like production software.
  4. Monitor aggressively. You cannot improve what you cannot measure.
  5. Design for failure. Assume tools will fail, networks will be slow, and models will hallucinate.
  6. GPT-3.5 + strong harness > raw GPT-4. The model matters less than what you build around it.

References

  • Anthropic's Defending Code Reference Harness — Reference implementation demonstrating harness patterns
  • LangChain / LangGraph — Popular frameworks for building agent harnesses
  • OpenAI Function Calling / Tool Use — API-level infrastructure for harness building
  • "ReAct" (Yao et al., 2022) — The reasoning-acting loop foundation
  • "Toolformer" (Schick et al., 2023) — How models learn to use tools

These lecture notes summarize and expand upon Tejas Kumar's talk on harnesses in AI. The views expressed are those of the speaker and do not necessarily reflect the position of IBM.