Lecture Notes: Harnesses in AI — A Deep Dive¶

Speaker: Tejas Kumar (IBM) Source: YouTube — Harnesses in AI: A Deep Dive Date: 2026-06-05

Core Thesis: It's Not a Prompt Problem — It's a Harness Problem¶

For years, the AI community has obsessed over "the prompt problem" — crafting the perfect instruction for a language model. Prompt engineering, chain-of-thought, few-shot examples, system prompts. All valuable. But Tejas Kumar argues we've been neglecting the far more consequential challenge: the harness problem.

The harness is the infrastructure layer that sits between the LLM and the outside world. It determines how the model accesses tools, retains context, manages state, handles errors, and coordinates multiple steps. And it matters more than the model itself.

The Demonstration That Proves the Point¶

Kumar's demo is devastatingly simple: GPT-3.5 Turbo + Playwright fails at a basic Hacker News upvote task. The model can't navigate the DOM, can't handle authentication, can't manage state across page loads.

Add harness components — tool definitions, state management, guardrails, an agent loop with verification — and the same model becomes reliable. The takeaway: harness beats model upgrades. GPT-3.5 with a strong harness outperforms raw GPT-4 on the same task.

The Five Components of an Agent Harness¶

1. Tool Registry¶

The mechanism by which tools (APIs, functions, databases, search engines, code interpreters) are made available to the model.

Tool definitions: JSON Schema or similar structured format
Tool discovery: How the model learns what tools exist and when to use them
Tool routing: Mapping natural language intent to specific tool calls
Authentication/authorization: Managing credentials and permissions per tool

2. AI Model¶

The LLM itself — but Kumar emphasizes this is just one component. The model should be swappable without changing the harness.

3. Context Management¶

How the agent manages conversation history, state, and memory:

Short-term memory: Conversation context window (limited, expensive)
Long-term memory: Persistent storage (vector databases, key-value stores)
Episodic memory: Records of past actions and outcomes
Semantic memory: Facts, concepts, and learned patterns
Working memory: Current task state and intermediate results

4. Guardrails¶

Safety boundaries that prevent the agent from going off-course:

Input sanitization and prompt injection detection
Output validation and content filtering
Rate limits and resource quotas
Permission boundaries per tool
Human-in-the-loop checkpoints for high-risk actions

5. Agent Loop + Verify Step¶

The control flow that drives the agent's operation:

ReAct loop (Reasoning + Acting): Alternating between thought and action
Plan-then-execute: Generate a plan, then execute steps
Tree-of-thought: Explore multiple reasoning paths
Dynamic re-planning: Adjust plans based on intermediate results
Critical: The Verify Step — after each action, the agent verifies the result before proceeding. This is what prevents cascading failures.

Two Types of Harnesses¶

Eval Harnesses¶

Frameworks for evaluating AI model performance systematically. Examples: OpenAI Evals, LangChain Benchmarks, Anthropic's evaluation frameworks. These measure accuracy, safety, and capability across standardized tasks.

Agent Harnesses¶

Frameworks for running AI agents in production. The focus is on reliability, safety, and repeatability. Examples: LangGraph, AutoGPT, Anthropic's reference harness.

Key Insight: Extract Determinism from the Model¶

The central engineering insight: move as much determinism as possible out of the stochastic model and into the deterministic harness.

Tool definitions should be static and versioned
Control flow should be explicit and auditable
State transitions should follow defined patterns
The harness itself should be unit-testable
All the stochasticity (creativity, variation) lives in the agent, not the scaffolding

Design Principles for Building Great Harnesses¶

Fail fast, fail gracefully — Explicit timeouts, immediate error surfacing, circuit breakers, fallback behaviors
Deterministic scaffolding, stochastic agents — The harness is predictable, the model is creative
Observability by default — Every tool call logged, reasoning traces stored, real-time dashboards
Principle of least privilege — Granular permissions, scoped credentials, human gates for high-risk actions
Idempotency where possible — Safe to retry, deduplication, execution ID tracking
State is explicit, not implicit — Durable storage, versioned state, clean serialization

Common Failure Modes¶

Failure Mode	Description	Solution
Infinite Loop Trap	Repeated identical tool calls	Loop detection, max step limits
Context Window Overload	History grows unbounded	Sliding window summarization
Tool Confusion	Wrong tool called	Clear descriptions, validation layers
Silent Failure	Tool fails, agent doesn't notice	Mandatory error acknowledgment
Prompt Injection	User input overwrites system prompt	Input sanitization, output validation
Frankenstein Agent	Patched beyond comprehension	Architectural discipline from day one

How Harness Design Affects Performance¶

Accuracy: Structured data access + validation steps reduce hallucination
Reliability: Edge case handling + failure recovery without human intervention
Efficiency: Caching + parallel tool calls + smart context management
Safety: Access controls + pattern detection + emergency stop
Scalability: Modular tool design + multi-instance support + monitoring integration

Key Takeaways¶

The harness is infrastructure, not magic. Building a good harness requires software engineering discipline, not prompt engineering tricks.
Start simple, then layer. Begin with the minimal harness that solves your problem.
Test everything. Unit tests, integration tests, end-to-end tests. Treat it like production software.
Monitor aggressively. You cannot improve what you cannot measure.
Design for failure. Assume tools will fail, networks will be slow, and models will hallucinate.
GPT-3.5 + strong harness > raw GPT-4. The model matters less than what you build around it.

References¶

Anthropic's Defending Code Reference Harness — Reference implementation demonstrating harness patterns
LangChain / LangGraph — Popular frameworks for building agent harnesses
OpenAI Function Calling / Tool Use — API-level infrastructure for harness building
"ReAct" (Yao et al., 2022) — The reasoning-acting loop foundation
"Toolformer" (Schick et al., 2023) — How models learn to use tools

These lecture notes summarize and expand upon Tejas Kumar's talk on harnesses in AI. The views expressed are those of the speaker and do not necessarily reflect the position of IBM.