The agent harness — series

Traces are how agents get better

Yee Seng Chan — Sun, 10 May 2026 00:00:00 GMT

Part of a series

The agent harness

A user asked the docs Q&A agent: “What’s the data retention policy for trial accounts?”

The agent answered: “Trial accounts: data is retained for 30 days after closure, after which it is deleted.” The answer was confident, polished, and wrong. The corpus says trial-account data is retained for 90 days. Thirty days is the standard-account window.

The team finds out a day later, when the user files a correction. The logs show this:

2026-05-08 14:23:18  run_551  query received
2026-05-08 14:23:19  run_551  tool_call: retrieval, status: success
2026-05-08 14:23:21  run_551  response generated
2026-05-08 14:23:21  run_551  run completed

That log records events. It does not show which passages were retrieved, what the model saw, why it chose 30 days, which passage supported the claim, or which validation checks ran. The team can guess. It cannot debug.

Logs tell you what happened. Traces show why it happened.

The previous article covered gates: runtime checks that stop wrong actions before they happen. This article covers traces: structured records that make failures explainable after they happen.

What a useful trace records

A trace records one structured step for each meaningful action: a model decision, tool call, verification, phase transition, or gate firing. It should not store a wall of chat text or a dump of model reasoning.

Each step should record:

Versions: prompt, model, tool, index, schema, and validator versions.
State before: what the system knew before the step.
Context pack: what the model actually saw.
Action chosen: what the system decided to do, with a short reason.
Verification: what was checked after the action.
State after: what changed.

Here is one step from the failed retention run, captured at synthesis:

trace_step = {
    "step_kind": "synthesize",
    "versions": {
        "prompt": "syn_v6",
        "index": "ri_v3",
        "validator": "val_v4",
    },

    "state_before": {
        "phase": "synthesize",
        "query": "What is the data retention policy for trial accounts?",
        "retrieved_passages": [
            {"id": "psg_1102", "category": "trial_accounts"},
            {"id": "psg_0871", "category": "general"},
            {"id": "psg_0883", "category": "general"},
            {"id": "psg_0902", "category": "general"},
            {"id": "psg_0915", "category": "general"},
        ],
    },

    "context_pack": {
        "retrieved_refs": ["psg_1102", "psg_0871", "psg_0883", "psg_0902", "psg_0915"],
        "passage_metadata_included": False,
    },

    "action_chosen": {
        "type": "draft_grounded_answer",
        "reason": "5 passages retrieved; drafting answer from passage content",
    },

    "verification": {
        "all_claims_cited": True,
        "category_alignment_check": "not_run",
    },

    "state_after": {
        "phase": "validate",
        "claims": [
            {
                "text": "Trial accounts: data retained 30 days after closure",
                "evidence": ["psg_0871"],
            },
        ],
        "drafted_answer": "Trial accounts: data is retained for 30 days after closure...",
    },
}

That is enough to debug the run. general account passages mentioning “30 days” entered the answer path, and no category_alignment_check caught the category mismatch.

Rationale, not chain-of-thought

Store decision rationale, not full chain-of-thought. The rationale should be a short operational reason, captured in action_chosen.reason.

Examples:

Clarification: target was ambiguous, so the agent asked a clarifying question.
Synthesis: five passages were retrieved, so the agent drafted from passage content.

The team needs the reason for the action, not the model’s internal rumination.

Walking the retention-policy trace

The user asked about trial-account retention. The agent answered 30 days. The corpus says 90 days. The team opens run_551.

Step 1, receive_query, captured the user’s message and wrote it into state.

Step 2, retrieve, pulled five passages:

retrieved_passages = [
    {"id": "psg_1102", "category": "trial_accounts", "score": 0.93,
     "snippet": "Trial accounts: data retained 90 days after trial expiry."},
    {"id": "psg_0871", "category": "general", "score": 0.91,
     "snippet": "Standard accounts: data retained 30 days after closure..."},
    {"id": "psg_0883", "category": "general", "score": 0.89,
     "snippet": "Default retention windows for closed accounts are 30 days..."},
    {"id": "psg_0902", "category": "general", "score": 0.87,
     "snippet": "After account closure, customer data is purged within 30 days..."},
    {"id": "psg_0915", "category": "general", "score": 0.85,
     "snippet": "Standard data retention: 30 days post-closure..."},
]

Retrieval found the right passage and ranked it first. The system did not fail to retrieve the answer.

Step 3, synthesize, shows the first bad move. Synthesis received passage text without category tags. It saw four general passages saying 30 days and one trial-account passage saying 90 days. Without category metadata, the model had no signal that the 90-day passage was the relevant one.

Step 4, ground, mapped the drafted claim to psg_0871. Grounding succeeded narrowly because the claim had a citation. The missing check was category alignment. The trace shows category_alignment_check: not_run.

The failure is now specific: retrieval found the answer, context assembly dropped the metadata, synthesis chose the wrong passage, and validation missed the category mismatch.

A team reading only the output might edit the synthesis prompt. A team reading the trace fixes the right layer: include category metadata in the context pack, weight category-specific passages for category-specific queries, and validate category alignment before returning the answer.

From first bad move to durable fix

A failure should leave behind a trace, a fix, and a regression case.

First bad move: locate where the run first went wrong. The trace should show whether the failure began in retrieval, context assembly, action choice, tool arguments, state updates, validation, or workflow phase.
Durable fix: change the harness layer that failed. For the retention issue, that means category-aware retrieval, category metadata in the context pack, and category-alignment validation.
Regression case: encode the failure shape so it cannot return silently. A “retention policy for [category]” query should retrieve the category-specific passage at rank 1, and the validator should reject answers citing the wrong category.

Figure 1: The hardening loop. A failure produces a trace; the trace identifies the first bad move; the team ships a durable fix and a regression case. The cycle feeds back into the system. The harness is what changes, and what changes is what compounds.

Teams that run this loop on every incident are doing reliability engineering. Teams that stop at “patched the prompt, moved on” are managing symptoms. A system improves when the team stops wasting failure.

Minimum viable trace

A v1 agent does not need a perfect observability platform. It needs a minimum viable trace.

For every meaningful step, record:

Identity: run_id, step_id, and step kind.
Versions: prompt, model, tool, index, validator, and schema versions.
State: state_before and state_after.
Context: what entered the model context.
Action: action_chosen and a short reason.
Tool data: tool inputs and outputs, when a tool runs.
Verification: checks that ran and their results.
Metrics: latency, cost, token counts, and other operational fields.

Different agents need different trace fields. For instance, these fields are useful for the Docs Q&A agent:

retrieval query, passage IDs, claims, claim-to-evidence mapping, validator result

Closing the loop

Five articles in this series, one argument:

Why AI agent demos break in production: recurring production failures live in the system around the model.
The harness is the product: an agent is the model plus the harness around it.
State, not transcript, is agent memory: state is the memory layer the runtime can read and update.
Prompts guide. Gates enforce: gates turn state into runtime enforcement.
Traces are how agents get better: traces turn failures into durable fixes.

The harness has four jobs:

Shape the input: decide what the model sees.
Bound the action: decide what the system allows.
Verify the outcome: check what happened.
Preserve the lesson: keep enough evidence to improve the system.

State tells the system what it believes. Gates decide what it may do. Tools let it act. Verification checks the result. Traces preserve the path.

Prompts guide. Gates enforce

Yee Seng Chan — Fri, 01 May 2026 00:00:00 GMT

Part of a series

The agent harness

A prompt can guide behavior. It cannot enforce behavior.

The scheduling assistant’s prompt says, “Always confirm the target meeting before making any changes.” A user asks it to move “my Tuesday review with Priya” to Thursday afternoon. The model calls reschedule_event(description="Tuesday review with Priya", new_time="Thursday afternoon"); the API matches a different recurring meeting with Priya; the agent replies, “I’ve moved it, and I’ll keep an eye on it going forward.”

The failure is not the wording of the prompt. The runtime allowed an unsafe action. A gate should have blocked reschedule_event until the target event was uniquely confirmed, non-recurring, and represented by a specific event ID.

Prompts shape behavior. Gates enforce behavior.

The previous article established that state holds the system’s beliefs. This article is about the runtime checks that decide whether the system can act on those beliefs.

Why prompt-only control breaks

Prompt-only control works best for soft behavior: tone, formatting, response length, and style. It is weak for the following.

Fuzzy boundaries: “Do not commit to remediation timelines” sounds clear, but real language is messy.
Runtime facts: a prompt cannot verify that an event ID is unique, that a user has permission, that a write already happened after a timeout, or that every answer claim is supported by evidence.
Competing rules: production prompts accumulate rules for tone, safety, tools, escalation, formatting, and edge cases. Rules important enough to block behavior should not live only in text.

The system asked the model to enforce constraints the runtime should own.

The model proposes. The harness checks.

A gate checks a proposed action or output at runtime. It can allow the proposal, reject it, or route it to a safer path. Two kinds of gates cover most production needs.

Hard gates: check crisp facts

Hard gates check conditions that are computable from state. They do not need model judgment.

Is this tool allowed in the current workflow phase?
Is the target uniquely identified and confirmed?
Has approval been granted, and has it expired?
Is the user authorized, and is the write budget still available?

If a check fails, the action is rejected before it reaches the tool.

def gate_reschedule(state, action):
    if state["phase"] != "execute":
        return Reject("write_not_allowed_in_phase")
    if state["proposed_change"] is None:
        return Reject("no_proposed_change_materialized")
    if state["candidates"][0]["match_confidence"] != "confirmed":
        return Reject("target_not_confirmed")
    if state["candidates"][0]["is_recurring"]:
        return Reject("recurring_series_requires_human_only")
    if action.get("idempotency_key") is None:
        return Reject("missing_idempotency_key")
    return Allow()

Each line is a mechanical check. If a condition is crisp enough to enforce with code, enforce it with code.

Semantic gates: check meaning

Semantic gates check meaning rather than schema or permissions. They answer questions like:

Does this answer overstate the evidence?
Does this message imply an unauthorized commitment?
Does this response give advice outside the agent’s role?

These checks usually require model judgment. They are slower and more expensive than hard gates, so use them when the risk is semantic and code cannot capture it.

Use hard gates for crisp conditions. Use semantic gates for judgment calls.

The tool executes

Gates check proposals before they become actions. Tool contracts narrow what the model can safely propose in the first place.

Tools are contracts, not functions

In a notebook, a tool can be a function with a docstring. In production, a tool is a contract between the model, the runtime, and the outside world. It should define required inputs, allowed use, side effects, retry behavior, and verification.

A weak tool:

def update_calendar(field: str, value: str) -> dict:
    """Update a field on a calendar event."""

A stronger tool:

def reschedule_event(
    event_id: str,          # confirmed unique target only
    new_start_iso: str,
    new_end_iso: str,
    idempotency_key: str,
) -> RescheduleResult:
    """
    Reschedule one confirmed, non-recurring event.
    Requires a specific event_id, not a free-text description.
    """

The second tool removes unsafe paths. It requires a specific event_id, which forces the workflow to identify the target before the call. It requires an idempotency key. It does not expose a broad field parameter that could update anything.

Tool contracts should also distinguish reads from writes. Reads can usually be retried after a timeout. Writes need more care. A write should carry an idempotency key, and an ambiguous timeout should route to verification before retry.

Broad tools push hidden responsibility onto the model. Narrow tools move more of that responsibility into the harness.

The harness verifies and routes

Gates and tools do not finish the loop. The system also needs controlled fallbacks and safe human approval.

Gates need safe fallbacks

A blocked action should route to a safe next step.

If the scheduling target is ambiguous, route to a clarifying question. If the corpus does not support an answer, route to an honest “I don’t have enough evidence” response. If the intake user asks for a credit, route the request into the human handoff.

A gate should return a reason and a next action. A blocked action should become a controlled detour, not a dead end.

Approval packets: approve the exact action

Some actions need human judgment: rescheduling a meeting with eight attendees, sending a customer-facing summary on an enterprise account, or canceling an event from a recurring series.

The weak pattern asks the human for approval, then asks the model to perform the action. That gives the model a second chance to drift. The human approves one description, and the model may execute a slightly different action.

The safer pattern uses an approval packet. The packet is a fully materialized action object that the human reviews. After approval, the runtime executes that exact object: same tool name, same arguments, same idempotency key.

packet = {
    "status": "pending",
    "expires_at": "2026-05-09T17:32:18Z",
    "tool": "reschedule_event",
    "args": {
        "event_id": "evt_8819",
        "new_start": "2026-05-13T10:00:00-04:00",
        "idempotency_key": "run_204:evt_8819:reschedule",
    },
    "human_summary": "Move Tuesday 2pm with Priya to Wednesday 10am",
}

Before execution, the runtime checks that the approval has not expired and that the relevant state has not changed. If the target event, user request, or proposed action changed, the packet is stale and should not execute.

Humans approve specific actions, not summaries of actions.

Gate where failure matters

Gate the actions where failure matters most. Do not gate everything.

Heavy gates belong around actions that change the outside world, affect another person, are hard to undo, depend on weak evidence, or may require human escalation. Keep the path light for harmless clarifying questions, read-only lookups, and low-stakes summaries.

The goal is appropriate gating, not maximum gating. A gate that fires on every action becomes a gate the team learns to ignore.

The shape worth keeping

The runtime loop is simple:

The model proposes.
The harness checks.
The tool executes.
The harness verifies.

Prompts inform the proposal. Gates check it. Tool contracts narrow what can be proposed. Safe fallbacks turn blocked actions into controlled detours. Approval packets keep humans and runtime aligned on the exact action.

A prompt alone cannot do those jobs. That is what gates are for.

The next article Traces are how agents get better shows what useful traces contain, how they reveal the first bad move in a failing run, and how one bad run becomes a permanent improvement to the harness.

State, not transcript, is agent memory

Yee Seng Chan — Tue, 28 Apr 2026 00:00:00 GMT

Part of a series

The agent harness

Conversation history looks like memory, so many agents use it that way. Each turn gets appended to the transcript, the transcript gets passed back to the model, and the model is expected to remember what matters. This works in short demos because the conversation is small and the stakes are low. It breaks when the agent has to make reliable decisions across many turns.

The intake agent makes the failure concrete. On turn 2, the user said, “We’re on the enterprise plan, and the dashboard is the only feature we use.” On turn 9, the agent asked, “Just to confirm, are you on the standard or enterprise plan?” On turn 12, the handoff record went out with plan_tier: unknown.

The system failed because it never stored the plan tier in state. The answer stayed buried in the transcript instead of becoming plan_tier = enterprise. Any fact that affects future behavior should become a field that later steps can read.

The previous article argued that the harness is the product. State is the first concrete part of that harness: the facts, uncertainties, workflow status, and pending actions that the runtime and model read before the next step.

Raw history is a record, not a decision layer

Raw history preserves the original material, including user messages and tool outputs. The transcript may contain the answer, but later steps need explicit fields to read.

For the intake agent, the handoff writer needed this field:

"plan_tier": {
    "value": "enterprise",
    "confidence": "confirmed",
    "source": "user_turn_2",
}

That field lets the agent skip a redundant plan-tier question, lets the handoff writer emit plan_tier: enterprise, and lets a required-field check decide whether the handoff is ready.

State also records reliability. If a user first says, “I think it might be the migration,” and later says, “Support confirmed it was the migration,” state should mark one claim as uncertain and the other as confirmed. The transcript preserves both sentences, but state tells the system how to use them.

The same issue appears in handoff readiness. If the handoff requires plan_tier, affected_feature, and confirmed_root_cause, state should show which fields are filled, which are uncertain, and which still need follow-up. The agent can then read state instead of reconstructing the situation from the transcript every turn.

State turns remembered facts into usable facts

After turn 2, the intake agent should have written this state:

state = {
    "facts": {
        "plan_tier": {
            "value": "enterprise",
            "confidence": "confirmed",
            "source": "user_turn_2",
        },
        "affected_feature": {
            "value": "dashboard",
            "confidence": "confirmed",
            "source": "user_turn_2",
        },
    },
    "open_questions": [],
    "workflow_phase": "discovery",
    "ready_for_handoff": False,
}

On turn 9, the agent checks state["facts"]["plan_tier"] before asking another plan-tier question. The field already says enterprise, with confidence = confirmed, so the agent moves on. On turn 12, the handoff writer reads the same field and emits plan_tier: enterprise instead of plan_tier: unknown.

Store information the system needs later in named fields, and update those fields when new evidence arrives. The schema depends on the agent, but the rule stays the same: operational facts should not live only inside raw text.

State should stay focused. It does not need every utterance, retrieved passage, or intermediate model output. Those belong in raw history or trace. State should contain the information that changes future behavior: filled values, confidence, unresolved questions, workflow phase, proposed actions, and verification status.

State needs belief status

A weak state object stores only values:

"root_cause": "migration"

That field alone does not tell the system how safely it can rely on the value. A better state object stores belief status alongside the value:

"root_cause": {
    "value": "migration",
    "confidence": "uncertain",
    "source": "user_turn_5",
}

The confidence label tells the harness whether to use, verify, qualify, or block the field. A guessed root cause and a confirmed root cause should not drive the same behavior.

Five labels cover many production cases:

Label	Meaning	Harness behavior
`confirmed`	Safe to rely on	Use it, summarize it, or act on it if other gates pass
`uncertain`	Plausible, but not safe yet	Ask, verify, or avoid treating it as fact
`needs_verification`	Requires a specific check	Run a lookup, validator, or read-after-write step
`stale`	Was once true but may no longer be true	Refresh before relying on it
`contradicted`	Conflicting evidence exists	Preserve both sides and resolve before acting

Contradictions need to remain visible. If the user first says, “We do not have internal logs for this system,” and later says, “The application logs show the migration completed successfully,” the updater should preserve both statements, mark the relevant field as contradicted or needs_verification, and leave the next step with a clear ambiguity to resolve.

State should preserve ambiguity in a form the next decision can see.

Raw history, state, and trace serve different jobs

Raw history, state, and trace overlap, but each one has a different job.

Artifact	Job	Intake example
Raw history	Preserves the original material	The user said, “We’re on the enterprise plan…”
State	Stores the current working memory	`plan_tier = enterprise`, `confidence = confirmed`
Trace	Records what happened during the run	Turn 2 updated `plan_tier`; turn 9 skipped a redundant question; turn 12 produced the handoff

Raw history preserves nuance, tone, and provenance. Trace explains how the system behaved. State guides the next decision. If a fact affects what the system asks, writes, summarizes, verifies, or hands off, it belongs in state.

Different agents need different state schemas

State should match the decisions the agent has to make.

Agent type	State needs to record
Intake agent	Confirmed facts, uncertain facts, open questions, handoff readiness
Scheduling assistant	Candidate events, selected target, proposed change, approval, verification status
Docs Q&A agent	Retrieved refs, grounded claims, evidence mapping, validation status

A scheduling assistant must know whether it has selected the right calendar event before it can reschedule anything. A docs Q&A agent must know whether each claim is supported by retrieved evidence before it can answer. An intake agent must know whether it has enough confirmed information to produce a useful handoff.

State holds the information the next decision needs.

State should answer four questions

A useful state object answers four questions at each step:

What do we currently believe?
How sure are we?
What remains unclear?
What stage of the workflow are we in?

These questions expose whether state is usable. Without those fields, the model has to reconstruct the situation from raw history. It may treat guesses as facts, skip required questions, or advance the workflow too early.

State gives the model and runtime stable fields to read. The model uses state to decide what to say next. The runtime uses state to decide what is allowed next.

State changes as the workflow runs

State is maintained throughout the workflow. Each meaningful step reads from it, updates it, and leaves the system in a clearer position than before.

A typical loop looks like this:

Understand. Read the current state and the new input. Update facts when the input is clear. Mark uncertainty when it is not.
Decide. Choose the next action: ask a clarifying question, call a tool, draft a summary, produce a handoff, or refuse.
Execute. Take the action. If a tool is called, capture its inputs and outputs. If a write happens, record the attempt.
Verify. Check whether the action did what it was supposed to do. Update state with what is now known.

Figure 1: The four-step loop operates on state. Every Understand step reads from state; every Verify step writes back to it. State is the system’s through-line across steps, whether the steps are conversation turns, workflow phases, or pipeline stages.

The loop appears in different forms across agents. A scheduling assistant may run it across target selection, approval, execution, and verification. The names change, but the pattern stays the same: read state, choose the next move, execute it, and update state based on what happened.

Verification keeps state honest. After a write, the system should check the external source of truth before treating state as updated. For example, after rescheduling a meeting, it should confirm that the calendar shows the new time.

Gates read state

Gates enforce policies by inspecting state. A handoff gate can block missing fields, uncertain facts, or unresolved questions only if state records them explicitly.

if action == "produce_handoff":
    assert state["facts"]["plan_tier"]["confidence"] == "confirmed"
    assert state["facts"]["affected_feature"]["confidence"] == "confirmed"
    assert state["ready_for_handoff"] is True

Prompts suggest behavior. Runtime checks enforce it.

Common mistakes

State usually fails in predictable ways:

No operational state: raw history becomes state, and the model has to reconstruct the situation every step.
Too much state: every utterance, passage, draft, and tool output gets persisted, making state as noisy as the transcript.
No confidence labels: "root_cause": "migration" looks settled even if the user only guessed it.
Silent contradiction handling: conflicting evidence gets overwritten instead of staying visible until the system resolves it.
State drift: failed writes, stale retrieved passages, and user corrections do not update the stored belief.

Use a simple test when deciding what belongs in state: will the system behave worse later if this information only lives in raw history or trace? Put it in state only when the answer is yes.

State is the system’s memory

The transcript records what was said. State records what the system can rely on.

The intake agent needed to store two confirmed facts: the user was on the enterprise plan, and the dashboard was the affected feature. Once those facts became state, later steps could use them. The agent could avoid a redundant question, produce a better handoff, and expose unresolved fields before claiming the workflow was ready.

State gives gates, tools, verification steps, and traces concrete fields to read and update. Runtime control reads state before acting. Gates and tool contracts decide what the system is allowed to do with that memory.

The harness is the product

Yee Seng Chan — Thu, 23 Apr 2026 00:00:00 GMT

Part of a series

The agent harness

A stronger model does not automatically give you a reliable agent.

The previous article catalogued six ways agent demos break in production. The intake agent forgot a fact the user gave it on turn 2. The scheduling assistant double-booked a meeting after an ambiguous timeout. The docs Q&A agent gave a confident, well-formed, wrong answer.

The instinct after each failure is to reach for a stronger model. Sometimes that helps a little. More often, the failure returns in a slightly different shape because the system around the model has not changed. The useful reframe is simple:

Agent = model + harness

The model is the reasoning engine. The harness is the system the model lives inside. Models get swapped, cheaper, and faster. The harness carries the team’s accumulated understanding of how to make the agent reliable. For production agents, the harness is the product.

Prompts guide. Harnesses constrain.

A prompt tells the model what behavior you want. A harness decides what behavior the system permits.

The difference is concrete:

Refunds: a prompt can say, “Do not issue refunds.” A harness can make the refund tool unavailable.
Scheduling: a prompt can say, “Ask for clarification if the target meeting is ambiguous.” A harness can reject reschedule_event unless state contains one confirmed event ID.
Docs Q&A: a prompt can say, “Use only the retrieved documents.” A harness can require every final claim to map to a cited passage.
Retries: a prompt can say, “Be careful with retries.” A harness can enforce one write attempt, require an idempotency key, and route ambiguous timeouts to verification.

The prompt shapes one model call. The harness controls the workflow around that call. It makes some failures impossible to commit and makes the rest easier to detect.

That is the shift from prompt engineering to harness engineering.

Scope is the first harness decision

Before designing the harness, define the agent’s job. Scope answers three questions:

What exact job does this agent do?
What artifact or outcome should exist when it succeeds?
What is it explicitly not allowed to do?

For the intake agent, a scoped version reads:

A discovery-stage intake agent that conducts an initial conversation with a user reporting a support issue and produces a structured handoff for a human support agent. It may ask clarifying questions, retrieve approved account information, and summarize confirmed facts and open questions. It may not commit to remediation, issue credits or refunds, change account state, or attempt to resolve the issue itself.

The harness has four jobs

Figure 1: The harness has four jobs: decide what the model sees, limit what the model can do, check what actually happened, and keep a record.

Once scope is clear, the harness has four jobs:

Shape the input: decide what the model sees at each step. The model should receive the relevant state, retrieved evidence, available tools, and local instruction, not a pile of every transcript turn and document.
Bound the action: decide what the model is allowed to do. The model can propose an action, but workflow phases, tool contracts, gates, approval packets, and permissions decide whether it runs.
Verify the outcome: check what actually happened. A tool returning success does not prove the world is correct. A rescheduled meeting needs calendar readback. A grounded answer needs claim-to-evidence checks.
Preserve the evidence: record enough information to debug and improve the system. Useful traces include input, state, action, tool inputs and outputs, verification results, state after, and version information.

A reliable agent takes bounded actions, fails safely when it cannot proceed, leaves enough evidence to debug the run, and becomes harder to break after each incident.

Prompt, context, and harness engineering

Prompt, context, and harness engineering solve different problems.

Figure 2: Three nested levels of engineering. Each level encompasses the previous. A good harness contains good context engineering, which contains good prompts.

Prompt engineering: writes the instructions for one model call: wording, structure, examples, and output format.
Context engineering: decides what enters the model context at each step: system instructions, relevant state, retrieved passages, memory, available tools, and local task framing.
Harness engineering: controls the application around the model: when context is assembled, which tools are available, which actions are allowed, how writes are verified, how state persists, how failures recover, how traces are written, and how regression cases catch old failures.

Putting the failure at the right level saves wasted iteration. A bad instruction may need a prompt change. Missing evidence may need context changes. Unsafe actions, duplicate writes, missing state, weak verification, and thin traces need harness changes.

This is not a call for overengineering

A harness matters, but teams can build too much harness too early. The minimum useful harness depends on the agent’s scope.

Intake agent: clear scope, structured state with confidence labels, a few workflow phases, no write tools, a basic handoff artifact, step traces, and a small regression set.
Scheduling assistant: target identification, narrow write tools with idempotency, one-attempt write policy, read-after-write verification, and traces of tool inputs and outputs.
Docs Q&A agent: retrieval provenance, claim-to-evidence mapping, a “no supported answer” behavior for thin evidence, auth-aware retrieval, and traceable citations.

Build the harness around the risks created by the job. An intake agent without write authority does not need the same side-effect controls as a scheduling assistant. A docs Q&A agent does need evidence mapping because its main risk is confident unsupported synthesis.

Which parts of the harness age?

Some harness pieces get lighter as models improve. Tool-selection scaffolding matters less when models choose tools reliably. Format validators matter less when structured output becomes dependable. Long prompt chains matter less when models handle more reasoning in one call. These pieces compensate for model limitations, so stronger models reduce their importance.

Other harness pieces remain necessary because they control product risk. A smarter model still cannot unsend a calendar invite, undo a payment, or reconstruct last month’s behavior without traces. It still needs runtime boundaries around tools, writes, permissions, and user-visible actions.

The pieces that age are scaffolding around weak model behavior. The pieces that stay are tied to reliability: state, tool contracts, gates, verification, recovery, traces, and tests.

Why AI agent demos break in production

Yee Seng Chan — Sat, 18 Apr 2026 00:00:00 GMT

Part of a series

The agent harness

Most AI agent demos fail in boring ways.

The agent forgets something the user said five turns ago. It treats a guess as a fact. It calls the wrong tool because the tool name sounded close enough. It returns a polished answer only loosely connected to the evidence. It retries a calendar write after a timeout and books the same meeting twice. Then, when someone asks what happened, the logs say:

tool_call: retrieval, status: success

Reliability becomes concrete after something breaks. The team needs to know why the agent believed a claim, why it called a tool, why it skipped a clarifying question, why it retried a write, and why the run cannot be reconstructed.

The usual instinct is to blame the model. A stronger model may help, but many agent failures come from the system around the model: state, workflow, tool contracts, gates, retry policy, retrieval design, traces, and evaluation.

The diagnostic question is simple: what failed, and where should control have lived?

That question starts harness engineering. This series is about building agent systems that survive real users, real tools, real ambiguity, and real debugging.

Three small agents

This series uses three running examples.

Customer support intake agent: holds a conversation with a user, gathers the issue, identifies what is known and uncertain, and produces a structured handoff for a human. It does not solve the issue or write to external systems.
Scheduling assistant: books, reschedules, and cancels meetings. It uses tools that change the world.
Internal docs Q&A agent: answers questions over company documentation. It retrieves passages and synthesizes a short response.

Six failure shapes show up across these agents.

Failure 1: The agent bluffs when it should pause

A user opens the intake conversation: “We’re having issues with the new dashboard, and I think it’s related to the migration we did last month, but I’m not totally sure.”

A weak agent starts triaging the migration. It asks about migration steps, surfaces likely root causes, and drafts a handoff that pins the issue to the migration.

The user never confirmed the migration as the cause. The agent treated a hedge as a fact.

A reliable agent should record the uncertainty and ask a clarifying question before building on the claim: “Has the migration been confirmed as the cause, or is that still a guess?”

The system needs a field for uncertainty:

state["root_cause"] = {
    "value": "migration",
    "confidence": "uncertain",
    "source": "user_hedged_statement",
}

It also needs behavior tied to that label. If the next step depends on a fact labeled uncertain, the workflow should route to clarification instead of action.

Visible failure: the agent jumped to conclusions.
Deeper failure: the system had no mechanism for treating uncertainty differently from certainty.

Failure 2: The agent forgets what already happened

In the same intake conversation, the user says on turn 2: “We’re on the enterprise plan, and the dashboard is the only feature we use.” On turn 9, the agent asks: “Just to confirm, are you on the standard or enterprise plan?” On turn 12, the handoff record lists the plan as unknown.

The system failed because it never stored the plan tier in state. It stuffed the transcript back into context every turn and relied on the model to keep the right facts active.

The fix is to store confirmed facts once:

state["customer"] = {
    "plan_tier": {
        "value": "enterprise",
        "confidence": "confirmed",
        "source_turn": 2,
    },
    "primary_feature": {
        "value": "dashboard",
        "confidence": "confirmed",
        "source_turn": 2,
    },
}

Now the agent reads plan_tier = enterprise instead of rediscovering it from the transcript. The transcript records what the user said. State records what the system can rely on. State, not transcript, is agent memory goes deeper into that distinction.

Visible failure: the model forgot.
Deeper failure: the system treated the conversation transcript as memory.

Failure 3: The agent takes the wrong action

The scheduling assistant has these tools:

list_calendar_events()
find_event_by_description()
reschedule_event()
cancel_event()

The user says: “Move my Tuesday review with Priya to Thursday.”

A weak agent calls cancel_event and creates a fresh booking. Or it calls reschedule_event with a fuzzy description, and the API matches the wrong event.

The safe sequence is explicit: find the event, disambiguate if multiple events match, then reschedule a specific event ID. The model should not invent that sequence from scratch every time.

The harness should make the wrong action hard to take. It can require an identified event ID before reschedule_event, require user confirmation when multiple events match, and block destructive actions unless the workflow phase allows them.

Visible failure: the agent picked the wrong tool.
Deeper failure: the system relied on the model to invent a safe tool sequence instead of encoding the sequence in the harness.

Failure 4: The agent sounds grounded, but it is not

The docs Q&A agent receives this question: “What is our policy on customer data retention for trial accounts?”

It retrieves five passages and returns a confident answer saying 30 days. The corpus says 90 days. The right passage was in the retrieved set, but synthesis leaned on the wrong evidence.

A reliable retrieval agent should track supported claims explicitly. The system needs to know which claim came from which passage, and whether that passage addresses the question being asked.

supported_claims = [
    {
        "claim": "Trial account data is retained for 90 days after trial expiry.",
        "evidence_refs": ["policy_42:trial_accounts"],
        "confidence": "confirmed",
    }
]

Synthesis should write from supported claims, not from a raw pile of passages.

Visible failure: the agent gave a wrong answer.
Deeper failure: the system never tracked which claims were supported by which evidence.

Groundedness is a system property. Polished prose does not make an answer grounded.

Failure 5: The agent repeats a side effect

The scheduling assistant calls:

reschedule_event(
    event_id="evt_482",
    new_time="2026-05-12T14:00",
)

The API takes thirty seconds and returns a timeout. The agent cannot tell whether the write succeeded, so it retries. Now Priya receives two confusing calendar updates.

Writes need different retry semantics from reads. Every write should carry an idempotency key:

payload = {
    "event_id": "evt_482",
    "new_time": "2026-05-12T14:00",
    "idempotency_key": "run_847_step_3",
}

After an ambiguous timeout, the agent should check the calendar before retrying. Verify first, retry second.

Visible failure: the meeting update happened twice.
Deeper failure: the system treated reads and writes as equivalent tool calls.

Failure 6: Nobody can debug it

A user reports that the docs Q&A agent gave a wrong answer about retention policy yesterday. The logs contain the user’s question, the agent’s response, and this line:

tool_call: retrieval, status: success

That log cannot localize the failure. It does not show which passages were retrieved, which were ranked highest, what prompt version was deployed, what state the system had, or which answer claims came from which passage.

A useful trace records enough to debug the run: input, relevant state, action chosen, tool inputs and outputs, verification result, state after, prompt version, tool version, and model version. Without that trace, the team cannot tell whether the failure came from retrieval, synthesis, prompting, model behavior, or a recent deployment.

Visible failure: the team cannot debug the answer.
Deeper failure: the system logged steps instead of recording decisions.

Traces are how agents get better goes deeper into useful traces.

The pattern behind the failures

Each failure asks the model to carry control that should live in the harness.

The harness decides what the model sees, what actions are allowed, what gets verified, and what is preserved for debugging and improvement. The model still matters, but reliability depends on the control surfaces around it.

Figure 1: Six failure shapes mapped to the control surfaces that should have caught them. The model is not the right place to fix any of these. The system around the model is.

The control surfaces are concrete:

State: stores facts, uncertainty, workflow phase, and pending actions.
Workflow logic: determines which step can happen next.
Tool contracts: define safe inputs, outputs, side effects, and retry behavior.
Gates and validators: block unsafe or unsupported actions.
Retrieval and evidence mapping: connect answers to supporting sources.
Traces: preserve enough information to debug and harden the system.
Regression tests: keep known failures from returning.

The next four articles go deeper on the most important parts:

The harness is the product: why an agent is the model plus the harness around it.
State, not transcript, is agent memory: how state gives the system memory it can read and update.
Prompts guide. Gates enforce: how gates and tools limit actions, execute them safely, and verify outcomes.
Traces are how agents get better: how traces support debugging, regression tests, and the hardening loop.

Common mistakes

Teams usually get agent reliability wrong in predictable ways:

Model-only diagnosis: when the agent fails, the team swaps in a stronger model instead of fixing the missing control surface.
Demo-as-evidence thinking: a demo shows that a path can work, not that the system survives messy users, ambiguous inputs, tool errors, or repeated runs.
Prompt-only boundaries: a system prompt that says “do not do XYZ” is guidance, not enforcement. Real boundaries live in gates, validators, permissions, and tool contracts.
Step logging: logs that say what happened do not explain why it happened. A useful trace records state, decisions, tool inputs, tool outputs, verification results, and version information.
Random-failure framing: most failures are recurring shapes. Once the category is visible, the team can engineer against it.

Reliable agents do not come from prompts alone. They come from moving the right responsibilities into the harness: state, workflow, gates, tools, verification, traces, and tests.