Perspectives

The Experimentation Gap

Piper Cyterski
April 14, 2026
5 min read

There is a structural difference between evaluating a model and evaluating an agent, and most of the tooling teams rely on was built for the first case.

Evaluating a model is, at bottom, scoring an input-output mapping: given a prompt, was the response good? The unit of analysis is a single turn, and it can be assessed in isolation. Evaluating an agent is a different problem. An agent acts over many steps -- it calls tools, observes what comes back, and chooses its next action accordingly. Its behavior is not a property of any single output. It emerges from a sequence of interactions with an environment that responds to what the agent does.

That distinction carries a consequence that is easy to overlook: an agent's behavior is only defined relative to an environment. Ask what an agent does when a payment fails on the third attempt, and the question has no answer without a world that can fail on the third attempt. The behavior worth measuring exists only in the interaction.

What current tools measure

Two families of tools dominate, and each is well suited to the problem it was designed for.

The first is trace evaluation: instrument the deployed system, collect execution traces, and score them after the fact. As observational measurement this is sound -- it characterizes how the system behaves in the field. But it is post hoc by construction. It describes what has already happened; it cannot tell you how a change will behave before that change reaches users.

The second is offline evaluation against a fixed dataset: assemble representative inputs, run the agent on them before deploying, and score the results. The instinct is right, and the tooling has grown more capable -- some platforms now script multi-turn exchanges, and research benchmarks go further, running agents against simulated environments. But the gap is structural rather than incidental. A scripted exchange follows its script regardless of what the agent does, and where reactive environments do exist, they live in fixed benchmarks -- not as infrastructure a team can aim at its own agent, with the scenarios and failures it specifically cares about. The condition under which agentic behavior is defined -- a state that changes in response to the agent's actions -- is exactly what's missing from the workflow most teams actually have.

Neither family is deficient at its own task. The gap is that the behavior we most need to understand -- how an agent acts when the situation develops against it -- falls outside the range of either.

What evaluating behavior requires

If behavior is defined by interaction, then measuring it means reproducing the interaction under controlled conditions. A few requirements follow directly from that premise.

  • A stateful environment. The environment must carry state across the trajectory, so that the consequences of early actions are visible in later ones -- the tenth tool call should reflect the first nine.
  • Controlled perturbation. The failures of interest -- a timeout, an upstream error, a stale read, an inconsistent user -- should be introduced deliberately, at chosen points, so that recovery behavior can be observed rather than waited for.
  • Adequate sampling. A handful of scenarios is anecdote. Characterizing behavior across an operating range requires many scenarios, which is feasible only if the process runs end to end without manual intervention.
  • Controlled comparison. To attribute a change in behavior to a change in the agent, the environment must be held fixed across versions: the same scenarios, the same perturbations, the same grading. Otherwise version differences are confounded with scenario differences, and the comparison licenses no conclusion.
  • A graded trajectory. The object of evaluation is the full sequence of actions, scored on consistent criteria -- not the final answer in isolation. The trajectory is where both the failure and its diagnosis reside.

Closing the gap

None of this is unfamiliar; it is close to how any experimental discipline isolates a variable -- hold the conditions constant, introduce one change, observe the effect. What was missing was not the method but the apparatus: until the environment itself could respond, there was no way to run the experiment on a system that makes its own decisions.

The teams beginning to close this gap study their agents' failure modes under controlled, repeatable conditions before deployment, rather than reconstructing them from production traces after the fact. The shift is the same one most engineering disciplines eventually make -- from observing outcomes to designing the experiments that explain them.