Monitoring Your Agent

Benchmarks

Regression testing: Catch behavior changes before they reach users.
Prompt iteration: Compare changes against the same set of representative tasks.
Model comparison: See how different models handle the same inputs.
Edge cases: Preserve examples the agent has mishandled before.

Benchmarks are reusable test sets for an agent. Each benchmark contains inputs and desired outputs so you can check whether behavior still matches expectations after changing prompts, tools, models, or workflows.

What benchmarks are for

Benchmark examples

Best practices