Monitoring Your Agent
Benchmarks
Benchmarks are reusable test sets for an agent. Each benchmark contains inputs and desired outputs so you can check whether behavior still matches expectations after changing prompts, tools, models, or workflows.
What benchmarks are for
- Regression testing: Catch behavior changes before they reach users.
- Prompt iteration: Compare changes against the same set of representative tasks.
- Model comparison: See how different models handle the same inputs.
- Edge cases: Preserve examples the agent has mishandled before.
Benchmark examples
- A support agent benchmark with common customer issues and expected resolutions.
- A sales agent benchmark with messy lead notes and desired qualification summaries.
- A review agent benchmark with edge cases that should be escalated to a human.
Best practices
- Keep rows realistic; synthetic examples miss important messiness.
- Add a row whenever a production task exposes a new failure mode.
- Include both easy cases and cases that should make the agent pause.
- Run the same benchmark before and after major agent changes.