Testing Your First Agent
Testing turns agent building from guesswork into a feedback loop. A good test setup helps you see what the agent understood, where it drifted, and whether each change made the agent more reliable.
- 01Set up evaluations for the behaviors you want to measure.
- 02Chat with the agent and create practice tasks from realistic examples.
- 03Review completed tasks, compare evaluation results, and iterate.
Why testing matters
Agents can sound confident even when they miss context, skip a rule, or invent details. Evaluations give you a lightweight way to inspect every run and grade whether the agent handled it well.
Step 1: Set up your evaluations
Start with evaluations that grade the behaviors you care about. Each evaluation should check one thing, use a readable slug, and describe the standard clearly enough that a teammate would agree with the result.
| Type | Use it for | Example |
|---|---|---|
| Binary | A pass/fail check for a required behavior. | `asks-for-missing-info`: Pass when the agent asks a follow-up question before acting without required details. |
| Classification | Sorting a task into one of several categories. | `request-type`: Classify the task as setup, integration, troubleshooting, or billing. |
| Rating (1-5) | Scoring quality on a scale when the answer is not simply right or wrong. | `answer-quality`: Rate how complete, grounded, and useful the final response is. |
A small set is enough for the first test pass. Pick one or two behaviors that matter most, then add more evaluations after you see real task results.
| Evaluation | Slug | Criteria |
|---|---|---|
| Helpful answer | answer-is-helpful | The agent directly addresses the user's question and gives a useful next step. |
| No hallucination | no-hallucination | The agent does not invent product details, policies, URLs, or capabilities. |
| Asks when missing context | asks-for-missing-info | The agent asks a follow-up question when it lacks required information. |
| Uses the playbook | uses-playbook-guidance | The agent follows the rules, tone, and process details from the relevant playbook. |
Step 2: Run practice tasks
Chat with the agent using realistic examples and create a few practice tasks. Mix straightforward prompts with messy ones: missing details, vague asks, edge cases, and questions the agent should escalate instead of answering from memory.
- Ask a common happy-path question the agent should answer confidently.
- Ask a vague question where the agent should request more context.
- Ask about a policy or product detail that only exists in the playbook.
- Ask a question outside the agent's scope and check whether it redirects or escalates.
Step 3: Review tasks and iterate
After each run, review the task timeline and evaluation results. Look for patterns: repeated failures usually mean the instructions are unclear, the playbook is missing an example, or the agent needs a tool to verify the answer.
Keep iterating until the important cases are green across the board. When a test fails, update the smallest thing that could fix it, then rerun the same example before moving on.