Agent Lifecycle

Testing Your First Agent

Testing turns agent building from guesswork into a feedback loop. A good test setup helps you see what the agent understood, where it drifted, and whether each change made the agent more reliable.

How to test your first agent

01Set up evaluations for the behaviors you want to measure.
02Chat with the agent and create practice tasks from realistic examples.
03Review completed tasks, compare evaluation results, and iterate.

Why testing matters

Agents can sound confident even when they miss context, skip a rule, or invent details. Evaluations give you a lightweight way to inspect every run and grade whether the agent handled it well.

Step 1: Set up your evaluations

Start with evaluations that grade the behaviors you care about. Each evaluation should check one thing, use a readable slug, and describe the standard clearly enough that a teammate would agree with the result.

Type	Use it for	Example
Binary	A pass/fail check for a required behavior.	`asks-for-missing-info`: Pass when the agent asks a follow-up question before acting without required details.
Classification	Sorting a task into one of several categories.	`request-type`: Classify the task as setup, integration, troubleshooting, or billing.
Rating (1-5)	Scoring quality on a scale when the answer is not simply right or wrong.	`answer-quality`: Rate how complete, grounded, and useful the final response is.

A small set is enough for the first test pass. Pick one or two behaviors that matter most, then add more evaluations after you see real task results.

Evaluation	Slug	Criteria
Helpful answer	answer-is-helpful	The agent directly addresses the user's question and gives a useful next step.
No hallucination	no-hallucination	The agent does not invent product details, policies, URLs, or capabilities.
Asks when missing context	asks-for-missing-info	The agent asks a follow-up question when it lacks required information.
Uses the playbook	uses-playbook-guidance	The agent follows the rules, tone, and process details from the relevant playbook.

Step 2: Run practice tasks

Chat with the agent using realistic examples and create a few practice tasks. Mix straightforward prompts with messy ones: missing details, vague asks, edge cases, and questions the agent should escalate instead of answering from memory.

Ask a common happy-path question the agent should answer confidently.
Ask a vague question where the agent should request more context.
Ask about a policy or product detail that only exists in the playbook.
Ask a question outside the agent's scope and check whether it redirects or escalates.

Step 3: Review tasks and iterate

After each run, review the task timeline and evaluation results. Look for patterns: repeated failures usually mean the instructions are unclear, the playbook is missing an example, or the agent needs a tool to verify the answer.

Keep iterating until the important cases are green across the board. When a test fails, update the smallest thing that could fix it, then rerun the same example before moving on.