Dashboard
Tutorial

Eval-driven receipt extraction

Maya Patel· Solutions EngineeringMay 20, 202614 min readVisionEvalsAgents
Receipt extraction control panel mock

Receipt extraction looks like a vision problem until you actually try to ship it. The hard part isn't the model — it's deciding, in code, when an extraction is good enough to skip human review. This tutorial walks through building the agent, then layering an eval harness over it so the decision stops being a vibe and starts being a number.

What's an eval, again?
An eval is a tiny test for a model output. It takes the run's input plus output, decides whether the output is acceptable, and returns a score. Stack a few of these and you have a regression suite for a non-deterministic system.

What you'll build

  • A Brainbase agent that takes a receipt image and returns structured line items.
  • A set of evals that score each extraction on completeness, type-correctness, and total-line accuracy.
  • A small dashboard view that shows how the score moves as you change the prompt or swap the model.
  • A safe-rollout switch so the agent's output is only trusted when the eval score clears a threshold.

1. Shape the output

Pin the output shape before you touch the model. A flat list of line items is enough for most workflows — description, quantity, unit_price, total — plus a top-level subtotal, tax, and grand_total.

schema.tstypescript
export const ReceiptSchema = z.object({
  merchant: z.string(),
  purchased_at: z.string(),
  line_items: z.array(
    z.object({
      description: z.string(),
      quantity: z.number(),
      unit_price: z.number(),
      total: z.number(),
    })
  ),
  subtotal: z.number(),
  tax: z.number(),
  grand_total: z.number(),
});

Use this schema for two things: as the agent's output_schema, and as the contract your evals score against. Drift between them is the most common source of "passes evals, breaks in production" surprises.

2. Write the agent

Create the agent through the API. The prompt stays short and pushes everything structured into the schema; tweak only the prompt as you iterate, never the schema.

bash
curl --request POST \
  --url https://api.brainbaselabs.com/v2/agents \
  --header 'Authorization: Bearer YOUR_API_KEY' \
  --header 'Content-Type: application/json' \
  --data '{
    "title": "Receipt extractor",
    "instructions": "Extract every line item exactly as printed. Do not infer items, totals, or taxes that are not visible. Return null for fields you cannot see.",
    "runtime_kind": "vision",
    "output_schema_ref": "schemas/receipt.v1.json"
  }'
Keep the instructions narrow. Every additional rule you cram in here makes the eval score harder to reason about — was it the instruction, the model, or the image quality that moved the number?

3. Collect evals

You need ~30 receipts to start. Pick them deliberately: 10 boring ones, 10 with edge cases (multiple pages, faded ink, foreign currency), and 10 that previously broke in production. Annotate each one with the ground-truth JSON.

python
from brainbase import Eval, Suite

def line_item_recall(run, ground_truth):
    predicted = {item["description"] for item in run.output["line_items"]}
    actual = {item["description"] for item in ground_truth["line_items"]}
    return len(predicted & actual) / max(len(actual), 1)

def total_within_one_cent(run, ground_truth):
    delta = abs(run.output["grand_total"] - ground_truth["grand_total"])
    return 1.0 if delta < 0.01 else 0.0

suite = Suite(
    agent_id="ag_01HC2K8...",
    dataset="datasets/receipts.v1",
    evals=[
        Eval("line_item_recall", line_item_recall, threshold=0.95),
        Eval("total_within_one_cent", total_within_one_cent, threshold=0.98),
    ],
)

Run the suite once to set a baseline. Whatever it returns is your starting line — every subsequent change is measured against it.

4. Iterate with structured scores

Now you can change one variable at a time and read the score. Pick the eval that's worst and decide whether it's an instruction problem, a model-capability problem, or a dataset problem.

  1. Instruction problem: rewrite the relevant sentence in the agent's instructions. Re-run.
  2. Model problem: swap to a stronger vision model. Re-run. If the number jumps and cost is acceptable, ship the model.
  3. Dataset problem: the eval is wrong, or you're holding the model to a standard humans can't meet. Loosen the eval or fix the ground truth.
Lock the loop
Once you have evals running on every change, prompt-tuning stops being an art project. The agent either passes or it doesn't, and you can promote changes with confidence.

5. Ship it behind a flag

In production, route every run through the same eval gate before trusting the output. The agent stays the source of truth for "what's on the receipt"; the eval gate decides whether a human should still look.

python
run = client.runs.get(run_id)
scores = suite.score(run)

if scores["line_item_recall"] >= 0.95 and scores["total_within_one_cent"] >= 0.98:
    mark_processed(run)
else:
    enqueue_for_review(run, reason=scores)

Where to go next