Are you an LLM? Read llms.txt for a summary of the docs, or llms-full.txt for the full context.
Skip to content

amodal eval

Run evaluation suites against your agent to measure quality, compare models, and track regressions.

amodal eval

Eval Files

Evals live in evals/ as YAML files:

name: triage-accuracy
description: Test alert triage quality
cases:
  - input: "Review recent security alerts"
    rubric:
      - "Correctly identifies critical alerts"
      - "Filters known false positives"
      - "Provides severity ranking"
    expected_tools:
      - request
      - query_store

Evaluation Methods

MethodDescription
LLM JudgeAn LLM evaluates the agent's response against the rubric
Tool usageVerify expected tools were called
Cost trackingTrack token usage and cost per eval

Experiments

Compare different configurations side-by-side:

amodal ops experiment

Experiments let you test:

  • Different LLM providers or models
  • Different skill configurations
  • Different prompt variations
  • Different knowledge documents

Results include cost comparison, quality scores, and latency metrics.

Multi-Model Comparison

Run the same eval suite against multiple providers to find the best model for your use case:

amodal eval --providers anthropic,openai,google