Skip to content

Rubrics

Rubrics are defined with assertions entries and support binary checklist grading and score-range analytic grading.

The simplest form — list plain strings in assertions and each one becomes a required criterion:

tests:
- id: quicksort-explain
criteria: Explain how quicksort works
input: Explain quicksort algorithm
assertions:
- Mentions divide-and-conquer approach
- Explains partition step
- States time complexity

All strings are collected into a single rubrics grader automatically.

Use type: rubrics explicitly when you need weights, required flags, or score ranges:

tests:
- id: quicksort-explain
criteria: Explain how quicksort works
input: Explain quicksort algorithm
assertions:
- type: rubrics
criteria:
- Mentions divide-and-conquer approach
- Explains partition step
- States time complexity

For fine-grained control, use rubric objects with weights and requirements:

assertions:
- type: rubrics
criteria:
- id: core-concept
outcome: Explains divide-and-conquer
weight: 2.0
required: true
- id: partition
outcome: Describes partition step
weight: 1.5
- id: complexity
outcome: States O(n log n) average time
weight: 1.0
FieldDefaultDescription
idAuto-generatedUnique identifier for the criterion
outcomeDescription of what to check
weight1.0Relative importance for scoring
requiredfalseIf true, failing this criterion fails the entire eval
min_scoreMinimum score (0–1) for this criterion to pass
score_rangesScore range definitions (analytic mode)

For quality gradients instead of binary pass/fail, use score ranges:

assertions:
- type: rubrics
criteria:
- id: accuracy
outcome: Provides correct answer
weight: 2.0
score_ranges:
0: Completely wrong
3: Partially correct with major errors
5: Mostly correct with minor issues
7: Correct with minor omissions
10: Perfectly accurate and complete

Each criterion is scored 0–10 by the LLM grader with granular feedback.

score = sum(satisfied_weights) / sum(total_weights)
score = sum(criterion_score / 10 * weight) / sum(total_weights)
VerdictScore
pass≥ 0.8
fail< 0.8

Write rubric criteria directly in assertions. If you want help choosing between plain assertions, deterministic graders, and rubric or LLM-based grading, use the agentv-eval-writer skill. Keep the grader choice driven by the criteria rather than one fixed recipe.

Rubric assertions automatically receive the full evaluation context, not just the agent’s text answer. When present, the following are appended to the grader prompt:

  • file_changes — unified diff of workspace file changes (when workspace is configured)
  • tool_calls — formatted summary of tool calls from agent execution (tool name + key inputs)

This means rubric criteria can reason about what the agent did, not only what it said. For example, you can check whether an agent invoked a specific skill:

assertions:
- The agent invoked the acme-deploy skill
- The agent used Read to inspect the config file before editing

This is a lightweight alternative to the skill-trigger evaluator when you want to check tool usage with natural-language criteria.

Rubrics work alongside code and LLM graders:

tests:
- id: code-quality
criteria: Generates correct, clean Python code
input: Write a fibonacci function
assertions:
- type: rubrics
criteria:
- Returns correct values for n=0,1,2,10
- Uses meaningful variable names
- Includes docstring
- name: syntax_check
type: code-grader
command: [./validators/check_python.py]