Overview
Evaluators determine how well a solution performs. They parse execution output and return a score.Available Evaluators
| Evaluator | Description | Requires LLM |
|---|---|---|
no_score | Always returns 0 | No |
regex_pattern | Extract score with regex | No |
file_json | Read score from JSON file | No |
multi_metric | Weighted combination of metrics | No |
llm_judge | LLM-based evaluation | Yes |
llm_comparison | Compare against expected output | Yes |
composite | Combine multiple evaluators | Depends |
Usage
Via Kapso API
Direct Usage
no_score
Always returns 0. Use when you only care about successful execution.regex_pattern
Extract score from output using a regex pattern.Parameters
| Parameter | Default | Description |
|---|---|---|
pattern | r'SCORE:\s*([-+]?\d*\.?\d+)' | Regex with capture group |
default_score | 0.0 | Score if pattern not found |
Example Patterns
file_json
Read score from a JSON file created by the code.Parameters
| Parameter | Default | Description |
|---|---|---|
filename | "results.json" | JSON file to read |
score_key | "score" | Key path (dot notation) |
default_score | 0.0 | Score if file/key not found |
Example
Code writes:multi_metric
Weighted combination of multiple regex metrics.Parameters
| Parameter | Description |
|---|---|
patterns | Dict of {name: (pattern, weight)} |
invert | List of metric names where lower is better |
llm_judge
LLM-based evaluation with custom criteria.Parameters
| Parameter | Default | Description |
|---|---|---|
criteria | "correctness and quality" | What to evaluate |
model | "gpt-4.1-mini" | LLM model |
scale | 10 | Score scale (normalized to 0-1) |
include_code | False | Include source code in evaluation |
Prompt Template
llm_comparison
Compare output against expected output.Parameters
| Parameter | Default | Description |
|---|---|---|
expected | "" | Expected output |
model | "gpt-4.1-mini" | LLM model |
strict | False | Require exact match |
composite
Combine multiple evaluators with weights.Parameters
| Parameter | Default | Description |
|---|---|---|
evaluators | - | List of (type, params, weight) |
aggregation | "weighted_avg" | How to combine scores |
Aggregation Methods
weighted_avg: Weighted average of scoresmin: Minimum scoremax: Maximum scoreproduct: Product of scores