Overview
Evaluators determine how well a solution performs. They parse execution output and return a score.
Available Evaluators
| Evaluator | Description | Requires LLM |
|---|
no_score | Always returns 0 | No |
regex_pattern | Extract score with regex | No |
file_json | Read score from JSON file | No |
multi_metric | Weighted combination of metrics | No |
llm_judge | LLM-based evaluation | Yes |
llm_comparison | Compare against expected output | Yes |
composite | Combine multiple evaluators | Depends |
Usage
Via Kapso API
solution = kapso.evolve(
goal="Build a classifier",
evaluator="regex_pattern",
evaluator_params={"pattern": r"Accuracy: ([\d.]+)"},
)
Direct Usage
from src.environment.evaluators import EvaluatorFactory
evaluator = EvaluatorFactory.create(
"regex_pattern",
pattern=r"Accuracy: ([\d.]+)",
)
result = evaluator.evaluate(
output="Accuracy: 0.95",
file_path="/path/to/code",
)
print(f"Score: {result.score}")
no_score
Always returns 0. Use when you only care about successful execution.
evaluator = EvaluatorFactory.create("no_score")
regex_pattern
Extract score from output using a regex pattern.
evaluator = EvaluatorFactory.create(
"regex_pattern",
pattern=r"Accuracy: ([\d.]+)",
default_score=0.0,
)
Parameters
| Parameter | Default | Description |
|---|
pattern | r'SCORE:\s*([-+]?\d*\.?\d+)' | Regex with capture group |
default_score | 0.0 | Score if pattern not found |
Example Patterns
# Accuracy percentage
pattern=r"Accuracy: ([\d.]+)%"
# Loss value
pattern=r"Loss: ([\d.]+)"
# Custom score format
pattern=r"SCORE:\s*([\d.]+)"
# F1 score
pattern=r"F1: ([\d.]+)"
file_json
Read score from a JSON file created by the code.
evaluator = EvaluatorFactory.create(
"file_json",
filename="results.json",
score_key="evaluation.f1_score",
default_score=0.0,
)
Parameters
| Parameter | Default | Description |
|---|
filename | "results.json" | JSON file to read |
score_key | "score" | Key path (dot notation) |
default_score | 0.0 | Score if file/key not found |
Example
Code writes:
{
"evaluation": {
"f1_score": 0.87,
"accuracy": 0.92
}
}
Config:
score_key="evaluation.f1_score" # Returns 0.87
multi_metric
Weighted combination of multiple regex metrics.
evaluator = EvaluatorFactory.create(
"multi_metric",
patterns={
"accuracy": (r"Accuracy: ([\d.]+)", 0.5),
"f1": (r"F1: ([\d.]+)", 0.3),
"speed": (r"Time: ([\d.]+)s", 0.2),
},
invert=["speed"], # Lower time is better
)
Parameters
| Parameter | Description |
|---|
patterns | Dict of {name: (pattern, weight)} |
invert | List of metric names where lower is better |
llm_judge
LLM-based evaluation with custom criteria.
evaluator = EvaluatorFactory.create(
"llm_judge",
criteria="correctness and efficiency",
model="gpt-4.1-mini",
scale=10,
include_code=False,
)
Parameters
| Parameter | Default | Description |
|---|
criteria | "correctness and quality" | What to evaluate |
model | "gpt-4.1-mini" | LLM model |
scale | 10 | Score scale (normalized to 0-1) |
include_code | False | Include source code in evaluation |
Prompt Template
You are an expert code evaluator.
Evaluate the code execution output based on these criteria: {criteria}
Provide:
1. A score from 0 to {scale}
2. Brief feedback explaining the score
Format:
SCORE: <number>
FEEDBACK: <explanation>
llm_comparison
Compare output against expected output.
evaluator = EvaluatorFactory.create(
"llm_comparison",
expected="The answer is 42",
model="gpt-4.1-mini",
strict=False, # Allow semantic equivalence
)
Parameters
| Parameter | Default | Description |
|---|
expected | "" | Expected output |
model | "gpt-4.1-mini" | LLM model |
strict | False | Require exact match |
composite
Combine multiple evaluators with weights.
evaluator = EvaluatorFactory.create(
"composite",
evaluators=[
("regex_pattern", {"pattern": r"Accuracy: ([\d.]+)"}, 0.6),
("llm_judge", {"criteria": "code quality"}, 0.4),
],
aggregation="weighted_avg",
)
Parameters
| Parameter | Default | Description |
|---|
evaluators | - | List of (type, params, weight) |
aggregation | "weighted_avg" | How to combine scores |
Aggregation Methods
weighted_avg: Weighted average of scores
min: Minimum score
max: Maximum score
product: Product of scores
EvaluationResult
@dataclass
class EvaluationResult:
score: float # 0.0 to 1.0 (or custom scale)
feedback: str = "" # Human-readable feedback
details: Dict = {} # Additional data
raw_output: str = "" # Original output
Creating Custom Evaluators
from src.environment.evaluators.base import Evaluator, EvaluationResult
from src.environment.evaluators.factory import register_evaluator
@register_evaluator("my_custom_evaluator")
class MyCustomEvaluator(Evaluator):
description = "My custom evaluation logic"
requires_llm = False
def __init__(self, my_param: str = "default", **params):
super().__init__(**params)
self.my_param = my_param
def evaluate(self, output: str, file_path: str, **context) -> EvaluationResult:
# Custom scoring logic
score = compute_score(output, self.my_param)
return EvaluationResult(
score=score,
feedback=f"Evaluated with {self.my_param}",
raw_output=output,
)
Configuration
evaluator:
type: "regex_pattern"
params:
pattern: 'SCORE:\s*([-+]?\d*\.?\d+)'
default_score: 0.0
Listing Evaluators
PYTHONPATH=. python -m src.cli --list-evaluators
Best Practices
regex_pattern is fastest and works well when output format is predictable.
llm_judge is flexible but adds latency and cost. Use for subjective criteria.
Normalize scores to 0-1 for consistent stop condition behavior.