Skip to main content

Overview

Evaluators determine how well a solution performs. They parse execution output and return a score.

Available Evaluators

EvaluatorDescriptionRequires LLM
no_scoreAlways returns 0No
regex_patternExtract score with regexNo
file_jsonRead score from JSON fileNo
multi_metricWeighted combination of metricsNo
llm_judgeLLM-based evaluationYes
llm_comparisonCompare against expected outputYes
compositeCombine multiple evaluatorsDepends

Usage

Via Kapso API

solution = kapso.evolve(
    goal="Build a classifier",
    evaluator="regex_pattern",
    evaluator_params={"pattern": r"Accuracy: ([\d.]+)"},
)

Direct Usage

from src.environment.evaluators import EvaluatorFactory

evaluator = EvaluatorFactory.create(
    "regex_pattern",
    pattern=r"Accuracy: ([\d.]+)",
)

result = evaluator.evaluate(
    output="Accuracy: 0.95",
    file_path="/path/to/code",
)
print(f"Score: {result.score}")

no_score

Always returns 0. Use when you only care about successful execution.
evaluator = EvaluatorFactory.create("no_score")

regex_pattern

Extract score from output using a regex pattern.
evaluator = EvaluatorFactory.create(
    "regex_pattern",
    pattern=r"Accuracy: ([\d.]+)",
    default_score=0.0,
)

Parameters

ParameterDefaultDescription
patternr'SCORE:\s*([-+]?\d*\.?\d+)'Regex with capture group
default_score0.0Score if pattern not found

Example Patterns

# Accuracy percentage
pattern=r"Accuracy: ([\d.]+)%"

# Loss value
pattern=r"Loss: ([\d.]+)"

# Custom score format
pattern=r"SCORE:\s*([\d.]+)"

# F1 score
pattern=r"F1: ([\d.]+)"

file_json

Read score from a JSON file created by the code.
evaluator = EvaluatorFactory.create(
    "file_json",
    filename="results.json",
    score_key="evaluation.f1_score",
    default_score=0.0,
)

Parameters

ParameterDefaultDescription
filename"results.json"JSON file to read
score_key"score"Key path (dot notation)
default_score0.0Score if file/key not found

Example

Code writes:
{
  "evaluation": {
    "f1_score": 0.87,
    "accuracy": 0.92
  }
}
Config:
score_key="evaluation.f1_score"  # Returns 0.87

multi_metric

Weighted combination of multiple regex metrics.
evaluator = EvaluatorFactory.create(
    "multi_metric",
    patterns={
        "accuracy": (r"Accuracy: ([\d.]+)", 0.5),
        "f1": (r"F1: ([\d.]+)", 0.3),
        "speed": (r"Time: ([\d.]+)s", 0.2),
    },
    invert=["speed"],  # Lower time is better
)

Parameters

ParameterDescription
patternsDict of {name: (pattern, weight)}
invertList of metric names where lower is better

llm_judge

LLM-based evaluation with custom criteria.
evaluator = EvaluatorFactory.create(
    "llm_judge",
    criteria="correctness and efficiency",
    model="gpt-4.1-mini",
    scale=10,
    include_code=False,
)

Parameters

ParameterDefaultDescription
criteria"correctness and quality"What to evaluate
model"gpt-4.1-mini"LLM model
scale10Score scale (normalized to 0-1)
include_codeFalseInclude source code in evaluation

Prompt Template

You are an expert code evaluator.
Evaluate the code execution output based on these criteria: {criteria}

Provide:
1. A score from 0 to {scale}
2. Brief feedback explaining the score

Format:
SCORE: <number>
FEEDBACK: <explanation>

llm_comparison

Compare output against expected output.
evaluator = EvaluatorFactory.create(
    "llm_comparison",
    expected="The answer is 42",
    model="gpt-4.1-mini",
    strict=False,  # Allow semantic equivalence
)

Parameters

ParameterDefaultDescription
expected""Expected output
model"gpt-4.1-mini"LLM model
strictFalseRequire exact match

composite

Combine multiple evaluators with weights.
evaluator = EvaluatorFactory.create(
    "composite",
    evaluators=[
        ("regex_pattern", {"pattern": r"Accuracy: ([\d.]+)"}, 0.6),
        ("llm_judge", {"criteria": "code quality"}, 0.4),
    ],
    aggregation="weighted_avg",
)

Parameters

ParameterDefaultDescription
evaluators-List of (type, params, weight)
aggregation"weighted_avg"How to combine scores

Aggregation Methods

  • weighted_avg: Weighted average of scores
  • min: Minimum score
  • max: Maximum score
  • product: Product of scores

EvaluationResult

@dataclass
class EvaluationResult:
    score: float            # 0.0 to 1.0 (or custom scale)
    feedback: str = ""      # Human-readable feedback
    details: Dict = {}      # Additional data
    raw_output: str = ""    # Original output

Creating Custom Evaluators

from src.environment.evaluators.base import Evaluator, EvaluationResult
from src.environment.evaluators.factory import register_evaluator

@register_evaluator("my_custom_evaluator")
class MyCustomEvaluator(Evaluator):
    description = "My custom evaluation logic"
    requires_llm = False

    def __init__(self, my_param: str = "default", **params):
        super().__init__(**params)
        self.my_param = my_param

    def evaluate(self, output: str, file_path: str, **context) -> EvaluationResult:
        # Custom scoring logic
        score = compute_score(output, self.my_param)

        return EvaluationResult(
            score=score,
            feedback=f"Evaluated with {self.my_param}",
            raw_output=output,
        )

Configuration

evaluator:
  type: "regex_pattern"
  params:
    pattern: 'SCORE:\s*([-+]?\d*\.?\d+)'
    default_score: 0.0

Listing Evaluators

PYTHONPATH=. python -m src.cli --list-evaluators

Best Practices

regex_pattern is fastest and works well when output format is predictable.
llm_judge is flexible but adds latency and cost. Use for subjective criteria.
Normalize scores to 0-1 for consistent stop condition behavior.