Generic Problems

The generic runner lets you solve any problem without benchmark-specific setup. Configure evaluators and stop conditions to match your needs.

Usage

# Inline problem
PYTHONPATH=. python -m src.runner -p "Create a script that calculates prime numbers"

# From file
PYTHONPATH=. python -m src.runner -f problem.txt

# With options
PYTHONPATH=. python -m src.runner \
    -f problem.txt \
    -i 10 \
    --evaluator regex_pattern \
    --stop-condition threshold

CLI Options

Option	Description	Default
`-p, --problem`	Problem description (inline)	-
`-f, --problem-file`	File with problem description	-
`-i, --iterations`	Max iterations	10
`-m, --mode`	Config mode	`GENERIC`
`-d, --coding-agent`	Coding agent	From config
`--main-file`	Entry point file	`main.py`
`--language`	Programming language	`python`
`--timeout`	Execution timeout (seconds)	300
`--evaluator`	Evaluator type	`no_score`
`--stop-condition`	Stop condition type	`never`
`--context`	Additional context/tips	-

Evaluators

no_score
regex_pattern
llm_judge

No scoring, always returns 0. Useful for code generation without evaluation.

--evaluator no_score

Parse score from output using regex.

evaluator:
  type: "regex_pattern"
  params:
    pattern: 'SCORE:\s*([-+]?\d*\.?\d+)'
    default_score: 0.0

Your code should print: SCORE: 0.95

LLM-based evaluation of output quality.

evaluator:
  type: "llm_judge"
  params:
    criteria: "correctness, efficiency, readability"

List all evaluators:

PYTHONPATH=. python -m src.runner --list-evaluators

Stop Conditions

never
threshold
plateau
composite

Run all iterations.

--stop-condition never

Stop when score reaches target.

stop_condition:
  type: "threshold"
  params:
    threshold: 0.95

Stop if no improvement for N iterations.

stop_condition:
  type: "plateau"
  params:
    patience: 5

Combine multiple conditions.

stop_condition:
  type: "composite"
  params:
    conditions:
      - ["threshold", {"threshold": 0.95}]
      - ["max_iterations", {"max_iter": 50}]
    mode: "any"

List all stop conditions:

PYTHONPATH=. python -m src.runner --list-stop-conditions

Configuration Modes

GENERIC

Standard configuration for most problems:

search_strategy:
  type: "linear_search"
  params:
    code_debug_tries: 5
coding_agent:
  type: "aider"
  model: "gpt-4.1"
knowledge_search:
  enabled: false

TREE_SEARCH

For complex problems requiring exploration:

search_strategy:
  type: "llm_tree_search"
  params:
    node_expansion_limit: 2
    exploration_budget_percent: 40

Example: Scored Problem

Create problem.txt:

Create a Python script that finds the longest common subsequence
of two strings. Print the result as: SCORE: <length>

Test with:
- String A: "ABCDGH"
- String B: "AEDFHR"

Run with scoring:

PYTHONPATH=. python -m src.runner \
    -f problem.txt \
    --evaluator regex_pattern \
    --stop-condition threshold \
    -i 10

The agent will iterate until the score threshold is reached or iterations exhausted.

Output Structure

tmp/experiment_workspace/{uuid}/
├── main.py                  # Generated solution
├── output_data_{branch}/    # Any output files
└── sessions/                # Experiment branches

Getting Started

Core Concepts

Benchmarks

Components

Deployment

Usage

CLI Options

Evaluators

Stop Conditions

Configuration Modes

GENERIC

TREE_SEARCH

Example: Scored Problem

Output Structure

Getting Started

Core Concepts

Benchmarks

Components

Deployment

​Usage

​CLI Options

​Evaluators

​Stop Conditions

​Configuration Modes

​GENERIC

​TREE_SEARCH

​Example: Scored Problem

​Output Structure

Usage

CLI Options

Evaluators

Stop Conditions

Configuration Modes

GENERIC

TREE_SEARCH

Example: Scored Problem

Output Structure