Skip to main content
The generic runner lets you solve any problem without benchmark-specific setup. Configure evaluators and stop conditions to match your needs.

Usage

# Inline problem
PYTHONPATH=. python -m src.runner -p "Create a script that calculates prime numbers"

# From file
PYTHONPATH=. python -m src.runner -f problem.txt

# With options
PYTHONPATH=. python -m src.runner \
    -f problem.txt \
    -i 10 \
    --evaluator regex_pattern \
    --stop-condition threshold

CLI Options

OptionDescriptionDefault
-p, --problemProblem description (inline)-
-f, --problem-fileFile with problem description-
-i, --iterationsMax iterations10
-m, --modeConfig modeGENERIC
-d, --coding-agentCoding agentFrom config
--main-fileEntry point filemain.py
--languageProgramming languagepython
--timeoutExecution timeout (seconds)300
--evaluatorEvaluator typeno_score
--stop-conditionStop condition typenever
--contextAdditional context/tips-

Evaluators

No scoring, always returns 0. Useful for code generation without evaluation.
--evaluator no_score
List all evaluators:
PYTHONPATH=. python -m src.runner --list-evaluators

Stop Conditions

Run all iterations.
--stop-condition never
List all stop conditions:
PYTHONPATH=. python -m src.runner --list-stop-conditions

Configuration Modes

GENERIC

Standard configuration for most problems:
search_strategy:
  type: "linear_search"
  params:
    code_debug_tries: 5
coding_agent:
  type: "aider"
  model: "gpt-4.1"
knowledge_search:
  enabled: false
For complex problems requiring exploration:
search_strategy:
  type: "llm_tree_search"
  params:
    node_expansion_limit: 2
    exploration_budget_percent: 40

Example: Scored Problem

Create problem.txt:
Create a Python script that finds the longest common subsequence
of two strings. Print the result as: SCORE: <length>

Test with:
- String A: "ABCDGH"
- String B: "AEDFHR"
Run with scoring:
PYTHONPATH=. python -m src.runner \
    -f problem.txt \
    --evaluator regex_pattern \
    --stop-condition threshold \
    -i 10
The agent will iterate until the score threshold is reached or iterations exhausted.

Output Structure

tmp/experiment_workspace/{uuid}/
├── main.py                  # Generated solution
├── output_data_{branch}/    # Any output files
└── sessions/                # Experiment branches