MLE-Bench

MLE-Bench is a benchmark for evaluating ML engineering agents on Kaggle competitions. Kapso achieved #1 among open-source systems on this benchmark.

These results were submitted as an official submission to MLE-Bench.

Usage

# List available competitions
PYTHONPATH=. python -m benchmarks.mle.runner --list

# List lite benchmark competitions
PYTHONPATH=. python -m benchmarks.mle.runner --lite

# Solve a competition
PYTHONPATH=. python -m benchmarks.mle.runner -c tabular-playground-series-dec-2021

# With options
PYTHONPATH=. python -m benchmarks.mle.runner \
    -c tabular-playground-series-dec-2021 \
    -i 20 \
    -m MLE_CONFIGS \
    -d aider

CLI Options

Option	Description	Default
`-c, --competition`	Competition ID	Required
`-i, --iterations`	Max experiment iterations	20
`-m, --mode`	Config mode	`MLE_CONFIGS`
`-d, --coding-agent`	Coding agent	From config
`--no-kg`	Disable knowledge graph	Enabled
`--list`	List all competitions	-
`--lite`	List lite competitions	-
`--list-agents`	List coding agents	-

Configuration Modes

MLE-Bench uses benchmark_tree_search strategy which uses the handler’s built-in evaluation via handler.run(). This is different from kapso.evolve() which uses agent-built evaluation.

MLE_CONFIGS
MINIMAL

Production configuration with full features.

search_strategy:
  type: "benchmark_tree_search"
  params:
    reasoning_effort: "high"
    code_debug_tries: 15
    node_expansion_limit: 2
coding_agent:
  type: "aider"
  model: "o3"
knowledge_search:
  enabled: true

Fast configuration for testing.

search_strategy:
  type: "benchmark_tree_search"
  params:
    reasoning_effort: "medium"
    code_debug_tries: 2
coding_agent:
  type: "aider"
  model: "gpt-4.1-mini"
knowledge_search:
  enabled: false

Stages

The handler automatically adjusts strategy based on budget progress:

Stage	Budget	Behavior
MINI TRAINING	0-35%	Sample training data (for datasets >30GB)
FULL TRAINING	35-80%	Train on complete dataset
FINAL ENSEMBLING	80-100%	Ensemble best models from history

Output Structure

The agent generates:

experiment_workspace/{uuid}/
├── main.py                    # Entry point
├── output_data_{branch}/
│   ├── final_submission.csv   # Kaggle submission file
│   └── checkpoints/           # Model checkpoints
└── sessions/                  # Experiment branches

Code Requirements

Generated code must:

Support --debug flag for fast testing
Write final_submission.csv in the output directory
Print progress and metrics
Handle GPU efficiently (batch size, device selection)
Use early stopping and learning rate scheduling

Competition Types

Type	Examples
Tabular	`tabular-playground-series-*`
Image	`dogs-vs-cats-`, `plant-pathology-`
Text	`spooky-author-identification`, `jigsaw-toxic-*`
Audio	`mlsp-2013-birds`

Environment Variables

Variable	Default	Description
`CUDA_DEVICE`	`0`	GPU device ID
`MLE_SEED`	`1`	Random seed

Getting Started

Evolve System

Knowledge System & Learning

Research

Deployment

Benchmarks

Usage

CLI Options

Configuration Modes

Stages

Output Structure

Code Requirements

Competition Types

Environment Variables

Getting Started

Evolve System

Knowledge System & Learning

Research

Deployment

Benchmarks

​Usage

​CLI Options

​Configuration Modes

​Stages

​Output Structure

​Code Requirements

​Competition Types

​Environment Variables

Usage

CLI Options

Configuration Modes

Stages

Output Structure

Code Requirements

Competition Types

Environment Variables