Skip to main content
MLE-Bench provides Kaggle competition problems for evaluating ML agents. Tinkerer Agent generates Python solutions that produce submission files.

Usage

# List available competitions
PYTHONPATH=. python -m benchmarks.mle.runner --list

# List lite benchmark competitions
PYTHONPATH=. python -m benchmarks.mle.runner --lite

# Solve a competition
PYTHONPATH=. python -m benchmarks.mle.runner -c tabular-playground-series-dec-2021

# With options
PYTHONPATH=. python -m benchmarks.mle.runner \
    -c tabular-playground-series-dec-2021 \
    -i 20 \
    -m MLE_CONFIGS \
    -d aider

CLI Options

OptionDescriptionDefault
-c, --competitionCompetition IDRequired
-i, --iterationsMax experiment iterations20
-m, --modeConfig modeMLE_CONFIGS
-d, --coding-agentCoding agentFrom config
--no-kgDisable knowledge graphEnabled
--listList all competitions-
--liteList lite competitions-
--list-agentsList coding agents-

Configuration Modes

Production configuration with full features.
search_strategy:
  type: "llm_tree_search"
  params:
    reasoning_effort: "high"
    code_debug_tries: 15
    node_expansion_limit: 2
coding_agent:
  type: "aider"
  model: "o3"
knowledge_search:
  enabled: true

Stages

The handler automatically adjusts strategy based on budget progress:
StageBudgetBehavior
MINI TRAINING0-35%Sample training data (for datasets >30GB)
FULL TRAINING35-80%Train on complete dataset
FINAL ENSEMBLING80-100%Ensemble best models from history

Output Structure

The agent generates:
experiment_workspace/{uuid}/
├── main.py                    # Entry point
├── output_data_{branch}/
│   ├── final_submission.csv   # Kaggle submission file
│   └── checkpoints/           # Model checkpoints
└── sessions/                  # Experiment branches

Execution Flow

Code Requirements

Generated code must:
  • Support --debug flag for fast testing
  • Write final_submission.csv in the output directory
  • Print progress and metrics
  • Handle GPU efficiently (batch size, device selection)
  • Use early stopping and learning rate scheduling

Competition Types

TypeExamples
Tabulartabular-playground-series-*
Imagedogs-vs-cats-*, plant-pathology-*
Textspooky-author-identification, jigsaw-toxic-*
Audiomlsp-2013-birds

Environment Variables

VariableDefaultDescription
CUDA_DEVICE0GPU device ID
MLE_SEED1Random seed