Skip to main content

Strategy Overview

The deployment system supports multiple strategies, each optimized for different use cases.
StrategyProviderInterfaceGPUBest For
LOCALNoneFunctionDevelopment, testing
DOCKERDockerHTTPContainerized APIs
MODALModal.comFunctionServerless GPU
BENTOMLBentoCloudHTTP⚙️Production ML serving
LANGGRAPHLangGraph PlatformLangGraphStateful agents

Lifecycle Management

All strategies support full lifecycle management with start() and stop() methods:
software = tinkerer.deploy(solution, strategy=DeployStrategy.DOCKER)

# Run
result = software.run(inputs)

# Stop (cleanup resources)
software.stop()

# Restart
software.start()

# Run again
result = software.run(inputs)
Each strategy handles lifecycle differently:
Strategystop() Actionstart() Action
LOCALUnload module from sys.modulesReload module
DOCKERStop + remove containerCreate + start new container
MODALRun modal app stopRe-lookup Modal function
BENTOMLRun bentoml deployment terminateRun bentoml deployment apply
LANGGRAPHDelete thread + disconnectReconnect to platform

LOCAL

Run directly as a Python process on the local machine.

When to Use

Best For:
  • Development and testing
  • Simple scripts and utilities
  • Quick prototyping
  • No infrastructure needed
  • CPU-only workloads
Not For:
  • Production deployments
  • GPU workloads (use Modal)
  • Scalable APIs (use Docker or BentoML)
  • Stateful agents (use LangGraph)

Configuration

# config.yaml
name: local
provider: null
interface: function
runner_class: LocalRunner

default_resources: {}

run_interface:
  type: function
  module: main
  callable: predict

How It Works

  1. Adapter ensures main.py has a predict() function
  2. LocalRunner imports the module using importlib
  3. Each run() call invokes predict(inputs) directly

Usage

from src.deployment import DeploymentFactory, DeployStrategy, DeployConfig

software = DeploymentFactory.create(DeployStrategy.LOCAL, config)

# Direct function call
result = software.run({"text": "hello"})
print(result)
# {"status": "success", "output": {"result": "..."}}

Lifecycle

# Run
result = software.run(inputs)

# Stop unloads module
software.stop()

# Start reloads module (picks up code changes!)
software.start()
result = software.run(inputs)

Requirements

  • Python 3.8+
  • Solution dependencies installed (pip install -r requirements.txt)

Generated Files

FileDescription
main.pyEntry point with predict() function

DOCKER

Run in an isolated Docker container with HTTP API.

When to Use

Best For:
  • Reproducible deployments
  • Isolated environments
  • HTTP-based APIs
  • Local testing of production setup
  • CPU-only workloads with network access
Not For:
  • Quick development iteration (use Local)
  • GPU workloads (use Modal)
  • Serverless auto-scaling (use Modal or BentoML)
  • Stateful agents (use LangGraph)

Configuration

# config.yaml
name: docker
provider: null
interface: http
runner_class: DockerRunner

default_resources: {}

run_interface:
  type: http
  endpoint: http://localhost:8000
  predict_path: /predict

How It Works

  1. Adapter creates Dockerfile and app.py (FastAPI)
  2. Agent builds and runs the Docker container
  3. DockerRunner makes HTTP POST requests to the container

Usage

software = DeploymentFactory.create(DeployStrategy.DOCKER, config)

# HTTP request under the hood
result = software.run({"text": "hello"})
print(result)
# {"status": "success", "output": {"result": "..."}}

# Get endpoint for manual testing
print(software.get_endpoint())
# http://localhost:8000

Lifecycle

# Run
result = software.run(inputs)

# Stop removes the container
software.stop()
# Logs: "Stopping container...", "Container removed"

# Start creates a new container
software.start()
# Logs: "Creating new container...", "Container started"

result = software.run(inputs)
The Docker runner uses the docker-py SDK for programmatic container management.

Requirements

  • Docker installed and running
  • docker Python package (pip install docker)
  • Port 8000 available (configurable)

Generated Files

FileDescription
DockerfileContainer definition
app.pyFastAPI application
main.pyBusiness logic with predict()

Manual Deployment

# Build the image
docker build -t my-solution .

# Run the container
docker run -p 8000:8000 my-solution

# Test
curl -X POST http://localhost:8000/predict \
  -H "Content-Type: application/json" \
  -d '{"text": "hello"}'

Serverless GPU deployment on Modal.com with auto-scaling.

When to Use

Best For:
  • GPU workloads (PyTorch, TensorFlow, CUDA)
  • ML model inference at scale
  • Serverless auto-scaling
  • Pay-per-use pricing
  • Fast cold starts for ML
Not For:
  • Simple local scripts (use Local)
  • Persistent HTTP servers (use Docker)
  • LangGraph/LangChain agents (use LangGraph)
  • Need on-premise deployment (use Docker)

Configuration

# config.yaml
name: modal
provider: modal
interface: function
runner_class: ModalRunner

default_resources:
  gpu: T4
  memory: 16Gi

run_interface:
  type: modal
  function_name: predict

Resource Options

GPUMemory OptionsUse Case
T48Gi, 16GiInference, light training
L416Gi, 32GiMedium models
A10G16Gi, 32GiLarge models
A10040Gi, 80GiVery large models
H10080GiMaximum performance

How It Works

  1. Adapter creates modal_app.py with Modal decorators
  2. Agent runs modal deploy modal_app.py
  3. ModalRunner calls modal.Function.remote() to invoke

Usage

software = DeploymentFactory.create(DeployStrategy.MODAL, config)

# Remote execution on GPU
result = software.run({"text": "Generate embeddings for this"})
print(result)
# {"status": "success", "output": {"embeddings": [...]}}

Lifecycle

# Run
result = software.run(inputs)

# Stop terminates the Modal app
software.stop()
# Runs: modal app stop {app_name}

# Start re-looks up the function
software.start()
# Calls: modal.Function.lookup(app_name, function_name)

result = software.run(inputs)
Note: After stop(), the Modal app is terminated and won’t consume resources. start() reconnects to an existing deployment (you may need to re-deploy if the app was fully stopped).

Requirements

  1. Install Modal: pip install modal
  2. Authenticate: modal token new
  3. Or set environment variables:
    export MODAL_TOKEN_ID=your_id
    export MODAL_TOKEN_SECRET=your_secret
    

Generated Files

FileDescription
modal_app.pyModal application with @app.function decorator
main.pyBusiness logic with predict()

Manual Deployment

# Deploy to Modal
modal deploy modal_app.py

# The deployment URL is printed after deploy
# Example: https://your-username--app-name.modal.run

Example modal_app.py

import modal

app = modal.App("my-solution")

image = modal.Image.debian_slim().pip_install_from_requirements("requirements.txt")

@app.function(image=image, gpu="T4", memory=16384)
def predict(inputs: dict) -> dict:
    from main import predict as _predict
    return _predict(inputs)

BENTOML

Production ML service deployment on BentoCloud with batching and monitoring.

When to Use

Best For:
  • Production ML model serving
  • Need automatic request batching
  • Production monitoring and observability
  • Managed ML infrastructure
  • Model versioning and A/B testing
Not For:
  • Quick development (use Local)
  • Simple scripts (use Local)
  • LangGraph agents (use LangGraph)
  • GPU-heavy serverless (use Modal)

Configuration

# config.yaml
name: bentoml
provider: bentocloud
interface: http
runner_class: BentoMLRunner

default_resources:
  cpu: 2
  memory: 4Gi

run_interface:
  type: bentocloud
  predict_path: /predict

Resource Options

CPUMemoryGPUUse Case
12Gi0Light workloads
24Gi0Standard workloads
48Gi0Heavy CPU workloads
28Gi1GPU inference

How It Works

  1. Adapter creates service.py and bentofile.yaml
  2. Agent runs bentoml build and bentoml deploy
  3. BentoMLRunner makes HTTP requests to BentoCloud endpoint

Usage

software = DeploymentFactory.create(DeployStrategy.BENTOML, config)

# HTTP to BentoCloud
result = software.run({"question": "What is ML?", "context": "..."})
print(result)
# {"status": "success", "output": {"answer": "..."}}

# Get endpoint
print(software.get_endpoint())
# https://your-bento.bentoml.ai

Lifecycle

# Run
result = software.run(inputs)

# Stop terminates the deployment
software.stop()
# Runs: bentoml deployment terminate {deployment_name}

# Start re-deploys by running deploy.py
software.start()
# Runs: python {code_path}/deploy.py
# Then connects to the new endpoint

result = software.run(inputs)
Note: stop() actually terminates the BentoCloud deployment to avoid billing. start() re-deploys the service, which may take a minute.

Requirements

  1. Install BentoML: pip install bentoml
  2. For BentoCloud (optional):
    bentoml cloud login
    
  3. BENTO_CLOUD_API_KEY environment variable for API access

Generated Files

FileDescription
service.pyBentoML service class
bentofile.yamlBuild configuration
main.pyBusiness logic

Manual Deployment

# Build the Bento
bentoml build

# Serve locally
bentoml serve service:MyService

# Deploy to BentoCloud
bentoml deploy .

LANGGRAPH

Deploy stateful AI agents to LangGraph Platform with memory and streaming.

When to Use

Best For:
  • LangGraph/LangChain agents
  • Stateful conversational AI
  • Multi-step agentic workflows
  • Need conversation persistence (threads)
  • Streaming responses
  • Human-in-the-loop workflows
Not For:
  • Simple ML inference (use Modal or BentoML)
  • Non-agent code (use Local or Docker)
  • GPU-heavy workloads (use Modal)
  • Batch processing (use BentoML)

Configuration

# config.yaml
name: langgraph
provider: langgraph
interface: langgraph
runner_class: LangGraphRunner

default_resources: {}

run_interface:
  type: langgraph
  assistant_id: agent

How It Works

  1. Adapter creates langgraph.json and agent structure
  2. Agent runs langgraph deploy
  3. LangGraphRunner uses LangGraph SDK to invoke

Usage

software = DeploymentFactory.create(DeployStrategy.LANGGRAPH, config)

# Invoke the agent
result = software.run({
    "messages": [{"role": "user", "content": "Hello!"}]
})
print(result)
# {"status": "success", "output": {"messages": [...]}}

# With thread persistence
result = software.run({
    "messages": [{"role": "user", "content": "Follow up question"}],
    "thread_id": "conversation-123"
})

Lifecycle

# Run (creates a conversation thread)
result = software.run({"messages": [...]})

# Stop deletes the thread and disconnects
software.stop()
# Thread is deleted, client is cleared

# Start reconnects to LangGraph Platform
software.start()
# New conversation thread is created on next run()

result = software.run({"messages": [...]})
Note: LangGraph Platform manages the actual deployment. stop() only cleans up the local client and thread. The deployed agent remains available.

Requirements

  1. Install LangGraph: pip install langgraph langgraph-cli
  2. Set API key:
    export LANGSMITH_API_KEY=your_key
    

Generated Files

FileDescription
langgraph.jsonLangGraph configuration
agent.pyAgent graph definition
main.pyEntry point

Manual Deployment

# Deploy to LangGraph Platform
langgraph deploy

# Or test locally first
langgraph dev

AUTO Strategy

Let the system analyze your code and choose the best strategy.
software = DeploymentFactory.create(DeployStrategy.AUTO, config)
print(f"Auto-selected: {software.name}")

Selection Criteria

The SelectorAgent considers:
FactorImpact
Dependenciestorch, tensorflow → GPU needed → Modal
Existing filesDockerfile exists → Docker
Goal description”stateful agent” → LangGraph
Code patternsLangGraph imports → LangGraph
ComplexitySimple script → Local

Restricting Choices

# Only allow specific strategies
software = DeploymentFactory.create(
    DeployStrategy.AUTO, 
    config,
    strategies=["local", "docker"]  # Won't choose Modal, BentoML, etc.
)

Strategy Comparison

Performance

StrategyCold StartScalabilityCost Model
LocalNoneSingle processFree
DockerSecondsManualSelf-hosted
Modal~1s (optimized)Auto-scalingPay-per-use
BentoMLSecondsAuto-scalingPay-per-use
LangGraphSecondsAuto-scalingPay-per-use

Features

FeatureLocalDockerModalBentoMLLangGraph
GPU Support⚙️
Auto-scaling
Request Batching⚙️
State Persistence
Streaming⚙️⚙️⚙️
Monitoring

Next Steps