Strategy Overview
The deployment system supports multiple strategies, each optimized for different use cases.| Strategy | Provider | Interface | GPU | Best For |
|---|---|---|---|---|
| LOCAL | None | Function | ❌ | Development, testing |
| DOCKER | Docker | HTTP | ❌ | Containerized APIs |
| MODAL | Modal.com | Function | ✅ | Serverless GPU |
| BENTOML | BentoCloud | HTTP | ⚙️ | Production ML serving |
| LANGGRAPH | LangGraph Platform | LangGraph | ❌ | Stateful agents |
Lifecycle Management
All strategies support full lifecycle management withstart() and stop() methods:
| Strategy | stop() Action | start() Action |
|---|---|---|
| LOCAL | Unload module from sys.modules | Reload module |
| DOCKER | Stop + remove container | Create + start new container |
| MODAL | Run modal app stop | Re-lookup Modal function |
| BENTOML | Run bentoml deployment terminate | Run bentoml deployment apply |
| LANGGRAPH | Delete thread + disconnect | Reconnect to platform |
LOCAL
Run directly as a Python process on the local machine.When to Use
✅ Best For:- Development and testing
- Simple scripts and utilities
- Quick prototyping
- No infrastructure needed
- CPU-only workloads
- Production deployments
- GPU workloads (use Modal)
- Scalable APIs (use Docker or BentoML)
- Stateful agents (use LangGraph)
Configuration
How It Works
- Adapter ensures
main.pyhas apredict()function LocalRunnerimports the module usingimportlib- Each
run()call invokespredict(inputs)directly
Usage
Lifecycle
Requirements
- Python 3.8+
- Solution dependencies installed (
pip install -r requirements.txt)
Generated Files
| File | Description |
|---|---|
main.py | Entry point with predict() function |
DOCKER
Run in an isolated Docker container with HTTP API.When to Use
✅ Best For:- Reproducible deployments
- Isolated environments
- HTTP-based APIs
- Local testing of production setup
- CPU-only workloads with network access
- Quick development iteration (use Local)
- GPU workloads (use Modal)
- Serverless auto-scaling (use Modal or BentoML)
- Stateful agents (use LangGraph)
Configuration
How It Works
- Adapter creates
Dockerfileandapp.py(FastAPI) - Agent builds and runs the Docker container
DockerRunnermakes HTTP POST requests to the container
Usage
Lifecycle
docker-py SDK for programmatic container management.
Requirements
- Docker installed and running
dockerPython package (pip install docker)- Port 8000 available (configurable)
Generated Files
| File | Description |
|---|---|
Dockerfile | Container definition |
app.py | FastAPI application |
main.py | Business logic with predict() |
Manual Deployment
MODAL
Serverless GPU deployment on Modal.com with auto-scaling.When to Use
✅ Best For:- GPU workloads (PyTorch, TensorFlow, CUDA)
- ML model inference at scale
- Serverless auto-scaling
- Pay-per-use pricing
- Fast cold starts for ML
- Simple local scripts (use Local)
- Persistent HTTP servers (use Docker)
- LangGraph/LangChain agents (use LangGraph)
- Need on-premise deployment (use Docker)
Configuration
Resource Options
| GPU | Memory Options | Use Case |
|---|---|---|
| T4 | 8Gi, 16Gi | Inference, light training |
| L4 | 16Gi, 32Gi | Medium models |
| A10G | 16Gi, 32Gi | Large models |
| A100 | 40Gi, 80Gi | Very large models |
| H100 | 80Gi | Maximum performance |
How It Works
- Adapter creates
modal_app.pywith Modal decorators - Agent runs
modal deploy modal_app.py ModalRunnercallsmodal.Function.remote()to invoke
Usage
Lifecycle
Note: Afterstop(), the Modal app is terminated and won’t consume resources.start()reconnects to an existing deployment (you may need to re-deploy if the app was fully stopped).
Requirements
- Install Modal:
pip install modal - Authenticate:
modal token new - Or set environment variables:
Generated Files
| File | Description |
|---|---|
modal_app.py | Modal application with @app.function decorator |
main.py | Business logic with predict() |
Manual Deployment
Example modal_app.py
BENTOML
Production ML service deployment on BentoCloud with batching and monitoring.When to Use
✅ Best For:- Production ML model serving
- Need automatic request batching
- Production monitoring and observability
- Managed ML infrastructure
- Model versioning and A/B testing
- Quick development (use Local)
- Simple scripts (use Local)
- LangGraph agents (use LangGraph)
- GPU-heavy serverless (use Modal)
Configuration
Resource Options
| CPU | Memory | GPU | Use Case |
|---|---|---|---|
| 1 | 2Gi | 0 | Light workloads |
| 2 | 4Gi | 0 | Standard workloads |
| 4 | 8Gi | 0 | Heavy CPU workloads |
| 2 | 8Gi | 1 | GPU inference |
How It Works
- Adapter creates
service.pyandbentofile.yaml - Agent runs
bentoml buildandbentoml deploy BentoMLRunnermakes HTTP requests to BentoCloud endpoint
Usage
Lifecycle
Note:stop()actually terminates the BentoCloud deployment to avoid billing.start()re-deploys the service, which may take a minute.
Requirements
- Install BentoML:
pip install bentoml - For BentoCloud (optional):
BENTO_CLOUD_API_KEYenvironment variable for API access
Generated Files
| File | Description |
|---|---|
service.py | BentoML service class |
bentofile.yaml | Build configuration |
main.py | Business logic |
Manual Deployment
LANGGRAPH
Deploy stateful AI agents to LangGraph Platform with memory and streaming.When to Use
✅ Best For:- LangGraph/LangChain agents
- Stateful conversational AI
- Multi-step agentic workflows
- Need conversation persistence (threads)
- Streaming responses
- Human-in-the-loop workflows
- Simple ML inference (use Modal or BentoML)
- Non-agent code (use Local or Docker)
- GPU-heavy workloads (use Modal)
- Batch processing (use BentoML)
Configuration
How It Works
- Adapter creates
langgraph.jsonand agent structure - Agent runs
langgraph deploy LangGraphRunneruses LangGraph SDK to invoke
Usage
Lifecycle
Note: LangGraph Platform manages the actual deployment. stop() only cleans up the local client and thread. The deployed agent remains available.
Requirements
- Install LangGraph:
pip install langgraph langgraph-cli - Set API key:
Generated Files
| File | Description |
|---|---|
langgraph.json | LangGraph configuration |
agent.py | Agent graph definition |
main.py | Entry point |
Manual Deployment
AUTO Strategy
Let the system analyze your code and choose the best strategy.Selection Criteria
TheSelectorAgent considers:
| Factor | Impact |
|---|---|
| Dependencies | torch, tensorflow → GPU needed → Modal |
| Existing files | Dockerfile exists → Docker |
| Goal description | ”stateful agent” → LangGraph |
| Code patterns | LangGraph imports → LangGraph |
| Complexity | Simple script → Local |
Restricting Choices
Strategy Comparison
Performance
| Strategy | Cold Start | Scalability | Cost Model |
|---|---|---|---|
| Local | None | Single process | Free |
| Docker | Seconds | Manual | Self-hosted |
| Modal | ~1s (optimized) | Auto-scaling | Pay-per-use |
| BentoML | Seconds | Auto-scaling | Pay-per-use |
| LangGraph | Seconds | Auto-scaling | Pay-per-use |
Features
| Feature | Local | Docker | Modal | BentoML | LangGraph |
|---|---|---|---|---|---|
| GPU Support | ❌ | ❌ | ✅ | ⚙️ | ❌ |
| Auto-scaling | ❌ | ❌ | ✅ | ✅ | ✅ |
| Request Batching | ❌ | ❌ | ⚙️ | ✅ | ❌ |
| State Persistence | ❌ | ❌ | ❌ | ❌ | ✅ |
| Streaming | ❌ | ⚙️ | ⚙️ | ⚙️ | ✅ |
| Monitoring | ❌ | ❌ | ✅ | ✅ | ✅ |