Autoscaling
Configure horizontal autoscaling for your services
Autoscaling automatically adjusts the number of service replicas based on demand, optimizing both performance and cost.
Basic Autoscaling
Configuration Reference
Complete Autoscaling Block
Field Reference
| Field | Type | Default | Description |
|-------|------|---------|-------------|
| enabled | boolean | false | Enable/disable autoscaling |
| min_replicas | number | 1 | Minimum replicas (0 for scale-to-zero) |
| max_replicas | number | 10 | Maximum replicas (hard ceiling) |
| target_cpu_percent | number | 70 | Target CPU utilization (%) |
| target_memory_percent | number | none | Optional memory threshold |
| scale_up_cooldown_seconds | number | 45 | Wait time before next scale-up |
| scale_down_cooldown_seconds | number | 300 | Wait time before scale-down |
Scaling Strategies
Strategy 1: Conservative (High Availability)
Keep ample headroom for traffic spikes:
Use Case: Production APIs, critical services
Cost: Higher baseline ($$$)
Strategy 2: Balanced (Standard)
Balance cost and performance:
Use Case: Most web applications
Cost: Moderate ($$)
Strategy 3: Aggressive (Cost-Optimized)
Minimize replicas, scale reactively:
Use Case: Dev/staging, batch jobs
Cost: Low ($)
Strategy 4: Scale-to-Zero (Event-Driven)
No baseline cost when idle:
Use Case: Batch processing, webhooks, scheduled jobs
Cost: Minimal (pay only when running)
Note: Cold starts add 5-15 seconds latency.
CPU-Based Scaling
How It Works
- Monitor average CPU across all replicas
- If CPU >
target_cpu_percent: scale up - If CPU <
target_cpu_percent: scale down
Scaling Formula
Example:
- Current: 5 replicas at 90% CPU
- Target: 70% CPU
- Desired: 5 × (90 / 70) = 6.4 → 7 replicas
CPU Target Recommendations
| Target | Behavior | Use Case | |--------|----------|----------| | 50% | Aggressive headroom | Critical services | | 70% | Balanced (default) | Standard applications | | 85% | Minimal headroom | Cost-sensitive |
Warning: Targets >85% risk saturation during scale-up lag.
Memory-Based Scaling
Optional secondary metric:
Behavior: Scale if either CPU or memory exceeds target.
When to Use Memory Scaling
✅ Use memory scaling when:
- Application is memory-intensive (caching, data processing)
- Memory leaks are possible
- OOM kills would be catastrophic
❌ Skip memory scaling when:
- Application memory is stable
- CPU is the bottleneck
- Simplicity is preferred
Cooldown Periods
Scale-Up Cooldown
Wait time before another scale-up:
Purpose: Prevent thrashing during metric spikes.
Recommendations:
- Fast APIs: 30-45s
- Slow startup: 60-120s
- Cold start: 180s+
Scale-Down Cooldown
Wait time before scale-down:
Purpose: Prevent premature scale-down during temporary dips.
Recommendations:
- Stable traffic: 180-300s
- Variable traffic: 600s
- Expensive startup: 900s+
Custom Metrics
Scale based on application-specific metrics:
Custom Metric Fields
| Field | Type | Description |
|-------|------|-------------|
| target | number | Ideal metric value |
| scale_down_threshold | number | Scale down below this |
| scale_up_threshold | number | Scale up above this (optional) |
| query | string | Prometheus query |
Custom Metric Flow
Real-World Examples
Example 1: Production API
Result:
- Baseline: 5 replicas ($50/month)
- Peak: 30 replicas during business hours
- Cost: ~$180/month (vs $500/month fixed 50 replicas)
Example 2: Batch Worker
Result:
- Idle: 0 replicas ($0/month)
- Active: Scales to match queue (1 replica per 100 jobs)
- Cost: Pay only for processing time
Example 3: Database Proxy
Result:
- Baseline: 2 proxies (400 connections)
- Peak: 8 proxies (1,600 connections)
- Cost: ~$30/month
Example 4: ML Inference
Result:
- Warm: 1 GPU ready ($2/hour)
- Peak: 10 GPUs during batch inference
- Cost: ~$1,500/month (vs $14,000/month for 20 fixed GPUs)
Troubleshooting
Issue: Service Not Scaling Up
Symptom: CPU at 100%, but replicas not increasing
Causes:
-
Hit max_replicas:
YAML -
Scale-up cooldown active:
YAML -
Provider capacity exhausted:
- Add more providers
- Check region availability
Solution:
Issue: Service Scaling Too Aggressively
Symptom: Constantly scaling up/down (yo-yo effect)
Cause: Cooldowns too short
Solution:
Issue: Slow Scale-Up
Symptom: Latency spikes before scaling kicks in
Solutions:
-
Lower CPU target:
YAML -
Increase min_replicas:
YAML -
Reduce cooldown:
YAML
Issue: Scale-to-Zero Not Working
Symptom: Service stays at min_replicas despite no traffic
Cause: min_replicas > 0
Solution:
Note: Requires container to start in <15s for best experience.
Best Practices
1. Start Conservative
Monitor for 1-2 weeks, then optimize.
2. Set Realistic Maximums
Don't set max_replicas arbitrarily high:
3. Longer Scale-Down Cooldowns
Scale down slowly to avoid instability:
4. Monitor Autoscaler Behavior
5. Test Under Load
Use load testing to validate: