Blazing iconBlazing

Autoscaling

Edit on GitHub

Configure horizontal autoscaling for your services

Autoscaling automatically adjusts the number of service replicas based on demand, optimizing both performance and cost.

Basic Autoscaling

YAML

Configuration Reference

Complete Autoscaling Block

YAML

Field Reference

| Field | Type | Default | Description | |-------|------|---------|-------------| | enabled | boolean | false | Enable/disable autoscaling | | min_replicas | number | 1 | Minimum replicas (0 for scale-to-zero) | | max_replicas | number | 10 | Maximum replicas (hard ceiling) | | target_cpu_percent | number | 70 | Target CPU utilization (%) | | target_memory_percent | number | none | Optional memory threshold | | scale_up_cooldown_seconds | number | 45 | Wait time before next scale-up | | scale_down_cooldown_seconds | number | 300 | Wait time before scale-down |

Scaling Strategies

Strategy 1: Conservative (High Availability)

Keep ample headroom for traffic spikes:

YAML

Use Case: Production APIs, critical services

Cost: Higher baseline ($$$)

Strategy 2: Balanced (Standard)

Balance cost and performance:

YAML

Use Case: Most web applications

Cost: Moderate ($$)

Strategy 3: Aggressive (Cost-Optimized)

Minimize replicas, scale reactively:

YAML

Use Case: Dev/staging, batch jobs

Cost: Low ($)

Strategy 4: Scale-to-Zero (Event-Driven)

No baseline cost when idle:

YAML

Use Case: Batch processing, webhooks, scheduled jobs

Cost: Minimal (pay only when running)

Note: Cold starts add 5-15 seconds latency.

CPU-Based Scaling

How It Works

  1. Monitor average CPU across all replicas
  2. If CPU > target_cpu_percent: scale up
  3. If CPU < target_cpu_percent: scale down

Scaling Formula

Plain Text

Example:

  • Current: 5 replicas at 90% CPU
  • Target: 70% CPU
  • Desired: 5 × (90 / 70) = 6.4 → 7 replicas

CPU Target Recommendations

| Target | Behavior | Use Case | |--------|----------|----------| | 50% | Aggressive headroom | Critical services | | 70% | Balanced (default) | Standard applications | | 85% | Minimal headroom | Cost-sensitive |

Warning: Targets >85% risk saturation during scale-up lag.

Memory-Based Scaling

Optional secondary metric:

YAML

Behavior: Scale if either CPU or memory exceeds target.

When to Use Memory Scaling

Use memory scaling when:

  • Application is memory-intensive (caching, data processing)
  • Memory leaks are possible
  • OOM kills would be catastrophic

Skip memory scaling when:

  • Application memory is stable
  • CPU is the bottleneck
  • Simplicity is preferred

Cooldown Periods

Scale-Up Cooldown

Wait time before another scale-up:

YAML

Purpose: Prevent thrashing during metric spikes.

Recommendations:

  • Fast APIs: 30-45s
  • Slow startup: 60-120s
  • Cold start: 180s+

Scale-Down Cooldown

Wait time before scale-down:

YAML

Purpose: Prevent premature scale-down during temporary dips.

Recommendations:

  • Stable traffic: 180-300s
  • Variable traffic: 600s
  • Expensive startup: 900s+

Custom Metrics

Scale based on application-specific metrics:

YAML

Custom Metric Fields

| Field | Type | Description | |-------|------|-------------| | target | number | Ideal metric value | | scale_down_threshold | number | Scale down below this | | scale_up_threshold | number | Scale up above this (optional) | | query | string | Prometheus query |

Custom Metric Flow

Plain Text

Real-World Examples

Example 1: Production API

YAML

Result:

  • Baseline: 5 replicas ($50/month)
  • Peak: 30 replicas during business hours
  • Cost: ~$180/month (vs $500/month fixed 50 replicas)

Example 2: Batch Worker

YAML

Result:

  • Idle: 0 replicas ($0/month)
  • Active: Scales to match queue (1 replica per 100 jobs)
  • Cost: Pay only for processing time

Example 3: Database Proxy

YAML

Result:

  • Baseline: 2 proxies (400 connections)
  • Peak: 8 proxies (1,600 connections)
  • Cost: ~$30/month

Example 4: ML Inference

YAML

Result:

  • Warm: 1 GPU ready ($2/hour)
  • Peak: 10 GPUs during batch inference
  • Cost: ~$1,500/month (vs $14,000/month for 20 fixed GPUs)

Troubleshooting

Issue: Service Not Scaling Up

Symptom: CPU at 100%, but replicas not increasing

Causes:

  1. Hit max_replicas:

    YAML
  2. Scale-up cooldown active:

    YAML
  3. Provider capacity exhausted:

    • Add more providers
    • Check region availability

Solution:

Bash

Issue: Service Scaling Too Aggressively

Symptom: Constantly scaling up/down (yo-yo effect)

Cause: Cooldowns too short

Solution:

YAML

Issue: Slow Scale-Up

Symptom: Latency spikes before scaling kicks in

Solutions:

  1. Lower CPU target:

    YAML
  2. Increase min_replicas:

    YAML
  3. Reduce cooldown:

    YAML

Issue: Scale-to-Zero Not Working

Symptom: Service stays at min_replicas despite no traffic

Cause: min_replicas > 0

Solution:

YAML

Note: Requires container to start in <15s for best experience.

Best Practices

1. Start Conservative

YAML

Monitor for 1-2 weeks, then optimize.

2. Set Realistic Maximums

Don't set max_replicas arbitrarily high:

YAML

3. Longer Scale-Down Cooldowns

Scale down slowly to avoid instability:

YAML

4. Monitor Autoscaler Behavior

YAML

5. Test Under Load

Use load testing to validate:

Bash

Next Steps