Back to Roadmap
bl-011Q3 2026Est. 9/30/2026In Progress

Blazing Inference Service

Run AI models at scale across the provider network — real-time inference for latency-sensitive APIs and batch inference for high-throughput pipelines. GPU scheduling, model versioning, and auto-scaling handled by the platform.

computeai

Blazing Inference Service makes model serving a first-class primitive on the platform, covering both latency-sensitive and throughput-optimised workloads.

**Real-time inference** serves requests with low latency over HTTP/gRPC. Models are loaded into GPU memory and kept warm across requests. The platform handles horizontal scaling based on request queue depth, GPU utilisation, and SLA targets. Routing is provider-aware — requests can be pinned to specific regions, provider tiers, or hardware types (e.g., H100 only).

**Batch inference** processes large datasets asynchronously. Jobs are submitted as tasks, queued, and dispatched across available GPU capacity on the provider network. Results are written to Blazing Storage or a caller-specified destination. Batch jobs are interruptible and resumable — they tolerate provider churn and spot-style preemption without data loss.

Both modes share a common model registry (built on Blazing Registry), versioning system, and observability layer. Swapping model versions, running A/B splits, and rolling back a bad deployment are handled through the same SDL config used for all other Blazing workloads.

Supported runtimes include vLLM, TensorRT-LLM, and ONNX Runtime. The platform is model-agnostic — any model that fits in a container and exposes a standard serving interface is deployable.