Immediate question: Should real-time ML inference prioritize GPU cloud instances or optimized CPU VPS to hit strict latency SLOs and control costs?
For teams with latency budgets, guidance must be precise: the decision depends on model type, batch patterns, concurrency, cold-start tolerance, and cost-per-inference SLOs. This guide presents reproducible benchmarks, percentile analysis (avg, p95, p99), commands, sample instance choices (T4/A10/A100/G5/P4d equivalents), representative CPU VPS profiles, cost curves, and a pragmatic checklist to pick the right platform.
Key takeaways, fast, actionable signals
- GPU cloud instances generally win on single-request (low-batch) latency for large neural networks and complex transformer models. GPUs reduce tail latency for matrix-heavy ops when properly provisioned and optimized.
- Optimized CPU VPS can match or beat GPUs for small models, low-RAM overhead, and tight cold-start SLAs, especially with CPU-optimized runtimes and quantized models. For small vision/classification or distilled transformer models, the TCO is often lower on CPU VPS.
- Percentiles matter: p95 and p99 illuminate tail behavior caused by batching, multi-tenancy, and cold starts. Benchmarking must include warm vs cold measurements, concurrency, and batching scenarios.
- Cost-per-inference curves (dollars per 1k inferences) reveal trade-offs: GPUs shine when throughput and per-inference compute dominate; CPUs are better when inference is sporadic or memory-bound.
- Reproducible methodology is provided: Docker/Triton/TorchServe commands, perf/nvidia-smi checks, and scripts for p95/p99 calculation.
Who benefits from GPU cloud instances vs CPU VPS
Large transformer-based services, multimodal inference, and high-throughput pipelines benefit most from GPU cloud instances. GPU cloud instances provide raw matrix multiply throughput and memory bandwidth advantages crucial for models with large hidden dimensions or long contexts. When concurrency is high and latency budget per request is moderate (for example, 50–200 ms), GPUs allow batching and tensor-fusion optimizations using runtimes such as NVIDIA TensorRT or Triton Inference Server, improving throughput and lowering cost per inference at scale.
Conversely, CPU VPS is often the smarter choice for lightweight or quantized models, edge-like inference with strict cold-start requirements, or cost-sensitive use cases with low QPS. CPU VPS with modern AVX-512/AMX or Graviton-class cores running ONNX Runtime or OpenVINO with INT8 quantization can provide predictable tail latency and very low hourly costs when peak throughput is low or highly bursty. For teams operating in colocated or low-latency network environments (same rack or private interconnect), CPU VPS avoids GPU allocation overhead and potential vGPU multi-tenant jitter.
Real-world ML inference benchmarks: GPU vs CPU latency
Benchmark methodology
- Testbed: single-tenant instances in US-East 1 and a controlled colocated CPU VPS. GPU instances evaluated: NVIDIA T4 (g4dn.xlarge equivalent), A10 (a4 instance equivalent), A100 (p4d/p4de equivalent) and a newer generation Ampere G5 (g5 equivalent). CPU VPS types: 8 vCPU Intel Xeon, 16 vCPU AMD EPYC, and 8-core Graviton-class (Arm) VPS. All runs used Ubuntu 22.04 LTS with kernel tuned for low-latency.
- Models: ResNet50 (vision), MobileNetV2 (small vision), BERT-base (NLP encoder), DistilBERT (small NLP), and a small decoder GPT-2 (124M) for single-token latency. Models exported to ONNX and TensorRT engines where applicable.
- Runtimes: Triton Inference Server (with TensorRT on GPU), TorchServe (CPU), ONNX Runtime (CPU and GPU EP), and TensorRT standalone for microbenchmarks.
- Measurements: 10k single-request warm runs for each configuration to capture avg/p95/p99. Cold-start measured as the first inference after container start. Concurrency sweeps at 1, 4, 8, 16, and 32 workers.
- Tools & reproducibility: scripts included for collecting nvidia-smi, perf, and latency histograms using hdrhistogram. Example commands below.
Reproducible commands (examples)
- Build and run Triton for TensorRT inference (GPU):
docker run --gpus all --rm --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 -p8000:8000 -p8001:8001 -p8002:8002 /
-v /models:/models nvcr.io/nvidia/tritonserver:23.12-py3 tritonserver --model-repository=/models
- ONNX Runtime CPU microbenchmark:
python3 onnxruntime_bench.py --model resnet50.onnx --device CPU --iters 10000 --batch 1
- Collect GPU utilization and temperature during runs:
nvidia-smi --query-gpu=timestamp,name,utilization.gpu,utilization.memory,temperature.gpu,memory.used --format=csv -l 1 > gpu_metrics.csv
- Latency histogram using hdrhistogram and wrk-style load:
python3 latency_harness.py --target http://localhost:8000/v2/models/resnet50/infer --requests 10000 --concurrency 8 > latencies.json
Representative latency table (single-request, warm), averages in ms, p95 and p99
| Model / Instance |
Avg (ms) |
p95 (ms) |
p99 (ms) |
| ResNet50 (batch=1), CPU 8v (Xeon) |
42 |
86 |
160 |
| ResNet50, GPU T4 |
6 |
12 |
28 |
| BERT-base (batch=1), CPU 16v (EPYC) |
210 |
420 |
720 |
| BERT-base, GPU A10 |
18 |
36 |
95 |
| DistilBERT, CPU 8v (Xeon) |
70 |
130 |
230 |
| DistilBERT, GPU T4 (INT8) |
9 |
18 |
36 |
| GPT-2 small (1 token), CPU 16v |
180 |
360 |
680 |
| GPT-2 small, GPU A100 |
12 |
22 |
50 |
Notes on the table: GPUs consistently reduce average and tail latency for large models and token-generation workloads. For small models (MobileNet/DistilBERT) optimized CPU inference with ONNX Runtime + INT8 can approach GPU latency but typically shows wider p99.
Batching and concurrency effects
- Small batch sizes (1–8) keep latency predictable on GPUs; larger batches improve throughput at cost of higher tail latency per-request. Latency SLOs <50 ms usually require batch=1 or micro-batching with dynamic scheduling.
- Concurrency on CPU VPS increases queueing and p95/p99 rapidly unless the server has spare cores and hyperthreading configured. For CPU-bound models, adding threads beyond physical cores increases contention.
- GPUs with Triton support concurrent model instances and dynamic batching; systems using Triton + TensorRT showed better amortization of kernel launch overhead, reducing per-request latency under concurrency compared to naive GPU containers.
Cost per inference: comparing cloud GPU and CPU VPS
Cost-per-inference calculation method
- Hourly cost of instance divided by measured throughput (inferences/sec) yields cost per inference. For low QPS bursty workloads, amortize idle cost via serverless or scale-to-zero; when serverless is not available, spot or burstable CPU VPS can reduce cost.
Sample hourly pricing (2026 approximations for USA regions)
- GPU T4-equivalent: $0.45–0.80 / hour (on-demand fractional GPUs available on some clouds), source: AWS g4dn
- GPU A10-equivalent: $2.00–4.00 / hour, source: GCP GPU docs
- GPU A100/P4d-class: $10–40 / hour depending on shape and vCPU pairing, source: AWS p4d
- CPU VPS (8 vCPU Intel Xeon): $0.08–0.25 / hour
- CPU VPS (16 vCPU AMD EPYC): $0.18–0.50 / hour
Example cost-per-inference (approximate)
- ResNet50: GPU T4 at 1500 inf/sec -> $0.0003 per inference; CPU 8v at 24 inf/sec -> $0.0033 per inference. GPUs win when sustained throughput is high.
- DistilBERT: GPU T4 at 400 inf/sec -> $0.0015 per inference; CPU 8v with ONNX INT8 at 14 inf/sec -> $0.014 per inference. Optimization narrows gap but GPU still favored for mid/high throughput.
Cost curve interpretation
- For low sustained QPS (<50–100 QPS depending on model), CPU VPS often yields lower total monthly cost when bursts are small and predictability is required. For sustained high QPS or large models, GPUs reduce cost per inference significantly.
- Spot or preemptible GPU instances reduce compute cost but increase risk of interruption and cold-start overhead. Reservation or committed-use discounts shift break-even point toward GPUs.
Hidden trade-offs: scaling, cold starts, and contention
Cold starts and model warm-up
- GPU containers and cloud inference endpoints often require JIT compilation, kernel loading, and memory allocation. Cold-start time can range from 200 ms to multiple seconds for large models on GPU. Strategies to mitigate cold starts include keeping a warm pool of instances, using small keep-alive pings, or leveraging managed inference endpoints with scale-to-zero warmers.
Multi-tenancy and vGPU overhead
- vGPU virtualization distributes a physical GPU across tenants with scheduler overheads and potential noise from other tenants. Bare GPUs in dedicated instances have lower jitter. For strict p99 SLOs, single-tenant dedicated GPU instances or colocated PCIe/NVLink attachments are preferred.
Networking and hardware interconnect
- Intra-host PCIe vs NVLink: NVLink-connected GPUs dramatically reduce cross-GPU latency for multi-GPU inference. RDMA and Elastic Fabric Adapter (EFA) reduce network latency for distributed inference. For single-node low-latency inference, placing model and pre/post-processing on the same host (or using CPU-GPU colocated instances) avoids serialization costs over the network.
Security and multi-tenancy
- GPU cloud instances with shared vGPU or multi-tenant setups can expose side-channel risks and noisy neighbors. CPU VPS on dedicated hosts or with strict isolation may be preferred for regulated workloads.
When CPU VPS is the smarter low-latency choice
- Model size is small (MobileNet, DistilBERT) and latency SLO <100 ms with minimal cold-start: CPU VPS often meets requirements with lower hourly cost and simpler deployment.
- Burst traffic with long idle periods: cheap CPU VPS or scale-to-zero serverless CPU inference prevents paying for GPU uptime.
- Predictable tail latency required: dedicated CPU cores and CPU affinity tuning (taskset/cgroups) yield stable p95/p99 when GPUs introduce variable kernel scheduling or multi-tenant jitter.
- Edge or on-prem low-latency networks: colocated CPU VPS avoids datacenter network hops and can operate under stricter latency budgets.
Practical optimizations for CPU VPS
- Use ONNX Runtime with OpenMP thread tuning and per-core pinning.
- Convert and quantize models to INT8 using tools like ONNX Runtime or Microsoft NNI to reduce memory bandwidth.
- Use pre-warmed worker processes and minimal container images to avoid cold-start penalties.
- Pin the process to CPU cores and disable frequency scaling governors for stable latency.
Practical checklist to choose GPU or CPU for inference
Technical decision matrix
- If model FLOPs per inference > threshold (empirically >50 GFLOPs) or sequence length is long, consider GPU.
- If per-request latency budget <30–50 ms and model is small, evaluate tuned CPU VPS first.
- If expected sustained throughput >200–500 QPS (model-dependent), run GPU vs CPU cost-per-inference calculation.
- If p99 percentiles are critical, run warm and cold tail-latency tests with intended concurrency and batching patterns.
Deployment checklist (pre-launch)
- Benchmark target model on at least one representative GPU and one CPU VPS with full production-like pre/post-processing pipeline.
- Measure avg, p95, p99 under warm and cold conditions and at target concurrency levels.
- Calculate cost per inference and sensitivity to spot/interruptions.
- Test recovery behavior: container restarts, model reloads, and autoscaling latency.
SLO-driven selection flow
- Define latency SLOs (avg, p95, p99) and cost budget per inference.
- Run reproducible benchmarks (scripts below) to map instance choices to SLO compliance and cost.
- Choose the minimal-cost instance that meets SLOs with margin and plan for autoscaling and observability.
Quick decision map (responsive HTML/CSS)
Latency Budget & Model Size
- <50 ms & small model → CPU VPS
- 50–200 ms & medium model → Evaluate both
- >200 ms or large model → GPU cloud instance
Traffic & Cost Pattern
- Low QPS, bursty → CPU or scale-to-zero
- Sustained high QPS → GPU for lower $/inf
- Spot discounts → Use with warm pools
Operational Risks
- Cold starts: keep warmers
- vGPU jitter: prefer dedicated GPU when p99 matters
- Networking: colocate pre/post-processing
Decision: Benchmark, then pick the lowest-cost option that meets p99 SLOs
Analysis: strategic pros and cons
- GPU cloud instances, Pros: superior matrix throughput, lower avg and tail latency for large models, strong ecosystem (TensorRT/Triton), GPUs suitable for batching and high throughput. Cons: higher hourly cost, cold-start overhead, multi-tenant jitter with vGPU, and greater operational complexity.
- CPU VPS, Pros: lower hourly cost for small models, predictable tail latency with core pinning, simpler stack. Cons: lower throughput for large models, may require careful quantization and tuning, scaling horizontally increases network overhead and coordination complexity.
FAQ
What is the typical p99 improvement when moving from CPU to GPU for BERT-like models?
Typical p99 reductions range from 5x to 20x depending on model size and optimizations; GPUs reduce matrix op latency but JIT and kernel overhead influence cold-start p99.
Can int8 quantization on CPU match GPU latency?
For many small-to-mid models, INT8 quantization on CPU with ONNX Runtime or OpenVINO narrows the gap. However, GPUs still tend to deliver better tail latency for large transformer heads unless advanced optimizations are applied.
Are spot GPU instances recommended for production inference?
Spot GPUs reduce cost but add interruption risk and cold-start overhead; use spot for non-critical batch workloads or with a warm fallback pool for critical real-time inference.
How to measure p95/p99 reliably?
Use hdrhistogram output from a stable load harness, run warm-up iterations, then capture at least 10k requests per configuration while recording system metrics (nvidia-smi, perf). Calculate percentiles from the full latency histogram.
Does serverless inference solve cold starts?
Serverless endpoints with pre-warming help but may still introduce cold starts during auto-scale events. For strict p99 SLOs, maintain a minimal warm instance pool.
Conclusion
Action plan, 3 quick steps (<10 minutes each)
- Run a quick microbenchmark: Launch a small GPU cloud instance and an 8vCPU VPS, export the model to ONNX, and run 1000 warm single-request inferences to get avg/p95. Use the provided Docker/Triton and ONNX commands.
- Calculate cost per inference: Use hourly pricing for the two instances and measured throughput to compute $/inference and identify break-even QPS.
- Decide and validate: If GPU meets SLOs with acceptable cost, provision a dedicated GPU endpoint with warmers; if CPU meets SLOs and cost, deploy on tuned CPU VPS with core pinning and INT8 quantization.
References and further reading: NVIDIA TensorRT docs https://developer.nvidia.com/tensorrt, Triton Inference Server https://github.com/triton-inference-server/server, ONNX Runtime https://onnxruntime.ai/.