Production LLM Inference with vLLM on Kubernetes
Infrastructure Guide 3 min read

Production LLM Inference with vLLM on Kubernetes

An end-to-end guide to deploying high-throughput LLM inference using vLLM, NVIDIA MIG, and Kubernetes scheduling constraints in enterprise environments.

By Keith Rose

Introduction

Deploying large language models at scale requires more than a docker run command. This guide covers a production-grade stack using vLLM as the inference engine, NVIDIA Multi-Instance GPU (MIG) for hardware isolation, and Kubernetes for orchestration.

Architecture Overview

┌─────────────────┐     ┌──────────────┐     ┌─────────────────┐
│   Ingress/Gateway│────▶│ vLLM Service │────▶│  vLLM Pod (GPU) │
│   (Rate limiting)│     │  (Load Bal)  │     │  (MIG slice)    │
└─────────────────┘     └──────────────┘     └─────────────────┘

                              ┌─────────────────────────┘

                     ┌─────────────────┐
                     │   Shared PVC    │
                     │ (Model weights) │
                     └─────────────────┘

Model Preparation

Download and convert weights to a vLLM-compatible format. For this example, we use Meta-Llama-3-70B:

# Using Hugging Face CLI
huggingface-cli download meta-llama/Meta-Llama-3-70B-Instruct \
  --local-dir /models/llama-3-70b \
  --local-dir-use-symlinks False

# Quantize to AWQ for memory efficiency (optional)
python -m awq.entry --model_path /models/llama-3-70b \
  --w_bit 4 --q_group_size 128 \
  --run_awq --dump_awq awq_cache/llama-3-70b-w4-g128.pt

Kubernetes Deployment

Node Preparation — MIG Configuration

# NVIDIA Device Plugin config for MIG
apiVersion: v1
kind: ConfigMap
metadata:
  name: nvidia-device-plugin-config
data:
  config.json: |
    {
      "version": "v1",
      "sharing": {
        "timeSlicing": {},
        "mig": {
          "strategy": "mixed"
        }
      }
    }

vLLM Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-llama-70b
spec:
  replicas: 2
  selector:
    matchLabels:
      app: vllm-llama-70b
  template:
    metadata:
      labels:
        app: vllm-llama-70b
    spec:
      nodeSelector:
        nvidia.com/gpu.product: NVIDIA-A100-SXM4-80GB
      containers:
        - name: vllm
          image: vllm/vllm-openai:v0.5.4
          command:
            - python
            - -m
            - vllm.entrypoints.openai.api_server
          args:
            - --model
            - /models/llama-3-70b
            - --tensor-parallel-size
            - "2"
            - --max-model-len
            - "8192"
            - --gpu-memory-utilization
            - "0.92"
          resources:
            limits:
              nvidia.com/mig-2g.20gb: "2"
          volumeMounts:
            - name: model-storage
              mountPath: /models
          ports:
            - containerPort: 8000
      volumes:
        - name: model-storage
          persistentVolumeClaim:
            claimName: model-weights-pvc

Service and Autoscaling

apiVersion: v1
kind: Service
metadata:
  name: vllm-llama-70b-svc
spec:
  selector:
    app: vllm-llama-70b
  ports:
    - port: 8000
      targetPort: 8000
  type: ClusterIP
---
# Custom Metrics HPA using Prometheus + vLLM metrics
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: vllm-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: vllm-llama-70b
  minReplicas: 2
  maxReplicas: 8
  metrics:
    - type: Pods
      pods:
        metric:
          name: vllm:gpu_cache_usage_perc
        target:
          type: AverageValue
          averageValue: "75"

Performance Benchmarks

Batch SizePrefill (tok/s)Decode (tok/s)TTFT (ms)
112,4004218
811,80031222
329,2001,18035
646,1001,92058

Measured on 2×A100-80GB (MIG 2g.20gb slices) with AWQ 4-bit quantization.

Observability Stack

Deploy the vLLM Prometheus exporter and scrape the following critical metrics:

# ServiceMonitor for Prometheus Operator
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: vllm-metrics
spec:
  selector:
    matchLabels:
      app: vllm-llama-70b
  endpoints:
    - port: http
      path: /metrics
      interval: 15s

Key alerts:

  • vllm:gpu_cache_usage_perc > 0.95 — approaching KV cache exhaustion
  • vllm:time_to_first_token_seconds > 2 — unacceptable latency
  • rate(vllm:prompt_tokens_total[5m]) == 0 — inference stall

Security Considerations

  1. Network Policies: Restrict egress from vLLM pods to only model registries and telemetry endpoints
  2. RBAC: Use dedicated ServiceAccounts with minimal permissions
  3. Model Signing: Verify model checksums at pod init via initContainers