Skip to main content

Announcing KServe v0.17 - Production-Ready LLM Serving with LLMInferenceService

ยท 14 min read
Dan Sun
Co-Founder, KServe

Published on March 13, 2026

We are excited to announce the release of KServe v0.17, a landmark release that brings LLMInferenceService to production readiness with a GenAI-first architecture built on the llm-d framework. This release introduces KV-cache aware intelligent routing, disaggregated prefill-decode, distributed inference with tensor/data/expert parallelism, Envoy AI Gateway integration with token-based rate limiting, and a completely restructured modular Helm chart architecture.

๐Ÿค– LLMInferenceService: GenAI-First Architectureโ€‹

KServe v0.17 elevates LLMInferenceService from an experimental feature to a production-ready CRD purpose-built for generative AI workloads. Built on the llm-d framework, LLMInferenceService provides a GenAI-first architecture that goes beyond traditional InferenceService to address the unique challenges of serving large language models at scale.

Unlike InferenceService which is designed for predictive AI workloads, LLMInferenceService natively supports:

  • Distributed inference across multiple nodes and GPUs
  • KV-cache aware scheduling for intelligent request routing
  • Disaggregated prefill-decode for optimal resource utilization
  • Gateway Inference Extension (GIE) integration for advanced traffic management
  • Token-based rate limiting via Envoy AI Gateway
FeatureInferenceServiceLLMInferenceService
Primary Use CasePredictive AIGenerative AI
RoutingStandard GatewayKV-cache aware with EPP
ParallelismWorker SpecTP, DP, EP native support
Prefill-DecodeN/ADisaggregated separation
ScalingHPA/KPAWVA + KEDA
apiVersion: serving.kserve.io/v1alpha2
kind: LLMInferenceService
metadata:
name: llama3-serving
spec:
model:
uri: hf://meta-llama/Llama-3.1-8B-Instruct
name: meta-llama--Llama-3.1-8B-Instruct
replicas: 3
template:
spec:
containers:
- name: vllm
resources:
limits:
nvidia.com/gpu: "1"
router:
gateway:
managed: {}
route:
httpRoute: {}
scheduler:
pool: {}

This creates a full serving stack including the Deployment, Service, Gateway, HTTPRoute, InferencePool, InferenceModel, and EPP (Endpoint Picker Pod) โ€” all managed by the LLMInferenceService controller.

๐Ÿš€ Key LLMInferenceService Features in v0.17โ€‹

๐Ÿง  KV-Cache Aware Scheduling with Gateway Inference Extensionโ€‹

LLMInferenceService integrates with Gateway Inference Extension (GIE) v1.3.0, a Kubernetes SIG project that extends the Gateway API with AI-specific routing capabilities. At the heart of this integration is the Endpoint Picker Pod (EPP) from the llm-d inference scheduler, an intelligent scheduler that routes requests based on real-time KV-cache state rather than simple round-robin or random load balancing.

Traditional load balancing treats all LLM inference requests equally, but in practice, requests with similar prompts benefit enormously from being routed to the same pod โ€” because that pod already has the relevant KV cache blocks loaded. The EPP solves this by tracking real-time KV cache states across all vLLM instances via ZMQ events (BlockStored, BlockRemoved) and building an index mapping {ModelName, BlockHash} โ†’ {PodID, DeviceTier}.

The scheduling behavior is configured through EndpointPickerConfig, which defines a plugin pipeline with weighted scorers:

apiVersion: inference.networking.x-k8s.io/v1alpha1
kind: EndpointPickerConfig
plugins:
- type: single-profile-handler
- type: prefix-cache-scorer
- type: load-aware-scorer
parameters:
threshold: 100
- type: max-score-picker
schedulingProfiles:
- name: default
plugins:
- pluginRef: prefix-cache-scorer
weight: 2.0
- pluginRef: load-aware-scorer
weight: 1.0
- pluginRef: max-score-picker

The pipeline uses three types of plugins (see llm-d scheduler architecture for details):

  • prefix-cache-scorer (weight: 2.0): Tracks the actual KV cache contents across all vLLM instances and scores pods based on how many cached prefix blocks match the incoming request's prompt. This reduces Time To First Token (TTFT) by avoiding redundant prefill computation for repeated or similar prompts โ€” particularly beneficial for multi-turn conversations and RAG workloads.
  • load-aware-scorer (weight: 1.0): Scores candidate pods based on their current queue depth. Pods with empty queues score 0.5, while pods with growing queues score progressively lower toward 0. The threshold parameter controls the sensitivity โ€” when queue depth exceeds the threshold, the pod scores near zero.
  • max-score-picker: After all scorers run, selects the pod with the highest weighted aggregate score.

The EndpointPickerConfig can be provided inline in the LLMInferenceService spec or referenced from a ConfigMap, giving platform teams the flexibility to standardize scheduling behavior across deployments:

apiVersion: serving.kserve.io/v1alpha2
kind: LLMInferenceService
metadata:
name: llama3-with-scheduler
spec:
model:
uri: hf://meta-llama/Llama-3.1-8B-Instruct
name: meta-llama--Llama-3.1-8B-Instruct
replicas: 4
template:
spec:
containers:
- name: vllm
resources:
limits:
nvidia.com/gpu: "1"
router:
gateway:
managed: {}
route:
httpRoute: {}
scheduler:
config:
ref:
name: custom-endpoint-picker-config
key: endpoint-picker-config.yaml
pool: {}

The GIE CRDs (InferencePool and InferenceModel) are now bundled as part of the KServe installation, simplifying setup.

๐Ÿ”€ Disaggregated Prefill-Decodeโ€‹

LLMInferenceService natively supports disaggregated prefill-decode, which separates the compute-intensive prefill phase from the memory-intensive decode phase into independent workloads. This allows each phase to be scaled and optimized independently.

apiVersion: serving.kserve.io/v1alpha2
kind: LLMInferenceService
metadata:
name: llama3-prefill-decode
spec:
model:
uri: hf://meta-llama/Llama-3.1-8B-Instruct
name: meta-llama--Llama-3.1-8B-Instruct
replicas: 2
template:
spec:
containers:
- name: vllm
resources:
limits:
nvidia.com/gpu: "1"
prefill:
replicas: 2
template:
spec:
containers:
- name: vllm
resources:
limits:
nvidia.com/gpu: "1"
router:
gateway:
managed: {}
route:
httpRoute: {}
scheduler:
pool: {}

KV cache data is transferred between prefill and decode pods using NixlConnector with RDMA-based RoCE for high-throughput, low-latency block transfers.

๐Ÿ“ Distributed Inference: Tensor, Data, and Expert Parallelismโ€‹

LLMInferenceService introduces a comprehensive parallelism specification for distributed inference across multiple nodes and GPUs using LeaderWorkerSet:

  • Tensor Parallelism (TP): Splits model layers across GPUs within a node
  • Data Parallelism (DP): Runs multiple model replicas for higher throughput
  • Data-Local Parallelism: Controls GPUs per node for optimal NUMA affinity
  • Expert Parallelism (EP): Distributes Mixture-of-Experts (MoE) model experts across GPUs
apiVersion: serving.kserve.io/v1alpha2
kind: LLMInferenceService
metadata:
name: llama3-multi-node
spec:
model:
uri: hf://meta-llama/Llama-3.1-70B-Instruct
name: meta-llama--Llama-3.1-70B-Instruct
replicas: 8
parallelism:
tensor: 4
data: 8
dataLocal: 4
template:
spec:
containers:
- name: vllm
resources:
limits:
nvidia.com/gpu: "4"
worker:
spec:
containers:
- name: vllm
resources:
limits:
nvidia.com/gpu: "4"
router:
gateway:
managed: {}
route:
httpRoute: {}
scheduler:
pool: {}

๐ŸŒ Envoy AI Gateway Integration with Token-Based Rate Limitingโ€‹

LLMInferenceService integrates with Envoy AI Gateway for AI-native traffic management. This enables token-based rate limiting โ€” a capability critical for LLM serving where request cost varies dramatically based on input and output token counts rather than simple request counts.

apiVersion: aigateway.envoyproxy.io/v1alpha1
kind: AIGatewayRoute
metadata:
name: llm-route
spec:
targetRefs:
- group: gateway.networking.k8s.io
kind: HTTPRoute
name: llama3-serving
llmRequestCosts:
- metadataKey: llm_input_token
type: InputToken
- metadataKey: llm_output_token
type: OutputToken
- metadataKey: llm_total_token
type: TotalToken
---
apiVersion: gateway.envoyproxy.io/v1alpha1
kind: BackendTrafficPolicy
metadata:
name: llm-rate-limit
spec:
targetRefs:
- group: aigateway.envoyproxy.io
kind: AIGatewayRoute
name: llm-route
rateLimit:
type: Global
global:
rules:
- clientSelectors:
- headers:
- name: x-user-id
type: Distinct
limit:
requests: 1000
unit: Hour
cost:
request:
from: Number
number: 0
response:
from: Metadata
key: llm_total_token

โšก Autoscaling API with WVA Supportโ€‹

A new autoscaling API has been added to LLMInferenceService with support for the Workload Variant Autoscaler (WVA), a Kubernetes-based global autoscaler designed specifically for LLM inference workloads. Traditional CPU/memory-based autoscaling is inadequate for LLMs because inference cost is driven by token throughput, KV cache utilization, and queue depth rather than CPU or memory usage.

WVA continuously monitors inference server metrics via Prometheus โ€” specifically KV cache utilization and queue depth โ€” to determine when servers are approaching saturation. It then computes a wva_desired_replicas metric and emits it to Prometheus, where an actuator backend (HPA or KEDA) reads it to drive the actual scaling:

  • WVA + KEDA: Queries Prometheus directly for the wva_desired_replicas metric. Does not require Prometheus Adapter. Supports idle scale-to-zero via idleReplicaCount.
  • WVA + HPA: Reads the wva_desired_replicas metric via Kubernetes Metrics API. Requires Prometheus Adapter. Supports standard HPA scaling behaviors.

A key concept in WVA is the variant โ€” a specific deployment configuration (hardware, runtime, parallelism strategy) for serving a model. The same base model might be served by multiple variants: for example, Llama-3 on A100 GPUs with TP=4 is one variant, while Llama-3 on H100 GPUs with TP=2 is another. The variantCost field specifies the relative cost per replica for each variant, enabling WVA to make cost-aware scaling decisions across variants โ€” scaling up the cheaper variant first when demand increases, and scaling down the most expensive variant first when demand decreases.

apiVersion: serving.kserve.io/v1alpha2
kind: LLMInferenceService
metadata:
name: llama3-wva-autoscaling
spec:
model:
uri: hf://meta-llama/Llama-3.1-8B-Instruct
name: meta-llama--Llama-3.1-8B-Instruct
scaling:
minReplicas: 1
maxReplicas: 10
wva:
variantCost: "15.0"
keda:
pollingInterval: 30
cooldownPeriod: 300
initialCooldownPeriod: 120
idleReplicaCount: 0
fallback:
failureThreshold: 3
replicas: 2
template:
spec:
containers:
- name: vllm
resources:
limits:
nvidia.com/gpu: "1"
router:
gateway:
managed: {}
route:
httpRoute: {}
scheduler:
pool: {}

In the example above, variantCost: "15.0" indicates the relative cost of running each replica of this variant. If another variant of the same model has variantCost: "5.0", WVA would prefer to add capacity on that cheaper variant before scaling up this one. The default value is "10.0" if not specified. When using the KEDA backend, the fallback field ensures the deployment maintains a minimum replica count (here, 2 replicas) even if the metrics pipeline fails โ€” a critical safety net for production LLM deployments.

๐Ÿ”ง Scheduler High Availabilityโ€‹

The LLMInferenceService scheduler (EPP) now supports scaling and high availability, allowing multiple EPP replicas for production deployments that require fault tolerance and higher routing throughput.

๐Ÿ›ก๏ธ CRD Webhook Validationโ€‹

LLMInferenceService now includes CRD webhook validation with comprehensive E2E tests, providing early feedback on invalid configurations before they reach the controller. This catches errors in parallelism settings, workload specifications, and router configurations at admission time.

๐Ÿ“‹ Configuration Composition with LLMInferenceServiceConfigโ€‹

LLMInferenceService supports a configuration composition model through LLMInferenceServiceConfig, enabling reusable templates that can be shared across multiple LLMInferenceService resources. The merge order follows:

  1. Well-Known Configs โ†’ 2. Explicit BaseRefs โ†’ 3. LLMInferenceService Spec

This allows platform teams to define standardized vLLM worker templates, router/scheduler configurations, and resource defaults while giving application teams the ability to override specific settings.

๐Ÿ“ฆ Additional LLMInferenceService Improvementsโ€‹

  • Label and annotation propagation to downstream workload resources (#5009)
  • Prometheus annotation propagation to workloads for metrics collection (#5086)
  • Certificate management with DNS/IP SAN and automatic renewal for self-signed certs (#5099)
  • Improved CA bundle management for secure communication (#4803)
  • Optional storageInitializer โ€” skip model download when using pre-loaded models (#4970)
  • InferencePool auto-migration for seamless upgrades (#5007)
  • Route-only completions through InferencePool for chat/completion endpoints (#5087)
  • Startup probes for vLLM containers for more reliable health monitoring (#5063)
  • vLLM arguments migrated to command field for cleaner configuration (#5049)
  • Versioned well-known config resolution for stable config management (#5096)
  • Scheduler config via ConfigMap or inline for flexible configuration (#4856)
  • Pod init container failure monitoring for better observability (#5034)
  • Preserve externally managed replicas during reconciliation (#4996)
  • Allow stopping LLMInferenceService gracefully (#4839)
  • Enhanced Gateway API URL discovery with listener hostname fallback (#5104, #5079)

๐Ÿ—๏ธ Modular Component Architectureโ€‹

KServe v0.17 introduces a fundamental architectural shift toward modular, component-based deployment. KServe now consists of three independent components:

  • kserve (core): Manages InferenceService, ServingRuntime, ClusterServingRuntime, InferenceGraph, and TrainedModel CRDs.
  • llmisvc: The LLMInferenceService controller for generative AI workloads, managing LLMInferenceService and LLMInferenceServiceConfig CRDs.
  • localmodel (optional): The LocalModel controller for efficient model caching with LocalModelCache, LocalModelNode, and LocalModelNodeGroup CRDs.
CombinationUse CaseComponents
KServe OnlyPredictive AIkserve
KServe + LLMIsvcPredictive AI + Generative AIkserve + llmisvc
Full StackPredictive AI + Generative AI + Model Cachingkserve + llmisvc + localmodel

Helm Chart Restructuringโ€‹

To support the new component architecture, the Helm charts have been completely restructured from a single chart into 10 independent Helm charts:

CRD Charts (6 charts with full and minimal variants):

  • kserve-crd / kserve-crd-minimal
  • kserve-llmisvc-crd / kserve-llmisvc-crd-minimal
  • kserve-localmodel-crd / kserve-localmodel-crd-minimal

Resource Charts (4 charts):

  • kserve-resources (renamed from kserve)
  • kserve-llmisvc-resources (new)
  • kserve-localmodel-resources (new)
  • kserve-runtime-configs (new โ€” manages ClusterServingRuntimes and LLMIsvcConfigs)
warning

This is a breaking change. Users upgrading from v0.16 cannot use a simple helm upgrade command. Please follow the detailed upgrade guide for step-by-step migration instructions. We strongly recommend testing the upgrade in a non-production environment first.

For fresh installations, the new Kustomize component-based architecture also provides composable deployment options via standalone overlays, addon overlays, and all-in-one overlays. See the installation concepts for details.

๐Ÿ”ง InferenceService and Platform Improvementsโ€‹

Storage Performanceโ€‹

  • Parallelized blob downloads from Azure and S3 for faster model loading (#4709, #4714)
  • Faster parallel S3 downloads with configurable file selection (#5102, #5119)
  • Git repository support for downloading models directly from Git repos via HTTPS (#4966)

New Serving Runtimesโ€‹

  • OpenVINO Model Server โ€” Intel's optimized inference runtime for high-performance serving on Intel hardware (#4592)
  • PredictiveServer runtime with full build/publish infrastructure and E2E testing (#4954)

Gateway & Routingโ€‹

  • Gateway API upgraded to v1.4.0 (#5038)
  • PathTemplate configuration for flexible inference service routing (#4817)

vLLM Backendโ€‹

  • Upgraded to vLLM v0.15.1 with performance improvements (#5098)
  • Removed Python 3.9 support (#4851)

Additional Enhancementsโ€‹

  • CSV and Parquet marshallers for expanded data format support (#5115)
  • Event loop configuration with new --event_loop flag supporting auto, asyncio, and uvloop (#4971)
  • Annotation-based runtime defaults for MLServer (#5064)
  • INFERENCE_SERVICE_NAME environment variable exposed to serving containers (#5013)
  • Failure condition surfacing in InferenceService status (#5114)
  • Inference log batching with external marshalling support (#5061)

Infrastructure Updatesโ€‹

  • Kubernetes packages bumped to v0.34.0
  • Knative Serving updated to v1.21.1
  • Go updated to 1.25
  • Kubebuilder updated to 1.9.0
  • KEDA bumped from 2.16.1 to 2.17.3
  • MinIO replaced with SeaweedFS for testing infrastructure

๐Ÿ”’ Security Fixesโ€‹

Multiple security vulnerabilities have been addressed:

  • CVE-2025-62727 (Starlette)
  • CVE-2025-22872, CVE-2025-47914, CVE-2025-58181
  • CVE-2024-43598 (LightGBM updated to 4.6.0)
  • CVE-2025-43859 (h11 HTTP parsing)
  • CVE-2025-66418 (decompression chain)
  • CVE-2025-68156 (expr-lang/expr)
  • CVE-2026-26007 (cryptography subgroup attack)
  • CVE-2026-24486 (python-multipart arbitrary file write)
  • Path traversal vulnerabilities in https.go and tar extraction

๐Ÿ” Release Notesโ€‹

For the complete list of all 167 merged pull requests, bug fixes, and known issues, visit the GitHub release pages:

๐Ÿ™ Acknowledgmentsโ€‹

We extend our gratitude to all 38+ contributors who made this release possible, including 21 first-time contributors. Your efforts continue to drive the advancement of KServe as a leading platform for serving machine learning models.

  • Core Contributors: The KServe maintainers and regular contributors
  • Community: Everyone who reported issues, provided feedback, and tested features
  • New Contributors: Welcome to all first-time contributors who helped shape this release

๐Ÿค Join the Communityโ€‹

We invite you to explore the new features in KServe v0.17 and contribute to the ongoing development of the project:

Happy serving!


The KServe team is committed to making machine learning model serving simple, scalable, and standardized. Thank you for being part of our community!