Announcing KServe v0.17 - Production-Ready LLM Serving with LLMInferenceService
Published on March 13, 2026
We are excited to announce the release of KServe v0.17, a landmark release that brings LLMInferenceService to production readiness with a GenAI-first architecture built on the llm-d framework. This release introduces KV-cache aware intelligent routing, disaggregated prefill-decode, distributed inference with tensor/data/expert parallelism, Envoy AI Gateway integration with token-based rate limiting, and a completely restructured modular Helm chart architecture.
๐ค LLMInferenceService: GenAI-First Architectureโ
KServe v0.17 elevates LLMInferenceService from an experimental feature to a production-ready CRD purpose-built for generative AI workloads. Built on the llm-d framework, LLMInferenceService provides a GenAI-first architecture that goes beyond traditional InferenceService to address the unique challenges of serving large language models at scale.
Unlike InferenceService which is designed for predictive AI workloads, LLMInferenceService natively supports:
- Distributed inference across multiple nodes and GPUs
- KV-cache aware scheduling for intelligent request routing
- Disaggregated prefill-decode for optimal resource utilization
- Gateway Inference Extension (GIE) integration for advanced traffic management
- Token-based rate limiting via Envoy AI Gateway
| Feature | InferenceService | LLMInferenceService |
|---|---|---|
| Primary Use Case | Predictive AI | Generative AI |
| Routing | Standard Gateway | KV-cache aware with EPP |
| Parallelism | Worker Spec | TP, DP, EP native support |
| Prefill-Decode | N/A | Disaggregated separation |
| Scaling | HPA/KPA | WVA + KEDA |
apiVersion: serving.kserve.io/v1alpha2
kind: LLMInferenceService
metadata:
name: llama3-serving
spec:
model:
uri: hf://meta-llama/Llama-3.1-8B-Instruct
name: meta-llama--Llama-3.1-8B-Instruct
replicas: 3
template:
spec:
containers:
- name: vllm
resources:
limits:
nvidia.com/gpu: "1"
router:
gateway:
managed: {}
route:
httpRoute: {}
scheduler:
pool: {}
This creates a full serving stack including the Deployment, Service, Gateway, HTTPRoute, InferencePool, InferenceModel, and EPP (Endpoint Picker Pod) โ all managed by the LLMInferenceService controller.
๐ Key LLMInferenceService Features in v0.17โ
๐ง KV-Cache Aware Scheduling with Gateway Inference Extensionโ
LLMInferenceService integrates with Gateway Inference Extension (GIE) v1.3.0, a Kubernetes SIG project that extends the Gateway API with AI-specific routing capabilities. At the heart of this integration is the Endpoint Picker Pod (EPP) from the llm-d inference scheduler, an intelligent scheduler that routes requests based on real-time KV-cache state rather than simple round-robin or random load balancing.
Traditional load balancing treats all LLM inference requests equally, but in practice, requests with similar prompts benefit enormously from being routed to the same pod โ because that pod already has the relevant KV cache blocks loaded. The EPP solves this by tracking real-time KV cache states across all vLLM instances via ZMQ events (BlockStored, BlockRemoved) and building an index mapping {ModelName, BlockHash} โ {PodID, DeviceTier}.
The scheduling behavior is configured through EndpointPickerConfig, which defines a plugin pipeline with weighted scorers:
apiVersion: inference.networking.x-k8s.io/v1alpha1
kind: EndpointPickerConfig
plugins:
- type: single-profile-handler
- type: prefix-cache-scorer
- type: load-aware-scorer
parameters:
threshold: 100
- type: max-score-picker
schedulingProfiles:
- name: default
plugins:
- pluginRef: prefix-cache-scorer
weight: 2.0
- pluginRef: load-aware-scorer
weight: 1.0
- pluginRef: max-score-picker
The pipeline uses three types of plugins (see llm-d scheduler architecture for details):
- prefix-cache-scorer (weight: 2.0): Tracks the actual KV cache contents across all vLLM instances and scores pods based on how many cached prefix blocks match the incoming request's prompt. This reduces Time To First Token (TTFT) by avoiding redundant prefill computation for repeated or similar prompts โ particularly beneficial for multi-turn conversations and RAG workloads.
- load-aware-scorer (weight: 1.0): Scores candidate pods based on their current queue depth. Pods with empty queues score 0.5, while pods with growing queues score progressively lower toward 0. The
thresholdparameter controls the sensitivity โ when queue depth exceeds the threshold, the pod scores near zero. - max-score-picker: After all scorers run, selects the pod with the highest weighted aggregate score.
The EndpointPickerConfig can be provided inline in the LLMInferenceService spec or referenced from a ConfigMap, giving platform teams the flexibility to standardize scheduling behavior across deployments:
apiVersion: serving.kserve.io/v1alpha2
kind: LLMInferenceService
metadata:
name: llama3-with-scheduler
spec:
model:
uri: hf://meta-llama/Llama-3.1-8B-Instruct
name: meta-llama--Llama-3.1-8B-Instruct
replicas: 4
template:
spec:
containers:
- name: vllm
resources:
limits:
nvidia.com/gpu: "1"
router:
gateway:
managed: {}
route:
httpRoute: {}
scheduler:
config:
ref:
name: custom-endpoint-picker-config
key: endpoint-picker-config.yaml
pool: {}
The GIE CRDs (InferencePool and InferenceModel) are now bundled as part of the KServe installation, simplifying setup.
๐ Disaggregated Prefill-Decodeโ
LLMInferenceService natively supports disaggregated prefill-decode, which separates the compute-intensive prefill phase from the memory-intensive decode phase into independent workloads. This allows each phase to be scaled and optimized independently.
apiVersion: serving.kserve.io/v1alpha2
kind: LLMInferenceService
metadata:
name: llama3-prefill-decode
spec:
model:
uri: hf://meta-llama/Llama-3.1-8B-Instruct
name: meta-llama--Llama-3.1-8B-Instruct
replicas: 2
template:
spec:
containers:
- name: vllm
resources:
limits:
nvidia.com/gpu: "1"
prefill:
replicas: 2
template:
spec:
containers:
- name: vllm
resources:
limits:
nvidia.com/gpu: "1"
router:
gateway:
managed: {}
route:
httpRoute: {}
scheduler:
pool: {}
KV cache data is transferred between prefill and decode pods using NixlConnector with RDMA-based RoCE for high-throughput, low-latency block transfers.
๐ Distributed Inference: Tensor, Data, and Expert Parallelismโ
LLMInferenceService introduces a comprehensive parallelism specification for distributed inference across multiple nodes and GPUs using LeaderWorkerSet:
- Tensor Parallelism (TP): Splits model layers across GPUs within a node
- Data Parallelism (DP): Runs multiple model replicas for higher throughput
- Data-Local Parallelism: Controls GPUs per node for optimal NUMA affinity
- Expert Parallelism (EP): Distributes Mixture-of-Experts (MoE) model experts across GPUs
apiVersion: serving.kserve.io/v1alpha2
kind: LLMInferenceService
metadata:
name: llama3-multi-node
spec:
model:
uri: hf://meta-llama/Llama-3.1-70B-Instruct
name: meta-llama--Llama-3.1-70B-Instruct
replicas: 8
parallelism:
tensor: 4
data: 8
dataLocal: 4
template:
spec:
containers:
- name: vllm
resources:
limits:
nvidia.com/gpu: "4"
worker:
spec:
containers:
- name: vllm
resources:
limits:
nvidia.com/gpu: "4"
router:
gateway:
managed: {}
route:
httpRoute: {}
scheduler:
pool: {}
๐ Envoy AI Gateway Integration with Token-Based Rate Limitingโ
LLMInferenceService integrates with Envoy AI Gateway for AI-native traffic management. This enables token-based rate limiting โ a capability critical for LLM serving where request cost varies dramatically based on input and output token counts rather than simple request counts.
apiVersion: aigateway.envoyproxy.io/v1alpha1
kind: AIGatewayRoute
metadata:
name: llm-route
spec:
targetRefs:
- group: gateway.networking.k8s.io
kind: HTTPRoute
name: llama3-serving
llmRequestCosts:
- metadataKey: llm_input_token
type: InputToken
- metadataKey: llm_output_token
type: OutputToken
- metadataKey: llm_total_token
type: TotalToken
---
apiVersion: gateway.envoyproxy.io/v1alpha1
kind: BackendTrafficPolicy
metadata:
name: llm-rate-limit
spec:
targetRefs:
- group: aigateway.envoyproxy.io
kind: AIGatewayRoute
name: llm-route
rateLimit:
type: Global
global:
rules:
- clientSelectors:
- headers:
- name: x-user-id
type: Distinct
limit:
requests: 1000
unit: Hour
cost:
request:
from: Number
number: 0
response:
from: Metadata
key: llm_total_token
โก Autoscaling API with WVA Supportโ
A new autoscaling API has been added to LLMInferenceService with support for the Workload Variant Autoscaler (WVA), a Kubernetes-based global autoscaler designed specifically for LLM inference workloads. Traditional CPU/memory-based autoscaling is inadequate for LLMs because inference cost is driven by token throughput, KV cache utilization, and queue depth rather than CPU or memory usage.
WVA continuously monitors inference server metrics via Prometheus โ specifically KV cache utilization and queue depth โ to determine when servers are approaching saturation. It then computes a wva_desired_replicas metric and emits it to Prometheus, where an actuator backend (HPA or KEDA) reads it to drive the actual scaling:
- WVA + KEDA: Queries Prometheus directly for the
wva_desired_replicasmetric. Does not require Prometheus Adapter. Supports idle scale-to-zero viaidleReplicaCount. - WVA + HPA: Reads the
wva_desired_replicasmetric via Kubernetes Metrics API. Requires Prometheus Adapter. Supports standard HPA scaling behaviors.
A key concept in WVA is the variant โ a specific deployment configuration (hardware, runtime, parallelism strategy) for serving a model. The same base model might be served by multiple variants: for example, Llama-3 on A100 GPUs with TP=4 is one variant, while Llama-3 on H100 GPUs with TP=2 is another. The variantCost field specifies the relative cost per replica for each variant, enabling WVA to make cost-aware scaling decisions across variants โ scaling up the cheaper variant first when demand increases, and scaling down the most expensive variant first when demand decreases.
apiVersion: serving.kserve.io/v1alpha2
kind: LLMInferenceService
metadata:
name: llama3-wva-autoscaling
spec:
model:
uri: hf://meta-llama/Llama-3.1-8B-Instruct
name: meta-llama--Llama-3.1-8B-Instruct
scaling:
minReplicas: 1
maxReplicas: 10
wva:
variantCost: "15.0"
keda:
pollingInterval: 30
cooldownPeriod: 300
initialCooldownPeriod: 120
idleReplicaCount: 0
fallback:
failureThreshold: 3
replicas: 2
template:
spec:
containers:
- name: vllm
resources:
limits:
nvidia.com/gpu: "1"
router:
gateway:
managed: {}
route:
httpRoute: {}
scheduler:
pool: {}
In the example above, variantCost: "15.0" indicates the relative cost of running each replica of this variant. If another variant of the same model has variantCost: "5.0", WVA would prefer to add capacity on that cheaper variant before scaling up this one. The default value is "10.0" if not specified. When using the KEDA backend, the fallback field ensures the deployment maintains a minimum replica count (here, 2 replicas) even if the metrics pipeline fails โ a critical safety net for production LLM deployments.
๐ง Scheduler High Availabilityโ
The LLMInferenceService scheduler (EPP) now supports scaling and high availability, allowing multiple EPP replicas for production deployments that require fault tolerance and higher routing throughput.
๐ก๏ธ CRD Webhook Validationโ
LLMInferenceService now includes CRD webhook validation with comprehensive E2E tests, providing early feedback on invalid configurations before they reach the controller. This catches errors in parallelism settings, workload specifications, and router configurations at admission time.
๐ Configuration Composition with LLMInferenceServiceConfigโ
LLMInferenceService supports a configuration composition model through LLMInferenceServiceConfig, enabling reusable templates that can be shared across multiple LLMInferenceService resources. The merge order follows:
- Well-Known Configs โ 2. Explicit BaseRefs โ 3. LLMInferenceService Spec
This allows platform teams to define standardized vLLM worker templates, router/scheduler configurations, and resource defaults while giving application teams the ability to override specific settings.
๐ฆ Additional LLMInferenceService Improvementsโ
- Label and annotation propagation to downstream workload resources (#5009)
- Prometheus annotation propagation to workloads for metrics collection (#5086)
- Certificate management with DNS/IP SAN and automatic renewal for self-signed certs (#5099)
- Improved CA bundle management for secure communication (#4803)
- Optional storageInitializer โ skip model download when using pre-loaded models (#4970)
- InferencePool auto-migration for seamless upgrades (#5007)
- Route-only completions through InferencePool for chat/completion endpoints (#5087)
- Startup probes for vLLM containers for more reliable health monitoring (#5063)
- vLLM arguments migrated to command field for cleaner configuration (#5049)
- Versioned well-known config resolution for stable config management (#5096)
- Scheduler config via ConfigMap or inline for flexible configuration (#4856)
- Pod init container failure monitoring for better observability (#5034)
- Preserve externally managed replicas during reconciliation (#4996)
- Allow stopping LLMInferenceService gracefully (#4839)
- Enhanced Gateway API URL discovery with listener hostname fallback (#5104, #5079)
๐๏ธ Modular Component Architectureโ
KServe v0.17 introduces a fundamental architectural shift toward modular, component-based deployment. KServe now consists of three independent components:
- kserve (core): Manages InferenceService, ServingRuntime, ClusterServingRuntime, InferenceGraph, and TrainedModel CRDs.
- llmisvc: The LLMInferenceService controller for generative AI workloads, managing LLMInferenceService and LLMInferenceServiceConfig CRDs.
- localmodel (optional): The LocalModel controller for efficient model caching with LocalModelCache, LocalModelNode, and LocalModelNodeGroup CRDs.
| Combination | Use Case | Components |
|---|---|---|
| KServe Only | Predictive AI | kserve |
| KServe + LLMIsvc | Predictive AI + Generative AI | kserve + llmisvc |
| Full Stack | Predictive AI + Generative AI + Model Caching | kserve + llmisvc + localmodel |
Helm Chart Restructuringโ
To support the new component architecture, the Helm charts have been completely restructured from a single chart into 10 independent Helm charts:
CRD Charts (6 charts with full and minimal variants):
kserve-crd/kserve-crd-minimalkserve-llmisvc-crd/kserve-llmisvc-crd-minimalkserve-localmodel-crd/kserve-localmodel-crd-minimal
Resource Charts (4 charts):
kserve-resources(renamed fromkserve)kserve-llmisvc-resources(new)kserve-localmodel-resources(new)kserve-runtime-configs(new โ manages ClusterServingRuntimes and LLMIsvcConfigs)
This is a breaking change. Users upgrading from v0.16 cannot use a simple helm upgrade command. Please follow the detailed upgrade guide for step-by-step migration instructions. We strongly recommend testing the upgrade in a non-production environment first.
For fresh installations, the new Kustomize component-based architecture also provides composable deployment options via standalone overlays, addon overlays, and all-in-one overlays. See the installation concepts for details.
๐ง InferenceService and Platform Improvementsโ
Storage Performanceโ
- Parallelized blob downloads from Azure and S3 for faster model loading (#4709, #4714)
- Faster parallel S3 downloads with configurable file selection (#5102, #5119)
- Git repository support for downloading models directly from Git repos via HTTPS (#4966)
New Serving Runtimesโ
- OpenVINO Model Server โ Intel's optimized inference runtime for high-performance serving on Intel hardware (#4592)
- PredictiveServer runtime with full build/publish infrastructure and E2E testing (#4954)
Gateway & Routingโ
- Gateway API upgraded to v1.4.0 (#5038)
- PathTemplate configuration for flexible inference service routing (#4817)
vLLM Backendโ
Additional Enhancementsโ
- CSV and Parquet marshallers for expanded data format support (#5115)
- Event loop configuration with new
--event_loopflag supportingauto,asyncio, anduvloop(#4971) - Annotation-based runtime defaults for MLServer (#5064)
INFERENCE_SERVICE_NAMEenvironment variable exposed to serving containers (#5013)- Failure condition surfacing in InferenceService status (#5114)
- Inference log batching with external marshalling support (#5061)
Infrastructure Updatesโ
- Kubernetes packages bumped to v0.34.0
- Knative Serving updated to v1.21.1
- Go updated to 1.25
- Kubebuilder updated to 1.9.0
- KEDA bumped from 2.16.1 to 2.17.3
- MinIO replaced with SeaweedFS for testing infrastructure
๐ Security Fixesโ
Multiple security vulnerabilities have been addressed:
- CVE-2025-62727 (Starlette)
- CVE-2025-22872, CVE-2025-47914, CVE-2025-58181
- CVE-2024-43598 (LightGBM updated to 4.6.0)
- CVE-2025-43859 (h11 HTTP parsing)
- CVE-2025-66418 (decompression chain)
- CVE-2025-68156 (expr-lang/expr)
- CVE-2026-26007 (cryptography subgroup attack)
- CVE-2026-24486 (python-multipart arbitrary file write)
- Path traversal vulnerabilities in https.go and tar extraction
๐ Release Notesโ
For the complete list of all 167 merged pull requests, bug fixes, and known issues, visit the GitHub release pages:
๐ Acknowledgmentsโ
We extend our gratitude to all 38+ contributors who made this release possible, including 21 first-time contributors. Your efforts continue to drive the advancement of KServe as a leading platform for serving machine learning models.
- Core Contributors: The KServe maintainers and regular contributors
- Community: Everyone who reported issues, provided feedback, and tested features
- New Contributors: Welcome to all first-time contributors who helped shape this release
๐ค Join the Communityโ
We invite you to explore the new features in KServe v0.17 and contribute to the ongoing development of the project:
- Visit our Website or GitHub
- Join the Slack (#kserve)
- Attend our community meeting by subscribing to the KServe calendar.
- View our community github repository to learn how to make contributions. We are excited to work with you to make KServe better and promote its adoption!
Happy serving!
The KServe team is committed to making machine learning model serving simple, scalable, and standardized. Thank you for being part of our community!
