Azure compute resources that are used to create and manage groups of heterogeneous load-balanced virtual machines.
During your scheduled maintenance window, Azure Kubernetes Service (AKS) updates the underlying Virtual Machine Scale Set (VMSS) model (such as node image versions, extensions, or metadata).
When AKS later reconciles differences between the VMSS model and existing nodes, it performs node upgrades using a cordon → drain → reimage process. During this process, nodes are temporarily taken out of service and pods are evicted.
If your workloads don’t have enough ready replicas to absorb the temporary reduction in node capacity, this can appear as brief application interruptions.
To prevent outages during node image or platform updates, Microsoft recommends:
Configure node pool Max Surge so AKS adds extra nodes before draining existing ones, maintaining capacity during upgrades (commonly ~33% of node pool size for production).
Example:
az aks nodepool update --resource-group <rg> --cluster-name <aks> --name <pool> --max-surge 5
• If surge isn’t possible due to quota or capacity limits, configure maxUnavailable to limit how many nodes are drained at once. This is supported but carries more risk than surge.
• Define Pod Disruption Budgets (PDBs) for critical workloads to ensure a minimum number of replicas remain available during node upgrades and other voluntary disruptions.
These behaviors are expected during AKS upgrades, and following these reliability best practices helps ensure zero‑ or minimal‑downtime maintenance.
Deployment and cluster reliability best practices for Azure Kubernetes Service (AKS) - https://learn.microsoft.com/en-us/azure/aks/best-practices-app-cluster-reliability
If you have any further queries, let me know.
Regards
Himanshu