Unexpected outage during AKS update window

Question

Unexpected outage during AKS update window

Dean Ferley 45

During our scheduled update window, we experience regular but brief outages with our web application running on AKS. When we've looked into the Change Analysis section of Azure, we notice that VMSS and Disk upgrades are happening around the time of the outage.

Our VMSS are managed by AKS and are running in the Uniform Orchestration mode with the Automatic Upgrade policy/mode. After an update, the Upgrade Policy gets set back to Manual after the updates.

We've heard that Max Surge is an important option but that seems to only be available when set to the Rolling Upgrade Policy.

Do you have any recommendations on how we can avoid future outages during our update window?

Thanks,

Dean

Himanshu Shekhar 4,025 Reputation points Microsoft External Staff Moderator

2026-02-16T23:32:27.2+00:00

@Dean Ferley Just checking if provided response was helpful! please let me know if you have any queries.
Dean Ferley 45 Reputation points

2026-02-17T22:13:45.24+00:00

Thanks, Himanshu. We will give this a try and I'll let you know if it works

1 answer

Your answer

Himanshu Shekhar 4,025 Reputation points Microsoft External Staff Moderator

2026-02-16T23:32:27.2+00:00

@Dean Ferley Just checking if provided response was helpful! please let me know if you have any queries.
Dean Ferley 45 Reputation points

2026-02-17T22:13:45.24+00:00

Thanks, Himanshu. We will give this a try and I'll let you know if it works

Answer 1

@Dean Ferley

During your scheduled maintenance window, Azure Kubernetes Service (AKS) updates the underlying Virtual Machine Scale Set (VMSS) model (such as node image versions, extensions, or metadata).

When AKS later reconciles differences between the VMSS model and existing nodes, it performs node upgrades using a cordon → drain → reimage process. During this process, nodes are temporarily taken out of service and pods are evicted.

If your workloads don’t have enough ready replicas to absorb the temporary reduction in node capacity, this can appear as brief application interruptions.

To prevent outages during node image or platform updates, Microsoft recommends:

Configure node pool Max Surge so AKS adds extra nodes before draining existing ones, maintaining capacity during upgrades (commonly ~33% of node pool size for production).

Example:

az aks nodepool update --resource-group <rg> --cluster-name <aks> --name <pool> --max-surge 5

• If surge isn’t possible due to quota or capacity limits, configure maxUnavailable to limit how many nodes are drained at once. This is supported but carries more risk than surge.

• Define Pod Disruption Budgets (PDBs) for critical workloads to ensure a minimum number of replicas remain available during node upgrades and other voluntary disruptions.

These behaviors are expected during AKS upgrades, and following these reliability best practices helps ensure zero‑ or minimal‑downtime maintenance.

Deployment and cluster reliability best practices for Azure Kubernetes Service (AKS) - https://learn.microsoft.com/en-us/azure/aks/best-practices-app-cluster-reliability

If you have any further queries, let me know.

Regards

Himanshu

Share via

Unexpected outage during AKS update window

1 answer

Your answer