Share via

Unexpected outage during AKS update window

Dean Ferley 45 Reputation points
2026-02-13T21:55:22.42+00:00

During our scheduled update window, we experience regular but brief outages with our web application running on AKS. When we've looked into the Change Analysis section of Azure, we notice that VMSS and Disk upgrades are happening around the time of the outage.

Our VMSS are managed by AKS and are running in the Uniform Orchestration mode with the Automatic Upgrade policy/mode. After an update, the Upgrade Policy gets set back to Manual after the updates.

We've heard that Max Surge is an important option but that seems to only be available when set to the Rolling Upgrade Policy.

Do you have any recommendations on how we can avoid future outages during our update window?

Thanks,

Dean

Azure Virtual Machine Scale Sets
Azure Virtual Machine Scale Sets

Azure compute resources that are used to create and manage groups of heterogeneous load-balanced virtual machines.

{count} votes

1 answer

Sort by: Most helpful
  1. Himanshu Shekhar 4,025 Reputation points Microsoft External Staff Moderator
    2026-02-13T23:33:38.1433333+00:00

    @Dean Ferley

    During your scheduled maintenance window, Azure Kubernetes Service (AKS) updates the underlying Virtual Machine Scale Set (VMSS) model (such as node image versions, extensions, or metadata).

    When AKS later reconciles differences between the VMSS model and existing nodes, it performs node upgrades using a cordon → drain → reimage process. During this process, nodes are temporarily taken out of service and pods are evicted.

    If your workloads don’t have enough ready replicas to absorb the temporary reduction in node capacity, this can appear as brief application interruptions.

    To prevent outages during node image or platform updates, Microsoft recommends:

    Configure node pool Max Surge so AKS adds extra nodes before draining existing ones, maintaining capacity during upgrades (commonly ~33% of node pool size for production).

    Example:

    az aks nodepool update --resource-group <rg> --cluster-name <aks> --name <pool> --max-surge 5 
    

    • If surge isn’t possible due to quota or capacity limits, configure maxUnavailable to limit how many nodes are drained at once. This is supported but carries more risk than surge.

    • Define Pod Disruption Budgets (PDBs) for critical workloads to ensure a minimum number of replicas remain available during node upgrades and other voluntary disruptions.

    These behaviors are expected during AKS upgrades, and following these reliability best practices helps ensure zero‑ or minimal‑downtime maintenance.

    Deployment and cluster reliability best practices for Azure Kubernetes Service (AKS) - https://learn.microsoft.com/en-us/azure/aks/best-practices-app-cluster-reliability

    If you have any further queries, let me know.

    Regards

    Himanshu

    0 comments No comments

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.