Edit

Share via


Control upgrade failure behavior

This guide describes the Azure Operator Service Manager (AOSM) upgrade failure behavior features for container network functions (CNFs). For faster retries, use pause on failure. To return to the starting point, use rollback on failure.

Pause on failure overview

Any upgrade using AOSM starts with a site network service (SNS) reput operation. The reput operation processes the network function applications (nfApps) found in the network function design version (NFDV). The reput operation implements the following default logic:

  • A user initiates an SNS reput operation with pause-on-failure enabled.
  • nfApps are processed following either updateDependsOn ordering, or in the sequential order they appear.
  • If an nfApp install or upgrade operation fails, the atomic rollback setting for that operation and nfApp is honored.
  • No prior completed NfApps are further operated upon.
  • The task terminates and leaves the SNS resource in a failed state.

With pause on failure, AOSM rolls back only the failed nfApp, via the testOptions, installOptions, or upgradeOptions operation parameters. No action is taken on any nfApps proceeding the failed nfApp. This method allows the end user to troubleshoot the failed nfApp and then restart the upgrade from that point forward. As the default behavior, this method is the most efficient method, but may cause network function (NF) inconsistencies while in a mixed version state.

Upgrade successful

An upgrade is considered successful if all nfApps reach the desired target state without generating helm install or helm upgrade failures. In such conditions, Azure Operator Service Manager returns the following operational status and message:

  - Upgrade Succeeded
    - Provisioning State: Succeeded
    - Message: <empty>

Upgrade unsuccessful

An upgrade is considered unsuccessful if any nfApp generates a helm install or helm upgrade failure. In such conditions, Azure Operator Service Manager returns the following operational status and message:

  - Upgrade Failed
    - Provisioning State: Succeeded
    - Message: Application(<ComponentName>) : <Failure Reason>

Rollback on failure overview

To address risk of mismatched nfApp versions, Azure Operator Service Manager supports NF level rollback on failure. With this option enabled, if an nfApp operation fails, both the failed nfApp, and all prior completed nfApps, can be rolled back to initial version state. This method minimizes, or eliminates, the amount of time the NF is exposed to nfApp version mismatches. The optional rollback on failure feature works as follows:

  • A user initiates an SNS reput operation with rollback on failure enabled.
  • nfApps are processed following either updateDependsOn ordering, or in the sequential order they appear.
  • Atomic state for all NfApps is forced to true, any operator provided values are ignored.
  • A snapshot of the current nfApp versions is captured and stored.
  • The snapshot is used to determine the individual nfApp actions taken to reverse actions that completed successfully.
    • helm install action on deleted components,
    • helm rollback action on upgraded components,
    • helm delete action on newly installed components
  • If an nfApp install or upgrade operation fails, an atomic rollback of the failed nfApp is executed first.
  • After the atomic rollback, the prior completed NfApps are restored to original snapshot version, with most recent actions reverted first.
  • The task terminates and leaves the SNS resource in a failed state.

Note

  • AOSM doesn't create a snapshot if a user doesn't enable rollback on failure.
  • A rollback on failure only applies to the successfully completed nfApps.
  • An error with the atomic rollback isn't treated as a rollback failure.

Upgrade successful

An upgrade is considered successful if all nfApps reach the desired target state without generating helm install or helm upgrade failures. In such conditions, Azure Operator Service Manager returns the following operational status and message:

  - Upgrade Succeeded
    - Provisioning State: Succeeded
    - Message: <empty>

Rollback successful

A rollback is considered successful if all prior completed NfApps reached the original snapshot state without generating a helm rollback failure. In such conditions, Azure Operator Service Manager returns the following operational status and message:

  - Upgrade Failed, Rollback Succeeded
    - Provisioning State: Failed
    - Message: Application(<ComponentName>) : <Failure Reason>; Rollback succeeded

Rollback unsuccessful

A rollback is considered unsuccessful if any prior completed nfApps fail to reach the original snapshot state, instead generating a helm rollback failure. In such conditions, Azure Operator Service Manager stops processing any further rollback-eligible nfApps and terminates with the following operational status and message:

  - Upgrade Failed, Rollback Failed
    - Provisioning State: Failed
    - Message: Application(<ComponentName>) : <Failure reason>; Rollback Failed (<RollbackComponentName>) : <Rollback Failure reason>

Configure rollback on failure

The most flexible method to control failure behavior is to extend a new configuration group schema (CGS) parameter, rollbackEnabled, to allow for configuration group value (CGV) control via roleOverrideValues in the NF payload. First, define the CGS parameter:

{
  "description": "NF configuration",
  "type": "object",
  "properties": {
    "nfConfiguration": {
      "type": "object",
      "properties": {
        "rollbackEnabled": {
          "type": "boolean"
        }
      },
      "required": [
        "rollbackEnabled"
      ]
    }
  }
}

Note

  • If the nfConfiguration isn't provided through the roleOverrideValues parameter, by default the rollback is disabled.

With the new rollbackEnable parameter defined, the Operator can now provide a run time value, under roleOverrideValues, as part of NF reput payload.

example:
{
  "location": "eastus",
  "properties": {
    // ...
    "roleOverrideValues": [
          "{\"nfConfiguration\":{\"rollbackEnabled\":true}}",
            "{\"name\":\"nfApp1\",\"deployParametersMappingRuleProfile\":{\"applicationEnablement\" : \"Disabled\"}}",
            "{\"name\":\"nfApp2\",\"deployParametersMappingRuleProfile\":{\"applicationEnablement\" : \"Disabled\"}}",
          //... other nfApps overrides
       ]
  }
}

Note

  • Each roleOverrideValues entry overrides the default behavior of the NfAapps.
  • If multiple entries of nfConfiguration are found in the roleOverrideValues, then the NF reput is returned as a bad request.

Manage nfApps that don't support rollback

Almost all publishers have some nfApps which aren't compatible with helm rollback. These nfApps maybe sourced from third-parties who don't commonly support strict resiliency requirements. These nfApps maybe database applications with complicated schema management requirements. Consider the following restrictions when onboarding services with nfApps that don't support rollback.

  • nfApps that don't support rollback can't be skipped.
  • nfApp rollback order can't change.
  • Incremental-NFDV approach must be used in these situations.

Selective rollback using incremental NFDVs

A network function’s composition may include nfAppa that don't support a helm rollback. Known examples are Elastic and VoltDb. An attmept to rollback one of these nfApps will break the nfApp. Pursuing publisher enhancements, to make these nfApps rollback complaint, is the best solution. A paramter to skip rollback is not supported as it introduces the risk of a deployed state not defined in a NFDV. This nondeterministic behavior increases the testing surface area significantly and undermines reliability guarantees of deployments. Instead, the incremental NFDV method enables selective rollback execution while ensuring deterministic deployment states.

Incremental NFDV approach

It's recommended that publishers use a combination of applicationEnablement, skipUpgrade and nfRollbackEnabled configurations in CGVs, along with multiple NFDVs, to logically segment nfApps into sets based on rollback compatibility. This incremental NFDV strategy allows operators to break deployments down into multiple operatons, bypassing rollback for select charts while preserving rollback for the rest. This approach effectively simulates per-chart rollback behavior using NFDV-level constructs. Consider the following example where a network function is composed of 20 nfApps with five nfApps that don't support rollback.

  • NFDV1
    • Performs initial verions 1 install.
      • Contains all 20 nfApps in an enabled state.
    • In CGV1: rollbackEnabled: true.
      • On the first install, a failure deletes charts and does not use rollback.
  • NFDV2:
    • Performs first step upgrade to version 2.
      • Contains all 20 nfApps but enable only the five nfApps without rollback support.
    • In CGV2:
      • Use skipUpgrade: true for the 15 nfApps with rollback supprt.
      • Set nfRollbackEnabled: false.
        • On success, only five nfApps are upgraded.
        • On failure, no rollback is performed.
          • Due to chart limitations, the workload is left in a nondeterministic state. No rollback is possible. To recover, there are two options:
            • Fix NFDV2 and try the upgrade again.
            • Downgrade to NFDV1 with skipUpgrade: false
  • NFDV3:
    • Performs second step upgrade to version 2
      • Contains all 20 nfApps but enable only the 15 nfApps with rollback support.
    • In CGV3:
      • Use skipUpgrade: true for the 5 nfApps previous upgraded via NFDV2.
      • Set nfRollbackEnabled: true.
        • On success, the remaining 15 nfApps are upgraded
        • On failure, a rollback occurs to restore the starting state.

Note

  • The five rollback-incompatible charts must not have runtime upgrade dependencies on charts in NFDV3.
  • AOSM's rollback design assumes that rollback restores the workload state to the previous NFDV state.

This approach providers cleaner separation and manageability of applications not supporting standard helm operations. Maintains the operation’s idempotency and state on the cluster reflected by the last operation. NFDV 2/3 can directly be used for install operations as well (installation of previous version not needed) with any difference in goal state. Overall upgrade time and deployment reliability remain the same.

Troubleshoot rollback on failure

Understand pod states

Understanding the different pod states is crucial for effective troubleshooting. The following are the most common pod states:

  • Pending: Pod scheduling is in progress by Kubernetes.
  • Running: All containers in the pod are running and healthy.
  • Failed: One or more containers in the pod are terminated with a nonzero exit code.
  • CrashLoopBackOff: A container within the pod is repeatedly crashing and Kubernetes is unable to restart it.
  • ContainerCreating: Container creation is in progress by the container runtime.

Check pod status and logs

First start by checking pod status and logs using a kubectl command:

$ kubectl get pods
$ kubectl logs <pod-name>

The get pods command lists all the pods in the current namespace, along with their current status. The logs command retrieves the logs for a specific pod, allowing you to inspect any errors or exceptions. To troubleshoot networking problems, use the following commands:

$ kubectl get services
$ kubectl describe service <service-name>

The get services command displays all the services in the current namespace. The command provides details about a specific service, including the associated endpoints, and any relevant error messages. If you're encountering issues with PVCs, you can use the following commands to debug them:

$ kubectl get persistentvolumeclaims
$ kubectl describe persistentvolumeclaims <pvc-name>

The "get persistentvolumeclaims" command lists all the PVCs in the current namespace. The describe command provides detailed information about a specific PVC, including the status, associated storage class, and any relevant events or errors.