GPU observability best practices for Azure Kubernetes Service (AKS)

This article provides best practices for monitoring and interpreting GPU signals on Azure Kubernetes Service (AKS). Instead of looking at NVIDIA GPU metrics in isolation, you correlate signals across utilization, memory, and workload context to improve long-term performance and node efficiency.

Important

AKS preview features are available on a self-service, opt-in basis. Previews are provided "as is" and "as available," and they're excluded from the service-level agreements and limited warranty. AKS previews are partially covered by customer support on a best-effort basis. As such, these features aren't meant for production use. For more information, see the following support articles:

Understand GPU utilization versus saturation

Don't treat the NVIDIA DCGM metric DCGM_FI_DEV_GPU_UTIL as a direct efficiency score. DCGM_FI_DEV_GPU_UTIL only indicates how often kernels are active, so it doesn't tell you whether the workload is compute-efficient. You get more accurate guidance by correlating utilization signals instead of reading them independently. Compare DCGM_FI_DEV_GPU_UTIL with DCGM_FI_PROF_SM_ACTIVE, and then compare DCGM_FI_PROF_SM_ACTIVE with DCGM_FI_PROF_DRAM_ACTIVE to identify whether your bottleneck is compute, memory, or launch and synchronization overhead.

High DCGM_FI_DEV_GPU_UTIL with low DCGM_FI_PROF_SM_ACTIVE often points to launch overhead, synchronization stalls, or memory contention. High DCGM_FI_PROF_SM_ACTIVE with low DCGM_FI_PROF_DRAM_ACTIVE is more consistent with compute-bound behavior. Higher DCGM_FI_PROF_DRAM_ACTIVE with lower DCGM_FI_PROF_SM_ACTIVE usually points to memory-bound execution.

Note

DCGM_FI_PROF_SM_ACTIVE and DCGM_FI_PROF_DRAM_ACTIVE are DCGM profiling fields and may not appear by default for all NVIDIA GPU architecture types offered in Azure Virtual Machine (VM) sizes.

This correlation-first approach helps you avoid scaling out when the root issue might be kernel efficiency or memory access patterns. For detailed metric semantics, see the NVIDIA DCGM user guide.

Use memory pressure as a primary scheduling signal

If memory repeatedly approaches out-of-memory thresholds, treat that pattern as an early indicator of instability. Kubernetes has no native GPU-memory pressure signal, so VRAM exhaustion typically surfaces only as container OOM kills and pod disruption, often well after DCGM telemetry shows the trend.

Automate node lifecycle actions from GPU health signals

This practice is especially important for long-lived AKS GPU node pools where host aging can vary across nodes.

Align observability signals with scaling decisions

For vertical scaling, create a new node pool on a different Azure GPU-enabled VM SKU and migrate workloads when power or thermal constraints cap throughput, for example when DCGM_FI_DEV_POWER_USAGE stays near limit while DCGM_FI_PROF_SM_ACTIVE remains flat despite demand.

Separate MIG and non-MIG observability policies

When MIG is enabled, the scope of each metric shifts, so interpret the signals differently.

Publish cost-aware GPU efficiency metrics

Optimize for cost visibility, not only performance. A high-value derived metric for AKS platform teams is GPU-seconds used versus GPU-seconds allocated. Use DCGM telemetry and Kubernetes context joins to publish this metric by namespace and workload class, then review it over time as a shared KPI for platform and finance teams. This approach defines a common source of truth for optimization decisions and helps prevent over-allocation from being hidden by aggregate utilization averages.

Next steps

Review GPU best practices for AKS.
Get started with AKS managed GPU observability.
Optimize allocation with multi-instance GPU (MIG) nodes.
Scale based on GPU signals using KEDA and DCGM metrics.

Feedback

Was this page helpful?

Last updated on 2026-05-05