Share via

Half the instances in the VMMS are failing because they can not access the filesystem

Anibal Rivero 0 Reputation points
2026-02-10T16:07:07.2033333+00:00

As the title says, I have a workload that spawns 200 VMs using a VMSS to concurrently process the same set of files stored in a Storage Account, mounted using NFS. Exactly half of them (100) fail because the can not find the files. This happens ver early in the process. I guess we are hitting a usage limit in the storage account, but I can not see any relevant error.

How could I find the root of the issue?

Thanks.

Azure Virtual Machine Scale Sets
Azure Virtual Machine Scale Sets

Azure compute resources that are used to create and manage groups of heterogeneous load-balanced virtual machines.

{count} votes

1 answer

Sort by: Most helpful
  1. Himanshu Shekhar 4,025 Reputation points Microsoft External Staff Moderator
    2026-02-10T17:19:49.2533333+00:00

    @Anibal Rivero

    Azure Files (NFS) has real limits and can throttle. Storage accounts have IOPS and throughput caps, and if combined usage across shares exceeds those limits, requests may be throttled at the account level.

    Azure Files also enforces file/share limits, including metadata IOPS and handle limits for example, a maximum of 10,000 concurrent handles on the root directory and 2,000 handles per file or directory - https://learn.microsoft.com/en-us/azure/storage/files/storage-files-scale-targets

    For Azure Files, Microsoft guidance for alerting on throttling uses Azure Monitor >Transactions metric with Response type dimension

    Create monitoring alerts for Azure Files - https://docs.azure.cn/en-us/storage/files/files-monitoring-alerts

    Microsoft has a dedicated document on improving Azure Files NFS performance, which explicitly calls out connect (and read-ahead tuning) as key levers for scaling and performance - https://learn.microsoft.com/en-us/azure/storage/files/nfs-performance

    The scenario shows an exact 50% failure rate (100 out of 200) occurring very early with “file not found” errors. A clean 50/50 split usually points to a configuration difference, not random throttling.

    The two most common non-throttling causes of early “file not found” errors are:

    1. Mount never completed the mount point exists, but it’s just an empty local directory.
    2. Networking/DNS/firewall split some instances can reach the NFS endpoint while others cannot (for example, differences across subnets, zones, or private endpoint DNS).

    Microsoft’s Azure Files troubleshooting guidance starts with validating DNS resolution and port connectivity, noting that misconfigured networking is the most common reason for mount or access failures.

    Internal Azure Files NFS guidance also emphasizes that throttling typically presents as latency or slowness, not clean “file not found” errors, and stresses ensuring the VNet is properly allowed in the storage firewall - https://learn.microsoft.com/en-us/troubleshoot/azure/azure-storage/files/connectivity/files-troubleshoot?tabs=powershell

    The throughput numbers mentioned (for example, “25,600 MiB/s egress”) apply only to specific storage account SKUs and regions. They come from the Azure Files scale targets documentation and are not universal limits for all Blob or NFS setups.

    The key understanding is that storage account limits do exist and can cause throttling, but those exact numbers should only be used after confirming the storage account SKU and region itself.

    For Azure Files NFS, Microsoft explicitly lists NFS > TCP 2049 as the required port.

    Blob NFS has different implementation details and don’t blend them unless you confirm which service the customer uses. There is no fixed SMB client connection limit (such as 100 clients) on Azure Files. The errors you are experiencing are most likely related to Azure Files scalability throttling, not a hard connection cap.

     When a large number of virtual machines access the same file share concurrently especially the same directories or files Azure Files can reach scalability limits at one of the following levels:

    Storage account level

    File share level

    Individual file or directory path (hot-spotting)

     In high fan-out scenarios, metadata-heavy operations (such as directory listings and frequent open/close operations) typically hit limits before bandwidth does.

     When these limits are exceeded, Azure Storage may temporarily throttle requests and return backend responses such as:

    • 503 (ServerBusy)

    • 500 (OperationTimeout)

     Depending on the application’s retry logic, these may appear as generic file access errors (for example, file not found or resource unavailable).

    To confirm whether throttling is occurring, I recommend reviewing Azure Monitor metrics during the exact timeframe of the failures:

    1. Navigate to the Storage Account
    2. Go to Monitoring > Metrics
    3. Select metric namespace: File
    4. Review:
      • Success Server Latency
      • Availability
      • Transactions
      • Throttled Requests
    5. If you observe increased latency, non-zero throttled requests, or reduced availability during scale-out events, this strongly indicates share-level or account-level throttling.

    Please validate the following:

    1. Storage account type (General Purpose v2 vs FileStorage)
    2. File share tier (Standard vs Premium)
    3. Provisioned share size (for Premium shares)
    4. Estimated IOPS per VM multiplied by total VM count

    Please note that Premium (FileStorage) shares scale performance based on provisioned size. If the share is undersized for the workload, throttling will occur under heavy concurrency.

     For reference, the official documentation can be found here:

    1. Azure Files scalability targets https://learn.microsoft.com/azure/storage/files/storage-files-scale-targets
    2. Analyze Azure Files metrics https://learn.microsoft.com/azure/storage/files/analyze-files-metrics
    3. SMB performance guidance https://learn.microsoft.com/azure/storage/files/smb-performance
    0 comments No comments

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.