Share via

Latency with Azure Open AI - In general as compared to Open AI direct model use

Aditya Purohit 5 Reputation points
2026-02-12T14:39:19.51+00:00

I have been working with Azure OpenAI gpt-4.1-mini for the last 8 months, and for the last couple of months, I have been observing a spike in the TTFT. It ranges from 800ms to 1.9 seconds, which is very high for Agentic use, and earlier it use to be under 500ms.
I use chat completion with streaming on Azure and when I use OpenAI APIs directly (US servers) I get a latency of 700-800ms, which adds a network hop latency of 300ms. My expectations from Azure were to remove the continental network hop and use it directly in its regional GPT deployments.

Azure AI services
Azure AI services

A group of Azure services, SDKs, and APIs designed to make apps more intelligent, engaging, and discoverable.

{count} votes

2 answers

Sort by: Most helpful
  1. Manas Mohanty 15,295 Reputation points Microsoft External Staff Moderator
    2026-02-25T22:01:26.3766667+00:00

    Hi Aditya Purohit

    To summarize the case, you have seen slowness (800 milli seconds-1.3 second) recently on GPT 4.1 mini deployment in Azure OpenAI side (earlier it used to stay in 500 milli second).

    Occasional service degradation on certain zones happens due to multiple individual spikes from different customers, product group normally load balance and increase capacity in those region.

    Other best practice suggested to customer to load balance between multiple regions, Increase Max TPM, Reduce prompt size.

    If you can help us share the details in private message, we can investigate and escalate to product group.

    Thank you.

    0 comments No comments

  2. Yevhen Marynchak 0 Reputation points
    2026-02-13T13:11:10.44+00:00

    Try comparing the same model and region first and run simple tests with a very small prompt to measure baseline latency, check your network and DNS to rule out client-side delays, make sure you are calling the nearest Azure OpenAI region and not routing through a distant proxy, verify you are not hitting throttling or rate limits and that your requests are not being queued, inspect request and response sizes because large prompts or long completions increase time, use the official SDKs or HTTP client with connection reuse and keepalive to avoid extra handshake overhead, enable retries and exponential backoff for transient spikes, look at telemetry and p95/p99 metrics rather than averages to find outliers, confirm the model is not running on a shared slow instance (consider dedicated capacity if available), check Azure service health for regional incidents, and if you still see unexplained high latency collect timestamps, request IDs and sample payloads and open a support case so Microsoft can investigate.

    0 comments No comments

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.