Latency with Azure Open AI - In general as compared to Open AI direct model use

Question

Latency with Azure Open AI - In general as compared to Open AI direct model use

Aditya Purohit 5

I have been working with Azure OpenAI gpt-4.1-mini for the last 8 months, and for the last couple of months, I have been observing a spike in the TTFT. It ranges from 800ms to 1.9 seconds, which is very high for Agentic use, and earlier it use to be under 500ms.
I use chat completion with streaming on Azure and when I use OpenAI APIs directly (US servers) I get a latency of 700-800ms, which adds a network hop latency of 300ms. My expectations from Azure were to remove the continental network hop and use it directly in its regional GPT deployments.

Anshika Varshney 7,995 Reputation points Microsoft External Staff Moderator

2026-02-12T21:37:50.4733333+00:00
Hi Aditya Purohit,

Thanks for raising this. We investigated the increased TTFT / end‑to‑end latency you’re seeing on Azure OpenAI compared to OpenAI direct.

Issue:

The higher and inconsistent latency (especially elevated TTFT in the ~800 ms–1.9 s range) observed across Azure OpenAI deployments was due to a service‑side issue affecting request routing and backend handling, which resulted in additional processing delay for some regions and models (including GPT‑4.1‑mini). This was not related to customer configuration, prompt size, or client‑side networking.

Resolution:

Engineering has applied a service‑side mitigation to address the underlying routing and request‑handling issue. The fix restores normal latency characteristics and reduces TTFT back to expected levels for affected deployments. No customer‑side changes are required (no SDK updates, config changes, or redeployments). Improvements are reflected automatically as the fix completes rollout.

If you continue to see higher‑than‑expected latency after this update, please share:

Deployment name

Region

Approximate timestamps (UTC)

Whether streaming or non‑streaming is used

This will help us confirm traffic is hitting the updated service path. We’ll be glad to provide further clarification or guidance.

Thank you!
Manas Mohanty 15,295 Reputation points Microsoft External Staff Moderator

2026-02-23T19:50:40.7133333+00:00

Hi Aditya Purohit,

Please share the details the requested in private message to create support ticket and escalate to product group.

Thank you.
Manas Mohanty 15,295 Reputation points Microsoft External Staff Moderator

2026-02-24T16:51:46.2666667+00:00

Hi Aditya Purohit,

Could you help with details requested in private message for further investigation.

Thank you.

2 answers

Your answer

Anshika Varshney 7,995 Reputation points Microsoft External Staff Moderator

2026-02-12T21:37:50.4733333+00:00

Hi Aditya Purohit,

Thanks for raising this. We investigated the increased TTFT / end‑to‑end latency you’re seeing on Azure OpenAI compared to OpenAI direct.

Issue:

The higher and inconsistent latency (especially elevated TTFT in the ~800 ms–1.9 s range) observed across Azure OpenAI deployments was due to a service‑side issue affecting request routing and backend handling, which resulted in additional processing delay for some regions and models (including GPT‑4.1‑mini). This was not related to customer configuration, prompt size, or client‑side networking.

Resolution:

Engineering has applied a service‑side mitigation to address the underlying routing and request‑handling issue. The fix restores normal latency characteristics and reduces TTFT back to expected levels for affected deployments. No customer‑side changes are required (no SDK updates, config changes, or redeployments). Improvements are reflected automatically as the fix completes rollout.

If you continue to see higher‑than‑expected latency after this update, please share:

Deployment name

Region

Approximate timestamps (UTC)

Whether streaming or non‑streaming is used

This will help us confirm traffic is hitting the updated service path. We’ll be glad to provide further clarification or guidance.

Thank you!
Manas Mohanty 15,295 Reputation points Microsoft External Staff Moderator

2026-02-23T19:50:40.7133333+00:00

Hi Aditya Purohit,

Please share the details the requested in private message to create support ticket and escalate to product group.

Thank you.
Manas Mohanty 15,295 Reputation points Microsoft External Staff Moderator

2026-02-24T16:51:46.2666667+00:00

Hi Aditya Purohit,

Could you help with details requested in private message for further investigation.

Thank you.

Answer 1

Hi Aditya Purohit

To summarize the case, you have seen slowness (800 milli seconds-1.3 second) recently on GPT 4.1 mini deployment in Azure OpenAI side (earlier it used to stay in 500 milli second).

Occasional service degradation on certain zones happens due to multiple individual spikes from different customers, product group normally load balance and increase capacity in those region.

Other best practice suggested to customer to load balance between multiple regions, Increase Max TPM, Reduce prompt size.

If you can help us share the details in private message, we can investigate and escalate to product group.

Thank you.

Answer 2

Try comparing the same model and region first and run simple tests with a very small prompt to measure baseline latency, check your network and DNS to rule out client-side delays, make sure you are calling the nearest Azure OpenAI region and not routing through a distant proxy, verify you are not hitting throttling or rate limits and that your requests are not being queued, inspect request and response sizes because large prompts or long completions increase time, use the official SDKs or HTTP client with connection reuse and keepalive to avoid extra handshake overhead, enable retries and exponential backoff for transient spikes, look at telemetry and p95/p99 metrics rather than averages to find outliers, confirm the model is not running on a shared slow instance (consider dedicated capacity if available), check Azure service health for regional incidents, and if you still see unexplained high latency collect timestamps, request IDs and sample payloads and open a support case so Microsoft can investigate.

Share via

Latency with Azure Open AI - In general as compared to Open AI direct model use

2 answers

Your answer