Python Function App (Flex plan) latency sometimes very low (1s) and sometimes very long (100s)

Question

Python Function App (Flex plan) latency sometimes very low (1s) and sometimes very long (100s)

Stepan Hlushak 0

I am running a set of functions in azure function app in Python3.12 that process ML model calculation requests. I have problems with very inconsistent latency for the functions, which are sometimes 1s and sometimes 100s. How can I decrease this latency

Siddhesh Desai 4,025 Reputation points Microsoft External Staff Moderator

2026-02-23T20:49:22.65+00:00

Hi @Stepan Hlushak

Thank you for reaching out to Microsoft Q&A.

I have asked you for some details on Private chat/messages, Can you please provide me the same?
Stepan Hlushak 0 Reputation points

2026-02-23T22:50:31.54+00:00

Hi Siddhesh,

How can it be that when I am using consumption plan, I don't see such big delays, but only flex plan exhibits such anomaly?

Thanks
Siddhesh Desai 4,025 Reputation points Microsoft External Staff Moderator

2026-02-24T20:39:24.3066667+00:00

Hi @Stepan Hlushak

There's an Architectural difference between the Consumption and Flex Consumption plans. The Consumption plan uses a simpler, event‑driven routing model that often reuses the same warm worker when functions are invoked periodically, resulting in relatively stable latency.

In contrast, the Flex Consumption plan introduces concurrency‑based routing, per‑function isolation, and an additional front‑end routing layer to support faster scaling and advanced features. As a result, requests may be queued or routed to newly provisioned instances based on concurrency and capacity decisions. These delays frequently occur before the request reaches the function runtime, which explains why execution logs show minimal processing time despite long client‑side latency.

In summary, the Consumption plan appears more stable due to its simpler routing behavior, while the Flex plan prioritizes scalability and isolation, which can introduce occasional latency variability unless carefully tuned or replaced with a Premium plan for deterministic performance.
Stepan Hlushak 0 Reputation points

2026-02-26T16:42:26.02+00:00

Will it help the flex plan if I specify more "Always ready instances"? I tried to make 1 instance always ready, but haven't noticed much f the difference. Maybe 10 or 20 "Always ready instances" will improve the delay?
Siddhesh Desai 4,025 Reputation points Microsoft External Staff Moderator

2026-03-04T06:15:10.63+00:00

@Stepan Hlushak

Why don’t more Always Ready instances help much?

Always Ready guarantees capacity, not routing. Even with many always‑ready instances, Flex may still queue requests or route them elsewhere based on concurrency and fairness rules.

Delays often happen before your function is invoked (in the Flex front‑end/router). Adding more warm instances does not eliminate that routing or queueing layer.

If per‑instance HTTP concurrency is low, Flex will still spin up new instances (or queue requests), even though warm instances exist.

Why “1 Always Ready” showed little improvement?

One warm instance helps only if most traffic is consistently routed to it.

If concurrency limits or routing decisions block that instance, requests wait or hit new instances—causing the same delays you observed.

What actually improves Flex latency?

Tune HTTP concurrency per instance so requests stay pinned to warm instances (this is the most effective fix in real deployments).

Keep Always Ready instances low (1–2) and scale concurrency, not instance count.

For strict, predictable latency requirements, Elastic Premium is still the most reliable option.
Siddhesh Desai 4,025 Reputation points Microsoft External Staff Moderator

2026-03-12T04:50:00.38+00:00

Following up to see if the provided answer was helpful. If this answers your query, do click Accept Answer =>Yes, and upvote it. If you have any further queries do let us know.
Siddhesh Desai 4,025 Reputation points Microsoft External Staff Moderator

2026-03-13T06:39:50.0766667+00:00

@Stepan Hlushak

Following up to see if the provided answer was helpful. If this answers your query, do click Accept Answer =>Yes, and upvote it. If you have any further queries do let us know

1 answer

Your answer

Siddhesh Desai 4,025 Reputation points Microsoft External Staff Moderator

2026-02-23T20:49:22.65+00:00

Hi @Stepan Hlushak

Thank you for reaching out to Microsoft Q&A.

I have asked you for some details on Private chat/messages, Can you please provide me the same?
Stepan Hlushak 0 Reputation points

2026-02-23T22:50:31.54+00:00

Hi Siddhesh,

How can it be that when I am using consumption plan, I don't see such big delays, but only flex plan exhibits such anomaly?

Thanks
Siddhesh Desai 4,025 Reputation points Microsoft External Staff Moderator

2026-02-24T20:39:24.3066667+00:00

Hi @Stepan Hlushak

There's an Architectural difference between the Consumption and Flex Consumption plans. The Consumption plan uses a simpler, event‑driven routing model that often reuses the same warm worker when functions are invoked periodically, resulting in relatively stable latency.

In contrast, the Flex Consumption plan introduces concurrency‑based routing, per‑function isolation, and an additional front‑end routing layer to support faster scaling and advanced features. As a result, requests may be queued or routed to newly provisioned instances based on concurrency and capacity decisions. These delays frequently occur before the request reaches the function runtime, which explains why execution logs show minimal processing time despite long client‑side latency.

In summary, the Consumption plan appears more stable due to its simpler routing behavior, while the Flex plan prioritizes scalability and isolation, which can introduce occasional latency variability unless carefully tuned or replaced with a Premium plan for deterministic performance.
Stepan Hlushak 0 Reputation points

2026-02-26T16:42:26.02+00:00

Will it help the flex plan if I specify more "Always ready instances"? I tried to make 1 instance always ready, but haven't noticed much f the difference. Maybe 10 or 20 "Always ready instances" will improve the delay?
Siddhesh Desai 4,025 Reputation points Microsoft External Staff Moderator

2026-03-04T06:15:10.63+00:00

@Stepan Hlushak

Why don’t more Always Ready instances help much?

Always Ready guarantees capacity, not routing. Even with many always‑ready instances, Flex may still queue requests or route them elsewhere based on concurrency and fairness rules.

Delays often happen before your function is invoked (in the Flex front‑end/router). Adding more warm instances does not eliminate that routing or queueing layer.

If per‑instance HTTP concurrency is low, Flex will still spin up new instances (or queue requests), even though warm instances exist.

Why “1 Always Ready” showed little improvement?

One warm instance helps only if most traffic is consistently routed to it.

If concurrency limits or routing decisions block that instance, requests wait or hit new instances—causing the same delays you observed.

What actually improves Flex latency?

Tune HTTP concurrency per instance so requests stay pinned to warm instances (this is the most effective fix in real deployments).

Keep Always Ready instances low (1–2) and scale concurrency, not instance count.

For strict, predictable latency requirements, Elastic Premium is still the most reliable option.
Siddhesh Desai 4,025 Reputation points Microsoft External Staff Moderator

2026-03-12T04:50:00.38+00:00

Following up to see if the provided answer was helpful. If this answers your query, do click Accept Answer =>Yes, and upvote it. If you have any further queries do let us know.
Siddhesh Desai 4,025 Reputation points Microsoft External Staff Moderator

2026-03-13T06:39:50.0766667+00:00

@Stepan Hlushak

Following up to see if the provided answer was helpful. If this answers your query, do click Accept Answer =>Yes, and upvote it. If you have any further queries do let us know

Answer 1

Hi @Stepan Hlushak

Thank you for reaching out to Microsoft Q&A.

There's an Architectural difference between the Consumption and Flex Consumption plans. The Consumption plan uses a simpler, event‑driven routing model that often reuses the same warm worker when functions are invoked periodically, resulting in relatively stable latency.

In contrast, the Flex Consumption plan introduces concurrency‑based routing, per‑function isolation, and an additional front‑end routing layer to support faster scaling and advanced features. As a result, requests may be queued or routed to newly provisioned instances based on concurrency and capacity decisions. These delays frequently occur before the request reaches the function runtime, which explains why execution logs show minimal processing time despite long client‑side latency.

In summary, the Consumption plan appears more stable due to its simpler routing behavior, while the Flex plan prioritizes scalability and isolation, which can introduce occasional latency variability unless carefully tuned or replaced with a Premium plan for deterministic performance.

The inconsistent latency you are experiencing in your Python 3.12 Azure Function App (responses sometimes around 1 second and sometimes extending to tens or even ~100 seconds) is primarily caused by cold starts and scale-out behavior, which is expected in serverless environments—especially for Python-based machine learning workloads. When a function instance is idle or scaled down, Azure may deallocate the underlying worker. The next incoming request then triggers a fresh worker initialization, which includes provisioning compute resources, starting the Python worker process, loading dependencies, and initializing ML models into memory. For large ML models or heavy Python dependencies, this startup phase can significantly increase execution time, leading to highly variable latency. Additionally, during scale-out events, new workers may be created that are not pre-warmed, further contributing to unpredictable response times.

Refer below points to resolve this issue or as a workaround:

1. Use Elastic Premium or Dedicated (App Service) Plan

If your Function App is running on the Consumption plan, cold starts are expected and unavoidable. Moving to an Elastic Premium plan (EP1/EP2/EP3) or a Dedicated App Service plan with “Always On” enabled ensures that at least one worker instance is always running. This prevents scale-to-zero behavior and significantly reduces cold-start latency, making response times far more consistent for latency-sensitive ML inference workloads.

2. Configure Pre-warmed (Always Ready) Instances

On the Elastic Premium plan, configure a minimum number of instances (for example, 1 or more) to remain pre-warmed. This guarantees that incoming requests are handled by an already-initialized Python worker, eliminating first-request penalties and improving reliability during traffic spikes.

3. Optimize Python Startup and Model Loading

Avoid loading large ML models and heavy dependencies at module import time. Instead, use lazy initialization so the model is loaded only once and reused across subsequent requests within the same worker. Also, review and minimize the contents of requirements.txt to reduce startup overhead caused by unnecessary libraries.

4. Tune Python Worker Process Count

Python Azure Functions are single threaded per worker by default. For CPU-bound ML inference workloads, it is recommended to use multiple Python worker processes (for example, 2–3) rather than increasing thread counts. Over-provisioning workers can reduce the effectiveness of warm-up mechanisms, so keeping the worker count moderate helps maintain consistent performance.

5. Prefer Vertical Scaling for ML Workloads

ML inference is typically CPU- and memory-intensive. Instead of relying heavily on horizontal scaling, choose a larger Premium SKU with more CPU and memory. This reduces the need for frequent scale-out events and helps keep model's resident in memory.

6. Ensure Regional Proximity of Dependent Services

Deploy your Function App in the same Azure region as dependent services such as storage accounts, model artifacts, and upstream callers. Cross-region calls can amplify perceived latency and make cold-start effects appear more random.

Share via

Python Function App (Flex plan) latency sometimes very low (1s) and sometimes very long (100s)

1 answer

Your answer