An Azure service that provides an event-driven serverless compute platform.
Thank you for reaching out to Microsoft Q&A.
There's an Architectural difference between the Consumption and Flex Consumption plans. The Consumption plan uses a simpler, event‑driven routing model that often reuses the same warm worker when functions are invoked periodically, resulting in relatively stable latency.
In contrast, the Flex Consumption plan introduces concurrency‑based routing, per‑function isolation, and an additional front‑end routing layer to support faster scaling and advanced features. As a result, requests may be queued or routed to newly provisioned instances based on concurrency and capacity decisions. These delays frequently occur before the request reaches the function runtime, which explains why execution logs show minimal processing time despite long client‑side latency.
In summary, the Consumption plan appears more stable due to its simpler routing behavior, while the Flex plan prioritizes scalability and isolation, which can introduce occasional latency variability unless carefully tuned or replaced with a Premium plan for deterministic performance.
The inconsistent latency you are experiencing in your Python 3.12 Azure Function App (responses sometimes around 1 second and sometimes extending to tens or even ~100 seconds) is primarily caused by cold starts and scale-out behavior, which is expected in serverless environments—especially for Python-based machine learning workloads. When a function instance is idle or scaled down, Azure may deallocate the underlying worker. The next incoming request then triggers a fresh worker initialization, which includes provisioning compute resources, starting the Python worker process, loading dependencies, and initializing ML models into memory. For large ML models or heavy Python dependencies, this startup phase can significantly increase execution time, leading to highly variable latency. Additionally, during scale-out events, new workers may be created that are not pre-warmed, further contributing to unpredictable response times.
Refer below points to resolve this issue or as a workaround:
1. Use Elastic Premium or Dedicated (App Service) Plan
If your Function App is running on the Consumption plan, cold starts are expected and unavoidable. Moving to an Elastic Premium plan (EP1/EP2/EP3) or a Dedicated App Service plan with “Always On” enabled ensures that at least one worker instance is always running. This prevents scale-to-zero behavior and significantly reduces cold-start latency, making response times far more consistent for latency-sensitive ML inference workloads.
2. Configure Pre-warmed (Always Ready) Instances
On the Elastic Premium plan, configure a minimum number of instances (for example, 1 or more) to remain pre-warmed. This guarantees that incoming requests are handled by an already-initialized Python worker, eliminating first-request penalties and improving reliability during traffic spikes.
3. Optimize Python Startup and Model Loading
Avoid loading large ML models and heavy dependencies at module import time. Instead, use lazy initialization so the model is loaded only once and reused across subsequent requests within the same worker. Also, review and minimize the contents of requirements.txt to reduce startup overhead caused by unnecessary libraries.
4. Tune Python Worker Process Count
Python Azure Functions are single threaded per worker by default. For CPU-bound ML inference workloads, it is recommended to use multiple Python worker processes (for example, 2–3) rather than increasing thread counts. Over-provisioning workers can reduce the effectiveness of warm-up mechanisms, so keeping the worker count moderate helps maintain consistent performance.
5. Prefer Vertical Scaling for ML Workloads
ML inference is typically CPU- and memory-intensive. Instead of relying heavily on horizontal scaling, choose a larger Premium SKU with more CPU and memory. This reduces the need for frequent scale-out events and helps keep model's resident in memory.
6. Ensure Regional Proximity of Dependent Services
Deploy your Function App in the same Azure region as dependent services such as storage accounts, model artifacts, and upstream callers. Cross-region calls can amplify perceived latency and make cold-start effects appear more random.