An Azure service that integrates speech processing into apps and services.
Hello yu.lili ,
Welcome to Microsoft Q&A .Thank you for reaching out with detailed description of the case.
It is understood that you are working on reducing the real-time speech-to-text transcription latency while using GPT-4o-transcribe and GPT-4o-Mini-Transcribe models.
For the use case of transcribing speech-to-text , there are two types of models available
1. Batch/Transcription models process pre-recorded audio as a complete file and optimize for accuracy, so they typically analyze larger chunks before returning results.
2. Realtime/Streaming models – which are architected for low latency, processing audio incrementally and emitting partial tokens as soon as enough acoustic context is available.
GPT-4o-transcribe and GPT-4o-Mini-Transcribe models fall under the batch/ transcription model. We need real time/Streaming models to reduce latency of the first token received.
As asked if there are any improved methods or other model options with reduced latency – yes, there are other options available.
Switching from transcribe models to using real-time streaming options is recommended to reduce latency and support real-time streaming. Following models are supported
1. gpt-4o-realtime-preview
2. gpt-4o-mini-realtime-preview
3. gpt-realtime
4. gpt-realtime-mini
The above models are different from gpt-4o-transcribe / gpt-4o-mini-transcribe. These are specifically activated through the Realtime API using WebRTC, WebSockets, or SIP and can stream audio continuously and return text as soon as possible.
Please go through the following resource for detailed information
Another alternative would be using Microsoft Azure Cognitive Speech Services (Speech-to-Text) — designed for very low-latency streaming (100–300 ms typical in continuous mode).
Please refer the below for additional information
· Speech-to-Text Documentation - Tutorials, API Reference - Foundry Tools | Microsoft Learn
· Speech to text quickstart - Azure AI services | Azure Docs
· Quotas and Limits for Azure Speech - Foundry Tools | Microsoft Learn
Please let me know if you have any questions.
Thank you!