gpt-4o-transcribe for real-time speech-to-text transcription ---slow speed

Question

gpt-4o-transcribe for real-time speech-to-text transcription ---slow speed

yu.lili 0

When I try to use gpt-4o-transcribe for real-time speech-to-text transcription, it takes about 1.5-2 seconds for a 2s mp3 file from sending the request to receiving the first token.

Are there improved methods or other model options?

Additionally, GPT-4o-Mini-Transcribe has also been tried, but there was no significant improvement in speed.

api_key='...xxx...'

transcribe_client = AzureOpenAI(
	azure_endpoint='https://kkk.openai.azure.com',
	api_key=api_key,
	api_version="2025-03-01-preview",
)
model='gpt-4o-transcribe'
with open(audio, "rb") as f:
    response = transcribe_client.audio.transcriptions.create(
        model= model,  
        file=f,  #('test_output.mp3', filebytes, 'audio/mpeg'),
        response_format="text", 
        stream=True, 
        language='en'
    )

Karnam Venkata Rajeswari 300 Reputation points Microsoft External Staff Moderator

2026-02-28T14:33:12.1133333+00:00

Hello yu.lili,

Did you get any chance to review the above response?

Do let me know if you have any further queries.

Thank you!
Karnam Venkata Rajeswari 300 Reputation points Microsoft External Staff Moderator

2026-03-02T08:25:25.2733333+00:00

Hello yu.lili,

We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.
Karnam Venkata Rajeswari 300 Reputation points Microsoft External Staff Moderator

2026-03-03T10:56:01.5933333+00:00

Hello yu.lili,

Just checking in to see if you have got a chance to see my response to your question in resolving the issue.

If you are still facing any further issues, please don't hesitate to reach out to us. We are happy to assist you.

Looking forward to your response and appreciate your time on this.

If you feel that your quires have been resolved, please accept the answer by clicking the "Upvote" and "Accept Answer" on the post.

Thank you!

2 answers

Your answer

Karnam Venkata Rajeswari 300 Reputation points Microsoft External Staff Moderator

2026-02-28T14:33:12.1133333+00:00

Hello yu.lili,

Did you get any chance to review the above response?

Do let me know if you have any further queries.

Thank you!
Karnam Venkata Rajeswari 300 Reputation points Microsoft External Staff Moderator

2026-03-02T08:25:25.2733333+00:00

Hello yu.lili,

We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.
Karnam Venkata Rajeswari 300 Reputation points Microsoft External Staff Moderator

2026-03-03T10:56:01.5933333+00:00

Hello yu.lili,

Just checking in to see if you have got a chance to see my response to your question in resolving the issue.

If you are still facing any further issues, please don't hesitate to reach out to us. We are happy to assist you.

Looking forward to your response and appreciate your time on this.

If you feel that your quires have been resolved, please accept the answer by clicking the "Upvote" and "Accept Answer" on the post.

Thank you!

Answer 1

Hello yu.lili ,

Welcome to Microsoft Q&A .Thank you for reaching out with detailed description of the case.

It is understood that you are working on reducing the real-time speech-to-text transcription latency while using GPT-4o-transcribe and GPT-4o-Mini-Transcribe models.

For the use case of transcribing speech-to-text , there are two types of models available

1. Batch/Transcription models process pre-recorded audio as a complete file and optimize for accuracy, so they typically analyze larger chunks before returning results.

2. Realtime/Streaming models – which are architected for low latency, processing audio incrementally and emitting partial tokens as soon as enough acoustic context is available.

GPT-4o-transcribe and GPT-4o-Mini-Transcribe models fall under the batch/ transcription model. We need real time/Streaming models to reduce latency of the first token received.

As asked if there are any improved methods or other model options with reduced latency – yes, there are other options available.

Switching from transcribe models to using real-time streaming options is recommended to reduce latency and support real-time streaming. Following models are supported

1. gpt-4o-realtime-preview

2. gpt-4o-mini-realtime-preview

3. gpt-realtime

4. gpt-realtime-mini

The above models are different from gpt-4o-transcribe / gpt-4o-mini-transcribe. These are specifically activated through the Realtime API using WebRTC, WebSockets, or SIP and can stream audio continuously and return text as soon as possible.

Please go through the following resource for detailed information

Use the GPT Realtime API for speech and audio with Azure OpenAI - Azure OpenAI in Microsoft Foundry Models | Microsoft Learn

Another alternative would be using Microsoft Azure Cognitive Speech Services (Speech-to-Text) — designed for very low-latency streaming (100–300 ms typical in continuous mode).

Please refer the below for additional information

· Speech-to-Text Documentation - Tutorials, API Reference - Foundry Tools | Microsoft Learn

· Speech to text quickstart - Azure AI services | Azure Docs

· GitHub - Azure-Samples/aoai-realtime-audio-sdk: Azure OpenAI code resources for using gpt-4o-realtime capabilities.

· Quotas and Limits for Azure Speech - Foundry Tools | Microsoft Learn

Please let me know if you have any questions.

Thank you!

Answer 2

The delay you are experiencing with the gpt-4o-transcribe model for real-time speech-to-text transcription could be influenced by several factors, including network latency, the processing speed of the model, and the size of the audio file. Since you've also tried gpt-4o-mini-transcribe without significant improvement, here are a few suggestions to potentially enhance the speed:

Audio Quality: Ensure that the audio quality is high, as lower quality can lead to longer processing times.
Chunking Strategy: You might want to experiment with different chunking strategies if available. Adjusting parameters like silence_duration_ms and threshold can help the model detect speech more effectively, potentially reducing latency.
Model Selection: While you've tried gpt-4o-mini-transcribe, consider testing other models such as whisper or gpt-4o-transcribe-diarize, which may offer different performance characteristics.
Real-time Settings: If you're using real-time transcription, ensure that the settings for noise reduction and audio format are optimized for your specific use case.
Network Conditions: Check your network conditions, as a slow connection can significantly impact the time it takes to send requests and receive responses.

If the issue persists, it may be beneficial to reach out to Azure support for more tailored assistance regarding performance optimization for your specific use case.

References:

Share via

gpt-4o-transcribe for real-time speech-to-text transcription ---slow speed

2 answers

Your answer