Share via

gpt-4o-transcribe for real-time speech-to-text transcription ---slow speed

yu.lili 0 Reputation points
2026-02-24T03:40:45.7033333+00:00

When I try to use gpt-4o-transcribe for real-time speech-to-text transcription, it takes about 1.5-2 seconds for a 2s mp3 file from sending the request to receiving the first token.

Are there improved methods or other model options?

Additionally, GPT-4o-Mini-Transcribe has also been tried, but there was no significant improvement in speed.

api_key='...xxx...'

transcribe_client = AzureOpenAI(
	azure_endpoint='https://kkk.openai.azure.com',
	api_key=api_key,
	api_version="2025-03-01-preview",
)
model='gpt-4o-transcribe'
with open(audio, "rb") as f:
    response = transcribe_client.audio.transcriptions.create(
        model= model,  
        file=f,  #('test_output.mp3', filebytes, 'audio/mpeg'),
        response_format="text", 
        stream=True, 
        language='en'
    )
Azure AI Speech
Azure AI Speech

An Azure service that integrates speech processing into apps and services.

{count} votes

2 answers

Sort by: Most helpful
  1. Karnam Venkata Rajeswari 300 Reputation points Microsoft External Staff Moderator
    2026-02-25T06:25:07.45+00:00

    Hello yu.lili ,

    Welcome to Microsoft Q&A .Thank you for reaching out with detailed description of the case.

    It is understood that you are working on reducing the real-time speech-to-text transcription latency while using GPT-4o-transcribe and GPT-4o-Mini-Transcribe models.

    For the use case of transcribing speech-to-text , there are two types of models available

    1.       Batch/Transcription models process pre-recorded audio as a complete file and optimize for accuracy, so they typically analyze larger chunks before returning results.

    2.       Realtime/Streaming models – which are architected for low latency, processing audio incrementally and emitting partial tokens as soon as enough acoustic context is available.

    GPT-4o-transcribe and GPT-4o-Mini-Transcribe models fall under the batch/ transcription model. We need real time/Streaming models to reduce latency of the first token received.

    As asked if there are any improved methods or other model options with reduced latency – yes, there are other options available.

    Switching from transcribe models to using real-time streaming options is recommended to reduce latency and support real-time streaming. Following models are supported

    1.       gpt-4o-realtime-preview

    2.       gpt-4o-mini-realtime-preview

    3.       gpt-realtime

    4.       gpt-realtime-mini

    The above models are different from gpt-4o-transcribe / gpt-4o-mini-transcribe. These are specifically activated through the Realtime API using WebRTC, WebSockets, or SIP and can stream audio continuously and return text as soon as possible.

    Please go through the following resource for detailed information

    Use the GPT Realtime API for speech and audio with Azure OpenAI - Azure OpenAI in Microsoft Foundry Models | Microsoft Learn

    Another alternative would be using Microsoft Azure Cognitive Speech Services (Speech-to-Text) — designed for very low-latency streaming (100–300 ms typical in continuous mode). 

    Please refer the below for additional information

    ·       Speech-to-Text Documentation - Tutorials, API Reference - Foundry Tools | Microsoft Learn

    ·       Speech to text quickstart - Azure AI services | Azure Docs

    ·       GitHub - Azure-Samples/aoai-realtime-audio-sdk: Azure OpenAI code resources for using gpt-4o-realtime capabilities.

    ·       Quotas and Limits for Azure Speech - Foundry Tools | Microsoft Learn

     

    Please let me know if you have any questions.

    Thank you!

     

    0 comments No comments

  2. Q&A Assist
    2026-02-24T03:40:58.58+00:00
    AI generated content. This question contains an answer created with AI from Q&A Assist. Learn more

    The delay you are experiencing with the gpt-4o-transcribe model for real-time speech-to-text transcription could be influenced by several factors, including network latency, the processing speed of the model, and the size of the audio file. Since you've also tried gpt-4o-mini-transcribe without significant improvement, here are a few suggestions to potentially enhance the speed:

    1. Audio Quality: Ensure that the audio quality is high, as lower quality can lead to longer processing times.
    2. Chunking Strategy: You might want to experiment with different chunking strategies if available. Adjusting parameters like silence_duration_ms and threshold can help the model detect speech more effectively, potentially reducing latency.
    3. Model Selection: While you've tried gpt-4o-mini-transcribe, consider testing other models such as whisper or gpt-4o-transcribe-diarize, which may offer different performance characteristics.
    4. Real-time Settings: If you're using real-time transcription, ensure that the settings for noise reduction and audio format are optimized for your specific use case.
    5. Network Conditions: Check your network conditions, as a slow connection can significantly impact the time it takes to send requests and receive responses.

    If the issue persists, it may be beneficial to reach out to Azure support for more tailored assistance regarding performance optimization for your specific use case.


    References:

    0 comments No comments

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.