Introduction
AI speech capabilities enable us to manage systems with voice instructions, get answers from computers for spoken questions, generate captions from audio, and much more. Voice-based interfaces provide a more natural way to engage with AI software. The ability to interact through spoken language can increase the accessibility and inclusiveness of applications and agents.
To enable this kind of interaction, the AI system must support at least two capabilities:
- Speech recognition: the ability to detect and interpret spoken input
- Speech synthesis: the ability to generate spoken output
Examples of these capabilities include:
Clinical dictation and note-taking in healthcare: Doctors can say patient notes aloud during or after appointments. An AI speech app converts the audio into accurate medical text, reducing manual typing and saving time.
Call transcription in customer support: Contact centers transcribe customer calls in real time, making it easier to review conversations, detect issues, and analyze sentiment.
Automated captioning in media and entertainment: Video platforms generate live or recorded captions for shows and streams, improving accessibility and supporting multilingual audiences.
Language learning and pronunciation feedback in education: Learning apps use AI speech capabilities to listen to students speak and provide pronunciation feedback, helping learners practice and improve spoken language skills.
Voice‑enabled assistants in retail and e‑commerce: Virtual shopping assistants use speech recognition to understand spoken customer requests and text‑to‑speech to respond with product information or order status.
Azure Speech in Microsoft Foundry Tools provides speech-to-text, text-to-speech, and speech translation capabilities through speech recognition and synthesis. You can use prebuilt and custom Speech service models for a variety of tasks, from transcribing audio to text with high accuracy, to identifying speakers in conversations, creating custom voices, and more. Next learn how to incorporate speech recognition into an application with Azure Speech.
Note
We recognize that different people like to learn in different ways. You can choose to complete this module in video-based format or you can read the content as text and images. The text contains greater detail than the videos, so in some cases you might want to refer to it as supplemental material to the video presentation.