Summary

1 minute

Note

See the Text and images tab for more details!

This module introduces speech recognition (speech‑to‑text) as the foundation for voice‑enabled applications and agents. Learners explore how spoken audio is captured from a microphone or audio file and converted into written text using Azure Speech. The module explains where speech‑to‑text fits in an application—either in a client app or a backend service—and highlights common scenarios such as live transcription, captions, voicemail processing, and providing text input to AI agents.

The module then covers speech synthesis (text‑to‑speech), which enables applications to generate natural‑sounding spoken audio from text. Learners see how Azure Speech uses neural voices to control pronunciation, tone, speed, and pitch, and how synthesized audio can be played immediately or saved for later use. This section emphasizes how text‑to‑speech allows applications and agents to respond audibly, improving accessibility, hands‑free interaction, and overall user experience.

Finally, the module brings these capabilities together with speech‑to‑speech using Voice Live. Learners discover how Voice Live combines speech‑to‑text, AI reasoning, and text‑to‑speech into a single, fully managed service for real‑time conversations. Instead of stitching together multiple components, developers can use Voice Live to build responsive, natural voice agents that can listen, think, and speak—making it easier to create production‑ready conversational experiences with Azure Speech and Microsoft Foundry.

Use the links below to learn more.

Azure Speech in the service documentation
Learn more about Azure Speech's Software Development Kits (SDKs)
Learn more about Azure Speech - Voice Live in the documentation

Feedback

Was this page helpful?