An Azure service that integrates speech processing into apps and services.
Hello Soulaïman Marsou,
Welcome to Microsoft Q&A and Thank you for reaching out.
You’re correct in your diagnosis: what’s happening is that the TTS pipeline is effectively tokenizing the HTML entity & as two parts (& + amp;) for word-boundary metadata, while the speech engine normalizes it to “et” for audio. That mismatch causes:
Broken tokenization
amp; fragments in word boundaries
AudioOffset / Duration drift relative to spoken output
The audio engine and the word-boundary generator are not operating on the exact same normalized text representation.
Recommended Approach
There isn’t currently a flag to tell the service “treat & as a single lexical word,” but the most reliable SSML-based workaround is to use the <sub> (substitute) tag.
Example:
<speak version="1.0"
What this does:
The SSML remains valid (& is still properly escaped).
The engine pronounces the alias "et".
Word-boundary events align with "et" instead of splitting into & and amp;.
In many cases, this restores clean boundary alignment without manually replacing the symbol in your source text.
If <sub> Still Produces Misalignment
If you continue seeing boundary corruption even with <sub>, then this is very likely a limitation/bug in how HTML entities are tokenized during boundary generation (particularly in Batch synthesis).
In that case, the only fully deterministic workaround is:
Pre-normalize the text before sending to TTS
Replace symbolic forms with their spoken equivalents before building SSML:
& → et (fr-FR)
& → and (en-US)
This guarantees:
Stable tokenization
Correct AudioOffset alignment
No entity artifacts
Consistent boundary metadata
Yes, it’s language-aware, but it is the most reliable solution when precise synchronization (karaoke highlighting, subtitles, animation timing) is required.
What Not to Rely On
PlainText vs SSML (behavior is the same)
Escaping with html.escape() alone
CDATA blocks
Assuming entity decoding happens before boundary generation
You’ve already validated that those approaches don’t solve the offset drift.
This is not expected behavior from a consumer standpoint.
It stems from entity tokenization mismatching audio normalization.
Best workaround: use <sub alias="et">&</sub>
Most reliable fallback: pre-replace & with the spoken form before synthesis.
If precise alignment is critical, normalization before TTS is safest.
Please refer this
SSML document structure and special characters https://learn.microsoft.com/azure/ai-services/speech-service/speech-synthesis-markup-structure#special-characters
SSML <sub> element for pronunciation substitution https://learn.microsoft.com/azure/ai-services/speech-service/speech-synthesis-markup?tabs=csharp#sub-element
I Hope this helps. Do let me know if you have any further queries.
Thank you!