Special character ampersand (“&”) breaks word boundaries in Azure Text-to-Speech

Question

Question

Answer 1

SRILAKSHMI C 15,030 Microsoft External Staff Moderator

Welcome to Microsoft Q&A and Thank you for reaching out.

You’re correct in your diagnosis: what’s happening is that the TTS pipeline is effectively tokenizing the HTML entity & as two parts (& + amp;) for word-boundary metadata, while the speech engine normalizes it to “et” for audio. That mismatch causes:

Broken tokenization

amp; fragments in word boundaries

AudioOffset / Duration drift relative to spoken output

The audio engine and the word-boundary generator are not operating on the exact same normalized text representation.

Recommended Approach

There isn’t currently a flag to tell the service “treat & as a single lexical word,” but the most reliable SSML-based workaround is to use the  (substitute) tag.

Example:

<speak version="1.0"

What this does:

The SSML remains valid (& is still properly escaped).

The engine pronounces the alias "et".

Word-boundary events align with "et" instead of splitting into & and amp;.

In many cases, this restores clean boundary alignment without manually replacing the symbol in your source text.

If  Still Produces Misalignment

If you continue seeing boundary corruption even with , then this is very likely a limitation/bug in how HTML entities are tokenized during boundary generation (particularly in Batch synthesis).

In that case, the only fully deterministic workaround is:

Pre-normalize the text before sending to TTS

Replace symbolic forms with their spoken equivalents before building SSML:

& → et (fr-FR)

& → and (en-US)

This guarantees:

Stable tokenization

Correct AudioOffset alignment

No entity artifacts

Consistent boundary metadata

Yes, it’s language-aware, but it is the most reliable solution when precise synchronization (karaoke highlighting, subtitles, animation timing) is required.

What Not to Rely On

PlainText vs SSML (behavior is the same)

Escaping with html.escape() alone

CDATA blocks

Assuming entity decoding happens before boundary generation

You’ve already validated that those approaches don’t solve the offset drift.

This is not expected behavior from a consumer standpoint.

It stems from entity tokenization mismatching audio normalization.

Best workaround: use &

Most reliable fallback: pre-replace & with the spoken form before synthesis.

If precise alignment is critical, normalization before TTS is safest.

Please refer this

SSML document structure and special characters https://learn.microsoft.com/azure/ai-services/speech-service/speech-synthesis-markup-structure#special-characters

SSML  element for pronunciation substitution https://learn.microsoft.com/azure/ai-services/speech-service/speech-synthesis-markup?tabs=csharp#sub-element

I Hope this helps. Do let me know if you have any further queries.

Thank you!

Soulaïman Marsou 0

Hi,

Thanks for your answer.

I understand that I need to use a workaround.

You presented two different workarounds:

Use the  tag with an alias
Replace & with the spoken form

Both solutions are language-aware, since the alias should match the spoken form. That adds complexity to our codebase.

Since both approaches require language awareness anyway, I wanted to try them both to see whether one provided additional benefits over the other.

However, I encountered another issue:  seems buggy.

1. Issue with ``

With the following SSML:

<speak version="1.0"
       xmlns="http://www.w3.org/2001/10/synthesis"
       xml:lang="fr-FR">
  <voice name="fr-FR-RemyMultilingualNeural">
    Donjons <sub alias="A">B</sub> Dragons, souvent abrégé en D<sub alias="A">B</sub>D.
  </voice>
</speak>

I receive these boundaries:

[
  { "Text": "Donjons", "AudioOffset": 87, "Duration": 437 },
  { "Text": "A", "AudioOffset": 525, "Duration": 75 },
  { "Text": "\">B Dragons", "AudioOffset": 600, "Duration": 475 },
  { "Text": ",", "AudioOffset": 1162, "Duration": 337 }
]

The segment \">B Dragons" is clearly unexpected.

Because of this behavior, we decided to simply use the second workaround (& → " et " / " and "), which is the only approach that works reliably in our case.

I do not have further questions, but I wanted to share an interesting behavior I observed.

Interesting behavior

Issue only with DragonHD voices

I initially opened this Q&A because I obtained the following result when using & with the DragonHD voice (fr-FR-Remy:DragonHDLatestNeural):

[
  { "Text": "Donjons", "AudioOffset": 70, "Duration": 600 },
  { "Text": "&", "AudioOffset": 670, "Duration": 0 },
  { "Text": "amp; Dra", "AudioOffset": 670, "Duration": 600 },
  { "Text": "g", "AudioOffset": 1270, "Duration": 0 },
  { "Text": "ons, sou", "AudioOffset": 1470, "Duration": 559 },
  { "Text": "vent ab", "AudioOffset": 2029, "Duration": 400 },
  { "Text": "rég", "AudioOffset": 2429, "Duration": 81 },
  { "Text": "é en", "AudioOffset": 2510, "Duration": 640 }
]

Surprisingly, the issue does not occur when using the voice fr-FR-RemyMultilingualNeural:

[
  { "Text": "Donjons", "AudioOffset": 87, "Duration": 450 },
  { "Text": "&amp;", "AudioOffset": 537, "Duration": 75 },
  { "Text": "Dragons", "AudioOffset": 612, "Duration": 475 },
  { "Text": ",", "AudioOffset": 1175, "Duration": 325 },
  { "Text": "souvent", "AudioOffset": 1500, "Duration": 325 },
  { "Text": "abrégé", "AudioOffset": 1825, "Duration": 487 }
]

The text is still escaped, but the word boundaries are correct 🎉

Is `` a partial solution?

Even more surprisingly, when keeping the DragonHD voice but using:

<speak version="1.0"
       xmlns="http://www.w3.org/2001/10/synthesis"
       xml:lang="fr-FR">
  <voice name="fr-FR-Remy:DragonHDLatestNeural">
    Donjons <sub alias="&amp;">&amp;</sub> Dragons, souvent abrégé en D<sub alias="&amp;">&amp;</sub>D.
  </voice>
</speak>

The word boundaries are no longer shifted:

[
  { "Text": "Donjons", "AudioOffset": 70, "Duration": 560 },
  { "Text": ";", "AudioOffset": 630, "Duration": 0 },
  { "Text": "\">&amp; Dragons", "AudioOffset": 630, "Duration": 560 },
  { "Text": ",", "AudioOffset": 1190, "Duration": 0 },
  { "Text": "souvent", "AudioOffset": 1190, "Duration": 520 },
  { "Text": "abrégé", "AudioOffset": 1710, "Duration": 400 }
]

The only remaining issue is that & and "Dragons" are returned as ";" and \">& Dragons".

However, having correct boundaries (without shifts) is already better than the original behavior.

I hope this page may help others encountering similar issues.

Thanks.

SRILAKSHMI C 15,030 Reputation points Microsoft External Staff Moderator

2026-03-11T06:09:16.52+00:00
Hello Soulaïman Marsou,

Thank you for taking the time to test both workarounds and for sharing the detailed observations. The additional examples you provided are very helpful.

From what you described, your findings align with what we would expect based on how the word boundary metadata is currently generated in Azure AI Speech Service.

Behavior you observed

There are two key behaviors visible in your results:

Entity tokenization vs. speech normalization When the input contains &, the speech engine normalizes it to the spoken word (for example “et” in French), but the word-boundary generator still processes the encoded text representation (&). Because these two stages operate on slightly different representations of the text, the resulting boundary tokens can appear split or include fragments like amp;.

Voice-specific differences Your comparison between fr-FR-Remy

and fr-FR-RemyMultilingualNeural is particularly useful. DragonHD voices appear to handle entity normalization differently than the multilingual neural voices, which explains why the offsets remain stable in the multilingual voice but not in the DragonHD voice. This indicates the behavior is voice-pipeline specific rather than purely SSML parsing related.

 element output artifacts The behavior you observed with  (for example tokens like \">B Dragons) suggests that the boundary generator may still be referencing the underlying SSML markup during token segmentation, which can produce artifacts in the returned Text field even though the spoken audio is correct. While the  element can help stabilize offsets in some cases, it does not always guarantee clean token text in the boundary output.

Practical recommendation

Given your requirement for stable AudioOffset alignment, the approach you selected is currently the most reliable:

Pre-normalize symbolic characters before synthesis Example: & → " et " (fr-FR) & → " and " (en-US)

Although it requires language awareness, this ensures:

Stable tokenization

Clean boundary metadata

Accurate AudioOffset synchronization with spoken audio

Your observation that multilingual neural voices return stable boundaries even with escaped entities is valuable feedback. This suggests that different voice pipelines handle entity normalization at different stages. I will share this information internally so the engineering team can review whether the DragonHD behavior can be improved in future updates.

The behavior is related to entity tokenization vs. audio normalization differences in the TTS pipeline.

It can vary by voice type (e.g., DragonHD vs. Multilingual voices).

 can sometimes stabilize offsets but may still produce token artifacts.

Pre-normalizing the symbol to the spoken word remains the most deterministic approach when precise boundary alignment is required.

Thank you again for documenting your experiments and sharing the results. This kind of detailed feedback is extremely helpful and may assist others encountering similar scenarios.

Share via

Special character ampersand (“&”) breaks word boundaries in Azure Text-to-Speech

Context

Simplified request example

Expected behavior

Question

1 answer

1. Issue with `<sub>`

Interesting behavior

Issue only with DragonHD voices

Is `<sub>` a partial solution?

Your answer

Share via

Special character ampersand (“&”) breaks word boundaries in Azure Text-to-Speech

Context

Simplified request example

Expected behavior

Question

1 answer

1. Issue with <sub>

Interesting behavior

Issue only with DragonHD voices

Is <sub> a partial solution?

Your answer

1. Issue with `<sub>`

Is `<sub>` a partial solution?