Share via

Special character ampersand (“&”) breaks word boundaries in Azure Text-to-Speech

Soulaïman Marsou 0 Reputation points
2026-02-26T10:51:25.2866667+00:00

Hello,

I’m encountering an issue with word boundary events in Azure Text-to-Speech when the input text contains the ampersand character (&).


Context

  • Locale: fr-FR
  • Neural French voice (e.g. fr-FR-Remy:DragonHDLatestNeural)
  • Batch synthesis API
  • wordBoundaryEnabled = true
  • In my use case, I rely on AudioOffset and Duration to synchronize text with audio.

When the text contains an ampersand (&), the voice correctly pronounces it as “et”, but the word boundary results become corrupted:

  • Tokens are split incorrectly
  • Escaped element amp; appears
  • Offsets no longer match the spoken audio

This happens whether I use:

  • PlainText
  • PlainText with escaped text
  • SSML
  • SSML with escaped text

In all cases, the presence of & causes misaligned word boundary output.

Example:

Voice : fr-FR-Remy:DragonHDLatestNeural

Input text : "Donjons & Dragons, souvent abrégé en D&D, est une expérience unique."

Word boundaries :

[
  { "Text": "Donjons", "AudioOffset": 70, "Duration": 600 },
  { "Text": "&", "AudioOffset": 670, "Duration": 0 },
  { "Text": "amp; Dra", "AudioOffset": 670, "Duration": 600 },
  { "Text": "g", "AudioOffset": 1270, "Duration": 0 },
  { "Text": "ons, sou", "AudioOffset": 1470, "Duration": 559 },
  { "Text": "vent ab", "AudioOffset": 2029, "Duration": 400 },
  { "Text": "rég", "AudioOffset": 2429, "Duration": 81 },
  { "Text": "é en", "AudioOffset": 2510, "Duration": 640 },
  { "Text": " ", "AudioOffset": 3150, "Duration": 0 },
  { "Text": "D&am", "AudioOffset": 3510, "Duration": 160 },
  { "Text": "p;D,", "AudioOffset": 3670, "Duration": 159 }
]

Simplified request example

Here is a simplified version of the batch synthesis request I’m sending using Python:


import html

import json

import httpx

locale = "fr-FR"

voice = "fr-FR-YourVoiceName"

text = "Donjons & Dragons, souvent abrégé en D&D"

safe_text = html.escape(text)

ssml = f"""

<speak version="1.0"

       xmlns="http://www.w3.org/2001/10/synthesis"

       xml:lang="{locale}">

  <voice name="{voice}">

    {safe_text}

  </voice>

</speak>

"""

payload = {

    "inputKind": "SSML",

    "inputs": [

        {

            "content": ssml

        }

    ],

    "properties": {

        "outputFormat": "audio-48khz-192kbitrate-mono-mp3",

        "wordBoundaryEnabled": True

    }

}

async with httpx.AsyncClient() as client:

    response = await client.put(

        url="https://<region>.tts.speech.microsoft.com/texttospeech/batchsyntheses/<job_id>?api-version=2024-04-01",

        headers={

            "Content-Type": "application/json",

            "Ocp-Apim-Subscription-Key": "<your-key>"

        },

        content=json.dumps(payload)

    )

response.raise_for_status()

I got the same issue using PlainText and/or directly the REST API :

curl -X PUT \
  "$AZURE_AI_SPEECHSERVICES_ENDPOINT/texttospeech/batchsyntheses/$JOB_ID?api-version=2024-04-01" \
  -H "Ocp-Apim-Subscription-Key: $AZURE_AI_SPEECHSERVICES_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "inputKind": "PlainText",
    "synthesisConfig": {
      "voice": "fr-FR-Remy:DragonHDLatestNeural"
    },
    "inputs": [
      {
        "content": "Donjons & Dragons, souvent abrégé en D&D, est une expérience unique."
      }
    ],
    "properties": {
      "outputFormat": "audio-48khz-192kbitrate-mono-mp3",
      "sentenceBoundaryEnabled": true,
      "wordBoundaryEnabled": true
    }
  }'

Expected behavior

Even if & is internally normalized to “et” (French) or “and” (English), I would expect:

  • Clean tokenization
  • Stable word boundaries
  • No escaped fragments like amp;
  • No offset shifting

Question

Is there a supported or recommended way to include the ampersand character (&) in Azure Text-to-Speech while preserving correct word boundary offsets, without manually replacing it with language-specific words like:

  • et (French)
  • and (English)

Is this:

  • A known limitation of word boundary generation?
  • A parsing issue related to HTML entity handling?
  • Something that requires a specific SSML construct?

Any guidance on best practices for handling special characters while maintaining reliable word boundary alignment would be greatly appreciated.

Thank you.

Azure AI Speech
Azure AI Speech

An Azure service that integrates speech processing into apps and services.

{count} votes

1 answer

Sort by: Most helpful
  1. SRILAKSHMI C 15,030 Reputation points Microsoft External Staff Moderator
    2026-03-03T00:44:38.1+00:00

    Hello Soulaïman Marsou,

    Welcome to Microsoft Q&A and Thank you for reaching out.

    You’re correct in your diagnosis: what’s happening is that the TTS pipeline is effectively tokenizing the HTML entity &amp; as two parts (& + amp;) for word-boundary metadata, while the speech engine normalizes it to “et” for audio. That mismatch causes:

    Broken tokenization

    amp; fragments in word boundaries

    AudioOffset / Duration drift relative to spoken output

    The audio engine and the word-boundary generator are not operating on the exact same normalized text representation.

    Recommended Approach

    There isn’t currently a flag to tell the service “treat & as a single lexical word,” but the most reliable SSML-based workaround is to use the <sub> (substitute) tag.

    Example:

    <speak version="1.0"
    

    What this does:

    The SSML remains valid (&amp; is still properly escaped).

    The engine pronounces the alias "et".

    Word-boundary events align with "et" instead of splitting into & and amp;.

    In many cases, this restores clean boundary alignment without manually replacing the symbol in your source text.

    If <sub> Still Produces Misalignment

    If you continue seeing boundary corruption even with <sub>, then this is very likely a limitation/bug in how HTML entities are tokenized during boundary generation (particularly in Batch synthesis).

    In that case, the only fully deterministic workaround is:

    Pre-normalize the text before sending to TTS

    Replace symbolic forms with their spoken equivalents before building SSML:

    &et (fr-FR)

    &and (en-US)

    This guarantees:

    Stable tokenization

    Correct AudioOffset alignment

    No entity artifacts

    Consistent boundary metadata

    Yes, it’s language-aware, but it is the most reliable solution when precise synchronization (karaoke highlighting, subtitles, animation timing) is required.

    What Not to Rely On

    PlainText vs SSML (behavior is the same)

    Escaping with html.escape() alone

    CDATA blocks

    Assuming entity decoding happens before boundary generation

    You’ve already validated that those approaches don’t solve the offset drift.

    This is not expected behavior from a consumer standpoint.

    It stems from entity tokenization mismatching audio normalization.

    Best workaround: use <sub alias="et">&amp;</sub>

    Most reliable fallback: pre-replace & with the spoken form before synthesis.

    If precise alignment is critical, normalization before TTS is safest.

    Please refer this

    SSML document structure and special characters https://learn.microsoft.com/azure/ai-services/speech-service/speech-synthesis-markup-structure#special-characters

    SSML <sub> element for pronunciation substitution https://learn.microsoft.com/azure/ai-services/speech-service/speech-synthesis-markup?tabs=csharp#sub-element

    I Hope this helps. Do let me know if you have any further queries.

    Thank you!


Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.