Text-to-speech for Python

Edit on GitHub

Text-to-speech is a broad topic, but as far as Spokestack is concerned, there are two things your app has to handle: sending text, SSML, or Speech Markdown to be synthesized; and playing the resulting audio for your users. This guide will cover both.

Generating Audio

The best way to synthesize speech in Spokestack is to use the TextToSpeechManager module. This module combines TextToSpeechClient with an audio output target. Keep in mind that this module operates independently of the SpeechPipeline. If you haven’t already, you will need to create an account or sign in to get your API credentials.

TextToSpeechManager is initialized as follows:

from spokestack.tts.manager import TextToSpeechManager
from spokestack.tts.clients.spokestack import TextToSpeechClient
from spokestack.io.pyaudio import PyAudioOutput

manager = TextToSpeechManager(
    TextToSpeechClient("spokestack_id", "spokestack_secret"), PyAudioOutput()
)

There are three different modes for TTS: text, ssml, markdown. We will go over each mode briefly here. However, if you would like a more detailed view check out the TTS concept guide.

Text

The text mode is for plain text without any additional markup. To synthesize plain text you do the following:

manager.synthesize(utterance="welcome to spokestack", mode="text", voice="demo-male")

SSML

SSML is based on XML and gives you enhanced control over pronunciation. Check out the guide for more details. You can synthesize speech from SSML like this:

manager.synthesize(
    utterance="<speak>welcome to spokestack</speak>", mode="text", voice="demo-male"
)

Speech Markdown

Speech Markdown is a wrapper around SSML syntax that gives some additional features as explained in the guide. An example of Speech Markdown looks like this:

manager.synthesize(
    utterance="See all our products at (www)[characters] dot my company dot com.",
    mode="text",
    voice="demo-male",
)

Additional Synthesis Options

If automatic playback is not what you are looking for, we offer another option. An instance of TextToSpeechClient can synthesize separately from the TextToSpeechManager and produce a URL that points to the audio file. This allows you to download the entire audio clip. This is especially useful in a Jupyter notebook where you may not have direct audio output access. Using the TextToSpeechClient to retrieve the audio URL is as simple as this:

from spokestack.tts.clients.spokestack import TextToSpeechClient

tts = TextToSpeechClient("spokestack_id" "spokestack_secret")

audio_location = tts.synthesize_url("welcome to spokestack")