What is Text-to-Speech?

Text-to-Speech, or TTS, uses a machine learning model to synthesize the text you provide into an AI voice that reads the text aloud. In short, its speech synthesis technology made popular as the “Siri voice”, Sir Stephen Hawkings, or even ET's Speak & Spell!

With Spokestack, TTS is no longer limited to a single device or only available with a ton of machine learning work—it's easy to create a TTS voice and use it in your software!

Why Should I Use TTS?

Unique Audio Branding Opportunity

Unique Audio Branding Opportunity

Multimodal UI Not Limited to a Screen

Multimodal UI Not Limited to a Screen

Create an Artificial Persona

Create an Artificial Persona

Personalized Speech

Personalized Speech Specific to Each Potential User

How Does Text-to-Speech Work?

TTS transforms text input into audio that mimics a human speaker reading it aloud. It's essentially the opposite of ASR.

Synthesizing speech might be the oldest field in voice technology, with early efforts potentially dating back to the Middle Ages. We've come a long way since then, and today neural networks can produce speech nearly indistinguishable from a human speaker in both reproduction of individual letters and the qualities that make speech sound natural — things like cadence, intonation, and stress — collectively known as prosody. Natural speech synthesis is still a computationally intensive task; the models that approach human performance require too many resources to run on a mobile device, but the field is advancing rapidly.

Cloud-Based TTS

Spokestack's current approach to TTS is cloud-based. You send us either plain text or text formatted with SSML or Speech Markdown if you need fine control over the result, and we'll send you a URL where you can stream your result for the next 60 seconds. Our mobile libraries have convenience methods for automatically streaming the audio to your local or web device. Our system works faster than real-time, so there's no waiting for your audio to be ready — by the time you can send a request to your streaming URL, the first chunks of audio should be ready, and playback won't get ahead of synthesis.

Our TTS is currently limited to English, but we can produce custom voices for your brand, and we offer an affordable subscription tier that lets you train your own TTS voice with as little as 5 minutes of data. The quality of a voice trained on a very small data set won't be quite up to par with our custom voices, but it can be a great way to produce a proof of concept or power a hobby project.

Create a Custom TTS Model

Creating a personal text-to-speech model is straightforward using Spokestack Maker, a microphone, and a quiet room.

Spokestack's personal text-to-speech uses few-shot transfer learning to produce a speech model capable of synthesizing any sound in the English language in near real-time using a small amount of training data. The quality of the model is highly dependent on the quality and quantity of training data provided.
YouTube LogoSee How it Works

1 Create a TTS Model

Create a TTS Model

First, head to the text-to-speech builder and click Create model in the top right. A section for a new model will appear. Change the model's name.

2 Record and Upload Samples

Record Samples

Then, look for the Data Collection section. Training a TTS model requires recordings of a single voice. The tool will provide the scripts; all you have to do is read them. Click Record to open a window that will let you record as many scripts as you like, review your recordings before upload, and move on to the next script.

At least 75 samples are required to train a model, but the more samples, the better your model will sound.

It may be tempting to give the scripts a bit of personality. Since we're training a model with relatively little data, it's best to keep both your pace and pitch at a natural, even level. Don't feel like you have to read in a monotone — we do want to capture pauses and natural pitch contours — but don't put too much emotion into your read.

Upload Samples

3 Train Your Model

Train Your Model

When you’ve reached 75 scripts (or your personal tolerance level, whichever is higher), click Train. It takes longer to train a TTS model than wake word or keyword models, so don't record all your samples right before you need to use it; you'll probably have at least a couple hours to wait.

How Do I Use a TTS Model?

For mobile apps, integrate Spokestack Tray, a drop-in UI widget that manages voice interactions and delivers actionable user commands with just a few lines of code.

Sample code and tutorials
Complete API and SDK documentation
Low-code integrations on popular platforms including:
Hugging Face
Google Assistant
iOSAndroidReact NativeNodePython
let tts = SpokestackBuilder()
    .setProperty("apiId", "YOUR_SPOKESTACK_API_ID")
    .setProperty("apiSecret", "YOUR_SPOKESTACK_API_SECRET")
let input = TextToSpeechInput(
  "Hello world!",
Full-Featured Platform SDK

Our native iOS library is written in Swift and includes a top-level class that handles wake word configuration making setup a breeze.

Expore the docs

Become a Spokestack Maker and #OwnYourVoice

Access our hosted services for model import, natural language processing, text-to-speech, and wakeword.