What is Text-to-Speech?
Text-to-Speech, or TTS, uses a machine learning model to synthesize the text you provide into an AI voice that reads the text aloud. In short, its speech synthesis technology made popular as the “Siri voice”, Sir Stephen Hawkings, or even ET's Speak & Spell!
With Spokestack, TTS is no longer limited to a single device or only available with a ton of machine learning work—it's easy to create a TTS voice and use it in your software!
Why Should I Use TTS?
Unique Audio Branding Opportunity
Multimodal UI Not Limited to a Screen
Create an Artificial Persona
Personalized Speech Specific to Each Potential User
How Does Text-to-Speech Work?
TTS transforms text input into audio that mimics a human speaker reading it aloud. It's essentially the opposite of ASR.
Synthesizing speech might be the oldest field in voice technology, with early efforts potentially dating back to the Middle Ages. We've come a long way since then, and today neural networks can produce speech nearly indistinguishable from a human speaker in both reproduction of individual letters and the qualities that make speech sound natural — things like cadence, intonation, and stress — collectively known as prosody. Natural speech synthesis is still a computationally intensive task; the models that approach human performance require too many resources to run on a mobile device, but the field is advancing rapidly.
Spokestack's current approach to TTS is cloud-based. You send us either plain text or text formatted with SSML or Speech Markdown if you need fine control over the result, and we'll send you a URL where you can stream your result for the next 60 seconds. Our mobile libraries have convenience methods for automatically streaming the audio to your local or web device. Our system works faster than real-time, so there's no waiting for your audio to be ready — by the time you can send a request to your streaming URL, the first chunks of audio should be ready, and playback won't get ahead of synthesis.
Our TTS is currently limited to English, but we can produce custom voices for your brand, and we offer an affordable subscription tier that lets you train your own TTS voice with as little as 5 minutes of data. The quality of a voice trained on a very small data set won't be quite up to par with our custom voices, but it can be a great way to produce a proof of concept or power a hobby project.
Creating a personal text-to-speech model is straightforward using Spokestack Maker or Spokestack Pro, a microphone, and a quiet room.
1 Create a TTS Model
First, head to the text-to-speech builder and click
Create model in the top right. A section for a new model will appear. Change the model's name.
2 Record and Upload Samples
Then, look for the
Data Collection section. Training a TTS model requires recordings of a single voice. The tool will provide the scripts; all you have to do is read them. Click
Record to open a window that will let you record as many scripts as you like, review your recordings before upload, and move on to the next script.
It may be tempting to give the scripts a bit of personality. Since we're training a model with relatively little data, it's best to keep both your pace and pitch at a natural, even level. Don't feel like you have to read in a monotone — we do want to capture pauses and natural pitch contours — but don't put too much emotion into your read.
3 Train Your Model
When you’ve reached 75 scripts (or your personal tolerance level, whichever is higher), click
Train. It takes longer to train a TTS model than wake word or keyword models, so don't record all your samples right before you need to use it; you'll probably have at least a couple hours to wait.
How Do I Use a TTS Model?
For mobile apps, integrate Spokestack Tray, a drop-in UI widget that manages voice interactions and delivers actionable user commands with just a few lines of code.
let tts = SpokestackBuilder() .addDelegate(self) .setProperty("apiId", "YOUR_SPOKESTACK_API_ID") .setProperty("apiSecret", "YOUR_SPOKESTACK_API_SECRET") .build() let input = TextToSpeechInput( "Hello world!", "demo-male" ) tts.speak(input)