As of version 9.0.0, Spokestack offers a single class that centralizes setup and configuration for all of its individual modules (ASR, NLU, TTS, etc.). This guide details the configuration options available when setting up that class as well as tips for runtime usage.
Note that you don’t have to use this class to use Spokestack. Each module still has its own builder and can be configured and used independently if that’s best for your app. More information about the individual modules can be found at the following links:
Each Spokestack module can be used independently and comes with its own builder interface for configuration. The other guides in this section detail those builders, and the
Spokestack.Builder uses them internally to configure the modules. If you prefer to use the
Spokestack class but configure each module individually, that can be done via the
get***Builder methods. You’ll also need these builders to perform low-level customization (for example, changing ASR provider). If you don’t need anything quite that advanced and would rather configure
Spokestack directly, read on.
The speech pipeline is the first piece of the puzzle in any voice interaction and is responsible for capturing user audio and translating it into text. Configuring it entails choices about whether or not to use a wake word to activate ASR, what kind of preprocessing to perform on audio before sending it to ASR, and which ASR service to use. These choices can all be made individually or through the use of configuration profiles, as mentioned in the pipeline guide linked above.
By default, the
Spokestack class uses the
TFWakewordAndroidASR profile, which expects paths to TensorFlow Lite model files to be added to the builder:
val spokestack = Spokestack.Builder() .setProperty("wake-detect-path", "path-to-detect.tflite") .setProperty("wake-encode-path", "path-to-encode.tflite") .setProperty("wake-filter-path", "path-to-filter.tflite") // ... .build()
Note: configurations using Android’s ASR must also provide an Android application context via
To change pipeline profiles without leaving the builder’s call chain, use
withPipelineProfile() and pass it the full class name of a profile. This can even be a custom profile you’ve created; see the existing pipeline profiles for examples of how to create one.
If you would prefer to manually activate ASR, you can disable wake word using the builder’s
withoutWakeword() method, which will activate the
PushToTalkAndroidASR profile and remove the need to supply wake word model files, or you can select your own profile via
To disable wake word and ASR altogether, use the builder’s
Once a user’s speech has been transcribed, it’s useful to know what to do with it. That’s where Natural Language Understanding comes in. You’ll need an NLU model and supporting files to use this feature; see our model export guide for some easy ways to create your own. You’ll supply the
Spokestack builder with paths to these files just like you would wake word files:
val spokestack = Spokestack.Builder() .setProperty("nlu-model-path", "path-to-nlu.tflite") .setProperty("nlu-metadata-path", "path-to-metadata.json") .setProperty("wordpiece-vocab-path", "path-to-vocab.txt") // ... .build()
Once configured, all ASR transcripts will automatically be sent through NLU; see Receiving events below for information on how to see the results.
For certain domains, though, relying completely on ASR can be problematic. Sometimes the most likely transcription of a given sound isn’t the most likely transcription for your app. We experienced this ourselves during development of our Bartender app, where ASR consistently misheard “gin” as “Jen”.
Errors like this can cause cascading problems in processing user requests, so
Spokestack allows you to edit ASR transcripts before they’re sent to NLU. Just supply an instance of a class that implements
TranscriptEditor at build time:
val spokestack = Spokestack.Builder() .withTranscriptEditor(myEditor) // ... .build()
With that in place, every final ASR transcript will be sent through
myEditor.edit() before being sent to NLU. Any listeners receiving speech events can access the unedited transcript via
RECOGNIZE events, but NLU results will contain the edited version in the
We recommend that you use this feature sparingly and only after testing interactions with a variety of voices.
As with the speech pipeline, NLU features can also be disabled with builder methods:
withoutAutoClassification(): Don’t automatically send ASR transcripts to the NLU. If models are supplied, the NLU will still be available at runtime via
withoutNlu(): Disable NLU entirely. This removes the need to supply NLU file paths to the builder.
Once your app has processed the user’s request, you’ll likely want to respond via the same input modality that request came from—audio. That’s where text-to-speech comes in. By default,
Spokestack handles sending text responses to a cloud service for synthesis and playing the resulting audio. The only configuration properties necessary are your Spokestack credentials (client ID and secret key, available from the API credentials section of your account settings) and a couple Android system components:
val spokestack = Spokestack.Builder() .setProperty("spokestack-id", "your-client-id") .setProperty("spokestack-secret", "your-secret-key") .withAndroidContext(applicationContext) // ... .build()
With TTS enabled, you can use
synthesize() method to respond to your users. Spokestack uses ExoPlayer to play back synthesized audio. If you’d rather manage playback yourself, call
withoutAutoPlayback() on the builder and see the next section for information on handling TTS events.
To disable TTS entirely at build time, eliminating the need to add your Spokestack credentials to the builder (unless you’ve switched ASR providers to Spokestack ASR), call
So far we’ve talked mostly about how to get Spokestack running and turn some knobs to make it work just the way you’d like. We’ve touted all the things it does without any intervention from an app that’s using it … but at some point, you’re going to want to interact with your user’s requests. Once a transcript goes through NLU, you’ll need the results of that classification in order to actually respond to the user.
Events from all Spokestack modules are dispatched to a listener registered at build time. This listener must inherit from the
SpokestackAdapter class, which has been designed to allow clients to handle only the events they’re interested in.
SpokestackAdapter implements listener interfaces for all individual Spokestack modules, so those listeners’ methods can be overridden and used as event handlers. Since these methods were originally designed for separate modules, though, their names don’t look great all together in the same class. For that reason, we’ve also (since version 9.1.0) provided convenience methods that specify the module where the event originated. The new methods are:
speechEvent(): Receives events from the speech pipeline—activation, deactivation, ASR results, etc. Trace events are also sent to the
trace()method (see below).
nluResult(): Receives classifications from the NLU module.
ttsEvent(): Receives events from the TTS subsystem. If you’re managing TTS playback manually, the
AUDIO_AVAILABLEevent will let you know when your TTS response is ready to play. If you’re using Spokestack TTS, be sure to download or play the audio within 60 seconds of receibing this event, or it will become unavailable.
trace(): Receives trace events from all modules, specifying the module that originated the event. These events are also sent to any handlers registered to individual modules, so it’s up to you where to handle them. Overriding
trace(), just like any of the other methods, is optional.
trace(), this method receives errors from all modules, specifying the module that originated the error.