Wake Word Models

Spokestack provides pretrained TensorFlow Lite models that enable on-device wake word detection. These free models, however, only recognize the word “Spokestack”; in order to have your app respond to a different word or phrase, you’ll need new models. If machine learning is outside your wheelhouse and you’d like a customized wake word for your app, stop reading here and drop us a line. We’re happy to help.

If building a custom model sounds like fun, though, soldier on. We’ll describe the design of the models and their input/output shapes below; see the configuration guide for more information about hyperparameters. Spokestack uses three separate models; they operate continuously, each feeding output into the next, for both efficiency and accuracy. We’ll go over them in the order in which they’re used. See the list of references at the end for descriptions of any unfamiliar terminology, and let us know if we missed anything!



The filter model processes the linear amplitude Short-Time Fourier Transform (STFT), converting it into an audio feature representation. This representation may be the result of applying the mel filterbank or calculating MFCC features. The use of a TF-Lite model for filtering hides the details of the filter from spokestack while optimizing the matrix operations involved.


The filter model takes as input a single linear STFT frame, which is computed by Spokestack as the magnitude of the FFT over a sliding window of the audio signal. This input is shaped [fft-window-size / 2 + 1]. The model outputs a feature vector shaped [mel-frame-width].



The encoder model is the autoregressive component of the system. It processes a single frame (RNN) or a sliding window (CRNN) along with a previous state tensor.


The encoder model’s input tensor is shaped [mel-frame-length, mel-frame-width] (if processing a single frame, mel-frame-length will be 1), , the state tensor is shaped [wake-state-width], and the output tensor is shaped [wake-encode-width]. It outputs an encoded representation of the frame and an updated state tensor.



The detection model is a binary classifier that outputs a posterior probability that the wake word was detected. The architecture of this model is opaque to the wake word trigger at runtime and may vary, but it must be constrained to be compatible with tflite and core.ml and fast enough to run in soft real time on all supported devices.


The input to this model is a sliding window of encoder frames, each of which was produced by the encoder model described above. This input is shaped [wake-encode-length, wake-encode-width]. The classifier outputs a scalar probability value.

For More Information

Something missing here?Edit this doc!Questions? Visit our forum