Wake Word Models

Spokestack provides pretrained TensorFlow Lite models that enable on-device wake word detection. These free models, however, only recognize the word “Spokestack”; in order to have your app respond to a different word or phrase, you’ll need new models. If machine learning is outside your wheelhouse and you’d like a customized wake word for your app, check out our Maker tier for a way to prototype a personal wake word model, no code required.

If building a custom model sounds like fun, though, soldier on. We’ll describe the design of the models and their input/output shapes below; see the configuration guide for more information about hyperparameters. Spokestack uses three separate models; they operate continuously, each feeding output into the next, for both efficiency and accuracy. We’ll go over them in the order in which they’re used. See the list of references at the end for descriptions of any unfamiliar terminology, and let us know if we missed anything!

Filter

Description

The filter model processes the linear amplitude Short-Time Fourier Transform (STFT), converting it into an audio feature representation. This representation may be the result of applying the mel filterbank or calculating MFCC features. The use of a TF-Lite model for filtering hides the details of the filter from spokestack while optimizing the matrix operations involved.

Input/Output

The filter model takes as input a single linear STFT frame, which is computed by Spokestack as the magnitude of the FFT over a sliding window of the audio signal. This input is shaped [fft-window-size / 2 + 1]. The model outputs a feature vector shaped [mel-frame-width].

Encoder

Description

The encoder model is the autoregressive component of the system. It processes a single frame (RNN) or a sliding window (CRNN) along with a previous state tensor.

Input/Output

The encoder model’s input tensor is shaped [mel-frame-length, mel-frame-width] (if processing a single frame, mel-frame-length will be 1), , the state tensor is shaped [wake-state-width], and the output tensor is shaped [wake-encode-width]. It outputs an encoded representation of the frame and an updated state tensor.

Detector

Description

The detection model is a binary classifier that outputs a posterior probability that the wake word was detected. The architecture of this model is opaque to the wake word trigger at runtime and may vary, but it must be constrained to be compatible with tflite and core.ml and fast enough to run in soft real time on all supported devices.

Input/Output

The input to this model is a sliding window of encoder frames, each of which was produced by the encoder model described above. This input is shaped [wake-encode-length, wake-encode-width]. The classifier outputs a scalar probability value.

For More Information