Spokestack provides pretrained TensorFlow Lite models that enable on-device wakeword detection. These free models, however, only recognize the word “Spokestack”; in order to have your app respond to a different word or phrase, you’ll need new models. If machine learning is outside your wheelhouse and you’d like a customized wakeword for your app, stop reading here and drop us a line. We’re happy to help.
If building a custom model sounds like fun, though, soldier on. We’ll describe the design of the models and their input/output shapes below; see the configuration guide for more information about hyperparameters. Spokestack uses three separate models; they operate continuously, each feeding output into the next, for both effeciency and accuracy. We’ll go over them in the order in which they’re used. See the list of references at the end for descriptions of any unfamiliar terminology, and let us know if we missed anything!
The filter model processes the linear amplitude Short-Time Fourier Transform (STFT), converting it into an audio feature representation. This representation may be the result of applying the mel filterbank or calculating MFCC features. The use of a TF-Lite model for filtering hides the details of the filter from spokestack while optimizing the matrix operations involved.
The filter model takes as input a single linear STFT frame, which is computed by Spokestack as the magnitude of the FFT over a sliding window of the audio signal. This input is shaped
[fft-window-size / 2 + 1]. The model outputs a feature vector shaped
The encoder model is the autoregressive component of the system. It processes a single frame (RNN) or a sliding window (CRNN) along with a previous state tensor.
The encoder model’s input tensor is shaped
[mel-frame-length, mel-frame-width] (if processing a single frame,
mel-frame-length will be 1), , the state tensor is shaped
[wake-state-width], and the output tensor is shaped
[wake-encode-width]. It outputs an encoded representation of the frame and an updated state tensor.
The detection model is a binary classifier that outputs a posterior probability that the wakeword was detected. The architecture of this model is opaque to the wakeword trigger at runtime and may vary, but it must be constrained to be compatible with tflite and core.ml and fast enough to run in soft real time on all supported devices.
The input to this model is a sliding window of encoder frames, each of which was produced by the encoder model described above. This input is shaped
[wake-encode-length, wake-encode-width]. The classifier outputs a scalar probability value.
For More Information
- speech processing for machine learning
- hmm keyword spotting
- wuw-sr wakeword detection
- snowboy dnn wakeword detection
- cnn wakeword detection
- crnn wakeword detection
- lstm wakeword detection
- attention for wakeword detection
- raw audio wakeword detection
- wakeword detection power consumption
- google speech commands dataset
- wakeword detector model compression
- wakeword detection on microcontrollers
- semi-supervised keyword spotting
- agc for wakeword detection