Spokestack Speech Pipeline

What is the Speech Pipeline?

The speech pipeline is an extensible audio processing pipeline that includes a variety of built-in speech processors for voice activity detection (VAD), wake word activation, keyword recognition, and automatic speech recognition (ASR).

How Does the Speech Pipeline Work?

This pipeline seamlessly integrates VAD-triggered wake word detection using on-device machine learning models and speech transcription using ASR. It runs as a soft real-time system , and its components must be as responsive as possible.

When Spokestack detects a wake word, the speech pipeline begins transcribing the user's speech until they stop talking for a pre-set amount of time, or a total activation time limit elapses. The technology for converting spoken words to text is known as ASR.

The speech pipeline is the first piece of the puzzle in any voice interaction and is responsible for capturing user audio and translating it into text. Configuring it entails choices about whether or not to use a wake word to activate ASR, what kind of preprocessing to perform on audio before sending it to ASR, and which ASR service to use. These choices can all be made individually or through the use of configuration profiles.

Why Should I Use Spokestack's Speech Pipeline?

Voice activity detection enables the pipeline to listen to small segments of audio and determine if speech is present. To keep computation usage low for edge devices, the rest of the pipeline does not proceed if the VAD does not detect speech.

A wake word enables the pipeline to listen to speech audio and determine if a name from a set of recognized names is spoken. This fulfills two objectives: one, only listen in on a conversation if the wake word is being spoken; and two, preserve the higher-cost ASR component to transcribe only when the user wants to talk.

Keyword recognition enables your software to listen for multiple brief commands and support variations in phrasing for each of them—using a fast, lightweight model—without user audio leaving the device. Instead of having to recognize and respond to anything that can be said, like a voice assistant, keyword recognition allows you simplify processing and act on what users expect.

Automatic speech recognition refers to the process of analyzing and transcribing a chunk of audio without human intervention, producing text that software can process further, either to perform a function or simply to record it. This feature is essential for any voice interaction.

Become a Spokestack Maker and #OwnYourVoice

Access our hosted services for model import, natural language processing, text-to-speech, and wakeword.