Voice Activity Detection

Determine whether or not an audio snippet contains human speech.

What is Voice Activity Detection?

The very beginning of a voice-activated speech pipeline resolves the very first problem in that description — how to determine whether a human voice is speaking?

Voice activity detection, or VAD, is the first gatekeeper in a speech detection pipeline.

It's responsible for making an initial determination of whether or not a snippet of audio contains human speech. Ignoring audio that's not detected as speech saves energy and processing power. The savings grow with each downstream processor you have in your speech pipeline.

How Does VAD Work?

VADs range in complexity from simple frequency analyzers to heavier black-box neural models. The underlying implementation in Spokestack's libraries varies based on tools available for the various platforms, but we try to strike a balance between speed and accuracy. The speech pipeline is a soft real-time system and thus must be as responsive as possible.

It's worse for a downstream component to miss user speech than to process too much. Our VADs tend to err on the side of producing false positives rather than rejecting actual speech.

Become a Spokestack Maker and #OwnYourVoice

Access our hosted services for model import, natural language processing, text-to-speech, and wakeword.