Keyword Recognition

Local on-device keyword spotting — recognize any sound whether or not it's part of a langauge.

Get started free

What is Keyword Recognition?

Instead of having to recognize and respond to anything that can be said, like a voice assistant, why not just act on what your users know software can do?

Your custom multilingual on-device model recognizes pre-defined keywords, sending a transcript of trained commands, each associated with one or more utterances. That’s the insight behind keyword recognition.

Your software listens for multiple brief commands and supports variations in phrasing for each of them—using a fast, lightweight model—without user audio leaving the device.

Why Should I Use Keyword Recognition?

The main use cases for keyword models are in domains with limited vocabularies or apps that only wish to support specific words or phrases.

The main benefits of choosing a keyword model over traditional ASR are:



Accessible, safe, natural.



Only activating your software when it’s directly addressed processes audio as efficiently as possible.



Running fully on device (without an internet connection) is fast and consumes little power.



Train a model with our no-code AutoSpeech Maker and use it across all our platforms



Rather than listen to audio, only answer “Did I hear on the keywords you trained me to listen for?” All other sounds are immediately forgotten.



Constraining your app’s vocabulary means a lightweight customized recognition model.

If users are expected to interact using complete sentences or you want to support unanticipated prhasings, a speech recognition component paired with natural language understanding would be a better fit for your use case.

Use Case for Keyword Recognition

Imagine an app designed to control music while running. Classes could be named play and stop — we'll just talk about two for sake of brevity.

Utterances (variations) for play could include:

play, start, go, music on

Utterances (variations) for stop could include:

stop, quit, pause, music off

If a user says any of the above utterances, your app would recieve a transcript, but the utterances are normalized in a transcript to one of your two commands, play and stop, making it easy to map the command to the proper app feature.

How Do Keyword Recognition Models Work?

A keyword recognizer or keyword spotter straddles the line between wake word detection and speech recognition, with the performance of the former and the results of the latter. A keyword model is trained to recognize multiple named classes, each associated with one or more utterances. When the model detects one of these utterances in user speech, it returns as a transcript the name of the keyword class associated with that utterance.

A keyword detector is trained using machine learning models (like what you create with no code using Spokestack Maker or Spokestack Pro) to constantly analyze input from a microphone for specific sounds. These models work in tandem with a voice activity detector to:

How Do Keyword Recognition Models Work?
  • Detect human speech
  • Detect if preset keyword utterance is spoken
  • Send transcript event to Spokestack's Speech Pipeline so you can respond
Detection happens entirely on the device the software is running on without accessing a network or cloud services.

The technical term for what a keyword recognition model does is multiclass classification. Each keyword is a class label, and the utterances associated with that class are its instances. During training, the model receives multiple instances of the keyword classes and multiple words and phrases that don't fit into any of the classes, and it learns to tell the difference.

This probably sounds similar to the training process for a wake word model, and that's because it is: Spokestack's wake word and keyword recognition models are very similar, with small differences at the very end to allow the keyword model to detect multiple classes and return the label of the class that was detected.

At runtime, they both consist of three separate models:
  1. One for filtering incoming audio to retain only certain frequency components
  2. One for encoding the filtered representation into a format conducive to classification
  3. One for detecting target words or phrases
Create a Custom Keyword Model

Creating a personal keyword model is straightforward using Spokestack Maker or Spokestack Pro, a microphone, and a quiet room.

Spokestack's personal keywords use few-shot transfer learning allowing a small amount of data to produce a neural model with an accuracy level suitable for personal, hobby, or exploratory projects. Personal models will respond to the voice (or voices) used in the data you submit.
YouTube LogoSee How it Works

1 Create a Keyword Recognition Model

Create a Keyword Recognition Model

First, head to the keyword builder and click Create model in the top right. A section for a new model will appear. Change the model's name.

2 Add and Record Utterances

Add and Record Utterances

Then, look for the Keywords section. This is where you'll add the words or short phrases that make up your recognizer's vocabulary. Use Add Keyword to compose your list; for each keyword you add, follow this process:

Set keyword text

1. Set keyword text

View utterances

2. Click the arrow to the right of the keyword to view utterances.

Add utterance

3. Use Add Utterance to add new utterances to the selected keyword.

View samples

4. Click the arrow to the right of an utterance to view samples.


5. Click Record at the bottom of the box to add new samples.

At least three samples per utterance are required to train a model, but the more samples, the better. If you want the model to respond to anyone other than you, collect samples using more than one voice (remember, this is a personal keyword model, not a universal one).

Note the extra steps here compared to the process for creating a wake word model. This reflects the difference between the two types of model.

When you create a keyword recognizer, the list of keywords are the only text your app will ever see. Each one of those keywords, sometimes referred to as keyword classes in technical documentation, can be thought of as its own miniature wake word model, in that it can have different utterances that trigger it. This is why you have to add a keyword and an utterance in order to begin recording samples: a keyword for establishing the text you want returned to your app as a transcript, and an utterance to represent the text mapped to that keyword.

Each keyword can have one utterance that simply matches the keyword's name (or doesn't, if you want to change the formatting/spelling of some word before your app sees it), or several that should all be normalized to the same text before your app sees it. The keyword name itself has no correlation to the audio meant to trigger it.

3 Train Your Model

Train Your Model

When you've added as many different keywords and utterances as you want and recorded all your samples, click Train. That's all there is to it! In a few minutes you'll be able to download and use your very own keyword model. You can retrain as many times as you like, adding or deleting keywords, utterances, and samples as necessary.

How Do I Use a Keyword Recognition Model?

For mobile apps, integrate Spokestack Tray, a drop-in UI widget that manages voice interactions and delivers actionable user commands with just a few lines of code.

Sample code and tutorials
Complete API and SDK documentation
Low-code integrations on popular platforms including:
Hugging Face
Google Assistant
iOSAndroidReact NativeNodePython
let pipeline = SpeechPipelineBuilder()
    .setProperty("tracing", Trace.Level.PERF)
    .setProperty("keywordDetectModelPath", "detect.tflite")
    .setProperty("keywordEncodeModelPath", "encode.tflite")
    .setProperty("keywordFilterModelPath", "filter.tflite")
    .setProperty("keywordMetadataPath", "metadata.json")

Try a Keyword in Your Browser

Test a keyword model by pressing “Start test,” then saying any digit between zero and nine. Wait a few seconds for results. This browser tester is experimental.

Say any of the following utterances when testing:

zero, one, two, three, four, five, six, seven, eight, nine

  1. Test a model by pressing "start test" above
  2. Then, try saying any of the utterances listed above. Wait a few seconds after saying an utterance for a confirmation to appear.
Create Keyword

Full-Featured Platform SDK

Our native iOS library is written in Swift and makes setup a breeze.

Explore the docs

Become a Spokestack Maker and #OwnYourVoice

Access our hosted services for model import, natural language processing, text-to-speech, and wakeword.