ASR configuration

The technology for converting spoken words to text is known as automatic speech recognition (ASR).

ASR refers to the process of analyzing and transcribing a chunk of audio without human intervention. This technology is ubiquitous, with a place in the stack of every major voice assistant on the market.

In software using voice powered by Spokestack, ASR is part of the speech pipeline. To change ASR providers from the default, you’ll want to either set a pipeline profile at configuration time or directly configure your speech pipeline’s stages. See the “Getting Started” guide for your platform for information about pipeline profiles or the speech configuration introduction for an explanation of pipeline stages.

Many different techniques have been used to accomplish this throughout ASR’s long history, but modern models use — what else? — neural networks. The size and performance characteristics of these models vary widely, based on where they’re designed to be deployed and their intended use cases. Technology has advanced to the point where models small enough to fit on a mobile device and run in almost real time are accurate enough to use for many tasks, but models that run in the cloud are still widely used for their speed and relatively higher accuracy.

Accuracy is often measured in Word Error Rate (WER), or the percentage of words that differ between the ASR system’s transcript and that of a gold-standard version. Getting an accurate WER measurement entails juggling many different variables (think accent, background noise, whether we’re talking about a speech vs. a multi-party conversation, etc.), and it’s easy to spin. There are, however, various academic comparisons of major vendors. For perspective, human WER hovers somewhere between 4-11%, depending on variables like those mentioned above, usually falling on the lower end of that range.

The Spokestack open-source native libraries provide a convenient API across multiple ASR providers such as Apple, Google, and Microsoft. Spokestack is designed to support multiple speech recognition providers so you can decide which is right for your use case. Support varies by mobile platform, however, so we decided to gather the information in one place to make the choice as easy as possible for your app.

Supported ASR Providers by Platform

Provider	Android	iOS
Android ASR (on-device)	✅	❌
Apple ASR (on-device)	❌	✅
Spokestack Cloud ASR	✅	✅
Azure Speech Services	✅	❌
Google Cloud	✅	❌

Configuration

ASR providers require various configuration, usually in the form of API keys, but sometimes runtime components. This configuration takes place when you first build a Spokestack SpeechPipeline (or, in newer versions, Spokestack object). Below is a list of configuration needed for each platform and some usage notes.

For Android, primitive configuration properties are set via a call to setProperty(propertyName, value) on the speech pipeline’s builder (or a SpeechConfig object supplied to it); in iOS, they’re set as fields of a SpeechConfiguration object.

Android ASR

Android

No API keys or configuration properties are required, but a Context (android.content.Context) object must be added to the SpeechPipeline’s builder via the setAndroidContext() method. See the javadoc for AndroidSpeechRecognizer for more information.

Device compatibility

Android’s native ASR support is device-dependent. For production apps targeting broad compatibility, we recommend testing for its availability by calling SpeechRecognizer.isRecognitionAvailable() and having a fallback option in place for if it returns false.

This chart lists physical devices on which it has been tested by either the Spokestack team or our community. If you have a device that is not listed, please try it out and submit a PR with your results!

Device	API Level	ASR working?
Moto X (2nd Gen)	22	❌`*`
Lenovo TB-X340F tablet	27	✅
Pixel 1	29	✅
Pixel 3 XL	29	✅
Pixel 3a	29	✅
Pixel 4	29	✅

* ASR fails consistently with a SERVER_ERROR, which seems to indicate that the server used by the device manufacturer to handle these requests is no longer operational.

iOS

N/A

Apple ASR

Android

N/A

iOS

None required! 🎉

Spokestack Cloud ASR

Spokestack’s Cloud ASR requires requests to be signed with a Spokestack client ID and API secret. Spokestack accounts are free, and cloud-based ASR currently is as well. If you don’t already have an account, you can sign up for one here; if you do, log in to get your credentials.

Android

spokestack-id (string): A Spokestack client ID, available in the account portal.
spokestack-secret (string): A Spokestack API secret, also available in the account portal.

iOS

spokestack-id (string): A Spokestack client ID, available in the account portal.
spokestack-secret (string): A Spokestack API secret, also available in the account portal.

Azure Speech Services

Android

azure-api-key (string): An API key valid for Azure Cognitive Services. See Microsoft’s documentation for more information.
azure-region (string): A region identifier for Azure Speech Services. See Microsoft’s list.

You’ll also need the following dependency in your app’s build:gradle:

  implementation 'com.microsoft.cognitiveservices.speech:client-sdk:1.9.0'

This will require you to add Microsoft’s Maven repository to your top-level build.gradle, which implies acceptance of their license terms:

repositories {
  // ...
  maven { url 'https://csspeechstorage.blob.core.windows.net/maven/' }
}

iOS

N/A (for now)

Google Cloud

Android

google-credentials (string): A JSON-serialized string containing Google account credentials. See Google’s documentation for more information.
locale (string): A BCP-47 language identifier to identify the language that should be used for speech recognition (example: “en-US”). See Google’s documentation for a list of supported codes.

You’ll also need the following dependencies in your app’s build:gradle:

  implementation 'com.google.cloud:google-cloud-speech:1.22.2'
  implementation 'io.grpc:grpc-okhttp:1.28.0'

iOS

N/A (for now)

Related Resources

Want to dive deeper into the world of Android voice integration? We've got a lot to say on the subject: