Automatic Speech Recognition

Auphonic has built a layer on top of a few external Speech Recognition Services:
Our classifiers generate metadata during the analysis of an audio signal (music segments, silence, multiple speakers, etc.) to divide the audio file into small and meaningful segments, which are then processed by the speech recognition engine. The external speech services support multiple languages and return text results for all audio segments.
The results from all segments are then combined, and meaningful timestamps, simple punctuation and structuring are added to the resulting text.

This is especially interesting when used with our Multitrack Algorithms, where all speakers have been recorded on separate tracks:
In Multitrack Speech Recognition, all the tracks are processed by the speech recognition service individually, so there is no crosstalk or spill and the recognition is much more accurate, as it is only dealing with one speaker at a time.
This also means that we can show individual speaker names in the transcript output file and audio player because we know exactly who is saying what at any given time.

Automatic speech recognition is most useful to make audio searchable: Although automatically generated transcripts won’t be perfect and might be difficult to read (spoken text is very different from written text), they are very valuable if you try to find a specific topic within a one hour audio file or the exact time of a quote in an audio archive.


Our WebVTT-based audio player with search in speech recognition transcripts and exact speaker names of a Multitrack Production.

How to use Speech Recognition within Auphonic

To activate speech recognition within Auphonic, you first have to connect your Auphonic account to an external speech recognition service at the External Services page.
Once this is done, you will have the option to choose whichever service you signed up for in the section Speech Recognition when creating a new production:


The results will be combined with our other Audio Algorithms and generate different output formats, depending on your selections.

Integrated Speech Recognition Services

At the moment we support two external speech recognition services: and the Google Cloud Speech API.
We will add additional services if and when we find services that offer improved cost benefits or better final results.

  •, owned by Facebook, provides an online natural language processing platform, which also includes speech recognition.
  • Wit is free, including for commercial use. See FAQ and Terms .
  • It supports many languages, but you have to create a separate service for each language!

Google Cloud Speech API

  • Google Cloud Speech API is the speech to text engine developed by Google and supports over 80 languages.
  • 60 minutes of audio per month are free, for more see Pricing (about $1.5/h).
  • It is possible to add keywords to improve speech recognition accuracy for specific words and phrases or to add additional words to the vocabulary. For details see word and phrase hints.

The outstanding feature of the Google Speech API is the possibility to add keywords, which can improve the recognition quality a lot! We automatically send words and phrases from your metadata (title, artist, chapters, track names, tags, etc.) to Google and you can add additional keywords manually (see screenshot above).
This provides a context for the recognizer and allows the recognition of nonfamous names (e.g the podcast host) or out-of-vocabulary words.

Auphonic Output Formats

Auphonic produces three output formats from speech recognition results: An HTML transcript file (readable by humans), a JSON or XML file with all data (readable by machines) and a WebVTT subtitles/captions file as an exchange format between systems.

HTML Transcript File

Examples: EN Singletrack, EN Multitrack, DE Singletrack

The HTML output file contains the transcribed text with timestamps for each new paragraph, mouse hover shows the time for each text segment and speaker names are displayed in case of multitrack. Sections are automatically generated from chapter marks and the HTML file includes the audio metadata as well.

The transcription text can be copied into Wordpress or other content management systems, in order to search within the transcript and to find the corresponding timestamps (if you don’t have an audio player which supports search in WebVTT/transcripts).

WebVTT File

Examples: EN Singletrack, EN Multitrack, DE Singletrack

WebVTT is the open specification for subtitles, captions, chapters, etc. The WebVTT file can be added as a track element within the HTML5 audio/video element. For an introduction see Getting started with the HTML5 track element.
It is supported by all major browsers and also many other systems use it already (screenreaders, (web) audio players with WebVTT display+search, software libs, etc.).
It is possible to add other time-based metadata as well: Not only the transcription text, also speaker names, styling or any other custom data like GPS coordinates are possible.

Search engines could parse WebVTT files in audio/video tags, as the format is well defined, then we would have searchable audio/video.
It is also possible to link to an external WebVTT file in an RSS feed, then podcast players and other feed-based systems could parse the transcript as well (for details see this discussion).

WebVTT is therefore a great exchange format between different systems: audio players, speech recognition systems, human transcriptions, feeds, search engines, CMS, etc.

JSON/XML Output File

Examples: EN Singletrack, EN Multitrack, DE Singletrack

This file contains all the speech recognition details in JSON or XML format. This includes the text, timestamps, confidence values (how accurate the recognition is) and paragraphs.

Tips to Improve Speech Recognition Accuracy

Audio quality is important
  • Reverberant audio is quite a problem, put the microphone as close to the speaker as possible.
  • Try to avoid background sounds and noises during recording.
  • Don’t use background music either (unless you use our multitrack version).
  • Only use fast as well as stable skype/hangout connections.
Speak clearly
  • Pronunciation and grammar are important.
  • Dialects are more difficult to understand, use the correct language variant if available (e.g. English-UK vs. English-US).
  • Don’t interrupt other speakers. This has a huge impact on the accuracy!
  • Don’t mix languages.
Use a lot of metadata
  • This is a big help for the Google Search API.
  • When using metadata and keywords, it contributes a lot to make the recognition of special names, terms and out-of-vocabulary words easier.
  • As always, accurate metadata is important!
Use our multitrack version
  • If you record a separate track for each speaker, use our multitrack speech recognition.
  • This will lead to better results and more accurate information on the timing of each speaker.
  • Background music/sounds should be put into a separate track, so as to not interfere with the speech recognition.

Speech Recognition Examples

Please see examples in English and German at:

And take a look at the blog post:
Make Podcasts Searchable

All features demonstrated in these examples also work in over 80 languages, although the recognition quality might vary.