Automatic Speech Recognition

Auphonic has built a layer on top of Automatic Speech Recognition Services:
Our classifiers generate metadata during the analysis of an audio signal (music segments, silence, multiple speakers, etc.) to divide the audio file into small and meaningful segments, which are then processed by the speech recognition engine. The speech recognition services support multiple languages and return text results for all audio segments.
The results from all segments are then combined, and meaningful timestamps, simple punctuation and structuring are added to the resulting text.

This is especially interesting when used with our Multitrack Algorithms, where all speakers have been recorded on separate tracks:
In Multitrack Speech Recognition, all the tracks are processed by the speech recognition service individually, so there is no crosstalk or spill and the recognition is much more accurate, as it is only dealing with one speaker at a time.
This also means that we can show individual speaker names in the transcript output file and audio player because we know exactly who is saying what at any given time.

Automatic Speech Recognition (ASR) is most useful to make audio searchable: Although automatically generated transcripts won’t be perfect and might be difficult to read (spoken text is very different from written text), they are very valuable if you try to find a specific topic within a one hour audio file or the exact time of a quote in an audio archive.

We also include a complete Transcript Editor directly in our HTML output file, which displays word confidence values to instantly see which sections should be checked manually, supports direct audio playback, HTML/PDF/WebVTT export and allows you to share the editor with someone else for further editing.

../_images/inspector-mt-closed.png


Our WebVTT-based audio player with search in speech recognition transcripts and exact speaker names of a Multitrack Production.

How to use Speech Recognition within Auphonic

To activate speech recognition within Auphonic, you can either select our self-hosted Auphonic Whisper ASR service or one of the integrated external services in the section Speech Recognition when creating a new Production or Preset.
If you want to use an external service, you first have to connect your Auphonic account to an external speech recognition service at the External Services page. This connection process is very different depending on the speech recognition provider, therefore please visit our step-by-step tutorials. Once that is done, you will have the option to choose whichever service you signed up for.
For our self-hosted service, you can skip the connection process and directly select Auphonic Whisper ASR as service:

../_images/SpeechRecognition.png


The speech recognition transcripts will be combined with our other Audio Algorithms and generate different Output Formats, depending on your selections.

Auphonic Whisper ASR

Using OpenAI’s open-source model Whisper, we offer a self-hosted automatic speech recognition (ASR) service.
For an overview and comparison to our integrated external ASR services, please see our Services Comparison Table.
Most important facts about Auphonic Whisper ASR:

Price and Supported languages

Whisper supports transcriptions in about 100 languages, which you can integrate into your Auphonic audio post-production workflow without creating an external account and at no additional cost.
Except for strong accents Whisper also provides a reliable language autodetection feature.

Timestamps, Confidence Values and Punctuation

Whisper returns timestamps and confidence values, that allow you to locate specific phrases in your recording, and combined with our Transcript Editor, you can easily find sections that should be checked manually.

ASR Speed and Quality

By using Auphonic Whisper ASR, your data does not have to leave our Auphonic servers for speech recognition processing, which speeds up ASR processing. The quality of Whisper transcripts is absolutely comparable to the “best” services in our comparison table.

Besides the fact, that Whisper is integrated in the Auphonic web service per default and you need no external account, causing extra costs, the outstanding feature with Whisper is the automatic language detection.

Integrated Speech Recognition Services

Besides our self-hosted Auphonic Whisper ASR (automatic speech recognition) service, we also support the following integrated external ASR services: Wit.ai, Google Cloud Speech API, Amazon Transcribe and Speechmatics.
For an overview and comparison, please see our Services Comparison Table.

Wit.ai

  • Wit.ai, owned by Facebook, provides an online natural language processing platform, which also includes speech recognition.

  • Wit is free, including for commercial use. See FAQ and Terms .

  • It supports many languages, but you have to create a separate Wit.ai service for every language you use!

Google Cloud Speech API

  • Google Cloud Speech API is the speech to text engine developed by Google and supports over 100 languages.

  • 60 minutes of audio per month and model are free, for more see Pricing (about $1.5/h for default model).

  • For English (en-US), we use the more expensive Enhanced Model (about $2.2/h for enhanced model), which gives much better results.

  • In your Google Cloud account you can optionally give Google permission to apply Data Logging to reduce your costs to ~$1/h default and ~$1.5/h enhanced model.

  • It is possible to add keywords to improve speech recognition accuracy for specific words and phrases or to add additional words to the vocabulary. For details see Word and Phrase Hints.

A great feature of the Google Speech API is the possibility to add keywords (also available in Amazon Transcribe and Speechmatics), which can improve the recognition quality a lot! We automatically send words and phrases from your metadata (title, artist, chapters, track names, tags, etc.) to Google and you can add additional keywords manually (see screenshot above).
This provides a context for the recognizer and allows the recognition of nonfamous names (e.g the podcast host) or out-of-vocabulary words.

Amazon Transcribe

Amazon Transcribe offers accurate transcriptions in many languages at low costs, including keywords, word confidence, timestamps, and punctuation.

Pricing

The free tier offers 60 minutes of free usage a month for 12 months. After that, it is billed monthly at a rate of $0.0004 per second ($1.44/h). More information is available at Amazon Transcribe Pricing.

Custom Vocabulary (Keywords) Support

Custom Vocabulary (called Keywords in Auphonic) gives you the ability to expand and customize the speech recognition vocabulary, specific to your case (i.e. product names, domain-specific terminology, or names of individuals).
The same feature is also available in the Google Cloud Speech API and in Speechmatics.

Timestamps, Word Confidence, and Punctuation

Amazon Transcribe returns a timestamp and confidence value for each word so that you can easily locate the audio in the original recording by searching for the text.
It also adds some punctuation, which is combined with our own punctuation and formatting automatically.

The high quality, especially in combination with keywords, and low costs of Amazon Transcribe make the service very attractive.
However, the processing time of Amazon Transcribe is much slower compared to all our other integrated services!

Speechmatics

Speechmatics offers accurate transcriptions in many languages including word confidence values, timestamps, punctuation and custom dictionary (called Keywords in Auphonic).

Languages and Keywords

Speechmatics supports all major European, American and some Asiatic languages and takes a global language approach. That means various accents and dialects, like Indian English or Austrian German, are very well recognized even though you cannot explicitly select a specific language region.
Also the Custom Dictionary feature (called Keywords in Auphonic), which does further improve the results, is now available in the Speechmatics API.

Model Accuracy

There are two levels of accuracy for Speechmatics that you can choose when you create the Service within Auphonic for the first time. The Standard Model works much faster than the Enhanced Model, is lower in costs but still in equal accuracy range as the other services.
For the Enhanced Model you have to be more patient, as processing takes nearly as long as with Amazon Transcribe, but the quality of the result is correspondingly high. In terms of punctuation and small details in pronunciation the Enhanced Model of Speechmatics is incomparably good.

Note

If you want to use both Standard and Enhanced Model of Speechmatics once in a while, you need to create two separate services (one service for each model) in your Auphonic account!

Timestamps, Word Confidence, and Punctuation

Like Amazon Transcribe, Speechmatics creates timestamps, word confidence values, and punctuation for every single word.

Pricing

Speechmatics offers 4 hours of free speech recognition per month (2h standard model plus 2h enhanced model). Once you exceed these 4 hours, however, Speechmatics is at similar cost to the Google Cloud Speech API (or lower).
Pricing starts at $1.25 per hour of audio for Standard Model up to $1.90/h for Enhanced Model. They offer significant discounts for users requiring higher volumes. If you process a lot of content, you should contact them directly at hello@speechmatics.com and say that you wish to use it with Auphonic.
More information is available at Speechmatics Pricing.

Services Comparison Table

Auphonic Speech Recognition Services Comparison

Auphonic Whisper

Wit.ai

Google Speech API

Amazon Transcribe

Speechmatics Standard

Speechmatics Enhanced

Price

free,
also for commercial

free,
also for commercial

1+1h free per month
(Enhanced + Default Model),
then ~$0.96-$2.16/h
(depending on user settings)

1h free per month
(first 12 months),
then ~$1.44/h
cheaper for high volumes

2h free per month,
then ~$1.25/h,
much cheaper for high volumes

2h free per month,
then ~$1.90/h
much cheaper for high volumes

ASR Quality English

best

basic

good
(Enhanced model)

very good

very good

best

ASR Quality German

best

basic

basic
(Default model)

very good

very good

best

Keyword Support

Yes

No

Yes

Yes

Yes

Yes

Timestamps and Confidence Value

Yes

Yes

No

Yes

Yes

Yes

Speed

fast

slow

fast

much slower

medium

fast

Supported Languages

about 100 languages

over 100 languages

over 100 languages

about 40 languages

about 50 languages

about 50 languages

(Last Update: March 2023)

More Details about the comparison:

ASR Quality:

We tried to compare the relative speech recognition quality of all services in English and German (best means just the best one of our integrated services).
Please let us know if you get different results or if you compare services in other languages!

Keyword Support:

Support for keywords to expand the speech recognition vocabulary, to recognize out-of-vocabulary words.
This feature is called Word and Phrase Hints in the Google Cloud Speech API, Custom Vocabulary in Amazon Transcribe and Custom Dictionary in Speechmatics.

Timestamps and Confidence Value:

A timestamp and confidence value is returned for each word or phrase.
This is relevant for our Transcript Editor, to play each word or phrase separately and to instantly see which sections should be checked manually (low confidence).

Speed:

The relative processing speed of all services. All services are faster than real-time, but Amazon Transcribe and Speechmatics enhanced model are significantly slower compared to all other services.

Supported Languages:

Links to pages with recent supported languages and variants.

We will add additional services if and when we find services that offer improved cost benefits or better final results and support at least two languages (that’s an important step for a speech recognition company).

Auphonic Output Formats

Auphonic produces three output formats from speech recognition results: An HTML transcript file (readable by humans), a JSON or XML file with all data (readable by machines) and a WebVTT subtitles/captions file as an exchange format between systems.

HTML Transcript File

Examples: EN Singletrack, EN Multitrack, DE Singletrack

The HTML output file contains the transcribed text with timestamps for each new paragraph, mouse hover shows the time for each text segment and speaker names are displayed in case of multitrack. Sections are automatically generated from chapter marks and the HTML file includes the audio metadata as well.

The transcription text can be copied into Wordpress or other content management systems, in order to search within the transcript and to find the corresponding timestamps (if you don’t have an audio player which supports search in WebVTT/transcripts).

Our HTML output file also includes the Auphonic Transcript Editor for easy-to-use transcription editing.

WebVTT File

Examples: EN Singletrack, EN Multitrack, DE Singletrack

WebVTT is the open specification for subtitles, captions, chapters, etc. The WebVTT file can be added as a track element within the HTML5 audio/video element. For an introduction see Getting started with the HTML5 track element.
It is supported by all major browsers and also many other systems use it already (screenreaders, (web) audio players with WebVTT display+search like the player from Podlove, software libs, etc.).
It is possible to add other time-based metadata as well: Not only the transcription text, also speaker names, styling or any other custom data like GPS coordinates are possible.

Search engines could parse WebVTT files in audio/video tags, as the format is well defined, then we would have searchable audio/video.
It is also possible to link to an external WebVTT file in an RSS feed, then podcast players and other feed-based systems could parse the transcript as well (for details see this discussion).

WebVTT is therefore a great exchange format between different systems: audio players, speech recognition systems, human transcriptions, feeds, search engines, CMS, etc.

JSON/XML Output File

Examples: EN Singletrack, EN Multitrack, DE Singletrack
Examples with word timestamps: EN Singletrack, EN Multitrack, DE Singletrack

This file contains all the speech recognition details in JSON or XML format. This includes the text, punctuation and paragraphs with timestamps and confidence values.
Word timestamps and confidence values are also available if you use Speechmatics or Amazon Transcribe.

Tips to Improve Speech Recognition Accuracy

Audio quality is important
  • Reverberant audio is quite a problem, put the microphone as close to the speaker as possible.

  • Try to avoid background sounds and noises during recording.

  • Don’t use background music either (unless you use our multitrack version).

  • Only use fast as well as stable skype/hangout connections.

Speak clearly
  • Pronunciation and grammar are important.

  • Dialects are more difficult to understand, use the correct language variant if available (e.g. English-UK vs. English-US).

  • Don’t interrupt other speakers. This has a huge impact on the accuracy!

  • Don’t mix languages.

Use a lot of metadata and keywords
  • This is a big help for the Google Speech API, Amazon Transcribe and Speechmatics.

  • When using metadata and keywords, it contributes a lot to make the recognition of special names, terms and out-of-vocabulary words easier.

  • As always, accurate metadata is important!

Use our multitrack version
  • If you record a separate track for each speaker, use our multitrack speech recognition.

  • This will lead to better results and more accurate information on the timing of each speaker.

  • Background music/sounds should be put into a separate track, so as to not interfere with the speech recognition.

Auphonic Transcript Editor

../_images/TranscriptEditor.png


Screenshot of our transcript editor with word confidence highlighting and the edit bar.

Our open source transcript editor, which is embedded directly in the HTML Transcript File, has been designed to make checking and editing transcripts as easy as possible. Try it yourself with our Transcript Editor Examples.

Main features of the Transcript Editor:

  • Edit the transcription directly in the HTML document.

  • Show/hide word or phrase confidence, to instantly see which sections should be checked manually (if you use Auphonic Whisper ASR, Amazon Transcribe or Speechmatics as speech recognition engine).

  • Listen to audio playback of specific words directly in the HTML editor.

  • Share the transcript editor with others: as the editor is embedded directly in the HTML file (no external dependencies), you can just send the HTML file to some else to manually check the automatically generated transcription.

  • Export the edited transcript to HTML, PDF or WebVTT.

  • Completely useable on all mobile devices and desktop browsers.

Transcript Editing

By clicking the Edit Transcript button, a dashed box appears around the text. This indicates that the text is now freely editable on this page. Your changes can be saved by using one of the export options.
If you make a mistake whilst editing, you can simply use the undo/redo function of the browser to undo or redo your changes.

../_images/TranscriptEditor-Edit.png

When working with multitrack speech recognition, another helpful feature is the ability to change all speaker names at once throughout the whole transcript just by editing one speaker. Simply click on an instance of a speaker title and change it to the appropriate name, this name will then appear throughout the whole transcript.

Word/Phrase Confidence Highlighting

Word or phrase confidence values are shown visually in the transcript editor, highlighted in shades of red (see screenshot). The shade of red is dependent on the actual word confidence value: The darker the red, the lower the confidence value. This means you can instantly see which sections you should check/re-work manually to increase the accuracy.
Once you have edited the highlighted text, it will be set to white again, so it’s easy to see which sections still require editing.
Use the button Add/Remove Highlighting to disable/enable word confidence highlighting.

Note

Confidence values are only available in Auphonic Whisper ASR, Amazon Transcribe and Speechmatics. They are not supported if you use any of our other integrated speech recognition services!

Audio Playback

The button Activate/Stop Play-on-click allows you to hear the audio playback of the section you click on (by clicking directly on the word in the transcript editor). This is helpful in allowing you to check the accuracy of certain words by being able to listen to them directly whilst editing, without having to go back and try to find that section within your audio file.

If you use an External Service in your production to export the resulting audio file, we will automatically use the exported file in the transcript editor. Otherwise we will use the output file generated by Auphonic. Please note that this file is password protected for the current Auphonic user and will be deleted in 21 days.
If no audio file is available in the transcript editor, or cannot be played because of the password protection, you will see the button Add Audio File to add a new audio file for playback.

Export Formats, Save/Share Transcript Editor

Click on the button Export… to see all export and saving/sharing options:

../_images/TranscriptEditor-Export.png
Save/Share Editor

The Save Editor button stores the whole transcript editor with all its current changes into a new HTML file. Use this button to save your changes for further editing or if you want to share your transcript with someone else for manual corrections (as the editor is embedded directly in the HTML file without any external dependencies).

Export HTML / Export PDF / Export WebVTT

Use one of these buttons to export the edited transcript to HTML (for WordPress, Word, etc.), to PDF (via the browser print function) or to WebVTT (so that the edited transcript can be used as subtitles or imported in web audio players of the Podlove Publisher or Podigee. Every export format is rendered directly in the browser, no server needed.

Speech Recognition Examples

Please see examples in English and German at:
https://auphonic.com/features/speechrec

All features demonstrated in these examples also work in over 100 languages, although the recognition quality might vary.

Transcript Editor Examples

Here are two examples of the transcript editor, taken from our Speech Recognition Examples:

1. Singletrack Transcript Editor Example

Singletrack speech recognition example from the first 10 minutes of Common Sense 309 by Dan Carlin. Speechmatics was used as speech recognition engine without any keywords or further manual editing.

2. Multitrack Transcript Editor Example

A multitrack speech recognition transcript example from the first 20 minutes of TV Eye on Marvel - Luke Cage S1E1. Amazon Transcribe was used as speech recognition engine without any further manual editing.
As this is a multitrack production, the transcript includes exact speaker names as well (try to edit them!).