Automatic Speech Recognition¶
Auphonic has built a layer on top of a few external Speech Recognition Services:
Our classifiers generate metadata during the analysis of an audio signal (music segments, silence, multiple speakers, etc.) to divide the audio file into small and meaningful segments, which are then processed by the speech recognition engine. The external speech services support multiple languages and return text results for all audio segments.
The results from all segments are then combined, and meaningful timestamps, simple punctuation and structuring are added to the resulting text.
This is especially interesting when used with our Multitrack Algorithms, where all speakers have been recorded on separate tracks:
In Multitrack Speech Recognition, all the tracks are processed by the speech recognition service individually, so there is no crosstalk or spill and the recognition is much more accurate, as it is only dealing with one speaker at a time.
This also means that we can show individual speaker names in the transcript output file and audio player because we know exactly who is saying what at any given time.
Automatic speech recognition is most useful to make audio searchable: Although automatically generated transcripts won’t be perfect and might be difficult to read (spoken text is very different from written text), they are very valuable if you try to find a specific topic within a one hour audio file or the exact time of a quote in an audio archive.
We also include a complete Transcript Editor directly in our HTML output file, which displays word confidence values to instantly see which sections should be checked manually, supports direct audio playback, HTML/PDF/WebVTT export and allows you to share the editor with someone else for further editing.
How to use Speech Recognition within Auphonic¶
To activate speech recognition within Auphonic, you first have to connect your Auphonic account to an external speech recognition service at the External Services page.
Once this is done, you will have the option to choose whichever service you signed up for in the section Speech Recognition when creating a new production:
Integrated Speech Recognition Services¶
At the moment we support the following external speech recognition services: Wit.ai, Google Cloud Speech API, Amazon Transcribe and Speechmatics.
For an overview and comparison, please see our Services Comparison Table.
Google Cloud Speech API¶
- Google Cloud Speech API is the speech to text engine developed by Google and supports over 80 languages.
- 60 minutes of audio per month are free, for more see Pricing (about $1.5/h).
- It is possible to add keywords to improve speech recognition accuracy for specific words and phrases or to add additional words to the vocabulary. For details see Word and Phrase Hints.
The outstanding feature of the Google Speech API is the possibility to add keywords (also available in Amazon Transcribe), which can improve the recognition quality a lot!
We automatically send words and phrases from your metadata (title, artist, chapters, track names, tags, etc.) to Google and you can add additional keywords manually (see screenshot above).
This provides a context for the recognizer and allows the recognition of nonfamous names (e.g the podcast host) or out-of-vocabulary words.
Amazon Transcribe offers accurate transcriptions in English, German, Spanish, French, Italian, Portuguese and Korean at low costs, including keywords, word confidence, timestamps, and punctuation.
- The free tier offers 60 minutes of free usage a month for 12 months. After that, it is billed monthly at a rate of $0.0004 per second ($1.44/h). More information is available at Amazon Transcribe Pricing.
- Custom Vocabulary (Keywords) Support
- Custom Vocabulary
(called Keywords in Auphonic)
gives you the ability to expand and customize the speech recognition vocabulary, specific to your case (i.e. product names, domain-specific terminology, or names of individuals).
The same feature is also available in the Google Cloud Speech API.
- Timestamps, Word Confidence, and Punctuation
- Amazon Transcribe returns a timestamp and confidence value for each word so that you can easily locate the audio in the original recording by searching for the text.
It also adds some punctuation, which is combined with our own punctuation and formatting automatically.
The high quality, especially in combination with keywords, and low costs of Amazon Transcribe make the service very attractive.
However, the processing time of Amazon Transcribe is much slower compared to all our other integrated services!
Speechmatics offers accurate transcriptions in many languages including word confidence values, timestamps, and punctuation.
- Many Languages
- Speechmatics’ clear advantage is the sheer number of languages it supports (all major European and some Asiatic languages).
It also has a Global English feature, which supports different English accents during transcription.
- Timestamps, Word Confidence, and Punctuation
- Like Amazon Transcribe, Speechmatics creates timestamps, word confidence values, and punctuation.
- Speechmatics is the most expensive speech recognition service at Auphonic.
Pricing starts at £0.06 per minute of audio and can be purchased in blocks of £10 or £100. This equates to a starting rate of about $4.78/h. Reduced rate of £0.05 per minute ($3.98/h) are available if purchasing £1,000 blocks.
They offer significant discounts for users requiring higher volumes. At this further reduced price point it is a similar cost to the Google Cloud Speech API (or lower). If you process a lot of content, you should contact them directly at firstname.lastname@example.org and say that you wish to use it with Auphonic.
More information is available at Speechmatics Pricing.
Speechmatics offers high-quality transcripts in many languages. But these features do come at a price, it is the most expensive speech recognition services at Auphonic.
Unfortunately, their existing Custom Dictionary (keywords) feature, which would further improve the results, is not available in the Speechmatics API yet.
Services Comparison Table¶
|Wit.ai||Google Speech API||Amazon Transcribe||Speechmatics|
also for commercial
|1h free per month,
|1h free per month,
|~$4.8/h to ~$4/h,
much cheaper for high volumes
|ASR Quality English||basic||medium||best||best|
|ASR Quality German||medium
(Wit.ai uses Google in German)
|Word Timestamps and Confidence||No||No||Yes||Yes|
|Supported Languages||sq, ar, az, bn, bs, bg, my, ca, zh, hr, cs, da, nl, en, et, fi, fr, ka, de, el, he, hi, hu, is, id, it, ja, ko, la, lt, mk, ms, nb, fa, pl, pt, ro, ru, sr, sk, sl, es, sw, sv, tl, ta, th, tr, uk, vi||ar-DZ, ar-BH, ar-EG, ar-IQ, ar-IL, ar-JO, ar-KW, ar-LB, ar-MA, ar-OM, ar-QA, ar-SA, af-ZA, ar-PS, ar-TN, ar-AE, eu, bg, ca, hr, cs, da, nl, en-AU, en-CA, en-IN, en-IE, en-NZ, en-PH, en-ZA, en-GB, en-US, fil, fi, fr, gl, de, el, he, hi, hu, is, id, it, ja, ko, lt, ms, nb, fa, pl, pt-BR, pt-PT, ro, ru, sr, sk, sl, es-AR, es-BO, es-CL, es-CO, es-CR, es-DO, es-EC, es-SV, es-GT, es-HN, es-MX, es-NI, es-PA, es-PY, es-PE, es-PR, es-ES, es-US, es-UY, es-VE, sv, th, tr, u, uk, vi, u, yue-Hant-HK, cmn-Hans-CN, cmn-Hans-HK, cmn-Hant-TW||en-AU, en-GB, en-US, de-DE, fr-CA, fr-FR, it-IT, pt-BR, es-ES, es-US, ko-KR||bg, ca, hr, cs, da, nl, en, en-AU, en-GB, en-US, fi, fr, de, el, hi, hu, it, ja, ko, lv, lt, pl, pt, ro, ru, sk, sl, es, sv|
(Last Update: April 2019)
More Details about the comparison:
- ASR Quality:
- We tried to compare the relative speech recognition quality of all services in English and German
(best means just the best one of our integrated services).
Wit.ai seems to use the Google Cloud Speech API in German (maybe also in other languages?) and therefore achieves better results as in English.
Please let us know if you get different results or if you compare services in other languages!
- Keyword Support:
- Support for keywords to expand the speech recognition vocabulary, to recognize out-of-vocabulary words.
This feature is called Word and Phrase Hints in the Google Cloud Speech API and Custom Vocabulary in Amazon Transcribe.
- Word Timestamps and Confidence:
- A timestamp and confidence value for each word is returned (and not just for a phrase).
This is relevant for our Transcript Editor, to play each word separately and to instantly see which words should be checked manually (low confidence).
- The relative processing speed of all services. All services are faster than real-time, but Amazon Transcribe is significantly slower compared to all other services.
- Supported Languages:
- A list of all supported languages and variants.
We will add additional services if and when we find services that offer improved cost benefits or better final results and support at least two languages (that’s an important step for a speech recognition company).
Auphonic Output Formats¶
Auphonic produces three output formats from speech recognition results:
An HTML transcript file (readable by humans), a JSON or XML file with all data (readable by machines) and a WebVTT subtitles/captions file as an exchange format between systems.
HTML Transcript File¶
The HTML output file contains the transcribed text with timestamps for each new paragraph, mouse hover shows the time for each text segment and speaker names are displayed in case of multitrack. Sections are automatically generated from chapter marks and the HTML file includes the audio metadata as well.
The transcription text can be copied into Wordpress or other content management systems, in order to search within the transcript and to find the corresponding timestamps (if you don’t have an audio player which supports search in WebVTT/transcripts).
Our HTML output file also includes the Auphonic Transcript Editor for easy-to-use transcription editing.
WebVTT is the open specification for subtitles, captions, chapters, etc. The WebVTT file can be added as a track element within the HTML5 audio/video element. For an introduction see Getting started with the HTML5 track element.
It is supported by all major browsers and also many other systems use it already (screenreaders, (web) audio players with WebVTT display+search like the player from Podlove or Podigee, software libs, etc.).
It is possible to add other time-based metadata as well: Not only the transcription text, also speaker names, styling or any other custom data like GPS coordinates are possible.
Search engines could parse WebVTT files in audio/video tags, as the format is well defined,
then we would have searchable audio/video.
It is also possible to link to an external WebVTT file in an RSS feed, then podcast players and other feed-based systems could parse the transcript as well (for details see this discussion).
WebVTT is therefore a great exchange format between different systems: audio players, speech recognition systems, human transcriptions, feeds, search engines, CMS, etc.
JSON/XML Output File¶
This file contains all the speech recognition details in JSON or XML format. This includes the text, punctuation and paragraphs with timestamps and confidence values.
Word timestamps and confidence values are also available if you use Speechmatics or Amazon Transcribe.
Tips to Improve Speech Recognition Accuracy¶
- Audio quality is important
- Reverberant audio is quite a problem, put the microphone as close to the speaker as possible.
- Try to avoid background sounds and noises during recording.
- Don’t use background music either (unless you use our multitrack version).
- Only use fast as well as stable skype/hangout connections.
- Speak clearly
- Pronunciation and grammar are important.
- Dialects are more difficult to understand, use the correct language variant if available (e.g. English-UK vs. English-US).
- Don’t interrupt other speakers. This has a huge impact on the accuracy!
- Don’t mix languages.
- Use a lot of metadata
- This is a big help for the Google Search API.
- When using metadata and keywords, it contributes a lot to make the recognition of special names, terms and out-of-vocabulary words easier.
- As always, accurate metadata is important!
- Use our multitrack version
- If you record a separate track for each speaker, use our multitrack speech recognition.
- This will lead to better results and more accurate information on the timing of each speaker.
- Background music/sounds should be put into a separate track, so as to not interfere with the speech recognition.
Auphonic Transcript Editor¶
Our open source transcript editor, which is embedded directly in the HTML Transcript File, has been designed to make checking and editing transcripts as easy as possible. Try it yourself with our Transcript Editor Examples.
Main features of the Transcript Editor:
- Edit the transcription directly in the HTML document.
- Show/hide word confidence, to instantly see which sections should be checked manually (if you use Amazon Transcribe or Speechmatics as speech recognition engine).
- Listen to audio playback of specific words directly in the HTML editor.
- Share the transcript editor with others: as the editor is embedded directly in the HTML file (no external dependencies), you can just send the HTML file to some else to manually check the automatically generated transcription.
- Export the edited transcript to HTML, PDF or WebVTT.
- Completely useable on all mobile devices and desktop browsers.
By clicking the Edit Transcript button, a dashed box appears around the text. This indicates that the text is now freely editable on this page. Your changes can be saved by using one of the export options.
If you make a mistake whilst editing, you can simply use the undo/redo function of the browser to undo or redo your changes.
When working with multitrack speech recognition, another helpful feature is the ability to change all speaker names at once throughout the whole transcript just by editing one speaker. Simply click on an instance of a speaker title and change it to the appropriate name, this name will then appear throughout the whole transcript.
Word Confidence Highlighting¶
Word confidence values are shown visually in the transcript editor, highlighted in shades of red (see screenshot).
The shade of red is dependent on the actual word confidence value: The darker the red, the lower the confidence value. This means you can instantly see which sections you should check/re-work manually to increase the accuracy.
Once you have edited the highlighted text, it will be set to white again, so it’s easy to see which sections still require editing.
Use the button Add/Remove Highlighting to disable/enable word confidence highlighting.
The button Activate/Stop Play-on-click allows you to hear the audio playback of the section you click on (by clicking directly on the word in the transcript editor). This is helpful in allowing you to check the accuracy of certain words by being able to listen to them directly whilst editing, without having to go back and try to find that section within your audio file.
If you use an External Service in your production to export the resulting audio file, we will automatically use the exported file in the transcript editor.
Otherwise we will use the output file generated by Auphonic. Please note that this file is password protected for the current Auphonic user and will be deleted in 21 days.
If no audio file is available in the transcript editor, or cannot be played because of the password protection, you will see the button Add Audio File to add a new audio file for playback.
Speech Recognition Examples¶
Please see examples in English and German at:
All features demonstrated in these examples also work in over 80 languages, although the recognition quality might vary.
Transcript Editor Examples¶
Here are two examples of the transcript editor, taken from our Speech Recognition Examples:
- 1. Singletrack Transcript Editor Example
- Singletrack speech recognition example from the first 10 minutes of Common Sense 309 by Dan Carlin. Speechmatics was used as speech recognition engine without any keywords or further manual editing.
- 2. Multitrack Transcript Editor Example
- A multitrack speech recognition transcript example from the first 20 minutes of TV Eye on Marvel - Luke Cage S1E1. Amazon Transcribe was used as speech recognition engine without any further manual editing.
As this is a multitrack production, the transcript includes exact speaker names as well (try to edit them!).