Coqui STT - Extract Words from Audio Files

11/02/2022, Wed
Categories: #shell
Tags: #cli-tools

Speech to Text CLI Tool

There could be a time when you were listening to a podcast, but you can not make out the spelling of a word that you would like to understand. If there was a way to add a 'live-caption' feature to your podcast, this will inform you of the vocabulary that you are missing. To get remedy this problem, you can use a Speech-to-Text tool to output the transcript of the audio.

You might also want to convert an audio file to text if you are able to read quicker than you can listen to the audio file.

The following will show you how to use coqui STT to perform a transcribing of an audio file to text in the terminal.

docker pull ghcr.io/coqui-ai/stt-train

Download the docker container.

docker run -it --net="host" ghcr.io/coqui-ai/stt-train:latest

Add the host directory to docker for access inside the docker container when running the docker container.

docker run -v /folder/for/host:/folder/for/docker -it ghcr.io/coqui-ai/stt-train:latest

Download a version of the pre-trained data from

https://github.com/coqui-ai/STT/releases/tag/v1.4.0

Navigate to the host directory of where the folder 'coqui-stt-1.4.0-checkpoint' is stored.

In the same directory, create the directory for the 'checkpoint' folder.

mkdir coqui-stt-1.0.0-checkpoint

and download the huge-vocabulary.scorer from https://coqui.ai/english/coqui/v1.0.0-huge-vocab

Acquire a mp3 audio file for STT to process. You will most likely have to convert the audio file because STT will only take a 16-bit wav file format for input.

The following ffmpeg media converter command shows an example of a conversion.

ffmpeg -i "Gettysburg Address.mp3" -acodec pcm_u8 -ar 22050 "Gettysburg Address.wav"

Run the 'Single file (aka one-shot) inference' command to perform a basic conversion of an audio to text.

python -m coqui_stt_training.training_graph_inference --checkpoint_dir coqui-stt-1.4.0-checkpoint --scorer_path huge-vocabulary.scorer --n_hidden 2048 --one_shot_infer 'Gettysburg Address.wav'

The output will be displayed on the console.