Audio Transcription from HuggingFace Pre-Trained Model

One time, somebody asked me if it was possible to transcribe call center conversations to text so they later be analyzed for quality purposes. This is exactly what we will be trying to do here in the most simple way: audio transcription.

Fortunately, companies like Facebook and OpenAI have already invested millions of dollars in producing these technologies, and they placed them on a website called "Hugging Face".

For this post, we will be using the pre-trained model "distil-whisper/distil-small.en," which is able to translate audio arrays into text. There are many different models and versions of the "distil-small.en" model based on the number of parameters. It's said that a model with a higher number of parameters might produce better transcriptions but will also convert audio into extremely long sequences, so the final transcription might take a lot of time, depending on how long the audio file is.

For our case, we will be using the smallest model, which has 166 million parameters to convert a sample audio file of 1:51 minutes.

Download the sample Audio File

Please download this sample audio file from this link. This is an mp3 file, but you are welcome to use your own file. Rename the audio file as "speech.mp3" and place the file at the root of your colab directory.

Nicely done! Now, it's time to code. Surprisingly, the code required for this is very simple but requires some tweaking. Let's get our hands dirty.

Import Libraries

To reproduce this code, you can use Google Colab, so we will need to install the following dependencies:

  !pip install transformers
  !pip install soundfile
  !pip install librosa

Transformers: This is the main library that allows us to play with thousands of pre-trained models and download them directly from the model objects.
Soundfile: this library will help us load the wav, flac or mp3 files
Librosa: Its Librosa not LeviOsA. This library will help us with some pre-processing we need to do to the mp3 file such as converting stereo files to mono and downcasting an audio file to another sampling rate.

Great! Now let's load all the libraries we need:

from transformers.utils import logging
logging.set_verbosity_error()

import soundfile as sf
import io
import numpy as np
import librosa

from transformers import pipeline

Create the Transcription Pipeline

A pipeline is an object that loads a pre-trained model and sets up all the required processing so we don't have to do it ourselves. This is very simple, we just need to create the pipeline object and call the model we will be using for the transcription.

asr = pipeline("automatic-speech-recognition", 
model="distil-whisper/distil-small.en")

This single line of code will download the model and store it in the ASR model variable. We are almost ready to use it. The only thing we need to do is make sure the audio file is compatible with the model, and for that, we will need to transform the audio file with Librosa.

Transforming the Audio File

The "distil-small.en" has been trained in a specific sampling rate. Concretely this model works with audio that is in the 16Hz sampling rate. We can find this out by checking the model sampling rate:

print(asr.feature_extractor.sampling_rate)

output: 16000

Now, its time to check the sampling rate of the audio file "speech.mp3"

audio, sampling_rate = sf.read('speech.mp3')
print(sampling_rate)

output: 44100

No problem, we need to do two things with this audio file. The first one 1) is to make sure the file is mono, as these models work with single-channel audio files and 2) downcast the 44Hz audio file to 16Hz. Let's do exactly that in this single block of code:

audio_transposed = np.transpose(audio)
audio_mono = librosa.to_mono(audio_transposed)

audio_16KHz = librosa.resample(audio_mono,
                               orig_sr=sampling_rate,
                               target_sr=asr.feature_extractor.sampling_rate)

We will transpose our audio file to a single array to manage a united version of the audio file. The transposed array will now be converted to mono using the Librosa to_mono method. That's it; now the file is in a single channel.

The second part is also very simple. Librosa has a resampling method that allows users to change the sampling rate of a mono audio file. The audio_16KHz object now contains the pre-processed audio file that is compatible with the HuggingFace model for audio transcription.

Audio Transcription

This is the most simple part now. We will use the ASR object to translate the audio file into text. We will use some special parameters that allow us to process audio in 30-second chunks. This way, we can process longer audio without error, as the model only processes audio of less than 30 seconds.

asr(
    audio_16KHz,
    chunk_length_s=30, # 30 seconds
    batch_size=4,
    return_timestamps=True,
)["chunks"]

Thats it! just wait some time; you will get a JSON array with the timestamp and text recognized. This is the output of this transcription:

[{'timestamp': (0.0, 7.0),
  'text': ' Community Update December 2019 posted on December 27th 2019 by Frederick Font.'},
 {'timestamp': (7.0, 11.0),
  'text': ' Hi everyone, welcome to a new community update.'},
 {'timestamp': (11.0, 15.0),
  'text': ' Yeah, we know we have not been updating you very regularly lately,'},
 {'timestamp': (15.0, 18.0),
  'text': ' but this does not mean we have not been working hard on free sound.'},
 {'timestamp': (18.0, 25.6),
  'text': ' As has been the case for the last year, we have not been very much concentrated on working on either under the hood improvements'},
 {'timestamp': (25.6, 30.56),
  'text': ' or research type of issues which do not have a clearly visible output in the Fresound'},
 {'timestamp': (30.56, 32.28), 'text': ' website yet.'},
 {'timestamp': (32.28, 37.12),
  'text': ' But we are indeed working on great things which will definitely end up in the platform.'},
 {'timestamp': (37.12, 41.6),
  'text': ' Here is a summary of our current main working threads.'},
 {'timestamp': (41.6, 44.44),
  'text': ' BUG Fixs, General Maintenance and Software updates.'},
 {'timestamp': (44.44, 45.44),
  'text': ' This is a big one as we are about to carry, and software updates.'},
 {'timestamp': (45.44, 49.72),
  'text': ' This is a big one as we are about to carry out necessary software updates.'},
 {'timestamp': (49.72, 55.76),
  'text': ' For nerds, Python, Jango updates, which affect all of our code base and are therefore quite'},
 {'timestamp': (55.76, 56.2), 'text': ' time-consuming.'},
 {'timestamp': (56.2, 57.2), 'text': ' New features.'},
 {'timestamp': (57.2, 63.44),
  'text': ' We are working on new features mostly related to the search page.'},
 {'timestamp': (63.44, 68.56),
  'text': " However, all these new features require a lot of previous research work. Don't forget we're a research"},
 {'timestamp': (68.56, 73.44),
  'text': " institution, so that's what we do best. And again, features need their time to"},
 {'timestamp': (73.44, 78.32),
  'text': " become a reality. The new search features we're working on will allow to cluster"},
 {'timestamp': (78.32, 86.64),
  'text': ' search results as well as adding new filtering options. New front end. Yes, we have not abandoned this one.'},
 {'timestamp': (86.64, 91.52),
  'text': ' It is going very slowly, much more than we thought, but it will eventually become a reality'},
 {'timestamp': (91.52, 93.68), 'text': ' and it is indeed in our roadmap.'},
 {'timestamp': (93.68, 100.16),
  'text': " Oh, and by the way, we've just published a tech-oriented post about Free Sound in the Creative Commons open-source"},
 {'timestamp': (100.16, 101.24), 'text': ' blog.'},
 {'timestamp': (101.24, 103.44), 'text': ' You might want to check that out.'},
 {'timestamp': (103.44, 105.6),
  'text': " And that's it for the short update. Thanks"},
 {'timestamp': (105.6, 109.92),
  'text': ' for a reading and stay tuned for the updates in the coming year. Big happy new'},
 {'timestamp': (109.92, 113.04), 'text': ' year to everyone.'}]