How I transcribed 4.5 hours of audio with Python

In order to create the Soup Story I made soup with a chess master, I used my audio recorder to record the full conversation. I thought that was pretty clever, because it would allow me to focus on the conversation and not on note taking.

The only part I hadn’t really thought off was the transcription of the interview.

I sort of assumed there would be some free software available that could take care of it.

Due to some temporary cognitive impairment I had forgotten that ‘free’ does not exist on the internet. ‘Free’ on the internet means: free 1 minute transcription, a maximum of 1 recording per month, + you have to share all of your personal data and sacrifice the heart of your first born baby.

As you can understand, I was pretty bummed out by that. Because I didn’t have just 1 minute to transcribe. I had 4.5 hours to transcribe.

But then I realised that the internet isn’t just a place of dirty capitalist SAAS entrepreneurs with lousy freemium businessmodels, it is also the place where many generous tutors gather. Where people go to share ideas, ask questions, and help each other out. And coincidentally also the place where OpenAI had released their speech-to-text A.I. model Whisper to the public.

And so I ended up writing a Python script that utilises Whisper A.I. to transcribe the 4.5h recordings for my Soup Story. I was impressed with how easy it was to build. That’s why I thought it would be cool to share it with you all!

Code walk-through

Let’s setup this script. Firstly we’ll need to import three modules. Of course we start of with whisper. This is the module that enables us to use OpenAI’s Whisper model. Then we’ll need datetime to be able to display timestamps in our transcript and finally we’ll need pydub‘s module AudioSegment to help us cut up our 4.5h long recording (this is because whisper only accepts audio files of maximum 25MB).

I’ll now quickly explain the essence of this script (the transcribing). In the snippet underneath you select the whisper base-model, after which we invoke the transcribe-function with a set of parameters. Check out the Whisper documentation for all parameters. Technically you could run it with just the path to the audio file, but for better results I’ve added these other parameters.

If you run this code you’ll get the transcription printed into you console. Very cool and simple.

But like I said before, the Whisper model only runs audio files with a maximum of 25MB. The 4.5h recording was 3GB in total. So in order to properly and completely transcribe this recording, we’ll have to perform a few other actions.

In the end we want to code to:

Cut up our 4.5h recording into 10-minute segments
Transcribe each segment separately
Let each sentence start on a new line
Add timestamps to each sentence
Add all segments into one big text file

Below you can find the full script that I wrote. I’m very sorry for all of you professional Python purists that hate how I wrote this code: I am merely an autodidact hillbilly. I am completely unaware of any Python coding conventions, I merely hack some code together to make it work. I hope my comments help to make a bit more sense of the code. Enjoy!

How I transcribed 4.5 hours of audio with Python

Code walk-through

Stay in touch with me via my newsletter