I was casually playing around with WhisperX and wanted to quickly document how I summarized a meeting. This is a toy example just to demonstrate to myself how this works. There was no actual meeting. I pulled a publicly available non-serious meeting audio from the internet. Basically you need a .mp3 file; the exact audio format does not matter.
1. Transcribe the meeting audio
In a dedicated python environment, I pip installed the whisperx:
pip install whisperx
I had a urllib3
related problem when trying to run it, so I had to do this:
pip install "urllib3<2"
In a directory I already had the audio file as audio.mp3, so I ran the following to transcribe it:
whisperx audio.mp3 --compute_type int8
I had to supply the --compute_type int8
on my Mac. Not providing it didn’t work, and there was a mention in the documentation about passing this in, so that’s what I did.
I didn’t try diarization this time but this is something I want to experiment with in the near future.
My audio file was about half an hour long, and it took a few minutes on my Macbook Pro M3 Pro for this to get processed. A .json, .srt, .tsv, .txt., and .vtt file got generated and saved in that same directory. Among them, the .srt file seemed the most readable to me (a human). I don’t know which one the machine prefers but I decided to go with the .srt file for the next step.
2. Summarize the transcript
I already had Ollama and a few models set up on my machine. Of the few models I tried, I liked gemma3:1b’s response in this particular case. I wasn’t trying to do anything serious so its output was good enough.
Here’s the exact command I gave it:
ollama run gemma3:1b < audio.srt
It just straight-up spitted out a summary; didn’t even prompt me for anything.
Conclusion
That was a fun exercise for me and it took far less than I thought it would. I’m satisfied with the output.
There are probably better ways to achieve what I did here. I was just playing around. I’m very new to all this.