Classic Logic

Summarizing a meeting using WhisperX

I was casually playing around with WhisperX and wanted to quickly document how I summarized a meeting. This is a toy example just to demonstrate to myself how this works. There was no actual meeting. I pulled a publicly available non-serious meeting audio from the internet. Basically you need a .mp3 file; the exact audio format does not matter.

1. Transcribe the meeting audio

In a dedicated python environment, I pip installed the whisperx:

pip install whisperx

I had a urllib3 related problem when trying to run it, so I had to do this:

pip install "urllib3<2"

In a directory I already had the audio file as audio.mp3, so I ran the following to transcribe it:

whisperx audio.mp3 --compute_type int8

I had to supply the --compute_type int8 on my Mac. Not providing it didn’t work, and there was a mention in the documentation about passing this in, so that’s what I did.

I didn’t try diarization this time but this is something I want to experiment with in the near future.

My audio file was about half an hour long, and it took a few minutes on my Macbook Pro M3 Pro for this to get processed. A .json, .srt, .tsv, .txt., and .vtt file got generated and saved in that same directory. Among them, the .srt file seemed the most readable to me (a human). I don’t know which one the machine prefers but I decided to go with the .srt file for the next step.

2. Summarize the transcript

I already had Ollama and a few models set up on my machine. Of the few models I tried, I liked gemma3:1b’s response in this particular case. I wasn’t trying to do anything serious so its output was good enough.

Here’s the exact command I gave it:

ollama run gemma3:1b < audio.srt

It just straight-up spitted out a summary; didn’t even prompt me for anything.

Conclusion

That was a fun exercise for me and it took far less than I thought it would. I’m satisfied with the output.

There are probably better ways to achieve what I did here. I was just playing around. I’m very new to all this.