Here’s a powerful one-liner that converts any video into a concise text summary using modern AI tools:
#!/bin/sh
yt-dlp -x --audio-format mp3 "$1" -o "audio.mp3" && \
whisper "audio.mp3" --model medium --output_format txt --output_dir . && \
cat audio.txt | ollama run mistral "Summarize the following text, removing any fluff and focusing on key points: ${cat}" > summary.txt && \
rm audio.mp3 audio.txt && cat summary.txt
How It Works
The pipeline combines three powerful tools:
-
yt-dlp: A robust video downloader that handles YouTube, Vimeo, and many other platforms. It extracts just the audio track to minimize processing time.
-
Whisper: OpenAI’s open-source speech recognition system. We use the ‘medium’ model which balances accuracy and speed. Whisper can handle multiple languages and is remarkably accurate at transcription.
-
Mistral + Ollama: Mistral is a powerful open-source language model, run locally through Ollama. It handles the summarization step, distilling the transcribed text into key points.
Requirements
# Install dependencies
pip install yt-dlp
pip install -U openai-whisper
curl https://ollama.ai/install.sh | sh
ollama pull mistral
- System requirements:
- Python 3.x
- FFmpeg (for audio processing)
- ~10GB free disk space for models
- GPU recommended but not required
Usage
Save as vid2txt.py
and make executable:
chmod +x vid2txt.py
./vid2txt.py [video_url_or_path]
The script outputs a concise summary in summary.txt
, automatically cleaning up intermediate files.
You could even feed the output directly into a note keeping app like Joplin. Use this to have your laptop capture and store those fancy recipes online without wasting time taking manual notes.
Limitations
- Processing time depends on video length and your hardware
- Whisper works best with clear audio
- Large videos may require significant memory