Here’s a powerful one-liner that converts any video into a concise text summary using modern AI tools:

#!/bin/sh
yt-dlp -x --audio-format mp3 "$1" -o "audio.mp3" && \
whisper "audio.mp3" --model medium --output_format txt --output_dir . && \
cat audio.txt | ollama run mistral "Summarize the following text, removing any fluff and focusing on key points: ${cat}" > summary.txt && \
rm audio.mp3 audio.txt && cat summary.txt

How It Works

The pipeline combines three powerful tools:

  1. yt-dlp: A robust video downloader that handles YouTube, Vimeo, and many other platforms. It extracts just the audio track to minimize processing time.

  2. Whisper: OpenAI’s open-source speech recognition system. We use the ‘medium’ model which balances accuracy and speed. Whisper can handle multiple languages and is remarkably accurate at transcription.

  3. Mistral + Ollama: Mistral is a powerful open-source language model, run locally through Ollama. It handles the summarization step, distilling the transcribed text into key points.

Requirements

# Install dependencies
pip install yt-dlp
pip install -U openai-whisper
curl https://ollama.ai/install.sh | sh
ollama pull mistral
  • System requirements:
    • Python 3.x
    • FFmpeg (for audio processing)
    • ~10GB free disk space for models
    • GPU recommended but not required

Usage

Save as vid2txt.py and make executable:

chmod +x vid2txt.py
./vid2txt.py [video_url_or_path]

The script outputs a concise summary in summary.txt, automatically cleaning up intermediate files.

You could even feed the output directly into a note keeping app like Joplin. Use this to have your laptop capture and store those fancy recipes online without wasting time taking manual notes.

Limitations

  • Processing time depends on video length and your hardware
  • Whisper works best with clear audio
  • Large videos may require significant memory