Update tech_docs/linux/FFmpeg.md

2025-06-30 05:22:36 +00:00
parent 2658f299eb
commit 77f5d10eb8
1 changed files with 599 additions and 0 deletions
--- a/tech_docs/linux/FFmpeg.md
+++ b/tech_docs/linux/FFmpeg.md
@@ -1,3 +1,602 @@
+# Complete Guide: Video to Text Transcription with FFmpeg and Speech Recognition
+
+## Introduction
+
+Converting video content to text is a common need for content creators, researchers, journalists, and accessibility professionals. This comprehensive guide covers multiple approaches to extract audio from videos and transcribe it to text, with detailed explanations of each step and alternative methods to suit different needs and technical requirements.
+
+## Prerequisites and System Requirements
+
+### Hardware Requirements
+- **Minimum**: 4GB RAM, 2GB free disk space
+- **Recommended**: 8GB+ RAM, SSD storage for faster processing
+- **GPU acceleration**: Optional but significantly speeds up processing for large files
+
+### Software Dependencies
+- FFmpeg (for audio extraction)
+- Python 3.7+ (for speech recognition tools)
+- Internet connection (for downloading models and cloud-based services)
+
+### Supported Operating Systems
+- Linux (Ubuntu/Debian, Fedora, Arch, etc.)
+- macOS
+- Windows (with some command variations)
+
+## Part 1: Audio Extraction with FFmpeg
+
+### Understanding Audio Extraction Parameters
+
+The FFmpeg command for audio extraction contains several important parameters that affect quality and compatibility:
+
+```bash
+ffmpeg -i input_video.mp4 -vn -acodec pcm_s16le -ar 16000 -ac 1 output_audio.wav
+```
+
+**Parameter Breakdown:**
+- `-i input_video.mp4`: Input file specification
+- `-vn`: "Video no" - excludes video streams from output
+- `-acodec pcm_s16le`: Audio codec - 16-bit PCM little-endian (uncompressed)
+- `-ar 16000`: Audio sample rate - 16kHz (sufficient for speech recognition)
+- `-ac 1`: Audio channels - mono (single channel reduces file size)
+
+### Audio Extraction Variations
+
+#### High-Quality Extraction (for better accuracy)
+```bash
+# 22kHz sample rate for better quality
+ffmpeg -i input_video.mp4 -vn -acodec pcm_s16le -ar 22050 -ac 1 high_quality_audio.wav
+
+# Preserve original quality
+ffmpeg -i input_video.mp4 -vn -acodec pcm_s24le -ar 48000 -ac 2 original_quality.wav
+```
+
+#### Noise Reduction During Extraction
+```bash
+# Apply noise reduction filter
+ffmpeg -i input_video.mp4 -vn -af "highpass=f=200,lowpass=f=3000" -acodec pcm_s16le -ar 16000 -ac 1 clean_audio.wav
+
+# Normalize audio levels
+ffmpeg -i input_video.mp4 -vn -af "loudnorm=I=-16:TP=-1.5:LRA=11" -acodec pcm_s16le -ar 16000 -ac 1 normalized_audio.wav
+```
+
+#### Extract Specific Audio Segments
+```bash
+# Extract audio from specific time range
+ffmpeg -ss 00:02:30 -t 00:05:00 -i input_video.mp4 -vn -acodec pcm_s16le -ar 16000 -ac 1 segment_audio.wav
+
+# Extract multiple segments
+ffmpeg -i input_video.mp4 \
+  -ss 00:00:00 -t 00:01:00 -vn -acodec pcm_s16le -ar 16000 -ac 1 part1.wav \
+  -ss 00:01:30 -t 00:01:00 -vn -acodec pcm_s16le -ar 16000 -ac 1 part2.wav
+```
+
+#### Batch Audio Extraction
+```bash
+# Linux/macOS batch processing
+for file in *.mp4; do
+    ffmpeg -i "$file" -vn -acodec pcm_s16le -ar 16000 -ac 1 "${file%.mp4}_audio.wav"
+done
+
+# Windows PowerShell
+Get-ChildItem *.mp4 | ForEach-Object {
+    ffmpeg -i $_.Name -vn -acodec pcm_s16le -ar 16000 -ac 1 "$($_.BaseName)_audio.wav"
+}
+```
+
+### Verifying Audio Quality
+```bash
+# Check audio file properties
+ffprobe -v quiet -print_format json -show_format -show_streams output_audio.wav
+
+# Listen to a sample (if audio output available)
+ffplay -autoexit -t 10 output_audio.wav
+```
+
+## Part 2: Speech Recognition Solutions
+
+### Option 1: DeepSpeech (Mozilla) - Detailed Setup
+
+#### Installation and Environment Setup
+
+```bash
+# Update system packages
+sudo apt update && sudo apt upgrade -y
+
+# Install system dependencies
+sudo apt install python3 python3-pip python3-venv git curl wget
+
+# Create and activate virtual environment
+python3 -m venv deepspeech-env
+source deepspeech-env/bin/activate
+
+# Upgrade pip and install DeepSpeech
+pip install --upgrade pip
+pip install deepspeech==0.9.3
+```
+
+#### Model Management
+
+```bash
+# Create models directory
+mkdir -p ~/deepspeech-models
+cd ~/deepspeech-models
+
+# Download English models (v0.9.3)
+wget https://github.com/mozilla/DeepSpeech/releases/download/v0.9.3/deepspeech-0.9.3-models.pbmm
+wget https://github.com/mozilla/DeepSpeech/releases/download/v0.9.3/deepspeech-0.9.3-models.scorer
+
+# Verify downloads
+ls -la *.pbmm *.scorer
+```
+
+#### Transcription with DeepSpeech
+
+```bash
+# Basic transcription
+deepspeech --model ~/deepspeech-models/deepspeech-0.9.3-models.pbmm \
+           --scorer ~/deepspeech-models/deepspeech-0.9.3-models.scorer \
+           --audio output_audio.wav
+
+# Save transcription to file
+deepspeech --model ~/deepspeech-models/deepspeech-0.9.3-models.pbmm \
+           --scorer ~/deepspeech-models/deepspeech-0.9.3-models.scorer \
+           --audio output_audio.wav > transcription.txt
+
+# Process with timestamps (requires JSON output)
+deepspeech --model ~/deepspeech-models/deepspeech-0.9.3-models.pbmm \
+           --scorer ~/deepspeech-models/deepspeech-0.9.3-models.scorer \
+           --audio output_audio.wav \
+           --json > transcription_with_timestamps.json
+```
+
+### Option 2: Whisper (OpenAI) - Modern Alternative
+
+Whisper often provides better accuracy than DeepSpeech, especially for diverse accents and languages.
+
+```bash
+# Install Whisper
+pip install openai-whisper
+
+# Basic transcription (auto-detects language)
+whisper output_audio.wav
+
+# Specify language for better performance
+whisper output_audio.wav --language English
+
+# Different model sizes (larger = more accurate, slower)
+whisper output_audio.wav --model tiny    # fastest, least accurate
+whisper output_audio.wav --model base    # good balance
+whisper output_audio.wav --model small   # better accuracy
+whisper output_audio.wav --model medium  # high accuracy
+whisper output_audio.wav --model large   # best accuracy, slowest
+
+# Output formats
+whisper output_audio.wav --output_format txt
+whisper output_audio.wav --output_format srt  # subtitles
+whisper output_audio.wav --output_format vtt  # web subtitles
+whisper output_audio.wav --output_format json # detailed output
+```
+
+### Option 3: SpeechRecognition Library with Multiple Engines
+
+```python
+# Create transcription script: transcribe.py
+#!/usr/bin/env python3
+import speech_recognition as sr
+import sys
+
+def transcribe_audio(audio_file, engine='google'):
+    """
+    Transcribe audio file using various speech recognition engines
+    """
+    r = sr.Recognizer()
+    
+    # Load audio file
+    with sr.AudioFile(audio_file) as source:
+        audio = r.record(source)
+    
+    try:
+        if engine == 'google':
+            # Google Speech Recognition (requires internet)
+            text = r.recognize_google(audio)
+        elif engine == 'sphinx':
+            # CMU Sphinx (offline)
+            text = r.recognize_sphinx(audio)
+        elif engine == 'wit':
+            # Wit.ai (requires API key)
+            text = r.recognize_wit(audio, key="YOUR_WIT_AI_KEY")
+        else:
+            raise ValueError(f"Unknown engine: {engine}")
+            
+        return text
+    except sr.UnknownValueError:
+        return "Could not understand audio"
+    except sr.RequestError as e:
+        return f"Error with speech recognition service: {e}"
+
+if __name__ == "__main__":
+    if len(sys.argv) != 2:
+        print("Usage: python transcribe.py audio_file.wav")
+        sys.exit(1)
+    
+    audio_file = sys.argv[1]
+    result = transcribe_audio(audio_file, engine='google')
+    print(result)
+```
+
+```bash
+# Install dependencies
+pip install SpeechRecognition pydub
+
+# Run transcription
+python transcribe.py output_audio.wav
+```
+
+## Part 3: Advanced Workflows and Automation
+
+### Complete Video-to-Text Pipeline Script
+
+```bash
+#!/bin/bash
+# video_to_text.sh - Complete pipeline script
+
+set -e  # Exit on any error
+
+# Configuration
+INPUT_VIDEO="$1"
+OUTPUT_DIR="./transcription_output"
+TEMP_DIR="./temp"
+MODELS_DIR="$HOME/deepspeech-models"
+
+# Validate input
+if [ -z "$INPUT_VIDEO" ]; then
+    echo "Usage: $0 <input_video.mp4>"
+    exit 1
+fi
+
+if [ ! -f "$INPUT_VIDEO" ]; then
+    echo "Error: Input video file not found: $INPUT_VIDEO"
+    exit 1
+fi
+
+# Create directories
+mkdir -p "$OUTPUT_DIR" "$TEMP_DIR"
+
+# Extract base filename
+BASENAME=$(basename "$INPUT_VIDEO" | sed 's/\.[^.]*$//')
+
+echo "Processing: $INPUT_VIDEO"
+echo "Output will be saved to: $OUTPUT_DIR"
+
+# Step 1: Extract audio
+echo "Step 1: Extracting audio..."
+AUDIO_FILE="$TEMP_DIR/${BASENAME}_audio.wav"
+ffmpeg -i "$INPUT_VIDEO" -vn -acodec pcm_s16le -ar 16000 -ac 1 "$AUDIO_FILE" -y
+
+# Step 2: Transcribe with multiple methods
+echo "Step 2: Transcribing audio..."
+
+# Method 1: DeepSpeech (if available)
+if command -v deepspeech &> /dev/null; then
+    echo "  Using DeepSpeech..."
+    deepspeech --model "$MODELS_DIR/deepspeech-0.9.3-models.pbmm" \
+               --scorer "$MODELS_DIR/deepspeech-0.9.3-models.scorer" \
+               --audio "$AUDIO_FILE" > "$OUTPUT_DIR/${BASENAME}_deepspeech.txt"
+fi
+
+# Method 2: Whisper (if available)
+if command -v whisper &> /dev/null; then
+    echo "  Using Whisper..."
+    whisper "$AUDIO_FILE" --output_dir "$OUTPUT_DIR" --output_format txt
+    mv "$OUTPUT_DIR/${BASENAME}_audio.txt" "$OUTPUT_DIR/${BASENAME}_whisper.txt" 2>/dev/null || true
+fi
+
+# Step 3: Generate metadata
+echo "Step 3: Generating metadata..."
+{
+    echo "Video Transcription Report"
+    echo "========================="
+    echo "Source: $INPUT_VIDEO"
+    echo "Date: $(date)"
+    echo "Audio duration: $(ffprobe -i "$AUDIO_FILE" -show_entries format=duration -v quiet -of csv="p=0" | cut -d. -f1) seconds"
+    echo ""
+} > "$OUTPUT_DIR/${BASENAME}_metadata.txt"
+
+# Cleanup
+rm -rf "$TEMP_DIR"
+
+echo "Transcription complete! Check $OUTPUT_DIR for results."
+```
+
+### Batch Processing Multiple Videos
+
+```python
+#!/usr/bin/env python3
+# batch_transcribe.py
+import os
+import subprocess
+import sys
+from pathlib import Path
+import concurrent.futures
+import logging
+
+# Setup logging
+logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
+logger = logging.getLogger(__name__)
+
+def process_video(video_path, output_dir):
+    """Process a single video file"""
+    try:
+        video_name = Path(video_path).stem
+        audio_path = output_dir / f"{video_name}_audio.wav"
+        transcript_path = output_dir / f"{video_name}_transcript.txt"
+        
+        # Extract audio
+        logger.info(f"Extracting audio from {video_path}")
+        subprocess.run([
+            'ffmpeg', '-i', str(video_path), '-vn', '-acodec', 'pcm_s16le',
+            '-ar', '16000', '-ac', '1', str(audio_path), '-y'
+        ], check=True, capture_output=True)
+        
+        # Transcribe with Whisper
+        logger.info(f"Transcribing {audio_path}")
+        result = subprocess.run([
+            'whisper', str(audio_path), '--output_format', 'txt',
+            '--output_dir', str(output_dir)
+        ], check=True, capture_output=True, text=True)
+        
+        # Rename output file
+        whisper_output = output_dir / f"{video_name}_audio.txt"
+        if whisper_output.exists():
+            whisper_output.rename(transcript_path)
+        
+        # Clean up audio file
+        audio_path.unlink()
+        
+        logger.info(f"Completed processing {video_path}")
+        return True
+        
+    except subprocess.CalledProcessError as e:
+        logger.error(f"Error processing {video_path}: {e}")
+        return False
+    except Exception as e:
+        logger.error(f"Unexpected error processing {video_path}: {e}")
+        return False
+
+def main():
+    if len(sys.argv) != 3:
+        print("Usage: python batch_transcribe.py <input_directory> <output_directory>")
+        sys.exit(1)
+    
+    input_dir = Path(sys.argv[1])
+    output_dir = Path(sys.argv[2])
+    
+    if not input_dir.exists():
+        print(f"Input directory does not exist: {input_dir}")
+        sys.exit(1)
+    
+    output_dir.mkdir(exist_ok=True)
+    
+    # Find all video files
+    video_extensions = {'.mp4', '.avi', '.mov', '.mkv', '.flv', '.wmv'}
+    video_files = [f for f in input_dir.rglob('*') if f.suffix.lower() in video_extensions]
+    
+    if not video_files:
+        print(f"No video files found in {input_dir}")
+        sys.exit(1)
+    
+    logger.info(f"Found {len(video_files)} video files to process")
+    
+    # Process videos in parallel
+    with concurrent.futures.ThreadPoolExecutor(max_workers=4) as executor:
+        futures = [executor.submit(process_video, video_file, output_dir) for video_file in video_files]
+        
+        completed = 0
+        for future in concurrent.futures.as_completed(futures):
+            if future.result():
+                completed += 1
+            logger.info(f"Progress: {completed}/{len(video_files)} completed")
+    
+    logger.info(f"Batch processing complete. {completed}/{len(video_files)} files processed successfully.")
+
+if __name__ == "__main__":
+    main()
+```
+
+## Part 4: Quality Improvement and Post-Processing
+
+### Improving Transcription Accuracy
+
+#### Pre-processing Audio for Better Results
+```bash
+# Enhance audio quality before transcription
+ffmpeg -i input_video.mp4 \
+  -af "highpass=f=100,lowpass=f=8000,compand=0.3|0.3:1|1:-90/-60|-60/-40|-40/-30|-20/-20:6:0:-90:0.2" \
+  -acodec pcm_s16le -ar 16000 -ac 1 enhanced_audio.wav
+```
+
+#### Post-processing Transcriptions
+```python
+#!/usr/bin/env python3
+# post_process_transcript.py
+import re
+import sys
+from pathlib import Path
+
+def clean_transcript(text):
+    """Clean and format transcript text"""
+    # Remove extra whitespace
+    text = re.sub(r'\s+', ' ', text.strip())
+    
+    # Fix common transcription errors
+    replacements = {
+        r'\bum\b': '',
+        r'\buh\b': '',
+        r'\blike\b(?=\s+like)': '',  # Remove repeated "like"
+        r'\byou know\b': '',
+        r'\.{2,}': '.',  # Multiple periods to single
+        r'\s+\.': '.',   # Space before period
+        r'\.(?=[A-Z])': '. ',  # Add space after period before capital
+    }
+    
+    for pattern, replacement in replacements.items():
+        text = re.sub(pattern, replacement, text, flags=re.IGNORECASE)
+    
+    # Capitalize sentences
+    sentences = text.split('.')
+    sentences = [s.strip().capitalize() for s in sentences if s.strip()]
+    
+    return '. '.join(sentences) + '.'
+
+def add_paragraphs(text, sentences_per_paragraph=4):
+    """Add paragraph breaks for better readability"""
+    sentences = [s.strip() for s in text.split('.') if s.strip()]
+    paragraphs = []
+    
+    for i in range(0, len(sentences), sentences_per_paragraph):
+        paragraph = '. '.join(sentences[i:i+sentences_per_paragraph]) + '.'
+        paragraphs.append(paragraph)
+    
+    return '\n\n'.join(paragraphs)
+
+def main():
+    if len(sys.argv) != 2:
+        print("Usage: python post_process_transcript.py transcript.txt")
+        sys.exit(1)
+    
+    input_file = Path(sys.argv[1])
+    output_file = input_file.with_suffix('.cleaned.txt')
+    
+    # Read original transcript
+    with open(input_file, 'r', encoding='utf-8') as f:
+        original_text = f.read()
+    
+    # Clean and format
+    cleaned_text = clean_transcript(original_text)
+    formatted_text = add_paragraphs(cleaned_text)
+    
+    # Save cleaned version
+    with open(output_file, 'w', encoding='utf-8') as f:
+        f.write(formatted_text)
+    
+    print(f"Cleaned transcript saved to: {output_file}")
+    print(f"Original length: {len(original_text)} characters")
+    print(f"Cleaned length: {len(formatted_text)} characters")
+
+if __name__ == "__main__":
+    main()
+```
+
+## Part 5: Troubleshooting and Optimization
+
+### Common Issues and Solutions
+
+#### FFmpeg Issues
+```bash
+# Issue: "codec not supported"
+# Solution: Check available codecs
+ffmpeg -codecs | grep -i pcm
+
+# Issue: Permission denied
+# Solution: Check file permissions
+chmod 755 input_video.mp4
+sudo chown $USER:$USER input_video.mp4
+
+# Issue: Out of disk space
+# Solution: Monitor disk usage and use temporary directories
+df -h
+```
+
+#### DeepSpeech Issues
+```bash
+# Issue: Model download fails
+# Solution: Manual download with verification
+wget -c https://github.com/mozilla/DeepSpeech/releases/download/v0.9.3/deepspeech-0.9.3-models.pbmm
+sha256sum deepspeech-0.9.3-models.pbmm
+
+# Issue: Poor transcription quality
+# Solution: Try different audio preprocessing
+ffmpeg -i input.mp4 -af "loudnorm=I=-16:TP=-1.5:LRA=11,highpass=f=80,lowpass=f=8000" -ar 16000 -ac 1 processed.wav
+```
+
+### Performance Optimization
+
+#### System Resource Management
+```bash
+# Monitor system resources during processing
+htop
+iotop
+nvidia-smi  # For GPU usage
+
+# Limit CPU usage for long-running jobs
+nice -n 10 your_transcription_command
+
+# Use GNU parallel for batch processing
+find . -name "*.mp4" | parallel -j 4 'ffmpeg -i {} -vn -acodec pcm_s16le -ar 16000 -ac 1 {.}_audio.wav'
+```
+
+#### Storage Optimization
+```bash
+# Use compressed intermediate formats when possible
+ffmpeg -i input.mp4 -vn -c:a libopus -b:a 32k temp_audio.opus
+
+# Clean up temporary files automatically
+trap 'rm -f temp_*.wav temp_*.opus' EXIT
+```
+
+## Part 6: Integration with Other Tools
+
+### Creating Subtitles
+```bash
+# Generate SRT subtitles with Whisper
+whisper input_video.mp4 --output_format srt
+
+# Convert to other subtitle formats
+ffmpeg -i subtitles.srt subtitles.vtt  # WebVTT format
+```
+
+### Automated Video Processing Pipeline
+```yaml
+# docker-compose.yml for containerized transcription service
+version: '3.8'
+services:
+  transcription:
+    build: .
+    volumes:
+      - ./input:/app/input
+      - ./output:/app/output
+    environment:
+      - WHISPER_MODEL=base
+      - OUTPUT_FORMAT=txt
+    command: python batch_process.py /app/input /app/output
+```
+
+## Conclusion
+
+This comprehensive guide provides multiple approaches to video-to-text transcription, from simple command-line tools to automated batch processing systems. The choice of method depends on your specific requirements:
+
+- **DeepSpeech**: Good for offline processing, privacy-conscious applications
+- **Whisper**: Superior accuracy, supports multiple languages, requires more resources
+- **Cloud APIs**: Highest accuracy, requires internet connection and API costs
+
+### Key Takeaways
+
+1. **Audio Quality Matters**: Preprocessing audio can significantly improve transcription accuracy
+2. **Choose the Right Tool**: Different speech recognition engines excel in different scenarios
+3. **Automate When Possible**: Batch processing and scripting save time for large projects
+4. **Post-process Results**: Cleaning and formatting improve final transcript quality
+5. **Monitor Resources**: Large-scale transcription can be resource-intensive
+
+### Next Steps
+
+- Experiment with different speech recognition models and parameters
+- Integrate transcription into larger content processing workflows
+- Explore real-time transcription for live video streams
+- Consider cloud-based solutions for production applications
+
+The field of speech recognition is rapidly evolving, with new models and techniques regularly improving accuracy and reducing computational requirements. Stay updated with the latest developments to get the best results from your transcription workflows.
+
+---
+
 ### Extracting Audio from Video with FFmpeg

 First, you'll extract the audio from your video file into a `.wav` format suitable for speech recognition: