Update tech_docs/linux/FFmpeg.md

This commit is contained in:
2025-06-30 05:22:36 +00:00
parent 2658f299eb
commit 77f5d10eb8

View File

@@ -1,3 +1,602 @@
# Complete Guide: Video to Text Transcription with FFmpeg and Speech Recognition
## Introduction
Converting video content to text is a common need for content creators, researchers, journalists, and accessibility professionals. This comprehensive guide covers multiple approaches to extract audio from videos and transcribe it to text, with detailed explanations of each step and alternative methods to suit different needs and technical requirements.
## Prerequisites and System Requirements
### Hardware Requirements
- **Minimum**: 4GB RAM, 2GB free disk space
- **Recommended**: 8GB+ RAM, SSD storage for faster processing
- **GPU acceleration**: Optional but significantly speeds up processing for large files
### Software Dependencies
- FFmpeg (for audio extraction)
- Python 3.7+ (for speech recognition tools)
- Internet connection (for downloading models and cloud-based services)
### Supported Operating Systems
- Linux (Ubuntu/Debian, Fedora, Arch, etc.)
- macOS
- Windows (with some command variations)
## Part 1: Audio Extraction with FFmpeg
### Understanding Audio Extraction Parameters
The FFmpeg command for audio extraction contains several important parameters that affect quality and compatibility:
```bash
ffmpeg -i input_video.mp4 -vn -acodec pcm_s16le -ar 16000 -ac 1 output_audio.wav
```
**Parameter Breakdown:**
- `-i input_video.mp4`: Input file specification
- `-vn`: "Video no" - excludes video streams from output
- `-acodec pcm_s16le`: Audio codec - 16-bit PCM little-endian (uncompressed)
- `-ar 16000`: Audio sample rate - 16kHz (sufficient for speech recognition)
- `-ac 1`: Audio channels - mono (single channel reduces file size)
### Audio Extraction Variations
#### High-Quality Extraction (for better accuracy)
```bash
# 22kHz sample rate for better quality
ffmpeg -i input_video.mp4 -vn -acodec pcm_s16le -ar 22050 -ac 1 high_quality_audio.wav
# Preserve original quality
ffmpeg -i input_video.mp4 -vn -acodec pcm_s24le -ar 48000 -ac 2 original_quality.wav
```
#### Noise Reduction During Extraction
```bash
# Apply noise reduction filter
ffmpeg -i input_video.mp4 -vn -af "highpass=f=200,lowpass=f=3000" -acodec pcm_s16le -ar 16000 -ac 1 clean_audio.wav
# Normalize audio levels
ffmpeg -i input_video.mp4 -vn -af "loudnorm=I=-16:TP=-1.5:LRA=11" -acodec pcm_s16le -ar 16000 -ac 1 normalized_audio.wav
```
#### Extract Specific Audio Segments
```bash
# Extract audio from specific time range
ffmpeg -ss 00:02:30 -t 00:05:00 -i input_video.mp4 -vn -acodec pcm_s16le -ar 16000 -ac 1 segment_audio.wav
# Extract multiple segments
ffmpeg -i input_video.mp4 \
-ss 00:00:00 -t 00:01:00 -vn -acodec pcm_s16le -ar 16000 -ac 1 part1.wav \
-ss 00:01:30 -t 00:01:00 -vn -acodec pcm_s16le -ar 16000 -ac 1 part2.wav
```
#### Batch Audio Extraction
```bash
# Linux/macOS batch processing
for file in *.mp4; do
ffmpeg -i "$file" -vn -acodec pcm_s16le -ar 16000 -ac 1 "${file%.mp4}_audio.wav"
done
# Windows PowerShell
Get-ChildItem *.mp4 | ForEach-Object {
ffmpeg -i $_.Name -vn -acodec pcm_s16le -ar 16000 -ac 1 "$($_.BaseName)_audio.wav"
}
```
### Verifying Audio Quality
```bash
# Check audio file properties
ffprobe -v quiet -print_format json -show_format -show_streams output_audio.wav
# Listen to a sample (if audio output available)
ffplay -autoexit -t 10 output_audio.wav
```
## Part 2: Speech Recognition Solutions
### Option 1: DeepSpeech (Mozilla) - Detailed Setup
#### Installation and Environment Setup
```bash
# Update system packages
sudo apt update && sudo apt upgrade -y
# Install system dependencies
sudo apt install python3 python3-pip python3-venv git curl wget
# Create and activate virtual environment
python3 -m venv deepspeech-env
source deepspeech-env/bin/activate
# Upgrade pip and install DeepSpeech
pip install --upgrade pip
pip install deepspeech==0.9.3
```
#### Model Management
```bash
# Create models directory
mkdir -p ~/deepspeech-models
cd ~/deepspeech-models
# Download English models (v0.9.3)
wget https://github.com/mozilla/DeepSpeech/releases/download/v0.9.3/deepspeech-0.9.3-models.pbmm
wget https://github.com/mozilla/DeepSpeech/releases/download/v0.9.3/deepspeech-0.9.3-models.scorer
# Verify downloads
ls -la *.pbmm *.scorer
```
#### Transcription with DeepSpeech
```bash
# Basic transcription
deepspeech --model ~/deepspeech-models/deepspeech-0.9.3-models.pbmm \
--scorer ~/deepspeech-models/deepspeech-0.9.3-models.scorer \
--audio output_audio.wav
# Save transcription to file
deepspeech --model ~/deepspeech-models/deepspeech-0.9.3-models.pbmm \
--scorer ~/deepspeech-models/deepspeech-0.9.3-models.scorer \
--audio output_audio.wav > transcription.txt
# Process with timestamps (requires JSON output)
deepspeech --model ~/deepspeech-models/deepspeech-0.9.3-models.pbmm \
--scorer ~/deepspeech-models/deepspeech-0.9.3-models.scorer \
--audio output_audio.wav \
--json > transcription_with_timestamps.json
```
### Option 2: Whisper (OpenAI) - Modern Alternative
Whisper often provides better accuracy than DeepSpeech, especially for diverse accents and languages.
```bash
# Install Whisper
pip install openai-whisper
# Basic transcription (auto-detects language)
whisper output_audio.wav
# Specify language for better performance
whisper output_audio.wav --language English
# Different model sizes (larger = more accurate, slower)
whisper output_audio.wav --model tiny # fastest, least accurate
whisper output_audio.wav --model base # good balance
whisper output_audio.wav --model small # better accuracy
whisper output_audio.wav --model medium # high accuracy
whisper output_audio.wav --model large # best accuracy, slowest
# Output formats
whisper output_audio.wav --output_format txt
whisper output_audio.wav --output_format srt # subtitles
whisper output_audio.wav --output_format vtt # web subtitles
whisper output_audio.wav --output_format json # detailed output
```
### Option 3: SpeechRecognition Library with Multiple Engines
```python
# Create transcription script: transcribe.py
#!/usr/bin/env python3
import speech_recognition as sr
import sys
def transcribe_audio(audio_file, engine='google'):
"""
Transcribe audio file using various speech recognition engines
"""
r = sr.Recognizer()
# Load audio file
with sr.AudioFile(audio_file) as source:
audio = r.record(source)
try:
if engine == 'google':
# Google Speech Recognition (requires internet)
text = r.recognize_google(audio)
elif engine == 'sphinx':
# CMU Sphinx (offline)
text = r.recognize_sphinx(audio)
elif engine == 'wit':
# Wit.ai (requires API key)
text = r.recognize_wit(audio, key="YOUR_WIT_AI_KEY")
else:
raise ValueError(f"Unknown engine: {engine}")
return text
except sr.UnknownValueError:
return "Could not understand audio"
except sr.RequestError as e:
return f"Error with speech recognition service: {e}"
if __name__ == "__main__":
if len(sys.argv) != 2:
print("Usage: python transcribe.py audio_file.wav")
sys.exit(1)
audio_file = sys.argv[1]
result = transcribe_audio(audio_file, engine='google')
print(result)
```
```bash
# Install dependencies
pip install SpeechRecognition pydub
# Run transcription
python transcribe.py output_audio.wav
```
## Part 3: Advanced Workflows and Automation
### Complete Video-to-Text Pipeline Script
```bash
#!/bin/bash
# video_to_text.sh - Complete pipeline script
set -e # Exit on any error
# Configuration
INPUT_VIDEO="$1"
OUTPUT_DIR="./transcription_output"
TEMP_DIR="./temp"
MODELS_DIR="$HOME/deepspeech-models"
# Validate input
if [ -z "$INPUT_VIDEO" ]; then
echo "Usage: $0 <input_video.mp4>"
exit 1
fi
if [ ! -f "$INPUT_VIDEO" ]; then
echo "Error: Input video file not found: $INPUT_VIDEO"
exit 1
fi
# Create directories
mkdir -p "$OUTPUT_DIR" "$TEMP_DIR"
# Extract base filename
BASENAME=$(basename "$INPUT_VIDEO" | sed 's/\.[^.]*$//')
echo "Processing: $INPUT_VIDEO"
echo "Output will be saved to: $OUTPUT_DIR"
# Step 1: Extract audio
echo "Step 1: Extracting audio..."
AUDIO_FILE="$TEMP_DIR/${BASENAME}_audio.wav"
ffmpeg -i "$INPUT_VIDEO" -vn -acodec pcm_s16le -ar 16000 -ac 1 "$AUDIO_FILE" -y
# Step 2: Transcribe with multiple methods
echo "Step 2: Transcribing audio..."
# Method 1: DeepSpeech (if available)
if command -v deepspeech &> /dev/null; then
echo " Using DeepSpeech..."
deepspeech --model "$MODELS_DIR/deepspeech-0.9.3-models.pbmm" \
--scorer "$MODELS_DIR/deepspeech-0.9.3-models.scorer" \
--audio "$AUDIO_FILE" > "$OUTPUT_DIR/${BASENAME}_deepspeech.txt"
fi
# Method 2: Whisper (if available)
if command -v whisper &> /dev/null; then
echo " Using Whisper..."
whisper "$AUDIO_FILE" --output_dir "$OUTPUT_DIR" --output_format txt
mv "$OUTPUT_DIR/${BASENAME}_audio.txt" "$OUTPUT_DIR/${BASENAME}_whisper.txt" 2>/dev/null || true
fi
# Step 3: Generate metadata
echo "Step 3: Generating metadata..."
{
echo "Video Transcription Report"
echo "========================="
echo "Source: $INPUT_VIDEO"
echo "Date: $(date)"
echo "Audio duration: $(ffprobe -i "$AUDIO_FILE" -show_entries format=duration -v quiet -of csv="p=0" | cut -d. -f1) seconds"
echo ""
} > "$OUTPUT_DIR/${BASENAME}_metadata.txt"
# Cleanup
rm -rf "$TEMP_DIR"
echo "Transcription complete! Check $OUTPUT_DIR for results."
```
### Batch Processing Multiple Videos
```python
#!/usr/bin/env python3
# batch_transcribe.py
import os
import subprocess
import sys
from pathlib import Path
import concurrent.futures
import logging
# Setup logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)
def process_video(video_path, output_dir):
"""Process a single video file"""
try:
video_name = Path(video_path).stem
audio_path = output_dir / f"{video_name}_audio.wav"
transcript_path = output_dir / f"{video_name}_transcript.txt"
# Extract audio
logger.info(f"Extracting audio from {video_path}")
subprocess.run([
'ffmpeg', '-i', str(video_path), '-vn', '-acodec', 'pcm_s16le',
'-ar', '16000', '-ac', '1', str(audio_path), '-y'
], check=True, capture_output=True)
# Transcribe with Whisper
logger.info(f"Transcribing {audio_path}")
result = subprocess.run([
'whisper', str(audio_path), '--output_format', 'txt',
'--output_dir', str(output_dir)
], check=True, capture_output=True, text=True)
# Rename output file
whisper_output = output_dir / f"{video_name}_audio.txt"
if whisper_output.exists():
whisper_output.rename(transcript_path)
# Clean up audio file
audio_path.unlink()
logger.info(f"Completed processing {video_path}")
return True
except subprocess.CalledProcessError as e:
logger.error(f"Error processing {video_path}: {e}")
return False
except Exception as e:
logger.error(f"Unexpected error processing {video_path}: {e}")
return False
def main():
if len(sys.argv) != 3:
print("Usage: python batch_transcribe.py <input_directory> <output_directory>")
sys.exit(1)
input_dir = Path(sys.argv[1])
output_dir = Path(sys.argv[2])
if not input_dir.exists():
print(f"Input directory does not exist: {input_dir}")
sys.exit(1)
output_dir.mkdir(exist_ok=True)
# Find all video files
video_extensions = {'.mp4', '.avi', '.mov', '.mkv', '.flv', '.wmv'}
video_files = [f for f in input_dir.rglob('*') if f.suffix.lower() in video_extensions]
if not video_files:
print(f"No video files found in {input_dir}")
sys.exit(1)
logger.info(f"Found {len(video_files)} video files to process")
# Process videos in parallel
with concurrent.futures.ThreadPoolExecutor(max_workers=4) as executor:
futures = [executor.submit(process_video, video_file, output_dir) for video_file in video_files]
completed = 0
for future in concurrent.futures.as_completed(futures):
if future.result():
completed += 1
logger.info(f"Progress: {completed}/{len(video_files)} completed")
logger.info(f"Batch processing complete. {completed}/{len(video_files)} files processed successfully.")
if __name__ == "__main__":
main()
```
## Part 4: Quality Improvement and Post-Processing
### Improving Transcription Accuracy
#### Pre-processing Audio for Better Results
```bash
# Enhance audio quality before transcription
ffmpeg -i input_video.mp4 \
-af "highpass=f=100,lowpass=f=8000,compand=0.3|0.3:1|1:-90/-60|-60/-40|-40/-30|-20/-20:6:0:-90:0.2" \
-acodec pcm_s16le -ar 16000 -ac 1 enhanced_audio.wav
```
#### Post-processing Transcriptions
```python
#!/usr/bin/env python3
# post_process_transcript.py
import re
import sys
from pathlib import Path
def clean_transcript(text):
"""Clean and format transcript text"""
# Remove extra whitespace
text = re.sub(r'\s+', ' ', text.strip())
# Fix common transcription errors
replacements = {
r'\bum\b': '',
r'\buh\b': '',
r'\blike\b(?=\s+like)': '', # Remove repeated "like"
r'\byou know\b': '',
r'\.{2,}': '.', # Multiple periods to single
r'\s+\.': '.', # Space before period
r'\.(?=[A-Z])': '. ', # Add space after period before capital
}
for pattern, replacement in replacements.items():
text = re.sub(pattern, replacement, text, flags=re.IGNORECASE)
# Capitalize sentences
sentences = text.split('.')
sentences = [s.strip().capitalize() for s in sentences if s.strip()]
return '. '.join(sentences) + '.'
def add_paragraphs(text, sentences_per_paragraph=4):
"""Add paragraph breaks for better readability"""
sentences = [s.strip() for s in text.split('.') if s.strip()]
paragraphs = []
for i in range(0, len(sentences), sentences_per_paragraph):
paragraph = '. '.join(sentences[i:i+sentences_per_paragraph]) + '.'
paragraphs.append(paragraph)
return '\n\n'.join(paragraphs)
def main():
if len(sys.argv) != 2:
print("Usage: python post_process_transcript.py transcript.txt")
sys.exit(1)
input_file = Path(sys.argv[1])
output_file = input_file.with_suffix('.cleaned.txt')
# Read original transcript
with open(input_file, 'r', encoding='utf-8') as f:
original_text = f.read()
# Clean and format
cleaned_text = clean_transcript(original_text)
formatted_text = add_paragraphs(cleaned_text)
# Save cleaned version
with open(output_file, 'w', encoding='utf-8') as f:
f.write(formatted_text)
print(f"Cleaned transcript saved to: {output_file}")
print(f"Original length: {len(original_text)} characters")
print(f"Cleaned length: {len(formatted_text)} characters")
if __name__ == "__main__":
main()
```
## Part 5: Troubleshooting and Optimization
### Common Issues and Solutions
#### FFmpeg Issues
```bash
# Issue: "codec not supported"
# Solution: Check available codecs
ffmpeg -codecs | grep -i pcm
# Issue: Permission denied
# Solution: Check file permissions
chmod 755 input_video.mp4
sudo chown $USER:$USER input_video.mp4
# Issue: Out of disk space
# Solution: Monitor disk usage and use temporary directories
df -h
```
#### DeepSpeech Issues
```bash
# Issue: Model download fails
# Solution: Manual download with verification
wget -c https://github.com/mozilla/DeepSpeech/releases/download/v0.9.3/deepspeech-0.9.3-models.pbmm
sha256sum deepspeech-0.9.3-models.pbmm
# Issue: Poor transcription quality
# Solution: Try different audio preprocessing
ffmpeg -i input.mp4 -af "loudnorm=I=-16:TP=-1.5:LRA=11,highpass=f=80,lowpass=f=8000" -ar 16000 -ac 1 processed.wav
```
### Performance Optimization
#### System Resource Management
```bash
# Monitor system resources during processing
htop
iotop
nvidia-smi # For GPU usage
# Limit CPU usage for long-running jobs
nice -n 10 your_transcription_command
# Use GNU parallel for batch processing
find . -name "*.mp4" | parallel -j 4 'ffmpeg -i {} -vn -acodec pcm_s16le -ar 16000 -ac 1 {.}_audio.wav'
```
#### Storage Optimization
```bash
# Use compressed intermediate formats when possible
ffmpeg -i input.mp4 -vn -c:a libopus -b:a 32k temp_audio.opus
# Clean up temporary files automatically
trap 'rm -f temp_*.wav temp_*.opus' EXIT
```
## Part 6: Integration with Other Tools
### Creating Subtitles
```bash
# Generate SRT subtitles with Whisper
whisper input_video.mp4 --output_format srt
# Convert to other subtitle formats
ffmpeg -i subtitles.srt subtitles.vtt # WebVTT format
```
### Automated Video Processing Pipeline
```yaml
# docker-compose.yml for containerized transcription service
version: '3.8'
services:
transcription:
build: .
volumes:
- ./input:/app/input
- ./output:/app/output
environment:
- WHISPER_MODEL=base
- OUTPUT_FORMAT=txt
command: python batch_process.py /app/input /app/output
```
## Conclusion
This comprehensive guide provides multiple approaches to video-to-text transcription, from simple command-line tools to automated batch processing systems. The choice of method depends on your specific requirements:
- **DeepSpeech**: Good for offline processing, privacy-conscious applications
- **Whisper**: Superior accuracy, supports multiple languages, requires more resources
- **Cloud APIs**: Highest accuracy, requires internet connection and API costs
### Key Takeaways
1. **Audio Quality Matters**: Preprocessing audio can significantly improve transcription accuracy
2. **Choose the Right Tool**: Different speech recognition engines excel in different scenarios
3. **Automate When Possible**: Batch processing and scripting save time for large projects
4. **Post-process Results**: Cleaning and formatting improve final transcript quality
5. **Monitor Resources**: Large-scale transcription can be resource-intensive
### Next Steps
- Experiment with different speech recognition models and parameters
- Integrate transcription into larger content processing workflows
- Explore real-time transcription for live video streams
- Consider cloud-based solutions for production applications
The field of speech recognition is rapidly evolving, with new models and techniques regularly improving accuracy and reducing computational requirements. Stay updated with the latest developments to get the best results from your transcription workflows.
---
### Extracting Audio from Video with FFmpeg
First, you'll extract the audio from your video file into a `.wav` format suitable for speech recognition: