Files
the_information_nexus/tech_docs/linux/rsync.md

8.3 KiB
Raw Blame History

The Complete Rsync Guide: Mastering File Synchronization

Rsync (Remote Synchronization) is one of the most powerful and efficient tools for copying and synchronizing files locally or across networks. Its widely used for backups, mirroring, and deploying files efficiently by only transferring changes.

This guide covers:
Basic to advanced rsync usage
Trailing slash rules (critical!)
Local & remote sync (SSH)
Exclusions, deletions, and permissions
Performance optimization
Real-world examples & scripts


1. Installation & Basic Usage

Installation

  • Linux (Debian/Ubuntu):
    sudo apt install rsync
    
  • Linux (RHEL/CentOS):
    sudo yum install rsync
    
  • macOS:
    brew install rsync  # via Homebrew
    
  • Windows:
    • Use WSL (Windows Subsystem for Linux)
    • Or cwRsync (native Windows port)

Basic Command Structure

rsync [OPTIONS] SOURCE DESTINATION
  • SOURCE: The files/folders to copy.
  • DESTINATION: Where to copy them.

2. Critical: Trailing Slash Rules

The trailing slash (/) changes behavior drastically:

Command Effect
rsync /source /dest Copies entire /source folder into /dest/source
rsync /source/ /dest Copies only contents of /source/ into /dest/

Example:

rsync -a ~/photos/ /backup/  # Copies files inside ~/photos/ to /backup/
rsync -a ~/photos /backup/   # Creates /backup/photos/ with all files inside

⚠ Always test with -n (dry run) first!


3. Essential Rsync Options

Option Meaning
-a Archive mode (recursive + preserve permissions)
-v Verbose (show progress)
-z Compress during transfer
-h Human-readable file sizes
-P Show progress + resume interrupted transfers
--delete Delete files in destination not in source
-n Dry run (simulate without copying)
-e ssh Use SSH for remote transfers

4. Local & Remote File Syncing

Copy Locally

rsync -avh /source/folder/ /destination/

Copy to Remote Server (Push)

rsync -avzP -e ssh /local/path/ user@remote-server:/remote/path/

Copy from Remote Server (Pull)

rsync -avzP -e ssh user@remote-server:/remote/path/ /local/path/

5. Advanced Usage

Exclude Files/Folders

rsync -av --exclude='*.tmp' --exclude='cache/' /source/ /dest/

Or use an exclude file (exclude-list.txt):

rsync -av --exclude-from='exclude-list.txt' /source/ /dest/

Delete Extraneous Files (--delete)

rsync -av --delete /source/ /dest/  # Removes files in dest not in source

Limit Bandwidth (e.g., 1MB/s)

rsync -avz --bwlimit=1000 /source/ user@remote:/dest/

Partial Transfer Resume

rsync -avzP /source/ user@remote:/dest/  # -P allows resuming

6. Real-World Examples

1. Backup Home Directory

rsync -avh --delete --exclude='Downloads/' ~/ /backup/home/

2. Mirror a Website (Excluding Cache)

rsync -avzP --delete --exclude='cache/' user@webserver:/var/www/ /local/backup/

3. Sync Large Files with Bandwidth Control

rsync -avzP --bwlimit=5000 /big-files/ user@remote:/backup/

7. Performance Tips

  • Use -z for compression over slow networks.
  • Use --partial to keep partially transferred files.
  • Avoid -a if not needed (e.g., -rlt for lightweight sync).
  • Use rsync-daemon for frequent large transfers.

8. Common Mistakes & Fixes

Mistake Fix
Accidentally reversing source/dest Always test with -n first!
Forgetting trailing slash Check paths before running!
--delete removing needed files Use --dry-run before --delete
Permission issues Use --chmod or sudo rsync

9. Scripting & Automation

Cron Job for Daily Backup

0 3 * * * rsync -avz --delete /important-files/ user@backup-server:/backup/

Logging Rsync Output

rsync -avzP /source/ /dest/ >> /var/log/rsync.log 2>&1

Final Thoughts

Rsync is incredibly powerful once mastered. Key takeaways: Trailing slash (/) matters!
Use -a for backups, -z for slow networks.
Test with -n before --delete.
Automate with cron for scheduled syncs.

Want even deeper control? Explore rsync --daemon for server setups! 🚀


The multi-stream transfer technique (parallel rsync) is extremely valuable in specific high-performance scenarios where you need to maximize throughput or overcome certain limitations. Here are the key use cases where this shines:


1. Syncing Millions of Small Files

  • Problem: Rsync's single-threaded nature becomes a bottleneck with many small files (e.g., a directory with 500,000 tiny log files).
  • Solution: Parallel transfers reduce overhead by processing multiple files simultaneously.
  • Example:
    find /var/log/ -type f -print0 | xargs -0 -n1 -P8 -I{} rsync -a {} backup-server:/logs/
    
    (8 parallel processes for log files)

2. High-Latency Network Transfers

  • Problem: On high-latency connections (e.g., cross-continent), single-threaded rsync wastes bandwidth waiting for acknowledgments.
  • Solution: Parallel streams saturate the pipe by keeping multiple TCP connections busy.
  • Example:
    find /data/ -type f -size +1M -print0 | xargs -0 -n1 -P4 -I{} rsync -az {} user@remote:/backup/
    
    (Focuses on larger files with 4 parallel streams)

3. Maximizing SSD/NVMe I/O

  • Problem: Modern storage (SSDs/NVMe) can handle thousands of IOPS, but single-threaded rsync can't utilize full I/O bandwidth.
  • Solution: Parallel processes exploit concurrent disk reads/writes.
  • Example:
    cd /src && find . -type f -print0 | xargs -0 -n1 -P16 -I{} rsync -a ./{} /dest/{}
    
    (16 threads for NVMe arrays)

4. Cloud Storage Sync (S3/Blob)

  • Problem: Cloud storage APIs often throttle single connections but allow higher aggregate throughput.
  • Solution: Parallel uploads bypass per-connection limits.
  • Example (with rclone or s3cmd in parallel):
    find /data/ -type f | parallel -j10 s3cmd put {} s3://bucket/{}
    

5. Large Scientific Datasets (HPC)

  • Problem: Syncing multi-TB datasets with complex directory trees.
  • Solution: Split workload across directory levels.
  • Example:
    # Sync top-level dirs in parallel
    find /dataset/ -maxdepth 1 -mindepth 1 -type d | parallel -j4 rsync -av {} remote:/dataset/
    

Technical Considerations

  1. CPU vs. Network Tradeoff:

    • More threads increase CPU usage but improve throughput.
    • Ideal thread count = (Network Bandwidth in Mbps) / (Single-Thread Speed in Mbps)
  2. Avoid File Collisions:

    # Use subdirs to prevent conflicts
    find /src/ -type f | parallel -j8 rsync -a {} remote:/dest/{/.}/
    
  3. Progress Monitoring:
    Use pv to track overall progress:

    find /src/ -type f | pv -l | parallel -j8 rsync -a {} /dest/
    

When NOT to Use Parallel Rsync

  • Small file counts (<1,000 files) overhead outweighs benefits
  • Spinning HDDs random I/O thrashing may slow transfers
  • Rate-limited destinations (e.g., APIs with strict QPS limits)

Alternative Tools

For extreme cases, consider:

  • fpart + parallel (file partitioning)
  • gsutil -m (Google Cloud optimized)
  • aria2c (multi-protocol downloader)

Real-World Benchmark

# Single-threaded
time rsync -av /data/ remote:/backup/  # 45 MB/s

# Parallel (16 threads)
time find /data/ -type f -print0 | xargs -0 -n1 -P16 -I{} rsync -a {} remote:/backup/  # 220 MB/s

(Example from a 10Gbps link with NVMe storage)


Parallel rsync unlocks order-of-magnitude speedups in the right scenarios. Test with varying -P values to find your system's sweet spot! 🚀