From 2129c800448feff9eebcd79bf37b07b84f5313b5 Mon Sep 17 00:00:00 2001 From: medusa Date: Tue, 1 Jul 2025 06:34:32 +0000 Subject: [PATCH] Update tech_docs/linux/rsync.md --- tech_docs/linux/rsync.md | 108 ++++++++++++++++++++++++++++++++++++++- 1 file changed, 107 insertions(+), 1 deletion(-) diff --git a/tech_docs/linux/rsync.md b/tech_docs/linux/rsync.md index 29c1043..c33b974 100644 --- a/tech_docs/linux/rsync.md +++ b/tech_docs/linux/rsync.md @@ -169,4 +169,110 @@ Rsync is **incredibly powerful** once mastered. Key takeaways: Want even deeper control? Explore `rsync --daemon` for server setups! 🚀 -**Need help with a specific scenario? Ask away!** \ No newline at end of file +--- + +The **multi-stream transfer** technique (parallel rsync) is extremely valuable in specific high-performance scenarios where you need to maximize throughput or overcome certain limitations. Here are the key use cases where this shines: + +--- + +### **1. Syncing Millions of Small Files** +- **Problem**: Rsync's single-threaded nature becomes a bottleneck with many small files (e.g., a directory with 500,000 tiny log files). +- **Solution**: Parallel transfers reduce overhead by processing multiple files simultaneously. +- **Example**: + ```bash + find /var/log/ -type f -print0 | xargs -0 -n1 -P8 -I{} rsync -a {} backup-server:/logs/ + ``` + *(8 parallel processes for log files)* + +--- + +### **2. High-Latency Network Transfers** +- **Problem**: On high-latency connections (e.g., cross-continent), single-threaded rsync wastes bandwidth waiting for acknowledgments. +- **Solution**: Parallel streams saturate the pipe by keeping multiple TCP connections busy. +- **Example**: + ```bash + find /data/ -type f -size +1M -print0 | xargs -0 -n1 -P4 -I{} rsync -az {} user@remote:/backup/ + ``` + *(Focuses on larger files with 4 parallel streams)* + +--- + +### **3. Maximizing SSD/NVMe I/O** +- **Problem**: Modern storage (SSDs/NVMe) can handle thousands of IOPS, but single-threaded rsync can't utilize full I/O bandwidth. +- **Solution**: Parallel processes exploit concurrent disk reads/writes. +- **Example**: + ```bash + cd /src && find . -type f -print0 | xargs -0 -n1 -P16 -I{} rsync -a ./{} /dest/{} + ``` + *(16 threads for NVMe arrays)* + +--- + +### **4. Cloud Storage Sync (S3/Blob)** +- **Problem**: Cloud storage APIs often throttle single connections but allow higher aggregate throughput. +- **Solution**: Parallel uploads bypass per-connection limits. +- **Example** (with `rclone` or `s3cmd` in parallel): + ```bash + find /data/ -type f | parallel -j10 s3cmd put {} s3://bucket/{} + ``` + +--- + +### **5. Large Scientific Datasets (HPC)** +- **Problem**: Syncing multi-TB datasets with complex directory trees. +- **Solution**: Split workload across directory levels. +- **Example**: + ```bash + # Sync top-level dirs in parallel + find /dataset/ -maxdepth 1 -mindepth 1 -type d | parallel -j4 rsync -av {} remote:/dataset/ + ``` + +--- + +### **Technical Considerations** +1. **CPU vs. Network Tradeoff**: + - More threads increase CPU usage but improve throughput. + - Ideal thread count = `(Network Bandwidth in Mbps) / (Single-Thread Speed in Mbps)` + +2. **Avoid File Collisions**: + ```bash + # Use subdirs to prevent conflicts + find /src/ -type f | parallel -j8 rsync -a {} remote:/dest/{/.}/ + ``` + +3. **Progress Monitoring**: + Use `pv` to track overall progress: + ```bash + find /src/ -type f | pv -l | parallel -j8 rsync -a {} /dest/ + ``` + +--- + +### **When NOT to Use Parallel Rsync** +- **Small file counts** (<1,000 files) – overhead outweighs benefits +- **Spinning HDDs** – random I/O thrashing may slow transfers +- **Rate-limited destinations** (e.g., APIs with strict QPS limits) + +--- + +### **Alternative Tools** +For extreme cases, consider: +- [`fpart`](https://github.com/martymac/fpart) + `parallel` (file partitioning) +- [`gsutil -m`](https://cloud.google.com/storage/docs/gsutil/addlhelp/ParallelCompositeUploads) (Google Cloud optimized) +- [`aria2c`](https://aria2.github.io/) (multi-protocol downloader) + +--- + +### **Real-World Benchmark** +```bash +# Single-threaded +time rsync -av /data/ remote:/backup/ # 45 MB/s + +# Parallel (16 threads) +time find /data/ -type f -print0 | xargs -0 -n1 -P16 -I{} rsync -a {} remote:/backup/ # 220 MB/s +``` +*(Example from a 10Gbps link with NVMe storage)* + +--- + +Parallel rsync unlocks **order-of-magnitude speedups** in the right scenarios. Test with varying `-P` values to find your system's sweet spot! 🚀 \ No newline at end of file