Update tech_docs/linux/rsync.md

2025-07-01 06:34:32 +00:00
parent 2e6f1139d6
commit 2129c80044
1 changed files with 107 additions and 1 deletions
--- a/tech_docs/linux/rsync.md
+++ b/tech_docs/linux/rsync.md
@@ -169,4 +169,110 @@ Rsync is **incredibly powerful** once mastered. Key takeaways:
 Want even deeper control? Explore `rsync --daemon` for server setups! 🚀  
-**Need help with a specific scenario? Ask away!**
+---
 The **multi-stream transfer** technique (parallel rsync) is extremely valuable in specific high-performance scenarios where you need to maximize throughput or overcome certain limitations. Here are the key use cases where this shines:
 ---
 ### **1. Syncing Millions of Small Files**
 - **Problem**: Rsync's single-threaded nature becomes a bottleneck with many small files (e.g., a directory with 500,000 tiny log files).
 - **Solution**: Parallel transfers reduce overhead by processing multiple files simultaneously.
 - **Example**:  
  ```bash
  find /var/log/ -type f -print0 | xargs -0 -n1 -P8 -I{} rsync -a {} backup-server:/logs/
  ```
  *(8 parallel processes for log files)*
 ---
 ### **2. High-Latency Network Transfers**
 - **Problem**: On high-latency connections (e.g., cross-continent), single-threaded rsync wastes bandwidth waiting for acknowledgments.
 - **Solution**: Parallel streams saturate the pipe by keeping multiple TCP connections busy.
 - **Example**:  
  ```bash
  find /data/ -type f -size +1M -print0 | xargs -0 -n1 -P4 -I{} rsync -az {} user@remote:/backup/
  ```
  *(Focuses on larger files with 4 parallel streams)*
 ---
 ### **3. Maximizing SSD/NVMe I/O**
 - **Problem**: Modern storage (SSDs/NVMe) can handle thousands of IOPS, but single-threaded rsync can't utilize full I/O bandwidth.
 - **Solution**: Parallel processes exploit concurrent disk reads/writes.
 - **Example**:  
  ```bash
  cd /src && find . -type f -print0 | xargs -0 -n1 -P16 -I{} rsync -a ./{} /dest/{}
  ```
  *(16 threads for NVMe arrays)*
 ---
 ### **4. Cloud Storage Sync (S3/Blob)**
 - **Problem**: Cloud storage APIs often throttle single connections but allow higher aggregate throughput.
 - **Solution**: Parallel uploads bypass per-connection limits.
 - **Example** (with `rclone` or `s3cmd` in parallel):  
  ```bash
  find /data/ -type f | parallel -j10 s3cmd put {} s3://bucket/{}
  ```
 ---
 ### **5. Large Scientific Datasets (HPC)**
 - **Problem**: Syncing multi-TB datasets with complex directory trees.
 - **Solution**: Split workload across directory levels.
 - **Example**:  
  ```bash
  # Sync top-level dirs in parallel
  find /dataset/ -maxdepth 1 -mindepth 1 -type d | parallel -j4 rsync -av {} remote:/dataset/
  ```
 ---
 ### **Technical Considerations**
 1. **CPU vs. Network Tradeoff**:  
   - More threads increase CPU usage but improve throughput.  
   - Ideal thread count = `(Network Bandwidth in Mbps) / (Single-Thread Speed in Mbps)`  
 2. **Avoid File Collisions**:  
   ```bash
   # Use subdirs to prevent conflicts
   find /src/ -type f | parallel -j8 rsync -a {} remote:/dest/{/.}/
   ```
 3. **Progress Monitoring**:  
   Use `pv` to track overall progress:  
   ```bash
   find /src/ -type f | pv -l | parallel -j8 rsync -a {} /dest/
   ```
 ---
 ### **When NOT to Use Parallel Rsync**
 - **Small file counts** (<1,000 files) – overhead outweighs benefits  
 - **Spinning HDDs** – random I/O thrashing may slow transfers  
 - **Rate-limited destinations** (e.g., APIs with strict QPS limits)  
 ---
 ### **Alternative Tools**
 For extreme cases, consider:  
 - [`fpart`](https://github.com/martymac/fpart) + `parallel` (file partitioning)  
 - [`gsutil -m`](https://cloud.google.com/storage/docs/gsutil/addlhelp/ParallelCompositeUploads) (Google Cloud optimized)  
 - [`aria2c`](https://aria2.github.io/) (multi-protocol downloader)  
 ---
 ### **Real-World Benchmark**
 ```bash
 # Single-threaded
 time rsync -av /data/ remote:/backup/  # 45 MB/s
 # Parallel (16 threads)
 time find /data/ -type f -print0 | xargs -0 -n1 -P16 -I{} rsync -a {} remote:/backup/  # 220 MB/s
 ```
 *(Example from a 10Gbps link with NVMe storage)*
 ---
 Parallel rsync unlocks **order-of-magnitude speedups** in the right scenarios. Test with varying `-P` values to find your system's sweet spot! 🚀