Update tech_docs/linux/rsync.md

This commit is contained in:
2025-07-01 06:34:32 +00:00
parent 2e6f1139d6
commit 2129c80044

View File

@@ -169,4 +169,110 @@ Rsync is **incredibly powerful** once mastered. Key takeaways:
Want even deeper control? Explore `rsync --daemon` for server setups! 🚀 Want even deeper control? Explore `rsync --daemon` for server setups! 🚀
**Need help with a specific scenario? Ask away!** ---
The **multi-stream transfer** technique (parallel rsync) is extremely valuable in specific high-performance scenarios where you need to maximize throughput or overcome certain limitations. Here are the key use cases where this shines:
---
### **1. Syncing Millions of Small Files**
- **Problem**: Rsync's single-threaded nature becomes a bottleneck with many small files (e.g., a directory with 500,000 tiny log files).
- **Solution**: Parallel transfers reduce overhead by processing multiple files simultaneously.
- **Example**:
```bash
find /var/log/ -type f -print0 | xargs -0 -n1 -P8 -I{} rsync -a {} backup-server:/logs/
```
*(8 parallel processes for log files)*
---
### **2. High-Latency Network Transfers**
- **Problem**: On high-latency connections (e.g., cross-continent), single-threaded rsync wastes bandwidth waiting for acknowledgments.
- **Solution**: Parallel streams saturate the pipe by keeping multiple TCP connections busy.
- **Example**:
```bash
find /data/ -type f -size +1M -print0 | xargs -0 -n1 -P4 -I{} rsync -az {} user@remote:/backup/
```
*(Focuses on larger files with 4 parallel streams)*
---
### **3. Maximizing SSD/NVMe I/O**
- **Problem**: Modern storage (SSDs/NVMe) can handle thousands of IOPS, but single-threaded rsync can't utilize full I/O bandwidth.
- **Solution**: Parallel processes exploit concurrent disk reads/writes.
- **Example**:
```bash
cd /src && find . -type f -print0 | xargs -0 -n1 -P16 -I{} rsync -a ./{} /dest/{}
```
*(16 threads for NVMe arrays)*
---
### **4. Cloud Storage Sync (S3/Blob)**
- **Problem**: Cloud storage APIs often throttle single connections but allow higher aggregate throughput.
- **Solution**: Parallel uploads bypass per-connection limits.
- **Example** (with `rclone` or `s3cmd` in parallel):
```bash
find /data/ -type f | parallel -j10 s3cmd put {} s3://bucket/{}
```
---
### **5. Large Scientific Datasets (HPC)**
- **Problem**: Syncing multi-TB datasets with complex directory trees.
- **Solution**: Split workload across directory levels.
- **Example**:
```bash
# Sync top-level dirs in parallel
find /dataset/ -maxdepth 1 -mindepth 1 -type d | parallel -j4 rsync -av {} remote:/dataset/
```
---
### **Technical Considerations**
1. **CPU vs. Network Tradeoff**:
- More threads increase CPU usage but improve throughput.
- Ideal thread count = `(Network Bandwidth in Mbps) / (Single-Thread Speed in Mbps)`
2. **Avoid File Collisions**:
```bash
# Use subdirs to prevent conflicts
find /src/ -type f | parallel -j8 rsync -a {} remote:/dest/{/.}/
```
3. **Progress Monitoring**:
Use `pv` to track overall progress:
```bash
find /src/ -type f | pv -l | parallel -j8 rsync -a {} /dest/
```
---
### **When NOT to Use Parallel Rsync**
- **Small file counts** (<1,000 files) overhead outweighs benefits
- **Spinning HDDs** random I/O thrashing may slow transfers
- **Rate-limited destinations** (e.g., APIs with strict QPS limits)
---
### **Alternative Tools**
For extreme cases, consider:
- [`fpart`](https://github.com/martymac/fpart) + `parallel` (file partitioning)
- [`gsutil -m`](https://cloud.google.com/storage/docs/gsutil/addlhelp/ParallelCompositeUploads) (Google Cloud optimized)
- [`aria2c`](https://aria2.github.io/) (multi-protocol downloader)
---
### **Real-World Benchmark**
```bash
# Single-threaded
time rsync -av /data/ remote:/backup/ # 45 MB/s
# Parallel (16 threads)
time find /data/ -type f -print0 | xargs -0 -n1 -P16 -I{} rsync -a {} remote:/backup/ # 220 MB/s
```
*(Example from a 10Gbps link with NVMe storage)*
---
Parallel rsync unlocks **order-of-magnitude speedups** in the right scenarios. Test with varying `-P` values to find your system's sweet spot! 🚀