Update tech_docs/linux/rsync.md
This commit is contained in:
@@ -169,4 +169,110 @@ Rsync is **incredibly powerful** once mastered. Key takeaways:
|
|||||||
|
|
||||||
Want even deeper control? Explore `rsync --daemon` for server setups! 🚀
|
Want even deeper control? Explore `rsync --daemon` for server setups! 🚀
|
||||||
|
|
||||||
**Need help with a specific scenario? Ask away!**
|
---
|
||||||
|
|
||||||
|
The **multi-stream transfer** technique (parallel rsync) is extremely valuable in specific high-performance scenarios where you need to maximize throughput or overcome certain limitations. Here are the key use cases where this shines:
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### **1. Syncing Millions of Small Files**
|
||||||
|
- **Problem**: Rsync's single-threaded nature becomes a bottleneck with many small files (e.g., a directory with 500,000 tiny log files).
|
||||||
|
- **Solution**: Parallel transfers reduce overhead by processing multiple files simultaneously.
|
||||||
|
- **Example**:
|
||||||
|
```bash
|
||||||
|
find /var/log/ -type f -print0 | xargs -0 -n1 -P8 -I{} rsync -a {} backup-server:/logs/
|
||||||
|
```
|
||||||
|
*(8 parallel processes for log files)*
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### **2. High-Latency Network Transfers**
|
||||||
|
- **Problem**: On high-latency connections (e.g., cross-continent), single-threaded rsync wastes bandwidth waiting for acknowledgments.
|
||||||
|
- **Solution**: Parallel streams saturate the pipe by keeping multiple TCP connections busy.
|
||||||
|
- **Example**:
|
||||||
|
```bash
|
||||||
|
find /data/ -type f -size +1M -print0 | xargs -0 -n1 -P4 -I{} rsync -az {} user@remote:/backup/
|
||||||
|
```
|
||||||
|
*(Focuses on larger files with 4 parallel streams)*
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### **3. Maximizing SSD/NVMe I/O**
|
||||||
|
- **Problem**: Modern storage (SSDs/NVMe) can handle thousands of IOPS, but single-threaded rsync can't utilize full I/O bandwidth.
|
||||||
|
- **Solution**: Parallel processes exploit concurrent disk reads/writes.
|
||||||
|
- **Example**:
|
||||||
|
```bash
|
||||||
|
cd /src && find . -type f -print0 | xargs -0 -n1 -P16 -I{} rsync -a ./{} /dest/{}
|
||||||
|
```
|
||||||
|
*(16 threads for NVMe arrays)*
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### **4. Cloud Storage Sync (S3/Blob)**
|
||||||
|
- **Problem**: Cloud storage APIs often throttle single connections but allow higher aggregate throughput.
|
||||||
|
- **Solution**: Parallel uploads bypass per-connection limits.
|
||||||
|
- **Example** (with `rclone` or `s3cmd` in parallel):
|
||||||
|
```bash
|
||||||
|
find /data/ -type f | parallel -j10 s3cmd put {} s3://bucket/{}
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### **5. Large Scientific Datasets (HPC)**
|
||||||
|
- **Problem**: Syncing multi-TB datasets with complex directory trees.
|
||||||
|
- **Solution**: Split workload across directory levels.
|
||||||
|
- **Example**:
|
||||||
|
```bash
|
||||||
|
# Sync top-level dirs in parallel
|
||||||
|
find /dataset/ -maxdepth 1 -mindepth 1 -type d | parallel -j4 rsync -av {} remote:/dataset/
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### **Technical Considerations**
|
||||||
|
1. **CPU vs. Network Tradeoff**:
|
||||||
|
- More threads increase CPU usage but improve throughput.
|
||||||
|
- Ideal thread count = `(Network Bandwidth in Mbps) / (Single-Thread Speed in Mbps)`
|
||||||
|
|
||||||
|
2. **Avoid File Collisions**:
|
||||||
|
```bash
|
||||||
|
# Use subdirs to prevent conflicts
|
||||||
|
find /src/ -type f | parallel -j8 rsync -a {} remote:/dest/{/.}/
|
||||||
|
```
|
||||||
|
|
||||||
|
3. **Progress Monitoring**:
|
||||||
|
Use `pv` to track overall progress:
|
||||||
|
```bash
|
||||||
|
find /src/ -type f | pv -l | parallel -j8 rsync -a {} /dest/
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### **When NOT to Use Parallel Rsync**
|
||||||
|
- **Small file counts** (<1,000 files) – overhead outweighs benefits
|
||||||
|
- **Spinning HDDs** – random I/O thrashing may slow transfers
|
||||||
|
- **Rate-limited destinations** (e.g., APIs with strict QPS limits)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### **Alternative Tools**
|
||||||
|
For extreme cases, consider:
|
||||||
|
- [`fpart`](https://github.com/martymac/fpart) + `parallel` (file partitioning)
|
||||||
|
- [`gsutil -m`](https://cloud.google.com/storage/docs/gsutil/addlhelp/ParallelCompositeUploads) (Google Cloud optimized)
|
||||||
|
- [`aria2c`](https://aria2.github.io/) (multi-protocol downloader)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### **Real-World Benchmark**
|
||||||
|
```bash
|
||||||
|
# Single-threaded
|
||||||
|
time rsync -av /data/ remote:/backup/ # 45 MB/s
|
||||||
|
|
||||||
|
# Parallel (16 threads)
|
||||||
|
time find /data/ -type f -print0 | xargs -0 -n1 -P16 -I{} rsync -a {} remote:/backup/ # 220 MB/s
|
||||||
|
```
|
||||||
|
*(Example from a 10Gbps link with NVMe storage)*
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
Parallel rsync unlocks **order-of-magnitude speedups** in the right scenarios. Test with varying `-P` values to find your system's sweet spot! 🚀
|
||||||
Reference in New Issue
Block a user