Add tech_docs/networking/linux_networking.md

This commit is contained in:
2025-06-19 05:37:55 +00:00
parent 692c82bc4d
commit 78661c6f41

View File

@@ -0,0 +1,177 @@
Heres a **focused Linux deep dive** covering **process management, networking, and resource triage**—the 20% that delivers 80% of the value for senior roles:
---
# **Linux for Senior Engineers: Processes, Networking & Triage**
**Goal:** Master the commands and concepts needed to **analyze, optimize, and troubleshoot** Linux systems at scale.
---
## **1. Process Management**
### **1.1 Viewing & Controlling Processes**
| **Command** | **Purpose** | **Key Flags** |
|---------------------------|---------------------------------------------|-----------------------------------|
| `ps` | Snapshot of running processes | `-ef` (all), `aux` (detailed) |
| `top` / `htop` | Real-time process monitor | `-p PID` (filter by PID) |
| `kill` | Terminate processes | `-9` (SIGKILL), `-15` (SIGTERM) |
| `pkill` / `killall` | Kill processes by name | `-f` (match full command) |
| `nice` / `renice` | Adjust process priority (niceness) | `-n 19` (lowest priority) |
**Pro Tip:**
- Use `strace -p PID` to **debug hanging processes** (system calls tracing).
---
### **1.2 Process Resource Usage**
| **Command** | **What It Shows** |
|-------------------|--------------------------------------------|
| `pidstat -p PID` | CPU, memory, and I/O stats for a process |
| `vmstat 1` | System-wide CPU, memory, and I/O trends |
| `iotop` | Disk I/O by process (requires root) |
**Critical Metrics:**
- **%CPU**: >80% for prolonged periods → Check for runaway processes.
- **%MEM**: Watch for leaks (e.g., Java apps).
- **IOwait** (`vmstat`): High values indicate disk bottlenecks.
---
## **2. Linux Networking**
### **2.1 Key Networking Commands**
| **Command** | **Purpose** | **Example** |
|---------------------------|---------------------------------------------|----------------------------------|
| `ip addr` / `ifconfig` | View interfaces and IPs | `ip addr show eth0` |
| `ss` / `netstat` | Socket statistics | `ss -tulnp` (all listening ports)|
| `tcpdump` | Packet capture | `tcpdump -i eth0 port 80` |
| `dig` / `nslookup` | DNS troubleshooting | `dig +short example.com` |
| `traceroute` / `mtr` | Network path analysis | `mtr google.com` |
**Pro Tip:**
- Use `nc` (netcat) for **quick port checks**:
```bash
nc -zv example.com 443 # Test if port 443 is open
```
---
### **2.2 Advanced Networking**
- **Network Namespaces**: Isolate network stacks (used by Docker/Kubernetes).
```bash
ip netns list # List namespaces
```
- **eBPF**: Trace network traffic at kernel level (e.g., `bpftrace`).
- **IPTables/nftables**: Firewall rules (legacy vs. modern).
---
## **3. Resource Triage (Troubleshooting)**
### **3.1 The "Big Four" Resources**
1. **CPU**
- **Tool:** `mpstat -P ALL 1` (per-core usage).
- **Red Flag:** `%idle` < 10% → Bottleneck.
2. **Memory**
- **Tools:** `free -h`, `cat /proc/meminfo`.
- **Key Metrics:**
- **Available** (not "free") memory.
- **Swap usage** > 0 → Memory pressure.
3. **Disk I/O**
- **Tools:** `iostat -xz 1`, `df -h`.
- **Red Flags:**
- `%util` > 70% (disk saturated).
- `await` > 10ms (slow I/O).
4. **Network**
- **Tools:** `sar -n DEV 1`, `ethtool eth0`.
- **Red Flags:**
- `RX/TX drops` > 0 → Congestion or NIC issues.
---
### **3.2 Triage Workflow**
1. **Identify the bottleneck**:
```bash
uptime # Load averages (1m, 5m, 15m)
dmesg -T | tail # Kernel logs (OOMs, hardware errors)
```
2. **Drill down**:
- High CPU? → `top → pidstat -p PID 1`.
- High RAM? → `cat /proc/PID/status | grep VmRSS`.
3. **Check dependencies**:
- Database slow? → `ss -t src :5432` (Postgres connections).
---
## **4. Performance Optimization**
### **4.1 CPU**
- **Pin processes to cores**: `taskset -c 0,1 ./script.sh`.
- **Limit CPU usage**: `cpulimit -p PID -l 50` (max 50% CPU).
### **4.2 Memory**
- **Clear caches** (in emergencies):
```bash
echo 3 > /proc/sys/vm/drop_caches # Cleans pagecache, dentries, inodes
```
- **OOM Killer Tuning**: Adjust `/proc/PID/oom_score_adj`.
### **4.3 Disk**
- **I/O Scheduler**: Switch to `deadline` for databases:
```bash
echo deadline > /sys/block/sda/queue/scheduler
```
---
## **5. Must-Know Files & Directories**
| **Path** | **Purpose** |
|---------------------------|---------------------------------------------|
| `/proc/PID/` | Process details (limits, stats, FD usage) |
| `/etc/sysctl.conf` | Kernel tuning (e.g., `vm.swappiness`) |
| `/var/log/` | System logs (auth, kernel, apps) |
| `/sys/class/net/` | Network interface configurations |
---
## **6. Interview Questions**
1. **How do you find which process is listening on port 8080?**
```bash
ss -tulnp | grep 8080
```
2. **Whats the difference between `kill -9` and `kill -15`?**
- `-15` (SIGTERM): Graceful shutdown.
- `-9` (SIGKILL): Forceful termination (last resort).
3. **How do you debug high CPU usage?**
- `top → pidstat -p PID 1 → strace -p PID`.
---
## **7. Lab Exercises**
1. **Simulate a CPU hog**:
```bash
while true; do :; done & # Background infinite loop
```
Then use `pidstat` to monitor.
2. **Trigger OOM Killer**:
```bash
tail /dev/zero # Consume all RAM
```
Check `dmesg` for OOM events.
---
## **Cheat Sheet**
```bash
# Quick system overview
dmesg | tail # Kernel errors
vmstat 1 # CPU/memory/IO
ss -tulnp # Listening ports
df -h # Disk space
```
**Next Steps:**
- Learn **eBPF** for advanced tracing (`bpftrace`).
- Practice **container networking** (Docker, Kubernetes).
- Master **log aggregation** (ELK, Loki).
Need a **deep dive on kernel tuning** or **real-world outage scenarios**? Let me know! 🚀