As a Linux expert performing system analysis, I follow a structured approach to diagnose issues, optimize performance, and ensure system stability. Here’s my step-by-step methodology: --- ### **1. Define the Scope** - Identify the problem: Is it performance-related, stability, security, or functionality? - Gather symptoms: Error messages, logs, user reports, or observable behavior (e.g., slow response, crashes). --- ### **2. Gather System Overview** - **Hardware**: ```bash lscpu # CPU info free -h # Memory usage lsblk # Disk layout lspci/lsusb # Hardware devices dmidecode # Detailed hardware info (requires root) ``` - **OS/Kernel**: ```bash uname -a # Kernel version cat /etc/os-release # Distro info hostnamectl # System state (static/transient) ``` --- ### **3. Check Real-Time System State** - **Resource Monitoring**: ```bash top -c # Interactive process view (CPU/MEM) htop # Enhanced top (install if needed) vmstat 1 # System activity, memory, CPU, I/O iostat -xz 1 # Disk I/O stats dstat # Combined stats (CPU, disk, network) nload/iftop # Network traffic monitoring ``` - **Process Analysis**: ```bash ps auxf # Process tree pstree # Visual hierarchy pidstat 1 # Per-process CPU/memory/disk stats ``` --- ### **4. Inspect Logs** - **System Logs**: ```bash journalctl -xe --no-pager -n 50 # Systemd logs (most recent) tail -f /var/log/syslog # General logs (Debian) tail -f /var/log/messages # General logs (RHEL) ``` - **Service-Specific Logs**: Check `/var/log/` for `nginx/`, `apache2/`, `mysql/`, etc. - **Kernel/Errors**: ```bash dmesg -T | tail -50 # Kernel ring buffer (time-stamped) grep -i error /var/log/* # Search for errors ``` --- ### **5. Disk and Filesystem Analysis** - **Space Usage**: ```bash df -hT # Filesystem space du -sh /* # Directory sizes (root) ncdu # Interactive disk usage (installable) ``` - **I/O Performance**: ```bash iotop -o # Disk I/O by process sar -d 1 # Historical disk stats (sysstat package) ``` - **Filesystem Health**: ```bash fsck # Filesystem check (unmount first) smartctl -a /dev/sda # SMART data for disks ``` --- ### **6. Network Analysis** - **Connections**: ```bash ss -tulnp # Sockets (replaces netstat) netstat -tuln # Legacy socket info lsof -i # Open network connections ``` - **Performance**: ```bash ping # Latency check traceroute/mtr # Route analysis iperf3 # Bandwidth test ethtool # NIC settings ``` - **Firewall/Routing**: ```bash iptables -L -n -v # Firewall rules nft list ruleset # For nftables ip route # Routing table ``` --- ### **7. Performance Profiling** - **CPU**: ```bash perf top # CPU profiling (install linux-tools) mpstat -P ALL 1 # Per-CPU stats ``` - **Memory**: ```bash cat /proc/meminfo # Detailed memory stats slabtop # Kernel slab cache usage ``` - **Bottlenecks**: ```bash strace -p # System calls of a process ltrace # Library calls tcpdump # Packet capture ``` --- ### **8. Security Checks** - **User/Sessions**: ```bash who -a # Logged-in users last # Login history sudo -l # User’s sudo privileges ``` - **Audit**: ```bash auditd # Audit framework (if enabled) chkrootkit/rkhunter # Rootkit scans ``` - **SUID/SGID Files**: ```bash find / -perm -4000 -type f 2>/dev/null # SUID files ``` --- ### **9. Configuration Review** - **Critical Files**: ```bash cat /etc/sysctl.conf # Kernel parameters cat /etc/security/limits.conf # User limits systemctl list-unit-files --state=enabled # Enabled services ``` - **Cron/At Jobs**: ```bash crontab -l -u root # Root’s cron jobs ls /etc/cron.* # System cron ``` --- ### **10. Reproduce and Test** - **Stress Testing**: ```bash stress --cpu 4 --io 2 --vm 2 --vm-bytes 1G # Simulate load ``` - **Benchmark**: ```bash sysbench # CPU/memory/disk benchmarks ``` --- ### **11. Document and Resolve** - Summarize findings (e.g., "High I/O wait due to MySQL"). - Propose fixes (e.g., optimize queries, add RAM, adjust swappiness). - Implement changes **incrementally** and monitor impact. --- ### **Key Tools to Install if Missing** ```bash apt install sysstat htop iotop nload iftop perf dstat ltrace strace ``` This approach ensures thorough analysis while minimizing disruption to live systems. Adjust steps based on the specific issue (e.g., focus on `journalctl` for boot failures or `sar` for historical trends).