How I Troubleshoot Linux Servers in Production

March 3, 2026

Part 1 of 2
How I Troubleshoot Linux Servers in Production

Production outages are rarely caused by one thing. They are usually a chain of small failures. My approach is simple - reduce uncertainty quickly, restore service, then prevent it from happening again.

1. Stabilize and Define Impact

Before touching anything, understand the scope.

  • What is broken?
  • Who is affected?
  • Is this partial or total?

Start with service status:

systemctl status nginx
journalctl -u nginx -n 50

This tells me if the service is running, when it last restarted, and what the recent logs say.

2. Check Host Resources

A surprising number of incidents are resource-related.

top
free -h
df -h

CPU
If one process is eating 100% CPU, everything else slows down. top shows this immediately - look for the process sitting at the top consuming all cores.

Memory
When RAM is full, the kernel starts killing processes (OOM killer). I once saw a Java app getting OOM-killed every 20 minutes because it had no heap limit set. The app kept restarting, logs looked clean, but dmesg | grep -i oom revealed the kills.

Disk
Services can't write logs or temp files when disk is full. This one is sneaky - the app might still be running but silently failing on every write operation.

If any of these are saturated, fix that first. Everything else is a symptom.

3. Use Logs to Build a Timeline

Logs tell the story of what happened. I look at both system and app logs and match timestamps.

journalctl --since "30 min ago"
tail -100 /var/log/app/error.log

Three questions to answer:

  • What failed first?
  • What cascaded after that?
  • What changed recently?

The first failure in the timeline is usually the root cause. In one incident, I saw Nginx 502s flooding the logs. But scrolling back a few seconds, the first error was actually a PostgreSQL connection failure. The database went down, the app couldn't serve requests, and Nginx started returning 502s. Fixing the database fixed everything - Nginx was never the problem.

4. Validate Network and Dependencies

Even healthy services fail when their dependencies are unavailable.

ss -tlnp
curl http://localhost:8000/health
nslookup db.example.com

ss -tlnp
Shows all listening ports. If your service isn't listening, it's not running properly.

curl health check
A 200 means alive. A 502 or timeout means something upstream is broken.

nslookup
If the hostname doesn't resolve, your app can't connect - even if the dependency itself is fine.

5. Recover Safely, Then Harden

Apply the smallest safe fix first:

  • Restart the process
  • Free up disk space
  • Rollback a bad deployment
  • Revert a config change

Then add a permanent measure so it doesn't happen again.

6. Know When to Escalate

Not every incident should be debugged solo. Escalate when:

  • Data loss risk - Database corruption, storage failure, or backups not restoring
  • Security incident - Unauthorized access, exposed secrets, or suspicious activity
  • Prolonged outage - You've been debugging for 30+ minutes with no clear direction
  • Beyond your access - The fix requires infrastructure changes you can't make (cloud provider, DNS registrar, etc.)

Escalating early is not a failure. Sitting on a P1 for too long is.


Real Examples From Production

Nginx returning 502 - disk was full

Users reported 502 Bad Gateway. Looked like a networking issue. But the root cause was disk exhaustion.

App debug logs were growing unchecked. They filled the entire disk. Nginx couldn't write to its temp directory, so it started returning 502s.

Immediate fix:

du -h /var | sort -h | tail -20
truncate -s 0 /var/log/app/debug.log
systemctl restart nginx

Permanent fix:

  • Added logrotate to rotate logs daily and keep 7 days
  • Changed log level from DEBUG to INFO in production
  • Added disk usage alerts at 80% and 90%

Confirmed: df -h showed 40% disk usage after cleanup, curl returned 200 on all endpoints.

Database connection failures from AKS pods

App pods suddenly couldn't connect to the PostgreSQL server. The error was could not translate host name. The database was fine - the problem was DNS resolution inside the Kubernetes cluster.

Diagnosis:

kubectl exec -it <pod> -- nslookup db-server.postgres.database.azure.com
kubectl get pods -n kube-system

CoreDNS pods were in a crash loop. Restarting them fixed the resolution, and pods reconnected immediately.

Confirmed: kubectl logs showed successful database queries after running kubectl rollout restart deployment coredns -n kube-system, and health endpoints returned 200.

Gunicorn not picking up new deployment

Deployed a new build, but the app was still serving old code. The deployment script used systemctl reload gunicorn which tells Gunicorn to gracefully reload workers. But Gunicorn's reload doesn't follow symlink changes - it reloads from the same binary path.

Fix:

Changed all deployment scripts from reload to restart:

sudo systemctl restart gunicorn

Confirmed: Hit the health endpoint and saw the new build version in the response.

The lesson - reload and restart are not the same. Know the difference for each service.


My Go-To Commands During Incidents

systemctl status <service>
journalctl -u <service> -n 50
top
free -h
df -h
du -h /var | sort -h | tail -20
ss -tlnp
curl http://localhost:<port>/health
ps aux --sort=-%mem | head
uptime

Rules I Follow in Production

  • Change one thing at a time. If you change two things and it works, you don't know which fixed it.
  • Record every command and observation. You'll need this for the post-mortem.
  • Prefer reversible actions first. Restart before reinstall. Revert before rewrite.
  • Communicate status updates frequently. Your team needs to know what's happening.
  • Always close the loop. Write a root-cause summary and create follow-up tasks.

Key Takeaways

  • Stabilize first - Define impact, check service status
  • Check resources early - CPU, memory, disk solve most mysteries
  • Build a timeline from logs - Find the first failure, not the loudest one
  • Validate dependencies - Network, DNS, and upstream services matter
  • Recover then harden - Smallest safe fix first, permanent prevention second
  • Escalate when needed - Data loss, security, or 30+ minutes with no progress
  • Follow a repeatable process - Observe, narrow, test, recover, prevent

I've used this exact workflow to resolve every production incident I've faced. It has worked consistently for me.