The First 10 Minutes: Troubleshooting a Linux Server

or How I Learned to Stop Panicking and Love the Command Line


The Phone Call

It’s 2 AM. Your phone is ringing. Again. You know what this means.

“The server is down.”

No, it’s not just down. Users are screaming. Executives are panicking. Money is being lost by the second. And somehow, somehow, you’re the person who needs to figure out what’s wrong with a server you’ve never seen before, running applications you didn’t write, on infrastructure you didn’t design.

Welcome to operations, my friend. Let me tell you a story.

Back when our team was dealing with operations, optimization, and scalability, we had our fair share of… let’s call them “memorable experiences.“Picture this: tight deadlines, exotic technical stacks, and approximately zero documentation. The cause of issues was rarely obvious. Actually, it was never obvious.

Here’s what we learned the hard way, now updated for 2024-2025 because—surprise—the industry has changed just a bit since 2015.


Before You SSH: The “Please Don’t Make This Worse” Phase

STOP. Put down that keyboard. I know you want to look like a hero and start typing commands at lightning speed. Don’t.

First, you need context. Running commands without context is like performing surgery in the dark—technically possible, but strongly discouraged.

The Questions That Might Save Your Job

Ask these before you touch anything:

  • What exactly are the symptoms? “It’s slow” is not a symptom. “API response times went from 200ms to 30 seconds” is a symptom.
  • When did it start? Was it exactly when Karen from Marketing deployed her “tiny little change”?
  • Can we reproduce it? Or is this one of those lovely heisenbug situations?
  • Any patterns? Does it happen every hour? Every Tuesday? During full moons?
  • What changed recently? (Someone always changed something. Someone always says “nothing changed.” They’re always wrong.)
  • Which users are affected? Everyone? Just mobile users? Just people in Europe? Just your CEO? (Please not just your CEO.)
  • Got any documentation? (Narrator: They did not have documentation.)

The Unicorn Questions (They Probably Don’t Exist, But You Can Dream)

  • Is there a monitoring platform?

    • Modern: Prometheus + Grafana, Datadog, CloudWatch, New Relic (Look at you, fancy!)
    • Legacy: Munin (outdated), Nagios (your condolences), or that one Perl script Dave wrote in 2003
    • Reality: A sticky note that says “if it breaks, call Steve”
  • Any centralized logs?

    • Modern: Loki, ELK Stack, Splunk (if your company has budget)
    • Legacy: grep-ing through /var/log on 47 different servers at 3 AM
    • Reality: “Logs? What logs?”
  • Distributed tracing?

    • Modern: Jaeger, Zipkin, Tempo
    • Reality: [laughs in microservices]
  • What’s the infrastructure?

    • Bare metal that’s older than some of your junior devs?
    • VMs running on that one physical server in the closet?
    • Docker containers orchestrated by… hope and systemd?
    • Kubernetes? (My condolences, different kind of pain)
    • Cloud? (AWS bills giving you anxiety dreams?)

If you have none of the above, make a mental note to fix this. Then make a written note. Then make a JIRA ticket that will sit in the backlog forever. Then cry a little and move on.


Minutes 1-3: The “OMG What Have I Gotten Myself Into” Phase

You’ve SSH’d in. Deep breath. Here’s your 30-second panic snapshot:

# The "How Bad Is It Really?" starter pack
uptime                           # Is the load average 0.5 or 500? Big difference.
w                                # Who else is logged in? (And are they making it worse?)
systemctl --failed              # What services have given up on life?
journalctl -xep err -n 50       # Recent errors (now with helpful explanations!)
dmesg -T | tail -50             # What's the kernel yelling about?

Modern systems (systemd-based):

systemctl status                 # The "how's everything doing?" command
systemctl list-units --state=failed --type=service  # The "oh no" list

Container environments (because it’s 2025 and everything is containers now):

docker ps -a                     # Are your containers running or just... existing?
kubectl get pods --all-namespaces  # The K8s version of "show me the carnage"
kubectl get events --sort-by='.lastTimestamp' | tail -20  # What just exploded?

Pro tip from someone who learned this the hard way:

If uptime shows a load average of 387.00 on a 4-core machine, you’re not troubleshooting anymore. You’re performing an autopsy. The server isn’t down—it’s dead. These are different situations requiring different emotional preparation.


Minutes 3-5: The “Who Did This and When?” Investigation

Check Who’s Been Messing Around

$ w                              # Who's logged in RIGHT NOW?
$ last                           # Who was here before everything went wrong?

There’s nothing quite like discovering that three other people are currently logged into the same production server, all frantically typing commands, none of them coordinating with each other. It’s like a bad heist movie, except instead of stealing diamonds, you’re all trying to steal uptime back.

One cook in the kitchen is enough. Too many cooks and you end up with someone running kill -9 on what they think is a stuck process but is actually the main application. True story. Not mine. Definitely not mine. Moving on.

Check What Crimes Were Committed Recently

$ history                        # The confession log

Hot tip: Set HISTTIMEFORMAT="%F %T " in your bashrc to timestamp your history. Future-you (and your manager during the post-mortem) will thank past-you. Nothing is more frustrating than seeing rm -rf /var/cache/* in the history and having no idea if it was run 5 minutes ago or 5 months ago.

Spoiler: It was 5 minutes ago. It’s always 5 minutes ago.


Minutes 5-7: The “What’s Actually Running Here?” Phase

Process Archaeology

$ pstree -a                      # The family tree of processes
$ ps aux                         # Everything. Just... everything.
$ ps aux --sort=-%mem | head -20  # Memory hogs (looking at you, Java)
$ ps aux --sort=-%cpu | head -20  # CPU hogs (still looking at you, Java)

Systemd services (because init.d is so 2010):

$ systemctl list-units --type=service --state=running
$ systemctl list-units --type=service --state=failed  # The walk of shame

pstree -a is beautiful. It shows you the ancestry of every process. You’ll find things like:

  • The web server that spawned 47 PHP-FPM workers
  • The Python script someone started in a screen session 3 years ago
  • Seven different Java processes, and nobody knows what any of them do
  • That cryptocurrency miner someone definitely didn’t install

Look for the unexpected. If you see a process called definitely_not_bitcoin_miner, that’s probably worth investigating.

What’s Listening? (And Should It Be?)

Modern approach (netstat is so deprecated it hurts):

$ ss -tulpn                      # TCP/UDP listening with process names
$ ss -s                          # Socket statistics (the summary)
$ ss -tanp                       # All TCP connections
$ ss -o state established        # Active connections with timer info

Legacy approach (for those still running CentOS 6):

$ netstat -ntlp                  # If you must...

You’re looking for:

  • Unexpected listening ports (port 4444? That’s not suspicious at all!)
  • Services that shouldn’t be running (why is there a Minecraft server on production?)
  • Multiple Java processes all listening on slightly different ports (enterprise!)

Container networking (the modern nightmare):

$ docker network ls              # What network are we even on?
$ kubectl get services           # Service discovery, or: "how does anything find anything?"
$ kubectl get ingress            # How does the outside world get in?

Minutes 7-9: The “Resources Check” (CPU, RAM, and Existential Dread)

$ free -h                        # Do we have memory? Do we have SWAP being used? (We shouldn't!)
$ uptime                         # Load average: acceptable, concerning, or apocalyptic?
$ top                            # Classic. Boring. Reliable.
$ htop                           # Top's cooler younger sibling

Key questions:

  • Any free RAM? If the answer is “no” and swap is being used heavily, congratulations! You’ve found your problem. Servers shouldn’t swap. Ever. Swapping is just “slow motion crashing.”
  • Is there CPU left? A load average of 1.0 per core is usually fine. A load average of 50 on a 4-core machine means something has gone spectacularly wrong.
  • What’s causing the load? One rogue process at 400% CPU? Or is everything just a little bit terrible?

Modern memory pressure check:

$ cat /proc/pressure/memory      # PSI - how stressed is your memory? (Very.)
$ vmstat 1 5                     # Check for swapping (si/so columns)

If vmstat shows non-zero numbers in the si (swap in) and so (swap out) columns, your server is swapping. Your server should not be swapping. This is like your car’s “check engine” light, except it means “ENGINE IS CURRENTLY ON FIRE.”

Container resources:

$ docker stats --no-stream       # Who's hogging what?
$ kubectl top nodes              # Node resource usage
$ kubectl top pods               # Pod resource usage

Protip: That one pod using 15GB of RAM? That’s probably your problem. Unless you have 64GB and everything else is fine. Then it’s probably fine. Maybe. Check the logs to be sure.


Minutes 9-10: The “Disk Space Blues”

$ df -h                          # Filesystem usage (please don't be at 100%)
$ df -i                          # Inode usage (the SURPRISE problem!)
$ du -sh */ | sort -h            # Where did all my space go?
$ du -hx --max-depth=1 / | grep -E '[0-9]G'  # Find the chunky directories

Things that will ruin your day:

  1. A partition at 100% capacity (the classic)
  2. Inodes at 100% but disk space free (the “wait, what?” problem)
  3. A deleted log file that’s still open and holding 50GB (the invisible problem)

The invisible disk space thief:

$ lsof +L1                       # Deleted files still held open

This command is magic. It finds files that have been deleted but are still being held open by a process. The disk space isn’t actually freed until the process closes the file or dies. Usually, it’s some log file that logrotate deleted, but the application is still writing to it like nothing happened.

Fix: Restart the application. Or send it SIGHUP if you’re fancy. Or if you’re brave, kill -9 it. (Don’t do that last one in production. Or do. I’m not your manager.)

Container storage:

$ docker system df               # How much space are containers hogging?
$ kubectl get pv                 # Persistent volumes

Did you know Docker images are never automatically deleted? Hope you have disk space! (You don’t.)


When 10 Minutes Isn’t Enough: Going Deeper

Okay, so you’ve made it through the first 10 minutes and haven’t found the smoking gun. Time to dig deeper.

Systemd Services (Because Everything is Systemd Now)

$ systemctl status <service>     # Detailed service status
$ journalctl -u <service>        # All logs for a service
$ journalctl -u <service> -f     # Follow logs real-time (the production anxiety experience)
$ journalctl -u <service> --since "1 hour ago"  # Just the recent stuff

Find what’s broken:

$ systemctl list-units --failed  # The "wall of shame"
$ systemctl list-timers          # Modern cron jobs

System Logs: Where Applications Confess Their Sins

Modern approach (journalctl):

$ journalctl -xe                 # Recent entries with helpful explanations
$ journalctl -b                  # Everything since last boot
$ journalctl -k                  # Kernel messages
$ journalctl -p err              # Just the errors, please
$ journalctl -f                  # Follow all logs (the firehose)

Traditional logs (for the old school):

$ tail -f /var/log/syslog        # Debian/Ubuntu
$ tail -f /var/log/messages      # RHEL/CentOS
$ grep -i error /var/log/syslog  # The grep-driven debugging lifestyle

Container logs:

$ docker logs -f <container>     # Watch a container's existential crisis in real-time
$ kubectl logs <pod>             # Pod logs
$ kubectl logs --previous <pod>  # Logs from BEFORE the crash (the good stuff)
$ stern <pod-pattern>            # Tail logs from multiple pods (chaos mode)

I/O Performance: Is the Disk the Problem?

$ iostat -xz 2 10                # Detailed I/O stats
$ vmstat 2 10                    # Virtual memory stats
$ iotop                          # Which process is murdering the disk?

Look for:

  • Disk utilization at 100% (bad)
  • High I/O wait (%iowait in top/iostat) (also bad)
  • One process doing 10,000 IOPS (probably your culprit)

eBPF: The Modern Magic (Kernel 4.x+)

eBPF is like having X-ray vision for your system, except it’s real and actually works.

Install bcc-tools first:

$ apt install bcc-tools    # Debian/Ubuntu
$ yum install bcc-tools    # RHEL/CentOS

Then unleash the power:

$ biolatency                     # Block I/O latency distribution
$ execsnoop                      # See every process execution (the paranoia tool)
$ opensnoop                      # See every file open (more paranoia)
$ tcpconnect                     # Trace new TCP connections
$ tcplife                        # TCP connection lifespans

These tools have virtually zero overhead and give you insight into what the kernel is doing at a level that would have seemed like science fiction 10 years ago.


The Application Layer: Where Code Goes to Confess

Web Servers

NGINX:

$ tail -f /var/log/nginx/access.log
$ tail -f /var/log/nginx/error.log
$ grep " 5[0-9][0-9] " /var/log/nginx/access.log  # Count your 500 errors and weep

Apache:

$ tail -f /var/log/apache2/error.log
$ grep "error" /var/log/apache2/error.log | tail -50

Look for:

  • 500/502/503/504 errors (the HTTP version of “I give up”)
  • Connection timeouts
  • Upstream failures (the backend is dead, Jim)
  • Memory exhaustion messages

Databases: Where Slow Queries Go to Die

PostgreSQL:

$ tail -f /var/log/postgresql/postgresql-*.log
$ psql -c "SELECT pid, now() - query_start as duration, query FROM pg_stat_activity WHERE state = 'active' ORDER BY duration DESC;"

MySQL/MariaDB:

$ tail -f /var/log/mysql/error.log
$ tail -f /var/log/mysql/slow.log    # The "queries of shame" log
$ mysqlcheck --all-databases         # Check for corrupted tables

Redis:

$ redis-cli INFO                 # All the stats
$ redis-cli SLOWLOG GET 10       # The slowest commands

Common database problems:

  • Connection pool exhaustion (too many connections, not enough slots)
  • Lock contention (everybody’s waiting on everybody else)
  • Slow queries (that one query that’s been running for 45 minutes)
  • Out of connections (max_connections is a suggestion, apparently)

Docker & Kubernetes: The Modern Pain

Docker

$ docker ps                      # Running containers
$ docker ps -a                   # ALL containers (including the dead ones)
$ docker stats                   # Real-time resource usage (watch the RAM skyrocket)
$ docker logs <container>        # The confession booth
$ docker inspect <container>     # ALL the details (JSON overload)

Common Docker problems:

  • Container restart loops (CrashLoopBackOff is my favorite error message because it sounds like a techno band)
  • Out of disk space (those images add up fast)
  • Port conflicts (something’s already using 8080. Something’s always using 8080)
  • Network issues (can containers talk to each other? To the internet? To anyone?)

Kubernetes: Where Simple Problems Become Distributed Nightmares

$ kubectl get pods --all-namespaces  # The "what's running where" command
$ kubectl get events --sort-by='.lastTimestamp' | tail -50  # Recent chaos
$ kubectl describe pod <pod>     # Why won't this pod start? (Find out!)
$ kubectl logs <pod>             # What did it say before it died?
$ kubectl top nodes              # Node resources
$ kubectl top pods               # Pod resources

K8s pod states and what they mean:

  • Running - It’s working! (Probably. Check the logs to be sure.)
  • Pending - It’s… thinking about it? (Actually it can’t schedule. Probably resource constraints.)
  • CrashLoopBackOff - It starts, it crashes, it starts again, it crashes again. Einstein’s definition of insanity.
  • ImagePullBackOff - Can’t pull the image. Typo? Private registry auth? The image doesn’t exist?
  • OOMKilled - Out of memory. Set your limits higher. Or fix your memory leak. Probably both.

Modern Observability: When You Actually Have Nice Things

If your company actually invested in observability (lucky you!), here’s what to check:

Prometheus + Grafana

Access Grafana dashboards and look for:

  • Sudden spikes in error rates
  • Latency increases
  • Resource saturation
  • Anything that looks like a cliff

PromQL queries that save lives:

rate(http_requests_total{status=~"5.."}[5m])  # 5xx error rate
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))  # p95 latency

Centralized Logging (ELK, Loki, Splunk)

Search for:

  • Error messages around the incident time
  • Exception stack traces
  • Connection failures
  • Timeout messages

Distributed Tracing (Jaeger, Zipkin)

Find the slow traces. Follow the request through your microservices. Watch in horror as you realize that your “fast” API calls 47 other services and takes 30 seconds.


Troubleshooting Methodologies: Smart People Have Already Figured This Out

The USE Method (Utilization, Saturation, Errors)

For every resource, check:

  • Utilization: How busy is it?
  • Saturation: Is work queuing up?
  • Errors: Is it breaking?

Apply to: CPU, memory, disk, network

Google’s Four Golden Signals

  1. Latency: How long do requests take?
  2. Traffic: How many requests?
  3. Errors: How many requests are failing?
  4. Saturation: How full is the service?

The RED Method (for microservices)

  • Rate: Requests per second
  • Errors: Number of failures
  • Duration: How long requests take (p50, p95, p99)

These frameworks sound fancy, but they’re basically asking: “Is it slow? Is it broken? Is it overloaded?” in a structured way.


Cloud Things: Because Nobody Runs Bare Metal Anymore

AWS

$ curl http://169.254.169.254/latest/meta-data/instance-id  # Who am I?
$ curl http://169.254.169.254/latest/meta-data/instance-type  # What am I?
$ aws ec2 describe-instances --instance-ids <id>
$ aws logs tail <log-group> --follow

GCP

$ curl "http://metadata.google.internal/computeMetadata/v1/?recursive=true" -H "Metadata-Flavor: Google"
$ gcloud compute instances describe <instance>

Azure

$ curl -H "Metadata:true" "http://169.254.169.254/metadata/instance?api-version=2021-02-01"
$ az vm show --resource-group <rg> --name <name>

The Commands You Run When You’re Desperate

strace: When You Need to Know Everything

$ strace -p <pid>                # Attach to a running process
$ strace -tt -T -p <pid>         # With timestamps and timing
$ strace -c <command>            # Count syscalls (usually reveals something dumb)

Warning: strace has high overhead. Using it on a production process is like doing surgery with the patient running a marathon. It works, but it’s not ideal.

Modern alternatives with less overhead:

$ perf trace -p <pid>            # Like strace but less painful
$ execsnoop                      # eBPF-based process tracing
$ opensnoop                      # eBPF-based file open tracing

After 10 Minutes: What Have We Learned?

If you’ve made it this far, you should know:

  1. What’s running: Bare metal? VMs? Containers? Kubernetes? Someone’s Raspberry Pi?
  2. Is it a resource problem? CPU pegged? Out of RAM? Disk full? Network saturated?
  3. Is it a configuration problem? Service failed to start? Wrong port? Bad permissions?
  4. Is it an application problem? Exceptions? Database deadlocks? Memory leaks?
  5. Is there a pattern? Happens every hour? Started after a deployment? Only affects mobile users?

You might have even found the root cause. If so, congratulations! You’re a hero. Go fix it and bask in the glory.

If not, at least you’re no longer operating blind. You’ve got context. You’ve got data. You know where to dig deeper.

And most importantly, you haven’t accidentally made it worse. Yet.


The Modern vs. Ancient World

Things that have changed since 2015:

  1. Systemd won. Love it or hate it, it’s everywhere. Learn journalctl. Accept your fate.
  2. eBPF is magic. Kernel-level observability with minimal overhead? Science fiction is now reality.
  3. Containers everywhere. “It works on my machine” is no longer an excuse. Now it’s “it works in my container.”
  4. Kubernetes happened. We made distributed systems so complex that we need a whole new skillset just to troubleshoot them.
  5. Observability > Monitoring. We graduated from “is it up?” to “exactly what is it doing right now across 500 microservices?”
  6. Cloud native architecture. Everything is a service. Nothing is simple anymore.
  7. Commands changed: netstatss, ifconfigip, iptablesnftables

Philosophy shift:

  • Reactive → Proactive
  • Single server → Distributed systems
  • “Did it crash?” → “What’s the p99 latency of requests to the authentication service from the payment gateway in us-east-1?”

Conclusion: You’ve Got This (Probably)

Troubleshooting production systems at 2 AM is never fun. But with a methodical approach, the right tools, and the ability to stay calm when everything is on fire, you can figure it out.

Remember:

  1. Get context first. Don’t run commands blindly.
  2. Start broad, then narrow. System-wide checks first, then drill down.
  3. Use modern tools. eBPF and systemd are your friends.
  4. Check the logs. The answer is usually there, buried in 10,000 lines of INFO messages.
  5. Don’t panic. Panic is for people who don’t have a checklist.

And hey, once you fix it, you’ll have a great story for the next time someone asks you why you deserve a raise.

Now go forth and troubleshoot. And for the love of all that is holy, please implement monitoring before the next incident.


Further Reading (For When You Can’t Sleep Anyway)


Last updated: November 2025 - Because technology waits for no one, and your servers certainly won’t.

Written by someone who has seen things. Terrible things. At 3 AM. Multiple times.


Disclaimer: No servers were harmed in the making of this guide. Okay, maybe a few. But they were already broken.

Kevin Duane

Kevin Duane

Cloud architect and developer sharing practical solutions.