How I Troubleshoot Kubernetes in Production

Kubernetes failures are rarely random. Most incidents repeat a small set of patterns - image pull issues, crash loops, pending pods, DNS failures, or network path problems.

I used to jump between logs, dashboards, and guesses. Over time, I adopted a repeatable sequence that consistently shortens time-to-recovery. This post shares that sequence and the scenario patterns I've documented while troubleshooting Kubernetes workloads.

My Baseline Troubleshooting Flow

Before I try anything else, I run through this exact sequence:

# 1) Find failing workloads fast
kubectl get pods -A

# 2) Inspect one failing pod deeply
kubectl describe pod <pod> -n <ns>

# 3) Check current and previous container logs
kubectl logs <pod> -n <ns>
kubectl logs <pod> -n <ns> --previous

# 4) Confirm resource pressure and scheduling context
kubectl top nodes
kubectl top pods -n <ns>

# 5) Check related events chronologically
kubectl get events -n <ns> --sort-by=.metadata.creationTimestamp

I follow this order deliberately. In Kubernetes, the fastest clue is often not the application log. It is the event stream, the restart reason, or the scheduler telling me exactly why the pod never became healthy.

How I Classify Failures

Once I see a failing pod, I classify it immediately. The classification decides everything - which commands I run next, who I involve, and what the fix path looks like.

kubectl describe pod <pod> -n <ns>
    |
    +--> ImagePullBackOff ----> Check image/tag/secret
    |
    +--> CrashLoopBackOff ---> Check --previous logs/probes/OOM
    |
    +--> Pending -----------> Check scheduler events/PVC/quota
    |
    +--> 502/503 -----------> Check ingress -> service -> endpoints
    |
    +--> DNS errors --------> Check DNS resolution and CoreDNS health

Each of these has a distinct diagnosis path. Treating them all as "pod is broken" wastes time.

Scenario 1: `ImagePullBackOff`

Symptoms
Pod stuck in ImagePullBackOff. Events show an auth error or image-not-found.

Diagnosis

kubectl describe pod <pod> -n <ns>
kubectl get secret -n <ns>

I look at the events section first. It usually says one of three things - the image tag doesn't exist, the registry credentials are wrong, or the pull secret is missing from the namespace entirely.

Common root causes

Wrong image tag - someone pushed to latest but the deployment references a specific tag that was never built
Registry credentials expired or were rotated without updating the cluster secret
Pull secret exists in one namespace but wasn't copied to the namespace where the pod is running

Recovery

kubectl set image deployment/<deploy> <container>=<registry>/<image>:<tag> -n <ns>
kubectl rollout status deployment/<deploy> -n <ns>

Prevention

Pin and validate image tags in CI - don't rely on latest
Add pre-deploy registry access checks to your pipeline

Scenario 2: `CrashLoopBackOff`

Symptoms
Pod keeps restarting. Restart count climbs every few minutes.

Diagnosis

kubectl logs <pod> -n <ns> --previous
kubectl describe pod <pod> -n <ns>

The --previous flag is critical here. By the time you look at the pod, it might already be on its 10th restart. The current logs show a fresh boot, but the crash reason is in the previous container's output. I've seen people debug CrashLoopBackOff for 30 minutes without ever checking --previous.

Common root causes

Missing environment variables or secrets that the app expects at startup
Startup command mismatch - the container entrypoint doesn't match what the Dockerfile expects
Liveness probe too aggressive - the app needs 30 seconds to boot but the probe starts checking at 10 seconds
OOM-kill - the container hits its memory limit and gets killed by the kernel

kubectl describe pod <pod> -n <ns> | grep -E "OOMKilled|Reason|Exit Code"

If the exit code is 137, that's OOM. Increase the memory limit or fix the memory leak.

Recovery

kubectl rollout restart deployment/<deploy> -n <ns>

But only after you've fixed the underlying issue. Restarting a CrashLooping pod without fixing the cause just resets the backoff timer.

Prevention

Add startup probes in addition to liveness probes for slow-booting apps
Keep probes environment-aware - staging and production might need different thresholds

Scenario 3: Pod Stuck in `Pending`

Symptoms
Pod never gets scheduled. It stays in Pending with no container starts.

Diagnosis

kubectl describe pod <pod> -n <ns>
kubectl get nodes
kubectl get pvc -n <ns>

I don't treat Pending as one failure class. It is usually one of three things - scheduler capacity, placement policy, or storage readiness. That classification matters because each one belongs to a different owner and fix path.

Common root causes

Not enough CPU or memory available on any node - the pod's resource requests exceed what the cluster can offer
Node taints or affinity rules preventing scheduling - the pod has constraints that no current node satisfies
PVC is stuck in Pending - the storage class doesn't exist, the volume can't be provisioned, or the PVC is bound to a zone where no node lives

Recovery

Adjust resource requests and limits to match what the cluster can actually provide
Scale the node pool or add nodes
Fix the PVC - check storage class, provisioner, and zone alignment

Prevention

Capacity planning and quota reviews before traffic spikes
Alert on unschedulable pods - don't wait for users to report it

Scenario 4: Ingress Returns 502 While Pods Look Fine

Symptoms
Pods are Running. The app works perfectly with kubectl port-forward. But the ingress or external gateway returns 502 or 503.

This is the most confusing failure I've debugged - everything looks healthy inside the cluster, but users can't reach the app.

Diagnosis

kubectl get svc -n <ns>
kubectl get endpoints <svc> -n <ns>
kubectl describe ingress <ingress> -n <ns>
kubectl logs -n <ingress-ns> deploy/<ingress-controller>

The key check is kubectl get endpoints. If the endpoints list is empty, the service selector doesn't match any running pod labels. The ingress is trying to route traffic to a service that has no backends.

Common root causes

Wrong service selector - the label in the service doesn't match the label on the pod
Wrong target port - the service points to port 80 but the container listens on 8000
Readiness probe failures - pods are running but not ready, so they don't appear in the endpoints list
Ingress backend path or host mismatch - the ingress rule expects /api but the app serves at /

Recovery

Fix the service selector or port mapping
Correct the ingress backend rule
Wait for endpoints to populate, then verify with curl

Prevention

Add smoke tests that validate the full ingress → service → pod path after every deployment
Keep labels and port mappings under manifest lint or policy checks

Scenario 5: Cluster DNS Failure

Symptoms
App logs show hostname resolution errors. Pods are running, but internal or external dependency calls fail with DNS errors.

Diagnosis

kubectl exec -it <pod> -n <ns> -- nslookup kubernetes.default.svc.cluster.local
kubectl get deploy -n kube-system coredns
kubectl get pods -n kube-system -l 'k8s-app in (kube-dns,coredns)'
kubectl logs -n kube-system deploy/coredns --tail=200

If the nslookup from inside the pod fails, the problem is CoreDNS, not the application. I've seen entire clusters go down because CoreDNS pods were OOM-killed or stuck in a crash loop - and every service-to-service call broke simultaneously.

Recovery

Restart unhealthy CoreDNS pods
Fix bad CoreDNS ConfigMap or upstream resolver issues
In AKS, kubectl rollout restart deployment coredns -n kube-system has fixed this for me multiple times

Prevention

Alert on CoreDNS restart spikes and resolution error rate
Treat cluster DNS as a production dependency, not a background detail

Production Trade-offs

Every troubleshooting decision has trade-offs. Here are the ones I think about most:

Fast restarts vs preserving crash context
Restarting quickly recovers service, but if you restart before reading --previous logs and events, you lose the crash evidence. I always capture the context first, then restart.

Tight probes vs startup tolerance
Aggressive liveness probes detect failures fast, but they can create restart storms during cold starts or deployments. I use startup probes with generous initial delays for apps that take time to boot.

Higher resource requests vs cost efficiency
Requesting more CPU and memory improves scheduling stability, but increases cluster cost and reduces bin-packing efficiency. I size requests based on actual p99 usage, not peak theoretical load.

Common Mistakes

Restarting deployments before reading events - The event stream often contains the exact failure reason. Read it first.
Only checking current logs, not --previous - The current container just booted. The crash happened in the previous one.
Ignoring probe configuration during triage - A misconfigured liveness probe looks like an app crash. Check probe settings before blaming the code.
Using broad emergency permissions and forgetting rollback - Granting cluster-admin to debug is fine in an emergency. Not revoking it afterwards is a security debt.
Assuming Running means healthy - A pod can be Running but fail every readiness check. Running means the container started. It doesn't mean it's serving traffic.

Key Takeaways

Use a fixed sequence - get pods → describe → logs --previous → top → events. Every time.
Classify the failure first - ImagePullBackOff, CrashLoopBackOff, Pending, 502, and DNS are different problems with different fix paths.
Events tell you more than logs - The scheduler, kubelet, and controller manager write events before the app even starts.
Keep recovery actions reversible - Restart before reinstall. Rollback before rewrite.
Turn repeated incidents into platform safeguards - If the same failure happens twice, automate the detection or prevention.

This is the workflow I use for every Kubernetes incident. The scenarios change, but the sequence stays the same.

How I Troubleshoot Kubernetes in Production

My Baseline Troubleshooting Flow

How I Classify Failures

Scenario 1: ImagePullBackOff

Scenario 2: CrashLoopBackOff

Scenario 3: Pod Stuck in Pending

Scenario 4: Ingress Returns 502 While Pods Look Fine

Scenario 5: Cluster DNS Failure

Production Trade-offs

Common Mistakes

Key Takeaways

Scenario 1: `ImagePullBackOff`

Scenario 2: `CrashLoopBackOff`

Scenario 3: Pod Stuck in `Pending`