Kubernetes failures are rarely random. Most incidents repeat a small set of patterns - image pull issues, crash loops, pending pods, DNS failures, or network path problems.
I used to jump between logs, dashboards, and guesses. Over time, I adopted a repeatable sequence that consistently shortens time-to-recovery. This post shares that sequence and the scenario patterns I've documented while troubleshooting Kubernetes workloads.
My Baseline Troubleshooting Flow
Before I try anything else, I run through this exact sequence:
# 1) Find failing workloads fast
kubectl get pods -A
# 2) Inspect one failing pod deeply
kubectl describe pod <pod> -n <ns>
# 3) Check current and previous container logs
kubectl logs <pod> -n <ns>
kubectl logs <pod> -n <ns> --previous
# 4) Confirm resource pressure and scheduling context
kubectl top nodes
kubectl top pods -n <ns>
# 5) Check related events chronologically
kubectl get events -n <ns> --sort-by=.metadata.creationTimestamp
I follow this order deliberately. In Kubernetes, the fastest clue is often not the application log. It is the event stream, the restart reason, or the scheduler telling me exactly why the pod never became healthy.
How I Classify Failures
Once I see a failing pod, I classify it immediately. The classification decides everything - which commands I run next, who I involve, and what the fix path looks like.
kubectl describe pod <pod> -n <ns>
|
+--> ImagePullBackOff ----> Check image/tag/secret
|
+--> CrashLoopBackOff ---> Check --previous logs/probes/OOM
|
+--> Pending -----------> Check scheduler events/PVC/quota
|
+--> 502/503 -----------> Check ingress -> service -> endpoints
|
+--> DNS errors --------> Check DNS resolution and CoreDNS health
Each of these has a distinct diagnosis path. Treating them all as "pod is broken" wastes time.
Scenario 1: ImagePullBackOff
Symptoms
Pod stuck in ImagePullBackOff. Events show an auth error or image-not-found.
Diagnosis
kubectl describe pod <pod> -n <ns>
kubectl get secret -n <ns>
I look at the events section first. It usually says one of three things - the image tag doesn't exist, the registry credentials are wrong, or the pull secret is missing from the namespace entirely.
Common root causes
- Wrong image tag - someone pushed to
latestbut the deployment references a specific tag that was never built - Registry credentials expired or were rotated without updating the cluster secret
- Pull secret exists in one namespace but wasn't copied to the namespace where the pod is running
Recovery
kubectl set image deployment/<deploy> <container>=<registry>/<image>:<tag> -n <ns>
kubectl rollout status deployment/<deploy> -n <ns>
Prevention
- Pin and validate image tags in CI - don't rely on
latest - Add pre-deploy registry access checks to your pipeline
Scenario 2: CrashLoopBackOff
Symptoms
Pod keeps restarting. Restart count climbs every few minutes.
Diagnosis
kubectl logs <pod> -n <ns> --previous
kubectl describe pod <pod> -n <ns>
The --previous flag is critical here. By the time you look at the pod, it might already be on its 10th restart. The current logs show a fresh boot, but the crash reason is in the previous container's output. I've seen people debug CrashLoopBackOff for 30 minutes without ever checking --previous.
Common root causes
- Missing environment variables or secrets that the app expects at startup
- Startup command mismatch - the container entrypoint doesn't match what the Dockerfile expects
- Liveness probe too aggressive - the app needs 30 seconds to boot but the probe starts checking at 10 seconds
- OOM-kill - the container hits its memory limit and gets killed by the kernel
kubectl describe pod <pod> -n <ns> | grep -E "OOMKilled|Reason|Exit Code"
If the exit code is 137, that's OOM. Increase the memory limit or fix the memory leak.
Recovery
kubectl rollout restart deployment/<deploy> -n <ns>
But only after you've fixed the underlying issue. Restarting a CrashLooping pod without fixing the cause just resets the backoff timer.
Prevention
- Add startup probes in addition to liveness probes for slow-booting apps
- Keep probes environment-aware - staging and production might need different thresholds
Scenario 3: Pod Stuck in Pending
Symptoms
Pod never gets scheduled. It stays in Pending with no container starts.
Diagnosis
kubectl describe pod <pod> -n <ns>
kubectl get nodes
kubectl get pvc -n <ns>
I don't treat Pending as one failure class. It is usually one of three things - scheduler capacity, placement policy, or storage readiness. That classification matters because each one belongs to a different owner and fix path.
Common root causes
- Not enough CPU or memory available on any node - the pod's resource requests exceed what the cluster can offer
- Node taints or affinity rules preventing scheduling - the pod has constraints that no current node satisfies
- PVC is stuck in
Pending- the storage class doesn't exist, the volume can't be provisioned, or the PVC is bound to a zone where no node lives
Recovery
- Adjust resource requests and limits to match what the cluster can actually provide
- Scale the node pool or add nodes
- Fix the PVC - check storage class, provisioner, and zone alignment
Prevention
- Capacity planning and quota reviews before traffic spikes
- Alert on unschedulable pods - don't wait for users to report it
Scenario 4: Ingress Returns 502 While Pods Look Fine
Symptoms
Pods are Running. The app works perfectly with kubectl port-forward. But the ingress or external gateway returns 502 or 503.
This is the most confusing failure I've debugged - everything looks healthy inside the cluster, but users can't reach the app.
Diagnosis
kubectl get svc -n <ns>
kubectl get endpoints <svc> -n <ns>
kubectl describe ingress <ingress> -n <ns>
kubectl logs -n <ingress-ns> deploy/<ingress-controller>
The key check is kubectl get endpoints. If the endpoints list is empty, the service selector doesn't match any running pod labels. The ingress is trying to route traffic to a service that has no backends.
Common root causes
- Wrong service selector - the label in the service doesn't match the label on the pod
- Wrong target port - the service points to port
80but the container listens on8000 - Readiness probe failures - pods are running but not ready, so they don't appear in the endpoints list
- Ingress backend path or host mismatch - the ingress rule expects
/apibut the app serves at/
Recovery
- Fix the service selector or port mapping
- Correct the ingress backend rule
- Wait for endpoints to populate, then verify with
curl
Prevention
- Add smoke tests that validate the full ingress → service → pod path after every deployment
- Keep labels and port mappings under manifest lint or policy checks
Scenario 5: Cluster DNS Failure
Symptoms
App logs show hostname resolution errors. Pods are running, but internal or external dependency calls fail with DNS errors.
Diagnosis
kubectl exec -it <pod> -n <ns> -- nslookup kubernetes.default.svc.cluster.local
kubectl get deploy -n kube-system coredns
kubectl get pods -n kube-system -l 'k8s-app in (kube-dns,coredns)'
kubectl logs -n kube-system deploy/coredns --tail=200
If the nslookup from inside the pod fails, the problem is CoreDNS, not the application. I've seen entire clusters go down because CoreDNS pods were OOM-killed or stuck in a crash loop - and every service-to-service call broke simultaneously.
Recovery
- Restart unhealthy CoreDNS pods
- Fix bad CoreDNS ConfigMap or upstream resolver issues
- In AKS,
kubectl rollout restart deployment coredns -n kube-systemhas fixed this for me multiple times
Prevention
- Alert on CoreDNS restart spikes and resolution error rate
- Treat cluster DNS as a production dependency, not a background detail
Production Trade-offs
Every troubleshooting decision has trade-offs. Here are the ones I think about most:
Fast restarts vs preserving crash context
Restarting quickly recovers service, but if you restart before reading --previous logs and events, you lose the crash evidence. I always capture the context first, then restart.
Tight probes vs startup tolerance
Aggressive liveness probes detect failures fast, but they can create restart storms during cold starts or deployments. I use startup probes with generous initial delays for apps that take time to boot.
Higher resource requests vs cost efficiency
Requesting more CPU and memory improves scheduling stability, but increases cluster cost and reduces bin-packing efficiency. I size requests based on actual p99 usage, not peak theoretical load.
Common Mistakes
- Restarting deployments before reading events - The event stream often contains the exact failure reason. Read it first.
- Only checking current logs, not
--previous- The current container just booted. The crash happened in the previous one. - Ignoring probe configuration during triage - A misconfigured liveness probe looks like an app crash. Check probe settings before blaming the code.
- Using broad emergency permissions and forgetting rollback - Granting cluster-admin to debug is fine in an emergency. Not revoking it afterwards is a security debt.
- Assuming
Runningmeans healthy - A pod can beRunningbut fail every readiness check.Runningmeans the container started. It doesn't mean it's serving traffic.
Key Takeaways
- Use a fixed sequence -
get pods→describe→logs --previous→top→events. Every time. - Classify the failure first - ImagePullBackOff, CrashLoopBackOff, Pending, 502, and DNS are different problems with different fix paths.
- Events tell you more than logs - The scheduler, kubelet, and controller manager write events before the app even starts.
- Keep recovery actions reversible - Restart before reinstall. Rollback before rewrite.
- Turn repeated incidents into platform safeguards - If the same failure happens twice, automate the detection or prevention.
This is the workflow I use for every Kubernetes incident. The scenarios change, but the sequence stays the same.
