How I Design a Secure Kubernetes Cluster

A secure Kubernetes platform is never one setting. Most breaches or near-misses I've seen come from weak boundaries between workloads, identities, networks, and secrets - not from one missing checkbox.

This post is how I think about Kubernetes security as layers, not features. Each layer has a specific job, and when one fails, the others limit the blast radius. My implementation uses Azure (AKS, Key Vault, App Gateway), but the security principles - identity boundaries, network isolation, secret management, workload hardening - apply to any Kubernetes cluster regardless of provider.

The Architecture

Internet
   |
   v
Application Gateway (WAF)
   |
   v
AKS Ingress -> Service -> Pods (app namespace)
                     |        |
                     |        +-> ServiceAccount with workload identity
                     |
                     +-> SecretProviderClass -> Key Vault secret mount (CSI)

AKS node identity -> ACR pull
App pod -> PostgreSQL Flexible Server (private path)

Each layer has one clear security job:

Edge boundary - WAF and ingress policy shape internet exposure
Cluster boundary - namespace isolation and network policy reduce east-west blast radius
Identity boundary - workload identity avoids long-lived static credentials
Secret boundary - Key Vault is source-of-truth, pods consume secrets at runtime
Data boundary - database traffic stays private, no public endpoint

The key insight is that none of these layers alone is sufficient. A private cluster with no network policies still allows unrestricted lateral movement. Workload identity with overly broad RBAC still lets a compromised pod read secrets it shouldn't. Security works when the layers overlap.

1. Identity and Access

Identity is the strongest control in cloud-native systems because it governs who can call what before traffic reaches application logic.

Microsoft Entra-integrated AKS - all cluster access flows through Entra ID, not local Kubernetes credentials
Azure RBAC for platform access - who can manage the cluster, node pools, and infrastructure
Namespace-scoped Kubernetes RBAC for workload teams - developers get access to their namespace, not the entire cluster
Workload identity over service principals - pods authenticate to Azure resources (Key Vault, ACR, databases) using managed identities tied to Kubernetes service accounts. No client secrets stored anywhere.

The verification I run:

# Check AKS identity and Entra profile
az aks show \
  --resource-group <rg> \
  --name <cluster> \
  --query "{aadProfile:aadProfile,identity:identity}" \
  -o json

# Check namespace roles and bindings
kubectl get roles,rolebindings -n app

# Verify service account permissions
kubectl auth can-i get secrets \
  --as=system:serviceaccount:app:aks-fastapi \
  -n app

The last command is the one I care about most. If a service account can read secrets it doesn't need, the RBAC is too broad.

2. Network Isolation

Most Kubernetes security incidents I've seen are network shape problems - accidental public exposure, broad east-west trust, or unclear data paths.

Private cluster mode - the Kubernetes API server gets a private IP. No public control plane endpoint. Operators access through VPN or bastion.
Segmented subnets - separate subnets for system node pools, user node pools, and App Gateway. Each subnet has its own NSG.
Default-deny network policies - start by denying everything, then add narrow allow rules based on actual traffic patterns

The default-deny baseline I apply to every namespace:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
  namespace: app
spec:
  podSelector: {}
  policyTypes:
    - Ingress
    - Egress

After this, I add specific allow policies for ingress controller traffic, DNS (port 53 - pods need this), metrics scraping, and database connections. Nothing else gets through unless there's an explicit policy allowing it.

The mistake I've made - applying default-deny without allowing DNS egress on port 53. Every pod immediately lost name resolution and every service-to-service call broke. Always allow kube-dns egress before applying deny-all.

3. Secret Management

Secrets are where secure designs fail in practice. Teams centralize in Key Vault but still leak copied values into manifests, CI variables, or environment configs.

Azure Key Vault as source of truth - secrets are defined once in Key Vault, not duplicated across manifests
Secrets Store CSI Driver - mounts secrets directly into the pod filesystem at runtime. The pod reads a file, not an environment variable
Workload identity for vault access - the pod's service account authenticates to Key Vault using workload identity. No client secret needed for the vault connection itself
File-based consumption - the app reads from /mnt/secrets-store/db-password, not from $DB_PASSWORD. If the mount fails, the readiness probe fails, and the pod never receives traffic

The chain that matters:

Key Vault holds the secret
SecretProviderClass maps Key Vault objects to file paths
Pod mounts the CSI volume
App reads the file at startup
If the file is missing, readiness returns unhealthy

This is verifiable end-to-end. If any link breaks - wrong client ID, missing vault permission, wrong object name - the pod won't pass readiness. That's the behavior I want.

4. Workload Hardening

Cluster-level controls are necessary but not sufficient. A permissive pod spec can undo everything else.

The security context I apply to every production container:

securityContext:
  runAsNonRoot: true
  allowPrivilegeEscalation: false
  readOnlyRootFilesystem: true
  capabilities:
    drop:
      - ALL
resources:
  requests:
    cpu: 100m
    memory: 128Mi
  limits:
    cpu: 500m
    memory: 512Mi

What each setting does:

runAsNonRoot - container must run as a non-root user. If the image defaults to root, the pod fails to start. This catches Dockerfile mistakes early.
allowPrivilegeEscalation: false - prevents a process from gaining more privileges than its parent. Blocks most container escape techniques.
readOnlyRootFilesystem - the container can't write to its own filesystem. An attacker who gets code execution can't modify binaries or drop payloads on disk. Only explicitly mounted volumes (like /tmp via emptyDir) are writable.
capabilities drop ALL - removes all Linux capabilities. No NET_RAW, no SYS_ADMIN, nothing. If a specific capability is needed, it has to be explicitly added and justified.
Resource limits - prevent one container from consuming all node resources. Without limits, a memory leak in one pod can OOM-kill everything on the node.

Quick verification across all pods:

kubectl get pod -A -o yaml | grep -E "runAsNonRoot|allowPrivilegeEscalation|readOnlyRootFilesystem"

If any pod is missing these, it's running with default permissive settings.

5. Ingress and Edge Security

The path from the internet to the pod needs explicit control at every hop.

App Gateway with WAF - terminates TLS at the edge, runs OWASP rule sets, and routes to AKS ingress
Explicit ingress rules - every exposed path is defined in an ingress manifest. No wildcard routes, no accidental exposure
Backend health probes - App Gateway continuously checks pod health. If the probe fails, traffic stops flowing to that backend

The verification:

# Review all ingress exposure
kubectl get ingress -A

# Check endpoints behind services
kubectl get svc,endpoints -n app

I want the ingress inventory to be small and fully known. If I see an ingress I don't recognize, something is wrong.

Design Trade-offs

Every security decision has a cost. I make these trade-offs explicitly:

Private cluster - lowers control plane exposure but adds complexity for operator access. I mitigate this with a standardized VPN path and documented bastion access.
Workload identity - removes long-lived secrets but increases identity/RBAC setup effort. I template the identity bindings and validate them in CI.
Default-deny network policy - blocks accidental east-west sprawl but requires more policy maintenance. I start with deny-all, then add narrow allows based on observed traffic.
Separate namespaces per environment - better blast-radius control but more manifests to manage. I use Kustomize overlays to keep it manageable.
App Gateway + WAF - adds L7 protections but costs more and adds routing complexity. I keep a reusable ingress module and a probe checklist.

Failure Patterns I Watch For

ImagePullBackOff after identity or node changes

Symptom: pods stuck in ImagePullBackOff after rotating credentials, adding a node pool, or changing the ACR integration.

Usually means the AKS managed identity lost acrpull on the registry, or a new node pool's subnet isn't whitelisted in the ACR firewall.

kubectl describe pod <pod> -n app
# Fix: restore acrpull role or whitelist the new subnet
kubectl rollout status deploy/aks-fastapi -n app

Secret mount failure from Key Vault

Symptom: readiness probe fails, app logs show "secret file not found."

Usually means the workload identity client ID is wrong, the Key Vault object name doesn't match the SecretProviderClass, or the managed identity doesn't have Key Vault Secrets User role.

kubectl describe pod <pod> -n app
kubectl get secretproviderclass azure-kv-secrets -n app -o yaml
# Fix: correct the mapping, then restart
kubectl rollout restart deploy/aks-fastapi -n app

Ingress 502 while pods are healthy

Symptom: App Gateway returns 502, but kubectl port-forward works fine.

Almost always a backend health probe mismatch - the probe path, port, or expected status code doesn't match what the app serves.

kubectl describe ingress aks-fastapi -n app
kubectl get svc,endpoints -n app
# Fix: align the probe config with the app's actual health endpoint

Dev vs Prod - Keep the Gap Explicit

Many teams fail by assuming dev controls are production-ready. I keep these differences documented:

Identity - dev might use shared test identities temporarily. Prod requires dedicated least-privilege identities with clear ownership.
Secrets - dev has relaxed rotation cadence. Prod has scheduled rotation with an owner and verification workflow.
Networking - dev may have temporary exceptions. Prod is private-by-default with explicit ingress rules only.
Policy enforcement - dev runs security checks in warning mode. Prod blocks on critical violations.
Rollback - dev can tolerate manual rollback. Prod needs a documented and rehearsed rollback path.

Documenting this gap prevents accidental promotion of dev shortcuts into production.

Common Mistakes

Exposing public endpoints by default - everything should be private until explicitly opened
Putting secrets in plain Kubernetes Secrets without an external source - native K8s secrets are base64-encoded, not encrypted. Use Key Vault.
Skipping network policy because "internal traffic is trusted" - east-west traffic between pods is the most common lateral movement path
Running all workloads in one namespace - convenient for small projects, dangerous for anything with multiple teams or security boundaries
Treating CI security scans as advisory - if Trivy or policy checks don't block the pipeline, they'll be ignored under delivery pressure
Forgetting to test rollback paths - a security fix that you can't roll back is a deployment risk, not a security improvement

Key Takeaways

Secure Kubernetes is layered, not tool-based - identity, network, secrets, workload hardening, and ingress each have a specific job
Identity and network boundaries are the strongest controls - they limit blast radius before application logic is involved
Key Vault + CSI + workload identity removes common secret risks - no more credentials in manifests or CI variables
Workload hardening is not optional - non-root, read-only, drop-all capabilities. If a pod is missing these, it's running with unnecessary permissions
Verify continuously, not once - run control checks regularly. Security posture drifts the moment you stop checking.

This is how I design Kubernetes clusters. The layers change depending on the workload and the cloud provider, but the principle stays the same - defense in depth, with every layer doing one job well.