What I've Learned Running Production on Azure

April 11, 2026

Part 4 of 4
What I've Learned Running Production on Azure

Cloud computing sounds simple until you're managing production workloads on it. The marketing says "deploy anywhere, scale anything, pay only for what you use." The reality is more nuanced - you pay for what you forget to turn off, identity misconfigurations cause outages, and networking in the cloud is a whole discipline on its own.

I've been running production workloads on Azure - primarily AKS, App Gateway, PostgreSQL, and Key Vault. This post is not a cloud certification guide. It's what I've actually learned about cloud from operating it.

1. Cloud Fundamentals That Actually Matter

Service models - who owns what

The three service models aren't just exam material. They define who is responsible when something breaks.

  • IaaS (VMs, VNets, disks) - You manage the OS, runtime, app, and data. The provider manages the physical hardware and hypervisor. If the VM's disk fills up, that's your problem.
  • PaaS (App Service, Azure Functions, managed databases) - You manage the app and data. The provider manages the OS, patching, and scaling. If the database has a slow query, that's your problem. If the database server needs a kernel patch, that's theirs.
  • SaaS (Microsoft 365, Salesforce) - You manage your data and access policies. Everything else is the provider's responsibility.

The shared responsibility model isn't abstract. I've seen teams open a ticket with Microsoft because their AKS pods were OOM-killed - but OOM is the customer's responsibility. Understanding the boundary saves you hours during incidents.

Regions and availability

  • Regions - Physical locations where Azure has data centers. Choosing the right region affects latency, compliance, and cost. I pick the region closest to my users, then check if all the services I need are available there.
  • Availability Zones - Physically separate data centers within a region. Spreading workloads across zones protects against single-facility failures.
  • High Availability vs Disaster Recovery - HA keeps you running during component failures (zone goes down). DR gets you back after a major event (entire region goes down). They solve different problems and cost different amounts.

Cloud economics

The CapEx vs OpEx shift is real, but the nuance is this - OpEx can exceed CapEx very quickly if you're not watching.

In traditional infrastructure, you overprovision once and forget. In the cloud, you can overprovision continuously and get a growing bill every month. The pay-as-you-go model is a feature and a risk.


2. Identity and Access - Where Most Security Failures Start

Identity misconfiguration causes more production incidents than I expected. Not hacking - just wrong permissions, expired credentials, or overly broad access.

Authentication vs authorization

  • Authentication - Proving who you are (passwords, tokens, certificates)
  • Authorization - What you're allowed to do after proving identity (RBAC roles, policies)

These are separate concerns. A service principal can authenticate successfully but still get 403 Forbidden because it doesn't have the right role assignment. I've debugged this exact scenario with AKS pulling from ACR - authentication passed, but the acrpull role was missing.

Principle of least privilege

Give the minimum permissions required for the task. Not "Contributor on the subscription" because it's easier. Not "Owner" because the engineer asked for it.

In practice, this means:

  • AKS pods use managed identities scoped to specific resources
  • CI/CD pipelines use service principals with only the permissions they need
  • Human access uses JIT (Just-In-Time) elevation, not permanent assignments

Managed identities vs service principals

  • Service principals - App registrations with client secrets or certificates. You manage the credentials. They expire. You rotate them. If you forget, the app stops working at 3 AM.
  • Managed identities - Azure manages the credentials for you. No secrets to rotate, no expiry to track. System-assigned identities are tied to the lifecycle of the resource. User-assigned identities can be shared across resources.

I use managed identities wherever possible. Every service principal is a credential rotation problem waiting to happen.

RBAC in practice

Azure RBAC works at four levels: management group → subscription → resource group → resource. Permissions inherit downward.

The mistake I see most often - granting Contributor at the subscription level because it's quick. This gives write access to every resource in the subscription. Scope roles to the specific resource group or resource instead.


3. Cloud Networking - The Part That Trips Everyone Up

Cloud networking feels familiar if you know traditional networking, but the abstraction layers add gotchas.

Virtual networks and subnets

A VNet is your private network in Azure. Subnets divide it into segments. The key rules:

  • Plan CIDR ranges before building. Overlapping ranges between VNets make peering impossible.
  • Use separate subnets for different workload types (AKS nodes, databases, App Gateway)
  • Each subnet can have its own NSG (Network Security Group) for traffic control

Network Security Groups

NSGs are stateful firewalls attached to subnets or NICs. They evaluate rules by priority, and the default rules allow outbound internet access and deny inbound from the internet.

The mistake I hit early on - creating an NSG rule to allow traffic on port 443, but forgetting that the default deny-all rule had a lower priority number (higher priority). Rules are processed lowest-number-first.

Public vs private endpoints

  • Public endpoints - The resource is reachable from the internet. It has a public IP.
  • Private endpoints - The resource gets a private IP in your VNet. Traffic stays on Azure's backbone network. No internet exposure.

I use private endpoints for databases, Key Vault, and ACR in production. There's no reason for a PostgreSQL server to be reachable from the internet when the only clients are AKS pods in the same VNet.

Load balancing

  • Azure Load Balancer (L4) - Routes TCP/UDP traffic. Doesn't understand HTTP. Fast and cheap.
  • Application Gateway (L7) - Routes HTTP/HTTPS traffic. Can do path-based routing, SSL termination, WAF. More expensive, more capable.
  • Traffic Manager - DNS-based routing across regions. Not a traditional load balancer.
  • Front Door - Global entry point with CDN, WAF, and intelligent routing.

The practical distinction - if you need to route by URL path or terminate TLS centrally, use App Gateway. If you just need to distribute TCP connections, use Azure Load Balancer.


4. The Azure Services I Actually Use

AKS (Azure Kubernetes Service)

This is where most of my workloads run. The key architectural decisions:

  • Node pools - System pool for cluster components (CoreDNS, kube-proxy). User pool for application workloads. Separating them prevents app resource pressure from affecting cluster operations.
  • Networking modes - Kubenet gives each pod an IP from a separate address space (simpler CIDR planning). Azure CNI gives each pod a VNet IP directly (better VNet integration, more IP consumption). I use Azure CNI when pods need to talk to private endpoints directly.
  • Private clusters - The API server gets a private IP. No public Kubernetes API endpoint. Access through VPN or bastion. Adds operational friction but significantly reduces attack surface.
  • Managed identity integration - AKS can use managed identities for pulling images from ACR, accessing Key Vault secrets, and managing Azure resources. No more storing service principal credentials in the cluster.

App Gateway + WAF

App Gateway sits in front of AKS as the L7 entry point. It terminates TLS, routes by host/path, and runs WAF rules.

The classic production issue - App Gateway returning 502 while pods are healthy. This almost always means the backend health probe is failing. Check that the probe path, port, and expected status code match what the app actually serves. I've fixed this more than once by updating a single probe configuration.

Azure Key Vault

Secrets, certificates, and keys. Integrated with AKS through the CSI Secrets Store driver.

The access model matters - Key Vault supports both access policies (legacy) and Azure RBAC (recommended). I've debugged Access Denied errors where the managed identity had the right access policy but the Key Vault had switched to RBAC mode, making the access policy irrelevant.

Azure Container Registry (ACR)

Private Docker registry. Integrates with AKS through managed identity - no more imagePullSecret management.

The common failure - AKS pods returning ImagePullBackOff with ACR. Usually one of three things: the managed identity doesn't have acrpull role, the ACR is behind a private endpoint and the AKS VNet can't reach it, or the image tag simply doesn't exist.


5. Database Operations in Production

PostgreSQL on Azure

I manage Azure PostgreSQL Flexible Server. The operational patterns that matter:

  • Connection pooling - Applications opening direct connections to PostgreSQL don't scale. PgBouncer sits between the app and the database, reusing connections. Without it, you hit too many connections at moderate load.
  • Performance tuning - EXPLAIN ANALYZE before optimizing. I've seen teams add indexes randomly without checking the query plan first. Start with pg_stat_statements to find the slowest queries, then optimize those specifically.
  • Backup and recovery - Azure handles automated backups with point-in-time recovery (PITR). But you should test the restore regularly. A backup you've never tested is not a backup.
  • Replication - Read replicas for read-heavy workloads. The replica has a lag - if your app reads from the replica immediately after writing to the primary, it might get stale data. Design for eventual consistency or route critical reads to the primary.

Schema migrations

  • Use a migration tool (Alembic for Python, Flyway for Java). Never run DDL manually in production.
  • Zero-downtime migrations require backward-compatible changes. Add the new column first, deploy the code that uses it, then remove the old column in a separate migration.
  • I've seen a production outage caused by a migration that locked a table for 20 minutes. Always check if your DDL acquires a table lock and plan accordingly.

Database security

  • Encryption at rest - Azure handles this by default for managed databases.
  • Encryption in transit - Enforce TLS connections. Reject plaintext.
  • Access control - Use managed identities for app access. Avoid shared database passwords. Rotate credentials on a schedule, and make sure the rotation doesn't break the app.

6. FinOps - Because Cloud Bills Explode

Cloud cost management is not optional. I've seen monthly bills double because someone left a dev environment running over a weekend, or because a load test spun up VMs that were never deleted.

Why costs get out of control

  • Orphaned resources - Disks, public IPs, and load balancers that remain after VMs or services are deleted
  • Over-provisioned resources - VMs with 8 cores running a service that uses 0.5 cores
  • No auto-scaling - Running peak capacity 24/7 when traffic is only high for 4 hours
  • Dev/test environments running at production scale - Staging doesn't need 3 replicas and a Standard_D8s_v3 node pool

What I actually do

  • Tagging - Every resource gets environment, team, and service tags. Without tags, you can't answer "who is spending what" when the bill arrives.
  • Budgets and alerts - Azure Cost Management budgets with alerts at 80% and 100%. Getting notified before the budget is hit, not after.
  • Right-sizing - Check Azure Advisor recommendations monthly. If a VM has averaged 10% CPU for 30 days, it's over-provisioned.
  • Reserved Instances - For stable, predictable workloads (production databases, always-on node pools). 1-year RIs save ~30%, 3-year RIs save ~50%. Only reserve what you're confident will run for the full term.
  • Spot nodes for AKS - Non-critical workloads (batch jobs, dev/test) can run on spot nodes at 60-90% discount. But spot nodes can be evicted with 30 seconds notice, so your workload must handle interruptions.
  • Storage tiers - Move infrequently accessed data from Hot to Cool or Archive. The access cost goes up but the storage cost drops significantly.
  • Scheduled shutdowns - Dev/test environments don't need to run from 8 PM to 8 AM, or on weekends.

Cost optimization for AKS specifically

  • Separate system and user node pools. System pool can be small (2 nodes, small SKU). User pool scales with workload.
  • Use the cluster autoscaler to add/remove nodes based on pending pods, not a fixed node count.
  • Set resource requests and limits properly. Over-requesting wastes node capacity. Under-requesting causes evictions and poor bin-packing.
  • Monitor with kubectl top nodes - if all nodes are at 20% CPU, you have too many or they're too large.

7. Real-World Scenarios I've Faced

App Gateway returning 502 - backend health probe mismatch

Users reported 502 Bad Gateway. Pods were healthy, kubectl port-forward worked fine. The App Gateway backend health showed "unhealthy."

The cause - the new deployment changed the health endpoint from /health to /healthz, but the App Gateway probe was still pointing to /health. One config change in the probe settings fixed it.

Pod can't pull image from ACR

ImagePullBackOff in a new namespace. The ACR was private, the cluster had the acrpull role, but the error persisted.

The issue - the cluster's managed identity had acrpull on the ACR, but the ACR had a firewall rule that only allowed the AKS VNet. The namespace's pods were running on a new node pool in a different subnet that wasn't whitelisted.

Key Vault secret access denied

App pods crashed with SecretNotFound from the CSI driver. The Key Vault had secrets, the managed identity existed, but access was denied.

The cause - the Key Vault had been migrated from access policy mode to RBAC mode. The old access policies were now ignored. The managed identity needed a Key Vault Secrets User RBAC role assignment, not an access policy.

Database connection failing from AKS

App logs showed could not translate host name for the PostgreSQL server. The database was running fine.

Two things were wrong: the PostgreSQL server used a private endpoint, and the AKS VNet's DNS wasn't configured to resolve the privatelink.postgres.database.azure.com zone. Adding the private DNS zone link to the AKS VNet fixed the resolution.

High Azure bill after load testing

Monthly bill jumped 40% after a performance test. The load test had scaled the AKS node pool to 20 nodes and the tester forgot to scale it back down.

Prevention - I now tag all load test resources with purpose=loadtest and have an Azure Policy that auto-deletes resources with that tag after 48 hours. I also set up cost alerts that fire when daily spend exceeds 150% of the baseline.


Common Mistakes

  • Using service principals when managed identities work - Every service principal is a credential you have to rotate. Managed identities eliminate that burden.
  • Public endpoints for internal-only services - If only your AKS pods need to reach the database, use a private endpoint. No internet exposure needed.
  • Granting Contributor at the subscription level - Scope permissions to the specific resource group or resource. Broad access is a security and cost risk.
  • Not testing database restores - Azure takes automated backups, but if you've never tested a restore, you don't have a backup strategy. You have a hope strategy.
  • Ignoring cost management until the bill arrives - Set up budgets, alerts, and tagging from day one. Retroactive cost analysis is painful.
  • Skipping the shared responsibility model - Know what Azure manages and what you manage. Filing a support ticket for an OOM-killed pod wastes everyone's time.

Key Takeaways

  • Cloud is an operating model, not just infrastructure - It changes how you manage identity, networking, cost, and responsibility
  • Identity is the new perimeter - Managed identities, least privilege RBAC, and JIT access are not optional
  • Private endpoints for everything internal - If it doesn't need internet access, don't give it internet access
  • Cost management is a continuous practice - Tags, budgets, right-sizing, and regular reviews prevent bill shock
  • Test your failure scenarios - App Gateway 502s, ACR pull failures, Key Vault access denied, DNS resolution failures. Practice them before they happen in production
  • Database operations require discipline - Connection pooling, tested backups, zero-downtime migrations, and performance monitoring are baseline expectations

Everything in this post comes from running production workloads on Azure. The concepts are cloud-general, but the details are from my own environment.