What I've Learned Running Production on Azure

Cloud computing sounds simple until you're managing production workloads on it. The marketing says "deploy anywhere, scale anything, pay only for what you use." The reality is more nuanced - you pay for what you forget to turn off, identity misconfigurations cause outages, and networking in the cloud is a whole discipline on its own.

I've been running production workloads on Azure - primarily AKS, App Gateway, PostgreSQL, and Key Vault. This post is not a cloud certification guide. It's what I've actually learned about cloud from operating it.

1. Cloud Fundamentals That Actually Matter

Service models - who owns what

The three service models define who is responsible when something breaks - and getting this wrong wastes incident time.

IaaS (VMs, VNets, disks) - You manage the OS, runtime, app, and data. The provider manages physical hardware. If the VM's disk fills up, that's your problem.
PaaS (App Service, Azure Functions, managed databases) - You manage the app and data. The provider manages the OS and patching. If the database has a slow query, that's your problem. If it needs a kernel patch, that's theirs.
SaaS (Microsoft 365, Salesforce) - You manage your data and access policies. Everything else belongs to the provider.

I've seen teams open a support ticket with Microsoft because their AKS pods were OOM-killed. OOM is entirely the customer's responsibility - the shared responsibility boundary matters most when you're in the middle of an incident.

Regions and availability

Regions - Physical data center locations. I pick the region closest to my users, then verify all the services I need are available there. Not every service is in every region.
Availability Zones - Independent data centers within one region. Spreading workloads across zones means a facility failure doesn't take down your entire service.
High Availability vs Disaster Recovery - HA keeps things running during component failures within a region. DR gets you back after a major event that takes down an entire region. They solve different problems at different cost points.

Cloud economics

The promise is pay-as-you-go. The reality is pay-for-what-you-forget-to-turn-off.

In traditional infrastructure, you overprovision once during procurement and it sits there. In the cloud, you overprovision continuously and get a growing bill every month. I've seen a load test leave 20 VMs running over a long weekend - the bill for those three days was more than the monthly budget for the environment. The flexibility is real, but so is the risk.

2. Identity and Access - Where Most Security Failures Start

Identity misconfiguration causes more production incidents than I expected. Not hacking - just wrong permissions, expired credentials, or overly broad access.

Authentication vs authorization

Authentication - Proving who you are (passwords, tokens, certificates)
Authorization - What you're allowed to do after proving identity (RBAC roles, policies)

These are separate concerns. A service principal can authenticate successfully but still get 403 Forbidden because it doesn't have the right role assignment. I've debugged this exact scenario with AKS pulling from ACR - authentication passed, but the acrpull role was missing.

Principle of least privilege

Give the minimum permissions required for the task. Not "Contributor on the subscription" because it's easier. Not "Owner" because the engineer asked for it.

In practice, this means:

AKS pods use managed identities scoped to specific resources
CI/CD pipelines use service principals with only the permissions they need
Human access uses JIT (Just-In-Time) elevation, not permanent assignments

Managed identities vs service principals

Service principals - App registrations with client secrets or certificates. You manage the credentials. They expire. You rotate them. If you forget, the app stops working at 3 AM.
Managed identities - Azure manages the credentials for you. No secrets to rotate, no expiry to track. System-assigned identities are tied to the lifecycle of the resource. User-assigned identities can be shared across resources.

I use managed identities wherever possible. Every service principal is a credential rotation problem waiting to happen.

RBAC in practice

Azure RBAC works at four levels: management group → subscription → resource group → resource. Permissions inherit downward.

The mistake I see most often - granting Contributor at the subscription level because it's quick. This gives write access to every resource in the subscription. Scope roles to the specific resource group or resource instead.

3. Cloud Networking - The Part That Trips Everyone Up

Cloud networking feels familiar if you know traditional networking, but the abstraction layers add gotchas.

Virtual networks and subnets

A VNet is your private network in Azure. Subnets divide it into segments. The key rules:

Plan CIDR ranges before building. Overlapping ranges between VNets make peering impossible.
Use separate subnets for different workload types (AKS nodes, databases, App Gateway)
Each subnet can have its own NSG (Network Security Group) for traffic control

Network Security Groups

NSGs are stateful firewalls attached to subnets or NICs. Rules are evaluated by priority number - lower number means higher priority, processed first. The default deny-all inbound rule sits at priority 65500 (the highest number, lowest priority), so your allow rules only need a lower number to take precedence.

The mistake I hit early on - I created an allow rule for port 443 at priority 200, but the existing deny rule for that traffic was at priority 100. My allow rule never fired because the deny rule was evaluated first.

Public vs private endpoints

Public endpoints - The resource is reachable from the internet. It has a public IP.
Private endpoints - The resource gets a private IP in your VNet. Traffic stays on Azure's backbone network. No internet exposure.

I use private endpoints for databases, Key Vault, and ACR in production. There's no reason for a PostgreSQL server to be reachable from the internet when the only clients are AKS pods in the same VNet.

Load balancing

Azure Load Balancer (L4) - Routes TCP/UDP traffic. Doesn't understand HTTP. Fast and cheap.
Application Gateway (L7) - Routes HTTP/HTTPS traffic. Can do path-based routing, SSL termination, WAF. More expensive, more capable.
Traffic Manager - DNS-based routing across regions. Not a traditional load balancer.
Front Door - Global entry point with CDN, WAF, and intelligent routing.

4. The Azure Services I Actually Use

AKS (Azure Kubernetes Service)

Most of my workloads run here.

Separate system and user node pools - system handles CoreDNS and kube-proxy, user runs app workloads. App resource pressure can't starve cluster operations.
Azure CNI over Kubenet when pods need to reach private endpoints inside the VNet directly.
Private clusters lock the API server behind a private IP - no public Kubernetes endpoint, significantly smaller attack surface.
Managed identities for ACR pulls, Key Vault access, and Azure resource management - no credentials stored in the cluster.
Watch out for IP exhaustion with Azure CNI. Each pod gets a real VNet IP - if the subnet is too small, pods get stuck in Pending. You can't resize a subnet while it has active resources.

App Gateway + WAF

L7 entry point in front of AKS. Terminates TLS, routes by host/path, runs WAF rules.

Always verify the backend health probe matches what the app actually serves - path, port, and expected status code.
The most common issue I've hit: App Gateway returning 502 while pods are healthy. Almost always a probe mismatch. One config change fixes it.

Azure Key Vault

Secrets, certificates, and keys. Integrated with AKS through the CSI Secrets Store driver.

Key Vault supports both access policies (legacy) and Azure RBAC (recommended). They don't coexist - if RBAC mode is enabled, access policies are ignored.
I've seen Access Denied errors where the managed identity had the right access policy but the Key Vault had been switched to RBAC mode. The fix was adding a Key Vault Secrets User role assignment, not an access policy.

Azure Container Registry (ACR)

Private Docker registry. Integrates with AKS through managed identity - no imagePullSecret management needed.

Three things cause ImagePullBackOff with ACR: missing acrpull role on the managed identity, the ACR firewall not whitelisting the node subnet, or the image tag not existing.
When adding a new node pool, always check that the new subnet is whitelisted in the ACR firewall - the cluster identity isn't enough if the network path is blocked.

Azure Virtual Machines

I use VMs for jump boxes, legacy workloads, and anything that needs persistent local state.

AMD-based SKUs (anything with an a - Standard_D4as_v5, Standard_B2as_v2) are 10-20% cheaper than Intel equivalents with negligible performance difference for most workloads.
D-series for general-purpose, B-series burstable for jump boxes and dev environments, E-series for memory-heavy workloads.
No SSH (port 22) open to the internet. Jump boxes go through Azure Bastion or VPN. NSG rules restrict to specific source IPs.
A jump box provisioned at Standard_D4s_v3 (Intel, portal default) ran for months as an SSH relay. Switching to Standard_B2as_v2 (AMD burstable) cut the monthly cost by over 70%.

5. Database Operations in Production

PostgreSQL on Azure

I manage Azure PostgreSQL Flexible Server. The operational patterns that matter:

Connection pooling - Applications opening direct connections to PostgreSQL don't scale. PgBouncer sits between the app and the database, reusing connections. Without it, you hit too many connections at moderate load.
Performance tuning - EXPLAIN ANALYZE before optimizing. I've seen teams add indexes randomly without checking the query plan first. Start with pg_stat_statements to find the slowest queries, then optimize those specifically.
Backup and recovery - Azure handles automated backups with point-in-time recovery (PITR). But you should test the restore regularly. A backup you've never tested is not a backup.
Replication - Read replicas for read-heavy workloads. The replica has a lag - if your app reads from the replica immediately after writing to the primary, it might get stale data. Design for eventual consistency or route critical reads to the primary.

Schema migrations

Use a migration tool (Alembic for Python, Flyway for Java). Never run DDL manually in production.
Zero-downtime migrations require backward-compatible changes. Add the new column first, deploy the code that uses it, then remove the old column in a separate migration.
I've seen a production outage caused by a migration that locked a table for 20 minutes. Always check if your DDL acquires a table lock and plan accordingly.

Database security

Encryption at rest - Azure handles this by default for managed databases.
Encryption in transit - Enforce TLS connections. Reject plaintext.
Access control - Use managed identities for app access. Avoid shared database passwords. Rotate credentials on a schedule, and make sure the rotation doesn't break the app.

6. FinOps - Because Cloud Bills Explode

Cloud cost management is not optional. I've seen monthly bills double because someone left a dev environment running over a weekend, or because a load test spun up VMs that were never deleted.

Why costs get out of control

Orphaned resources - Disks, public IPs, and load balancers that remain after VMs or services are deleted
Over-provisioned resources - VMs with 8 cores running a service that uses 0.5 cores
No auto-scaling - Running peak capacity 24/7 when traffic is only high for 4 hours
Dev/test environments running at production scale - Staging doesn't need 3 replicas and a Standard_D8s_v3 node pool

What I actually do

Tagging - Every resource gets environment, team, and service tags. Without tags, you can't answer "who is spending what" when the bill arrives.
Budgets and alerts - Azure Cost Management budgets with alerts at 80% and 100%. Getting notified before the budget is hit, not after.
Right-sizing - Check Azure Advisor recommendations monthly. If a VM has averaged 10% CPU for 30 days, it's over-provisioned.
Reserved Instances - For stable, predictable workloads (production databases, always-on node pools). 1-year RIs save ~30-40%, 3-year RIs save up to 72%. Only reserve what you're confident will run for the full term.
Spot nodes for AKS - Non-critical workloads (batch jobs, dev/test) can run on spot nodes at up to 90% discount. But spot nodes can be evicted with 30 seconds notice, so your workload must handle interruptions.
Storage tiers - Move infrequently accessed data from Hot to Cool or Archive. The access cost goes up but the storage cost drops significantly.
Scheduled shutdowns - Dev/test environments don't need to run from 8 PM to 8 AM, or on weekends.

Cost optimization for AKS specifically

Separate system and user node pools. System pool can be small (2 nodes, small SKU). User pool scales with workload.
Use the cluster autoscaler to add/remove nodes based on pending pods, not a fixed node count.
Use AMD-based node SKUs for user pools. The Standard_D4as_v5 costs less than Standard_D4s_v5 with identical specs for most application workloads.
Set resource requests and limits properly. Over-requesting wastes node capacity. Under-requesting causes evictions and poor bin-packing.
Monitor with kubectl top nodes - if all nodes are at 20% CPU, you have too many or they're too large.

7. Real-World Scenarios I've Faced

App Gateway returning 502 - backend health probe mismatch

Users reported 502 Bad Gateway. Pods were healthy, kubectl port-forward worked fine. The App Gateway backend health showed "unhealthy."

The cause - the new deployment changed the health endpoint from /health to /healthz, but the App Gateway probe was still pointing to /health. One config change in the probe settings fixed it.

Pod can't pull image from ACR

ImagePullBackOff in a new namespace. The ACR was private, the cluster had the acrpull role, but the error persisted.

The issue - the cluster's managed identity had acrpull on the ACR, but the ACR had a firewall rule that only allowed the AKS VNet. The namespace's pods were running on a new node pool in a different subnet that wasn't whitelisted.

Key Vault secret access denied

App pods crashed with SecretNotFound from the CSI driver. The Key Vault had secrets, the managed identity existed, but access was denied.

The cause - the Key Vault had been migrated from access policy mode to RBAC mode. The old access policies were now ignored. The managed identity needed a Key Vault Secrets User RBAC role assignment, not an access policy.

Database connection failing from AKS

App logs showed could not translate host name for the PostgreSQL server. The database was running fine.

Two things were wrong: the PostgreSQL server used a private endpoint, and the AKS VNet's DNS wasn't configured to resolve the privatelink.postgres.database.azure.com zone. Adding the private DNS zone link to the AKS VNet fixed the resolution.

High Azure bill after load testing

Monthly bill jumped 40% after a performance test. The load test had scaled the AKS node pool to 20 nodes and the tester forgot to scale it back down.

Prevention - I now tag all load test resources with purpose=loadtest and have an Azure Policy that auto-deletes resources with that tag after 48 hours. I also set up cost alerts that fire when daily spend exceeds 150% of the baseline.

Common Mistakes

Using service principals when managed identities work - Every service principal is a credential you have to rotate. Managed identities eliminate that burden.
Public endpoints for internal-only services - If only your AKS pods need to reach the database, use a private endpoint. No internet exposure needed.
Granting Contributor at the subscription level - Scope permissions to the specific resource group or resource. Broad access is a security and cost risk.
Not testing database restores - Azure takes automated backups, but if you've never tested a restore, you don't have a backup strategy. You have a hope strategy.
Ignoring cost management until the bill arrives - Set up budgets, alerts, and tagging from day one. Retroactive cost analysis is painful.
Skipping the shared responsibility model - Know what Azure manages and what you manage. Filing a support ticket for an OOM-killed pod wastes everyone's time.

Key Takeaways

Cloud is an operating model, not just infrastructure - It changes how you manage identity, networking, cost, and responsibility
Identity is the new perimeter - Managed identities, least privilege RBAC, and JIT access are not optional
Private endpoints for everything internal - If it doesn't need internet access, don't give it internet access
Cost management is a continuous practice - Tags, budgets, right-sizing, and regular reviews prevent bill shock
Test your failure scenarios - App Gateway 502s, ACR pull failures, Key Vault access denied, DNS resolution failures. Practice them before they happen in production
Database operations require discipline - Connection pooling, tested backups, zero-downtime migrations, and performance monitoring are baseline expectations

Everything in this post comes from running production workloads on Azure. The concepts are cloud-general, but the details are from my own environment.