How Networking Actually Works in Production

When I started handling production incidents, I expected most failures to be application bugs. The reality was very different - DNS drift, blocked ports, routing gaps, TLS misconfiguration, and unhealthy load balancer probes caused more outages than broken code.

Your API can be perfectly written and still be unavailable because one route table entry is missing or one certificate chain is incomplete.

In this post, I'll walk through the core networking concepts, how they show up in production, and a troubleshooting workflow you can use during incidents.

1. OSI and TCP/IP - Use Layers to Debug Faster

I don't use the OSI model as theory - I use it as a debugging map. If DNS fails, it's not a database issue. If TCP doesn't establish, app logs may be misleading.

OSI Model (7 layers)
Physical → Data Link → Network → Transport → Session → Presentation → Application. You don't need to memorize all seven, but knowing the bottom four helps during debugging. Most production issues live in Network (IP routing), Transport (TCP/UDP), or Application (HTTP/TLS).

TCP/IP Model (4 layers)
Network Access → Internet → Transport → Application. This is the real protocol stack that runs on every machine. It's what you'll actually see in packet captures, ss output, and logs.

The key insight is this - when something breaks in production, figure out which layer is failing first. If ping works but curl doesn't, the problem is above the network layer. If ping fails, don't bother checking app logs yet.

2. TCP vs UDP - Reliability vs Speed

Most things you deploy use TCP. Some things use UDP. Knowing the difference helps you predict failure patterns.

TCP (Transmission Control Protocol)
Guarantees delivery, ordering, and retransmission. Used by APIs, databases, SSH, and HTTPS. When TCP fails, you see timeouts, retransmits, and handshake failures. These are loud - your monitoring will catch them.

UDP (User Datagram Protocol)
No delivery guarantee. Lower overhead, lower latency. Used by DNS queries, telemetry, streaming, and VoIP. When UDP fails, you see drops and silent loss - no error, just missing data. These are quiet - you might not notice until someone asks why the dashboard is missing metrics.

I once spent an hour debugging missing telemetry data from a service. The app was sending metrics over UDP, but a firewall rule was silently dropping the packets. No errors in the app logs, no timeouts - just empty graphs. Switching the firewall rule fixed it immediately.

TCP failures are loud. UDP failures are silent. Debug them differently.

3. DNS - The Hidden Root Cause in Many Outages

DNS failures are often partial and confusing. One resolver sees new records, another still caches old ones. The app works from one node and fails from another.

dig +short api.example.com
dig @8.8.8.8 api.example.com
dig @1.1.1.1 api.example.com

Always query multiple resolvers during an incident. Never trust one DNS query result.

Why DNS breaks in production

TTL caching - Old records linger after you update DNS. Some resolvers respect TTL, some don't.
Split-horizon DNS - Internal and external resolvers return different answers. Your laptop resolves correctly, but the production pod doesn't.
CoreDNS issues in Kubernetes - If CoreDNS pods crash or overload, every service-to-service call inside the cluster breaks.

I've seen entire outages caused by DNS drift during a deployment. The new service was up and healthy, but half the cluster was still resolving to the old IP.

4. IP Addressing - CIDR, NAT, and Private vs Public

If you work in cloud or hybrid environments, CIDR planning is critical. Overlapping address ranges and exhausted NAT ports create intermittent, hard-to-reproduce failures.

10.0.0.0/24  -> 256 addresses
10.0.0.0/16  -> 65,536 addresses

Private IPs (10.x, 172.16-31.x, 192.168.x)
Used inside your VPC/VNet. Not routable on the internet. Free to use however you want.

Public IPs
Routable on the internet. Assigned by your cloud provider. Every public IP costs money and increases your attack surface.

NAT (Network Address Translation)
Translates private IPs to a shared public IP for outbound traffic. When NAT port exhaustion happens, random outbound connections start failing - and it looks like the external API is down when it's actually your NAT gateway.

Plan your CIDR ranges before you build. Larger ranges increase blast radius and overlap risk. Smaller, well-planned subnets save you during troubleshooting.

5. Ports That Matter Daily

You don't need to memorize all 65,535 ports. These are the ones I check most often:

22 - SSH
80 / 443 - HTTP / HTTPS
3306 - MySQL
5432 - PostgreSQL
6379 - Redis
8080 / 3000 / 5000 - Common app server ports
53 - DNS

When a service isn't reachable, the first thing I check is whether it's actually listening on the expected port:

ss -tlnp | grep 5432

If nothing shows up, the service isn't running or it's bound to the wrong address.

Port conflicts

A common issue I've hit - you deploy a new service and it fails to start. The logs say "address already in use". Another process is already bound to that port. Use ss -tlnp to find what's occupying it, then decide which one should move.

Another gotcha - a service binds to 127.0.0.1:8000 instead of 0.0.0.0:8000. It works when you curl from the same machine, but nothing external can reach it. Always check the bind address, not just the port.

6. Load Balancers and Reverse Proxies

L4 Load Balancer (Transport Layer)
Distributes TCP/UDP connections based on IP and port. Doesn't look at HTTP headers or paths. Fast and simple. Azure Load Balancer and AWS NLB are L4.

L7 Load Balancer / Reverse Proxy (Application Layer)
Can route by hostname, URL path, headers, cookies. Can terminate TLS, add authentication, rate limit. Azure App Gateway, AWS ALB, Nginx, and HAProxy work at this layer.

The difference matters during debugging. If you're behind an L4 LB and TLS is failing, the LB isn't touching TLS at all - the problem is on your backend. If you're behind an L7 LB, TLS terminates at the LB - so the issue might be the LB's certificate config, not your app's.

Client -> Load Balancer -> App Service -> Database

The classic health check mismatch

The app returns 200 on /healthz, but the load balancer probes / and expects a 200 there. The service is healthy locally but marked unhealthy externally.

This creates the confusing outage where curl from inside the server works fine, but users can't reach the app.

I've seen this happen right after a deployment - the new version changed the health endpoint from /health to /healthz, but nobody updated the App Gateway probe config. The fix was one line in the LB settings, not in the code.

Always verify that the LB probe path, port, and expected status code match what your app actually serves.

7. TLS and HTTPS

TLS is not optional in production. Every connection between your users and your services should be encrypted. Service-to-service communication inside your cluster should use mTLS where possible.

When TLS breaks, it's usually one of three things

Certificate chain problem
The cert itself is valid, but the intermediate certificate is missing. Browsers sometimes fill in the gap, but curl and other services won't.

openssl s_client -connect api.example.com:443 -servername api.example.com

Hostname / SNI mismatch
The certificate was issued for api.example.com but the request hits api-internal.example.com. TLS rejects the connection.

Expired or time-drifted certificate
The cert expired, or the server clock is off and thinks a valid cert is not-yet-valid.

curl -vk https://api.example.com/health

The -v flag shows the TLS handshake details. This is my first check when I see SSL/TLS errors.

8. Troubleshooting Workflow - Layer by Layer

This is the sequence I follow to keep MTTR low and avoid random command execution:

DNS -> Path -> Port -> Protocol -> Policy

DNS - Does name resolution return expected targets?

dig +short service.internal

Path - Can traffic reach the destination network?

traceroute service.internal

Port - Is the destination port reachable and listening?

nc -vz service.internal 443

Protocol - Does the HTTP/TLS handshake succeed?

curl -vk https://service.internal/health

Policy - Are firewall/NSG rules allowing the flow?

sudo iptables -L -n

The reason this sequence works is that it narrows fault domains quickly. If DNS is wrong, nothing above it matters yet. If the port is closed, the service is down - stop debugging the app and check the process.

9. Patterns I Keep Seeing in Production

"Connection timed out"
Packets aren't making the full round trip. Almost always a firewall, routing, or security group issue. Rarely fixed by restarting the app.

"Connection refused"
The host is reachable but nothing is listening on that port. Check if the service is running, if it's bound to the right address, or if there's a port mismatch.

# Timed out - check network path
traceroute db.internal
nc -vz db.internal 5432

# Refused - check the service
ss -tlnp | grep 5432
systemctl status postgresql

TLS errors after certificate renewal
The cert was renewed but the intermediate chain wasn't included. Browser trusts it (fills the gap), but server-to-server calls fail.

Load balancer marking healthy service as unhealthy
Probe path, port, or expected status code doesn't match what the app serves. Fix the probe config, not the app.

Cross-zone latency misdiagnosed as slow code
App works, but every DB call pays a cross-zone or cross-region RTT penalty. Easy to blame the code when the real issue is placement.

10. Common Mistakes

Debugging from only one host - The issue might be DNS, routing, or policy that only affects certain nodes. Always test from multiple places.
Ignoring DNS TTL during rollouts - You updated the DNS record, but half your servers still cache the old IP for another hour. Wait for TTL to expire or flush caches explicitly.
Treating timeout and refused as the same error - Timeout means packets aren't reaching the destination. Refused means they are, but nothing is listening. Completely different fault domains.
Increasing retries during a partial outage - If 50% of requests are failing and you add retries, you just doubled the load on an already struggling system. This causes retry storms that make partial outages total.
Assuming TLS errors are always expired certs - It could be a missing intermediate chain or a hostname mismatch. Check the full chain, not just the expiry date.
Not checking the bind address - The service is running and the port is correct, but it's bound to 127.0.0.1 instead of 0.0.0.0. Local curl works, nothing else can reach it.

Key Takeaways

Networking is a first-class skill - Not a "network team only" topic
Use layers to debug - DNS, path, port, protocol, policy - in that order
DNS causes more outages than you think - Always query multiple resolvers
Timeout and refused are different problems - They point to different fault domains
TLS breaks in three ways - Chain, hostname, or expiry
Health check mismatches are silent killers - Verify LB probes match your app
Plan your network before you build - CIDR, NAT, and subnets save you later

Most production outages I've debugged had a networking component. Learning to think in layers has consistently helped me find the root cause faster.