2 min read By Vamsi Karuturi · Senior Backend Engineer at Salesforce

DNS Deep Dive

Real Incident: Dyn DDoS Attack (October 2016)

On October 21, 2016, a massive DDoS attack targeted Dyn, a major DNS provider. The Mirai botnet — 100,000+ compromised IoT devices — flooded Dyn with 1.2 Tbps of traffic. Result: Twitter, GitHub, Netflix, Reddit, Spotify, and dozens of major sites went completely unreachable for hours. The sites themselves were fine — their servers were running. But nobody could resolve their domain names to IP addresses. DNS is the single most critical piece of internet infrastructure, and a single provider failure took down half the internet.

Why This Comes Up in Interviews

Every system design begins with a user typing a URL. DNS is the entry point for all distributed systems. Interviewers want to hear:

How DNS resolution actually works (not just "it translates names to IPs")
How you design for DNS failover and redundancy
How GeoDNS enables global load balancing
TTL trade-offs between availability and freshness
How DNS becomes a bottleneck or single point of failure

DNS Hierarchy — The Tree Structure

%%{init: {'theme': 'base', 'themeVariables': {'fontSize': '13px', 'fontFamily': 'Inter, -apple-system, sans-serif'}, 'flowchart': {'nodeSpacing': 30, 'rankSpacing': 50, 'padding': 12, 'curve': 'basis'}, 'sequence': {'actorMargin': 60, 'messageMargin': 40}, 'class': {'padding': 12}}}%%
graph TD
    ROOT["Root Servers (13 clusters, 1000+ instances)"]
    TLD_COM[".com TLD Server"]
    TLD_ORG[".org TLD Server"]
    TLD_IO[".io TLD Server"]
    AUTH_GOOGLE["ns1.google.com (Authoritative)"]
    AUTH_GITHUB["ns1.github.com (Authoritative)"]

    ROOT --> TLD_COM
    ROOT --> TLD_ORG
    ROOT --> TLD_IO
    TLD_COM --> AUTH_GOOGLE
    TLD_COM --> AUTH_GITHUB

    style ROOT fill:#e74c3c,color:#fff
    style TLD_COM fill:#f39c12,color:#fff
    style TLD_ORG fill:#f39c12,color:#fff
    style TLD_IO fill:#f39c12,color:#fff
    style AUTH_GOOGLE fill:#27ae60,color:#fff
    style AUTH_GITHUB fill:#27ae60,color:#fff

Level	What	Count	Example
Root	Starting point for all queries	13 logical servers (1000+ physical via anycast)	`a.root-servers.net`
TLD	Manages a top-level domain	~1500 TLDs	`.com`, `.org`, `.io`, `.dev`
Authoritative	Holds actual DNS records for a domain	Per-domain	`ns1.google.com`
Recursive Resolver	Caches and resolves on behalf of clients	ISPs, 8.8.8.8, 1.1.1.1	Google Public DNS

Resolution Process — Step by Step

Query: What is the IP address of api.stripe.com?

Step	Who	Action	Typical Latency
1	Browser	Check browser DNS cache	0ms
2	OS	Check OS DNS cache (`/etc/hosts`, system cache)	0ms
3	Recursive Resolver	Check resolver cache (ISP or 8.8.8.8)	1-5ms
4	Resolver → Root	"Who handles `.com`?" → Returns TLD server IPs	20-50ms
5	Resolver → TLD	"Who handles `stripe.com`?" → Returns authoritative NS	20-50ms
6	Resolver → Authoritative	"What is `api.stripe.com`?" → Returns A record	20-50ms
7	Resolver → Client	Caches answer (per TTL), returns to client	1ms

Total uncached resolution: ~100-200ms (3-4 network hops) Cached resolution: <5ms (resolver already knows the answer)

The math: With a 300s TTL and 1000 req/s to the same domain, only 1 in 300,000 requests actually triggers full resolution. Cache hit rate: 99.9997%.

DNS Record Types

Record	Purpose	Example	Interview Relevance
A	Domain → IPv4 address	`stripe.com → 104.21.5.77`	Basic resolution
AAAA	Domain → IPv6 address	`stripe.com → 2606:4700::6812`	IPv6 support
CNAME	Domain → another domain (alias)	`www.stripe.com → stripe.com`	CDN integration, cannot coexist with other records at zone apex
MX	Mail server for domain	`stripe.com → mail.stripe.com (priority 10)`	Email routing
NS	Authoritative nameservers	`stripe.com → ns1.stripe.com`	Delegation
TXT	Arbitrary text (verification, SPF)	`stripe.com → "v=spf1 include:..."`	Domain verification
SRV	Service discovery (host + port)	`_http._tcp.api.stripe.com → 443 api1.stripe.com`	Microservices
SOA	Zone authority metadata	Contains serial, refresh, retry, expire	Zone transfers

CNAME trap (common interview gotcha): You CANNOT have a CNAME at the zone apex (stripe.com). Why? CNAME must be the only record at a name, but apex needs SOA and NS records. Solution: use ALIAS/ANAME (provider-specific) or A records.

TTL — The Critical Trade-off

TTL Value	Pros	Cons	Use Case
30-60s	Fast failover, quick changes	High query load on authoritative servers	Active failover systems
300s (5min)	Good balance	5min stale window	Most web services
3600s (1hr)	Low query load, high cache hit	Slow failover (up to 1hr stale)	Stable services
86400s (1day)	Minimal DNS traffic	Cannot change quickly	Rarely-changing records

Pre-migration trick: Before a DNS migration, lower TTL to 60s 48 hours in advance. Old TTL (say 24hr) needs time to expire from all caches. Then migrate. Then raise TTL back.

DNS Load Balancing

Strategy	How	Pros	Cons
Round-robin	Return multiple A records, rotate order	Simple, no special infrastructure	No health awareness, uneven load
Weighted	Return records with weighted probability	Control traffic split (90/10 canary)	Limited granularity
GeoDNS	Return different IPs based on client location	Users hit nearest datacenter	Inaccurate for VPN users
Latency-based	Route to lowest-latency endpoint	Best user experience	Requires latency measurement
Failover	Health-check endpoints, remove unhealthy	High availability	TTL delay in failover

GeoDNS resolution example:

User Location	Query	Response	Datacenter
Tokyo, Japan	`api.company.com`	`103.21.x.x`	Tokyo DC
London, UK	`api.company.com`	`185.45.x.x`	London DC
New York, US	`api.company.com`	`104.16.x.x`	Virginia DC

DNS Failover Architecture

%%{init: {'theme': 'base', 'themeVariables': {'fontSize': '13px', 'fontFamily': 'Inter, -apple-system, sans-serif'}, 'flowchart': {'nodeSpacing': 30, 'rankSpacing': 50, 'padding': 12, 'curve': 'basis'}, 'sequence': {'actorMargin': 60, 'messageMargin': 40}, 'class': {'padding': 12}}}%%
graph LR
    CLIENT[Client] --> RESOLVER[Recursive Resolver]
    RESOLVER --> GEODNS[GeoDNS Provider]
    GEODNS -->|Health OK| PRIMARY["Primary DC (Virginia)"]
    GEODNS -->|Health FAIL| SECONDARY["Secondary DC (Oregon)"]

    MONITOR[Health Monitor] -->|Check every 10s| PRIMARY
    MONITOR -->|Check every 10s| SECONDARY
    MONITOR -->|Update records| GEODNS

    style PRIMARY fill:#27ae60,color:#fff
    style SECONDARY fill:#f39c12,color:#fff
    style MONITOR fill:#3498db,color:#fff

Failover timing math:

Health check interval: 10s
Failure threshold: 3 consecutive failures
Detection time: 30s
DNS TTL: 60s
Worst-case failover: 90s (failure happens right after health check + TTL hasn't expired)

DNS Security

Threat	Attack	Mitigation
Cache poisoning	Attacker injects fake records into resolver cache	DNSSEC (cryptographic signing of records)
DDoS on DNS	Flood DNS servers (like Dyn attack)	Anycast, multiple providers, over-provisioning
DNS hijacking	Compromise registrar, change NS records	Registrar lock, DNSSEC, monitoring
DNS tunneling	Exfiltrate data encoded in DNS queries	Monitor query patterns, limit TXT record sizes

Interview Framework

When DNS comes up in system design:

Step 1: "Users resolve our domain through DNS. I'd use a managed DNS provider (Route53/Cloudflare) with GeoDNS to route users to the nearest datacenter for lowest latency."

Step 2: "For high availability, I'd configure health-check-based failover with a 60s TTL. If the primary datacenter goes down, DNS automatically routes to the secondary within 90 seconds worst-case."

Step 3: "For global load balancing, GeoDNS returns different IPs based on user geography — Tokyo users hit Asia-Pacific servers, EU users hit Frankfurt."

Step 4: "To avoid the Dyn-style single point of failure, I'd use multiple DNS providers (Route53 + Cloudflare) with the same zone data. If one provider goes down, the other still resolves."

Quick Recall

Question	Answer
How many root servers?	13 logical (a through m), 1000+ physical instances via anycast
Full resolution hops?	Client → Resolver → Root → TLD → Authoritative (3-4 hops)
Why can't CNAME be at apex?	CNAME must be sole record; apex requires SOA + NS
GeoDNS purpose?	Route users to nearest datacenter based on IP geolocation
TTL trade-off?	Low = fast failover but high DNS load; High = low load but slow changes
DNS failover worst-case?	Detection time + TTL (typically 60-90 seconds)
How to prevent Dyn-style failure?	Multi-provider DNS (Route53 + Cloudflare with same zone)
DNSSEC purpose?	Cryptographically sign records to prevent cache poisoning