WhatsApp is just another CRUD app with a chat interface. Send a message, store it in a database, retrieve it when someone opens the app. Maybe throw in some Redis for caching. Wrong.
WhatsApp is a routing system with transient queues, not a message archive. The server's job is to shuttle encrypted packets between persistent connections, hold messages briefly if someone is offline, then delete them after delivery. The architecture that powered 2 billion users ran on a few hundred servers because the team made brutal trade-offs: vertical density over horizontal sprawl, runtime surgery over abstraction layers, graceful failure over compensating complexity. This is the opposite of how most modern apps are built.
Table of Contents
Architecture Philosophy
Dead Simple Product, Hard Backend
Never Lose a Message
Vertical Density Over Horizontal
Messaging Core
One Connection, One Process
Why Erlang Mattered
Routing, Not Archiving
342K Messages/Sec In, 712K Out
Data and Storage
Mnesia for Metadata
Write-Back Caching, 98% Hit Rate
Custom Runtime Patches
Meta-Clustering and Wandist
Calling Architecture
Signaling vs. Relay
End-to-End Encrypted Relays
Thousands of PoPs, Hundreds of Containers
Critical vs. Ephemeral State
Graceful Degradation Under Load
Privacy as Constraint
Signal Protocol, Routing Without Reading
Multi-Device Fan-Out
Business API Caveat
Operations and Monitoring
Manual Deploys, Hot Loading
Monitoring: mon.sh and Repeat Alarms
Bounded Failure Domains
Fix Fast Over Redundancy
Lessons for System Design
The Core Pattern
When to Choose This
Summary
Architecture Philosophy
Simple Product, Hard Backend
WhatsApp deliberately kept the user-facing product minimal. No stories for years. No algorithmic feed. Just messaging and calls. That simplicity was not laziness-it was a forcing function. Every bit of engineering effort went into making the backend brutally efficient instead of building features that would complicate the system.
The trade-off was intentional. A simple product means narrow state per user, which means you can pack more users onto fewer machines. More features mean more joins, more indices, more cache invalidation strategies, more operational runbooks. WhatsApp avoided that entire category of complexity by saying no at the product level.
The result: in 2014, 465 million users, 19 billion messages/day inbound, 40 billion outbound, on roughly 550 servers. About 150 of those were chat servers handling 147 million concurrent connections. That is close to 1 million connections per chat server. You do not get there by accident.
Never Lose a Message
This was the core design principle. Messages must be delivered, even if the recipient is offline for days. Even if servers fail. Even if the network is unstable. That constraint shaped every storage, queueing, and retry decision.
The architecture prioritized availability over features. If a subsystem went down, the rest of the system kept working. If a cluster was overloaded, it throttled gracefully rather than collapsing. Failure was expected, but losing messages was not acceptable.
This is why WhatsApp used store-and-forward semantics. The server holds the message until the recipient confirms receipt, then deletes it. That guarantee is expensive-it requires persistent queues, retry logic, ack tracking, and careful failure handling-but it is non-negotiable.
Vertical Density Over Horizontal Sprawl
Most modern architectures scale by adding more small nodes. WhatsApp did the opposite: fewer, larger machines with millions of processes per host. They ran FreeBSD on bare metal with dual CPUs, 128-768 GB of RAM, and tuned every layer from the allocator to the NIC driver.
This was a deliberate trade-off. Operational complexity scales with node count, not core count. A hundred servers are easier to monitor, deploy, and debug than ten thousand containers. The team was small-around 50 engineers at acquisition-and they needed an architecture they could actually operate.
The cost is that you have to tune the runtime yourself. WhatsApp patched Erlang's BEAM VM, rewrote parts of OTP, and modified kernel subsystems. That is not feasible for most teams, but for WhatsApp it was cheaper than managing a massive fleet.
Messaging Core

One user, One process
One Connection, One Process
WhatsApp clients maintain a persistent TCP connection to the backend. That connection stays open as long as the app is running. No polling. No repeated handshakes. When a message arrives, it is pushed immediately over that connection.
On the server side, each connection is handled by one Erlang process. That process owns the connection state: session info, encryption keys, message queue. Erlang processes are lightweight-millions can run on a single host-and each has its own memory and garbage collection context.
This model prevents GC pauses from cascading. If one user sends a huge photo and their process does heavy GC, other users are unaffected. Compare that to a Java or Go server with stop-the-world GC: one big pause stalls thousands of connections.
Erlang also gives you fault isolation by default. If a process crashes, OTP supervision restarts it without destabilizing the rest of the system. That is critical when you have millions of connections per host-random crashes should not propagate.
Why Erlang Mattered
Erlang was originally designed for telecom switches, which have similar requirements: massive concurrency, long-lived sessions, fault tolerance, minimal downtime. The BEAM VM was built for exactly this traffic pattern.
Key properties that mattered for WhatsApp:
Lightweight processes: spawn millions without worrying about thread overhead
Per-process GC: no global stop-the-world pauses
Message-passing concurrency: no shared-memory bugs, no lock contention
Supervision trees: automatic restart on crash, with bounded failure domains
Hot code loading: deploy new code without dropping connections
WhatsApp started with ejabberd, an Erlang-based XMPP server, and heavily modified it. Eventually they replaced so much of it that it was effectively a custom system, but the Erlang foundation stayed.
This was not a trendy choice. Erlang is niche, the tooling is sparse, and the community is small. But it was the right model for the problem: millions of mostly idle connections with occasional bursts of activity.
Routing, Not Archiving
This is the most misunderstood part of WhatsApp's architecture. The server does not store your chat history. It is not a database of messages. It is a router with transient queues.
Message flow works like this:
Sender's client encrypts the message and sends it over the persistent connection
Server looks up the recipient's connection
If the recipient is online, route the message immediately
If offline, write it to a temporary queue
When the recipient reconnects, drain the queue in order
After the client sends an ack, delete the message from the server
The server's retention window is short. Most messages are delivered within seconds. The offline queue exists to handle network hiccups and brief disconnections, not long-term storage.
This design made WhatsApp's backend much simpler. No complex indexing. No queries across message history. No schema migrations when the message format changed. The server just routes encrypted blobs and forgets them.
342K Messages/Sec In, 712K Out
By 2014, WhatsApp was handling 342,000 inbound messages/second at peak and 712,000 outbound. That asymmetry is expected: group chats cause fan-out. One incoming message might generate dozens of outbound deliveries.
The system also handled 230,000 logins per second at peak. That is important because login involves session setup, key exchange, offline queue replay, and connection state initialization. It is heavier than a simple message send.
The key architectural point is that these numbers were achieved on around 150 chat servers. That is roughly 2,300 messages/sec inbound per server and 4,700 outbound. No single server is a bottleneck because each user's connection is independent. Erlang's process model makes that parallelism natural.
Data and Storage
Mnesia for Metadata
WhatsApp used Mnesia, Erlang's built-in distributed database, for metadata: user accounts, device mappings, session state, group membership. Mnesia is not a general-purpose SQL database. It is a key-value store with optional transactions, designed to live inside an Erlang cluster.
The architecture around Mnesia was heavily customized:
Data was partitioned across 2-32 shards to avoid hotspots
Most operations used
async_dirtyinstead of transactions to avoid coupling nodesRecords were hashed so that a given key always hit the same process path
Transactional coupling is a killer at scale. If every write requires a two-phase commit across nodes, latency explodes and failure domains grow. WhatsApp avoided that by using dirty operations and designing schemas so that most writes were isolated to a single partition.
Write-Back Caching, 98% Hit Rate
WhatsApp's offline message queue used a write-back cache with a reported 98% hit rate. That means 98% of messages were delivered from memory before they ever needed to be flushed to disk.
Why this worked: most messages are read quickly. If someone is offline, they usually reconnect within minutes or hours, not days. The cache absorbed the write load and batched disk flushes, which is much faster than syncing every message individually.
The cache was backed by persistent storage, but the hot path stayed in memory. This is a classic trade-off: optimize for the common case (immediate delivery) and handle the uncommon case (multi-day offline) with slower disk-backed recovery.
This is also why WhatsApp's storage layer was not a bottleneck. The working set fit in RAM, and disk I/O was batched and asynchronous.
Custom Runtime Patches
At millions of connections per host, bottlenecks stop being in application code. They move to the runtime, the scheduler, the allocator, the I/O subsystem.
WhatsApp patched the Erlang VM itself to eliminate these bottlenecks. Examples from the 2014 Erlang Factory talk:
Multiple timer wheels: reduced contention on the global timer lock
GC throttling: prevented large mailboxes from triggering expensive GC at the wrong time
Round-robin async file I/O: eliminated head-of-line blocking in the file worker pool
Increased distribution buffer sizes: allowed more in-flight data between nodes
Improved
check_ioallocation: reduced memory fragmentation in the I/O subsystemMultiple
mnesia_tmasync_dirty senders: parallelized Mnesia's internal message handling
These are not high-level design choices. This is runtime surgery. Most teams never touch this layer, but WhatsApp had to because the standard BEAM VM was not designed for this traffic density.
The lesson here: at extreme scale, the runtime becomes part of the architecture. You cannot treat it as a black box.
Meta-Clustering and Wandist

Why Whatsapp used Meta-Clustering
Erlang clusters work well up to a certain size, but beyond that they become operationally fragile. Fully connected clusters generate O(n^2) monitoring traffic, and a network partition can split the cluster unpredictably.
WhatsApp solved this with meta-clustering and a custom transport layer called wandist. Instead of one giant cluster, they created a mesh of smaller clusters that communicated through a custom gen_tcp-based protocol.
The key idea: limit the size of any single cluster, but allow clusters to route messages to each other. A message could cross cluster boundaries with single-hop routing rather than full broadcast coupling.
This added complexity-now you have inter-cluster routing logic-but it kept failure domains bounded. A network issue in one cluster did not cascade to every node in the system.
This is a pattern worth understanding: when a distributed primitive does not scale, you do not always replace it. Sometimes you wrap it in a higher-level abstraction that controls the blast radius.
Calling Architecture
Signaling vs. Relay
WhatsApp calls involve two separate systems:
Signaling service: handles call setup, ringing, and connection negotiation
Relay service: forwards encrypted audio/video packets during the call
The separation is important. Signaling is control-plane logic: who is calling whom, what devices are involved, what network paths are available. It is relatively low traffic but state-heavy.
The relay is data-plane logic: shuttle encrypted media packets between endpoints. It is high traffic but conceptually simple-just forward packets. No transcoding, no inspection, no processing.
This architectural split is common in real-time systems. Control and data have different scaling properties, so they get different implementations.
End-to-End Encrypted Relays
Because WhatsApp calls are end-to-end encrypted using the Signal Protocol, the relay server never sees plaintext media. It just forwards encrypted packets. That changes what the relay can and cannot do.
It cannot:
Transcode audio/video to save bandwidth
Inspect the stream to detect quality issues
Apply server-side echo cancellation or noise reduction
Selectively drop certain packets based on content
It can:
Route packets based on addresses and session metadata
Monitor bandwidth and latency at the transport level
Fail over to a different relay if the current one crashes
Apply rate limiting and throttling based on flow metadata
This constraint shaped the entire relay architecture. Since the relay cannot be smart about media, it has to be extremely efficient at dumb forwarding. That means optimizing for proximity, low latency, and horizontal scalability.
Thousands of PoPs, Hundreds of Containers
WhatsApp's relay infrastructure runs across thousands of Meta points-of-presence (PoPs), with hundreds of relay containers per PoP. That footprint is necessary because voice and video are latency-sensitive. If the relay is too far from the users, the call quality degrades.
The architecture uses:
Load balancers to distribute incoming connections across containers in a PoP
State servers to persist call topology and participant metadata
Separation between networking and decision-making layers so that packet forwarding is fast and cheap
The relay container itself is stateless in the sense that it does not persist call state locally. If a container crashes, the load balancer detects it, connections fail over to another container, and the new container restores state from the state server.
This is a cloud-native pattern, even though WhatsApp's messaging backend was not. Calls and messages have different requirements. Messaging optimized for connection density and fault isolation; calling optimized for geographic distribution and failover speed.
Critical vs. Ephemeral State
Not all call state is equal. WhatsApp splits it into two categories:
Critical state:
Group size and participant list
Device network addresses
Call topology (who is connected to whom)
Join/leave events
This state is written to the state server on every change. Without it, the call cannot be reconstructed after a failover.
Ephemeral state:
Bandwidth estimates
Active speaker detection
Real-time quality metrics
This state is not persisted. It can be recomputed or allowed to drift briefly after failover without breaking the call.
This distinction is architecturally important. If you persist everything, you serialize your write path and add latency. If you persist nothing, you cannot recover from failures. The right answer is selective durability: store only what you cannot reconstruct.
Graceful Degradation Under Load
When the relay system is overloaded, it does not fail randomly. It fails in a predetermined order:
Ongoing calls are prioritized over new calls. If capacity is tight, reject new call attempts but keep existing calls running.
1:1 calls are prioritized over group calls. A two-person call uses less relay capacity than a 32-person group call.
Throttling is applied gradually. Slow down new connections rather than dropping them instantly.
This is graceful degradation by design. The system knows which users to shed load from, and it does so in a way that minimizes disruption.
Additionally, because calls are latency-sensitive, spare capacity in a distant region cannot always help. If users in India are overloading the relay PoPs in Mumbai, spinning up more capacity in Virginia does not solve the problem. The architecture has to handle regional overload locally, potentially by relocating non-latency-sensitive traffic elsewhere to free up local resources.
Privacy as Constraint
Signal Protocol, Routing Without Reading
WhatsApp uses the Signal Protocol for end-to-end encryption. That means messages and calls are encrypted on the sender's device and only decrypted on the recipient's device. The server never sees plaintext.
This is not just a feature-it is an architectural constraint. The backend cannot:
Search message content
Moderate messages automatically
Apply server-side spam filters based on content
Store messages in a readable format for legal requests
What the backend still does:
Route encrypted blobs to the correct recipient
Map accounts to devices and sessions
Store encrypted messages temporarily if the recipient is offline
Handle delivery acknowledgments and retry logic
Sync metadata like group membership and device pairing
The server is a transport layer with queuing, not a content processor. That is a fundamental shift from most messaging apps, which do server-side indexing, search, and moderation.
Multi-Device Fan-Out
Modern WhatsApp supports multiple devices per account: a phone, a desktop client, a browser tab. Each device has its own encryption session. That means the sender has to encrypt the message multiple times, once per recipient device.
Architectural consequences:
The sender does more work (multiple encryptions)
The server routes multiple copies of the same message
Key management complexity increases (each device has its own keys)
Metadata overhead grows (account-to-device mapping, session state per device)
But the privacy guarantee holds: the server still does not see plaintext. It just has to route more encrypted blobs.
This is a client-side fan-out model. The sender's client encrypts and sends per-device copies. The server does not fan out plaintext to multiple devices-it fans out encrypted blobs. That distinction matters because it keeps the server from being a decryption point.
Business API Caveat
WhatsApp's encryption model is not uniform across all product surfaces. The WhatsApp Business API has two modes:
On-premise deployment: The business runs the API endpoint on their own servers. Messages are end-to-end encrypted between the consumer and the business's infrastructure. Meta does not see plaintext.
Cloud API (hosted by Meta): The business uses Meta's hosted API. In this mode, messages are not considered end-to-end encrypted in the same sense, because Meta's infrastructure processes the message on behalf of the business.
This matters for architecture discussions because it shows that "WhatsApp" is really a family of communication paths with different trust boundaries. The core consumer messenger is E2E encrypted, but business integrations can break that property depending on deployment mode.
Operations and Monitoring
Manual Deploys, Hot Loading
WhatsApp's deployment culture was deliberately conservative. Deploys were manual, not automated. Changes were small. Hot code loading was used to deploy new Erlang modules without dropping connections.
Why manual? Because it forced engineers to think carefully about each change. Automated deploy pipelines are great for velocity, but they also make it easy to push changes without understanding their impact. WhatsApp preferred friction by design.
Hot loading is a rare capability. Most runtimes cannot replace running code without restarting the process. Erlang can. That meant WhatsApp could deploy a fix and have it live in seconds, without dropping 147 million concurrent connections.
The trade-off: hot loading is risky if the new code has different state expectations than the old code. But WhatsApp's codebase was small and tightly controlled, so the team could reason about compatibility.
Monitoring: mon.sh and Repeat Alarms
WhatsApp's monitoring was famously simple. Every server ran a single script called mon.sh that checked key metrics and triggered alerts. Alerts were broadcast to the entire team via WhatsApp itself. And they repeated until someone fixed the issue.
This is the opposite of modern observability stacks with dashboards, log aggregation, and on-call rotations. WhatsApp's approach was: everyone sees the alert, everyone is responsible, fix it fast.
The advantage: no alert fatigue from buried notifications. No complex escalation policies. If something broke, it was loud and immediate.
The disadvantage: this does not scale beyond a certain team size. But for a 50-person engineering team, it worked.
Key metrics monitored:
Message queue backlog per node (alert threshold: ~500k)
CPU and memory per server
Network partition detection
Mnesia sync status
Offline queue depth
The philosophy was: monitor simple things that directly correlate with user impact, not vanity metrics.
Bounded Failure Domains
WhatsApp's architecture isolated failures by design. A single process crash did not cascade. A single server failure did not take down an entire cluster. A network partition in one region did not destabilize others.
This was achieved through:
Small clusters instead of one monolithic cluster
Supervision trees that restart crashed processes locally
Partitioned data so that a failure in one shard does not affect others
Meta-clustering to prevent failures from crossing cluster boundaries
The key principle: expect failures, fail visibly, recover fast. Do not build layers of compensating abstraction that hide failures-they just make debugging harder.
Fix Fast Over Redundancy
WhatsApp's reliability culture emphasized Fix Fast over deep redundancy. Instead of building three layers of failover and retry logic, they built systems that failed in predictable ways and could be fixed quickly.
This was possible because:
The team was small and tightly coordinated
The system was simple enough to reason about
Monitoring was immediate and visible
Deploys were fast (hot loading)
This is not a universal strategy. Larger teams, more complex systems, and higher regulatory scrutiny often require deeper redundancy. But for WhatsApp, it was the right trade-off.
Lessons for System Design
The Core Pattern
WhatsApp's architecture can be summarized as: simple product, narrow state, tuned runtime, graceful failure.
Simple product: Few features, clear semantics, no algorithmic complexity
Narrow state: Minimize what you store, optimize for transient data, delete after delivery
Tuned runtime: Patch the VM, tune the OS, eliminate low-level bottlenecks
Graceful failure: Fail in predetermined ways, isolate blast radius, recover visibly
Most systems add features first and optimize later. WhatsApp did the opposite: it optimized the core, then resisted adding features that would break the model.
When to Choose This
This architecture makes sense when:
High concurrency, low per-user state: millions of connections, minimal data per connection
Latency-sensitive: real-time messaging and calling where milliseconds matter
Small, expert team: you have engineers who can patch runtimes and tune kernels
Deep control over the stack: bare metal or at least full VM control, not serverless
It does not make sense when:
Rich queries or analytics: if you need to query message history, this model breaks
Complex transactions: if your data has intricate relationships and consistency requirements
Polyglot teams: Erlang is niche, hiring and onboarding are hard
Rapid feature iteration: if you need to ship new services constantly, runtime surgery is too slow
Summary
Think of WhatsApp's architecture in layers:
Messaging layer:
Persistent connections map to Erlang processes
Servers route encrypted blobs, do not store them long-term
Offline queues are transient, optimized for the hot path
Delivery is ack-driven, messages are deleted after confirmation
Calling layer:
Signaling handles control logic, relays handle media forwarding
Relays are geographically distributed for proximity
State is split into critical (persisted) and ephemeral (reconstructed)
Overload leads to graceful throttling, not collapse
Privacy layer:
End-to-end encryption means the server is a transport, not a processor
Multi-device requires client-side fan-out and per-device encryption
The backend handles routing, metadata, and delivery, but never sees plaintext
Operational layer:
Failures are expected and isolated
Monitoring is simple and loud
Deploys are manual and conservative
Fix fast instead of over-engineering redundancy
This is not the only way to build a messaging system, but it is the way that let a 50-person team support 2 billion users.
References
Rick Reed, "That's 'Billion' with a B: Scaling to the Next Level at WhatsApp", Erlang Factory SF, 2014. http://www.erlang-factory.com/static/upload/media/1394350183453526efsf2014whatsappscaling.pdf
Jamshid Mahdavi, "An Erlang-Based Philosophy for Service Reliability", Erlang Factory, 2016. http://www.erlang-factory.com/static/upload/media/1457739243350841jamshidmahdavianerlangbasedphilosophyforservicereliability.pdf
WhatsApp, Security Features, Safety Tools & Tips. https://www.whatsapp.com/security/
WhatsApp / Meta, WhatsApp Encryption Overview (technical white paper). https://5.imimg.com/data5/SELLER/Doc/2024/12/471160889/BJ/LO/AQ/34065080/tally-prime-with-whatsapp.pdf
Thanks for supporting this newsletter. Y’all are the best!
Until next time!
Join 1,000+ engineers learning DevOps the hard way
Every week, I share:
How I'd approach problems differently (real projects, real mistakes)
Career moves that actually work (not LinkedIn motivational posts)
Technical deep-dives that change how you think about infrastructure
No fluff. No roadmaps. Just what works when you're building real systems.

👋 Find me on Twitter | Linkedin | Connect 1:1
Thank you for supporting this newsletter.
Y’all are the best.
