WhatsApp is just another CRUD app with a chat interface. Send a message, store it in a database, retrieve it when someone opens the app. Maybe throw in some Redis for caching. Wrong.

WhatsApp is a routing system with transient queues, not a message archive. The server's job is to shuttle encrypted packets between persistent connections, hold messages briefly if someone is offline, then delete them after delivery. The architecture that powered 2 billion users ran on a few hundred servers because the team made brutal trade-offs: vertical density over horizontal sprawl, runtime surgery over abstraction layers, graceful failure over compensating complexity. This is the opposite of how most modern apps are built.

Table of Contents

  1. Architecture Philosophy

    • Dead Simple Product, Hard Backend

    • Never Lose a Message

    • Vertical Density Over Horizontal

  2. Messaging Core

    • One Connection, One Process

    • Why Erlang Mattered

    • Routing, Not Archiving

    • 342K Messages/Sec In, 712K Out

  3. Data and Storage

    • Mnesia for Metadata

    • Write-Back Caching, 98% Hit Rate

    • Custom Runtime Patches

    • Meta-Clustering and Wandist

  4. Calling Architecture

    • Signaling vs. Relay

    • End-to-End Encrypted Relays

    • Thousands of PoPs, Hundreds of Containers

    • Critical vs. Ephemeral State

    • Graceful Degradation Under Load

  5. Privacy as Constraint

    • Signal Protocol, Routing Without Reading

    • Multi-Device Fan-Out

    • Business API Caveat

  6. Operations and Monitoring

    • Manual Deploys, Hot Loading

    • Monitoring: mon.sh and Repeat Alarms

    • Bounded Failure Domains

    • Fix Fast Over Redundancy

  7. Lessons for System Design

    • The Core Pattern

    • When to Choose This

    • Summary

Architecture Philosophy

Simple Product, Hard Backend

WhatsApp deliberately kept the user-facing product minimal. No stories for years. No algorithmic feed. Just messaging and calls. That simplicity was not laziness-it was a forcing function. Every bit of engineering effort went into making the backend brutally efficient instead of building features that would complicate the system.

The trade-off was intentional. A simple product means narrow state per user, which means you can pack more users onto fewer machines. More features mean more joins, more indices, more cache invalidation strategies, more operational runbooks. WhatsApp avoided that entire category of complexity by saying no at the product level.

The result: in 2014, 465 million users, 19 billion messages/day inbound, 40 billion outbound, on roughly 550 servers. About 150 of those were chat servers handling 147 million concurrent connections. That is close to 1 million connections per chat server. You do not get there by accident.

Never Lose a Message

This was the core design principle. Messages must be delivered, even if the recipient is offline for days. Even if servers fail. Even if the network is unstable. That constraint shaped every storage, queueing, and retry decision.

The architecture prioritized availability over features. If a subsystem went down, the rest of the system kept working. If a cluster was overloaded, it throttled gracefully rather than collapsing. Failure was expected, but losing messages was not acceptable.

This is why WhatsApp used store-and-forward semantics. The server holds the message until the recipient confirms receipt, then deletes it. That guarantee is expensive-it requires persistent queues, retry logic, ack tracking, and careful failure handling-but it is non-negotiable.

Vertical Density Over Horizontal Sprawl

Most modern architectures scale by adding more small nodes. WhatsApp did the opposite: fewer, larger machines with millions of processes per host. They ran FreeBSD on bare metal with dual CPUs, 128-768 GB of RAM, and tuned every layer from the allocator to the NIC driver.

This was a deliberate trade-off. Operational complexity scales with node count, not core count. A hundred servers are easier to monitor, deploy, and debug than ten thousand containers. The team was small-around 50 engineers at acquisition-and they needed an architecture they could actually operate.

The cost is that you have to tune the runtime yourself. WhatsApp patched Erlang's BEAM VM, rewrote parts of OTP, and modified kernel subsystems. That is not feasible for most teams, but for WhatsApp it was cheaper than managing a massive fleet.

Messaging Core

One user, One process

One Connection, One Process

WhatsApp clients maintain a persistent TCP connection to the backend. That connection stays open as long as the app is running. No polling. No repeated handshakes. When a message arrives, it is pushed immediately over that connection.

On the server side, each connection is handled by one Erlang process. That process owns the connection state: session info, encryption keys, message queue. Erlang processes are lightweight-millions can run on a single host-and each has its own memory and garbage collection context.

This model prevents GC pauses from cascading. If one user sends a huge photo and their process does heavy GC, other users are unaffected. Compare that to a Java or Go server with stop-the-world GC: one big pause stalls thousands of connections.

Erlang also gives you fault isolation by default. If a process crashes, OTP supervision restarts it without destabilizing the rest of the system. That is critical when you have millions of connections per host-random crashes should not propagate.

Why Erlang Mattered

Erlang was originally designed for telecom switches, which have similar requirements: massive concurrency, long-lived sessions, fault tolerance, minimal downtime. The BEAM VM was built for exactly this traffic pattern.

Key properties that mattered for WhatsApp:

  • Lightweight processes: spawn millions without worrying about thread overhead

  • Per-process GC: no global stop-the-world pauses

  • Message-passing concurrency: no shared-memory bugs, no lock contention

  • Supervision trees: automatic restart on crash, with bounded failure domains

  • Hot code loading: deploy new code without dropping connections

WhatsApp started with ejabberd, an Erlang-based XMPP server, and heavily modified it. Eventually they replaced so much of it that it was effectively a custom system, but the Erlang foundation stayed.

This was not a trendy choice. Erlang is niche, the tooling is sparse, and the community is small. But it was the right model for the problem: millions of mostly idle connections with occasional bursts of activity.

Routing, Not Archiving

This is the most misunderstood part of WhatsApp's architecture. The server does not store your chat history. It is not a database of messages. It is a router with transient queues.

Message flow works like this:

  1. Sender's client encrypts the message and sends it over the persistent connection

  2. Server looks up the recipient's connection

  3. If the recipient is online, route the message immediately

  4. If offline, write it to a temporary queue

  5. When the recipient reconnects, drain the queue in order

  6. After the client sends an ack, delete the message from the server

The server's retention window is short. Most messages are delivered within seconds. The offline queue exists to handle network hiccups and brief disconnections, not long-term storage.

This design made WhatsApp's backend much simpler. No complex indexing. No queries across message history. No schema migrations when the message format changed. The server just routes encrypted blobs and forgets them.

342K Messages/Sec In, 712K Out

By 2014, WhatsApp was handling 342,000 inbound messages/second at peak and 712,000 outbound. That asymmetry is expected: group chats cause fan-out. One incoming message might generate dozens of outbound deliveries.

The system also handled 230,000 logins per second at peak. That is important because login involves session setup, key exchange, offline queue replay, and connection state initialization. It is heavier than a simple message send.

The key architectural point is that these numbers were achieved on around 150 chat servers. That is roughly 2,300 messages/sec inbound per server and 4,700 outbound. No single server is a bottleneck because each user's connection is independent. Erlang's process model makes that parallelism natural.

Data and Storage

Mnesia for Metadata

WhatsApp used Mnesia, Erlang's built-in distributed database, for metadata: user accounts, device mappings, session state, group membership. Mnesia is not a general-purpose SQL database. It is a key-value store with optional transactions, designed to live inside an Erlang cluster.

The architecture around Mnesia was heavily customized:

  • Data was partitioned across 2-32 shards to avoid hotspots

  • Most operations used async_dirty instead of transactions to avoid coupling nodes

  • Records were hashed so that a given key always hit the same process path

Transactional coupling is a killer at scale. If every write requires a two-phase commit across nodes, latency explodes and failure domains grow. WhatsApp avoided that by using dirty operations and designing schemas so that most writes were isolated to a single partition.

Write-Back Caching, 98% Hit Rate

WhatsApp's offline message queue used a write-back cache with a reported 98% hit rate. That means 98% of messages were delivered from memory before they ever needed to be flushed to disk.

Why this worked: most messages are read quickly. If someone is offline, they usually reconnect within minutes or hours, not days. The cache absorbed the write load and batched disk flushes, which is much faster than syncing every message individually.

The cache was backed by persistent storage, but the hot path stayed in memory. This is a classic trade-off: optimize for the common case (immediate delivery) and handle the uncommon case (multi-day offline) with slower disk-backed recovery.

This is also why WhatsApp's storage layer was not a bottleneck. The working set fit in RAM, and disk I/O was batched and asynchronous.

Custom Runtime Patches

At millions of connections per host, bottlenecks stop being in application code. They move to the runtime, the scheduler, the allocator, the I/O subsystem.

WhatsApp patched the Erlang VM itself to eliminate these bottlenecks. Examples from the 2014 Erlang Factory talk:

  • Multiple timer wheels: reduced contention on the global timer lock

  • GC throttling: prevented large mailboxes from triggering expensive GC at the wrong time

  • Round-robin async file I/O: eliminated head-of-line blocking in the file worker pool

  • Increased distribution buffer sizes: allowed more in-flight data between nodes

  • Improved check_io allocation: reduced memory fragmentation in the I/O subsystem

  • Multiple mnesia_tm async_dirty senders: parallelized Mnesia's internal message handling

These are not high-level design choices. This is runtime surgery. Most teams never touch this layer, but WhatsApp had to because the standard BEAM VM was not designed for this traffic density.

The lesson here: at extreme scale, the runtime becomes part of the architecture. You cannot treat it as a black box.

Meta-Clustering and Wandist

Why Whatsapp used Meta-Clustering

Erlang clusters work well up to a certain size, but beyond that they become operationally fragile. Fully connected clusters generate O(n^2) monitoring traffic, and a network partition can split the cluster unpredictably.

WhatsApp solved this with meta-clustering and a custom transport layer called wandist. Instead of one giant cluster, they created a mesh of smaller clusters that communicated through a custom gen_tcp-based protocol.

The key idea: limit the size of any single cluster, but allow clusters to route messages to each other. A message could cross cluster boundaries with single-hop routing rather than full broadcast coupling.

This added complexity-now you have inter-cluster routing logic-but it kept failure domains bounded. A network issue in one cluster did not cascade to every node in the system.

This is a pattern worth understanding: when a distributed primitive does not scale, you do not always replace it. Sometimes you wrap it in a higher-level abstraction that controls the blast radius.

Calling Architecture

Signaling vs. Relay

WhatsApp calls involve two separate systems:

  • Signaling service: handles call setup, ringing, and connection negotiation

  • Relay service: forwards encrypted audio/video packets during the call

The separation is important. Signaling is control-plane logic: who is calling whom, what devices are involved, what network paths are available. It is relatively low traffic but state-heavy.

The relay is data-plane logic: shuttle encrypted media packets between endpoints. It is high traffic but conceptually simple-just forward packets. No transcoding, no inspection, no processing.

This architectural split is common in real-time systems. Control and data have different scaling properties, so they get different implementations.

End-to-End Encrypted Relays

Because WhatsApp calls are end-to-end encrypted using the Signal Protocol, the relay server never sees plaintext media. It just forwards encrypted packets. That changes what the relay can and cannot do.

It cannot:

  • Transcode audio/video to save bandwidth

  • Inspect the stream to detect quality issues

  • Apply server-side echo cancellation or noise reduction

  • Selectively drop certain packets based on content

It can:

  • Route packets based on addresses and session metadata

  • Monitor bandwidth and latency at the transport level

  • Fail over to a different relay if the current one crashes

  • Apply rate limiting and throttling based on flow metadata

This constraint shaped the entire relay architecture. Since the relay cannot be smart about media, it has to be extremely efficient at dumb forwarding. That means optimizing for proximity, low latency, and horizontal scalability.

Thousands of PoPs, Hundreds of Containers

WhatsApp's relay infrastructure runs across thousands of Meta points-of-presence (PoPs), with hundreds of relay containers per PoP. That footprint is necessary because voice and video are latency-sensitive. If the relay is too far from the users, the call quality degrades.

The architecture uses:

  • Load balancers to distribute incoming connections across containers in a PoP

  • State servers to persist call topology and participant metadata

  • Separation between networking and decision-making layers so that packet forwarding is fast and cheap

The relay container itself is stateless in the sense that it does not persist call state locally. If a container crashes, the load balancer detects it, connections fail over to another container, and the new container restores state from the state server.

This is a cloud-native pattern, even though WhatsApp's messaging backend was not. Calls and messages have different requirements. Messaging optimized for connection density and fault isolation; calling optimized for geographic distribution and failover speed.

Critical vs. Ephemeral State

Not all call state is equal. WhatsApp splits it into two categories:

Critical state:

  • Group size and participant list

  • Device network addresses

  • Call topology (who is connected to whom)

  • Join/leave events

This state is written to the state server on every change. Without it, the call cannot be reconstructed after a failover.

Ephemeral state:

  • Bandwidth estimates

  • Active speaker detection

  • Real-time quality metrics

This state is not persisted. It can be recomputed or allowed to drift briefly after failover without breaking the call.

This distinction is architecturally important. If you persist everything, you serialize your write path and add latency. If you persist nothing, you cannot recover from failures. The right answer is selective durability: store only what you cannot reconstruct.

Graceful Degradation Under Load

When the relay system is overloaded, it does not fail randomly. It fails in a predetermined order:

  1. Ongoing calls are prioritized over new calls. If capacity is tight, reject new call attempts but keep existing calls running.

  2. 1:1 calls are prioritized over group calls. A two-person call uses less relay capacity than a 32-person group call.

  3. Throttling is applied gradually. Slow down new connections rather than dropping them instantly.

This is graceful degradation by design. The system knows which users to shed load from, and it does so in a way that minimizes disruption.

Additionally, because calls are latency-sensitive, spare capacity in a distant region cannot always help. If users in India are overloading the relay PoPs in Mumbai, spinning up more capacity in Virginia does not solve the problem. The architecture has to handle regional overload locally, potentially by relocating non-latency-sensitive traffic elsewhere to free up local resources.

Privacy as Constraint

Signal Protocol, Routing Without Reading

WhatsApp uses the Signal Protocol for end-to-end encryption. That means messages and calls are encrypted on the sender's device and only decrypted on the recipient's device. The server never sees plaintext.

This is not just a feature-it is an architectural constraint. The backend cannot:

  • Search message content

  • Moderate messages automatically

  • Apply server-side spam filters based on content

  • Store messages in a readable format for legal requests

What the backend still does:

  • Route encrypted blobs to the correct recipient

  • Map accounts to devices and sessions

  • Store encrypted messages temporarily if the recipient is offline

  • Handle delivery acknowledgments and retry logic

  • Sync metadata like group membership and device pairing

The server is a transport layer with queuing, not a content processor. That is a fundamental shift from most messaging apps, which do server-side indexing, search, and moderation.

Multi-Device Fan-Out

Modern WhatsApp supports multiple devices per account: a phone, a desktop client, a browser tab. Each device has its own encryption session. That means the sender has to encrypt the message multiple times, once per recipient device.

Architectural consequences:

  • The sender does more work (multiple encryptions)

  • The server routes multiple copies of the same message

  • Key management complexity increases (each device has its own keys)

  • Metadata overhead grows (account-to-device mapping, session state per device)

But the privacy guarantee holds: the server still does not see plaintext. It just has to route more encrypted blobs.

This is a client-side fan-out model. The sender's client encrypts and sends per-device copies. The server does not fan out plaintext to multiple devices-it fans out encrypted blobs. That distinction matters because it keeps the server from being a decryption point.

Business API Caveat

WhatsApp's encryption model is not uniform across all product surfaces. The WhatsApp Business API has two modes:

  1. On-premise deployment: The business runs the API endpoint on their own servers. Messages are end-to-end encrypted between the consumer and the business's infrastructure. Meta does not see plaintext.

  2. Cloud API (hosted by Meta): The business uses Meta's hosted API. In this mode, messages are not considered end-to-end encrypted in the same sense, because Meta's infrastructure processes the message on behalf of the business.

This matters for architecture discussions because it shows that "WhatsApp" is really a family of communication paths with different trust boundaries. The core consumer messenger is E2E encrypted, but business integrations can break that property depending on deployment mode.

Operations and Monitoring

Manual Deploys, Hot Loading

WhatsApp's deployment culture was deliberately conservative. Deploys were manual, not automated. Changes were small. Hot code loading was used to deploy new Erlang modules without dropping connections.

Why manual? Because it forced engineers to think carefully about each change. Automated deploy pipelines are great for velocity, but they also make it easy to push changes without understanding their impact. WhatsApp preferred friction by design.

Hot loading is a rare capability. Most runtimes cannot replace running code without restarting the process. Erlang can. That meant WhatsApp could deploy a fix and have it live in seconds, without dropping 147 million concurrent connections.

The trade-off: hot loading is risky if the new code has different state expectations than the old code. But WhatsApp's codebase was small and tightly controlled, so the team could reason about compatibility.

Monitoring: mon.sh and Repeat Alarms

WhatsApp's monitoring was famously simple. Every server ran a single script called mon.sh that checked key metrics and triggered alerts. Alerts were broadcast to the entire team via WhatsApp itself. And they repeated until someone fixed the issue.

This is the opposite of modern observability stacks with dashboards, log aggregation, and on-call rotations. WhatsApp's approach was: everyone sees the alert, everyone is responsible, fix it fast.

The advantage: no alert fatigue from buried notifications. No complex escalation policies. If something broke, it was loud and immediate.

The disadvantage: this does not scale beyond a certain team size. But for a 50-person engineering team, it worked.

Key metrics monitored:

  • Message queue backlog per node (alert threshold: ~500k)

  • CPU and memory per server

  • Network partition detection

  • Mnesia sync status

  • Offline queue depth

The philosophy was: monitor simple things that directly correlate with user impact, not vanity metrics.

Bounded Failure Domains

WhatsApp's architecture isolated failures by design. A single process crash did not cascade. A single server failure did not take down an entire cluster. A network partition in one region did not destabilize others.

This was achieved through:

  • Small clusters instead of one monolithic cluster

  • Supervision trees that restart crashed processes locally

  • Partitioned data so that a failure in one shard does not affect others

  • Meta-clustering to prevent failures from crossing cluster boundaries

The key principle: expect failures, fail visibly, recover fast. Do not build layers of compensating abstraction that hide failures-they just make debugging harder.

Fix Fast Over Redundancy

WhatsApp's reliability culture emphasized Fix Fast over deep redundancy. Instead of building three layers of failover and retry logic, they built systems that failed in predictable ways and could be fixed quickly.

This was possible because:

  • The team was small and tightly coordinated

  • The system was simple enough to reason about

  • Monitoring was immediate and visible

  • Deploys were fast (hot loading)

This is not a universal strategy. Larger teams, more complex systems, and higher regulatory scrutiny often require deeper redundancy. But for WhatsApp, it was the right trade-off.

Lessons for System Design

The Core Pattern

WhatsApp's architecture can be summarized as: simple product, narrow state, tuned runtime, graceful failure.

  • Simple product: Few features, clear semantics, no algorithmic complexity

  • Narrow state: Minimize what you store, optimize for transient data, delete after delivery

  • Tuned runtime: Patch the VM, tune the OS, eliminate low-level bottlenecks

  • Graceful failure: Fail in predetermined ways, isolate blast radius, recover visibly

Most systems add features first and optimize later. WhatsApp did the opposite: it optimized the core, then resisted adding features that would break the model.

When to Choose This

This architecture makes sense when:

  • High concurrency, low per-user state: millions of connections, minimal data per connection

  • Latency-sensitive: real-time messaging and calling where milliseconds matter

  • Small, expert team: you have engineers who can patch runtimes and tune kernels

  • Deep control over the stack: bare metal or at least full VM control, not serverless

It does not make sense when:

  • Rich queries or analytics: if you need to query message history, this model breaks

  • Complex transactions: if your data has intricate relationships and consistency requirements

  • Polyglot teams: Erlang is niche, hiring and onboarding are hard

  • Rapid feature iteration: if you need to ship new services constantly, runtime surgery is too slow

Summary

Think of WhatsApp's architecture in layers:

Messaging layer:

  • Persistent connections map to Erlang processes

  • Servers route encrypted blobs, do not store them long-term

  • Offline queues are transient, optimized for the hot path

  • Delivery is ack-driven, messages are deleted after confirmation

Calling layer:

  • Signaling handles control logic, relays handle media forwarding

  • Relays are geographically distributed for proximity

  • State is split into critical (persisted) and ephemeral (reconstructed)

  • Overload leads to graceful throttling, not collapse

Privacy layer:

  • End-to-end encryption means the server is a transport, not a processor

  • Multi-device requires client-side fan-out and per-device encryption

  • The backend handles routing, metadata, and delivery, but never sees plaintext

Operational layer:

  • Failures are expected and isolated

  • Monitoring is simple and loud

  • Deploys are manual and conservative

  • Fix fast instead of over-engineering redundancy

This is not the only way to build a messaging system, but it is the way that let a 50-person team support 2 billion users.

References

  1. Rick Reed, "That's 'Billion' with a B: Scaling to the Next Level at WhatsApp", Erlang Factory SF, 2014. http://www.erlang-factory.com/static/upload/media/1394350183453526efsf2014whatsappscaling.pdf

  2. Jamshid Mahdavi, "An Erlang-Based Philosophy for Service Reliability", Erlang Factory, 2016. http://www.erlang-factory.com/static/upload/media/1457739243350841jamshidmahdavianerlangbasedphilosophyforservicereliability.pdf

  3. WhatsApp, Security Features, Safety Tools & Tips. https://www.whatsapp.com/security/

  4. WhatsApp / Meta, WhatsApp Encryption Overview (technical white paper). https://5.imimg.com/data5/SELLER/Doc/2024/12/471160889/BJ/LO/AQ/34065080/tally-prime-with-whatsapp.pdf

Thanks for supporting this newsletter. Y’all are the best!
Until next time!

Join 1,000+ engineers learning DevOps the hard way

Every week, I share:

  • How I'd approach problems differently (real projects, real mistakes)

  • Career moves that actually work (not LinkedIn motivational posts)

  • Technical deep-dives that change how you think about infrastructure

No fluff. No roadmaps. Just what works when you're building real systems.

👋 Find me on Twitter | Linkedin | Connect 1:1

Thank you for supporting this newsletter.

Y’all are the best.

Keep Reading