The Uptime Engineer
👋 Hi, I am Yoshik Karnawat
You’ll read why “pick any two” is misleading, how real systems lean CP or AP in different features, and how to talk about CAP in interviews without sounding like a textbook. By the end, you’ll think in terms of “wrong data vs no data” instead of buzzwords and design systems that degrade on your terms, not the network’s.
CAP in the Real World
CP systems favor correctness over uptime. Common in payments, inventory, and security-sensitive features.
AP systems favor uptime over instant correctness. Common in feeds, counters, notifications, and analytics.
Caches and read replicas are everyday AP trade-offs: you accept stale data in exchange for speed and availability.
Strong cross-region consistency usually increases write latency; eventual consistency keeps apps snappy at the cost of short-term mismatch.
If you can clearly answer “what’s worse here: wrong data or no data?”, you already understand CAP better than most engineers.
CAP theorem sounds like exam material.
In reality, it decides how your system fails when the network behaves like the real world: messy, slow, and unreliable.
You’ve probably seen the textbook version:
Consistency, Availability, Partition tolerance.
“Pick any two.”
That framing is… not helpful.
In real systems, partitions are guaranteed.
Links flap, regions drop, routes get weird.
You always need to tolerate partitions.
So the real question becomes:
When a partition happens,
do you want wrong data or no data?
That’s CAP in operator language.
CAP in human terms
Forget the math. Use these definitions:
Consistency (C)
After a successful write, everyone sees the same latest value.
Example: user changes password → every login checks the new password, no stale version anywhere.
Availability (A)
The system responds to requests instead of hanging or erroring out.
Example: even under load, you get some answer, not a spinner forever.
Partition tolerance (P)
The system keeps doing something even when parts of the network can’t talk.
Example: one region can’t reach another, but both are still receiving traffic.
In practice:
Partitions will happen.
You can’t have strong C, high A, and full P at the same time during a partition.
You must choose which pain you prefer.
That’s the entire game.
The real decision: wrong data or no data?
During a network split, you answer one question:
For this feature, is it worse to show wrong data or no data?
If wrong data is worse → you lean CP (Consistency + Partition tolerance).
If no data is worse → you lean AP (Availability + Partition tolerance).
Everything else is just implementation details.
CP systems: “I’d rather fail than be wrong”
CP behavior:
The system would rather block or reject operations than risk inconsistent data.
How it feels in production:
Writes get rejected if correctness can’t be guaranteed.
Reads get blocked instead of serving stale/conflicting values.
Where CP thinking is natural:
Money: balances, wallets, ledgers.
Inventory: airline seats, tickets, limited stock.
Security: permissions, revocations, auth rules.
Here, “wrong but available” is catastrophic:
Double-charging a user.
Selling 110 tickets for 100 seats.
Letting an expired token still access data.
So you accept:
“If the network is broken, some actions fail or block.
But we never silently break correctness.”
AP systems: “I’d rather be slightly wrong than down”
AP behavior:
The system prefers to keep responding, even if some data is temporarily inconsistent.
How it feels in production:
Users see slightly stale views.
Different nodes return different values for a while.
Reconciliation/conflict resolution happens later.
Where AP thinking fits:
Social features: likes, views, comment counters.
Activity feeds, recommendations.
Analytics, metrics ingestion, logging.
Here, “down” is worse than “slightly wrong”:
Nobody cares if likes show 99 instead of 100 for a few seconds.
A slightly old feed is fine if the app stays snappy.
Analytics can be eventually correct, not perfect in real-time.
So you accept:
“If the network is broken, we stay up.
Data might be eventually, not instantly, correct.”
One product, many trade-offs
Beginners say: “My system is CP.” Or: “We chose AP.”
Reality: different parts of the same product lean CP or AP based on risk.
Examples:
Login / auth
Password changes, token revocation → CP-leaning.
Better to block than let old tokens live forever.
Order placement / payments
CP-leaning.
You care about no double-charges, no overselling, no ghost orders.
Notifications / badges / counters
AP-leaning.
“2 unread” instead of “3 unread” for a few seconds is fine.
Activity feed
AP-leaning.
Slightly stale is okay if the app stays fast.
The real skill:
Knowing where you can tolerate AP behavior, and where you must enforce CP-like behavior.
CAP + latency: the trade-off
Strong consistency across regions usually means:
Wait for writes to be acknowledged by multiple nodes/regions.
Higher latency, especially for write-heavy paths.
To keep things fast, systems often:
Accept local writes first.
Sync to other replicas later (eventual consistency).
From the user’s chair:
CP-ish: “The app is slower, but always correct.”
AP-ish: “The app is snappy, but sometimes a bit off before it catches up.”
Your product team may never say “we’re choosing CP or AP.”
But when they pick user experience, they’re also picking a CAP trade-off.
As DevOps/SRE, you’re the one who should name it.
Using CAP in interviews (without sounding like a textbook)
When they ask:
“How would you design payments / messaging / a feed?”
Don’t start with tools. Start with invariants and tolerances.
Use this structure:
1. Define the invariant
“We must never double-charge or lose confirmed orders.”
“We must never show someone else’s private data.”
2. Define acceptable inconsistency
“Notification counts can be off for a few seconds.”
“Feed can be slightly stale.”
3. Connect it to CAP
“For payments, I’d favor CP: if the network is split, I’d rather block new charges than risk ledger inconsistency.”
“For the feed, I’d favor AP: even if some regions lag, users should see something instead of a blank screen.”
4. Talk about degradation
“If the DB or region is partitioned, we disable risky writes but keep safe reads and non-critical features up.”
This makes you sound like someone who designs systems, not someone who just memorized words.
A simple CAP checklist
Next time you design or review a system, ask:
What’s worse here: wrong data or no data?
During a network split, what should this feature actually do?
Where can we tolerate eventual consistency, and where do we need strong guarantees?
What does the user see when things go bad?
If you can answer those, you’re already ahead of most engineers.
Helpful Resources
CAP Theorem in Practice: Trade-offs in Databases
https://www.packtpub.com/en-us/learning/how-to-tutorials/the-cap-theorem-in-practice-the-consistency-vs-availability-trade-off-in-distributed-databasesPerspectives on the CAP Theorem
CAP Strategies for Distributed Systems
https://www.splunk.com/en_us/blog/learn/cap-theorem.html
