📬 In Case You Missed This Week's Uptime Sync

Every week I curate the best DevOps, SRE & Cloud content so you don't have to.

This week's edition featured:

Why engineers still live in the terminal
Why PostgreSQL is enough for most applications
Zero-downtime migration from NGINX Ingress to Envoy Gateway
4 open-source tools worth trying this week

🧠 Interview Question of the Week

Why do Kubernetes Pods get recreated instead of restarted?

Think before reading the answer.

A Pod is designed to be disposable. If it crashes or becomes unhealthy, Kubernetes replaces it with a new Pod instead of trying to repair the old one. This keeps workloads predictable and allows Kubernetes to maintain the desired state automatically.

1. Fix the Problem First. Find the Root Cause Later.

When production is down, users don't care why it happened - they care that it's working again. Roll back the deployment, disable the feature, or reroute traffic first. Once things are stable, investigate what actually went wrong.

2. Read Before You Type.

The fastest way to make an outage worse is changing something you don't understand. Spend a few minutes checking logs, metrics, and recent deployments before running commands. A little patience usually saves hours of recovery.

3. Small Changes Beat Big Fixes.

Whenever possible, change one thing at a time. Small deployments are easier to test, easier to rollback, and much easier to debug when something breaks.

4. Don't Trust Your First Assumption.

The first thing you think is broken usually isn't. Always ask yourself, "What evidence do I have?" Verify your assumptions before acting.

5. If You're Guessing, You're Missing Data.

Guessing is a sign that your monitoring isn't telling you enough. Before trying random fixes, collect more information. Good decisions come from good visibility.

6. Automate Anything You Repeat.

If you've done the same task three or four times, automate it. Every manual step is another chance for mistakes, especially during stressful incidents.

7. Write Things Down.

You'll forget today's solution in six months. Future teammates won't know it either. Good documentation saves far more time than it takes to write.

8. Learn One Tool Really Well.

Don't try to master ten monitoring tools at once. Become excellent with the one your team uses every day. Deep knowledge is far more valuable than surface-level familiarity with everything.

9. Every Incident Should Teach You Something.

After every production issue, ask yourself one question: "What will I do differently next time?" Small lessons compound into experience.

10. Build for Future You.

If something feels annoying today, it'll be unbearable six months from now. Leave systems cleaner than you found them. Your future self will thank you.

11. Not Every Alert Needs to Wake Someone Up.

If an alert doesn't require immediate action, it shouldn't page anyone. Too many noisy alerts train engineers to ignore the important ones.

12. Rollbacks Aren't Failures.

Rolling back isn't admitting defeat - it's protecting users. You can always investigate later, but you can't recover lost customer trust.

13. Communicate More Than Feels Necessary.

Silence makes people assume the worst. During incidents, keep sharing updates, even if the update is simply "We're still investigating."

14. Tell People Before They Ask.

If your work might affect someone else, let them know early. Proactive communication builds trust and prevents surprises.

15. Ask Questions Early.

Nobody expects juniors to know everything. Asking a five-minute question is much better than spending two hours fixing a mistake that could have been avoided.

16. Ownership Doesn't End After the Fix.

Fixing production is only half the job. Follow through with documentation, cleanup, and preventing the same issue from happening again.

17. Your Reputation Is Built During Incidents.

People won't remember every feature you shipped. They'll remember whether you stayed calm, communicated clearly, and helped when things were breaking.

18. Learn Systems, Not Just Tools.

Tools come and go. Understanding networking, Linux, debugging, and distributed systems will stay valuable throughout your career.

19. Communication Is a Technical Skill.

The best engineers explain problems clearly, document decisions, and keep everyone informed. Strong communication multiplies your technical ability.

20. Simple Systems Are Easier to Operate.

Every extra dependency adds another way for things to fail. Choose the simplest solution that solves the problem well.

21. You Don't Need to Know Everything.

Great engineers aren't the ones with every answer. They're the ones who know how to investigate, ask good questions, and keep learning.

Join 1,000+ engineers becoming better DevOps & SRE professionals.

Every week, I share:

How I'd approach problems differently (real projects, real mistakes)
Career moves that actually work (not LinkedIn motivational posts)
Technical deep-dives that change how you think about infrastructure

No fluff. No roadmaps. Just what works when you're building real systems.

👉 Subscribe for free to get it delivered every week

👋 Find me on Twitter | Linkedin | Connect 1:1

Thank you for supporting this newsletter.
Y’all are the best.

21 Lessons I Learned in My First 3 Years as an SRE