← Back to all posts
INCIDENT POST-MORTEM

When Security Fixes
Break the Internet 😬

On December 5, 2025, Cloudflare went down for the second time in 3 weeks. This time? An urgent security fix for a critical React vulnerability went wrong. LinkedIn, Zoom, and 20% of the internet became unreachableβ€”not because of an attack, but because the cure was worse than the disease. Let's unpack what happened! πŸ”₯

πŸ“… December 5, 2025 ⏱️ ~25 minutes impact 🌐 28% of Cloudflare traffic affected πŸ“– 24 min read
Scroll to learn ↓
01

Context: Two Weeks After Disaster

Before we dive into December 5th, we need context. Just 17 days earlier on November 18, 2025, Cloudflare suffered what they called their "worst outage since 2019." 😱

⚠️ The November 18 Outage Recap

Cause: A missing WHERE clause in a SQL query caused a Bot Management feature file to balloon from 60 to 120+ features, crashing the FL2 proxy.

Duration: ~6 hours of major impact

Effect: ChatGPT, X (Twitter), Spotify, Discord, and millions of sites went down

See our full deep dive here for details!

Cloudflare had just published their November 18 post-mortem and promised major changes to prevent similar issues. They outlined specific improvements:

These changes would have helped prevent December 5th. Unfortunately, they weren't finished yet. 😬

The Cruel Irony

Cloudflare knew exactly what needed to be fixed after November 18. But while they were still implementing those safety measures, December 5th happened. This is the harsh reality of infrastructure at scale: you can't pause the world while you make improvements.

The Pressure Was On

After November 18, Cloudflare was under intense scrutiny:

And then, on December 3, a critical security vulnerability was disclosed that affected React and Next.jsβ€”frameworks used by millions of websites. Cloudflare had to act fast. πŸƒβ€β™‚οΈ

02

React2Shell: The Critical Vulnerability

On December 3, 2025, Meta and Vercel publicly disclosed CVE-2025-55182, nicknamed "React2Shell" by security researchers. This was a critical, unauthenticated remote code execution (RCE) vulnerability in React Server Components. πŸ’₯

10.0 CVSS Score (Maximum!)
82% Developers Use React
39% Cloud Environments Vulnerable
Hours Until China State Groups Exploited It

πŸ€” What Are React Server Components?

React Server Components (RSC) let developers run React code on the server instead of just in the browser. This makes websites faster because the server can handle heavy computations and database queries while sending optimized content to users.

πŸ”§ Simple Analogy

Imagine a restaurant: Traditional React is like giving customers raw ingredients and a recipe (they cook everything in their browser). React Server Components is like having the kitchen (server) pre-cook complex dishes, then just delivering the finished meal to the table (browser). Much faster!

The Vulnerability Explained

React Server Components use a special protocol called "Flight" to communicate between the server and browser. CVE-2025-55182 was an unsafe deserialization bug in how the server processed Flight requests.

πŸ’₯ What Could Attackers Do?

An attacker could send a specially crafted HTTP request to any server running vulnerable React/Next.js code. That single request could:

  • Execute arbitrary code on the server
  • Steal database credentials
  • Install backdoors
  • Access sensitive customer data

No authentication required. Just one malicious HTTP request. 😱

Who Was Affected?

This wasn't some obscure framework. The vulnerable versions included:

Framework/Library Vulnerable Versions Usage
React 19 19.0, 19.1.0, 19.1.1, 19.2.0 Tens of millions of sites
Next.js 15.x, 16.x with App Router Used by Airbnb, Netflix, Hulu
React Router RSC preview versions Popular routing library
Waku, Parcel, Vite With RSC plugins Modern build tools

⏰ The Timeline of Doom

Nov 29, 2025
πŸ” Discovery
Researcher Lachlan Davidson discovers the vulnerability and responsibly discloses it to Meta/Vercel.
Dec 3, 2025
⚠️ Public Disclosure
Meta and Vercel publish CVE-2025-55182. Security researchers call it "React2Shell" (echoing Log4Shell). Patches released for React and Next.js.
Dec 3-4, 2025
πŸ’₯ Active Exploitation
Within hours, China state-nexus threat groups (Earth Lamia, Jackpot Panda) begin exploiting the vulnerability. AWS detects attacks in their honeypots. Attackers spend hours debugging their exploits in real-time against live targets.
Dec 4-5, 2025
🚨 Emergency Response
Major cloud providers (Cloudflare, AWS, Google Cloud, Vercel) rush to deploy WAF rules to protect customers. This is where Cloudflare's story begins.
Dec 5, 2025
πŸ”΄ Added to CISA KEV
US Cybersecurity agency adds CVE-2025-55182 to Known Exploited Vulnerabilities list, confirming active in-the-wild exploitation.
Why This Was So Urgent

Unlike typical vulnerabilities that might take weeks to be exploited, React2Shell was being actively exploited within hours by nation-state actors. Major cloud providers had a tiny window to protect millions of websites before mass compromise. Cloudflare was racing against the clock. ⏰

03

What Actually Happened (Timeline)

Now let's walk through exactly what happened on December 5, 2025. This is where Cloudflare's good intentions collided with reality. 😬

Dec 3-4, 2025
πŸ”’ Planning Protection
Cloudflare engineers begin working on WAF (Web Application Firewall) rules to detect and block React2Shell exploits. They need to update how their WAF parses HTTP request bodies to catch malicious payloads.
Early Dec 5, ~08:00 UTC
πŸ“ˆ First Change: Increase Buffer Size
Cloudflare starts rolling out an increase to their WAF buffer size from default to 1MB (matching Next.js defaults) to ensure maximum protection. This change uses their gradual deployment system (the safe way!).
~08:30 UTC
⚠️ Discovered Issue
During rollout, engineers notice their internal WAF testing tool doesn't support the increased buffer size. This tool tests WAF rules but doesn't affect customer traffic. Decision: turn it off temporarily.
08:47 UTC
πŸ’₯ The Fatal Deployment
Engineers deploy a change to disable the WAF testing tool using the global configuration system (not gradual rollout!). This system propagates changes within seconds to every server globally. BIG MISTAKE.
08:47-08:56 UTC
πŸ”₯ Cascade Failure Begins
The change triggers a bug in the FL1 proxy (older version). Under certain circumstances, disabling the testing tool causes the proxy to enter an error state. HTTP 500 errors start flooding across Cloudflare's network. Websites start showing "Internal Server Error."
08:56 UTC
🚨 Detection & Investigation
Cloudflare's monitoring alerts fire. Status page updated: "Investigating issues with Cloudflare Dashboard and related APIs." Engineers scramble to understand what's happening.
09:12 UTC
βœ… Fix Deployed
Root cause identified: the WAF testing tool disable caused the error state. Engineers revert the change globally. Services begin recovering immediately.
09:19-09:20 UTC
πŸŽ‰ Resolution
Incident marked as resolved. Total impact: ~25 minutes affecting approximately 28% of Cloudflare's HTTP traffic.
25min Total Impact Duration
28% Traffic Affected
~16min Detection β†’ Fix
Seconds To Propagate Bad Config

Services That Went Down πŸ“‰

The Silver Lining: Unlike November 18 (6 hours), December 5 was "only" 25 minutes. Detection was fast, root cause was identified quickly, and the fix was deployed immediately. Still, 25 minutes is an eternity when you're serving 20% of the internet! πŸ˜…
04

The Technical Root Cause

Let's get technical. What exactly broke, and why? πŸ€”

The Two Configuration Systems

Cloudflare has two ways to deploy configuration changes:

System Gradual Deployment Global Configuration
Speed Slow, phased rollout ⚑ Seconds to entire fleet
Safety βœ… Health checks at each phase ❌ No validation
Rollback Can stop mid-rollout Must push another global change
Use Case Feature changes, code updates Emergency fixes, kill switches

The first change (buffer size increase) used gradual deployment. βœ… Safe!

The second change (disabling testing tool) used global configuration. ❌ Not safe!

Why Global Config Was Used

The engineers made a judgment call:

Sounds reasonable, right? πŸ€·β€β™‚οΈ

πŸ’₯ The Fatal Assumption

Engineers assumed that because the testing tool was "internal-only," turning it off couldn't possibly affect customer traffic. They were wrong. In the FL1 proxy (their older version), the change to disable the testing tool interacted with the request parsing logic in an unexpected way, causing an error state.

The FL1 vs FL2 Difference

Remember from the November 18 outage that Cloudflare is migrating customers from FL (old proxy) to FL2 (new, Rust-based proxy)? Both versions were affected on December 5, but differently:

Impact by Proxy Version
FL1 (Old Proxy) – Affected Customers
When the WAF testing tool was disabled, FL1 entered an error state during request processing. The exact bug isn't public yet, but it caused the proxy to return HTTP 500 errors for legitimate customer requests.

Impact: Sites down, users see "Internal Server Error"

vs
FL2 (New Proxy) – Mostly Unaffected
The Rust-based FL2 proxy handled the configuration change correctly. Customers on FL2 were largely unaffected.

Impact: Minimal or none

This is actually encouraging! It means Cloudflare's FL2 migration is making the system more resilient. The older FL1 code had a latent bug that only surfaced under specific conditions (disabling the testing tool). πŸ›

The Configuration Change (Simplified)

πŸ”§ What Was Changed
// Pseudo-code representation of the change

// Original config:
waf_config = {
    body_parser_buffer_size: 512KB,  // βœ… First change: increasing this to 1MB
    internal_testing_tool: enabled,
    body_parsing_mode: "strict"
};

// After first change (gradual rollout):
waf_config.body_parser_buffer_size = 1MB;  // Safe βœ…

// After second change (GLOBAL CONFIG ⚑):
waf_config.internal_testing_tool = disabled;  // πŸ’₯ TRIGGERED BUG IN FL1

// In FL1, this caused:
if (!internal_testing_tool && body_parsing_mode == "strict") {
    throw_error();  // ⬅️ BUG! Unintended error state
}
The Core Problem

The bug itself was in FL1. But the real failure was in the deployment process. Using the global configuration system meant the bug hit all FL1 servers simultaneously with no warning, no gradual rollout, no health checks. If they had used gradual deployment, the first health check would have caught the issue before it affected more than a tiny fraction of traffic.

05

Why Cloudflare Rushed the Fix

Here's where we need to have empathy for the engineers involved. They weren't being recklessβ€”they were responding to a genuine emergency. 🚨

The Pressure Cooker πŸ”₯

Hours Until Active Exploitation
10.0 CVSS Score (Max Severity)
Millions Sites at Risk
Nation-State Attackers Already Active

The Security Dilemma

Security engineers face an impossible choice when a critical vulnerability drops:

πŸ”§ The Security Triage Dilemma

Imagine you're a surgeon and a patient comes in with internal bleeding (the vulnerability). You know that operating immediately carries risks (the deployment might break things), but not operating means the patient will definitely die (customers will definitely be hacked). What do you do?

The answer: You operate as safely as possible, as quickly as possible. That's exactly what Cloudflare tried to do.

Why the Global Config System?

From Cloudflare's perspective, using the global configuration system made sense:

The problem? They didn't know about the latent bug in FL1. πŸ›

⚠️ The Hidden Coupling

The WAF testing tool appeared independent but was actually coupled to the request processing logic in FL1. This kind of hidden coupling is the bane of complex systems. Engineers can't always predict every interaction when they're racing against nation-state attackers.

What About Testing?

You might ask: "Couldn't they have tested this first?" πŸ€”

The answer is complicated:

The Speed vs Safety Paradox

This is the fundamental tension in security operations: Moving too slowly means customers get hacked. Moving too fast means you might break things. There's no perfect answer. Cloudflare chose speed to protect customers from React2Shell, but that speed caused a different kind of outage.

06

The Full Impact

Let's look at the real-world consequences of those 25 minutes. πŸ“Š

By the Numbers

Cyber Monday Timing πŸ’Έ

The outage happened on December 5, during the Cyber Monday shopping period. For e-commerce companies using Cloudflare (like Shopify stores), this was catastrophic timing:

The Compounding Effect

Coming just 17 days after November 18, this outage had amplified consequences:

Reliability Perception Over Time
Before Nov 18
βœ… Trusted Infrastructure
β†’
Nov 18, 2025
πŸ’₯ 6-Hour Outage
β†’
After Nov 18
⚠️ Promises of Improvement
β†’
Dec 5, 2025 (17 days later)
πŸ’₯ Another Outage
β†’
Current Perception
❌ Pattern of Instability

Stock Market Reaction πŸ“‰

Customer Sentiment

Social media wasn't kind:

Typical reactions:

  • "Cloudflare again? Two outages in three weeks? 😑"
  • "Maybe we need to rethink our single-CDN strategy..."
  • "At least DownDetector was working this time! πŸ˜‚"
  • "They said they fixed it after Nov 18. What happened?"
Reputation is Fragile

One outage can be forgiven. Two in three weeks starts to look like a pattern. Cloudflare didn't just lose 25 minutes of uptimeβ€”they lost customer confidence. That's much harder to rebuild. πŸ˜”

07

How They Fixed It (Fast!)

Credit where due: Cloudflare's detection and response were excellent. Let's break down how they recovered so quickly. ⚑

Detection: 9 Minutes ⏱️

08:47 UTC
Change Deployed
Global config change propagates to all servers
08:47-08:56 UTC
Monitoring Alerts Fire
  • HTTP 5xx error rate spikes across network
  • Latency increases detected
  • Automated health checks fail
  • Customer reports flooding in
08:56 UTC
🚨 Incident Declared
Status page updated, incident response team activated

Within 9 minutes of the change, Cloudflare knew they had a problem. That's fast. πŸ‘

Root Cause Analysis: 16 Minutes πŸ”

How did they figure out what went wrong so quickly?

  1. Correlation: Error spike started at exactly 08:47 UTC
  2. Recent Changes: Only one change deployed at that timeβ€”the global config
  3. Error Patterns: Errors only on FL1 servers, not FL2
  4. Configuration Diff: Compared before/after states
  5. Hypothesis: Disabling WAF testing tool triggered the bug
πŸ” Diagnostic Process (Simplified)
// Engineers' investigation flow:

1. check_error_logs()
   // HTTP 500 errors in FL1 proxy
   
2. check_recent_deployments()
   // 08:47 UTC: Global config change
   
3. diff_configuration("before", "after")
   // internal_testing_tool: enabled β†’ disabled
   
4. test_hypothesis()
   // Enable testing tool on test server β†’ errors stop
   // Disable again β†’ errors return
   
5. confirm_root_cause()
   // βœ… Found it! Revert the change.

The Fix: Instant ⚑

Once they knew the cause, the fix was simple:

πŸ”§ The Fix
// Revert the global configuration change

waf_config.internal_testing_tool = enabled;  // Re-enable it

// Push globally (same system that caused the problem!)
deploy_global_config(waf_config);

// Within seconds, all FL1 servers recover

The fix propagated just as fast as the bug: within seconds. By 09:12 UTC, services were recovering. βœ…

Post-Recovery Actions

After the immediate fix:

What Cloudflare Did Right

Despite the outage, Cloudflare's response was textbook:

  • βœ… Fast detection (9 minutes)
  • βœ… Systematic root cause analysis (16 minutes)
  • βœ… Immediate fix deployment
  • βœ… Transparent communication
  • βœ… Detailed public post-mortem

Many companies would have taken hours to recover. Cloudflare did it in 25 minutes. That's impressive incident response! πŸ‘

08

Lessons: Speed vs Safety

This incident is a masterclass in the impossible trade-offs of running infrastructure at scale. Let's extract the lessons. πŸ“š

🎯 Lesson 1: Global Config Systems Are Dangerous

Having a "bypass all safety checks" button is terrifying but necessary:

The Trade-off:
  • Why it exists: Emergency situations demand speed
  • The danger: No validation, no gradual rollout, no rollback
  • The lesson: Use it ONLY when the alternative is definitively worse

Best Practice: Even "emergency" config systems should have:

πŸ” Lesson 2: Assume All Changes Are Dangerous

The WAF testing tool seemed innocentβ€”it was "internal-only" and "doesn't affect customer traffic." That assumption was wrong. πŸ›

πŸ”§ The Swiss Cheese Model

Think of your system as layers of Swiss cheese. Each layer has holes (bugs, edge cases). Normally the holes don't align, so problems are caught. But when multiple holes line upβ€”like disabling the testing tool (hole 1) on FL1 servers (hole 2) during body parsing (hole 3)β€”you get a catastrophic failure. πŸ§€

Best Practice:

⏰ Lesson 3: Security Urgency vs Operational Safety

This is the hardest lesson and has no clean answer:

Scenario Deploy Fast (Risk Outage) Deploy Slow (Risk Breach)
React2Shell Fix ❌ Caused 25-min outage
βœ… Protected millions from RCE
βœ… No outage
❌ Customers hacked during delay

Which is worse? Honestly, both suck. 😬

Cloudflare's calculation:

This math probably makes sense. But it's still a brutal choice. πŸ’”

πŸ”„ Lesson 4: Technical Debt Has Consequences

The bug was in FL1 (the old proxy). Cloudflare is migrating to FL2, but migrations take time:

The Reality: You can't pause production to finish migrations. Customers depend on you RIGHT NOW. So you operate dual systemsβ€”old and newβ€”and that complexity creates risk. This is why paying down technical debt is so important, but also why it's so hard. πŸ˜“

πŸ“Š Lesson 5: Observability Saved the Day

Cloudflare's monitoring and alerting were excellent:

Without great observability, this could have been hours instead of minutes. πŸ‘

πŸ—£οΈ Lesson 6: Transparency Builds Trust

Cloudflare's public post-mortems are industry-leading:

This transparency is why we can write this article. Many companies would just say "we had a brief service disruption" and move on. Cloudflare teaches the entire industry by being honest about failures. πŸ™

The Meta-Lesson

There's no such thing as perfect infrastructure. You will have outages. What matters is:

  • How fast you detect them
  • How quickly you recover
  • How honest you are about what happened
  • What you do to prevent recurrence

By that measure, Cloudflare is doing most things right, even when things go wrong. πŸ’ͺ

09

What Cloudflare Is Doing Now

After two major outages in three weeks, Cloudflare is taking this very seriously. Here's their action plan. πŸ› οΈ

Immediate Actions (In Progress)

  1. Enhanced Rollouts & Versioning
    • ALL configuration changes (not just code) will use gradual rollouts
    • Health validation at each rollout stage
    • Automatic rollback on failures
    • Even "global config" system will have safety checks
  2. FL1 Bug Investigation
    • Understanding exactly why disabling the testing tool caused errors
    • Fixing the root cause, not just the symptom
    • Auditing FL1 for similar latent bugs
  3. Accelerated FL2 Migration
    • Prioritizing moving more customers to FL2 (which handled the change correctly)
    • The new proxy is provably more resilient
    • Written in Rust for memory safety
  4. Configuration Dependency Mapping
    • Understanding ALL interactions between config settings
    • Building a dependency graph
    • Preventing hidden coupling issues

Medium-Term Improvements

Cultural Changes

Beyond technical fixes, Cloudflare is working on cultural shifts:

New Principles:
  • "Speed is important, but safety is mandatory"
  • "Internal changes can break external systems"
  • "Gradual rollout is not optional"
  • "Global config is a nuclear option"

Will It Work?

Honestly? Probably, but not perfectly. πŸ€·β€β™‚οΈ

Here's the reality:

That's not pessimismβ€”it's realism. Every major infrastructure provider has outages. Google, AWS, Microsoft, GitHubβ€”they all go down sometimes. What matters is the trend: are outages getting less frequent and less severe? πŸ“ˆ

The Real Question

The real question isn't "Will Cloudflare ever have another outage?" (they will). It's "Are they systematically reducing the frequency and impact of outages over time?" Based on their response to these incidents, the answer appears to be yes. πŸ’ͺ

10

Key Takeaways for Developers

Whether you're running infrastructure for 10 users or 10 million, here's what you can learn from this incident. πŸŽ“

1. Gradual Rollouts Are Not Optional

Even for "simple" changes. Even for "internal-only" systems. Even during emergencies. Always roll out changes gradually with health checks at each stage. The few minutes you save by deploying globally are not worth the hours of outage.

2. Hidden Coupling Is Everywhere

Systems are more interconnected than you think. That "internal-only" tool? It might interact with customer-facing code in unexpected ways. Document dependencies, test interactions, and assume every change can break something.

3. Security vs Availability Is a False Dichotomy

You don't have to choose between security and availabilityβ€”you need both. Build systems that can apply security fixes safely. If your only option is "deploy dangerously or leave vulnerable," you've already failed at system design.

4. Observability Is Your Lifeline

You can't debug what you can't see. Invest heavily in monitoring, logging, and tracing. The difference between a 25-minute outage and a 6-hour outage is often just how quickly you can find the root cause.

5. Technical Debt Compounds

Cloudflare is still dealing with bugs in FL1 (their old proxy) because migrations take time. Every day you delay paying down technical debt is another day you're operating with elevated risk. Schedule the migration. Do the refactor. It's not sexy, but it's necessary.

6. Transparency Builds Trust

When things break, be honest about it. Cloudflare's detailed post-mortems are why the community still trusts them despite two major outages. Customers respect honesty and learning from mistakes.

7. Incident Response Matters More Than Prevention

You will have incidents. You can't prevent all failures. What matters is how fast you detect, diagnose, and fix them. Practice your incident response. Run game days. Build muscle memory for crisis situations.

For Security Engineers Specifically

For Infrastructure Engineers Specifically

✨

Final Thoughts

So let's recap: On December 5, 2025, Cloudflare deployed an urgent security fix to protect millions of websites from React2Shell, a critical RCE vulnerability. In their rush to protect customers, they used a global configuration system that bypassed safety checks. A latent bug in their older proxy (FL1) was triggered, causing 25 minutes of outage affecting 28% of their traffic. 😬

This was Cloudflare's second major outage in three weeks. The timing was brutal. The optics were bad. The stock dropped. Customers questioned their reliability. πŸ“‰

But here's the thing: Cloudflare made the right choice. πŸ€”

Let me explain: The alternative was leaving millions of websites vulnerable to nation-state attackers actively exploiting React2Shell. A 25-minute outage sucks. But widespread customer compromises would have been catastrophically worse. Cloudflare chose the lesser evil.

The real failure wasn't the decision to deploy fastβ€”it was that they didn't have the infrastructure to deploy fast safely. That's what they're working on now. πŸ› οΈ

🎯 The Bigger Picture

Cloudflare is doing something most companies won't: publishing detailed, honest post-mortems. They're not hiding behind PR spin. They're showing their work, admitting mistakes, and explaining exactly what they're doing to improve.

That transparency is valuable. Every engineer who reads their post-mortems learns from Cloudflare's mistakes. Every company can improve their own systems. The entire industry gets better.

So while two outages in three weeks is definitely not good, Cloudflare's response to those outages is excellent. πŸ‘

As you build and operate systems, remember: failures are inevitable. What matters is how you respond. Do you hide them? Or do you learn from them publicly?

Be like Cloudflare: when you break things, fix them fast, explain what happened, and build systems that make that class of failure impossible next time. That's how we all get better. πŸ”₯

Stay resilient, keep learning, and remember: the internet is held together with duct tape and hope. We're all just trying our best! πŸ˜…

Thanks for reading! πŸ˜‰

πŸ“–

References & Sources

This deep dive was compiled from the following sources:

πŸ”΄ Official Cloudflare Communications

πŸ” Security Vulnerability Information

πŸ“° News Coverage

πŸ”— Related Reading

Found this helpful? Share it! πŸš€