What is Cloudflare and Why Does It Matter?
Before we dive into what went wrong, let's understand what Cloudflare actually does. If you're building web applications, this is essential knowledge. Trust me, after reading this, you'll never look at the internet the same way! 🤯
🌐 The Problem Cloudflare Solves
Imagine you build an amazing e-commerce website. It's hosted on a server in Mumbai. Now, when someone from New York tries to access your site, their request has to travel thousands of kilometers across the ocean. That's slow. And if suddenly 10,000 people try to access your site at once? Your single server might crash. Not a great experience, right? 😣
Think of Cloudflare as a security guard + receptionist sitting between your users and your server. Every request goes through them first. They check if the visitor is legitimate, block the bad ones, and for common requests (like your homepage), they keep a copy ready so they can respond instantly without bothering your server. Now imagine having this guard in 330+ cities worldwide — so users always talk to someone nearby instead of waiting for a response from far away.
What Cloudflare Actually Does
Cloudflare provides several critical services:
- CDN (Content Delivery Network) — Caches and serves your content from servers close to users
- DDoS Protection — Blocks attacks where hackers flood your site with fake traffic
- DNS Services — Translates domain names (like google.com) to IP addresses
- Bot Management — Identifies and blocks malicious bots (this is the culprit in our story! 🎯)
- SSL/TLS — Handles encryption for secure connections
- Web Application Firewall — Protects against common attacks like SQL injection
Because Cloudflare sits between users and websites for such a large portion of the internet, when Cloudflare fails, it doesn't matter if your actual servers are running perfectly — users simply can't reach them. Your application is fine, but nobody can access it.
Services Affected by This Outage
This wasn't a minor hiccup. These major services went down or became partially unavailable — and it was chaos! 😱
- X (Twitter) — ~700 million users 🐦
- ChatGPT — couldn't log in 🤖
- Spotify, Discord, Canva, Figma 🎵💬🎨
- Claude AI, 1Password, Trello, Medium, Postman
- League of Legends, Valorant (couldn't connect to servers) 🎮
- Even DownDetector (the site people use to check if sites are down!) was down 😂
How Cloudflare's Architecture Works
To understand what broke, you need to understand how a request flows through Cloudflare's system. Don't worry, I'll make this simple! 🤓
🔄 The Request Journey
When you type twitter.com in your browser, here's what happens behind the scenes:
Let me explain each layer (this is where it gets interesting! 🎯):
- WAF rules are applied (blocking SQL injection, XSS, etc.)
- DDoS protection kicks in
- Bot Management runs here — generating bot scores for every request
- Customer-specific configurations are applied
- Traffic is routed to the appropriate service
This is where things broke. The FL Proxy crashed when loading a corrupted configuration file.
The FL Proxy processes every single request that goes through Cloudflare. There's no way to bypass it. When it fails, everything fails. This is why understanding the architecture matters — one broken component in the critical path can take down everything downstream.
FL vs FL2: The Two Proxy Versions
Here's where it gets tricky! 🤔 Cloudflare was in the process of migrating customers from their old proxy (FL) to a new, improved one (FL2). During this outage, both versions were affected — but differently:
| Aspect | FL2 (New Proxy) | FL (Old Proxy) |
|---|---|---|
| Written in | Rust | Older codebase |
| What happened | Completely crashed with HTTP 5xx errors | Continued running but returned incorrect bot scores (always 0) |
| User experience | Error pages, couldn't access sites | Could access sites, but bot rules misfired (false positives) |
The FL2 proxy was stricter about input validation (a good thing normally!) and crashed when it received invalid data. The older FL proxy was more lenient but produced incorrect results. Ironic, right? 🤷♂️
The Bot Management System
The root cause of this outage was in Cloudflare's Bot Management system. Now this is where it gets really interesting! 🎯 Let's understand what it does and why a simple "feature file" brought down the internet.
🤖 What is Bot Management?
Ever wondered how websites know if you're a real person or an automated bot? 🤔
Not all traffic to websites is from real humans. A significant portion comes from bots — automated programs that access websites. Some bots are good (like Google's crawler that indexes your site for search), and some are bad (like scrapers stealing your content or attackers trying to brute-force passwords).
Bot Management uses machine learning to analyze every request and assign a "bot score" — a number that indicates how likely the request is from a bot vs. a human.
It's like a spam filter for websites. Just like Gmail looks at email patterns to decide "spam or not spam", Bot Management looks at request patterns to decide "bot or human". It checks things like: How fast are requests coming? Does this browser fingerprint look real? Is this IP known for suspicious activity? All these signals get combined into a single score.
How Bot Scores Work
- Score 1-29: Likely a bot
- Score 30-70: Uncertain, might be either
- Score 71-99: Likely a human
Website owners can then create rules like: "If bot score < 30, show a CAPTCHA" or "If bot score < 10, block the request."
📄 What is a Feature File?
Now we get to the critical piece — pay attention here! 👀 The machine learning model needs to know which characteristics (features) to look at when analyzing a request. These are defined in a feature configuration file.
A feature file contains a list of "features" — individual traits the ML model uses. For example:
user_agent_entropy— How random/unique is the User-Agent string?request_rate— How many requests per second from this IP?header_order— In what order are HTTP headers sent?tls_fingerprint— What does the TLS handshake look like?
Why Feature Files Need Frequent Updates
Bad actors constantly evolve. It's like a cat and mouse game! 🐱🐭 When attackers figure out that Cloudflare is looking at Feature X, they'll modify their bots to appear normal on Feature X. So Cloudflare needs to constantly update the features — adding new ones, tweaking existing ones, removing obsolete ones.
This is why the feature file is regenerated every few minutes and pushed to every server globally. Sounds harmless, right? 🤷♂️
For performance reasons, Cloudflare's proxy pre-allocates memory for the feature file. They set a limit of 200 features — well above their actual use of ~60 features. When the corrupted file contained over 200 features due to duplicates, it exceeded this limit and crashed the system.
The Code That Crashed
Here's the actual Rust code that caused the crash (simplified):
// This code checks if the number of features is within the limit
// MAX_FEATURES is set to 200
fn load_feature_config(features: Vec<Feature>) -> Result<Config, Error> {
if features.len() > MAX_FEATURES {
return Err(Error::TooManyFeatures);
}
// Pre-allocate memory for exactly this many features
let config = Config::with_capacity(features.len());
// ... rest of loading logic
}
// Somewhere in the calling code:
let config = load_feature_config(features).unwrap(); // 💥 CRASH HERE!
The problem was that .unwrap(). In Rust, .unwrap() says "I expect this to succeed, and if it doesn't, crash the program." When the feature count exceeded 200, the function returned an error, .unwrap() was called on that error, and the entire proxy crashed.
Cloudflare's engineers assumed their internally-generated feature file would always be valid. They didn't apply the same defensive programming they would for user-provided input. This is a common mistake — we often trust "internal" data more than we should.
Understanding ClickHouse Distributed Databases
The bug originated in Cloudflare's ClickHouse database. If you're getting into large-scale systems, understanding distributed databases is essential. Let's break it down.
📊 What is ClickHouse?
ClickHouse is an open-source column-oriented database designed for analytics at massive scale. It's used by companies like Uber, Cloudflare, eBay, and Yandex to analyze billions of rows of data in real-time.
Row-oriented databases (MySQL, PostgreSQL): Store data like a book — one complete row after another. Great for looking up a specific user's full profile.
Column-oriented databases (ClickHouse): Store data like a spreadsheet where each column is a separate file. Great for analytics queries like "What's the average of column X across 1 billion rows?" because you only read that one column.
How Distributed Databases Work
When you have billions of rows of data, one server isn't enough. You shard the data across many servers:
Here's the key concept that caused the bug:
The Two-Database Structure
Cloudflare's ClickHouse has two logical databases:
| Database | Purpose | What it contains |
|---|---|---|
default |
Query entry point | Distributed tables — virtual tables that fan out queries to all shards |
r0 |
Actual storage | Underlying tables — where the actual data lives on each shard |
When you query default.http_requests_features, the Distributed engine automatically queries r0.http_requests_features on every shard and combines the results.
🔐 The Permission Change That Started It All
Here's where things went wrong. Cloudflare was improving their database security by making permissions more explicit.
Before the Change (11:04 UTC)
When users queried metadata (like "what columns exist in this table?"), they could only see the default database:
SELECT name, type FROM system.columns
WHERE table = 'http_requests_features';
-- Result:
-- name | type
-- --------------+--------
-- user_agent | String
-- ip_address | String
-- request_rate | Float64
-- ... (~60 features)
After the Change (11:05 UTC)
The permission change made the r0 database visible too. Now the same query returned duplicates:
SELECT name, type FROM system.columns
WHERE table = 'http_requests_features';
-- Result (PROBLEM!):
-- name | type | database
-- --------------+----------+----------
-- user_agent | String | default ← Original
-- ip_address | String | default
-- request_rate | Float64 | default
-- user_agent | String | r0 ← DUPLICATE!
-- ip_address | String | r0 ← DUPLICATE!
-- request_rate | Float64 | r0 ← DUPLICATE!
-- ... (now ~120+ rows!)
The query that generates the feature file didn't filter by database name. It assumed all results would be from the default database. When the r0 tables became visible, every column appeared twice — more than doubling the feature count from ~60 to ~120+, which exceeded the limit!
The Problematic Query
-- This query was used to generate the feature file
SELECT name, type FROM system.columns
WHERE table = 'http_requests_features'
ORDER BY name;
-- ❌ MISSING: WHERE database = 'default'
One missing WHERE clause. That's all it took. Just one line of code. 🤯
Why The Outage Was Intermittent (At First)
Here's where it gets even more confusing! 😵 The outage didn't hit all at once. It fluctuated. Why?
Cloudflare was gradually rolling out the permission change to their ClickHouse cluster. The feature file is regenerated every 5 minutes, and each regeneration randomly picks a node in the cluster to run the query on.
- Query hits updated node: Bad feature file generated → Outage
- Query hits non-updated node: Good feature file generated → Recovery
This made debugging incredibly confusing because the system would recover on its own, then fail again minutes later. Imagine trying to fix something that keeps "fixing itself" and breaking again! 😤
What Actually Happened (Step by Step)
Now that you understand all the components, let's trace exactly what happened, minute by minute.
default and r0 databases. This seemed like a good, routine change..unwrap() is called. PANIC. The proxy crashes. HTTP 5xx errors flood the network.- Initial suspicion: A DDoS attack (Cloudflare had recently defended against massive attacks)
- The status page going down (unrelated coincidence!) reinforced the attack theory
- The intermittent nature made it seem like attackers were probing
- Focus shifts to Workers KV, then Access, then other services
- Stop automatic regeneration of new feature files
- Retrieve the last known good feature file from before 11:20
How They Detected the Problem
This section is crucial for anyone building systems at scale. Detection and observability are your lifeline when things go wrong! 🚨
🚨 The Monitoring That Worked
Cloudflare has extensive monitoring. Here's what fired:
- Automated Health Checks: Synthetic tests that continuously make requests to Cloudflare services detected issues at 11:31 UTC — just 11 minutes after impact started. Pretty fast! ⚡
- 5xx Error Rate Monitors: Dashboards immediately showed the spike in error responses.
- Latency Metrics: Response times spiked because the proxy was spending CPU cycles on error handling and debugging.
🔍 The Investigation Challenges
Even with good monitoring, finding the root cause was hard. Here's why (this is where it gets interesting!):
1. Symptoms Didn't Point to the Cause
The visible symptoms were:
- Workers KV returning errors
- Access authentication failing
- Dashboard login broken
- General HTTP 5xx errors
None of these immediately screamed "Bot Management feature file!" The actual cause was several layers below the symptoms. Talk about a needle in a haystack! 🔍
2. The Intermittent Nature
The system would recover, then fail again. This pattern matched what you'd expect from:
- A sophisticated DDoS attack (probing before the main assault)
- A race condition in code
- Network issues that come and go
It didn't match what you'd expect from a configuration problem (which usually causes persistent failures).
3. The Status Page Coincidence
Cloudflare's status page (hosted completely separately, not on Cloudflare) went down at the same time. This was a complete coincidence, but it made the team think an attacker was targeting both their infrastructure AND their communication channel. Can you imagine the panic? 😱
When you're in incident response mode, your brain looks for patterns. Unrelated events can seem connected. Always verify assumptions. In this case, the status page issue was completely unrelated but wasted valuable investigation time.
4. Wrong Initial Hypothesis
The team initially suspected a DDoS attack because:
- Cloudflare had recently defended against record-breaking attacks
- The intermittent nature matched attack patterns
- The status page going down reinforced this theory
How They Finally Found It
Around 13:37 UTC, an engineer looking at the FL2 proxy logs noticed the panic message:
thread fl2_worker_thread panicked: called Result::unwrap()
on an Err value: TooManyFeatures
This was the key. "TooManyFeatures" pointed directly to the Bot Management module. From there:
- They examined the current feature file
- They saw the duplicate entries
- They traced the feature file generation to the ClickHouse query
- They found the query didn't filter by database
- They checked recent ClickHouse changes — found the permission change at 11:05
The error message contained the answer: TooManyFeatures. Good error messages are invaluable. When writing code, invest in descriptive, specific error messages. Future you (or your on-call colleague at 3 AM) will thank you.
How They Fixed It
Once the root cause was identified, the fix was conceptually simple but operationally challenging. Here's the playbook they followed.
🛑 Step 1: Stop the Bleeding (14:24 UTC)
First priority: stop making things worse.
# 1. Stop automatic feature file generation
# This prevents new bad files from being created
$ kill-feature-file-job
# 2. Block propagation of feature files
# Even if a file is generated, don't push it
$ block-feature-file-distribution
This stabilized the situation — no new bad files would be created or distributed.
📦 Step 2: Restore Known Good State (14:24-14:30 UTC)
With the bleeding stopped, they needed to restore service:
- Find the last good file: Look at feature file history, find the last one generated before 11:20 UTC (before the permission change took effect)
- Validate the file: Check it has ~60 features, not 120+
- Manually inject into distribution: Push this file into the distribution queue
- Force restart: Restart the FL2 proxy across all servers to pick up the new file
# Find last good feature file
$ ls -la /var/cloudflare/feature-files/
feature-file-2025-11-18-11-15.json # ← Before the change, should be good
feature-file-2025-11-18-11-20.json # ← Bad, has duplicates
feature-file-2025-11-18-11-25.json # ← Might be good (hit old node)
feature-file-2025-11-18-11-30.json # ← Bad again
# Verify the good file
$ cat feature-file-2025-11-18-11-15.json | jq '.features | length'
62 # Good! Under 200 ✅
# Inject into distribution
$ inject-feature-file feature-file-2025-11-18-11-15.json --force --global
# Force proxy restart globally
$ restart-fl2-proxy --all-regions
🔧 Step 3: Handle the Cascade (14:30 - 17:06 UTC)
Restoring the core proxy wasn't enough. Other services had entered bad states:
The Dashboard Login Storm
While the system was down, millions of users kept trying to log in. When services recovered, all those retry attempts hit at once — a "thundering herd" problem. This is a classic distributed systems nightmare! 😣
Imagine a popular website's cache expires, and suddenly 10,000 users who were waiting all hit "refresh" at the exact same moment. Your database gets slammed with 10,000 identical requests instead of just one. That's the thundering herd problem! The fix? Add random delays to retries (so not everyone retries at once), or use a "circuit breaker" that temporarily rejects requests to prevent overload.
Solution: Scale up dashboard and login services, add rate limiting, gradually let traffic through.
Turnstile (CAPTCHA) Recovery
Cloudflare Turnstile (their CAPTCHA alternative) was down, which meant new logins to the dashboard were impossible. Even after the proxy recovered, Turnstile needed separate attention.
CPU Exhaustion from Error Logging
Here's an interesting side effect: Cloudflare's debugging systems automatically enhance errors with extra context. With millions of errors happening, this consumed massive CPU resources, further slowing recovery. Ironic, right? The very thing meant to help debug was making things worse! 🤦♂️
Heavy error logging, stack trace collection, and debugging information are great for diagnosing issues. But during a major outage, they can consume resources you desperately need for recovery. Consider having "emergency mode" logging that's more minimal.
🔒 Step 4: Permanent Fix
After immediate recovery, they needed to fix the underlying bugs:
Fix 1: The SQL Query
SELECT name, type FROM system.columns
WHERE table = 'http_requests_features'
ORDER BY name;
SELECT name, type FROM system.columns
WHERE table = 'http_requests_features'
AND database = 'default' -- ← Added this filter! That's it! 🎉
ORDER BY name;
One line of code. That's the difference between the internet working and the internet breaking. 🤯
Fix 2: Better Error Handling in FL2
let config = load_feature_config(features).unwrap(); // Crash on error
let config = match load_feature_config(features) {
Ok(c) => c,
Err(e) => {
// Log the error
error!("Failed to load feature config: {:?}", e);
// Fall back to previous good config
get_previous_config()
}
};
The Full Impact
Let's look at the full scope of what was affected during this incident. Spoiler: it was massive! 💥
Services Directly Impacted
| Service | Impact |
|---|---|
| Core CDN & Security | HTTP 5xx errors for customer sites |
| Turnstile (CAPTCHA) | Failed to load entirely |
| Workers KV | Elevated 5xx errors |
| Dashboard | Users couldn't log in |
| Cloudflare Access | Authentication failures |
| Email Security | Reduced spam detection, some Auto Move failures |
Major Websites/Apps Affected
Also affected: Canva, Figma, Claude AI, 1Password, Trello, Medium, Postman, League of Legends, Valorant, various crypto platforms, and ironically... DownDetector (the site people use to check if other sites are down).
Financial Impact
- Cloudflare Stock (NET): Dropped 3.5% in pre-market trading
- Customer Revenue Loss: Potentially millions across all affected sites (e-commerce transactions failed, ads didn't load, subscriptions couldn't be processed)
- Crypto Markets: Multiple exchanges and DeFi platforms went offline, potentially affecting trades
Why Some Services Were Fine
Interestingly, not everything went down:
- OpenAI API: Continued working (different infrastructure path than ChatGPT login)
- Many mobile apps: Native mobile apps often bypass the CDN layer entirely
- Sites with multi-CDN: Companies using multiple CDN providers could fail over to alternatives
This outage showed why multi-CDN architecture is increasingly important. Companies that had a backup CDN configured (like Fastly, Akamai, or AWS CloudFront) could switch traffic and minimize impact. Single points of failure are dangerous at internet scale.
Lessons for Building Systems at Scale
This incident is a goldmine of lessons for anyone building or operating large-scale systems. Let's extract the wisdom.
🔐 Lesson 1: Never Trust Internal Data
Cloudflare's code trusted its own configuration files completely. The assumption: "We generate this file ourselves, so it will always be valid." Makes sense, right? 🤔
Reality: Internal systems can produce invalid data due to bugs, race conditions, database issues, or (as in this case) unexpected side effects of other changes. Never assume!
Validate ALL inputs, even those from internal systems. Apply the same defensive programming to internal data that you would to user input.
📉 Lesson 2: Graceful Degradation Over Hard Crashes
When the feature file was invalid, the FL2 proxy crashed with a panic. A better approach:
- Log the error for investigation
- Fall back to the previous known-good configuration
- Alert operators while continuing to serve traffic
- Rate-limit the fallback to prevent cascading issues
Serving traffic with a slightly stale configuration is almost always better than not serving traffic at all.
🔄 Lesson 3: Database Changes Are Infrastructure Changes
The permission change seemed like a small, safe improvement. But it had unexpected downstream effects. Database changes can affect:
- Query results (as seen here)
- Query performance (indexes, query plans)
- Application behavior (assumptions about data format)
Use the same rigor: staged rollouts, feature flags, monitoring, and the ability to quickly roll back. Test in production-like environments with realistic queries.
🎭 Lesson 4: Intermittent Failures Are The Hardest
If the system had stayed down, they might have found the cause faster. The intermittent nature led the team to wrong conclusions (DDoS attack, race condition, network issues). This is the worst kind of bug to debug! 😵
Strategy for intermittent issues:
- Focus on what's DIFFERENT between success and failure cases
- Check for gradual rollouts of any kind (feature flags, database changes, code deploys)
- Look at timing — does it correlate with scheduled jobs?
- Check if different servers/regions behave differently
🚨 Lesson 5: Design Kill Switches
Cloudflare is now implementing more "global kill switches" — ways to instantly disable features that might be causing problems. When building systems:
- Every feature should be independently disableable
- Kill switches should work even when the main system is failing
- Practice using them (chaos engineering)
📊 Lesson 6: Your Observability Can Hurt You
During the incident, Cloudflare's debugging systems (collecting stack traces, enhancing errors with context) consumed significant CPU. In a crisis, resources are precious.
Consider:
- Emergency logging modes with reduced verbosity
- Sampling during high-error-rate periods
- Async logging that doesn't block the main process
- Resource limits on debugging/tracing systems
🔗 Lesson 7: Understand Your Dependencies
Many developers debugging their apps during the outage wasted time because they didn't immediately recognize it as a Cloudflare issue. Understanding your dependency chain is crucial:
When something breaks, know which layer to investigate first.
🌐 Lesson 8: The Internet Is More Fragile Than It Looks
The internet appears decentralized, but in reality, a few key providers handle massive portions of traffic:
- CDNs: Cloudflare, Akamai, Fastly, AWS CloudFront
- Cloud Providers: AWS, Azure, GCP
- DNS: Cloudflare, AWS Route 53, Google
A failure in any of these can cascade globally. As a system designer, always consider: what if this dependency fails?
What Cloudflare Is Doing to Prevent This
To their credit, Cloudflare has been transparent about this incident and committed to specific improvements. Here's their action plan:
Immediate Actions
- Hardening Configuration File Validation: Treating internally-generated files with the same validation rigor as user input
- More Global Kill Switches: Ability to instantly disable any feature that might be causing issues
- Resource Limits on Error Handling: Preventing debugging systems from consuming excessive resources during incidents
- Review of All Error Paths: Auditing every module in the core proxy for similar issues
Longer-Term Improvements
- Graceful Degradation by Default: Modules should fail open (continue working with reduced functionality) rather than fail closed (crash)
- Better Testing of Permission Changes: Simulating downstream effects of database changes before production rollout
- Enhanced Canary Deployments: Testing changes on a small subset of traffic before global rollout
- Improved Incident Detection: Faster correlation between symptoms and root causes
Cloudflare's worst outage since 2019 was caused by a one-line bug in a SQL query that had been working correctly for years. The conditions for failure were created by an unrelated change (the permission improvement). Complex systems fail in complex ways. You can't prevent all failures, but you can build systems that fail gracefully, detect issues quickly, and recover fast.
What You Can Do In Your Systems
Regardless of your scale, apply these principles:
- Validate everything: Don't trust any input, even from your own systems
- Fail gracefully: When something goes wrong, degrade rather than crash
- Build kill switches: Have the ability to disable any feature instantly
- Monitor dependencies: Know when external services you depend on are having issues
- Test failure modes: Don't just test the happy path; test what happens when things break
- Have a runbook: Document how to diagnose and recover from common failures
- Practice incident response: Run game days where you simulate failures
- Consider multi-provider: For critical dependencies, have a backup
Final Thoughts
So let's recap what happened on November 18, 2025: A routine database permission change exposed a missing WHERE clause in a SQL query. That query generated a file that was too large. That file crashed a proxy. That proxy was handling 20% of the internet's traffic. Wild, right? 🤯
The chain reaction took about 15 minutes to start and nearly 6 hours to fully resolve. Billions of users were affected. Stock prices dropped. Engineers around the world wasted hours debugging their own systems thinking they were at fault. 😤
And yet, in some ways, the system worked! Automated monitoring detected the issue within 11 minutes. Engineers responded quickly. The post-mortem was thorough and public. Improvements are being made.
As you build and operate systems, remember: complexity is the enemy of reliability. Every dependency is a potential failure point. Every assumption is a potential bug. The goal isn't to never fail — it's to fail well, recover fast, and learn continuously.
Welcome to the world of systems at scale. It's messy, it's humbling, and it's endlessly fascinating. 🔥
This incident is a case study in distributed systems, database management, incident response, and engineering culture. Study other post-mortems (Google, AWS, GitHub all publish them). Build things. Break things. Learn from every failure. That's how we all get better! 💪
References & Sources
This deep dive was compiled from the following sources:
🔴 Official Cloudflare Communications
-
Cloudflare Blog — Official Post-Mortem: "18 November 2025 Outage"
blog.cloudflare.com/18-november-2025-outage -
Cloudflare Status Page — Incident Timeline
cloudflarestatus.com
📰 News Coverage
-
Ars Technica — "Cloudflare outage takes down Discord, ChatGPT, Notion, and many more"
arstechnica.com -
TechCrunch — "Major Cloudflare outage impacts Discord, Notion, and ChatGPT"
techcrunch.com -
BleepingComputer — "Cloudflare outage causes major Internet disruption"
bleepingcomputer.com -
Forbes — "Cloudflare Outage Takes Out Major Websites"
forbes.com/sites/technology -
The Verge — "Cloudflare outage takes down Spotify, Discord, and more"
theverge.com
📊 Status & Monitoring
-
DownDetector — Real-time outage reports
downdetector.com -
Discord Status — Service status page
discordstatus.com -
OpenAI Status — ChatGPT service status
status.openai.com
🔧 Technical Background
-
Cloudflare — "How Cloudflare's Bot Management Works"
cloudflare.com/products/bot-management -
ClickHouse Documentation — Distributed queries and database schemas
clickhouse.com/docs -
Rust Documentation — Error handling with Result and unwrap()
doc.rust-lang.org
📈 Market Impact
-
Yahoo Finance — Cloudflare (NET) stock movement
finance.yahoo.com/quote/NET -
MarketWatch — Pre-market trading data November 18, 2025
marketwatch.com
🎓 Related Learning Resources
-
Google SRE Book — Site Reliability Engineering
sre.google/books -
AWS Post-Mortems — Amazon Web Services incident reports
aws.amazon.com/premiumsupport/technology/pes -
GitHub Engineering Blog — Incident analyses
github.blog/category/engineering -
Cloudflare Engineering Blog — Technical deep dives
blog.cloudflare.com/tag/engineering
🌐 Internet Infrastructure Context
-
W3Techs — Web technology usage statistics
w3techs.com -
Netcraft — Web server surveys
netcraft.com