Cloudflare Outage Due to Phishing URL Block Error

In a significant operational mishap, Cloudflare’s R2 object storage platform experienced a major outage yesterday, disrupting multiple services for nearly an hour. This incident underscores the critical importance of robust operational protocols and training in tech environments.

What Happened?

Cloudflare R2, an object storage service akin to Amazon S3, is designed for scalable, durable, and cost-effective data storage. It boasts features such as cost-free data retrieval, S3 compatibility, and seamless integration with Cloudflare services. However, the platform faced a severe disruption when an employee mistakenly disabled the entire R2 Gateway service while addressing an abuse report concerning a phishing URL.

Source: Cloudflare

According to Cloudflare’s post-mortem analysis, the error occurred during routine abuse remediation. Instead of blocking the specific phishing endpoint, the employee inadvertently turned off the entire R2 Gateway service. “This was a failure of multiple system-level controls and operator training,” Cloudflare stated.

Impact of the Outage

The outage lasted for 59 minutes, from 08:10 to 09:09 UTC, and had a cascading effect on various services:

  • Stream: 100% failure in video uploads and streaming delivery.
  • Images: 100% failure in image uploads and downloads.
  • Cache Reserve: 100% failure in operations, leading to increased origin requests.
  • Vectorize: 75% failure in queries, with 100% failure in insert, upsert, and delete operations.
  • Log Delivery: Significant delays and data loss, with up to 13.6% loss for R2-related logs and 4.5% for non-R2 jobs.
  • Key Transparency Auditor: 100% failure in signature publishing and read operations.

Additionally, several indirectly affected services experienced partial failures. For instance, Durable Objects saw a 0.09% increase in error rates due to reconnections, while Cache Purge faced a 1.8% rise in errors and a tenfold latency spike. Workers & Pages also reported a 0.002% failure rate in deployments, impacting projects linked to R2.

Root Causes and Immediate Fixes

Cloudflare identified both human error and the lack of safeguards, such as validation checks for high-impact actions, as key contributors to the incident. In response, the company has implemented immediate corrective measures, including:

  • Removing the ability to disable systems from the abuse review interface.
  • Restricting access in the Admin API to prevent service shutdowns in internal accounts.

Future Safeguards

To prevent similar incidents in the future, Cloudflare plans to enhance its operational protocols with additional measures, including:

  • Improved account provisioning processes.
  • Stricter access controls.
  • A two-party approval system for high-risk actions.

Leave a Reply

Your email address will not be published. Required fields are marked *