Cloudflare Explains Plan for Resilience After Multiple Outages

Cloudflare Explains Plan for Resilience After Multiple Outages

Following multiple recent outages, Cloudflare outlines their plan to improve resilience of their network, titled “Code Orange: Fail Small.”

Here, “Code Orange” is an internal designation meaning the work is being prioritized over all other work. They’ve only had to declare a Code Orange once before, the name being borrowed from Google who reportedly declare a “Code Yellow” or “Code Red” for an existential threat to their business.

First outage occurred on November 18 and the second outage just a few weeks later on December 5th. Cloudflare is used by about 20% of all websites according to w3techs.com. These outages made a huge portion of websites inaccessible.

Cloudflare attributes both incidents to making global changes instantaneously:

Both outages followed a similar pattern. In the moments leading up to each incident we instantaneously deployed a configuration change in our data centers in hundreds of cities around the world.

They plan to focus on three main areas:

  • Requiring “controlled rollouts” for any configuration change that they plan to propagate across the whole network. This would reduce the chances of a huge portion of the network getting taken down just because of one bad change
  • Reviewing the failure modes of network infrastructure so they fail in a predictable and expected way.
If a configuration file is corrupt or out-of-range (e.g., exceeding feature caps), the system will log the error and default to a known-good state or pass traffic without scoring, rather than dropping requests. Some services will likely give the customer the option to fail open or closed in certain scenarios. This will include drift-prevention capabilities to ensure this is enforced continuously.
  • Avoid circular dependencies for “break glass” procedures so that customers and Cloudflare can act quickly in the face of failures

Cloudflare had already been following a controlled rollout strategy for their software releases, so this change will fall in line with their already-implemented strategies. They plan on introducing a new tool: the Health Mediated Deployment (HMD) system. Every team at Cloudflare that’s responsible for a service will define what indicates a success or failure in a rollout, and a rollback procedure that will automatically happen if the team isn’t able to proceed.

To improve failure modes, they are looking at every interface between critical components and assume that failure will occur between them, and handle the failure in the “most reasonable way possible.” This will prevent an unaccounted for error in one of these interfaces from spreading and causing other infrastructure to error out without the ability to easily understand what’s happening. They’re also looking to move to a “fail-open” strategy where the system won’t hard-fail from configuration issues, instead logging the error and reverting to a known-good state.

Finally, they will be eliminating circular dependencies from their system. As an example, they use Turnstile for the dashboard as well, so while visitors weren’t able to get on your website because Turnstile was broken, the site owners also couldn’t log in to their dashboard to make critical changes.

All of these are great changes, however the outages show how devastating putting so much of the web behind a centralized, single point of failure can be. With self-hosted options like Anubis available, it might be worth moving away from services like Cloudflare.

Community Discussion