Rolling With The Punches: Shifting Attack Tactics & Dropping Packets Faster & Cheaper At The Edge

On Cloudflare’s 8th birthday in 2017, we announced free unmetered DDoS Protection as part of all of our plans, regardless if you’re an independent blogger using WordPress on Cloudflare's Free plan or part of a large enterprise operating global network infrastructures. Our DDoS protection covers attack vectors on Layers 3-7; whether highly distributed and volumetric (rate-intensive) or small and sneaky. We protect over 26 million Internet properties, and at this scale, identifying small and sneaky DDoS attacks can be challenging, especially at L7. In this post, we discuss this challenge along with trends that we’ve seen, interesting DDoS attacks, and how we’ve responded to them so that you don’t have to worry.

Let’s Talk Trends

When analyzing attacks on the Cloudflare network, we’ve seen a steady decline in the proportion of L3/L4 DDoS attacks that exceed a rate of 30 Gbps in recent months. From September 2019 to March 2020, attacks peaking over 30 Gbps decreased by 82%, and in March 2020, more than 95% of all network-layer DDoS attacks peaked below 30 Gbps. Over the same time period, the average size of a DDoS attack has also steadily decreased by 53%, to just 11.88 Gbps. Yet, very large attacks have not disappeared: we’re still seeing attacks with intensive rates peaking at 330 Gbps on average and up to 400 millions packets per second. Some of our customers are being targeted with as many as 890 DDoS attacks in a single day and 1,750 DDoS attacks in a month.

As the average rate of these L3/L4 attacks has decreased, they have become more localized and less geographically distributed. Increasingly, we’re seeing attacks hit just one or two of our data centers, which means that these hyper-localized attacks were launched in the catchment of the data center, otherwise our Anycast network would have spread the attack surface across our global fleet of data centers. Counterintuitively, these hyper-localized floods can be more difficult to detect on a global scale as the attack samples get diluted when aggregated from all of our data centers in the core. Therefore we’ve had to change our tactics and systems to roll with the change in attacker behavior.

Keeping things interesting in the penthouse floor of the OSI Model, over the same time period we’ve also observed some of the most rate-intensive and highly distributed L7 HTTP DDoS attacks we’ve ever seen. These attacks have pushed our engineering teams to invent even more efficient and intelligent ways to defend our network and our customers at scale. Let’s take a look at some of these trends and attacks.

Changing L3/L4 DDoS Attack Trends

Centrally Analyzed, Edge Enforced DDoS Mitigations

Before we released dosd late last year, the primary automated system responsible for protecting Cloudflare and our customers against distributed rate-intensive attacks was Gatebot. Gatebot works by ingesting samples of flow data from routers and samples of HTTP requests from servers. It then analyzes these samples for anomalies, and when attacks are detected, pushes mitigation instructions automatically to the edge.

Gatebot requires a lot of computational power to analyze these samples, and correlate them across all the data centers, so it runs centrally in our “core” data centers, rather than at the edge. It does a terrific job at mitigating large attacks, and on average stops over 4,000 L3/L4 DDoS attacks every month.

Edge Analyzed, Edge Enforced Mitigations

The persistent increase we’ve observed in smaller, more localized attacks was one of the main factors that drove us to develop a new, complementary system to Gatebot. We call this new system our denial of service daemon, or “dosd”, and this past month alone it mitigated 281,746 L3/4 DDoS attacks. This figure is roughly 55 times greater than what Gatebot dropped over the same period, thanks to dosd’s ability to detect smaller network attacks that would previously have flown under the radar (or taken longer to mitigate).

To complement the computationally heavy, centralized deployments of Gatebot, dosd was architected as a decentralized system that runs on every single server in every one of our data centers. Each instance detects and mitigates attacks independent of the other instances, or any sort of centralized data center whatsoever. As a result, the system is much faster than Gatebot, and can detect and mitigate attacks within 0-3 seconds (and less than 10 seconds on average). The speed of dosd enables it to generate real-time rules to quickly protect our customers at the data center. Then Gatebot, which samples traffic globally, can determine a mitigation that applies to all data centers if needed. In such a case, Gatebot will push rules to the data centers which will take priority over dosd’s rules.

dosd is also a leaner piece of software, consumes less memory and CPU, and significantly improves the resiliency of our network by removing the need to communicate with our core data centers to mitigate attacks. dosd detects and mitigates attacks using a similar logic to Gatebot’s methods, but in the scope of a single server, across a subset of servers in the same data center, or even across the entire data center.

Changing L7 Trends

Our automated Gatebot system is also tasked with mitigating L7 HTTP floods using request attributes as anomaly indicators. Mitigations can come in the form of actions such as JavaScript challenges, CAPTCHAs, Rate Limits (429), or Blocks (403) which are served back to the client as an error or challenge page. This form of mitigation at L7 allows the request to pass through TCP and TLS to the HTTP web server. During very rate-intensive attacks our servers can waste a lot of CPU and bandwidth as seen in the attack examples below.

Example #1 - Highly Distributed DDoS Attack Targeting A Customer Website

In July 2019, Cloudflare mitigated an HTTP DDoS attack that peaked at 1.4M requests per second. While this isn’t the most rate-intensive attack that we’ve seen, what is interesting is that the attack originated from almost 1.1M unique IP addresses. These were actual clients with the ability to complete a TCP and HTTPS handshake, they were not spoofed IP addresses. As it turns out, responding (rather than dropping at the network level) to over a million clients at a max rate of 1.4M requests per second can be quite costly.

Example #2 - Rate-Intensive DDoS Attack Targeting A Customer Website

The second attack took place in September 2019. We mitigated an HTTP DDoS attack that peaked and persisted just below 5M requests per second for a little over an hour. What’s interesting is the sustained capability of the attacker to reach those rates from only 371K unique IPs (also not spoofed).

These attacks highlighted to us what needed to be optimized and consequently drove us to improve our L7 mitigations even more so, and significantly reduced the cost of mitigating an attack.

Using IP Jails to Reduce the Cost of Mitigation

With the goal of reducing the computational cost to Cloudflare of mitigating rate-intensive attacks, we recently rolled out a new Gatebot capability called IP Jails. IP Jails excels at efficiently mitigating extremely rate-intensive and distributed HTTP DDoS attacks. It is triggered when an attack exceeds a certain request rate and then pushes the mitigation from the application layer (L7 in the OSI model) to the transport layer (L4). Therefore instead of responding with an error or challenge page from the proxy, we simply drop the connection for that IP. Mitigating at L4 is more computationally efficient, it reduces our CPU and memory consumption in addition to saving bandwidth. It allows us to keep mitigating the largest of attacks without sacrificing performance.

IP Jails in action

In the first graph below, you can see an HTTP flood peaking just below 8M rps before the IPs are ‘jailed’ for misbehaving. In the second graph, you can see that same attack being dropped as packets at L4.

The flood requests generated over 130 Gbps in responses. IP Jails slashed it by a factor of 10.

Similarly, you can see a spike in the attack mitigation CPU usage which then drops back to normal after IP Jails kicks in.

Using Origin Errors to Catch Low-Rate Attacks

We see one or two of these rate-intensive attacks every month. But the vast majority of attacks we observe are mostly of a lower request rate, trying to sneak under the radar. To tackle these low-rate attacks better, last month we completed the rollout of a new capability that synchronizes Gatebot’s detection sensitivity with our customers’ origin server health. Gatebot uses the origin’s error response codes as an additional adaptive feedback signal.

However, when we take a step back and think about what a DDoS attack is actually, we usually think of a malicious actor that targets traffic at a specific website or IP address with the intent to degrade performance or cause an outage. However, malicious attackers are not the only threats to your applications availability.

As the migration of functionality to the edge increases, the cloud becomes smarter and more powerful, which often allows administrators to scale down their origin servers and infrastructure leaving the origin server weaker and under-configured. Evidently, there are many cases where an origin was taken down by small floods of traffic that were neither malicious nor generated with bad intentions. These floods may be generated by an overly excited good bot or even faulty client applications calling home too frequently. Fixing a home-sick client application or strengthening a server can be lengthy and costly processes during which the origin remains susceptible. Consequently, if a website is taken offline, no matter the reason, the end-users still experience it as if it were an attack.

Therefore this new capability not only protects our customers against DDoS attacks, but also protects the origin against all kinds of unwanted floods. It is designed to protect every one of our customers; big or small. It's available on all of our plans including the Free plan.

When an origin responds to Cloudflare with an increasing rate of errors from the 500 range (Internal Server Error), Gatebot initiates automatically and analyzes traffic to reduce or eliminate the impact on the origin even faster than before. The current error rate is also compared to the average error rate to minimize false positives. Once an attack is detected, dynamically generated, ephemeral mitigation rules are propagated to Cloudflare’s edge data centers to mitigate the flood. Mitigation rules may use a block action (403), rate-limit (429), or even a challenge based on the fingerprint logic and confidence.

In March 2020, we mitigated 812 HTTP DDoS attacks on average every day, and approximately 20,000 HTTP DDoS attacks in total.

Don’t Take Our Word For It, See For Yourself

Whether it's Gatebot or dosd that mitigated L3/4 DDoS attacks, you can see both types of attack events for yourself in our new Network Analytics dashboard.

Today this dashboard provides Magic Transit & BYOIP customers real-time visibility into L3/4 traffic and DDoS attacks, and in the future we plan to expand access to customers of our other products.

Visibility into L7 DDoS attacks is available to our WAF/CDN customers that have access to the Firewall Analytics dashboard.

Unmetered DDoS Protection For All

Whether you’re part of a large global enterprise, or use Cloudflare for your personal site on the Free plan, we want to make sure that you’re protected and also have the visibility that you need.

DDoS Protection is included as part of every Cloudflare service; from Magic Transit at L3, through Spectrum at L4, to the WAF/CDN service at L7. Our mission is to help build a better Internet – and this means a safer, faster, and more reliable Internet. For everyone.

If you’re a Cloudflare customer of any plan (Free, Pro, Business or Enterprise), these new protections are now enabled by default at no additional charge.

The Cloudflare Blog