Backoff Strategies

Jun 17, 2025 · 7 min read · system-design reliability ·

Share on:

Last week, GCP had an outage that brought down a very large portion of the internet (including Cloudflare). They released their RCA, and the part that stood out to me was this part:

Within some of our larger regions, such as us-central-1, as Service Control tasks restarted, it created a herd effect on the underlying infrastructure it depends on (i.e. that Spanner table), overloading the infrastructure.

This seems like something that Google would handle better. A few thoughts on different ways to handle this effectively.

Thundering Herd

Think of this as being a herd of buffalo in Dancing With Wolves. In systems, this is used to represent the case where a large number of operations are trying to happen all at the same time.

Imagine that your API server is down for an hour. During that hour, a lot of your customers want to make requests, but can’t. As soon as your web server is back up, every one of your customers starts sending their last hour of traffic immediately to you, which often is far more than you can really handle.

This is the thundering herd.

The worst part of this is when it is all internal and you are shooting yourself in the foot.

The problem is that often a thundering herd will end up in a worse overall experience for everyone (all requests would get served quicker if they were orderly).

Mitigation Strategies

There’s a decent number of ways to handle this. Or, at least try to handle it. It’s a hard problem because there’s just a lot of work to be done, and everyone wants to do it now.

In my opinion, there’s also a couple of different ways to look at it.

As the client
- You are the one doing the calling. You can figure out how to make your requests for, hopefully, both your success and the server’s
As the server
- You don’t always have control over the people calling your service. How can you best handle customers who may or may not want to deal with the situation?

As I talk through this I’ll mostly be thinking about things like web requests. But the same ideas should apply at differnt levels (say, calls to your database).

At the Client Level

This is a great place to start. We hopefully want to be a good and respectful caller. The goal here is to maximize everyone’s experience.

I’ll cut to the chase here – the answer for you is going to be Exponential Backoff.

Exponential Backoff

Others have covered this better than I will, but, in general you’ll want to sleep between retries, and the sleep should increase exponentially.

We use Sidekiq at work, which does this by default on retries

By default, it has 25 retries, which, if exhausted, will take 21 DAYS.

The idea here is you give the thing that you are calling a chance to recover. After a couple of tries, you are sleeping for a large amount of time in between each request, so your hope is that you are giving the server a chance to fix whatever the problem was.

One problem that can arise with this is that your calls may still be lined up. If you try to make 10,000 calls in a second, and they all fail. You may sleep for 10 seconds.. and then try 10,000 calls again all at the same time.

One solution to this is simply to add some jitter to your backoff.

What’s jitter? It’s as easy as a random sleep when you are backing off.

Let’s take the previous example of 10,000 calls. Now you decide to add a random 0-3 second sleep on the retries.

10 seconds later, you should have:

2500 calls on second 10
2500 calls on second 11
2500 calls on second 12
2500 calls on second 13

You can probably see how this can make life easier on the server. However, this also has a compounding effect.

On the next retry, these calls will start to spread out even further.

This is a great way to maximize the chance of both your success and the server’s success.

At the Server Level

In some ways, handling this at the server level is harder. You are probably dealing with clients who don’t really care that much about any of this, and have just written their code as simply and easy as possible.

1
2
3
while true; do
  curl https://your-poor-server
done

There’s a couple of ways you can go about this that are related.

Rate Limiting

Of course, the easiest thing to do here is rate limit. I won’t go into too much detail on the various ways to rate limit (there is a wealth of information about that), but the general idea is the same:

You may call us “X” times every “Y” seconds. If you call us more than that, you won’t succeed.

The concept is pretty simple. Implementation here can be detailed, but the concept is straightforward.

One thing to consider is how it’s implemented. If you are doing the work on your web server – this is still work the webserver has to do. It’s likely much less work than actally fulfilling the request, but it’s not zero. For a webserver that’s overloaded, that might matter.

You may be better off with a layer above your webserver handling this. Whether it’s a proxy, load balancer, or a WAF, you can offload a lot of this complexity to something external to your web sever, and let your webserver focus on just the important bits.

One improvement to this might be Adaptive Rate Limiting

A lot of places and products will select a number (say, 1000 requests/minute/IP), and apply it. This applies in good times and bad.

However, in this scenario, you are most certainly not in good times. It’s best to make sure you have some way to reduce this if you need to. There are a lot of times where your site is under stress, and you want to serve some traffic.. but less of it. Maybe you want to crank this down to 100, or 10. Having a method to do that is great.

Even better would be if you had some way to automatically do that. “If request time is over xxx, start rate limiting more severely”.

Convey Information!

Some of this will depend on if we are talking about internal or external traffic, but always convey information where you can.

The best example of this is returning a 429 if you are rate limiting. There’s, of course, no guarantee that anyone will respect it, but you have given the caller information that they can use to be more successful.

Taking it a step further, you can also return a message (maybe human readable, maybe defined by your API specification) to give even more information.

“Rate Limited for another 3:40”
“Rate Limited until 4:38PM”
“Currently experiencing high loads, rate limit reduced to 10/minute”

Obviously, those would be different if they were part of the API spec, but you get the idea. It’s yet another way to both explain what’s happening to the caller, but also to help them be successful.

This is a form of backpressure. Pushing back against the rush of calls. It’s a good way to convey “Hey, I’m struggling here, lay off for a bit, thanks”

Never expect an external caller to respect any of this. It’s great if they do, and you’ve given them the ability to handle it, but my experience is that there are a lot of folks that… simply won’t.

Futher Reading

I wrote this post pretty quickly, and while it’s something I’m quite interested in, I think other folks have probably covered it sufficiently. Here’s a few that I think you should check out:

Mark Brooker has a great post about this. He’s worked at AWS for a long time and done some great things.
He wrote about it on his personal blog as well
A great story about exponential backoffs and why they can be a problem
Another good post about it