Retry Policies with Polly

I originally set out to write about Polly. Then I started reading their documentation. I quickly realised that I couldn’t write anything as detailed and accurate as they already had available. So instead, I’m going to give more of a super high level, with links to relevant reading.

You can find their Documentation on their Github at github.com/App-vNext/Polly, as well as plenty more resources at www.thepollyproject.org/.

The elevator pitch is this:

Polly is a .NET resilience and transient-fault-handling library that allows developers to express policies such as Retry, Circuit Breaker, Timeout, Bulkhead Isolation, and Fallback in a fluent and thread-safe manner.

Basically, it handles the how of handling failure scenarios, so you can focus on the what. And they have enough guidance to help you with the what, too.

From version 6, Polly targets .Net Standard, and so runs anywhere you are compatible (so anywhere, really). This is split between targetting NETSTANDARD1.1 for maximum compatibility, to a more modern NETSTANDARD2.0+ that has all the bells and whistles a modern standard allows. Both in one single shiny nupkg.

Why

The internet is flaky. There are layers upon layers of fault-tolerant protocols on top of faulty protocols. But the HTTP spec includes support for fault scenarios (Timeout, DNSResolveErrors, InternalServerError, TooManyRequests) and that puts the demand on us to handle these cases.

This has never been truer then it is hosting in the cloud. Providers guarantee that they will restart servers on you, recreate resources, migrate for Disaster Recovery (DR) failover automatically. Nothing is fixed, everything is fluid. Add on top of that your own deployment model to achieve blue-green deployment, zero-down-time etc etc. Flakiness is a when-not-if occurrence.

Handling these cases is always app-specific, but knowing what to handle and why isn’t always clear or obvious. With Polly, and its supporting libraries as well as detailed guidance, you are already halfway there to provider better apps, and better service.

Standard Polly-cies

There are several primary use cases where Polly is designed to help. I’m going to list them in the order that I think (at least, today I think) should be considered and implemented.

Fallback

If anything goes wrong to the point of failure, you need to have a decision of what to do. If “Crash the entire app” or “500 the HTTP request” is a valid decision, you may start from there.

But more likely there will be an “Unavailable” message, a fallback image, a hard-coded result, an alternative display component, a too many requests Http Response. There will be something you can use in place of a hard failure. Thinking about, and determining this up front makes other decisions easier to make, especially around retries and timeouts. You will hardly ever be writing business software where UnhandledException is a valid use case.

The Polly policy for fallback is documented here.

Timout

You want to be fast. You want to be responsive. You don’t want to wait forever. There is probably a point that if you haven’t got an answer yet, waiting another 5 minutes probably won’t help. Work out what you can tolerate, what is a reasonable time for a response, and configure a proper timeout.

You can do a lot with Polly here. It could be GETs and POSTs have different timeouts. It could be lower for auxiliary data, and higher for primary data sources. Polly is pretty flexible if you need to get really custom. But think about timeouts early and monitor your dependencies for adjustment as necessary.

Revisit your fallback, and see if it makes sense on timeout, and if not, either alter it or create an alternative fallback for timeouts.

The Polly policy for timeout is documented here.

Retry

Retry doesn’t mean you will eventually get an answer, but it does mean if you wait, you might do. If we already handle Fallback and Timeout, we can be confident in what happens when our final retry fails.

We know some errors are transient, and if we try again it might actually work the next time. So we use a retry policy to try again. We want to have some sort of delay, and we want to think about using a back-off strategy, too. We don’t want to be the cause of a DDOS or making any service exhaustion issues worse.

This is probably the most discussed of all the error-handling strategies so I will defer to others, and point you at the Polly retry documentation here.

Circuit Breaker

I briefly mentioned back-off and service exhaustion in the Retry section. CircuitBreaker is another approach that helps here.

The most common analogy is the fuse box in your house. If an appliance is faulty, it blows the fuse and breaks the circuit. This stops the bad electricity problem from continuing. When you have isolated the problem, you can reset the fuse, and try again. If you haven’t isolated the problem, the fuse will break the circuit again, until it is made right again.

This is a basic circuit breaker pattern, and you can imagine how this applies to software, service failures or outages.

The Polly Circuit Breaker allows you to “blow the fuse” after a bunch of failures, assuming the remote service has issues. This broken fuse stops other parts of the application making the same request, they hit the same broken circuit and fail fast. The main benefits from a circuit breaker pattern are that your application can return a failure without even attempting to make the call, which makes the application fast, and also avoids contributing to exhaustion issues.

Polly Circuit Breaker docs are here.

Bulkhead

Their description of the Bulkhead pattern is much better than I could come up with:

A bulkhead is a wall within a ship which separates one compartment from another, such that damage to one compartment does not cause the whole ship to sink.

Similarly, a bulkhead isolation policy assigns operations to constrained resource pools, such that one faulting channel of actions cannot swamp all resource (threads/CPU/whatever) in a system, bringing down other operations with it. The impact of a faulting system is isolated to the resource-pool to which it is limited; other threads/pools/capacity remain to continue serving other calls.

We basically limit the number of requests to a particular resource, so that if too many requests are issued, further requests are turned away fast to avoid resource exhaustion. Again, when this happens we need to consider our retries, circuit breakers, and fallbacks for how our application behaves and responds to this exhaustion (429?).

Polly talks about this more here.

Combinations/Pipeline

You can chain these policies together, too. For instance, you may chain a Retry, to a CircuitBreaker to a Timeout. This means a request will go into Retry, then CircuitBreaker, then Timeout. Each request will be limited to a short timeout. That then updates the state of the circuit and then hits Retry. Our Retry might wait a short space of time (ms) and try again. That might be enough and a timeout might not be hit this time. If the Timeout fails again, we may trigger our CircuitBreaker. Our Retry triggers again and we wait longer. If we haven’t waited long enough we hit the CircuitBreaker and fail onto another retry. If we wait on Retry again, the circuit will be restored. and we try again and hopefully get past the timeout this time. Too many trys and we give up and call it a failure. We may now resort to showing our fallback here, perhaps.

A much better explanation of this (with diagrams) is here, as well as example codes of how this hooks together.

.Net Core 2.1

If you’ve been following the .Net 2.1 IHttpClientFactory changes, you will be happy to hear that Polly has this usage in mind already.

A large part of this boils down to avoiding some of the pitfalls associated with managing HttpClient yourself (the disposing-it-too-often-can-cause-socket-exhaustion but also only-using-a-singleton-can-miss-DNS-updates aspects). (Text as taken from the Polly docs.)

You can see their documentation Polly and HttpClientFactory as well as mention of Polly specifically on the official Initiate HTTP requests page (about three-quarters of the way down).

Plugging in Polly Policies at the HttpMessageHandler level can keep your calling code just the way you are used to while providing the benefits of these Policies. It could also mean you can provide policies for third-party libraries that expose the HttpClient enough to inject your own Polly Message Handlers.

Roadmap

Polly even has a roadmap set out so you can see what is being worked on for future releases.

Like I said, the docs are thorough, so refer to them as you start building out with Polly and make your applications more robust, and more useful when (not if) things go wrong.