#129: Handling Failure

Failure in our software systems is inevitable - be it a failing hard drive, broken network cable, power outage, virus, or simply a bug in the code.

"Hope is not a strategy" - thus we need to think about how we handle that failure.

Why you might be interesting in this episode:

  • The differences between how failures impact our traditional monolith applications and the more modern distributed application
  • To gain an understanding of the terms like Graceful Degredation, Cascading Failure, The Retry software pattern, The Circuit Breaker software pattern, and Deadline Propagation
  • And advice on how to find opportunities to use them

Or listen at:

Published: Wed, 13 Apr 2022 16:11:29 GMT


Hello and welcome back to the Better ROI from Software Development Podcast.

Failure in our software systems is inevitable.

Be it a failing hard drive, broken network, cable power outage, virus, or simply a bug in the code - Failure will happen.

And as a well known SRE saying goes: "Hope is not a strategy".

So what's in this episode that may be of interest to you?

Firstly, I'll look at the differences between how failure impacts our traditional monolith applications and the more modern distributed application.

I'll give you an understanding of the terms like:

  • Graceful Degradation
  • Cascading Failure
  • The Retry software pattern
  • The Circuit Breaker software pattern
  • And deadline propagation

And advice on how to find opportunities to use them.

But let's start with why you should be interested in failure in the first place;

As I set out in the intro, failure is inevitable - failure will happen - it's what you do of it, which affects your customers and your business outcomes.

You cannot put your head in the sand and pretend it will not happen.

It will.

And in those situations, you often want the "least worst" outcome to occur.

In a failure situation, the "best case scenario" has already failed - thus, you have to look at what the alternatives are available - and these are not going to be perfect. Otherwise, they would have been your "best case scenario" in the first place.

Thus you need to be choosing from a number of bad options.

Often you'll be needing to be subjective on what is the "next best", or more importantly, the "least worst" option.

And these options can get more complicated with modern systems.

We have our traditional, large, monolithic application - it was arguably simpler. It was one system - if it was down, it was all down. Much easier to rationalise about. But it, of course, had a massive impact because everything was down.

With the more modern distributed component nature of our applications, such as microservices, it's possible to have a failure to affect only part of the overall system.

I've talked previously to think about it as bulkheads to protect the rest of the ship. This does give us the benefit, that if done right, the impact can be minimised. But it is much more complex to rationalise about.

At this point, I'd like to use a metaphor. Now, no metaphor is perfect, so bear with me.

Let's say you want to get a parcel to your customer.

If we think of our traditional model application, everything in one. Then you may, think of this as a "man in a van" courier. You hand them the parcel. And they are then responsible for the entire journey of that parcel into your customers hands. It's easy to rationalise about - it's a single person hand-delivering that parcel.

The downside is if something goes wrong, if the van breaks, or the courier becomes ill, then your parcel is unlikely to make it.

It can give you problems with resilience.

You also have a problem with scaling. If you wanted to send 100 parcels to the same customer, then it's probably okay. The courier just needs a bigger van, although the same risks still apply.

However, if it's 100 parcels to 100 different customers around the country, that isn't going to work.

Compare this to something more like the Postal Service, a network of people, offices and vehicles to get your parcel to where it needs to be. This is a network of components, and by having a network, we have more options in terms of resilience and reliability.

If a single driver is ill, we have additional drivers. If a van breaks down, a colleague comes out, swaps the parcels over and carries on the delivery.

And this is much more akin to our modern distributed systems. We have options for resilience and handling failure.

At this point, I'd like to talk about the term Cascading Failures. Wikipedia describes it as:

"A cascading failure is a process in a system of interconnected parts in which the failure of one or few parts can trigger the failure of other parts and so on. Such a failure may happen in many types of systems, including power transmission, computer networking, finance, transportation systems, organisms, the human body, and ecosystems.

Cascading failures may occur when one part of the system fails. When this happens, other parts must then compensate for the failed component. This in turn overloads these nodes, causing them to fail as well, prompting additional nodes to fail one after another."

And this is something we aim to avoid in our modern distributed systems.

In our legacy monolith, it's common that a single failure makes the whole of it unusable. We need to think about how we can avoid Replicating this outcome in our modern distributed systems.

If we think about our Postal Service metaphor, we do not expect a single driver illness to cause Cascading Failure to the entire network.

What we would rather expect is "Graceful Degradation". With Graceful Degradation, we get to as much of the system to continue to operate as possible, even though part of it is failed.

In our postal example, if one driver is ill, then potentially that route isn't delivered to that day - but all other routes and deliveries are made. This is Graceful Degradation.

Yes, the overall system, the Postal Service, is not operating at peak performance, but the overall Impact of the failure is minimised.

Modern software development also provides us with approaches to handle these sorts of failures. They provide us patterns for how we can best handle them. Two of them I want to talk about are the Retry and the Circuit Breaker pattern.

With the Retry pattern, as the name would suggest, The failure is retried if the failure is likely to be transient, short lived, in nature.

Take, for example, our ill post driver; if they recover the next day, the parcel delivery can be retried.

This is also a common pattern. If the parcel needs a signature and you're not in; the Postal Service will retry deliveries until you're in and sign for the safe receipt of the parcel.

But the Postal Service will not try infinitum - they will only try so many times - that would tie up. all of their drivers if they did.

Rather they will stop attempting delivery after a certain point. And this is very similar to the second pattern, the Circuit Breaker.

Much like the electric Circuit Breaker in your home, it is there to avoid further failure once it occurs.

In the case of our delivery, it may kick in after three failed attempts. After this point, the Postal Service will not attempt further delivery. Effectively, the breaker is thrown and the parcel is not re-attempted - saving on wasted trips by the driver.

The breaker is only reset when you contact the Postal Service to reconfirm your availability.

Once this outside action has occurred, then the Postal Service can successfully deliver the parcel to you.

And our modern distributed software systems work in the same way. If a component of the system fails, we can use the Retry pattern to re-attempt - this can work brilliantly for transient errors.

If, however, the retries continue to fail, or it isn't a transient error, then the Circuit Breaker pattern allows us to avoid making those failed requests - which can often make the problem worse if the failure is related to the component being overloaded with the work.

By having that controlled failure, we have the opportunities to provide Graceful Degradation rather than Cascading Failure.

One other technique that I find interesting, which isn't so much a pattern as more of a technique, is the idea of "Deadline Propagation" from the Site Reliability Engineering practises.

I've talked about Site Reliability Engineering over the last few episodes. It's a set of practises and principles that have emerged from Google as a direct response of their having to handle massive scale.

One of the ideas from this is Deadline Propagation, which I think aligns well with this topic.

Deadline Propagation is the idea that a specific action must be completed within a specific time period to be considered successful. The deadline is set at the start of the action and every component that works on the action reviews that deadline to ensure that it hasn't expired.

Say, for example, our parcel must be delivered within 24 hours; As such, we promptly place a label on the parcel with the explicit date and time that the parcel should be delivered by. With this label, every part of the postal network is firstly aware of the deadline, and secondly can take actions based on it.

If the deadline is fast approaching, potentially, greater priority is put on the parcel - or if the deadline has passed, any attempt to deliver ceases.

If we are delivering fresh seafood, there is no point in the delivery being completed if it has gone past the safe consumption period. Better that the parcel never arrives, than arrive late and cause our customer food poisoning.

This technique not only has advantages when one of the components fails, but also when it's just being slower that it should. For example, if our driver is slower than normal due to a leg injury, there is little point giving them more parcels and making the problem worse.

So you may be wondering how you know when to use any of these approaches;

As I say, these will likely be "least worst" options following any failure. So there will likely be trade offs as to what approach is best for your organisation and your customer.

And as I said at the start, this matters to you because of the impact that that "least worst" option has - and ultimately, it will depend on your organisation and your customer.

However, there are good times to look for opportunities to use these approaches to decide what is the "least worst" option if a particular component fails.

Post-mortems are a great opportunity for learning following some form of customer affecting outage. Carry out a blameless post-mortem; Review the why it happened and what, if any, mitigations you had in place - Were those mitigations good enough? If not, what approach would be "less worse"?

Carrying out this sort of review as part of a post-mortem is a useful exercise because the failure is still fresh in people's minds and There is the memory of any pain that went with it.

An alternative is Gamedays workshops, in which we run "what-if" scenarios - we role play various scenarios - hopefully highlighting the failures on paper before they happen for real.

In this episode, I've talked about:

The differences between how failure impacts our traditional model of application and the more modern distributed application. In our monoliths, a failure would often be all or nothing - easier to think about but massive impact.

And if done right, our modern distributed applications allow for partial failure, "Graceful Degradation", limiting the impact and providing greater resilience. But that benefit can come at the price of greater complexity. If we don't get it right, we get full failure, a "Cascading Failure", as we did with the monolith.

The Retry software pattern helps to handle transient failures by automatically retrying it.

The Circuit Breaker pattern helps to handle non-transient failure, often helping by removing pressure on the failing service to allow it to recover.

The Deadline Propagation strategy from the Site Reliability Engineering practises help us to set an explicit time by which an action must be completed - allowing for priority processing or reducing the load of a system processing expired and thus pointless work.

And finally, I talked about using blameless post-mortems and gamedays as a way of finding opportunities to use these approaches.

In next week's episode, I want to look at how the Site Reliability Engineering practises look at checklists.

Thank you for taking the time to listen to this episode and I look forward to speaking to you again next week.