#128: Error Budgets

In this episode, I take a look at "Error Budgets"

Much of this episode is inspired by the Sight Reliability Engineering practices that come out of Google

Why you might be interested in this episode:

  • You want to understand what "Error Budgets" means?
  • You're struggling to prioritise effectively between feature development, defects, risk and debt
  • You want to see how "Error Budgets" can help you with that prioritisation

Or listen at:

Published: Wed, 06 Apr 2022 16:09:32 GMT

Links

Transcript

Hello, and welcome back to the Better ROI from Software Development podcast.

This week I want to take a look at Error Budgets and how they can help us prioritise between different types of work.

Much of this episode is inspired by the Site Reliability Engineering practises that have come out of Google.

Why you might be interested in this episode:

  • Maybe you want to understand what the term "Error Budget" means
  • Maybe you're struggling to prioritise effectively between feature development, defects, risk and debt work
  • Or how Error Budgets can help you with that prioritisation

But let's start with a brief recap.

I've previously introduced the set of practises and principles that Google employs to run at the scale that it does: Site Reliability Engineering (SRE).

Wikipedia describes Site Reliability Engineering as:

"Site reliability engineering is a set of principles and practices that incorporates aspects of software engineering and applies them to infrastructure and operations problems. The main goals are to create scalable and highly reliable software systems. Site reliability engineering is closely related to DevOps, a set of practices that combine software development and IT operations, and SRE has also been described as a specific implementation of DevOps"

Site Reliability Engineering originated out of Google and was as a direct response to them having to handle massive scale.

In the last episode, I talked about how Site Reliability Engineering looked at measuring availability and how they used:

  • Service Level Indicators - the metric to be observed
  • Service Level Objectives - the expected level of that metric
  • And Service Level Agreements - the outcome if that expected level is breached

Also, in that episode, I talked about how you don't want to aim for 100% in your Service Level Objective - it's going to be prohibitively expensive to even attempt it. Rather, your Service Level Objective (or SLO) should be a balanced that allows for new features and change, but provide the right level of stability - and that balance will be dependent on your own organisation.

Say, for example, you have a Service Level Objective of 99.9% Percent of successful requests - that will equate to a potential of 43 minutes in every 30 day period that you can have unsuccessful requests - you can have breaches of your SLO target.

That is your Error Budget.

Google equates this to something like a household budget - you can spend within your means, but you shouldn't be overspending.

So how could we use this budget?

At this point, I'd like to reference back to the book "Project to Product" by Mik Kersten that I talked about in episode 100; the book sets out a framework that describes four types of work:

  • Features, which delivers new business value
  • Defects, which helps deliver quality
  • Risk, which delivers security, governance and compliance
  • And Debt, which delivers removal of impediments to future delivery

The Error Budget can help to inform us and our decision making on how we split our prioritisation between those categories.

So let's see how that would work.

First of all, we need to look out our Error Budget and how much we actually spending - if we're not spending anything, effectively our SLO has become one hundred percent, then we are likely to be able to focus heavily on features. We can effectively be more aggressive in our experimentation and product development because we're currently not spending any of our Error Budget.

If, however, we are regularly eating into that Error Budget, or overspending, then we need to establish why.

The Site Reliability Engineering principles advise to analyse where are Error Budget is being spent:

  • Is it during releases?
  • Or is it a combination of many intermittent failures?
  • Or is it a major application failure?
  • Or is it something outside of our direct control?

So, for example, if we find that most of our Error Budget is spent during release, then we should really consider investing less in Features and more on Debt reduction. Say, for example, looking at implementing or improving our Continuous Integration, Continuous Delivery and Continuous Deployment practises.

A word of caution though; I've talked previously about how release failures can drive well-meaning, but dysfunctional outcomes. We think that releases are difficult and onerous. Thus, we try to do less of them and, in doing so, we actually make the release problem bigger and more impactful each time - because we start to batch work. Ultimately, this creates more risk and poor ROI.

Using DevOps practises such as Continuous Integration, Continuous Delivery and Continuous Deployment encourage us to do the reverse.

It encourages us to release more often, but in smaller chunks.

Like any exercise, the more we do it, the better we get.

Or our analysis shows that we're spending a lot of our Error Budget on major application failures. If so, we should consider spending more time on the defects.

And we should be repeating the analysis on a periodic basis: have we made enough corrections that the Error Budget is back in control? If so, maybe we can reprioritize features.

Or maybe the Error Budget remains stubbornly out of control. If so, we may need to devote exclusive effort to resolve or at least get into a controllable state.

As an aside, fixing may not always be the answer. If our Error Budget is being spent, this is an opportunity to review the outage - to check that it's actually causing customers pain. Is our Service Level Objective correctly aligned to what the customer needs - and thus what the business needs.

It's definitely worth checking to make sure you're looking at the right thing. While you may be breaching SLO and overspending your Error Budget, does this actually impact the customer or business outcomes?

Potentially the metrics may need to be reviewed rather than wasting efforts on something that is not going to move the needle.

The fix may actually be to adjust the SLO.

But regardless, being able to track the Service Level Objective and analyse the use of Error Budget, we have empirical data to allow us to make educated choices on how to invest our development time.

In this episode, I've talked about:.

What an Error Budget is - the difference between the Service Level Objective and 100% - a budget that you should be using to your advantage.

I've talked about how the analysis of current Error Budget spend can help you choose in prioritising development efforts between the four types of work - as described in the book "Project to Product" by Mik Kersten - be that:

  • Feature - which delivers new business value
  • Defects - which helps deliver quality
  • Risk - which helps deliver security, governance and compliance
  • Or Debt - which delivers removal of impediments to future delivery

I'll include links in the show notes to two articles by Google on this specific subject.

In next week's episode, I want to talk about some ideas for handling the inevitable failure in our software systems.

Thank you for taking the time to listen to this episode and look forward to speaking to you again next week.