#126: State of DevOps 2021 - What it says about Site Reliability Engineering

The State of DevOps report provides excellent insight through rigorous analysis of its wide reaching survey.

The research provides evidence-based guidance to help focus on the capabilities that drive performance.

One of those are Site Reliability Engineering practices that came out Google's efforts to handle massive scale.

Why you might be interesting in this episode:

  • What is Site Reliability Engineering
  • How does it relate to DevOps
  • What correlation the report found in Site Reliability Engineering use

Or listen at:

Published: Wed, 23 Mar 2022 17:04:00 GMT

Links

Transcript

Hello, and welcome back to the Better ROI from Software Development podcast.

Over the last few episodes, I've been talking about the State of DevOps report and what it says about practises that help you achieve better results.

I summarise the state of DevOps Report 2021 back in episode 120.

Then in 121, I talked about what the report said about Cloud Computing.

122, what the report said about Documentation.

123, what it said about DevOps Technical Practises.

124, what it said about Security.

And 125, the last episode, what it said about Culture.

This time, I want to wrap up the series on the State of DevOps report, and I want to look at the last practise they talked about: Site Reliability Engineering. Site Reliability Engineering, a practise that came out of Google's effort to help massive scale.

So why might this episode be of interest to you?

Firstly, you might be understanding what is Site Reliability Engineering?

You might be wondering how does it relate to DevOps?

And what correlation the report found in Site Reliability Engineering use?

But first, a quick recap. If you've not listened to the previous episode, I'll give you a quick recap of DevOps and the State of DevOps report.

For DevOps, I like the Microsoft definition:

"A compound of development (Dev) and operations (Ops), DevOps is the union of people, process, and technology to continually provide value to customers."

It's a marriage of traditionally opposing forces. Innovation and change from the development team, stability and limiting change from the operations team. DevOps, however, is a focus on business outcomes that mix of the two.

The State of DevOps report is now in its seventh year of reporting on over thirty two thousand professionals worldwide. Produced by the DORA Team (DevOps Research and Assessment), it's the longest running, academically rigorous research investigation of its kind.

And for me, it provides clear evidence on the benefits of DevOps and its practises. But many of these practises are universal, so even if you're not officially doing DevOps, I still think they provide benefit to you and your teams.

And one of the practises that the report took a specific look at was Site Reliability Engineering.

Wikipedia describes site reliability engineering as:

"Site reliability engineering is a set of principles and practices that incorporates aspects of software engineering and applies them to infrastructure and operations problems. The main goals are to create scalable and highly reliable software systems. Site reliability engineering is closely related to DevOps, a set of practices that combine software development and IT operations, and SRE has also been described as a specific implementation of DevOps"

Site Reliability Engineering originated out of Google and was a direct response to having to handle massive scale.

One of the problems that Google faced was the ability to scale their systems without scaling the workforce at the same rate. They need to achieve a much higher level productivity from their operational staff.

In a normal organisation you may have a ratio of tens of servers per member of operational team. At a certain point, most team members will only be able to handle a certain number of machines - given the work that they need to do to keep them updated, maintained and working.

Google, given their scale, needed to get that ratio into the thousands of machine per operational team member - and Google achieves this through Site Reliability Engineering.

And the key here is the word Engineering.

They wanted their staff to think about the work they did like engineers rather than just workers. As such, they're expected to be at most 50% loaded with the manual tasks we might normally associate with operational staff - keeping machines updated, maintained and working.

The rest of their time was spent on engineering much of that work away.

Much of this was through automation and developing more intelligent processes and systems. So much of this mindset has then fed back into everything Google does. Everything is an opportunity to learn and improve.

Be it daily, repetitive tasks - how do we automate them away?

Be it any customer affecting outage - post-mortems are performed. Google then openly celebrate and distribute that post-mortem work. While many organisations would rather brush their problems under the carpet for fear of embarrassment or even look at disciplinary action.

Google sees the value of sharing and learning and improving those Service Reliability Engineering practises across the organisation. And by doing this, they have been able to scale their systems without having to employ every it person on the planet.

There is a lot of crossover with the DevOps principles and practises. And the report considers them complementary practises - with their research demonstrating clear alignment.

Specifically, the report assessed the degree to which respondents followed these five SRE practises:.

  • Define reliability in terms of user facing behaviour.
  • Employ the service level indicator and objective metrics framework to prioritise work according to error budgets.
  • Use automation to reduce manual work and disruptive alerts.
  • Defined protocols and preparedness drills for incident response.
  • And incorporate reliability principles throughout the software delivery lifecycle - shift left on reliability.

In their analysis, they found that the majority of teams in the study, 52% of respondents, reported some use of SRE practises to some extent. Yet, only 10% indicated they had implemented every practise that the report investigated.

They did find, however, that those teams that excelled at these modern operational practises were 1.4 times more likely to report greater software development and operational performance. And 1.8 times more likely to report better business outcomes.

Site Reliability Engineering is a large topic. Take, for example, the canonical book on the subject "Site Reliability Engineering" by O'Reilly - if you listen to that on Audible, it's over 20 hours. That's the same length as Harry Potter - The Philosopher's Stone and the Chamber of Secrets.

It's a very technical subject and one that I would not recommend entering into lightly. And in most cases, organisations are simply too small to adopt Site Reliability engineering to its fullest.

However, there are a number of practises that I believe that are useful to consider in isolation. In the coming weeks, I'd like to talk about a number of them:.

  • The differences between service level indicators, objectives and agreements
  • Error budgets
  • Handling failures
  • And checklists - should we use them, or shouldn't we?

In this episode, I've given you a brief recap of DevOps and the State of DevOps report, and I've given you a summary of Site Reliability Engineering and how it's evolved out of Google as a direct response to them having to handle massive scale.

I've talked about how the report found that, while 52% of respondents said that you used some part of the SRE practises, only 10% used all of those examined.

And I talked about how the report correlated that teams using these practises were 1.4 times more likely to report greater software development and operational performance. And 1.8 times more likely to report better business outcomes.

This episode concludes my look at the 2021 State of DevOps report.

In episode 120, I summarise the report.

In 121, I talked about what the report said about Cloud Computing.

122, what it said about Documentation.

123, what it said about DevOps Technical Practises.

124, what it said about Security.

125, what it said about Culture.

And in this episode, 126, what it said about Site Reliability Engineering.

I would like to reiterate that I find the State of DevOps report to be an incredibly useful artefact. For me, it provides clear evidence of the benefits of DevOps and its practises.

Regardless how mature your software development process, I'd really consider taking a good look at the State of DevOps report.

If you're early days, you can get a clear idea of what might be your biggest wins, especially if you're just starting to look at the practises of DevOps.

If you've reached level maturity, the report also helps you look to do what next - what are similar organisations doing? what could be your next way of improving what you're doing?

And it's all backed up by that incredibly detailed scientific analysis.

As with all the episodes talking about the State of DevOps report, I'll provide a link in the show notes. And I would certainly recommend giving it a read.

In the next episode. I want to talk about:.

  • How Google uses Service Level Indicators, Objectives and Agreements.
  • Why 100% uptime might not be the correct target
  • And why "uptime" might not be the best indicator in the first place

Thank you for taking the time to listen to this podcast. I look forward to speaking to you again next week.