Error Budgets in Practice: A Data-Driven Approach to Risk and Release Management

Why Error Budgets?

CoinGecko offers API services to our customers. There are 2 types of APIs that we provide, Public API and Pro API. For Pro API, we are bound with tight service-level agreements (SLA) to our customers. These SLAs are important for us to ensure customer satisfaction and trust in the platform.

We visualized the risk metrics to categorize risk categories into severities that may impose danger to our SLAs. Instead of settling for availability goals like 99.9% or 99.95%, we strive for tangible information to determine to ensure that our goals remain realistic.

In this article, we will discuss the process behind measuring and managing a reliable uptime SLA. How do we track, analyse and understand risks before reaching a conclusion for our SLA?

For the ease of understanding, let’s first talk about SLOs and SLAs:-

SLA
Service Level Agreements – agreements with our customers about reliability of our services.

SLO
Service Level Objectives – thresholds that catch an issue before it breaches our SLAs.

SLA and SLO, where it stands - Courtesy of Google

In other words, we have a higher threshold for SLO compared to SLA. We need to capture any issues before it reaches out to the customer. In terms of uptime, internally, we only allow a lower duration of downtime, x, compared to the external threshold that we set, y duration of downtime. To put it into formula → x < y.

The SLA that we have for our Pro API is 99.9%. That means, for SLO, we have a higher threshold; e.g., 99.95% or 99.99%.

How do we know how much head room that we have before we breach our SLO?

An uptime SLA of 99.9% is equivalent to 43.2 minutes of downtime in a month. A corresponding SLO of 99.95% is equivalent to 21.6 minutes of downtime.

This difference in minutes is also known as Error Budget. This error budget allows us to do maintenance, deployment and improvement towards our application. Engineers only have 21.6 minutes a month to maneuver around when they face problems that cause downtime.

Error Budget is an inverse of SLO. If our SLO is 99.9% availability, our Error Budget is the remaining amount of time (0.1% unavailability).

Availability Table - Courtesy of Google

From the table above, we now understand the unavailability, i.e., the Error Budget, in terms of the time we can afford.

Let’s take a look at these 2 diagrams below to understand the burn rate of our Error Budget from Day 1 to Day 28. The diagram below shows our Error budget of 21.6 mins (expressed as 100%) at the beginning of the month.

The first diagram shows a positive Error Budget by the 28th day of the month.

Monthly error budget nearing the budget - Courtesy of Google

Meanwhile, the following diagram shows a breached Error Budget with negative percentage remaining.

Monthly error budget breaching the budget - Courtesy of Google

The diagram above provides a visual representation of the burn rate in percentage regardless of how many minutes of Error Budget that we have chosen.

Error budget burn rate can be monitored throughout the month to revise the frequency, priority and type of deployments scheduled.

Analyzing past incidents to categorize our failure points

No application or system is perfect, especially in its early stages. The key is to learn from these experiences by recording, documenting, and categorizing each incident for future reference. By investigating these issues, we gain a deeper understanding of how to prioritize and address them, helping us craft realistic SLOs. Analyzing past incidents and anticipating future ones allows us to take proactive measures to prevent SLA breaches and ensure system reliability.

First thing first is we have to understand our failure points. Categorize each incident that occurred or may occur in the future. This is what we call a risk. This helps us to create a high-level view to view which categories cause us the most headache.

From where do we obtain the information of risks? Historical data, industry best practices, brainstorming etc.

For example purposes, these are some of the categories that we identified that can cause downtime to our application.

Disaster recovery drill
Updating major code version
Code deployment misconfiguration
Unoptimized database queries
Software defects in the code
Breakdown of caching service
Outage in an Availability Zone
Unintended data loss or corruption
Malicious security breach/attack
High volume of traffic
Breakdown in the message queue system
Disk failure
Third-party dependency failure

Next, from each of the incidents, we calculate:
ETTD - Estimated Time To Detection – how long it would take to detect and notify a human (or robot) that the incident has occurred; aka MTTD (Mean Time to Detect).
ETTR - Estimated Time To Resolution – how long it would take to fix the incident once the human (or robot) has been notified; aka MTTR (Mean Time to Repair).
ETTF - Estimated Time To Failure – estimated frequency between instances of this incident; aka MTBF (Mean Time Between Failure).
% of Users Affected – Percentage of users was affected by the failure

Above terms visualized - Courtesy of Google

This helps us understand the frequency, time, and our swiftness in responding towards an incident.

We wanted to understand how much downtime (bad minutes) per year is caused by a single category. From this valuable information, we open up a spreadsheet, fill in all of our data, and calculate our risk level for each category. This is what we call the Risk Catalog.

We enter our list of risks in the blue cells together with the ETTD, ETTR, Percentage of impact towards users and ETTF. Based on our inputs, we are able to see the number of incidents per year and bad minutes per year generated by the spreadsheet formula in the grey cells.

Computed Stack Rank of Risks

We took the information above and rearranged the risks based on a severity level to a new spreadsheet called the Risk Stack Rank. This is how we can calculate and provide a data-driven context on how we stand today vs our current SLO defined.

Let’s have a look at the computed stack rank of risks below:

In the sheet above, we can see that our risks are populated and arranged by bad minutes per year. The most bad minutes per year will be considered as the highest risk

Risk Stack Rank has multiple components to look for:

Target Availability
The desired availability in percentage.

Budget (m/yr)
The total error budget available, measured in minutes per year (m/yr), which represents the maximum allowable downtime while still meeting the target availability.

Accepted (m/yr)
The amount of downtime already allocated for various known risks in minutes per year.

Unallocated Budget (m/yr)
The portion of the error budget that remains uncommitted after accounting for known and accepted risks.

Threshold of unacceptability for an individual risk (% of error budget)
A limit that defines how much of the total error budget a single risk can consume.

Too Big Threshold (m/yr) – for a single risk
The absolute upper limit for the amount of downtime a single risk can be responsible for. If the expected impact of a risk exceeds this threshold, the risk is deemed "too big" and must be mitigated, as it could jeopardize the ability to meet the SLO.

In terms of the colored cell, below are the explanation of it:

Red – this risk is unacceptable, as it falls above the acceptable error budget for a single risk.
Amber – this risk should not be acceptable, as it’s a major consumer of our error budget and therefore, needs to be addressed.
Green – this is an acceptable risk. It's not a major consumer of our error budget, and in aggregate, does not cause our application to exceed the error budget.
Blue – this risk has been accepted to fit within our error budget. Accepting a risk means planning not to fix it and taking the outage and corresponding hit on the error budget.

Understanding Risk Stack Rank in Practice

Remember the risks that we have entered in the Risk Catalog together with its metrics? This Risk Stack Rank calculates the risks and rank it according to bad mins/year.

In this subsection, assume that we want to have a 3-nines availability target (99.9%), we have 2 red-shaded (unacceptable) risks and the others are green-shaded (acceptable) risks.

Let's see some scenarios below to see it in action.

Accepting a Risk that is in Red or Amber-shaded

Say that our threshold of unacceptability for an individual risk is 25% of the error budget.

We can see from above that accepting “Third-party dependencies failure” causes green-shaded risks to turn into amber-shaded risks (should not be accepted). This happens due to the accepted risks already consuming a number of error budgets, causing other risks to impose danger to our error budget.

Say we accept more of the risks that will consume our error budget.

The diagram above shows that more risks are amber-shaded. This means that we have to act upon the risks to bring down the bad mins/year. We’ll discuss this in Improving our Risk Stack Rank section.

Ideal Situation

We can start by accepting the green risks and see how it consumes our error budget in this sheet.

From the figure above we can see that we have accepted 519.62 out of 525.96 minutes from our error budget.

Unaccepted Risks

As we accepted risks (marked y), we agreed to accept it without any mitigation actions that are required. These risks are now known as risks that will burn our error budget.

But how about unaccepted risks that are in the red or amber-shaded? What do we do with them?

If we do accept them, the sheet will show that we have breached our error budget.

We can see that our Unallocated Budget section has reached a negative value.

These risks (red and amber-shaded) are the risks that require mitigation actions, these risks have to be acted upon so that it does not impose danger to our Error Budget.

How to implement Error Budgets in practice

Now that we have Service Level Objective (SLO) and Error Budget, let's enforce it! It is important that everyone in the organization is aware of this policy, especially the engineering and product team.

This sets as a baseline in determining whether we can release new features or deploy a hotfix in case the Error Budget is nearing its limit or has been breached.

To simplify things in this article, we are going to present 3 tiers of severity levels – Tier 1, Tier 2 and Tier 3 – and with Call to Action (CTA) in each tier.

How do we know which is which? Again, quantifying this is crucial in understanding the criticality of an issue.

Tier 1
Description: There is a depletion of Error Budget within X days (e.g. 14 days) and the Error Budget percentage is still within acceptable status.
CTA: Acknowledgement is required and the SRE team will notify the application team.

Tier 2
Description: The Error Budget has depleted until Y% (e.g. 50%) within 28 days and the Error Budget percentage is in warning status.
CTA: Halt releases and P0 issues or security fixes until SLO recovers; setup dedicated team to investigate AND SRE team to highlight to application team.

Tier 3
Description: There is a major depletion of Error Budget within X days (e.g. 2 days) OR the Error

Author Of article : Hakim Zulkhibri Read full article