100% uptime for a service is unrealistic, with the cost of marginal improvements becoming exponentially costly. This talk will highlight how Google Cloud uses SLOs and Error Budgets to provide uptime guarantees for our external customers, while also being able to aggressively launch new products/features and remain cost competitive at scale.
Google SRE uses the notion of a Service Level Objective (SLO) to define how fast/reliable/available a service needs to be. This is measured using Service Level Indicators (SLIs), for example p99 latency. SLOs should be ambitious but achievable, e.g. p99 latency <250ms 99% of the time over trailing 30 days. The inverse of a product SLO can be thought of as the amount of acceptable errors, which Google SRE refers to as an “Error Budget”.
Error Budgets give SRE freedom to take risks and “spend” their budget on moving fast: rolling out new software, releasing features, handling inevitable HW/network failures, redesign/refactoring code. Once an Error Budget is exhausted feature releases are “frozen”. During freezes Devs/SRE focus on reliability improvements until Error Budget has been refilled: standardize infrastructure, improved testing frameworks, framework for safe releases and rollbacks.
SLOs should be product specific. Some products are more sensitive to downtime, e.g. ad serving stack, while some products might be able to handle plenty of errors/latency. Breaking apart monolithic services allows for more granular SLOs. Freeze only what needs to be frozen, let all other releases flow!