In 2016 we announced a new discipline at Google, Customer Reliability Engineering, an offshoot of Site Reliability Engineering (SRE). Our goal with CRE was (and still is) to create a shared operational fate between Google and our Google Cloud customers, to give you more control over the critical applications you’re entrusting to us. Since then, here on the Google Cloud blog, we’ve published a wealth of resources to help you take the best practices we’ve learned from SRE teams at Google and apply them in your own environments.
Below is the complete list of CRE life lessons posts we’ve published in the past five years in one convenient location.
- Know thy enemy: How to prioritize and communicate risks
- How to avoid a self-inflicted DDoS Attack
- Using load shedding to survive a success disaster
- Available . . . or not? That is the question
- SLOs, SLIs, SLAs, oh my
- Building good SLOs
- Consequences of SLO violations
- An example escalation policy
- Applying the escalation policy
- Defining SLOs for services with dependencies
- Tune up your SLI metrics
- Learning—and teaching—the art of service-level objectives
- Using deemed SLIs to measure customer reliability
- Why should your app get SRE support?
- How SREs find the landmines in a service
- Making the most of an SRE service takeover
- Shrinking the impact of production incidents using SRE principles
- Shrinking the time to mitigate production incidents
By: The Google Cloud editorial team
Source: Google Cloud Blog