Lessons learned from two decades of Google SRE

Foreword

A lot can happen in twenty years, especially when you're busy developing.

Twenty years ago, Google had a pairSmalldata centers, each center has a few thousand servers, through a pair of 24G network link ring connection. We use python scripts such as"assigner"、"autoreplacer "with"babysitter"Running our private cloud (although we didn't call it that at the time), these scripts run on a configuration file that contains a single server name. We have a small machine database (MDB) that can help organize and store information from individual servers. Our small team of engineers uses scripts and configuration files to automatically resolve common issues and reduce the manual labor required to manage a fleet of servers.

As time passed, Google's users came to search, to go for free GB Gmail, and our fleet and network grew. Today, we're more than 1,000 times larger in terms of computing power than we were 20 years ago;In terms of networking, we're more than 10,000 times bigger than we were 20 years ago, and we're spending a lot less effort on each server than we used to, while our service stack is more reliable. Our tools have evolved from a series of python scripts to an integrated ecosystem of services to a unified platform that provides reliability by default. Our understanding of the problems and failure modes of distributed systems is also evolving as we encounter new types of failures. We created the Wheel of Misfortune ("wheel of misfortune"[1], the Service Best Practices Guide ("service best practices guides"[2], published the book Google's greatest hits", and today, we are very happy to introduce to you:

Benjamin Trainor Sloss, creator of Google SRE

Lessons learned from two decades of reliability engineering

Let's start in 2016, when YouTube was still available"Adele's carpool karaoke"and always fascinating"pen-pineapple-apple-pen"Waiting for your favorite**. YouTube experienced a 15-minute global outage due to an error in YouTube's distributed memory caching system, resulting in an outage in the ability of YouTube's services**. Here are the lessons we learned from this incident.

1.The riskiness of a mitigation should scale with the severity of the outage

There's a joke about a guy who posted a picture of a spider in his home on the Internet, and the captain said"It's time to move to a new house!"。The meaning of the joke is that there will be severe mitigation measures for this event (seeing a scary spider) (abandoning your current home and moving to a new one). We've also had some interesting experiences with SRE about choosing a mitigation that is riskier than the failure it is trying to solve. In the aforementioned YouTube glitch event, a risky load-cutting process didn't resolve the glitch. ......Instead, it caused cascading failures.

We are keenly aware that during an incident, we should monitor and assess the severity of the situation and choose a mitigation path that is appropriate to the severity. In the best-case scenario, a risky mitigation can address the failure. In the worst-case scenario, failure mitigation measures can fail, resulting in longer outages. Also, if everything is in order, you can make an informed decision to bypass the standard procedure.

2.Recovery mechanisms should be fully tested before an emergency

An emergency fire evacuation in a tall city building is a great opportunity to use a ladder for the first time. Similarly, an interruption is a great opportunity to try a dangerous load drop process for the first time. In order to stay calm in high-risk, high-stress situations, it is important to practice Xi recovery mechanisms and mitigation measures beforehand and verify the following:

They have you covered.

You know how to do it.

Another interesting aspect of testing recovery mechanisms is that it reduces the risk of doing some of these actions. Since this chaotic glitch below, we've redoubled our efforts to test it.

3.Canary All Changes

At one point, we wanted to push a cache configuration change. We are pretty sure that this will not lead to any undesirable consequences. But"Pretty sure"It's not 100% certain. It turned out that caching was a fairly critical feature for YouTube, and the configuration change had some unintended consequences, bringing the service down completely for 13 minutes. If we had adopted a progressive release strategy, Canaried Those Global Changes[3], this failure could have been contained before it had a global impact. Read more about the canary strategy here[4], and learn more in ** [5].

Around the same time, YouTube's slightly younger sibling, Google Calendar, also experienced a breakdown, which was the backdrop for the next two lessons.

4.There is one"Big red (emergency stop) button"(h**e a "big red button")

E-STOP button"It's a unique but very useful security feature: it should initiate a simple, easy-to-trigger action that reverts what triggered a bad state to (ideally) shuts down everything that's happening. "E-STOP button"There come in many shapes and sizes – it's important to determine what these red buttons might be before committing to a potentially dangerous action. We came close to triggering a major failure, but luckily the engineer who submitted the potentially triggering change unplugged the desktop computer before the change propagated. So, when planning a major deployment, consider what's mine"Red button"？Make sure you have one for each service dependency"Red button"for use in case of emergency. For more information, see:"General mitigations"[6]！

5.Unit tests alone are not enough - integration testing is also needed

Yes. ......Unit tests. They verify that individual components perform as we require. Unit tests deliberately limit the scope of the test and are useful, but they also don't fully replicate the runtime environment and possible production requirements. That's why we're a big advocate for integration testing!We can use integration tests to verify that jobs and tasks can perform cold starts. Will things work the way we want them to?Do the components work together as we ask them to?Will these components succeed in creating the system we want?We learned this lesson in a calendar failure where our tests didn't follow the same path as we actually used, resulting in a lot of testing...This doesn't help us assess how the change will perform in the real world.

Moving on to an incident that occurred in February 2017, we find the next two lessons.

First, the unavailable OAuth token caused millions of users to log out of devices and services, and 32,000 Onhub and Google WiFi devices performed factory resets. The requirement to manually recover an account increased by a factor of 10 due to a failed login. It took Google about 12 hours to fully recover from the failure.

6.Communication Channels!And backup channels!!and the backup !! of these backup channelscommunication channels! and backup channels!! and backups for those backup channels!!!

Yes, it was a bad time. Do you want to know what makes it worse?Teams want to be able to use Google Hangouts and Google Meet to manage events. However, when 3500 million users log out of their devices and services. ......In retrospect, relying on these Google services was a bad decision. Make sure you have a backup communication channel that is not dependent and has been tested.

Then, the same event in 2017 gave us a better understanding of graceful degradation:[7].

7.Intentionally degrade performance modes

It's easy to understand usability as:"Completely normal"or"Everything works fine"...However, consistently providing minimal functionality through degraded performance mode can help provide a more consistent user experience. As a result, we've carefully and deliberately built a performance degradation pattern – so in the event of instability, users may not be able to see it at all (which may be happening right now!).）。Services should be gracefully degraded and continue to operate under exceptional circumstances.

The next lesson is a recommendation designed to ensure that your last line of defense system is operating as expected in extreme situations, such as natural disasters or cyberattacks, resulting in a loss of productivity or service availability.

8.Test for disaster resilience

In addition to unit and integration tests, there are other types of important tests: disaster resilience and recovery testing. Disaster resilience testing verifies that your service or system can continue to operate in the event of a failure, delay, or outage, while recovery testing verifies that your service can be restored to a healthy state after a complete shutdown. As such"Survive the unexpected"[8] Both should be a key part of a business continuity strategy. A useful activity could also be to sit down with the team and discuss how some of these scenarios could theoretically happen in the form of a board game. It's also an exploration of those terrible"If"of interesting opportunities, for example,"What if part of your network connection shuts down unexpectedly

9.Automate your mitigations

In March 2023, multiple network devices in several data centers failed almost simultaneously, resulting in widespread packet loss. During this 6-day outage, an estimated 70% of the services were affected to varying degrees based on location, service load, and configuration at the time of the network failure.

In this case, you can reduce your mean time to resolution (MTTR) by automatically taking mitigation actions. If there's a clear signal that a failure is occurring, why can't mitigations be initiated automaticallySometimes, it's best to use automated mitigations first and leave the root cause until it avoids impacting users.

10.Reduce the time between rollouts, to decrease the likelihood of the rollout going wrong

In March 2022, a widespread failure in the payment system prevented customers from completing transactions, resulting in the postponement of Pokemon GO Community Day. The reason is that a single database field has been deleted, which should have been safe since all uses of that field had been removed from ** beforehand. Unfortunately, due to the slower publishing of a part of the system, this means that the field is still being used by the live system.

Due to the long delay between releases, especially in complex multi-component systems, it can be difficult to extrapolate the security of a particular change from release. Frequent releases [9] – with proper testing – reduce the unintended occurrence of such failures.

11.A single global hardware version is a single point of failure

Performing critical functions with only one specific model of equipment simplifies operation and maintenance. However, this means that if there is a problem with the model, that critical function is no longer performed.

This happened in March 2020 when a network appliance with an undiscovered zero-day vulnerability experienced a change in traffic pattern that triggered the vulnerability. Since the entire network was using the same model and version of equipment, there was a serious regional failure. Thanks to multiple network backbones, high-priority traffic was able to travel through alternative devices that were still functioning, avoiding a total outage.

Potential vulnerabilities in critical infrastructure can go undetected until a seemingly innocuous event triggers them. Maintaining a diverse infrastructure, while costly, can mean the difference between a failure and a complete failure.

That's it!11 lessons learned from Google's two decades of reliability engineering. Why 11 Lessons?Well, you see, Google's reliability department has a long history, but it's still in its prime.

References

1] The Wheel of Misfortune ("wheel of misfortune")

2] Service Best Practices Guide ("service best practices guides")

3] Canaried Those Global Changes

4] Read more about the canary strategy here.

5] Learn more in **.

6] "General mitigations"

7] Graceful degradation

8] "Survive the unexpected"

9] Frequent releases.

Author丨Adrienne Walcer Compilation丨***Dongfeng Weiming Technology Blog (ID: ewhisperCN) DBAPLUS community welcomes contributions from technical personnel, submission email: editor@dbapluscn

Lessons learned from two decades of Google SRE

Related Pages

The phenomenon of "cutting off children and grandchildren" is intensifying!Twenty years later, the d

The phenomenon of "cutting off children and grandchildren" is intensifying!Twenty years later, the d

After being released from prison for 20 years, he came home to find that his wife's daughters were a

Cai Zhuoyan and Andy Lau have cooperated again for 20 years, permeated with fate and transformation

China and South Korea in the Yellow Sea Liangzi ended the 20-year conflict revealed