How to Use Fault Root Cause Analysis to Quickly Locate the Cause of a Fault?

Mondo Cars Updated on 2024-02-01

Background

It is well known that change is a key factor in the instability of the online environment, and studies have shown that 70% of online failures are triggered by some kind of change. Therefore, when an environment is alerted to a Down, the administrator's instinct is to wonder if there has been a recent change. At this point, we often need to actively look up the change history and confirm the plan for the next change, which is a cumbersome and inefficient process.

Another cause of environmental failures is the load and saturation of the infrastructure where the service is located, which affects the capacity and performance of the service.

We want to have the ability to analyze the environment and analyze whether the alarm is due to a change or the load on the system. And the results of the analysis can be presented in an intuitive topology, and we want to see the relationship between services, the intermediaries and infrastructure that they rely on, and where there are changes or exceptions. As shown in the figure below:

In addition, it can intelligently connect all service debugging links around the alarm service and analyze the possible causes of the abnormality

This capability is the ability of the EasyOps platform to analyze the root cause of failure. Let's take a look at how to configure and make them, and what the diagram represents.

Practice

First, define the SLI of the service. We choose detect code as the SLI for service capability, and we think that if the detect code is not 0, it means that the service is not available. At this time, the alarm system will trigger a severity level failure, which will be received by the administrator.

This SLI is already built into the platform and requires additional configuration. All we need to do is define the dial test collection policy and alarm rules. Such as:

Note: The selected alarm resource type is the model under the Service model, in this case the HTTP service. The platform definition only does root cause analysis for service resources.

With just a simple two-step configuration, you can make root cause analysis possible!

Effect Interpretation

Once an HTTP service sends an alarm, we can jump to the root cause analysis by clicking [Fault Analysis].

Take the diagram at the beginning as an example:

As can be seen from the above figure, the service marked in red is the alarm service, and below it is a series of mediation and dispatched services around this service, and the relationship between the service and the service is also presented. The lowest layer of the topology is the infrastructure, which is the host.

From this topology, we can see that the probability of the cause of the failure is that the two operating system hosts have made a change. Combined with the propagation diagram on the right, we further clarify the time point of change and the point of failure

As can be seen from the above figure, the change occurred at 1 18 ,22:03:30, and the fault occurred at 1 18 ,22:04:09, so it is obvious that the fault was caused by the change. In the above case, it is true that the defective ** package is released to the production environment at the time of change, which makes the service unavailable.

After clarifying the cause of the failure, the administrator can quickly decide on the next steps, such as rolling back in time to reduce the failure repair time and improve MTTR.

Related Pages