Discussion on the Fault Review of Event Management in Digital Operation Service Management 3 .

Mondo Technology Updated on 2024-01-30

Author: Qin Honglin, Zilingyun, CGO and head of SaaS.

In the previous article, that is, ".** "Fault Review" (2) in Event Management in Digital Operations Service Management (ITSM).———How to Organize Fault Review Efficiently".It focuses on how to efficiently organize the failure review, so as not to waste the opportunity to review each failure, then effective organization is very important, including determining the rules and templates for the review, the active participation of the team, and based on the timelineReview the entire process of failure, in-depth analysis of the cause and solution, assign responsibility, think about opportunities for improvement from all aspects, all the way to reaching a consensus, determine improvement measures, and follow up the continuous improvement process.

In this article, we focus on some precautions in fault review, and give a checklist for fault review to improve the quality and effect of fault review from various dimensions.

Some of the following common problems may arise during the fault review:

1.Focus on the solution, not the cause

Focus on the solution rather than the root cause. The main purpose of a fault review is to analyze the cause of the failure, not just to find a solution. The failure review should pay more attention to the root cause of the problem and start with the first.

The specific problem may only be a superficial phenomenon, and the employees are more of the executors in the whole system, and if they are not in place, there must be imperfections or loopholes in the system design. Therefore, only by digging deep into the root cause can we cure both the symptoms and the root cause.

To do this, it is necessary to do the root cause analysis of the failure layer by layer, the most classic is to use the 5why analysis method, also known as the "Toyota Five Questions Method", "Repeat the question five times, the essence of the problem and the solution are obvious".

2.The analysis of the process is neglected

In the process of troubleshooting, the rush to respond and locate the problem quickly, and the failure to report or upgrade in time is also a problem, and it needs to be improved.

3.Inadequacy or confusion of participants

Engaging and including a large number of team members during the review process and allowing people to collaborate among themselves can facilitate a comprehensive analysis of the problem and the development of a solution. Participants should ensure that they are able to discuss in depth, that they are not guilty, that they are not guilty, and that they are not distracted in any way. If it is a manager's participation, it is recommended that the manager listen more and speak less in the early stage, and participate more and guide everyone to speak with the attitude of a participant.

At the same time, it is important not to make the review process and purpose into accountability or punishment, which is a very big blow to the team atmosphere and employee motivation.

4.The locality of the solution

While there may be practical solutions at times, be aware that while solutions can solve the current problem, it may not be sufficient to address the root cause of such a problem.

5.Implement solutions without evaluating the costs and benefits

Often, the most straightforward and fastest solution can result in higher costs and limitations. Before developing and implementing a solution, any other aspects affected by the solution should be taken into account and a guarantee to assess the cost and scope of the measures taken.

6.Failure to identify key success factors and performance indicators

Important factors for measurable success, such as customer satisfaction, response time, ticket closure speed, etc., should be identified, and these metrics should be continuously monitored and adjusted as needed.

7.Insufficient guidance and service support

Such teams can lose points, find it difficult to distinguish themselves from other teams, and have difficulty coping with the dilemma at hand. Conducting a meaningful failure review requires guidance that effectively explains the process, coverage, and reporting procedures, ensuring that the team's users can understand the purpose of the review and provide high-quality recommendations.

8.Improvement measures are not in place

Some improvement measures may be partially lost due to non-compliance with SMART principles, or lack of timely follow-up, insufficient attention from all parties, or organizational reasons (some improvement measures involve third parties outside of O&M).

Improvement measures need to comply with the SMART principle, in addition, on the basis of SMART, can also be supplemented by the 5W1H principle:

Identify who is responsible for the relevant improvement items. There can be more than one person in charge, but there can only be one person in charge, that is, this person needs to be fully responsible for the implementation of the improvement item.

What is the status of subsequent improvements?Is it in preparation, in progress, or completed?

In addition to proposing improvements, it is also important to have closed-loop management of improvement measures, including the use of the PDCA cycle and the RACI matrix tool.

* The story of Toyota's "Five Minute Questions":

Once, Naiichi Ohno, vice president of Toyota Motor Corporation, discovered that there was a production line where the machines kept stopping because the fuse had been blown out. Although the fuse is replaced in time every time, it will not take long for it to blow out again, which seriously affects the efficiency of the entire production line. He felt that replacing the fuse did not solve the underlying problem. So, Naichi Ohno had a Q&A conversation with the workers.

One asked, "Why did the machine stop?"Answer: "The fuse was blown out because of the overload." ”

The second question: "Why is it overloaded?"Answer: "Because the bearings are not lubricated enough." ”

Three questions: "Why is the lubrication not enough?".Answer: "Because the lubrication pump can't absorb oil." ”

Four asked: "Why can't I absorb oil?"Answer: "Because the oil pump shaft is worn and loose." ”

Five asked, "Why is it worn?"Answer: "Because there is no filter, impurities such as iron filings are mixed in. ”

After five consecutive inquiries about "why", the real cause of the problem was found, and the solution was to install a filter on the oil pump shaft.

SMART Principles:

s - specific, which means that the improvement item must be specific and can be implemented.

Answer what are the individual items and indicators that need to be improved and optimized? For example, "optimizing system design" is a general term, and redesigning the dependencies of system A on system B so that it can cover exceptions is specific.

m - measurable, i.e. the improvement item is measurable and evaluable.

Answer what are the acceptance criteria that have been established. For example, a failure drill is used to verify the validity of a dependency.

a - attainable, which means that the improvement is feasible and achievable in the current technical environment.

Avoid some false and unattainable improvements, and don't write about things that are too far in the future to achieve.

r - relevant, i.e. to have a certain correlation with other improvements.

For example, other improvements in this fault need to be related to avoid isolated improvements.

t-time-bound, that is, to have a clear deadline.

Write down the deadline for the improvement project, and it is recommended that the maximum time period should not exceed three months, so as to avoid the improvement becoming a mere formality, and accept it after the expiration date.

This document provides a checklist for fault review for reference:

Finally, the author also gives a real fault review record file as an example, which can also be used as a reference for the fault review template (of course, there are also parts that need to be improved):

Peter Shengji once said, "In essence, human beings can only Xi learn through trial and error", but there is no point in repeating trial and error without thinking. Only by learning to review from the experience of trial and error can we grow and win the spiral of success. The same is true for fault review, you must know where you are wrong, what causes the error, what measures can be taken to improve, only by knowing this, can you not fall in the same place.

Fault review is an important task of fault management, which is an important means to improve the operation and maintenance work, improve the reliability of IT services, reduce losses and costs caused by service unavailability or degradation, reduce business risks, improve customer satisfaction, reduce the occurrence of repeated events and major events, and is also an important means to achieve SLA, which is of great significance for improving and improving event management. For teams and individuals, it is an opportunity to improve and Xi, and to improve individual skills and performance.

Therefore, it is necessary to continue to do a good job of fault review and regular review according to the overall requirements of fault review.

Related Pages