2023 Deterministic Operations White Paper Make operations an accelerator for the transformation of t

Mondo Technology Updated on 2024-01-19

Today's sharing is the O&M industry research report:"Deterministic O&M *** Stable and Reliable Chapter,Make O&M an accelerator for the transformation of an intelligent world(Report produced by HUAWEI CLOUD).

Summary of the content of the research reportAs followsThe main challenges to "stability and reliability".

With the rapid iteration and agile development of services, traditional O&M is challenged by rapid software rollout, blurring the boundaries between O&M and R&D, and making it difficult to reconcile the conflict between service rollout speed and live network stability. HUAWEI CLOUD summarizes these challenges as Mate: Decoupling Messy Complex, Rapid Iteration of Active Iteration, Trustworthy Operation, and Evolution Fullstack.

Decouple the meshmessy complex:The object of operation and maintenance is not a mature product that can be delivered in batches, but a large number of components and nodes under the microservice architecture.

Quickly iterate on active iteration:The shorter release cycle comes at the cost that each version is not fully validated on the live network.

Safety production trustworthy operation:There are many operators on the live network: R&D and O&M can all have access to the live network, and the personnel have a certain degree of mobility. Large radius: Automation amplifies the radius, and O&M operations can lead to a wide range of faults.

Evolution FullStack:The overall availability of the system depends on full-stack availability, and O&M personnel need to have full-stack O&M skills.

Stable and reliable 1+N capability system

Stable and reliable "1+n": "1" refers to standardized O&M, and "n" refers to stable and reliable special capabilities. Build standardized O&M based on ITIL standards, establish a third-line O&M support team, establish process specifications covering key O&M activities, and build a unified O&M platform. Carry out SRE reform on the basis of standardized O&M to build stable and reliable capabilities. According to the life cycle, there are six major domain capabilities for stability and reliability, and there are multiple special capabilities under the six domain capabilities.

Description of the stable and reliable capability maturity model

Each capability has its own maturity level, which will be opened in detail, and the comprehensive maturity here mainly summarizes the maturity levels, and comprehensively evaluates them from the three aspects of organization, process, and tools, as shown in the following figure:

Basic O&M:There is no process, and there is no tool to carry the process, and the operation and maintenance are mainly done by experts, and the results are not guaranteed. O&M personnel are passive and exhausted, changes introduce major events, the proportion of human factors is high, the average recovery time (MTTR) of major events is uncertain, and there is great uncertainty in safety production.

2.Standardized O&M:The ITIL standardized process was introduced, but the O&M personnel still responded passively, the major events introduced by changes were alleviated, the proportion of human factors was reduced, and the average recovery time of MTTR was initially improved.

3.SRE Transformation:The O&M organization comprehensively carries out SRE, uses software engineering methods to solve O&M problems, builds automated operation capabilities, infrastructure high availability capabilities, all-round dial testing capabilities, emergency drill capabilities, and negative improvement capabilities, establishes a culture of quality awareness and reliability of O&M services, and retrospectively improves the review culture.

Preliminary certainty:Carry out generalized SRE, extend operation and maintenance to the R&D organization, jointly protect SLO indicators, design SLO SLI system, build chaos engineering to verify reliability capabilities, deterministic recovery capabilities, and fault self-healing capabilities, deeply participate in product design and launch activities, and build the ability to detect faults before customers.

High degree of certainty:Build business-oriented dynamic risk management capabilities, AIOPS intelligent demarcation and positioning capabilities, and fault self-healing capabilities, and dare to challenge higher than 9999% availability, human factor event rate better than 6 level.

Featured Report**: Fantasy Film and Television Industry).

The following is an excerpt from the original report:

High-availability design

The architecture of high availability design needs to consider multiple layers, such as business architecture, application architecture, technology architecture, and data architecture. These architectural layers are interrelated and collectively affect the availability and stability of the system, and the following focuses on the technical architecture to explain the availability, fault tolerance, and recoverability of the building application to ensure that the system can continue to provide stable and reliable services and minimize the impact of system outages and failures.

Technical Architecture: Technical architecture covers the selection of technology for the system, infrastructure and tools. In a high-availability architecture design, the appropriate technology stack and components need to be evaluated and selected to support the high availability, scalability, and fault tolerance requirements of the system.

Related Pages