Today's sharing is [2023 Deterministic O&M*** Stable and Reliable] Report Producer: HUAWEI CLOUD.
Featured Reports** Public Title: A global repository of industry reports
--In the digital era,** the speed of enterprise cloudification has far exceeded expectations. How to really make good use of the cloud, realize efficient and stable innovation on the cloud, and enhance value is a common topic that all enterprises are concerned about. In order to give full play to the value of the cloud, realize flexible access to resources, and enjoy the convenience of more "cloud services", various applications try to transform from traditional IT architecture to cloud-native architecture. Digital transformation has entered the stage of "deep cloudification", and applications should not only support business innovation and user experience, but also pay attention to security, trustworthiness, stability and reliability, resource efficiency, and business agility.
--HUAWEI CLOUD's rapid development in the past few years has led to a thousand-fold increase in business volume, and the above-mentioned transformation from "slow" to "fast" has been achieved, and O&M has made changes to meet business requirements. Based on this, HUAWEI CLOUD has developed the "Deterministic O&M and Reliable O&M" system, which is an example of O&M reform. This change completes the transformation of the O&M team from "firefighters" to "constructors".Through the various capabilities of "certainty", the business team is supported to develop the business both "quickly" and "steadily".It's a practice that transforms the operations team from a cost department to a productivity department, making the operations transformation an accelerator for digital transformation.
Capability system upgrade
Tens of thousands of customers on the cloud, although they operate and maintain different objects, face many common challenges. When enterprises encounter problems such as availability management, division of responsibilities, capacity management, cloud resource allocation, safe production, efficiency improvement, and intelligent O&M capability building during rapid business growth, digital transformation, or in-depth cloud transformation, HUAWEI CLOUD SRE combines its "stable and reliable" practices with cloud application maintenance practices to sort out the following "stable and reliable" systems for cloud services, which have the following changes compared with traditional O&M systems:
In the stable and reliable system, the O&M team not only pays attention to maintainability, but also participates more in the architecture design and implementation of the product"Product High Availability Architecture".
In the traditional development model, version delivery is subject to long-term quality management and infrequent changes (tends to be steady-state), but most enterprises now implement it"Continuous Delivery".Processes (tend to be agile), in order to ensure business stability, it is necessary to emphasize the automation of changes to reduce risk.
When the size of the traditional business is small, the pressure of O&M compliance is not high, and the number of teams involved increases as the volume increasesDeliveries are becoming more and more frequent, and the pressure and capacity demands for safe production are great.
Description of the stable and reliable capability maturity model
1.Basic O&M
There is no process, and there is no tool to carry the process, and the operation and maintenance are mainly done by experts, and the results are not guaranteed. O&M personnel respond passively and are exhausted, changes introduce major events, and the proportion of human causes is highThe average recovery time (MTTR) of major events is uncertain, and there is great uncertainty in safety production.
2.Standardized O&M
The ITIL standardized process was introduced, but the operation and maintenance personnel were still reactive, and the major incidents introduced by the change were alleviated, and the proportion of human factors and incidents was reduced, and trivial matters were entangledThe average recovery time of MTTR has initially improved.
3.SRE Transformation
The O&M organization comprehensively carries out SRE, uses software engineering methods to solve O&M problems, builds automated operation capabilities, infrastructure high availability capabilities, all-round dial testing capabilities, emergency drill capabilities, and negative improvement capabilitiesEstablish a culture of quality awareness and reliability in O&M business, and retrospectively improve and review culture.
Business Availability Metrics (SLO SLI) Design Definition
1.sli(servicelevelindicator):A specific quantitative indicator of the quality of a service. Not all monitoring indicators are SLI indicators, only indicators that can directly reflect the service ability of the service to users can become SLI indicators, and the SLI indicators of general services should be as few and critical as possible. **Metrics (request success rate, request latency, traffic, load, etc.) are the best practices for SLI metrics in the industry.
2.slo(servicelevelobjective):Service availability targets, which are typically determined during the design phase. Describe the target requirements for service availability, typically measured by SLI metrics.
3.sla(servicelevelagreement):Refers to a service level agreement between the service and the user that describes the consequences (e.g., compensation, refund, etc.) after the SLO is achieved or not met. SLA is a business concept and is generally lower than the SLO value.
This article is for informational purposes only and does not represent any investment advice from us. To use the information, please refer to the original report. )
Featured Reports** Public Title: A global repository of industry reports