With frequent accidents, operational flexibility is even more needed

Mondo Social Updated on 2024-01-28

The frequent service disruptions in the cloud computing era highlight the huge operational risks faced by enterprises. It's imperative to build operational resilience.

Translated from What is Operational Resilience?by Robert Kimani is a systems engineer and open source advocate who loves to share knowledge. He believes in helping others and giving back to the community with compassion. When he's not immersed in Linux, he enjoys hiking, mountain biking, and exploring. The digital realm was once considered a bastion of reliability, and businesses and organizations trust cloud service providers to keep their business running continuously. However, this narrative is changing. A series of recent events have highlighted the vulnerabilities in these systems and the far-reaching impact of major outages. April,Google Cloud Platform for Europe West 9 RegionThere was a total outage, which lasted a whole day. The outage was caused by a fire in a Paris data center, followed by flooding, which had a ripple effect across the Google Cloud Platform, with regions and services gradually recovering within a few days. In June,Amazon Web Services is in its eastern US stateDistrict 1 has suffered a major disruption. An internal domain name system and monitoring system failed due to excessive traffic, which was caused by an error in the auto-scaling operation. This is followed by a series of connection errors and retries. The outage had an immediate impact, impacting millions of users and enterprise customers who rely on this critical region.

The scope of these events should not be underestimated. Businesses, schools, hospitals, institutions, and countless others suddenly find themselves in operational chaos, raising a fundamental question: How do we ensure business continuity in an environment where cloud service outages are becoming more common?

The answer lies in the concept of operational resilience, a strategy that allows organizations to adapt and respond in the event of disruption while maintaining continuous operations, ensuring that customers are not or not affected at all, even as the world around them is in turmoil.

As cloud service provider outages continue to rise, operational resilience has never been more important. Here's an explanation of the details, importance, and implementation strategies of operational resiliency.

Operational resilience revolves around the principle of continuity, and the business and its core functions continue despite challenges. It's a promise to customers that no matter what disruptions exist behind the scenes, their experience won't be interrupted.

It also ensures the security of customer (and organization's) data, a problem that becomes more and more important as each day passes. The importance of operational resilience goes far beyond "keeping the system running" to mean delivering unwavering products and services even in the toughest of environments.

Operational resilience faces many challenges, ranging from the ordinary to the extraordinary, each of which can cause disruption. Risks include:

Technical glitches, which may include hardware failures, software errors, or infrastructure issues. Such failures can have a knock-on effect on the ability to provide continuous service.

CyberattacksCyber threats are becoming more sophisticated. Things like distributed denial-of-service attacks (DDoS) or data breaches can compromise the integrity, accessibility, and reliability of the service. Natural disasters, potentially disrupting data centers and infrastructure, resulting in prolonged service outages.

Chain breaks, the organization relies on a complex ** chain. Any disruption in these, whether due to geopolitical events or logistical issues, can result in service disruptions and financial losses. Operational resilience is especially important for financial institutions because their operations are closely tied to the global economy. A disruption on a regional scale could have a catastrophic impact on financial stability.

If a major cloud provider** fails, causing large banks to be down for an extended period of time, millions of transactions could be halted, impacting consumers and businesses. The economic impact of such an event could be far-reaching, highlighting the urgent need for operational resilience in the financial sector and beyond.

Operational resiliency and business continuity are closely related concepts, but they are not the same. To illustrate the difference, consider a common analogy: games.

Operational Resiliency: A seamless gaming experience

Let's say you're playing a ** game and you're in the middle of a fierce boss fight. Suddenly, the game crashed. In an operational resiliency setup, the game has been designed to handle this disruption seamlessly. You press a button and you go back to where you were before, as if nothing had happened.

In this case, the player represents the end user of your service. Even if they run into problems, there is little to no disruption to their use of your organization's services.

Business Continuity: Load from savepoint

Now, thinking about business continuity, it's a bit like a game, and the focus is on making sure you can pick up where you left off after an outage. When the game crashes, you need to load from the saved progress, which may lose some progress.

As a result, operational resilience requires robust plans and proactive measures to ensure that organizations can weather the storm when problems arise. It is not feasible to wait for these rare events to occur;Preparedness is key to minimizing its impact.

Essentially, operational resiliency is designed to prevent the end of a user outage during an unforeseen challenge, which can feel as if nothing went wrong from the user's perspective.

Business continuity confirmation, on the other hand, can be disrupted, but the focus is on minimizing downtime and ensuring rapid recovery of critical functions. Both of these concepts are very important in their own right, and they both help organizations deal effectively with adversity in the digital age.

Operational resilience is much more than just customer satisfaction, it goes beyond areas such as economic stability and global impact. It is a key part of what connects the complex machines of modern society.

Major cloud service provider outages may be rare, but inevitable – and likely to become more frequent due to climate change and other factors. Even the most reliable providers are not immune to disruptions.

As a result, operational resilience requires sound planning and proactive measures to ensure that organizations are resilient to storms. It is not feasible to wait for such a rare event to occur;Preparedness is key to minimizing its impact.

Operational resilience can't be achieved through words alone;It must be embedded in the schema of the organization's application itself. This means that businesses must make it a fundamental part of their design and strategy.

To truly ensure operational resilience, it's important to recognize the limitations of relying on a single cloud provider, as well as the difficulty of switching providers.

Integrated resiliencyOperational resiliency should be integrated into the architecture of each application. Systems must be designed with resilience as a core principle. It's too late to wait for the outage to happen;Proactive preparation is key.

Limitations of a single cloud providerMany organizations have traditionally relied on a single cloud provider. This method is popular because of its simplicity and cost-effectiveness. The downside, however, is that it inherently lacks the robustness necessary for operational resilience. A single cloud provider can't provide the redundancy and failover that comes with multi-cloud or cloud-agnostic policies. The challenge of switching providersMigrating from one cloud provider to another isn't as straightforward as it seems. It is misleading to assume that applications can easily switch between providers. Different cloud providers have proprietary interfaces and architectures, making the conversion process complex and time-consuming. In the face of these challenges, cloud-agnostic application architectures emerge as a compelling solution. It means ensuring that every component of the application is platform-agnostic.

Cloud-agnostic architecture provides the triple benefits of scalability, flexibility, and operational resiliency. This design facilitates easy scaling based on business needs, allowing for dynamic allocation of resources. Its inherent flexibility allows for the addition or replacement of a variety of services and platforms without the need for significant rewrites.

Perhaps most critically, cloud-agnostic architecture inherently enhances operational resiliency by ensuring interoperability across multiple cloud service providers. Each component of the application is rendered platform-agnostic and can run seamlessly between different providers.

This approach not only alleviates the concerns of lock-in, but is fully aligned with expectations of future regulatory requirements, which is essential in an environment of ever-evolving operational resilience. In a world where resilience is an essential asset, the transition to a cloud-agnostic architecture has gone beyond a strategic choice – it's become a necessity.

As the world becomes more connected and the need for operational resilience grows, ** across the globe are responding by introducing regulations to ensure that critical services, especially in the financial industry, can withstand disruption.

These regulations are designed to provide a safety net that protects the economy and essential services from major service failures.

Some recent examples:

United KingdomBe at the forefront of operational resiliency regulations. Through the Operational Resilience Framework, which was launched in 2022, UK authorities have instructed financial firms to meet specific operational resilience requirements by 31 March 2025. These measures overlay the regulation on top of the organization's own internal strategy. By meeting minimum operational resiliency criteria, CIOs have the flexibility to choose the strategy that best suits their organization's needs, such as operating a hybrid cloud infrastructure or running on multiple cloud provider platforms.

European UnionThe Digital Operational Resilience Act (DORA) aims to ensure that all digital service providers, including cloud service providers, search engines, e-commerce platforms and marketplaces, have effective strategies and capabilities to manage operational resilience, regardless of whether they are within or outside the EU. The DORA regulation went live in January 2023;Financial entities are expected to be compliant by early 2025.

In the U.S., in March 2021, the Federal Reserve system and others issued guidance on operational resilience. In May of the same year, Biden** issued a cybersecurity executive order containing rules related to operational resilience. Regulations on operational resilience are expanding beyond financial services firms. The push for new regulations reflects a growing awareness of the interconnectedness of modern services.

The scope of regulation may soon be extended to sectors such as utilities, transportation, and healthcare, as they play a critical role in everyday life and are considered essential services. Regulators recognize that the resilience of these services is critical to the public welfare.

Related Pages