API call stability is considered the most important metric for data services. The influencing factors of this indicator are diverse, and the kangaroo cloud data service platform DataAPI not only stress-tests and optimizes the performance and stability of calls for many times, but also provides a variety of configuration item optimization methods for customers to self-tune. However, when encountering unexpected large amounts of traffic or other sudden situations, you will still encounter API call failures.
When the traffic continues to grow and reaches or exceeds the bearer range of the service itself, it is important to establish a self-protection mechanism for the system service. DataAPI, a kangaroo cloud data service platform, combines the concepts of API calls and microservice traffic control, and introduces a circuit breaker degradation function to ensure the stability and system availability of API calls to the greatest extent.
This article hopes to use the most popular explanations and appropriate examples to take you to understand what circuit breaker demotion is.
Generally, when it comes to traffic protection of microservice systems, three methods will be mentioned: current limiting, circuit breaker, and degradation, which are actually important design patterns for system fault tolerance.
Rate limiting. Throttling is a measure that limits the frequency of requests and the execution of some internal functions of the system to prevent the entire system from being unavailable due to sudden traffic surges. Rate limiting is mainly a means of defense and protection, which controls traffic from the source of traffic and avoids problems.
Fusing. The circuit breaker mechanism is an automated response that can automatically disconnect the interaction with the downstream service when the traffic is too large or the downstream service has a problem to prevent the further spread of the fault. At the same time, the circuit breaker mechanism can also self-diagnose whether the downstream system's errors have been corrected, or whether the upstream traffic has been reduced to normal levels to achieve self-recovery.
A circuit breaker is more like an automated remediation method, which may occur when the service cannot support a large number of requests or other service failures, and the request can be throttled, and at the same time, the recovery can be attempted.
Service degradation. It is mainly for non-core business functions, and if the core business process exceeds the estimated peak, it needs to throttle. Degradation generally considers the integrity of the distributed system and cuts off the traffic from the source. Downgrading is more like an estimation method, under the premise of predicting traffic peaks, the service experience is reduced in advance through configuration functions, or secondary functions are suspended to ensure the smooth response of the main process functions of the system.
Throttling and circuit breakers can also be seen as a means of service degradation. Under the microservice architecture, the call between services and services usually focuses on traffic, and developers need to ensure the stability of microservices from multiple dimensions such as traffic routing, traffic control, traffic shaping, circuit breaker degradation, system adaptive overload protection, and hotspot traffic protection. Among them, circuit breaker degradation tends to ensure the stability of key links or key services in the microservice invocation link.
As shown in the figure below, when serviced is abnormally unavailable, servicea, b, g, and f will be affected, and if not controlled, the entire microservice may eventually be paralyzed. What a circuit breaker does is stop calling the serviced service after it reaches a certain error thresholdIf you downgrade, you will return the degraded content customized by the developer to ensure at least the overall availability of the link.
The Kangaroo Cloud Data Service Platform DataAPI currently provides three circuit breaker policies, each API can and can only be associated with one circuit breaker policy, and the policy types are:
Slow request ratio
If you select the slow call ratio as the threshold, you need to set the allowed slow call RT (that is, the maximum response time), and if the response time of the request is greater than this value, it will be counted as a slow call. If the number of requests per statintervalms is greater than the minimum number of requests and the proportion of slow calls is greater than the threshold, the requests will be automatically circuit breaker within the following circuit breaker duration.
After the fuse time passes, the fuse will enter the detection recovery state (half-open state), and if the response time of the next request is less than the set slow call RT, the circuit breaker will be terminated, and if the slow call RT is greater than the set slow call RT, it will be fused again.
Error ratio
If the number of requests within the statintervalms period is greater than the minimum number of requests and the proportion of exceptions is greater than the threshold, the requests will be automatically interrupted within the following circuit breaker duration.
After the fuse time has passed, the fuse will enter the detection recovery state (half-open state), and if the next request is successfully completed (without error), the fuse will be terminated, otherwise it will be fused again. The threshold range for the anomaly ratio is [0.].0, 1.0], which stands for 0% -100%.
Error count
When the number of anomalies in the unit statistical period exceeds the threshold, circuit breakers are automatically performed. After the fuse time has passed, the fuse will enter the detection recovery state (half-open state), and if the next request is successfully completed (without error), the fuse will be terminated, otherwise it will be fused again.
Next, we will introduce the application of circuit breaker degradation in the dataapi of Kangaroo Cloud Data Service Platform through examples.
The circuit breaker degradation of data services is implemented based on the Sentinel framework. Sentinel's definition of resources is more service-level, but it also provides resource definitions for specified or content. Therefore, DataAPI is to define the APIID as the only resource to determine the circuit breaker threshold and implement the specific circuit breaker action. At the same time, by controlling the generation rules of resource names, the test API and the formal API are isolated from the environment.
In the process of development and implementation, the biggest difficulty is that Sentinel natively supports cluster control of the throttling policy, but does not support the cluster control of the circuit breaker policy. The method adopted by DataAPI is to use the master node to achieve cluster control, which mainly has the following two points.
How circuit breaker rules are loaded only on the primary node.
First of all, the circuit breaker policy is loaded through the degraderulemanager based on memory, since it is based on memory, it is necessary to do a persistence action, otherwise the program restart rule will be emptied. In this example, MySQL is used to create a circuit breaker rule table to modify the rule details.
Second, when starting, get the list of gateway instances through Nacos, select the master node, and call Nacos' NamingServicegetallinstance method to get all gateway instances. Select the first healthy Gateway instance as the primary node, store the IP information of the primary node in the form of key value in Redis, and the Redis key will be updated when the primary node is re-elected.
Even if the current primary node is down, you can obtain the surviving node instance and re-select the primary node through the timer once per minute to ensure high availability. When the primary node changes, it will also send a Redis notification to all nodes, and the new primary node will get the latest circuit breaker policy from MySQL and load it into the memory, but this action will clear the previous traffic statistics and the time window will be reset.
Finally, how the modification of the policy is synchronized to the primary node. The data service uses the Redis channel notification method, each gateway listens for channel messages, and only the gateway that determines that the current node is the master node will perform the load operation of memory rules.
Cluster threshold determination.
All examples in the cluster execute API requests normally, and only the threshold judgment will send an HTTP request to the master node for judgment and return the result of whether it passes or not, and the overall workflow diagram is as follows:
There are two modes of threshold determination:
Error mode, which can also be broken down into the number of errors or the error rate.
Slow call mode: refers to the proportion of slow call requests.
Since the primary node is determined to be instance B, but the error occurs in instance A C, it is necessary to manually generate an exception or slow call on instance B. When the node requests an exception and the primary node receives the exception flag, it manually throws an exception, at which point the sentinel can be perceived and the number of exceptions will be increased by one. Slow calls are implemented by modifying the complatetime of the request so that the counter can judge the slow call.
It is used to configure the downgraded content on the API edit page, the configuration condition is that the circuit breaker policy needs to be configured first, and when the fuse is on, the gateway will return a completely user-defined downgraded content, which must be in JSON format.
Downgrading mainly solves the contradiction between insufficient resources and increased access, and can cope with high concurrency and a large number of requests in the case of limited resources. Especially when the API configuration is unable to carry a large amount of traffic, in the case of limited resources, in order to achieve the above effect, it is necessary to limit some service functions but are not completely unavailable, and the system will return the preset value to ensure that the whole system can run smoothly.
The implementation of downgraded ** is relatively simple, and after the judgment condition takes effect, it can be done by rewriting the response of the serverwebexchange object.