Authors: Zhang Minghui, Pan Shengwei.
With the Spring Festival promotion approaching, in order to ensure the efficient and stable operation of online business, most e-commerce companies will conduct multiple rounds of testing of their key business applications. By simulating requests with high online traffic, you can observe the actual performance of the service. Take the business test report of an enterprise as an example:
Figure 1 The stress test report shows that the success rate is very low, and the success rate of the global interface is very low.
As you can see from the report:
When the amount of traffic an application is subjected to to a certain critical point, the request success rate drops dramatically, resulting in a dismal average success rate of 9 over the entire test cycle89% with a high response time (RT). After in-depth analysis, it was found that this high failure rate was prevalent on all interfaces, and did not show any signs of returning to a steady state during the entire stress testing process.
The intuitive inference for this type of stress test result is that the application may have reached its performance bottleneck. The CPU usage captured by Prometheus monitoring further supports the assumption that the CPU usage of the application instance is almost saturated, and the application cannot handle such high TPS (transactions per second) under the current scenario. On the one hand, through the tracing data and CPU flame diagram, the company gradually locates the performance bottleneck at the top level and carries out targeted optimization; On the other hand, open-source traffic control protection and circuit breaker degradation components are integrated for applications to cope with the uncertainty of online traffic, and business-critical applications are scaled to further enhance application availability.
Figure 2 CPU metrics for an application instance pod during stress testing.
Similar problems are not uncommon in real-world business scenarios and can be summarized into the following two challenges:
How to locate performance bottlenecks in complex business systems? How to deal with the uncertainty of traffic and protect our services? The company in the example provides its own answer to the above questions: simulate online traffic to carry out performance tests, use tracing data + CPU flame chart as the basis for problem location, and protect services through flow throttling and circuit breaker degradation. The above solutions have relatively mature implementation practices in the field of open source
Use OpenTelemetry's SDK agent to collect tracing dataUse the Async Profiler tool to generate a CPU flame graphUse Sentinel for traffic governanceHowever, in fact, whether it is to transform the business or build an opentelemetry server for data reception, it will bring a certain cost investment, and R&D can no longer focus on business development and maintenance. So, is there a non-intrusive, automated way to solve the problem? The new 3The X-version J**A probe brings new answers to these questions.
The j**a probe (also known as j**aagent) can augment the bytecode of the application itself in the application runtime, and the business application itself does not need to make any ** changes to achieve additional capabilities [4]. Applications deployed in Kubernetes can also be automatically injected with probes based on init-container, further reducing access costs.As mentioned above, when locating and optimizing slow call problems, in addition to observing the key metrics of the application, there are also "two treasures": tracing and CPU flame chart, which can help locate the time-consuming and high CPU segments in business calls and fix them accordingly. However, these "two ** treasures" have their own "hard injuries":
Tracing data often has blind spots
For Tracing, its data generally relies on automatic or manual tracking provided by the Agent SDK to achieve data collection, and then reports to the Tracracing data collector for storage, and then is associated and displayed by the dashboard of the display side. Obviously, the fineness of a link depends on the granularity of the buried point. However, in fact, the collection of tracing data also brings a certain amount of overhead, and it will not subdivide the granularity indefinitely, but only bury the key methods in the general framework. Then, for the high-overhead logic in some businesses, the buried point may often be missing, so that the time consumption of the business logic cannot be accurately judged, similar to the problem of missing buried point in the following figure often exists in the call chain.
Fig. 3 Common tracing data is prone to blind spots caused by uncovered buried points.
CPU flame graphs struggle to help locate online issues
For the CPU flame chart, it can visually show the CPU density points during the execution of business applications. The width of the method stack in the flame chart represents the execution time of the method, and the method of "wide base" and "large flat head" is optimized to reduce the time taken to improve performance. However, many hidden performance issues are often not easily exposed during the testing phase, but can be revealed in the environment. The generation of the flame diagram takes some time to collect, and when the focus is on stopping the bleeding of the online business, the preservation of the site is often missing, after all, it is impossible to run the flame diagram for 5 minutes after the problem occurs. This can complicate the whole process of locating the problem.
Figure 4 "Big Flat Head" vs. "Wide Base" in a common CPU flame diagram
So, is there oneThere is no need to manually establish dense burying pointsYou can observe the blind spots of tracing data, and you canAutomatically identify slow traces, andFilter and correlate the CPU flame graph of the relevant stack to the corresponding traceWhat about the method? Application Real-Time Monitoring Service (ARMS).In 3X version j**a probe passed"Hot spots".The function gives the answer.
Next, let's take the scenario of parsing and traversing JSON data and calling downstream HTTP interfaces as an example
public class hotspotaction extends absaction http call private void invokeapi() read file data and parse private double readfile() gettype())double totalcount = 0; for (int i = 0; i < movielist.size();i++)return totalcount; }
Combined with the following figure, for the above interface, find a slow call chain in the tracing system. It can be seen that the total time of this call chain has reached 2649ms, but there is a time-consuming blind gap of more than 2s between the last two spans and the first span (this logic corresponds to the above execution of json data parsing), relying solely on the tracing system, there is no way to locate the missing 2s time-consuming in which segment**.
Figure 5 shows the tracing data of the business system, and it can be seen that there is an observation blind spot before the second span.
In view of the above problems, after accessing the latest version of the arms probe and turning on the **hotspot, the root cause of the problem is very simple. You only need to click the Hot Spots tab in the call chain details, and you can see in the flame graph on the right that compared with Tracing, in addition to the HTTP-related method stack on the left (corresponding to the HTTP call in Tracing), it also contains comalibaba.cloud.pressure.memory.hotspotaction.readfile().91s execution time:
Figure 6 The actual display in the Hotspots tab can be done right down to the time-consuming method stack.
On the left side of the figure is a list of the time spent by all the methods involved in this call, and on the right is a flame chart plotted by all the method stack information of the corresponding methods
1)"self"The column shows the time or resources consumed by the method itself.
This metric represents the time or resources consumed by a method in its own call stack, excluding the time or resources consumed by its child method calls. Help identify which methods are spending a significant amount of time or resources in-house. 2)"total"The column shows the total amount of time or resources consumed by the method and all of its sub-method calls.
This metric includes the time or resources consumed by the method itself, as well as by all of its child method calls. Helps you understand which methods are contributing the most time or resources across the entire method call stack. For how to use the flame map generated by the ** hotspot function to locate the root cause of slow calling, you can focus on the self column or directly look at the wider flame at the bottom of the flame chart on the right to locate the time-consuming business method, which is the root cause of the high time-consuming of the upper layer, which is generally the bottleneck of system performance.
As you can see from the figure above, the core reason for the slow call in this article is that LinkedList does not have the ability to access randomly, and the overhead is high in the scenario of frequent search, which can be solved by refactoring it into an indexed list implementation such as ArrayList. For more details on how to use the hotspot, please refer to the user documentation related to this feature
In fact, during the stress test, if the elasticity that relies on the CPU metric and traffic metric has been configured, why does the request success rate continue to deteriorate? After analysis, there is actually a certain time interval from the time the indicator reaches the threshold to the readiness of the new service instance (probably because the j**a application starts slowly, and the startup time reaches the second level), and the old service instance is overloaded before the new service instance is ready in the burst traffic scenario. At this time, the business success rate is significantly high, partly because the old service instance is overloaded; Another reason is that due to the overload of the original instance, the request processing capacity is reduced, resulting in a large amount of traffic pouring into the newly launched service instance, resulting in the new service instance also being overloaded due to excessive traffic, and falling into the "worst-case scenario" of continuous restart and continuous expansion. When the system is faced with sudden flow surges, is there a means to effectively protect the system from being in a steady-state condition at all times?
That isMicroservices Engine (MSE).The latest feature, Adaptive Overload Protection.
Figure 7 MSE Adaptive Overload Protection page.
Under normal circumstances, in the face of sudden traffic spikes (elasticity is too late to scale), the CPU will spike (i.e., the system load will increase), and the performance of the global interface will deteriorate significantly, such as the RT will continue to increase, and the success rate will be greatly reduced**.
Figure 8 Stress test of a large flow rate with overload protection enabled.
aliyun j**aagent 3.Version X offers adaptive overload protection to effectively protect our system. Adaptive overload protection adjusts the throttling policy when the CPU usage reaches the threshold, and throttles the current at a certain percentage to ensure that the system load is at a stable level and the global interface maintains a high RT and success rate.
Figure 9 Interface performance with overload protection enabled.
The overall success rate is 5099%, after the sudden increase in traffic on XX interfaces, the success rate of all interfaces in the world began to decrease, and the RT also soared rapidly, the adaptive overload protection took effect, the success rate gradually increased, and the RT also quickly returned to the normal level. It should be noted that the current limit in the stress test is also treated as a normal exception, so the success rate is only 5099% (In fact, the request success rate is about 80% after removing the rate-limiting exception, which can also be seen from the RT performance in the second half.) )
As mentioned earlier"**Hotspot" and "Adaptive Overload Protection".The features are all based on the new 3The X J**A probe provides a complete set of non-intrusive access solutions for your J**A applications, and deeply integrates a variety of observable and microservice governance components and solutions such as OpenTelemetry, Arthas, and Sentinel. It helps you quickly and imperceptibly access important functions of cloud products such as Application Real-time Monitoring Service (ARMS), Microservice Engine (MSE), Enterprise Distributed Application Service (EDAS), and Serverless Application Engine (SAE).
How to access 3x j**a probe
Alibaba Cloud j**a probes provide a variety of convenient access methods, and you can customize the access of observability and service governance capabilities according to your needs. For more information, see Overview of ARMS Application MonitoringAccess to the MSE service governance applicationAccess.
For those running on Alibaba Cloud Container Service (ACK).You can use the simplest access method: automatic injection and configuration of probes based on the pilot mode, no need to modify the image, and only need to add a few labels to the application YAML to access observability or service governance capabilities for J**A applications.
Start by installing ack-onepilot for the ack cluster, as shown in the image:
Figure 10 Installing ack-onepilot
After the installation is complete, create a j**a application in the ACK cluster and create a spec. in the podtemplate.Metadata field, you can also edit the YAML file of the deployment directly, in the spectemplate.metadata.labels
If you want to connect to the ARMS application real-time monitoring service, add the following two labels:
armspilotautoenable: "on"armspilotcreateappname: "$"
If you want to connect to the MSE microservice governance service, add the following three labels:
msepilotautoenable: "on"msepilotcreateappname: "$"msenamespace: "$"
After the application is successfully deployed, you can see the corresponding deployment on the ACK console, and click ARMS console to jump to the ARMS console to view the monitoring information of the application.
Figure 11 Jumping from the ACK console to the ARMS console.
If your application is connected to MSE microservice governance, you can click the More > microservice governance button on the ACK console to go to the MSE service governance console.
Figure 12 Jumping from the ACK console to the MSE console.
List of version changes
In addition to the "**hotspot" and "adaptive overload protection" functions, other new features have been brought to the J**a probe:
Bytecode enhancement for J**A 21 applications is supported, and a single probe can support both non-intrusive observable data acquisition and microservices governance solutions for J**A 8-21 applications. The ARMS data reporting architecture has been fully upgraded, and the data reporting success rate has been increased from 99% to 99 based on short connections and reporting compression, and the data reporting success rate has been increased from 99%.99%, providing more stable and available data collection services. ARMS has launched a slow call diagnosis tool - the hotspot function based on the call chain, which provides the automatic observation ability of the slow call process of the interface to help analyze the performance bottleneck in the execution。ARMS optimizes performance, making application access lighter and more insensitive. In the 2C4G dual-copy container scenario, the additional CPU overhead caused by mounting probes is reduced by 50% compared with the previous version, the CPU is only increased by 10% in near-limit TPS scenarios, and the CPU overhead optimization in scenarios using asynchronous frameworks reaches 65%, resulting in better performance. The startup time is greatly optimized, the boot time of probe mounting is reduced to less than 5 seconds, and the startup time of init-container is reduced to less than 6 seconds through container access, and the overall boot time of the probe is reduced by 10s+. For details, please refer to the ARMS J**A probe performance stress test report。ARMS supports vertThe complete time-consuming statistics of asynchronous frameworks such as X and reactor-netty supplement the automatic burying of components such as OceanBase and XXL-JOB, and optimize the burying of existing components such as PostgreSQL and Kafka, providing more accurate and richer metrics and span data。MSE Traffic Protection supports custom RPC invocation behaviors, see Configuring MSE Traffic Protection Behaviors for details。MSE Traffic Protection supports adaptive overload protection, which automatically protects the system from being overwhelmed by excessive traffic based on CPU metrics and adaptive algorithms。Through its bytecode enhancement features, the j**aagent technology can enhance the functions of j**a microservices without intrusion, bringing users a new product capability experience. Through the explanation of this article, we can discover Alibaba Cloud 3The X-version probe not only introduces exciting new features, but also takes a quantum leap forward in the user experience.
In terms of performance:The probe mount start-up time is reduced to less than 5 seconds, and the probe is started as a wholeTime reduced by 10s+;Additional CPU from mounting probes50% lower overhead compared to previous versions;Near-limit TPS scenariosCPU only increases by 10%.In terms of characteristics:It brings new features in the field of diagnosis and governance, such as adaptive load protection, slow call diagnosis, and support for J**A 21. Of course 3The X-series J**A probes are not the end, and we have just set sail in order to achieve the goal of the best microservices on Alibaba Cloud. Latest 4The X version is fully compatible with the open-source OpenTracing protocol, adds the monitoring of custom thread pools, and provides a wider range of automatic burying support and more reliable asynchronous tracing capabilities.
If you are particularly interested in the "hot spots" mentioned in the article, you are welcome to join the ARMS Continuous Profiling product capability exchange DingTalk group discussion. (Group number: 2256001967).
1] OpenTelemetry Official**.
2] Async Profiler tool.
3] Use Sentinel for traffic governance.
4] What is a j**a agent
5] Use hotspots to diagnose slow call chains.
6] Aliyun j**aagent performance stress test report.
7] J**A components and frameworks supported by ARMS application monitoring.
8] Configure MSE traffic protection behavior.
9] Overview of ARMS application monitoring access.
10] MSE service governance application access.
11] Container Service for Kubernetes ACK
12] Configure System Protection.