Author: Rao Zihao, Yang Long.
With the development and iteration of software technology, many enterprise software systems have gradually evolved from monolithic applications to cloud-native microservice architectures, which on the one hand allow applications to achieve high concurrency, easy scalability, and high development agility, but on the other hand, it also makes the software application link longer and longer, relying on more and more external technologies, and making it difficult to troubleshoot some online problems.
Although after more than ten years of development, the observability technology corresponding to distributed systems has evolved rapidly, and many problems have been solved to a certain extent, some problems are still difficult to locate
Figure 1 CPU persistence peaks.
Figure 2 Heap memory space is used for **.
Figure 3 The trace trace chain could not locate the root cause of the time-consuming process.
How to locate the root cause of the above problems?
For some friends who have more experience in troubleshooting and have more contact with various troubleshooting tools, they may think of the following troubleshooting methods for the above problems:
1.For CPU peak diagnosis, use the CPU hotspot flame map tool to troubleshoot;
2.For memory problems, memory snapshots can be used to diagnose memory usage.
3.To solve the problem that the diagnosis process of slow call chain is missing time, you can use the trace command provided by arthas to diagnose the method time-consuming.
The above solutions can sometimes solve some problems, but friends who have experience in troubleshooting related problems must also know that they have their own thresholds and limitations, such as:
1.For online problems that are difficult to reproduce in the test environment, the CPU hotspot flame map tool is also powerless;
2.Memory snapshots not only may have an impact on the stability of online applications, but also require strong experience in using relevant tools to diagnose problems.
3.The trace command of arthas becomes difficult to troubleshoot when the slow call chain is not stable and difficult to trace, and it is also very difficult to locate the call request for multiple applications and multiple machines.
Is there a relatively simple and efficient powerful diagnostic technology that can help us solve the above problems? The answer is the continuous profiling technique that will be introduced later in this article.
What is Continuous Profiling?
Continuous profiling is passedDynamically capture applications, CPUs, memory, and more in real-timeStack information for resource applications to helpMonitor and locate application performance bottlenecks。Through the above introduction, is the concept of continuous analysis still relatively vague? If many friends haven't heard of continuous profiling before, then the jstack print thread method stack provided by jdk, the tool for locating the current state may have been more or less touched by many friends when troubleshooting application problems in the past:
Figure 4 The JStack Tool.
Continuous profiling is actually similar to jstack, which also grabs the CPU, memory and other resource application stack information executed by the application thread at a certain frequency or threshold, and then presents the relevant information through some visualization technologies, so that we can have a more intuitive insight into the usage of application-related resources. Speaking of which, friends who may use more performance analysis tools will associate it with the flame chart:
Figure 5 The Flame Chart Tool.
Usually, during the stress testing process, one-time performance diagnostic tools such as Arthas CPU hotspot flame map generators are actually used by manually turning on or offInstant profilingTechnology, both in terms of the method of data collection and the form of data presentation, is basically the same as the continuous profiling technology that will be introduced. Compared to continuous profiling,The core difference is that it is immediate rather than "ongoing".
Friends who have experience in using flame charts, you can recall that we generally use CPU hotspot flame chart tools in stress testing scenarios, and use some tools to capture and apply flame charts for a period of time during the stress testing process for stress test performance analysis. And continuous profiling is not only about solving the performance observation of the stress test scenario, but more importantly, it is more importantThrough some technical optimizations, we can continuously analyze the resource usage of the application along with the entire operation life cycle of the application in a low-cost way, and then present us with a deeper observability effect than the observable technology through flame charts or other visualization methods.
Continuously dissect the implementation principle
After talking about the basic concept of continuous profiling, you must be curious about the implementation principle of continuous profiling, and then briefly introduce some related implementation principles. We know that tracing collects the information in the call by collecting the information in the call by burying the method on the critical execution path to restore the information such as the parameter return value and the exception time consumption in the call, but it is difficult for business applications to exhaustively list all the buried points, and the overhead will be very large if there are too many buried points, so it is difficult to fully cover the problem of the blind spot of tracing monitoring introduced in Figure 3 above. The implementation of continuous profiling is actually to bury some key positions in the JDK library related to resource applications at a lower level, or rely on specific events of the operating system to achieve information collection, which can not only achieve low overhead, but also have a stronger insight effect on the collected information.
For example, CPU hotspot profiling, the general idea is to obtain the information of the executing thread on the CPU through the system call at the bottom of the operating system, and then collect the method stack information corresponding to a thread at a certain frequency (such as 10ms), and you can collect 100 thread method stack information in 1s, similar to Figure 6 below, and finally do some processing of these method stacks, and finally use some visualization techniques such as flame diagram to display, which is the CPU hotspot analysis result. Of course, the above is just a brief description of some implementation principles, and there are generally slight differences in the technical implementation of different profiling engines and the objects that need to be dissected.
Figure 6 Continuous dissection of the principle of data acquisition.
In addition to the common CPU hotspot flame diagram profiling, in fact, for the use and application of various system resources in the computer, the continuous profiling technology can provide corresponding profiling results to help analyze the application and implementation principles of related resources (note that different profiling implementation techniques may be different):
Continuous dissection of visualization techniques
One of the most widely used techniques for data visualization after continuous analysis is flame graph, so what are the mysteries of flame graphs?
What is a Flame Chart?
The Flame Chart is a visual profiling tool that helps developers track and display the time taken to make a program's function calls. The core idea is to transform the program's function call method stack into a rectangular "flame" image, the width of each rectangle represents the proportion of resources used by the function, and the height represents the overall call depth of the function. By comparing the flame graphs at different time points, it is possible to quickly diagnose the performance bottlenecks of the program and optimize them accordingly.
There are 2 types of flame diagrams in a broad sense, which are a narrow flame diagram in which the bottom element of the function method stack is at the bottom and the top element is at the top, as shown in the left figure below, and an icicle-shaped flame diagram with the bottom element at the top and the top element at the bottom of the stack, as shown in the right figure below.
Figure 7 Various types of flame diagrams.
How to use a flame chart?
As a visualization technology for performance analysis, the flame chart can only be used for performance analysis based on understanding how it should be read. For example, for a CPU hotspot flame map, one of the most common expressions for this problem is to see if there is a wide stack top in the flame map, what is the reason behind this statement?
In fact, this is because what the flame chart draws is the stack of methods executed in the computer. The context in which a function is called in a computer is based on a thing called a stackThe data structure of the stack is characterized by the fact that the elements are first in and then out, so the bottom of the stack is the initial call function, and the upper part is the layer by layer of called subfunctions. When the last subfunction is executed at the top of the stack, it will be out of the stack from top to bottom, so the top of the stack is wide, which means that the subfunction takes a long time to execute, and the parent function below it will also take a long time because it has been executed and cannot be out of the stack immediately.
Figure 8 Stack Data Structure.
Therefore, the method steps for analyzing the flame diagram are as follows:
1.Determine the type corresponding to the flame diagram and find the direction of the top of the stack;
2.If the total resource usage of the flame chart is high, continue to check whether there is a wide part of the stack top of the flame chart;
3.If there is a wide top of the stack, search along the top of the stack to the bottom of the stack, find the first package name and the method line defined by the analyzed application itself, and then focus on checking whether there is room for optimization in this method.
The following is a flame chart with high resource usage, and the performance bottleneck steps in the flame chart are as follows:
1.The shape of the image below can be seen as an icicle-shaped flame diagram with the bottom of the stack on top and the top of the stack on the bottom, so it needs to be analyzed from the bottom up.
2.Analyzing the top of the stack below, you can find that the wider top of the stack on the right is the method on the right: j**autil.linkedlist.node(int)。
3.Since the wider top of the stack is a library function in the JDK, not a business method, it follows the top of the stack method: j**autil.linkedlist.node(int), search from the bottom up, pass by: j**autil.linkedlist.get(int)->com.alibaba.cloud.pressure.memory.hotspotaction.readfile(), and comalibaba.cloud.pressure.memory.hotspotaction.readfile() is a business method that belongs to the analyzed application, i.e., the method line defined for the first analyzed application itself, and it takes 389s, which accounts for 76 of the entire flame chart06%, so it is the largest bottleneck with high resource occupation during the collection period of the flame chart, so the logic of the related methods in the business can be sorted out according to the name of the relevant method to see if there is room for optimization. In addition, the lower left corner of the figure can also be compared to the above analysis method, j**anet.socketinputstream related method, and found that it belongs to the first self-defined parent method of the analyzed application, and the full name is comalibaba.cloud.pressure.memory.hotspotaction.invokeapi, which accounts for about 23% of the total.
Figure 9 Flame diagram analysis process.
After the above introduction, you should have a certain understanding of the concept of continuous analysis, the principle of data collection and visualization technology. Then, let's talk about how the overhead and continuous profiling capabilities provided by ARMS can help troubleshoot and locate various online problems.
ARMS provides one-stop continuous profiling capabilities, and has been enabled for continuous data collection and monitoring on nearly 1W application cases**.
Figure 10 ARMS continues to dissect product capabilities.
The left map is an overview of the current ARMS continuous profiling capabilities, from top to bottom, data acquisition, data processing, and data visualization. At the specific functional level, corresponding solutions are provided for several scenarios where user needs are most urgent, such as CPU and heap memory analysis, and CPU and memory hotspot functions. For the diagnosis of slow call chains, ARMS provides the hotspot function. Compared with general profiling solutions, it has the characteristics of low overhead, fine granularity, and complete method stack.
Best practices for corresponding sub-features are already available in the ARMS product documentation
For more information about diagnosing high CPU usage, see Using CPU Hotspots to Diagnose High CPU Usage>> Diagnose the problem. For details about how to diagnose high heap memory usage, see Using Memory Hotspots to Diagnose High Heap Memory Usage>> Diagnose the problem. For the root cause diagnosis of call chain time, see "Use Hotspot to Diagnose Slow Call Chain".>> Diagnose the problem. Customer stories
Since the release of the relevant functions, it has better assisted users in diagnosing some incurable diseases that have been troubled online for a long time, and has been well received by many users, such as:
1.User A finds that when an application service is just started, the first few requests will be very slow, and the monitoring blind spot occurs when using tracing to diagnose the time-consuming distribution. Finally, the arms ** hotspot was used to help it diagnose the time-consuming root cause of the related slow call chain caused by the initialization time of the sharding-jdbc framework, and finally helped it figure out the root cause of the phenomenon that had been troubled for a long time.
Figure 11 User Problem Diagnosis Case 1
2.User B, during the stress test, there will always be some nodes in all instances of the application whose response time is much slower than others, and the root cause cannot be seen when using tracing. Then, according to the relevant information, check the resource usage of the log collection component of the application environment, and find that a large amount of CPU is occupied during the stress test, resulting in slow request processing due to the lack of resource competition for the write log of the application instance.
Figure 12 User Problem Diagnosis Case 2
3.User C, during the running of the online application, found that the heap memory usage is always very large, through the memory hotspot, it is soon found that the version of the microservice framework used by the application is used during the running process of the upstream service information of the subscription, resulting in a large amount of heap memory occupation, and then consult the relevant framework service provider, and finally, understand that the problem can be solved by upgrading the framework version.
Figure 13 User Problem Diagnosis Case 3
Overhead
Finally, you may be very concerned about the cost of continuous profiling of arms, so we have designed the following stress test scenario to calculate the cost of this function, which simulates a request from the stress test center to the business entrance application, which will query the query database and return the result.
Figure 14 Schematic diagram of the stress test.
The test environment enables all the continuous profiling functions, and uses the K8s container runtime environment to simulate the general enterprise application runtime environment. The pod limit value is 4c8g, the proportion of young generation in the 4g heap is set to 1 2, and the pressure limit is 6000 tps. The overhead for testing 500tps and 80% of the ultimate pressure at 4800tps is shown in the table below. As can be seen from the table, the CPU overhead is about 5% after all functions are enabled, the in-heap memory overhead is not obvious, and the off-heap memory usage is about 50MB, and the traffic is small, or it will be lower when only part of the continuous profiling function is enabled.
Figure 15 Stress test results.
It is understood that the utilization rate of CPU memory and other resources in the running process of many enterprise applications is relatively low, and it is very valuable to provide a new observation perspective for the application through a small amount of resource consumption, so that the application has detailed root cause positioning data when the application is running abnormally!
If you are interested in the continuous profiling feature in ARMS, please join the DingTalk group discussion on ARMS continuous profiling. (Group number: 22560019672).
Live Broadcast Recommendation:
Master the ARMS Continuous Profiling - Easily Gain Insight into Application Performance Bottlenecks:
Related Links:
1] stacks. 2] Use CPU hotspots to diagnose high CPU consumption.
3] Use memory hotspots to diagnose high heap memory usage.
4] Use hotspots to diagnose slow call chains.