Recently, I saw that in the grafana monitoring screen, the old generation memory of the two nodes responsible for a service is steadily rising, and the major GC is not triggered.
Then I contacted my O&M colleague to help export the dump file, and selected pod 1 to execute the jmap -dump:live,format=b,file=heap001 command
After pod 1 executes the dump command, the old gen memory trend is shown in the following figure
As you can see, this dump triggered a major GC (heap dump initiated), and the old gen memory was quickly released, so it can be inferred that there were a large number of garbage objects in the old days before this service instance, because the major GC was not triggered, resulting in these objects occupying memory all the time and not being cleared in time.
So what are the objects that have been entering the old age in a steady stream?
The resource configuration of pod 2 and pod 1 is the same, and the old gen memory trend is about the same, and then let the operation and maintenance colleagues execute the dump command on pod 2: jmap -dump:format=b,file=heap002, in addition to the generated file name is different from pod 1, you should have found that this command is missing an optional parameter live, so that full gc will not be triggered before generating a memory snapshot. It is beneficial for us to observe the situation of garbage objects.
Use a plugin profiler from IDEA to analyze the HEAP dump file.
Observe the class tab page and sort it from largest to smallest according to shallow (shalllow indicates the size of the heap occupied by this object, and retained indicates the size of the object in the heap after the next major gc). The two classes directly related to the business ** applicationbank and regiontree can be clearly seen that there are nearly 140m of garbage objects in the instances of these two classes that have not been used. Then you can find the relevant classes accurately, so as to find the key business **:
The regiontree also has a corresponding cache. Seeing this, the conclusion must be clear: refresh the localcache cache every 30 minutes, these cache instances are also considered long-term survival objects, and they either reach the entry age due to age (15 years old by default), or because of the dynamic age determination mechanism, or because of the space allocation guarantee mechanism to enter the old age is also normal.
At this point, the reason for the slow rise of Old Gen memory has been revealed, there are some long-term surviving cache objects in JVM memory, and these cache objects will be regenerated with new instances every once in a while.
So does this have any impact on the application? Judging from the current state of the application, even if the memory grows slowly, the major gc is not triggered, which means that the collection of these cached objects is not large, even after a long period of time, the old gen memory limit is reached, then the major gc will occur to clean up these garbage, which is also normal. At the moment, the number of major GCs that occur with this app is very low. So the memory situation of this app is normal.
To determine the memory of an application, we can start from the amount of HEAP memory used by the application, whether the HEAP memory can be released by GC, the frequency of GC, and the time consumption. When you find a memory problem, you can use the jmap command to get a snapshot of the JVM memory, and use some dump analysis tools (profiler, yourkit(paid)) to conduct in-depth memory analysis to find the problem.
February** Dynamic Incentive Program