In the past, we felt that it was difficult to troubleshoot if the CMDB automatically discovered that the data was not reported, and we could not figure out the data reporting link
The data breakpoints of monitoring metrics are difficult to locate, and Flink is a black box for the field;
If the APM data is not updated in time, is it an exception or a storage failure?
Now the console integrates the monitoring of data links, and the data reporting link is fully transparent, and the problem nodes are clear at a glance, which greatly reduces the difficulty of problem location.
Full-link monitoring
End-to-end monitoring includes the original metric link, alarm link, trace original link, trace aggregation link, resource discovery link, and metric storage success rate. The commonly used data links of the platform have been managed, and you can clearly judge whether the data links are abnormal. Link governance can help us analyze the overall situation and display the entire data link, but individual tasks still need to trace and troubleshoot the exceptions reported by the task.
.Raw indicator link
The path in which the original metrics are reported is agent(easy process sampler)--receiver --kafka --raw metric process --easy tsdb
When we find that most of the dashboards of the monitoring platform have breakpoints or no data, we can first check the data link, check the incoming and outgoing packets of each node, if we find that a certain link is accumulating, we can locate that the next node is abnormal, there is no consumption, click More in the upper right corner to jump to the detailed page of the component. The wavy situation in the figure is normal, because the raw metric process consumes Kafka in batches and then stores it in batches, and as long as the backlog does not increase proportionally to time, the data link is normal.
.Alarm link
The monitoring and alarm link is kafka-->stream --kafka --alert channel go --notify
After the metrics are written to Kafka, the alerts are written to Kafka after stream processing to match the alarm rules, and if the alarm conditions are met, the alerts are written to Kafka, and the alert channel goes consumes the alarm queue of Kafka and integrates the alarm messages and sends them to the user through notify.
.trace the link of the original metric
This link processes the storage of APM raw data, agent (easy trace sampler) --otelcol --kafka --span loader --clickhouse, and you can view the processing status of the data link test when the APM data is not updated in real time.
.trace the aggregation link
A trace aggregation link is used to collect statistics on the overall success rate, failure rate, and latency of APM data.
.Resource discovery link
The data reporting link automatically discovered by CMDB is as follows: agent --receiver --kafka --data loader --easy core
.Report on the inbound success rate of metrics
If the success rate persists low, you need to conduct a detailed investigation of the data link.