Best Practices for Setting Up IT Monitoring and Alerting Based on Prometheus Grafana 4

Seeing the words like faces, hello everyone, I am Xiaofei, the previous article introduced the generatorYML file configuration and prometheusYML file configuration, including file service auto-discovery mechanism, as well as based on generatorYML configuration, according to the generator provided by the SNMP exporter, generate SNMPYML configuration file, etc., today mainly explains the configuration and situation of the official environment, including the disclosure of the deployment of virtual machine node exporter node data collector and other issues.

About storage.

Prometheus provides a storage method for local storage (TSDB) time series databases, as described in 2After version 0, the ability to compress data has been greatly improved (each sampled data only occupies about 1 2 bytes of space), and the needs of most users can be met in the case of a single node, but local storage hinders the implementation of Prometheus clustering, so other methods should be adopted in the clusterTime-series databaseinstead, such as influxdb.

Prometheus is divided into three parts, which are:Scrape dataStoring datawithQuery the data

Grabbing data is a variety of grabbers, and storing data is a time series database, querying data, which can be understood as data visualization.

Regarding my formal environment here, because there are not many indicators, I do not use external databases, and it is mainly based on local storage, and local storage prometheus is every 2 hoursblockEach block consists of a directory that contains: one or more chunk files (which store time series data) with a default chunk size of 512 M, a metadata file, and an index file (which uses metric names and labels to find the location of time series data in chunk block files). As shown in the figure below:

./data├──01bk**7jbm69t2g1bgbgm6kb12│ └meta.json├──01bkgtzq1sy**tr4pb43c8pd98│ ├chunks│ │000001│ ├tombstones│ ├index│ └meta.json├──01bkgtzq1hhwhv8fbjxw1y3w0k│ └meta.json├──01bk**7jc0ry8a6macw02a2pjd│ ├chunks│ │000001│ ├tombstones│ ├index│ └meta.json├──chunks_head│ └000001└──wal ├─000000002 └─checkpoint.00000001 └─00000000

needed_disk_space = retention_time_seconds * ingested_samples_per_second * bytes_per_sample

Disk size is calculated as follows: Disk size = Retention time * Number of samples obtained per second * Sample size.

If you want to reduce the capacity requirements of the local disk if the retention time seconds and bytes per sample are the same, you can only reduce the number of ingested samples per second.

So there are two means, one isReduce the number of time series, two areIncrease the time interval between sample collection

Considering that Prometheus compresses the time series, the effect of reducing the number of time series is more obvious.

Why is the data stored once every two hours by default and written as a chuck block file?

Saving all sample data in the form of a time window can significantly improve the query efficiency of Prometheus, and when querying all sample data within a period of time, you only need to query the data from the blocks that fall within the range.

The latest 2 hours of data written are stored in memory

The latest data is stored in the memory block, and after 2 hours, it is written to disk, and the size of the data written in 2 hours is calculated to determine the memory size.

As the scale increases, the CPU and memory required by Prometheus will increase, and the memory will generally reach the bottleneck first. Let's start with the memory issues of the standalone version of Prometheus. The memory consumption of Prometheus is mainly due to the fact that every 2 hours a block data is placed on the disk, and all the data is in the memory before the disk is placed, so it is related to the amount of collection. When historical data is loaded, it is from disk to memory, and the larger the query range, the larger the memory. There's some room for optimization. Some unreasonable query conditions, such as group or large range rate, can also increase memory. In order to prevent data loss caused by program crashes, the WAL (write-ahead-log) write-ahead log mechanism is used, and the data is replayed by writing to the WAL at startup.

The content of the data stored when it is not placed on the disk. There are wal files.

./data/01bk**7jbm69t2g1bgbgm6kb12./data/01bk**7jbm69t2g1bgbgm6kb12/meta.json./data/01bk**7jbm69t2g1bgbgm6kb12/wal/000002./data/01bk**7jbm69t2g1bgbgm6kb12/wal/000001

How the data is stored.

After the data content is placed, the wal file is deleted, and index, tombstones (deleted data records) are generated, and the data file 00001 is generated

./data/01bk**7jc0ry8a6macw02a2pjd./data/01bk**7jc0ry8a6macw02a2pjd/meta.json./data/01bk**7jc0ry8a6macw02a2pjd/index./data/01bk**7jc0ry8a6macw02a2pjd/chunks./data/01bk**7jc0ry8a6macw02a2pjd/chunks/000001./data/01bk**7jc0ry8a6macw02a2pjd/tombstones

These 2-hour blocks will be compressed into larger blocks in the background, and the data will be compressed and merged into higher-level block files and the lower-level block files will be deleted. This is consistent with the idea of LSM trees such as leveldb and rocksdb.

The expiration time of data is stored for 15 days by default.

The directory where data is stored is data by default, and you can specify this directory if you want to attach external storage.

How to delete data

When data is deleted, the deletion entries are recorded in a separate tombstone deletion record file, rather than being deleted immediately from the chunk file.

The first is the problem of data persistence, which is stored for 15 days by default, and the native TSDB is not very friendly to the storage and query support of large data volumes, so it is not suitable for storing a large amount of data for a long timeIn addition, the reliability of the database is weak, and it is prone to failures such as data corruption during use, and cannot support the cluster architecture.

As for remote storage, I won't mention it here, because I don't plan to use remote storage to store data in my official environment, so I use local storage to meet my needs.

Data is stored for 15 days by default, but it is changed to 90 days for data storage, enabling API services, enabling dynamic loading, configuration file collection, and compression, reducing network bandwidth pressure, index compression, + speeding up data placement, and avoiding large memory problems

For more information about setting up a formal environment, please refer to the previous article.

Prometheus + Grafana Build IT Monitoring and Alarm Best Practices (1) Here are some additional details and perfect points:

Prometheus Server, Grafana Server, and AlertManager are deployed here instead of docker installation, but with compiled binaries

Production Older versions of docker are called docker or docker-engine, if these are installed, uninstall them along with the associated dependencies. sudo yum remove docker docker-client docker-client-latest docker-common docker-latest docker-latest-logrotate docker-logrotate docker-engine Install dockersudo yum install -y using the repository yum-utilssudo yum-config-manager add-repo installs the latest version of docker engine, containerd, and docker composesudo yum install docker-ce docker-ce-cli containerdio docker-compose-plugin to install a specific version of docker, please check the official website to verify the installation of docker version[root@it-prometheus ] docker versionclient: docker engine - community version: 2010.20 api version: 1.41 go version: go1.18.7 git commit: 9fdeb9c built: tue oct 18 18:22:47 2022 os/arch: linux/amd64 context: default experimental: truecannot connect to the docker daemon at unix:///var/run/docker.sock. is the docker daemon running?docker is installed at this point, but dockersudo systemctl start docker[root@it-prometheus ] systemctl start docker[root@it-prometheus ] systemctl status docker dockerservice - docker application container engine loaded: loaded (/usr/lib/systemd/system/docker.service; disabled;vendor preset: disabled) active: active (running) since three 2022-10-26 00:15:11 cst; 9s ago docs: main pid: 19146 (dockerd) tasks: 13 memory: 36.2m cgroup: /system.slice/docker.service └─19146 /usr/bin/dockerd -h fd:// --containerd=/run/containerd/containerd.sock set docker boot autostart, sudo systemctl enable docker[root@it-prometheus ] systemctl enable dockercreated symlink from etc systemd system multi-usertarget.wants/docker.service to /usr/lib/systemd/system/docker.service.

Deploy grafanawget yum install grafana-enterprise-92.2-1.x86_64.rpm start systemctl enable grafana-serversystemctl start grafana-serversystemctl status grafana-servernetstat -anplut |grep grafana access, grafana browser, native ip, access, username, and password are admin admin deploy prometheuswget -zxvf prometheus-239.1.linux-amd64.tar.gz runs vim usr lib systemd system prometheusservice[unit]description=prometheus serverwants=network-online.targetafter=network.target[service]type=***user=rootexecstart=/root/monitor/prometheus/current/prometheus --config.file=/root/monitor/prometheus/conf/prometheus.yml --web.listen-address=:9090 --storage.tsdb.path=/root/monitor/prometheus/data/ --storage.tsdb.retention=90d --web.enable-lifecycle --web.enable-admin-apirestart=on-failure[install]wantedby=multi-user.targetsystemctl daemon-reloadsystemctl start prometheussystemctl status prometheussystemctl enable prometheus access prometheus deploy alertmanagerwget -zxvf alertmanager-024.0.linux-amd64.tar.gz runs vim usr lib systemd system alertmanagerservice[unit]description=alertmanager[service]execstart=/root/monitor/alertmanager/alertmanager --config.file=/root/monitor/alertmanager/conf/alertmanager.ymlexecreload=/bin/kill -hup $mainpidkillmode=processrestart=on-failure[install]wantedby=multi-user.targetsystemctl daemon-reloadsystemctl start alertmanager.servicesystemctl status alertmanager.servicesystemctl enable alertmanager.Service Startup Check if the configuration file is formatted incorrectly. /amtool check-config /root/monitor/alertmanager/conf/alertmanager.yml access install prometheus-webhook-dingtalkwget -zxvf prometheus-webhook-dingtalk-21.0.linux-amd64.tar.gz runs vim usr lib systemd system prometheus-webhookservice[unit]description=prometheus dingding webhook[service]execstart=/root/monitor/prometheus-webhook-dingtalk/current/prometheus-webhook-dingtalk --config.file=/root/monitor/prometheus-webhook-dingtalk/conf/config.yml -web.enable-ui --web.enable-lifecycleexecreload=/bin/kill -hup $mainpidkillmode=processrestart=on-failure[install]wantedby=multi-user.targetsystemctl daemon-reloadsystemctl start prometheus-webhook.servicesystemctl status prometheus-webhook.servicesystemctl enable prometheus-webhook.service access

At this point, you have completed the construction of Prometheus Server, Grafana Server, AlertManager, and DingTalk alarm plug-in.

Set up the node exporter here

curl -lo -zxvf node_exporter-1.4.0.linux-amd64.tar.gzvim /usr/lib/systemd/system/node_exporter.service[unit]description=the node_exporter serverwants=network-online.targetafter=network.target[service]execstart=/opt/node_exporter/node_exporterexecreload=/bin/kill -hup $mainpidkillmode=processrestart=on-failurerestartsec=15ssyslogidentifier=node_exporter[install]wantedby=multi-user.target start systemctl daemon-reloadsystemctl start node exportersystemctl status node exportersystemctl enable node exporter

1. Host monitoring: I use the file service discovery mechanism, but I don't use the service registration discovery mechanism consul here

Add prometheusYML configuration file collects host information in the internal VMware cluster - Job Name:"vmware-host" metrics_path: /metrics scheme: http scrape_interval: 5s file_sd_configs: -files: -/data/monitor/prometheus/targets/node-*.yml refresh interval: 2m to create node-ityml file - labels: service: it-monitor brand: dell targets: -17217.40.51:9100 - 172.17.40.54:9100

Dynamically load configuration files:

curl -x post localhost:9090/-/reload

2. Switch monitoring: use the file service discovery mechanism.

Add the switch class job- job name:"snmp" file_sd_configs: -files: -/data/monitor/prometheus/targets/network-*.yml refresh interval: 2m scrape interval: 5s for SNMP collection node Override global configuration 15s metrics path: SNMP params: module: -if mib community: When SNMP exporter SNMPThe YML configuration file does not specify a community, and the community defined here takes effect. The default value is generally public - xxxx relabel configs: -source labels: ["__address__"] target_label: _param_target - source_labels: ["__param_target"Modify the target label: instance prometheus to the IP address of the SNMP exporter service - target label: address replacement: 17217.40.54:9116 SNMP Exporter service IP address - source labels: [.]"mib"] Obtain the name of the mib module from the custom target label target label: param module Create a network-switchYML file If there are multiple switches, you can add multiple switch targets to the end of the file- labels: mib: huawei brand: huawei hostname: hzzb-b2l-ag-master model: s5720-36c-ei-ac targets: -17218.48.2- labels: mib: huawei brand: huawei hostname: hzzb-b2l-access-master model: s5720s-52p-li-ac targets: -172.18.48.5- labels: mib: huawei brand: huawei hostname: hzzb-b2l-poe-master model: s5720s-28p-pwr-li-ac targets: -172.18.48.6- labels: mib: huawei brand: huawei hostname: hzzb-bljc-poe-master model: s5720s-28p-pwr-li-ac targets: -172.17.14.13- labels: mib: huawei brand: huawei hostname: hzzb-bljc-access-master model: s5720s-52p-li-ac targets: -172.17.14.14

Now that the monitoring target has been added, we will explain how to deploy the collector in batches and add the target.

Prerequisites: Deploy alertmanager and prometheus-webhook

Idea: In the alertmanagerConfigure the mailbox server, template path, number of routes, groups, and recipients in the yml configuration file (define accepted objects such as mailbox, WeChat, DingTalk) in the prometheus-webhook configuration fileIn YML, configure the key and URL of the DingTalk robot (add the robot in advance), refer to the template file (that is, the template defined by the alertmanager), configure Prometheus to communicate with AlertManager, create alarm rules in Prometheus, restart the service, and test.

# alertmanager.yml configuration template global: global configuration resolve timeout: 5m timeout period for processing alarms After no alarm is received, the alarm status is marked as resolved, and the mailbox configuration smtp smarthost:'localhost:25'SMTP email address SMTP FROM:'[email protected]'The message sender smtp auth username:'alertmanager'Sender user smtp auth password:'password'Sender password smtp hello:'@example.org'Tag identity smtp require tls: false tlsClose templates: -'/root/monitor/prometheus-webhook-dingtalk/template/*.tmpl'Alert Template Path How each alarm event is sent route: group by: [.]'alertname'] Which tag is used as the basis for grouping group group wait: 30s The waiting time of the first alarm in the group, if there is a second alarm within 10s, an alarm will be merged Group interval: 5m The interval between sending new alarms The interval between sending alarms in the upper and lower groups repeat interval: 30m The time when the alarm is sent again if it has not been processed'dingtalk_webhook'Receiver defines who will notify the alarm receivers: -name:'ops' email_configs: -to: '[email protected]' html: '}' headers: send_resolved: true - name: 'dingtalk_webhook' webhook_configs: -url: 'dingtalk/webhook1/send'Enter the webhook1 url send resolved: true of prometheus-webhook Whether to send a recovery message to the recipient after the recovery is completed

Official sample file:

alertmanager/***yml at main · prometheus/alertmanager · githubgithub.com/prometheus/alertmanager/blob/main/doc/examples/***yml

prometheus to configure an alert rule

Configure alerting: alertmanagers: -static configs: -targets: -17217.40.51:9093 Scan and load according to the set parameters, used to customize the alarm rule, the alarm medium and route route are implemented by the alertmanager plug-in rule files: -"rules/*.yml" # - "second_rules.yml"

Write a rules file.

Here I found two rule configuration files groups on the Internet:- name: servers status The name here is filled in group by rules: -alert: CPU load 1 minute alarm expr: node load1 count (count (node cpu seconds total) without (mode)) by (instance, job) >25 for: 1m labels: level: warning annotations: summary: "} CPU load alarm" description: "} 1 minute CPU load (current value: }"- alert: CPU usage alert expr: 1 - g(irate(node CPU seconds total[30M]))by (instance) >085 for: 1m labels: level: warning annotations: summary: "} CPU usage alarm" description: "CPU usage is more than 85% (current value: }"- alert: CPU usage alert expr: 1 - g(irate(node CPU seconds total[30M]))by (instance) >09 for: 1m labels: level: warning annotations: summary: "} CPU load alarm" description: "CPU usage is over 90% (current value: }"- alert: memory usage alarm expr: (1-node memory mem**ailable bytes node memory memtotal bytes) *100 > 90 labels: level: critical annotations: summary:"} Out of available memory alarm" description: "} Memory usage has reached 90% (current value: }"- alert: disk usage alarm expr: 100 - node filesystem **ail bytes node filesystem size bytes) *100 > 85 labels: level: warning annotations: summary:"} Disk usage alarm" description: "} Disk usage has exceeded 85% (current value: }"

Host survival check.

groups:- name: servers survival rules: -alert: node survival --it environment--prometheus alarm rule name expr: up == 0 for: 1m waiting time for evaluation labels: custom labels, define a level tag, mark the warning level of this alarm rule: critical, warning level: critical annotations: Specify additional information (message header text) summary:"The machine } hangs" description: "Server} is down (current value: }"- alert: Node Survival --IT Environment--Other Servers expr: up == 0 for: 1M labels: level: critical annotations: summary:"The machine } hangs" description: "} Downtime (current value: }.)"- alert: Node Survival --IT Environment--Production ES Server expr: up == 0 for: 1M labels: level: critical annotations: summary:"The machine } hangs" # description: "} Downtime (current value: }.)"

The configuration of prometheus-webhook is as follows:

## request timeout# timeout: 5s## uncomment following line in order to write template from scratch (be careful!)#no_builtin_template: true## customizable templates pathtemplates: #- current/contrib/templates/legacy/template.tmpl - /root/monitor/prometheus-webhook-dingtalk/template/dingding.tmpl## you can also override default template using `default_message`## the following example to use the 'legacy' template from v0.3.0#default_message:# title: '}'# text: '}'## targets, previously was known as "profiles"targets: webhook1: url: secret for signature DingTalk bot endorsement secret: secxxxxxxxxxxxx webhook2: url: webhook legacy: url: customize template content message: use legacy template title:'}' text: '}' #webhook_mention_all: # url: # mention: # all: true #webhook_mention_users: # url: # mention: # mobiles: ['156xxxx8827', '189xxxx8325']

Now that you have completed the basic configuration of alarms, the next article describes how to deploy node exporters in batches, optimize alarm rule files, and optimize DingTalk alarms.

Best Practices for Setting Up IT Monitoring and Alerting Based on Prometheus Grafana 4

Related Pages

Hospital information management registration system based on Java SpringBoot and Vue

It's new!A Java rapid development framework based on SpringBoot Vue with front-end and backend separ

Research on long text generation based on Transformer model

Actual combat dry goods!Digital Poet Chat Development Tutorial Based on ERNIE Bot SDK

MistralAI Releases Mistral 8x7B MoE, a Large Language Model Based on Hybrid Expert Technology