Prometheus各类监控及监控指标和告警规则

目录

linux  docker监控

linux  系统进程监控

linux  系统os监控

windows  系统os监控

配置文件&告警规则

Prometheus配置文件

 node_alert.rules

docker_container.rules

mysql_alert.rules

vmware.rules

Alertmanager告警规则

consoul注册服务

Dashboard JSON文件



linux  docker监控

获取的是docker stats命令的统计结果,可以页面方式展示出来。

cadvisor.tar

上传cadvisor.tar包,导入后修改tag,运行容器

docker load -i cadvisor.tardocker tag gcr.io/cadvisor/cadvisor:latest google/cadvisor:latestdocker run -d --volume=/:/rootfs:ro --volume=/var/run:/var/run:rw --volume=/sys:/sys:ro --volume=/var/lib/docker/:/var/lib/docker:ro --publish=8080:8080 --name=cadvisor google/cadvisor:latest

容器运行后如下:

访问cadvisor   http://ip:8080

linux  系统进程监控

通过正则、绝对路径、名字等获取指定进程的运行状况

process-exporter-0.7.5.linux-amd64.tar.gz

参考我的另一篇文章

Prometheus监控主机进程-CSDN博客

默认端口 9256

linux  系统os监控

通过exporter获取当前系统的Cpu、内存、硬盘等OS资源

node_exporter放到指定路径后

cat /etc/systemd/system/node-exporter.service

[Unit]
Description=Prometheus Node exporter
After=network.target[Service]
ExecStart=/opt/monitoring/node_exporter[Install]
WantedBy=multi-user.target

默认端口:9100

windows  系统os监控

通过exporter获取当前系统的Cpu、内存、硬盘等OS资源

windows_exporter-0.26.0-amd64.msi

1.关闭防火墙

2.管理员模式双击执行

3.services.msc服务管理检查windows-exporter服务自动启动即可

默认端口:9182

配置文件&告警规则

/opt/monitor/prometheus目录下

Prometheus配置文件
cat /opt/monitor/prometheus/prometheus.yml 
# my global config
global:scrape_interval:     10s # By default, scrape targets every 15 seconds.scrape_timeout: 5sevaluation_interval: 10s # By default, scrape targets every 15 seconds.# scrape_timeout is set to the global default (10s).# Attach these labels to any time series or alerts when communicating with# external systems (federation, remote storage, Alertmanager).external_labels:monitor: 'zqa_monitor'# Load and evaluate rules in this file every 'evaluation_interval' seconds.
rule_files:- 'node_alert.rules'- 'mysql_alert.rules'- 'docker_container.rules'# - "first.rules"# - "second.rules"# alert
alerting:alertmanagers:- scheme: httpstatic_configs:- targets:- "alertmanager:9093"# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.- job_name: 'prometheus'# Override the global default and scrape targets from this job every 5 seconds.scrape_interval: 5sstatic_configs:- targets: ['localhost:9090']#- job_name: 'cadvisor'# Override the global default and scrape targets from this job every 5 seconds.# scrape_interval: 5s#dns_sd_configs:#- names:#  - 'tasks.cadvisor'#  type: 'A'#  port: 8080#static_configs:#     - targets: ['10.33.70.218:8080']- job_name: 'node-exporter'# Override the global default and scrape targets from this job every 5 seconds.scrape_interval: 5sstatic_configs:- targets: ['10.100.10.100:9182']consul_sd_configs:- server: '10.33.70.203:8500'services: ['node-exporter-dev']- job_name: 'mysql-exporter'scrape_interval: 5sstatic_configs:- targets: ['10.33.70.218:9104', '10.33.70.166:9104', '10.33.70.224:9104']- job_name: 'postgres-exporter'scrape_interval: 5sstatic_configs:- targets: ['123.57.190.129:9187']- job_name: 'vsphere-exporter'scrape_interval: 5sstatic_configs:- targets: ['10.33.70.22:9272']- job_name: 'es-exporter'scrape_interval: 5sstatic_configs:- targets: ['123.57.216.51:9114']- job_name: 'pushgateway'scrape_interval: 30sstatic_configs:- targets: ['39.104.94.83:19091']labels:instance: pushgatewayhonor_labels: true- job_name: "cadvisor"scrape_interval: 10smetrics_path: '/metrics'static_configs:- targets: ["47.93.21.11:8080]#- job_name: 'kafka-exporter'#  scrape_interval: 5s#  static_configs:#       - targets: [ '10.100.7.1:9308']#  - job_name: 'pushgateway'
#    scrape_interval: 10s
#    dns_sd_configs:
#    - names:
#      - 'tasks.pushgateway'
#      type: 'A'
#      port: 9091#     static_configs:
#          - targets: ['node-exporter:9100']

 node_alert.rules
groups:
- name: zqaalertrules:- alert:  机器宕机expr: up == 0for: 2mlabels:severity: criticalannotations:summary: "Instance {{ $labels.instance }} down"description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 2 minutes."- alert: 负载率expr: node_load1 > 8for: 5mlabels:severity: warningannotations:summary: "Instance {{ $labels.instance }} under high load"description: "{{ $labels.instance }} of job {{ $labels.job }} is under high load."- alert: 可用内存小于5%expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 5for: 10mlabels:severity: warningannotations:summary: Host out of memory (instance {{ $labels.instance }})description: "节点内存告警 (< 5% left)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"- alert:  磁盘使用率expr: (100 - ((node_filesystem_avail_bytes{device!~'rootfs'} * 100) / node_filesystem_size_bytes{device!~'rootfs'}) > 90)for: 5mlabels:severity: Highannotations:summary: "{{$labels.instance}}: High Disk usage detected"description: "{{$labels.instance}}: 硬盘使用率大于 90% (当前值:{{ $value }})"- alert: Cpu使用率expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[10m])) * 100) > 95for: 10mlabels:severity: warningannotations:summary: "{{$labels.instance}}: High Cpu usage detected"description: "{{$labels.instance}}: CPU 使用率大于 95% (current value is:{{ $value }})"# - alert: 进程恢复#   expr: ceil(time() - max by(instance, groupname) (namedprocess_namegroup_oldest_start_time_seconds)) < 60#   for: 0s#   labels:#     severity: warning#   annotations:#     summary: "进程重启"#     description: "进程{{ $labels.groupname }}在{{ $value }}秒前重启过"- alert: 进程退出告警# expr: max by(instance, groupname) (rate(namedprocess_namegroup_oldest_start_time_seconds{groupname=~"^vsftpd.*|^proxy.*|^goproxy.*|^lizhu_monitor*|^lizhu_agent*|^lizhurunner*"}[5m])) < 0expr: namedprocess_namegroup_num_procs{groupname=~"^vsftpd.*|^proxy.*|^goproxy.*|^lizhu_monitor*|^lizhu_agent*|^lizhurunner*"} == 0for: 30slabels:severity: warningannotations:summary: "进程退出"description: "进程{{ $labels.groupname }}退出了"  #  - alert: 进程退出告警
#    expr: max_over_time(namedprocess_namegroup_oldest_start_time_seconds{groupname=~"^vsftpd.*|^proxy.*|^goproxy.*|^lizhu_monitor.*|^lizhu_agent.*|^lizhurunner.*"}[1d]) < (time() - 10*60)
#    for: 1s
#    labels:
#      severity: warning
#    annotations:
#      description: 进程组 {{ $labels.groupname }} 中的进程在最近10分钟内退出了
#      summary: 进程退出#- alert: 机器硬盘读取速率#  expr: sum by (instance) (rate(node_disk_read_bytes_total[2m])) / 1024 / 1024 > 200#  for: 5m#  labels:#    severity: warning#  annotations:#    summary: Host unusual disk read rate (instance {{ $labels.instance }})#    description: "Disk is probably reading too much data (> 50 MB/s)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"#- alert: 机器硬盘写入速率#  expr: sum by (instance) (rate(node_disk_written_bytes_total[2m])) / 1024 / 1024 > 120#  for: 2m#  labels:#    severity: warning#  annotations:#    summary: Host unusual disk write rate (instance {{ $labels.instance }})#    description: "Disk is probably writing too much data VALUE = {{ $value }}\n  LABELS = {{ $labels }}"- alert: HostOomKillDetectedexpr: increase(node_vmstat_oom_kill[1m]) > 0for: 0mlabels:severity: warningannotations:summary: Host OOM kill detected (instance {{ $labels.instance }})description: "OOM kill detected\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"- alert: Esxi主机连接丢失expr: vmware_host_power_state != 1for: 1m labels:severity: criticalannotations:summary: "Esxi 物理机IP: {{ $labels.host_name }} 丢失连接"description: "VMware host {{ $labels.host_name }} is not connected to the virtualization platform."

      

docker_container.rules
groups:
- name: zqaalertrules:- alert: ContainerAbsentexpr: absent(container_last_seen)for: 5mlabels:severity: warningannotations:summary: "无容器 容器:{{$labels.instance }}"description: "5分钟检查容器不存在,当前值为:{{ $value }}"- alert: ContainerCpuUsageexpr: (sum(rate(container_cpu_usage_seconds_total{name!=""}[3m])) BY(instance, name)*100 ) > 300for: 2mlabels:severity: warningannotations:summary: "容器cpu使用率告警,容器:{{$labels.instance }}"description: "容器cpu使用率超过300%,当前值为:{{ $value }}"- alert: ContainerMemoryUsageexpr: (sum(container_memory_working_set_bytes{name!=""})BY (instance, name) /sum(container_spec_memory_limit_bytes > 0) BY (instance, name) * 100 ) > 80for: 2mlabels:severity: warningannotations:summary: "容器内存使用率告警,容器:{{$labels.instance }}"description: "容器内存使用率超过80%,当前值为:{{ $value }}"- alert: ContainerVolumeIOUsageexpr: (sum(container_fs_io_current{name!=""}) BY (instance, name) * 100) >80 for: 2mlabels:severity: warningannotations:summary: "容器存储IO使用率告警,容器:{{$labels.instance }}"description: "容器存储IO使用率超过80%,当前值为:{{ $value }}"- alert: ContainerHighThrottleRateexpr: rate(container_cpus_cfs_throttled_seconds_total[3m]) > 1 for: 2mlabels:severity: warningannotations:summary: "容器限制告警,容器:{{$labels.instance }}"description: "容器被限制,当前值为:{{ $value }}"
mysql_alert.rules
groups:
- name: zqaalertrules:- alert:  Mysql 宕机expr: mysql_up == 0for: 1mlabels:severity: criticalannotations:summary: MySQL down (instance {{ $labels.instance }})description: "MySQL instance is down on {{ $labels.instance }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"- alert: MysqlTooManyConnections(>80%)expr: max_over_time(mysql_global_status_threads_connected[1m]) / mysql_global_variables_max_connections * 100 > 80for: 2mlabels:severity: warningannotations:summary: MySQL too many connections (> 80%) (instance {{ $labels.instance }})description: "More than 80% of MySQL connections are in use on {{ $labels.instance }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"- alert: MysqlHighThreadsRunningexpr: max_over_time(mysql_global_status_threads_running[1m]) / mysql_global_variables_max_connections * 100 > 60for: 2mlabels:severity: warningannotations:summary: MySQL high threads running (instance {{ $labels.instance }})description: "More than 60% of MySQL connections are in running state on {{ $labels.instance }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"- alert: Mysql慢查询expr: increase(mysql_global_status_slow_queries[1m]) > 0for: 60mlabels:severity: warningannotations:summary: MySQL slow queries (instance {{ $labels.instance }})description: "MySQL server mysql has some new slow query.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
vmware.rules
- name: VMware Host Connection Staterules:- alert: HostDisconnectedexpr: vmware_host_power_state == "connected"for: 5m # 规定主机连接状态必须持续5分钟才会触发警报labels:severity: warningannotations:summary: "VMware host {{ $labels.instance }} disconnected"description: "VMware host {{ $labels.instance }} is not connected to the virtualization platform."

Alertmanager告警规则

通过定义组来监控组内机器

cat vim /opt/monitor/alertmanager/config.yml

global:resolve_timeout: 5msmtp_from: 'ops@xxx.com'smtp_smarthost: 'smtp.feishu.cn:465'smtp_auth_username: 'ops@xxx.com'smtp_auth_password: 'ydWhsFDk3pF50TZg'smtp_require_tls: falsesmtp_hello: 'ZQA监控告警'route:group_by: ['zqaalert']group_wait: 60s # 在触发第一个警报后,等待相同分组内的所有警报的最长时间group_interval: 10m   # 系统每隔10分钟检查一次是否有新的警报需要处理repeat_interval: 60m  # 在发送警报通知后,在重复发送通知之间等待的时间。设置为1小时意味着如果同一组内的警报在 1小时再次触发receiver: 'web.hook'
receivers:
#- name: 'web.hook.prometheusalert'
- name: 'web.hook'webhook_configs:- url: 'http://10.33.70.22:9094/prometheusalert?type=fs&tpl=prometheus-fs&fsurl=https://open.feishu.cn/open-apis/bot/v2/hook/7fe7f42d-242b-42eb-837c-028cfc84adb8'

consoul注册服务

* */1 * * * ip addr | awk '/^[0-9]+: / {}; /inet.*global/ {print gensub(/(.*)\/(.*)/, "\\1", "g", $2)}' |grep "10.33"|head -1|xargs -i curl -X PUT -d  '{"id": "node-exporter-{}","name": "node-exporter-dev","address": "{}","port": 9100,"tags": ["env-dev"],"checks": [{"http": "http://{}:9100/metrics", "interval": "5s"}]}'  http://consul.intra.xxx.net/v1/agent/service/register

有现成的consoul容器,运行即可

Dashboard JSON文件

以下是我认为比较好用的  grafana 的 dashboards文件

Grafana dashboards | Grafana Labs

    

   

    

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://xiahunao.cn/news/3269316.html

如若内容造成侵权/违法违规/事实不符,请联系瞎胡闹网进行投诉反馈,一经查实,立即删除!

相关文章

ARM32开发——PWM蜂鸣器案例

&#x1f3ac; 秋野酱&#xff1a;《个人主页》 &#x1f525; 个人专栏:《Java专栏》《Python专栏》 ⛺️心若有所向往,何惧道阻且长 文章目录 需求原来的驱动移植操作替换初始化 更新Play函数完整代码 需求 通过控制PB9来播放音乐&#xff0c;PB9对应的定时器通道&#xff1…

Web3 职场新手指南:从技能到素养,求职者如何脱颖而出?

随着 2024 年步入下半年&#xff0c;Web3 行业正在经历一系列技术革新。通过改进的跨链交互机制和兼容性&#xff0c;逐步消除市场碎片化的问题。技术的进步为开发者和用户都打开了新的前景。然而&#xff0c;复杂的技术和快速变化的市场环境也让许多新人望而却步。求职者如何找…

【数据结构】双向带头循环链表(c语言)(附源码)

&#x1f31f;&#x1f31f;作者主页&#xff1a;ephemerals__ &#x1f31f;&#x1f31f;所属专栏&#xff1a;数据结构 目录 前言 1.双向带头循环链表的概念和结构定义 2.双向带头循环链表的实现 2.1 方法声明 2.2 方法实现 2.2.1 创建新节点 2.2.2 初始化 2.2.3 …

【基于yolo转onnx 量化测试】

1、 训练模型转onnx 和量化 from ultralytics import YOLOmodel_path "yolov10/runs/train8/weights/best.pt" model YOLO(model_path) # 载入官方模型 # 导出模型 model.export(formatonnx,halfTrue)2、量化&#xff0c;减少了三分之一的存储空间从100M到30M …

当镜像地址出错的时候下载selenium的处理办法

当镜像地址出错的时候下载selenium的处理办法 一、原因 显示出错&#xff1a; C:\Users\xiaodaidai>pip install selenium3.4.0 Looking in indexes: Simple Index WARNING: Retrying (Retry(total4, connectNone, readNone, redirectNone, statusNone)) after connection …

学语言,看这里,如何快速掌握JavaScript?

本篇文章是基于会点c语言和会点python基础的&#xff0c;去更容易上手javascript 学习笔记分享✨&#x1f308;&#x1f44f;&#x1f44f;&#x1f451;&#x1f451; javascript目录 1.安装node.js&#xff1a;2.配置环境变量——创建NODE_HOME :3.变量与常量4.原生数据类型5…

C++ —— STL简介

1. 什么是STL STL(standard template libaray-标准模板库)&#xff1a;是C标准库的重要组成部分&#xff0c;不仅是一个可复用的 组件库&#xff0c;而且是一个包罗数据结构与算法的软件框架 2.STL的版本 原始版本 Alexander Stepanov、Meng Lee 在惠普实验室完成的原始版本…

Java之父官宣退休

今年不用说大家都知道环境真的很差很差&#xff0c;裁员降薪已经是家常便饭&#xff0c;在这种严峻环境下&#xff0c;我们只能提升自己内功来抗风险&#xff0c;下面分享一本java之父推荐的优秀书籍。 刚过完自己 69 岁生日的两个月后&#xff0c;Java 之父 James Gosling&…

论文阅读:Deep_Generic_Dynamic_Object_Detection_Based_on_Dynamic_Grid_Maps

目录 概要 Motivation 整体框架流程 技术细节 小结 不足 论文地址&#xff1a;Deep Generic Dynamic Object Detection Based on Dynamic Grid Maps | IEEE Conference Publication | IEEE Xplore 概要 该文章提出了一种基于动态网格图&#xff08;Dynamic Grid Maps&a…

Golang高效合并(拼接)多个gzip压缩文件

有时我们可能会遇到需要把多个 gzip 文件合并成单个 gzip 文件的场景&#xff0c;最简单最容易的方式是把每个gzip文件都先解压&#xff0c;然后合并成一个文件后再次进行压缩&#xff0c;最终得到我们想要的结果&#xff0c;但这种先解压后压缩的方式显然效率不高&#xff0c;…

监控Windows文件夹下面的文件(C#和C++实现)

最近在做虚拟打印机时&#xff0c;需要实时监控打印文件的到达&#xff0c;并移动文件到另外的位置。一开始我使用了线程&#xff0c;在线程里去检测新文件的到达。实际上Windows提供了一个文件监控接口函数ReadDIrectoryChangesW。这个函数可以对所有文件操作进行监控。 ReadD…

1 深度学习网络DNN

代码来自B站up爆肝杰哥 测试版本 import torch import torchvisiondef print_hi(name):print(fHi, {name}) if __name__ __main__:print_hi(陀思妥耶夫斯基)print("HELLO pytorch {}".format(torch.__version__))print("torchvision.version:", torchvi…

2024后端开发面试题总结

一、前言 上一篇离职贴发布之后仿佛登上了热门&#xff0c;就连曾经阿里的师兄都看到了我的分享&#xff0c;这波流量真是受宠若惊&#xff01; 回到正题&#xff0c;文章火之后&#xff0c;一些同学急切想要让我分享一下面试内容&#xff0c;回忆了几个晚上顺便总结一下&#…

mybatis查询数据字段返回空值

1.描述 数据苦衷实际存储字段全不为空 查询后brand_name/company_name为空 2.原因分析 带下划线的字段&#xff0c;都会返回空值&#xff0c;应该是字段映射出了问题 3.解决方案 在配置文件中添加下划线自动映射为驼峰 <configuration><settings><sett…

【计算机网络】OSPF单区域实验

一&#xff1a;实验目的 1&#xff1a;掌握在路由器上配置OSPF单区域。 2&#xff1a;学习OSPF协议的原理&#xff0c;及其网络拓扑结构改变后的变化。 二&#xff1a;实验仪器设备及软件 硬件&#xff1a;RCMS交换机、网线、内网网卡接口、Windows 2019操作系统的计算机等。…

STM32+ESP8266-连接阿里云-物联网通用Android app(2)

前言 接着上一篇的文章创建好了设备&#xff0c;云产品转发&#xff0c;让STM32连接上阿里云&#xff0c;发布和订阅了相关主题。本篇文章来编写一个Android app来进行控制STM32和接收传感器数据显示在屏幕上。基于Android studio。 演示视频 实现一个简单的app来控制stm32开…

Django-3.3创建模型

创建模型&#xff08;models&#xff09;的时候&#xff0c; 1&#xff1a;我们需要这个模型是哪个文件下面的模型&#xff08;models&#xff09;&#xff0c;我们需要在配置文件中吧应用安装上&#xff08;安装应用&#xff1a;INSTALLED_APPS&#xff09; 2&#xff1a;找对…

【机器学习】不同操作系统下如何安装Jupyter Notebook和Anaconda

引言 Jupyter Notebook 是一个非常流行的开源Web应用程序&#xff0c;允许你创建和共享包含代码、方程、可视化和解释性文本的文档 文章目录 引言一、如何安装Jupyter Notebook1.1 对于Windows用户1.2 对于macOS用户1.3 对于Linux用户&#xff1a; 二、如何安装Anaconda2.1 对于…

《python程序语言设计》第6章13题 数列求和编写一个函数计算

正确代码 def sumNumber(integer_num):print(" i || m(i)")print("-"*30)a 0for i in range(1, integer_num 1):a i / (i 1)print("{:4d} || {:.4f}".format(i, a))sumNumber(20)结果如下

使用 leanback 库 GridView 管理AnroidTV的焦点

一、前情提要 我当前需要开发一个TV应用&#xff0c;但是之前处理过的焦点问题的很少&#xff0c;现在空下来了&#xff0c;对过往的工作做一个总结分享。在手机APP开发中常用的 RecycleView 在 TV 中开发时&#xff0c;无法解决大量的焦点问题&#xff0c;所以使用leanback进…