Prometheus 集群监控 - 之硬件监控篇

3,094 阅读2分钟

算了考虑了很久还是决定使用普罗米修斯来做监控好了,因为它是基于时间序列模型,基于键值的特性,在趋势度上有优势,查询速度快。 基于HTTP pull/push两种对应的数据采集数据,扩展性极强。 社区庞大,官方有很多社区高质量插件。

单点初始安装Prometheus服务端
  • 官方网站下载最新版本。
  • 拷贝到目录解压即可运行,这边不做过多讲解。
  • 修改配置文件:egrep -v '^$|#' prometheus.yml
# my global config
global:
  scrape_interval:     15s # 数据采集频率
  evaluation_interval: 15s # 监控数据规则的评估频率
  # scrape_timeout is set to the global default (10s).
# Alertmanager configuration 
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      # - alertmanager:9093
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs: #抓取数据配置
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus' # 任务名
    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.
    static_configs: # 监控目标
    - targets: ['192.168.1.250:9090']
采集程序批量部署 node_exporter 服务器基本数据(非常全面)
  • 我机器28台,所以我先把node_exporter下载到我本机然后放到ftp,使用ansible来批量安装。
  • 安装完成后,默认端口为9100
- name: node_exporter_install
  hosts: all
  tasks:

      - name: Create group prometheus
        group:
         name: prometheus
         state: present

      - name: Create user prometheus
        user:
          name: prometheus
          shell: /bin/nologin
          groups: prometheus

      - name: Decompression file to /usr/local/node_exporter
        unarchive:
          src: ftp://192.168.1.254/software/node_exporter.tar.gz
          dest: /usr/local
          remote_src: yes

      - name: Create prometheus service
        copy:
          src: /srv/ftp/software/systemctl_file/node_exporter.service
          dest: /usr/lib/systemd/system/node_exporter.service

      - name: Start server and enable the server
        systemd:
          state: started
          name: node_exporter
          enabled: yes
  • 测试下看是否有问题curl 192.168.1.2:9100/metrics,确认下。
  • 然后修改服务端的配置文件添加监控目标:
static_configs:
    - targets: ['192.168.1.250:9090','192.168.1.2:9100','192.168.1.3:9100','192.168.1.10:9100','192.168.1.11:9100','192.168.1.12:9100','192.168.1.13:9100','192.168.1.160:9100','192.168.1.161:9100','192.168.1.162:9100','192.168.1.167:9100','192.168.1.168:9100','192.168.1.155:9100','192.168.1.156:9100','192.168.1.180:9100','192.168.1.181:9100','192.168.1.199:9100','192.168.1.200:9100','192.168.1.201:9100','192.168.1.202:9100','192.168.1.203:9100','192.168.1.204:9100','192.168.1.210:9100','192.168.1.211:9100','192.168.1.212:9100','192.168.1.217:9100','192.168.1.218:9100','192.168.1.219:9100']
  • 重启服务,然后去Prometheus的查看targets。已经查看到了28台目标已经处于up状态。

  • 内存使用率:((node_memory_MemTotal_bytes - node_memory_MemFree_bytes - node_memory_Cached_bytes - node_memory_Buffers_bytes - node_memory_Slab_bytes)/node_memory_MemTotal_bytes) * 100

  • 硬盘使用量:(node_filesystem_size_bytes{fstype="xfs"} - node_filesystem_free_bytes{fstype="xfs"})/node_filesystem_size_bytes{fstype="xfs"} * 100

  • 调度器一的出站流量:irate(node_network_receive_bytes_total{instance="192.168.1.2:9100", device="eth0"}[30s]) /1024 /1024 > 0单位M

  • 调度器一的入站流量:irate(node_network_transmit_bytes_total{instance="192.168.1.2:9100", device="eth0"}[30s]) /1024 /1024 > 0

  • 调取器2呢就是换个IP端口

监控图形化搭建 grafana
  • 安装太简单这里就不赘述了。默认端口:3000

  • 安装完成后默认账户和密码admin,需要进行修改密码。

  • Configuration: -> DataSource -> Prometheus -> name(项目名)->httpurl(Prometheus的地址)->save and test。

  • Create -> Dashboard ->Choose Visualization -> 就是根据自己的指标去设计。