当前位置:首页 > 未命名 > 正文内容

Prometheus监控系统实战指南

廖万里8小时前未命名1

Prometheus概述

Prometheus是开源的系统监控和告警工具包,采用Pull模式采集指标,使用多维数据模型和强大的查询语言PromQL。

核心组件

  • Prometheus Server:核心服务,负责采集和存储数据
  • Exporters:暴露指标端点
  • Pushgateway:短期作业指标推送
  • Alertmanager:告警管理
  • Grafana:可视化展示

部署配置

prometheus.yml

global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    monitor: 'production'

# 告警规则文件
rule_files:
  - /etc/prometheus/rules/*.yml

# 告警配置
alerting:
  alertmanagers:
    - static_configs:
        - targets: ['localhost:9093']

# 采集配置
scrape_configs:
  # Prometheus自身
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  # Node Exporter
  - job_name: 'node'
    static_configs:
      - targets: ['node1:9100', 'node2:9100']
    relabel_configs:
      - source_labels: [__address__]
        target_label: instance

  # 自定义应用
  - job_name: 'myapp'
    metrics_path: /metrics
    static_configs:
      - targets: ['app1:8080', 'app2:8080']
  
  # 服务发现(Kubernetes)
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true

指标类型

四种核心指标

# Counter - 只增计数器
http_requests_total{method="GET", status="200"} 1234

# Gauge - 可增可减
memory_usage_bytes{host="server1"} 1048576
temperature_celsius{location="room1"} 23.5

# Histogram - 直方图
http_request_duration_seconds_bucket{le="0.1"} 100
http_request_duration_seconds_bucket{le="0.5"} 250
http_request_duration_seconds_bucket{le="1"} 300
http_request_duration_seconds_bucket{le="+Inf"} 350
http_request_duration_seconds_sum 150.5
http_request_duration_seconds_count 350

# Summary - 摘要
http_request_duration_seconds{quantile="0.5"} 0.3
http_request_duration_seconds{quantile="0.9"} 0.8
http_request_duration_seconds{quantile="0.99"} 1.2

应用集成

Python客户端

from prometheus_client import Counter, Histogram, Gauge, start_http_server
import time
import random

# 定义指标
REQUEST_COUNT = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['method', 'endpoint', 'status']
)

REQUEST_LATENCY = Histogram(
    'http_request_duration_seconds',
    'HTTP request latency',
    ['method', 'endpoint']
)

ACTIVE_CONNECTIONS = Gauge(
    'active_connections',
    'Number of active connections'
)

# 使用装饰器
@REQUEST_LATENCY.time()
@REQUEST_COUNT.labels(method='GET', endpoint='/api/users', status='200').count_exceptions()
def process_request():
    time.sleep(random.uniform(0.1, 0.5))
    return {'status': 'ok'}

# 手动记录
def handle_request():
    REQUEST_COUNT.labels(method='GET', endpoint='/api/data', status='200').inc()
    with REQUEST_LATENCY.labels(method='GET', endpoint='/api/data').time():
        # 处理请求
        pass

# 启动指标服务
start_http_server(9090)

Go客户端

package main

import (
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promhttp"
    "net/http"
)

var (
    httpRequestsTotal = prometheus.NewCounterVec(
        prometheus.CounterOpts{
            Name: "http_requests_total",
            Help: "Total number of HTTP requests",
        },
        []string{"method", "path", "status"},
    )
    
    httpRequestDuration = prometheus.NewHistogramVec(
        prometheus.HistogramOpts{
            Name:    "http_request_duration_seconds",
            Help:    "HTTP request duration",
            Buckets: []float64{.1, .5, 1, 2.5, 5, 10},
        },
        []string{"method", "path"},
    )
)

func init() {
    prometheus.MustRegister(httpRequestsTotal)
    prometheus.MustRegister(httpRequestDuration)
}

func main() {
    http.Handle("/metrics", promhttp.Handler())
    http.ListenAndServe(":8080", nil)
}

PromQL查询

常用查询示例

# 即时查询
http_requests_total

# 按标签过滤
http_requests_total{method="GET"}
http_requests_total{method=~"GET|POST"}  # 正则匹配

# 范围查询
http_requests_total[5m]

# 速率计算
rate(http_requests_total[5m])

# 分位数
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

# 聚合
sum(rate(http_requests_total[5m])) by (method)
avg(node_memory_MemAvailable_bytes) by (instance)

# 数学运算
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) 
  / node_memory_MemTotal_bytes * 100

# 预测
predict_linear(node_memory_MemAvailable_bytes[1h], 4*3600)

告警规则

groups:
  - name: node_alerts
    rules:
      - alert: HighMemoryUsage
        expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) 
              / node_memory_MemTotal_bytes > 0.9
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High memory usage on {{ $labels.instance }}"
          description: "Memory usage is above 90% for 5 minutes"
      
      - alert: InstanceDown
        expr: up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Instance {{ $labels.instance }} down"
      
      - alert: HighRequestLatency
        expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High request latency"

Alertmanager配置

global:
  smtp_smarthost: 'smtp.example.com:587'
  smtp_from: 'alerts@example.com'
  smtp_auth_username: 'alerts@example.com'
  smtp_auth_password: 'password'

route:
  receiver: 'default'
  group_by: ['alertname', 'severity']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  routes:
    - match:
        severity: critical
      receiver: 'critical-team'

receivers:
  - name: 'default'
    email_configs:
      - to: 'team@example.com'
  
  - name: 'critical-team'
    email_configs:
      - to: 'oncall@example.com'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/XXX'
        channel: '#alerts'

最佳实践

  1. 命名规范:使用snake_case,包含单位
  2. 标签设计:避免高基数标签
  3. 告警分级:critical/warning/info
  4. 数据保留:根据需求调整保留周期
  5. 监控监控:监控Prometheus自身

Prometheus是云原生监控的标准选择,合理配置能够及时发现和定位问题。

Prometheus Node Exporter App Metrics Alertmanager Grafana Prometheus监控架构

本文链接:https://www.kkkliao.cn/?id=754 转载需授权!

分享到:

版权声明:本文由廖万里的博客发布,如需转载请注明出处。


发表评论

访客

看不清,换一张

◎欢迎参与讨论,请在这里发表您的看法和观点。