当前位置：首页 > 未命名 > 正文内容

Prometheus监控系统实战指南

廖万里3个月前 (03-16)未命名6

Prometheus概述

Prometheus是开源的系统监控和告警工具包，采用Pull模式采集指标，使用多维数据模型和强大的查询语言PromQL。

核心组件

Prometheus Server：核心服务，负责采集和存储数据
Exporters：暴露指标端点
Pushgateway：短期作业指标推送
Alertmanager：告警管理
Grafana：可视化展示

部署配置

prometheus.yml

global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    monitor: 'production'

# 告警规则文件
rule_files:
  - /etc/prometheus/rules/*.yml

# 告警配置
alerting:
  alertmanagers:
    - static_configs:
        - targets: ['localhost:9093']

# 采集配置
scrape_configs:
  # Prometheus自身
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  # Node Exporter
  - job_name: 'node'
    static_configs:
      - targets: ['node1:9100', 'node2:9100']
    relabel_configs:
      - source_labels: [__address__]
        target_label: instance

  # 自定义应用
  - job_name: 'myapp'
    metrics_path: /metrics
    static_configs:
      - targets: ['app1:8080', 'app2:8080']
  
  # 服务发现（Kubernetes）
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true

指标类型

四种核心指标

# Counter - 只增计数器
http_requests_total{method="GET", status="200"} 1234

# Gauge - 可增可减
memory_usage_bytes{host="server1"} 1048576
temperature_celsius{location="room1"} 23.5

# Histogram - 直方图
http_request_duration_seconds_bucket{le="0.1"} 100
http_request_duration_seconds_bucket{le="0.5"} 250
http_request_duration_seconds_bucket{le="1"} 300
http_request_duration_seconds_bucket{le="+Inf"} 350
http_request_duration_seconds_sum 150.5
http_request_duration_seconds_count 350

# Summary - 摘要
http_request_duration_seconds{quantile="0.5"} 0.3
http_request_duration_seconds{quantile="0.9"} 0.8
http_request_duration_seconds{quantile="0.99"} 1.2

应用集成

Python客户端

from prometheus_client import Counter, Histogram, Gauge, start_http_server
import time
import random

# 定义指标
REQUEST_COUNT = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['method', 'endpoint', 'status']
)

REQUEST_LATENCY = Histogram(
    'http_request_duration_seconds',
    'HTTP request latency',
    ['method', 'endpoint']
)

ACTIVE_CONNECTIONS = Gauge(
    'active_connections',
    'Number of active connections'
)

# 使用装饰器
@REQUEST_LATENCY.time()
@REQUEST_COUNT.labels(method='GET', endpoint='/api/users', status='200').count_exceptions()
def process_request():
    time.sleep(random.uniform(0.1, 0.5))
    return {'status': 'ok'}

# 手动记录
def handle_request():
    REQUEST_COUNT.labels(method='GET', endpoint='/api/data', status='200').inc()
    with REQUEST_LATENCY.labels(method='GET', endpoint='/api/data').time():
        # 处理请求
        pass

# 启动指标服务
start_http_server(9090)

Go客户端

package main

import (
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promhttp"
    "net/http"
)

var (
    httpRequestsTotal = prometheus.NewCounterVec(
        prometheus.CounterOpts{
            Name: "http_requests_total",
            Help: "Total number of HTTP requests",
        },
        []string{"method", "path", "status"},
    )
    
    httpRequestDuration = prometheus.NewHistogramVec(
        prometheus.HistogramOpts{
            Name:    "http_request_duration_seconds",
            Help:    "HTTP request duration",
            Buckets: []float64{.1, .5, 1, 2.5, 5, 10},
        },
        []string{"method", "path"},
    )
)

func init() {
    prometheus.MustRegister(httpRequestsTotal)
    prometheus.MustRegister(httpRequestDuration)
}

func main() {
    http.Handle("/metrics", promhttp.Handler())
    http.ListenAndServe(":8080", nil)
}

PromQL查询

常用查询示例

# 即时查询
http_requests_total

# 按标签过滤
http_requests_total{method="GET"}
http_requests_total{method=~"GET|POST"}  # 正则匹配

# 范围查询
http_requests_total[5m]

# 速率计算
rate(http_requests_total[5m])

# 分位数
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

# 聚合
sum(rate(http_requests_total[5m])) by (method)
avg(node_memory_MemAvailable_bytes) by (instance)

# 数学运算
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) 
  / node_memory_MemTotal_bytes * 100

# 预测
predict_linear(node_memory_MemAvailable_bytes[1h], 4*3600)

告警规则

groups:
  - name: node_alerts
    rules:
      - alert: HighMemoryUsage
        expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) 
              / node_memory_MemTotal_bytes > 0.9
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High memory usage on {{ $labels.instance }}"
          description: "Memory usage is above 90% for 5 minutes"
      
      - alert: InstanceDown
        expr: up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Instance {{ $labels.instance }} down"
      
      - alert: HighRequestLatency
        expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High request latency"

Alertmanager配置

global:
  smtp_smarthost: 'smtp.example.com:587'
  smtp_from: 'alerts@example.com'
  smtp_auth_username: 'alerts@example.com'
  smtp_auth_password: 'password'

route:
  receiver: 'default'
  group_by: ['alertname', 'severity']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  routes:
    - match:
        severity: critical
      receiver: 'critical-team'

receivers:
  - name: 'default'
    email_configs:
      - to: 'team@example.com'
  
  - name: 'critical-team'
    email_configs:
      - to: 'oncall@example.com'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/XXX'
        channel: '#alerts'

最佳实践

命名规范：使用snake_case，包含单位
标签设计：避免高基数标签
告警分级：critical/warning/info
数据保留：根据需求调整保留周期
监控监控：监控Prometheus自身

Prometheus是云原生监控的标准选择，合理配置能够及时发现和定位问题。

本文链接：https://www.kkkliao.cn/?id=754 转载需授权！

分享到：

标签: Prometheus 监控告警时序数据库 DevOps

返回列表

上一篇：Nginx反向代理与负载均衡完全指南

下一篇：MongoDB数据库设计与性能优化

Prometheus监控系统实战指南

Prometheus概述

核心组件

部署配置

prometheus.yml

指标类型

四种核心指标

应用集成

Python客户端

Go客户端

PromQL查询

常用查询示例

告警规则

Alertmanager配置

最佳实践

发表评论

廖万里

© 2022-2026 天桥区万策云网络工作室、东莞市东城万策智联网络工作室及济南高新区万策网络工作室提供技术支持
鲁公网安备 37010502001945号
鲁ICP备2026009861号-1

Powered By Z-BlogPHP. Theme by TOYEAN.

Prometheus监控系统实战指南

Prometheus概述

核心组件

部署配置

prometheus.yml

指标类型

四种核心指标

应用集成

Python客户端

Go客户端

PromQL查询

常用查询示例

告警规则

Alertmanager配置

最佳实践

发表评论取消回复

廖万里

© 2022-2026 天桥区万策云网络工作室、东莞市东城万策智联网络工作室及济南高新区万策网络工作室提供技术支持 鲁公网安备 37010502001945号 鲁ICP备2026009861号-1

Powered By Z-BlogPHP. Theme by TOYEAN.

发表评论

© 2022-2026 天桥区万策云网络工作室、东莞市东城万策智联网络工作室及济南高新区万策网络工作室提供技术支持
鲁公网安备 37010502001945号
鲁ICP备2026009861号-1