Prometheus监控系统实战指南
Prometheus概述
Prometheus是开源的系统监控和告警工具包,采用Pull模式采集指标,使用多维数据模型和强大的查询语言PromQL。
核心组件
- Prometheus Server:核心服务,负责采集和存储数据
- Exporters:暴露指标端点
- Pushgateway:短期作业指标推送
- Alertmanager:告警管理
- Grafana:可视化展示
部署配置
prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
monitor: 'production'
# 告警规则文件
rule_files:
- /etc/prometheus/rules/*.yml
# 告警配置
alerting:
alertmanagers:
- static_configs:
- targets: ['localhost:9093']
# 采集配置
scrape_configs:
# Prometheus自身
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
# Node Exporter
- job_name: 'node'
static_configs:
- targets: ['node1:9100', 'node2:9100']
relabel_configs:
- source_labels: [__address__]
target_label: instance
# 自定义应用
- job_name: 'myapp'
metrics_path: /metrics
static_configs:
- targets: ['app1:8080', 'app2:8080']
# 服务发现(Kubernetes)
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
指标类型
四种核心指标
# Counter - 只增计数器
http_requests_total{method="GET", status="200"} 1234
# Gauge - 可增可减
memory_usage_bytes{host="server1"} 1048576
temperature_celsius{location="room1"} 23.5
# Histogram - 直方图
http_request_duration_seconds_bucket{le="0.1"} 100
http_request_duration_seconds_bucket{le="0.5"} 250
http_request_duration_seconds_bucket{le="1"} 300
http_request_duration_seconds_bucket{le="+Inf"} 350
http_request_duration_seconds_sum 150.5
http_request_duration_seconds_count 350
# Summary - 摘要
http_request_duration_seconds{quantile="0.5"} 0.3
http_request_duration_seconds{quantile="0.9"} 0.8
http_request_duration_seconds{quantile="0.99"} 1.2
应用集成
Python客户端
from prometheus_client import Counter, Histogram, Gauge, start_http_server
import time
import random
# 定义指标
REQUEST_COUNT = Counter(
'http_requests_total',
'Total HTTP requests',
['method', 'endpoint', 'status']
)
REQUEST_LATENCY = Histogram(
'http_request_duration_seconds',
'HTTP request latency',
['method', 'endpoint']
)
ACTIVE_CONNECTIONS = Gauge(
'active_connections',
'Number of active connections'
)
# 使用装饰器
@REQUEST_LATENCY.time()
@REQUEST_COUNT.labels(method='GET', endpoint='/api/users', status='200').count_exceptions()
def process_request():
time.sleep(random.uniform(0.1, 0.5))
return {'status': 'ok'}
# 手动记录
def handle_request():
REQUEST_COUNT.labels(method='GET', endpoint='/api/data', status='200').inc()
with REQUEST_LATENCY.labels(method='GET', endpoint='/api/data').time():
# 处理请求
pass
# 启动指标服务
start_http_server(9090)
Go客户端
package main
import (
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promhttp"
"net/http"
)
var (
httpRequestsTotal = prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "http_requests_total",
Help: "Total number of HTTP requests",
},
[]string{"method", "path", "status"},
)
httpRequestDuration = prometheus.NewHistogramVec(
prometheus.HistogramOpts{
Name: "http_request_duration_seconds",
Help: "HTTP request duration",
Buckets: []float64{.1, .5, 1, 2.5, 5, 10},
},
[]string{"method", "path"},
)
)
func init() {
prometheus.MustRegister(httpRequestsTotal)
prometheus.MustRegister(httpRequestDuration)
}
func main() {
http.Handle("/metrics", promhttp.Handler())
http.ListenAndServe(":8080", nil)
}
PromQL查询
常用查询示例
# 即时查询
http_requests_total
# 按标签过滤
http_requests_total{method="GET"}
http_requests_total{method=~"GET|POST"} # 正则匹配
# 范围查询
http_requests_total[5m]
# 速率计算
rate(http_requests_total[5m])
# 分位数
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
# 聚合
sum(rate(http_requests_total[5m])) by (method)
avg(node_memory_MemAvailable_bytes) by (instance)
# 数学运算
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)
/ node_memory_MemTotal_bytes * 100
# 预测
predict_linear(node_memory_MemAvailable_bytes[1h], 4*3600)
告警规则
groups:
- name: node_alerts
rules:
- alert: HighMemoryUsage
expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)
/ node_memory_MemTotal_bytes > 0.9
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage on {{ $labels.instance }}"
description: "Memory usage is above 90% for 5 minutes"
- alert: InstanceDown
expr: up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Instance {{ $labels.instance }} down"
- alert: HighRequestLatency
expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "High request latency"
Alertmanager配置
global:
smtp_smarthost: 'smtp.example.com:587'
smtp_from: 'alerts@example.com'
smtp_auth_username: 'alerts@example.com'
smtp_auth_password: 'password'
route:
receiver: 'default'
group_by: ['alertname', 'severity']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
- match:
severity: critical
receiver: 'critical-team'
receivers:
- name: 'default'
email_configs:
- to: 'team@example.com'
- name: 'critical-team'
email_configs:
- to: 'oncall@example.com'
slack_configs:
- api_url: 'https://hooks.slack.com/services/XXX'
channel: '#alerts'
最佳实践
- 命名规范:使用snake_case,包含单位
- 标签设计:避免高基数标签
- 告警分级:critical/warning/info
- 数据保留:根据需求调整保留周期
- 监控监控:监控Prometheus自身
Prometheus是云原生监控的标准选择,合理配置能够及时发现和定位问题。
本文链接:https://www.kkkliao.cn/?id=754 转载需授权!
版权声明:本文由廖万里的博客发布,如需转载请注明出处。



手机流量卡
免费领卡
号卡合伙人
产品服务
关于本站
