Day 22 - Prometheus 监控体系

📘 Day 22:Prometheus 监控体系

🎯 今日目标

  • 用 Helm 部署 kube-prometheus-stack
  • 理解 Prometheus 数据采集链路
  • 查看集群核心指标(CPU/Memory/Network)
  • 自定义告警规则

🧠 理论精讲(30 分钟)

Prometheus Stack 架构

1
2
3
4
5
6
7
8
9
┌──────────────┐    ┌──────────────┐    ┌──────────────┐
│ Prometheus │←──│ ServiceMon. │ │ Grafana │
│ (采集+存储) │ │ (动态目标) │ │ (可视化) │
└──────┬───────┘ └──────────────┘ └──────────────┘

├──→ AlertManager(告警路由)

└──→ node_exporter(节点指标)
kube-state-metrics(K8s 对象指标)

核心指标速查

指标 含义
container_cpu_usage_seconds_total CPU 累计使用
container_memory_working_set_bytes 内存使用量
kube_pod_status_phase Pod 状态
kube_deployment_status_replicas_ready Deploy 就绪副本
node_filesystem_avail_bytes 节点磁盘可用

🔧 动手实操(120 分钟)

练习 22.1:安装 kube-prometheus-stack

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# 1. 添加 Helm 仓库
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

# 2. 安装(使用 NodePort 暴露 Grafana)
helm install monitoring prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace \
--set grafana.service.type=NodePort \
--set grafana.service.nodePort=30300 \
--set prometheus.service.type=NodePort \
--set prometheus.service.nodePort=30900 \
--set alertmanager.service.type=NodePort \
--set alertmanager.service.nodePort=30903

# 3. 等待所有 Pod 就绪
kubectl get pods -n monitoring -w
# prometheus-xxx, grafana-xxx, alertmanager-xxx, operator-xxx, node-exporter-xxx, kube-state-metrics-xxx

# 4. 查看 Service
kubectl get svc -n monitoring

练习 22.2:访问 Prometheus 和 Grafana

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# 1. 获取 Grafana 登录密码
kubectl get secret -n monitoring monitoring-grafana \
-o jsonpath='{.data.admin-password}' | base64 -d
echo

# 2. 访问 Grafana(NodePort 30300)
echo "Grafana: http://<任意节点IP>:30300"
echo "User: admin"
echo "Password: <上面获取的密码>"

# 3. 访问 Prometheus(NodePort 30900)
echo "Prometheus: http://<任意节点IP>:30900"

# 4. 在 Prometheus 中查询一些指标:
# - up(所有目标状态)
# - kube_node_info(节点信息)
# - container_memory_usage_bytes(容器内存)

练习 22.3:ServiceMonitor 示例

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
# 1. 部署一个带 metrics 的应用
kubectl create ns app-metrics

cat <<EOF | kubectl apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
name: metrics-app
namespace: app-metrics
labels:
app: metrics-app
spec:
replicas: 2
selector:
matchLabels:
app: metrics-app
template:
metadata:
labels:
app: metrics-app
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "9113"
spec:
containers:
- name: nginx
image: nginx:alpine
ports:
- containerPort: 80
- name: exporter
image: nginx/nginx-prometheus-exporter:0.11
args:
- -nginx.scrape-uri=http://localhost/nginx_status
ports:
- containerPort: 9113
name: metrics
EOF

kubectl expose deploy metrics-app -n app-metrics --port=9113 --name=metrics-app-svc

# 2. 创建 ServiceMonitor
cat <<EOF | kubectl apply -f -
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: metrics-app-monitor
namespace: monitoring
spec:
selector:
matchLabels:
app: metrics-app
namespaceSelector:
matchNames:
- app-metrics
endpoints:
- port: metrics
interval: 30s
EOF

# 3. 在 Prometheus Targets 中验证新目标已出现
kubectl port-forward -n monitoring svc/monitoring-prometheus 9090:9090 &
# 浏览器打开 http://localhost:9090/targets

练习 22.4:自定义告警规则

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
# 创建 PrometheusRule
cat <<EOF | kubectl apply -f -
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: custom-alerts
namespace: monitoring
spec:
groups:
- name: pod-alerts
rules:
- alert: HighPodRestarts
expr: rate(kube_pod_container_status_restarts_total[15m]) > 0.05
for: 5m
labels:
severity: warning
annotations:
summary: "Pod {{ \$labels.pod }} has high restart rate"
description: "Pod {{ \$labels.pod }} in {{ \$labels.namespace }} restarted {{ \$value }} times in 15min"

- alert: PodCrashLooping
expr: kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff"} > 0
for: 2m
labels:
severity: critical
annotations:
summary: "Pod {{ \$labels.pod }} is crash looping"
EOF

# 验证规则
kubectl get PrometheusRule -n monitoring
kubectl describe PrometheusRule custom-alerts -n monitoring

🐛 排错练习(30 分钟)

场景:Prometheus 无法采集指标

1
2
3
4
5
6
7
8
9
10
11
12
13
# 1. 检查 ServiceMonitor 是否创建
kubectl get servicemonitor -A

# 2. 检查 Prometheus 配置是否已加载
kubectl port-forward -n monitoring svc/monitoring-prometheus 9090:9090
# 访问 http://localhost:9090/config 查看 scrape_configs

# 3. 检查 Target 状态
# http://localhost:9090/targets

# 4. 标签是否匹配
kubectl get servicemonitor <name> -o yaml | grep -A10 selector
kubectl get svc <name> -o yaml | grep -A5 labels

🏆 赛题模拟(40 分钟)

⚠️ 严格限时 40 分钟

题目:监控体系部署与配置

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
【操作要求】

1. 使用 Helm 部署 kube-prometheus-stack 到 monitoring 命名空间
- Grafana NodePort 30300
- Prometheus NodePort 30900

2. 部署示例应用:
- Deployment demo-app(nginx:alpine,2 副本)
- 暴露 80 和 metrics 端口

3. 配置 ServiceMonitor 采集 demo-app 的指标

4. 自定义 PrometheusRule:
- 规则 1:Pod 重启次数 > 3(15分钟内)
- 规则 2:Deployment 副本不达期望数超过 5 分钟

5. 在 Grafana 中:
- 导入 Node Exporter Full 仪表盘(ID: 1860)
- 查看集群 CPU/内存/磁盘使用情况
- 截图保存

6. 验证:
- Prometheus Targets 中包含 demo-app
- 自定义告警规则生效

【评分标准】
- Prometheus Stack 部署成功(25 分)
- ServiceMonitor 正确配置(20 分)
- PrometheusRule 正确(20 分)
- Grafana 仪表盘可用(20 分)
- 整体验证(15 分)