如何用原生Prometheus监控大规模Kubernetes集群


【编者的话】对于Prometheus的组件能力是毋庸置疑的,但是使用久了会发现很多的性能问题,诸如内存问题、大规模拉取问题、大规模存储问题等等。如何基于云原生Prometheus进行Kubernetes集群基础监控大规模数据拉取,本文将会给出答案。

架构图

1.png

上图是我们当前的监控平台架构图,根据架构图可以看出我们当前的监控平台结合了多个成熟开源组件和能力完成了当前集群的数据+指标+展示的工作。

当前我们监控不同的Kubernetes集群,包含不同功能、不同业务的集群,包含业务、基础和告警信息。

针对Kubernetes集群监控

我们采用常见的2种监控架构之一:
  1. Prometheus-operator
  2. Prometheus单独配置(选择的架构)


tips:对于Prometheus-operator确实易于部署化、简单的ServiceMonitor省了很大的力气,不过对于我们这样多种私有化集群来说维护成本稍微有点高,我们选择第二种方案更多的是想省略创建服务发现的步骤,更多的采用服务发现、服务注册的能力。

数据拉取

在数据拉取方面我们做了一定的调整,为了应对大规模节点或者数据对于apiserver的大压力问题和大规模数据拉取Prometheus内存OOM问题。
  • 利用Kubernetes做服务发现,监控数据拉取由Prometheus之间拉取,降低apiserver拉取压力
  • 采用Hashmod方式进行分布式拉取缓解内存压力


RBAC权限修改:
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRole
metadata:
name: prometheus
namespace: monitoring
rules:
- apiGroups: [""]
resources:
- nodes
- nodes/proxy
- nodes/metrics #新增路径为了外部拉取
- nodes/metrics/cadvisor #新增路径为了外部拉取
- services
- endpoints
- pods
verbs: ["get", "list", "watch"]
- apiGroups:
- extensions
resources:
- ingresses
verbs: ["get", "list", "watch"]
- nonResourceURLs: ["/metrics"]
verbs: ["get"]
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: prometheus
namespace: monitoring
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
name: prometheus
namespace: monitoring
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: prometheus
subjects:
- kind: ServiceAccount
name: prometheus
namespace: monitoring

需要新增对于Node节点的/metrics和/metrics/cadvsior路径的拉取权限。

以完整配置拉取示例:
  • 对于Thanos的数据写入提供写入阿里云OSS示例
  • 对于node_exporter数据提取,线上除Kubernetes外皆使用Consul作为配置注册和发现
  • 对于业务自定义基于Kubernetes做服务发现和拉取


主机命名规则

机房-业务线-业务属性-序列数(例:bja-athena-etcd-001)

Consul自动注册示例脚本

#!/bin/bash

#ip=$(ip addr show eth0|grep inet | awk '{ print $2; }' | sed 's/\/.*$//')
ip=$(ip addr | egrep -o '[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}' | egrep "^192\.168|^172\.21|^10\.101|^10\.100" | egrep -v "\.255$" | awk -F. '{print $1"."$2"."$3"."$4}' | head -n 1)
ahost=`echo $HOSTNAME`
idc=$(echo $ahost|awk -F "-" '{print $1}')
app=$(echo $ahost|awk -F "-" '{print $2}')
group=$(echo $ahost|awk -F "-" '{print $3}')

if [ "$app" != "test" ]
then
echo "success"
curl -X PUT -d "{\"ID\": \"${ahost}_${ip}_node\", \"Name\": \"node_exporter\", \"Address\": \"${ip}\", \"tags\": [\"idc=${idc}\",\"group=${group}\",\"app=${app}\",\"server=${ahost}\"], \"Port\": 9100,\"checks\": [{\"tcp\":\"${ip}:9100\",\"interval\": \"60s\"}]}" http://consul_server:8500/v1/agent/service/register
fi

完整配置文件示例

apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
namespace: monitoring
data:
bucket.yaml: |
type: S3
config:
  bucket: "gcl-download"
  endpoint: "gcl-download.oss-cn-beijing.aliyuncs.com"
  access_key: "xxxxxxxxxxxxxx"
  insecure: false
  signature_version2: false
  secret_key: "xxxxxxxxxxxxxxxxxx"
  http_config:
    idle_conn_timeout: 0s

prometheus.yml: |
global:
  scrape_interval:     15s
  evaluation_interval: 15s

  external_labels:
     monitor: 'k8s-sh-prod'
     service: 'k8s-all'
     ID: 'ID_NUM'

remote_write:
  - url: "http://vmstorage:8400/insert/0/prometheus/"
remote_read:
  - url: "http://vmstorage:8401/select/0/prometheus"

scrape_configs:
- job_name: 'kubernetes-apiservers'
  kubernetes_sd_configs:
  - role: endpoints
  scheme: https
  tls_config:
    ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
  bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
  relabel_configs:
  - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
    action: keep
    regex: default;kubernetes;https 

- job_name: 'kubernetes-cadvisor'
  kubernetes_sd_configs:
  - role: node
  scheme: https
  tls_config:
    #ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    insecure_skip_verify: true
  bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
  #bearer_token: monitoring
  relabel_configs:
  - action: labelmap
    regex: __meta_kubernetes_node_label_(.+)
  - source_labels: [__meta_kubernetes_node_address_InternalIP]
    regex: (.+)
    target_label: __address__
    replacement: ${1}:10250
  - source_labels: [__meta_kubernetes_node_name]
    regex: (.+)
    target_label: __metrics_path__
    replacement: /metrics/cadvisor
  - source_labels: [__meta_kubernetes_node_name]
    modulus:       10
    target_label:  __tmp_hash
    action:        hashmod
  - source_labels: [__tmp_hash]
    regex:         ID_NUM
    action:        keep
  metric_relabel_configs:
  - source_labels: [container]
    regex: (.+)
    target_label: container_name
    replacement: $1
    action: replace
  - source_labels: [pod]
    regex: (.+)
    target_label: pod_name
    replacement: $1
    action: replace

- job_name: 'kubernetes-nodes'
  kubernetes_sd_configs:
  - role: node
  scheme: https
  tls_config:
    #ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    insecure_skip_verify: true
  bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
  #bearer_token: monitoring
  relabel_configs:
  - action: labelmap
    regex: __meta_kubernetes_node_label_(.+)
  - source_labels: [__meta_kubernetes_node_address_InternalIP]
    regex: (.+)
    target_label: __address__
    replacement: ${1}:10250
  - source_labels: [__meta_kubernetes_node_name]
    regex: (.+)
    target_label: __metrics_path__
    replacement: /metrics
  - source_labels: [__meta_kubernetes_node_name]
    modulus:       10
    target_label:  __tmp_hash
    action:        hashmod
  - source_labels: [__tmp_hash]
    regex:         ID_NUM
    action:        keep
  metric_relabel_configs:
  - source_labels: [container]
    regex: (.+)
    target_label: container_name
    replacement: $1
    action: replace
  - source_labels: [pod]
    regex: (.+)
    target_label: pod_name
    replacement: $1
    action: replace

- job_name: 'kubernetes-service-endpoints'
  kubernetes_sd_configs:
  - role: endpoints
    namespaces:
      names:
      - monitoring
  relabel_configs:
  - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
    action: keep
    regex: true
  - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
    action: replace
    target_label: __scheme__
    regex: (https?)
  - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
    action: replace
    target_label: __metrics_path__
    regex: (.+)
  - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
    action: replace
    target_label: __address__
    regex: ([^:]+)(?::\d+)?;(\d+)
    replacement: $1:$2
  - action: labelmap
    regex: __meta_kubernetes_service_label_(.+)
  - source_labels: [__meta_kubernetes_namespace]
    action: replace
    target_label: kubernetes_namespace
  - source_labels: [__meta_kubernetes_service_name]
    action: replace
    target_label: kubernetes_name

- job_name: 'kubernetes-pods'
  kubernetes_sd_configs:
  - role: pod
    namespaces:
      names:
      - default
  relabel_configs:
  - action: labelmap
    regex: __meta_kubernetes_pod_label_(.+)
  - source_labels: [__meta_kubernetes_namespace]
    action: replace
    target_label: kubernetes_namespace
  - source_labels: [__meta_kubernetes_pod_name]
    action: replace
    target_label: kubernetes_pod_name

- job_name: 'ingress-nginx-endpoints'
  kubernetes_sd_configs:
  - role: pod
    namespaces:
      names:
      - nginx-ingress
  relabel_configs:
  - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
    action: keep
    regex: true
  - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scheme]
    action: replace
    target_label: __scheme__
    regex: (https?)
  - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
    action: replace
    target_label: __metrics_path__
    regex: (.+)
  - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
    action: replace
    target_label: __address__
    regex: ([^:]+)(?::\d+)?;(\d+)
    replacement: $1:$2

- job_name: 'node_exporter'
  consul_sd_configs:
  - server: 'consul_server:8500'
  relabel_configs:
      - source_labels: [__address__]
      modulus:       10
      target_label:  __tmp_hash
      action:        hashmod
      - source_labels: [__tmp_hash]
      regex:         ID_NUM
      action:        keep
      - source_labels: [__tmp_hash]
      regex:       '(.*)'
      replacement: '${1}'
      target_label: hash_num
      - source_labels: [__meta_consul_tags]
      regex: .*test.*
      action: drop
      - source_labels: [__meta_consul_tags]
      regex: ',(?:[^,]+,){0}([^=]+)=([^,]+),.*'
      replacement: '${2}'
      target_label: '${1}'
      - source_labels: [__meta_consul_tags]
      regex: ',(?:[^,]+,){1}([^=]+)=([^,]+),.*'
      replacement: '${2}'
      target_label: '${1}'
      - source_labels: [__meta_consul_tags]
      regex: ',(?:[^,]+,){2}([^=]+)=([^,]+),.*'
      replacement: '${2}'
      target_label: '${1}'
      - source_labels: [__meta_consul_tags]
      regex: ',(?:[^,]+,){3}([^=]+)=([^,]+),.*'
      replacement: '${2}'
      target_label: '${1}'
      - source_labels: [__meta_consul_tags]
      regex: ',(?:[^,]+,){4}([^=]+)=([^,]+),.*'
      replacement: '${2}'
      target_label: '${1}'
      - source_labels: [__meta_consul_tags]
      regex: ',(?:[^,]+,){5}([^=]+)=([^,]+),.*'
      replacement: '${2}'
      target_label: '${1}'
      - source_labels: [__meta_consul_tags]
      regex: ',(?:[^,]+,){6}([^=]+)=([^,]+),.*'
      replacement: '${2}'
      target_label: '${1}'
      - source_labels: [__meta_consul_tags]
      regex: ',(?:[^,]+,){7}([^=]+)=([^,]+),.*'
      replacement: '${2}'
      target_label: '${1}'

- job_name: '自定义业务监控'
  proxy_url: http://127.0.0.1:8888   #根据业务属性
  scrape_interval: 5s
  metrics_path: '/'  #根据业务提供路径
  params:   ##根据业务属性是否带有
    method: ['get']  
  kubernetes_sd_configs:
  - role: pod
  relabel_configs:
  - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
    action: keep
    regex: true
  - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
    action: replace
    target_label: __address__
    regex: ([^:]+)(?::\d+)?;(\d+)
    replacement: $1:$2
  - source_labels: [__meta_kubernetes_pod_annotation_prometheus_name_label]
    action: keep
    regex: monitor  #业务自定义label
  - action: labelmap
    regex: __meta_kubernetes_pod_label_(.+)
  - source_labels: [__meta_kubernetes_pod_name]
    action: keep
    regex: (.*)
  - source_labels: [__meta_kubernetes_namespace]
    action: replace
    target_label: kubernetes_namespace

自定义业务拉取标识(可集成CI/CD)

template:
metadata:
  annotations:
    prometheus.io/port: "port" #业务端口
    prometheus.io/scrape: "true"
    prometheus.name/label: monitor  #自定义标签

Hashmod配置方式

1、针对官方的镜像新增Hashmod模块分配值

Dockerfile:
FROM  prometheus/prometheus:2.20.0
MAINTAINER name gecailong

COPY ./entrypoint.sh /bin

ENTRYPOINT ["/bin/entrypoint.sh"] 

entrypoint.sh:
#!/bin/sh

ID=${POD_NAME##*-}

cp /etc/prometheus/prometheus.yml /prometheus/prometheus-hash.yml

sed -i "s/ID_NUM/$ID/g" /prometheus/prometheus-hash.yml

/bin/prometheus --config.file=/prometheus/prometheus-hash.yml --query.max-concurrency=20 --storage.tsdb.path=/prometheus --storage.tsdb.max-block-duration=2h --storage.tsdb.min-block-duration=2h  --storage.tsdb.retention=2h --web.listen-address=:9090 --web.enable-lifecycle --web.enable-admin-api

IDNUM: 为我们后面配置做准备

2、Prometheus部署
Prometheus配置文件:
prometheus.yml: |
  external_labels:
     monitor: 'k8s-sh-prod'
     service: 'k8s-all'
     ID: 'ID_NUM'
     ...

这个ID是为了我们在查询的时候可以区分同时也可以作为等下Hashmod模块的对应值。

部署文件:
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
labels:
app: prometheus
name: prometheus-sts
namespace: monitoring
spec:
serviceName: "prometheus"
replicas: 10 #Hashmod总模块数
selector:
matchLabels:
  app: prometheus
template:
metadata:
  labels:
    app: prometheus
spec:
  containers:
  - image: gecailong/prometheus-hash:0.0.1
    name: prometheus
    securityContext:
       runAsUser: 0
    command:
    - "/bin/entrypoint.sh"
    env:
    - name: POD_NAME  #根据StatefulSet的特性传入Pod名称用于模块取值
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.name
    ports:
    - name: http
      containerPort: 9090
      protocol: TCP
    volumeMounts:
    - mountPath: "/etc/prometheus"
      name: config-volume
    - mountPath: "/prometheus"
      name: data
    resources:
      requests:
        cpu: 500m
        memory: 1000Mi
      limits:
        memory: 2000Mi
  - image: gecailong/prometheus-thanos:v0.17.1
    name: sidecar
    imagePullPolicy: IfNotPresent
    args:
    - "sidecar"
    - "--grpc-address=0.0.0.0:10901"
    - "--grpc-grace-period=1s"
    - "--http-address=0.0.0.0:10902"
    - "--http-grace-period=1s"
    - "--prometheus.url=http://127.0.0.1:9090"
    - "--tsdb.path=/prometheus"
    - "--log.level=info"
    - "--objstore.config-file=/etc/prometheus/bucket.yaml"
    ports:
    - name: http-sidecar
      containerPort: 10902
    - name: grpc-sidecar
      containerPort: 10901
    volumeMounts:
    - mountPath: "/etc/prometheus"
      name: config-volume
    - mountPath: "/prometheus"
      name: data
  serviceAccountName: prometheus
  hostNetwork: true
  dnsPolicy: ClusterFirstWithHostNet
  imagePullSecrets: 
    - name: regsecret
  volumes:
  - name: config-volume
    configMap:
      name: prometheus-config
  - name: data
    hostPath:
      path: /data/prometheus

数据聚合

Thanos我们从18年一开始就用的它,虽然一开始的版本有很多bug,也给我们带来了很多困扰,同时我们也提了很多的issue,慢慢的稳定之后,我们在此之前线上都是使用v0.2.1版本,最新的版本已经去除了基于grpc cluster服务发现的功能,UI也更加的丰富。我们也进行了监控平台架构重构。

我们数据聚合采用Thanos进行查询数据聚合,同时后面我们提到的数据存储组件victoriametrics也可以实现数据聚合的功能,针对Thanos,我们主要使用它的几个子组件:query、sidecar、rule,至于其他的组件如compact、store、bucket等依据自己的业务没有进行使用。

我们的Thanos+Prometheus的架构图已在开头展示,以下仅给出部署和注意事项:

Thanos组件部署:

sidecar:(我们采用和Prometheus放在同一Pod)
- image: gecailong/prometheus-thanos:v0.17.1
    name: thanos
    imagePullPolicy: IfNotPresent
    args:
    - "sidecar"
    - "--grpc-address=0.0.0.0:10901"
    - "--grpc-grace-period=1s"
    - "--http-address=0.0.0.0:10902"
    - "--http-grace-period=1s"
    - "--prometheus.url=http://127.0.0.1:9090"
    - "--tsdb.path=/prometheus"
    - "--log.level=info"
    - "--objstore.config-file=/etc/prometheus/bucket.yaml"
    ports:
    - name: http-sidecar
      containerPort: 10902
    - name: grpc-sidecar
      containerPort: 10901
    env:
    - name: POD_NAME
      valueFrom:
        fieldRef:
          fieldPath: metadata.name
    volumeMounts:
    - mountPath: "/etc/prometheus"
      name: config-volume
    - mountPath: "/prometheus"
      name: data

query组件部署:
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
labels:
app: query
name: thanos-query
namespace: monitoring
spec:
replicas: 3
selector:
matchLabels:
  app: query
template:
metadata:
  labels:
    app: query
spec:
  containers:
  - image: gecailong/prometheus-thanos:v0.17.1
    name: query
    imagePullPolicy: IfNotPresent
    args:
    - "query"
    - "--http-address=0.0.0.0:19090"
    - "--grpc-address=0.0.0.0:10903"
    - "--store=dnssrv+_grpc._tcp.prometheus-sidecar-svc.monitoring.svc.cluster.local"
    - "--store=dnssrv+_grpc._tcp.sidecar-query.monitoring.svc.cluster.local"
    - "--store=dnssrv+_grpc._tcp.sidecar-rule.monitoring.svc.cluster.local"
    ports:
    - name: http-query
      containerPort: 19090
    - name: grpc-query
      containerPort: 10903
    env:
    - name: POD_NAME
      valueFrom:
        fieldRef:
          fieldPath: metadata.name
  hostNetwork: true
  dnsPolicy: ClusterFirstWithHostNet

rule组件部署:
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
labels:
app: query
name: thanos-rule
namespace: monitoring
spec:
replicas: 2
serviceName: "sidecar-rule"
selector:
matchLabels:
  app: rule
template:
metadata:
  labels:
    app: rule
spec:
  containers:
  - image: gecailong/prometheus-thanos:v0.17.1
    name: rule
    imagePullPolicy: IfNotPresent
    args:
    - "rule"
    - "--http-address=0.0.0.0:10902"
    - "--grpc-address=0.0.0.0:10901"
    - "--data-dir=/data"
    - "--rule-file=/prometheus-rules/*.yaml"
    - "--alert.query-url=http://sidecar-query:19090"
    - "--alertmanagers.url=http://alertmanager:9093"
    - "--query=http://sidecar-query:19090"
    env:
    - name: POD_NAME
      valueFrom:
        fieldRef:
          fieldPath: metadata.name
    volumeMounts:
    - mountPath: "/prometheus-rules"
      name: config-volume
    - mountPath: "/data"
      name: data
    resources:
      requests:
        cpu: 100m
        memory: 100Mi
      limits:
        memory: 1500Mi
  hostNetwork: true
  dnsPolicy: ClusterFirstWithHostNet
  volumes:
  - name: config-volume
    configMap:
      name: prometheus-rule
  - name: data
    hostPath:
      path: /data/prometheus

rule通用告警规则和配置:
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-rule
namespace: monitoring
data:
k8s_cluster_rule.yaml: |+
groups:
- name: pod_etcd_monitor
  rules:
  - alert: pod_etcd_num_is_changing
    expr: sum(kube_pod_info{pod=~"etcd.*"})by(monitor) < 3
    for: 1m
    labels:
      level: high
      service: etcd
    annotations:
      summary: "集群:{{ $labels.monitor }},etcd集群pod低于正常总数"
      description: "总数为3,当前值是{{ $value}}"
- name: pod_scheduler_monitor
  rules:
  - alert: pod_scheduler_num_is_changing
    expr: sum(kube_pod_info{pod=~"kube-scheduler.*"})by(monitor) < 3
    for: 1m
    labels:
      level: high
      service: scheduler
    annotations:
      summary: "集群:{{ $labels.monitor }},scheduler集群pod低于正常总数"
      description: "总数为3,当前值是{{ $value}}"
- name: pod_controller_monitor
  rules:
  - alert: pod_controller_num_is_changing
    expr: sum(kube_pod_info{pod=~"kube-controller-manager.*"})by(monitor) < 3
    for: 1m
    labels:
      level: high
      service: controller
    annotations:
      summary: "集群:{{ $labels.monitor }},controller集群pod低于正常总数"
      description: "总数为3,当前值是{{ $value}}"
- name: pod_apiserver_monitor
  rules:
  - alert: pod_apiserver_num_is_changing
    expr: sum(kube_pod_info{pod=~"kube-apiserver.*"})by(monitor) < 3
    for: 1m
    labels:
      level: high
      service: controller
    annotations:
      summary: "集群:{{ $labels.monitor }},apiserver集群pod低于正常总数"
      description: "总数为3,当前值是{{ $value}}"

k8s_master_resource_rules.yaml: |+
groups:
- name: node_cpu_resource_monitor
  rules:
  - alert: 节点CPU使用量
    expr:  sum(kube_pod_container_resource_requests_cpu_cores{node=~".*"})by(node)/sum(kube_node_status_capacity_cpu_cores{node=~".*"})by(node)>0.7
    for: 1m
    labels:
      level: disaster
      service: node
    annotations:
      summary: "集群NODE节点总的CPU使用核数已经超过了70%"
      description: "集群:{{ $labels.monitor }},节点:{{ $labels.node }}当前值为{{ $value }}!"
- name: node_memory_resource_monitor
  rules:
  - alert: 节点内存使用量
    expr:  sum(kube_pod_container_resource_limits_memory_bytes{node=~".*"})by(node)/sum(kube_node_status_capacity_memory_bytes{node=~".*"})by(node)>0.7
    for: 1m
    labels:
      level: disaster
      service: node
    annotations:
      summary: "集群NODE节点总的memory使用核数已经超过了70%"
      description: "集群:{{ $labels.monitor }},节点:{{ $labels.node }}当前值为{{ $value }}!"
- name: 节点POD使用率
  rules:
  - alert: 节点pod使用率
    expr: sum by(node,monitor) (kube_pod_info{node=~".*"}) / sum by(node,monitor) (kube_node_status_capacity_pods{node=~".*"})> 0.9
    for: 1m
    labels:
      level: disaster
      service: node
    annotations:
      summary: "集群NODE节点总的POD使用数量已经超过了90%"
      description: "集群:{{ $labels.monitor }},节点:{{ $labels.node }}当前值为{{ $value }}!"      
- name: master_cpu_used
  rules:
  - alert: 主节点CPU使用率
    expr:  sum(kube_pod_container_resource_limits_cpu_cores{node=~'master.*'})by(node)/sum(kube_node_status_capacity_cpu_cores{node=~'master.*'})by(node)>0.7
    for: 1m
    labels:
      level: disaster
      service: node
    annotations:
      summary: "集群Master节点总的CPU申请核数已经超过了0.7,当前值为{{ $value }}!"
      description: "集群:{{ $labels.monitor }},节点:{{ $labels.node }}当前值为{{ $value }}!" 
- name: master_memory_resource_monitor
  rules:
  - alert: 主节点内存使用率
    expr:  sum(kube_pod_container_resource_limits_memory_bytes{node=~'master.*'})by(node)/sum(kube_node_status_capacity_memory_bytes{node=~'master.*'})by(node)>0.7
    for: 1m
    labels:
      level: disaster
      service: node
    annotations:
      summary: "集群Master节点总的内存使用量已经超过了70%"
      description: "集群:{{ $labels.monitor }},节点:{{ $labels.node }}当前值为{{ $value }}!"
- name: master_pod_resource_monitor
  rules:
  - alert: 主节点POD使用率
    expr: sum(kube_pod_info{node=~"master.*"}) by (node) / sum(kube_node_status_capacity_pods{node=~"master.*"}) by (node)>0.7
    for: 1m
    labels:
      level: disaster
      service: node
    annotations:
      summary: "集群Master节点总的POD使用数量已经超过了70%"
      description: "集群:{{ $labels.monitor }},节点:{{ $labels.node }}当前值为{{ $value }}!"     
k8s_node_rule.yaml: |+
groups:
- name: K8sNodeMonitor
  rules:
  - alert: 集群节点资源监控
    expr: kube_node_status_condition{condition=~"OutOfDisk|MemoryPressure|DiskPressure",status!="false"} ==1
    for: 1m
    labels:
      level: disaster
      service: node
    annotations:
      summary: "集群节点内存或磁盘资源短缺"
      description: "节点:{{ $labels.node }},集群:{{ $labels.monitor }},原因:{{ $labels.condition }}"
  - alert: 集群节点状态监控
    expr: sum(kube_node_status_condition{condition="Ready",status!="true"})by(node)  == 1
    for: 2m
    labels:
      level: disaster
      service: node
    annotations:
      summary: "集群节点状态出现错误"
      description: "节点:{{ $labels.node }},集群:{{ $labels.monitor }}"
  - alert: 集群POD状态监控
    expr: sum (kube_pod_container_status_terminated_reason{reason!~"Completed|Error"})  by (pod,reason) ==1
    for: 1m
    labels:
      level: high
      service: pod
    annotations:
      summary: "集群pod状态出现错误"
      description: "集群:{{ $labels.monitor }},名称:{{ $labels.pod }},原因:{{ $labels.reason}}"
  - alert: 集群节点CPU使用监控
    expr:  sum(node_load1) BY (instance) / sum(rate(node_cpu_seconds_total[1m])) BY (instance) > 2
    for: 5m
    labels:
      level: disaster
      service: node
    annotations:
      summary: "机器出现cpu平均负载过高"
      description: "节点: {{ $labels.instance }}平均每核大于2"
  - alert: NodeMemoryOver80Percent
    expr:  (1 - avg by (instance)(node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes))* 100 >85
    for: 1m
    labels:
      level: disaster
      service: node
    annotations:
      summary: "机器出现内存使用超过85%"
      description: "节点: {{ $labels.instance }}"
k8s_pod_rule.yaml: |+
groups:
  - name: pod_status_monitor
    rules:
    - alert: pod错误状态监控
      expr: changes(kube_pod_status_phase{phase=~"Failed"}[5m]) >0
      for: 1m
      labels:
        level: high
        service: pod-failed
      annotations:
        summary: "集群:{{ $labels.monitor }}存在pod状态异常"
        description: "pod:{{$labels.pod}},状态:{{$labels.phase}}"
    - alert: pod异常状态监控
      expr: sum(kube_pod_status_phase{phase="Pending"})by(namespace,pod,phase)>0
      for: 3m
      labels:
        level: high
        service: pod-pending
      annotations:
        summary: "集群:{{ $labels.monitor }}存在pod状态pening异常超10分钟"
        description: "pod:{{$labels.pod}},状态:{{$labels.phase}}"
    - alert: pod等待状态监控
      expr: sum(kube_pod_container_status_waiting_reason{reason!="ContainerCreating"})by(namespace,pod,reason)>0
      for: 1m
      labels:
        level: high
        service: pod-wait
      annotations:
        summary: "集群:{{ $labels.monitor }}存在pod状态Wait异常超5分钟"
        description: "pod:{{$labels.pod}},状态:{{$labels.reason}}"
    - alert: pod非正常状态监控
      expr: sum(kube_pod_container_status_terminated_reason)by(namespace,pod,reason)>0
      for: 1m
      labels:
        level: high
        service: pod-nocom
      annotations:
        summary: "集群:{{ $labels.monitor }}存在pod状态Terminated异常超5分钟"
        description: "pod:{{$labels.pod}},状态:{{$labels.reason}}"
    - alert: pod重启监控
      expr: changes(kube_pod_container_status_restarts_total[20m])>3
      for: 3m
      labels:
        level: high
        service: pod-restart
      annotations:
        summary: "集群:{{ $labels.monitor }}存在pod半小时之内重启次数超过3次!"
        description: "pod:{{$labels.pod}}"
  - name: deployment_replicas_monitor
    rules:
    - alert: deployment监控
      expr: sum(kube_deployment_status_replicas_unavailable)by(namespace,deployment) >2
      for: 3m
      labels:
        level: high
        service: deployment-replicas
      annotations:
        summary: "集群:{{ $labels.monitor }},deployment:{{$labels.deployment}} 副本数未达到期望值! "
        description: "空间:{{$labels.namespace}},当前不可用副本:{{$value}},请检查"
  - name: daemonset_replicas_monitor
    rules:
    - alert: Daemonset监控
      expr: sum(kube_daemonset_status_desired_number_scheduled - kube_daemonset_status_current_number_scheduled)by(daemonset,namespace) >2
      for: 3m
      labels:
        level: high
        service: daemonset
      annotations:
        summary: "集群:{{ $labels.monitor }},daemonset:{{$labels.daemonset}} 守护进程数未达到期望值!"
        description: "空间:{{$labels.namespace}},当前不可用副本:{{$value}},请检查"
  - name: satefulset_replicas_monitor
    rules:
    - alert: Satefulset监控
      expr: (kube_statefulset_replicas - kube_statefulset_status_replicas_ready) >2
      for: 3m
      labels:
        level: high
        service: statefulset
      annotations:
        summary: "集群:{{ $labels.monitor }},statefulset:{{$labels.statefulset}} 副本数未达到期望值!"
        description: "空间:{{$labels.namespace}},当前不可用副本:{{$value}},请检查"
  - name: pvc_replicas_monitor
    rules:
    - alert: PVC监控
      expr: kube_persistentvolumeclaim_status_phase{phase!="Bound"} == 1
      for: 5m
      labels:
        level: high
        service: pvc
      annotations:
        summary: "集群:{{ $labels.monitor }},statefulset:{{$labels.persistentvolumeclaim}} 异常未bound成功!"
        description: "pvc出现异常"
  - name: K8sClusterJob
    rules:    
    - alert: 集群JOB状态监控
      expr: sum(kube_job_status_failed{job="kubernetes-service-endpoints",k8s_app="kube-state-metrics"})by(job_name) ==1
      for: 1m
      labels:
        level: disaster
        service: job
      annotations:
        summary: "集群存在执行失败的Job"
        description: "集群:{{ $labels.monitor }},名称:{{ $labels.job_name }}"
  - name: pod_container_cpu_resource_monitor
    rules:
    - alert: 容器内cpu占用监控
      expr: namespace:container_cpu_usage_seconds_total:sum_rate / sum(kube_pod_container_resource_limits_cpu_cores) by (monitor,namespace,pod_name)> 0.8
      for: 1m
      labels:
        level: high
        service: container_cpu
      annotations:
        summary: "集群:{{ $labels.monitor }} 出现Pod CPU使用率已经超过申请量的80%,"
        description: "namespace:{{$labels.namespace}}的pod:{{$labels.pod}},当前值为{{ $value }}"
    - alert: 容器内mem占用监控
      expr: namespace:container_memory_usage_bytes:sum/ sum(kube_pod_container_resource_limits_memory_bytes)by(monitor,namespace,pod_name) > 0.8
      for: 2m
      labels:
        level: high
        service: container_mem
      annotations:
        summary: "集群:{{ $labels.monitor }} 出现Pod memory使用率已经超过申请量的90%"
        description: "namespace:{{$labels.namespace}}的pod:{{$labels.pod}},当前值为{{ $value }}"

redis_rules.yaml: |+
groups:
- name: k8s_container_rule
rules:
- expr: sum(rate(container_cpu_usage_seconds_total[5m])) by (monitor,namespace,pod_name)
    record: namespace:container_cpu_usage_seconds_total:sum_rate
- expr: sum(container_memory_usage_bytes{container_name="POD"}) by (monitor,namespace,pod_name)
    record: namespace:container_memory_usage_bytes:sum

注意:因为组件都在同一集群,我们采用DNS SRV的方式进行发现其他组件节点,其实对于容器内部的DNS SRV方便很多,我们只需要创建一个需要的Headless Service并且使用DNS SRV的话,设置
clusterIP: None即可。
thanos-query-svc:
apiVersion: v1
kind: Service
metadata:
labels:
app: query
name: sidecar-query
spec:
ports:
- name: web
port: 19090
protocol: TCP
targetPort: 19090
selector:
app: query

thanos-rule-svc:
apiVersion: v1
kind: Service
metadata:
labels:
app: rule
name: sidecar-rule
spec:
clusterIP: None
ports:
- name: web
port: 10902
protocol: TCP
targetPort: 10902
- name: grpc
port: 10901
protocol: TCP
targetPort: 10901
selector:
app: rule

Prometheus+sidecar:
apiVersion: v1
kind: Service
metadata:
labels:
app: prometheus
name: prometheus-sidecar-svc
spec:
clusterIP: None
ports:
- name: web
port: 9090
protocol: TCP
targetPort: 9090
- name: grpc
port: 10901
protocol: TCP
targetPort: 10901
selector:
app: prometheus

效果图:

Pod指标监控多集群示例:
2.png

监控告警规则示例:
3.png

Thanos首页:
4.png

数据存储

对于Prometheus的数据存储我们也走了很多的弯路。

开始我们使用过InfluxDB最终因为集群版问题放弃了,也试过重写Prometheus-adapter接入OpenTSDB,后来因为部分通配符维护难问题也放弃了(其实还是tcollecter的搜集问题放弃的),我们也尝试过用Thanos-store S3打入Ceph因为副本问题成本太高,也打入过阿里云的OSS,存的多,但是取数据成了一个问题。后面我们迎来了VictoriaMetrics,能解决我们大部分的主要问题。

架构:
5.png

VictoriaMetrics本身是一个时序数据库,对于这样一个远端存储,同时也可以单独作为Prometheus数据源查询使用。

优势:
  1. 具有较高的压缩比和高性能
  2. 可以提供和Prometheus同等的数据源展示
  3. 支持MetricsQL同时查询时进行相同Meitrics数据聚合
  4. 开源的集群版本(简直无敌)


对于VictoriaMetrics我们做过一个简单的测试,相同的数据在和Prometheus原有数据对比中内存大概减少50%,CPU节省超40%,磁盘占用减少约40%,并且我们通过这种方式分离了写入和读取的通道避免了新老数据共存内存造成的大内存和OOM问题,也同时提供了一个长期数据存储的成本方案。

VictoriaMetrics部署:

vminsert部署:
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: monitor-vminsert
spec:
revisionHistoryLimit: 10
selector:
matchLabels:
  vminsert: online
template:
metadata:
  labels:
    vminsert: online
spec:
  containers:
  - args:
    - -storageNode=vmstorage:8400
    image: victoriametrics/vminsert:v1.39.4-cluster
    imagePullPolicy: IfNotPresent
    name: vminsert
    ports:
    - containerPort: 8480
      name: vminsert
      protocol: TCP
  dnsPolicy: ClusterFirst
  hostNetwork: true
  nodeSelector:
    vminsert: online
  restartPolicy: Always
updateStrategy:
rollingUpdate:
  maxUnavailable: 1
type: RollingUpdate

vmselect部署:
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: monitor-vmselect
spec:
revisionHistoryLimit: 10
selector:
matchLabels:
  vmselect: online
template:
metadata:
  labels:
    vmselect: online
spec:
  containers:
  - args:
    - -storageNode=vmstorage:8400
    image: victoriametrics/vmselect:v1.39.4-cluster
    imagePullPolicy: IfNotPresent
    name: vmselect
    ports:
    - containerPort: 8481
      name: vmselect
      protocol: TCP
  dnsPolicy: ClusterFirst
  hostNetwork: true
  nodeSelector:
    vmselect: online
  restartPolicy: Always
updateStrategy:
rollingUpdate:
  maxUnavailable: 1
type: RollingUpdate

vmstorage部署:
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: monitor-vmstorage
spec:
replicas: 10
serviceName: vmstorage
revisionHistoryLimit: 10
selector:
matchLabels:
  vmstorage: online
template:
metadata:
  labels:
    vmstorage: online
spec:
  containers:
  - args:
    - --retentionPeriod=1
    - --storageDataPath=/storage
    image: victoriametrics/vmstorage:v1.39.4-cluster
    imagePullPolicy: IfNotPresent
    name: vmstorage
    ports:
    - containerPort: 8482
      name: http
      protocol: TCP
    - containerPort: 8400
      name: vminsert
      protocol: TCP
    - containerPort: 8401
      name: vmselect
      protocol: TCP
    volumeMounts:
    - mountPath: /data
      name: data
  hostNetwork: true
  nodeSelector:
    vmstorage: online
  restartPolicy: Always
  volumes:
  - hostPath:
      path: /data/vmstorage
      type: ""
    name: data

vmstorage-svc(提供接口供查询、写入):
apiVersion: v1
kind: Service
metadata:
labels:
vmstorage: staging
name: vmstorage
spec:
ports:
- name: http
port: 8482
protocol: TCP
targetPort: http
- name: vmselect
port: 8401
protocol: TCP
targetPort: vmselect
- name: vminsert
port: 8400
protocol: TCP
targetPort: vminsert
selector:
vmstorage: staging
type: NodePort

vminsert-svc:
apiVersion: v1
kind: Service
metadata:
labels:
vminsert: online
name: monitor-vminsert
spec:
ports:
- name: vminsert
port: 8480
protocol: TCP
targetPort: vminsert
selector:
vminsert: online
type: NodePort

vmselet-svc:
apiVersion: v1
kind: Service
metadata:
labels:
vmselect: online
name: monitor-vmselect
spec:
ports:
- name: vmselect
port: 8481
protocol: TCP
targetPort: vmselect
selector:
vmselect: online
type: NodePort

进行部署完成后需要修改Prometheus配置进行写入和查询支持:
remote_write:
  - url: "http://vmstorage:8400/insert/0/prometheus/"
remote_read:
  - url: "http://vmstorage:8401/select/0/prometheus"

Grafana数据源配置:
选择数据源类型:Prometheus
http://vmstorage:8401/select/0/prometheus

效果图:
6.png

告警信息

告警规则都是由thanos rule推送至Alertmanager。

告警采用Alertmanager进行告警,同时搭配自己的告警平台进行告警的分发。

在配置中我们按照alertname和monitor进行分组,可以实现相同alert name下的所有告警分成一个组,进行基于Prometheus的聚合告警,同时因为现网Pod较多,如发生大规模Pod异常进行聚合时数据较大,单独分类。效果如后面展示。

告警静默配置:因为现网告警都在label中定义了告警级别(warning、high、disaster)级别,对于最低级别的告警我们默认不走告警平台,根据告警的等级和告警规则进行静默。

例:
  1. 同monitor集群下某一个alertname按照instance进行静默
  2. 对于大量Pod告警我们基于Pod告警类型进行静默


第一次告警时会根据分组聚合信息进行所有告警信息推送。

Alertmanager配置:
global:
smtp_smarthost: 'mail.xxxxxxx.com:25'
smtp_from: 'xxxxxxx@xxxxxxx.com'
smtp_auth_username: 'xxxxxxx@xxxxxxx.com'
smtp_auth_password: 'xxxxxxx'
smtp_require_tls: false

route:
group_by: ['alertname','pod','monitor']
group_wait: 10s
group_interval: 10s
repeat_interval: 6h
receiver: 'webhook'

routes:
- receiver: 'mail'
match:
  level: warning

receivers:
- name: 'mail'
email_configs:
- to: 'amend@xxxxx.com,amend2@xxxxx.com'
send_resolved: true
- name: 'webhook'
webhook_configs:
- url: 'http://alert.xxx.com/alert/prometheus'
send_resolved: true
inhibit_rules:
- source_match:
  level: 'disaster'
target_match_re:
  level: 'high|disaster'
equal: ['alertname','instance','monitor']
- source_match:
  level: 'high'
target_match_re:
  level: 'high'
equal: ['alertname','instance','monitor'] 

告警聚合代码示例(Python):
try:
        payload = eval(self.request.body)
    except json.decoder.JSONDecodeError:
        raise web.HTTPError(400)
    alert_row = payload['alerts']
    try:
        if len(alert_row) <2:
           description =  alert_row[0]['annotations']['description']
           summary =  alert_row[0]['annotations']['summary']
        else:
           for alert in alert_row:
               description +=  alert['annotations']['description'] + '\n'
           summary = '[聚合告警] '+ alert_row[0]['annotations']['summary']
    except:
        pass
    try:
        namespace =  alert_row[0]['labels']['namespace']
    except:
        pass

效果:

对于Pod的监控:
7.png

对于instance级别告警:
8.png

对于业务级别告警:
9.png

源码和模板:https://github.com/gecailong/K8sMonitor

参考:


原文链接:https://www.noalert.cn/post/ru ... -qun/

0 个评论

要回复文章请先登录注册