Sonic Telemetry Deployment

之前在老东家做TME时,整理过IOS XR相关的Telemetry,比如:Telemetry Solution Demo with IOS XR,也研究过 Telemetry Receiver by UDP+KV-GPB。但对Sonic Switch一直没搞过,一是Sonic是基于redis数据库的,对openconfig支持的不好,另外之前用的思科开源的pipeline很久没更新了,应该是被废弃了。2024年整理过Sonic + Telemetry,那时用的是Telegraf作为中间件,而且为了灵活,在SONiC上安装Telegraf的container,但这个方案真正实施起来有些繁琐,不如直接在服务器上部署Telegraf,当时很忙,也没时间整理,时间长了,细节就忘记了。

这次正好又面临可视化需求,因此重新梳理了架构,这次不再使用Telegraf+influxdata,而是采用gnmic + prometheus的方案,gnmic是openconfig开发的,而且在github上更新频繁。代码内容,可以查看:https://github.com/yongpro/sonic-telemetry

另外更新了Server Telemetry,具体看这篇总结:Server Telemetry Deployment,是在现有架构基础之上扩充的。

注意:整个方案会不断迭代和优化,因此文章原始内容我可能不一定会跟着优化而修正,我只是给自己作总结。如果需要参考,请过完整个文档。另外部署Telemetry平台,建议在本地局域网,避免数据传输延迟导致数据不正确。

架构概述

功能简介

本系统用于监控RDMA环境中SONiC交换机的关键网络指标,支持:

  • Port Traffic: 端口流量统计(IN/OUT)
  • PFC: Priority Flow Control 统计
  • Queue Watermark (OQ): 出口队列水位
  • PG Watermark (SQ): 入口优先级组水位
  • Queue Drop: 队列丢包统计
  • ACL Counter: ACL规则匹配计数(CNP/NAK等)

系统架构

经过确认,目前SONiC有三种接口供用户调用数据:

  1. SONiC原生的OID,所有DB里的信息都支持,但VID和RID可能会发生变化,不适合在生产上作为GNMI的接口(上万台设备如果都不一样,那订阅的信息量太大了,也无法维护)。但这个接口有个优势,就是不关心你上层(A-OS,B-OS,C-OS)是否相同,都可以使用这个架构把数据get出来,因此这个架构非常适合用于 LAB 环境;
  2. OpenConfig Yang,原生SONiC的OC很多都不支持,需要自行补全这个抽象层;
  3. SONiC Yang,SONiC的原生Yang,但支持的也不是很好,很多都需要用户自行扩展;

注意gnmic的订阅机制,如果订阅列表中任何一个OID不存在,整个订阅会失败。我没有深入研究是否有方法可以解决,而是直接绕开的问题,如上使用的第一种接口,每台交换机生成独立的订阅配置,保证OID都是正确的。如果设备重启,或配置变动,导致OID改变,那么需要重新更新OID。

另外目前架构支持Ethernet1,Ethernet1_1两种形态

┌────────────────────────────────────────────────────────────────────┐
│                         SONiC Switches                             │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐       ┌──────────┐       │
│  │ Switch1  │  │ Switch2  │  │ Switch3  │  ...  │ SwitchN  │       │
│  │(SONiC A) │  │(SONiC B) │  │(SONiC A) │       │(SONiC C) │       │
│  └────┬─────┘  └────┬─────┘  └────┬─────┘       └────┬─────┘       │
│       │gNMI        │gNMI        │gNMI              │gNMI           │
└───────┼────────────┼────────────┼──────────────────┼───────────────┘
        │            │            │                  │
        ▼            ▼            ▼                  ▼
┌────────────────────────────────────────────────────────────────────┐
│                           gnmic                                    │
│  ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐       │
│  │ port_switch1    │ │ port_switch2    │ │ port_switch3    │ ...   │
│  │ queue_switch1   │ │ queue_switch2   │ │ queue_switch3   │       │
│  │ pg_switch1      │ │ pg_switch2      │ │ pg_switch3      │       │
│  │ acl_switch1     │ │ acl_switch2     │ │ acl_switch3     │       │
│  └─────────────────┘ └─────────────────┘ └─────────────────┘       │
│              Each switch subscribes independently                  │
└──────────────────────────────┬─────────────────────────────────────┘
                               │ :9804/metrics
                               ▼
┌────────────────────────────────────────────────────────────────────┐
│                         Prometheus                                 │
│  ┌──────────────────────────────────────────────────────────────┐  │
│  │                    metric_relabel_configs                    │  │
│  │  sonic_COUNTERS_oid_0x10xxx → sonic_port_xxx + port tag      │  │
│  │  sonic_COUNTERS_oid_0x15xxx → sonic_queue_xxx + queue tag    │  │
│  │  sonic_COUNTERS_oid_0x1axxx → sonic_pg_xxx + pg tag          │  │
│  │  sonic_COUNTERS_oid_0x9xxxx → sonic_acl_xxx + acl_rule tag   │  │
│  └──────────────────────────────────────────────────────────────┘  │
└──────────────────────────────┬─────────────────────────────────────┘
                               │ :9090
                               ▼
┌────────────────────────────────────────────────────────────────────┐
│                          Grafana                                   │
│  ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐       │
│  │ Port    │ │  PFC    │ │   OQ    │ │   SQ    │ │  ACL    │       │
│  │ Traffic │ │  Stats  │ │Watermark│ │Watermark│ │ Counter │       │
│  └─────────┘ └─────────┘ └─────────┘ └─────────┘ └─────────┘       │
└────────────────────────────────────────────────────────────────────┘

目录结构

/data/sonic-telemetry/
├── config/                          # 配置目录
│   ├── switches.conf                # 交换机列表(主要维护文件)
│   ├── settings.conf                # 全局设置(认证、端口等)
│   └── modules/                     # Counter模块配置
│       ├── port.conf                # Port Counter配置
│       ├── queue.conf               # Queue Counter配置
│       ├── pg.conf                  # PG Counter配置
│       └── acl.conf                 # ACL Counter配置
├── scripts/
│   └── auto-refresh.sh              # 自动配置生成脚本
├── cache/                           # OID缓存(自动生成)
│   ├── switch1/
│   │   ├── port_oids_old.txt
│   │   ├── queue_oids_old.txt
│   │   ├── pg_oids_old.txt
│   │   └── acl_oids_old.txt
│   ├── switch2/
│   │   └── ...
│   ├── all_port_map_old.txt         # 合并的OID映射
│   ├── all_queue_map_old.txt
│   ├── all_pg_map_old.txt
│   ├── all_acl_map_old.txt
│   └── switches_old.txt
├── backup/                          # 配置备份(自动生成)
│   └── 20251230_131716/
│       ├── gnmic.yaml
│       └── prometheus.yml
├── logs/                            # 日志(自动生成)
│   └── refresh_20251230_131716.log
├── gnmic/
│   └── gnmic.yaml                   # gnmic配置(自动生成)
├── prometheus/
│   └── prometheus.yml               # Prometheus配置(自动生成)
├── grafana/
│   └── provisioning/
│       ├── datasources/
│       │   └── prometheus.yml
│       └── dashboards/
├── docker-compose.yml
└── README.md

配置文件

switches.conf – 交换机列表

# SONiC Telemetry - 交换机列表配置
# 
# 格式1(使用全局认证): 名称:IP地址
# 格式2(独立认证): 名称:IP地址:用户名:密码
#
# 名称会作为Prometheus的source标签

# 使用全局认证(settings.conf中的GNMI_USER/GNMI_PASS)
switch1:10.22.1.241
switch2:10.26.4.102

# 使用独立认证
switch3:10.26.4.103:admin:password123
switch4:10.26.4.104:operator:secret456

settings.conf – 全局设置

# SONiC Telemetry - 全局设置

# gNMI连接设置
GNMI_USER="xxxx"              # 默认用户名
GNMI_PASS="xxxx"            # 默认密码
GNMI_PORT="8080"               # gNMI端口
GNMI_TIMEOUT="10"              # 连接超时(秒)

# 采样间隔
SAMPLE_INTERVAL="1s"           # 数据采集间隔

# Prometheus设置
PROMETHEUS_PORT="9090"
GNMIC_METRICS_PORT="9804"

# Grafana设置
GRAFANA_PORT="3000"

# 日志保留天数
LOG_RETENTION_DAYS="30"

# 备份保留个数
BACKUP_RETENTION_COUNT="10"

模块配置文件

port.conf – 端口流量

# Port Counter 模块配置

# 是否启用此模块
enabled=true

# SONiC中的NAME_MAP路径(固定)
name_map_path="COUNTERS_PORT_NAME_MAP"

# Prometheus指标前缀
prometheus_prefix="sonic_port"

# 生成的标签名
label_name="port"

# 名称过滤(正则表达式)
# 只保留Ethernet端口,排除CPU端口
# 匹配 Ethernet56 和 Ethernet10_1, Ethernet10_2 等所有格式
name_filter="^Ethernet[0-9]+(_[0-9]+)?$"

# 是否需要从名称中提取子标签
extract_labels=""

queue.conf – 队列统计

# Queue Counter 模块配置

# 是否启用此模块
enabled=true

# SONiC中的NAME_MAP路径(固定)
name_map_path="COUNTERS_QUEUE_NAME_MAP"

# Prometheus指标前缀
prometheus_prefix="sonic_queue"

# 生成的标签名
label_name="queue"

# 名称过滤(正则表达式)
# 只保留UC队列(0-7),排除MC队列(8-15)和CPU队列
# 匹配 Ethernet56:3 和 Ethernet10_1:3 等所有格式,只保留UC队列(0-7)
name_filter="^Ethernet[0-9]+(_[0-9]+)?:[0-7]$"

# 从名称中提取子标签
# 格式: 标签名:正则表达式:替换模式(用分号分隔多个)
# 从 "Ethernet10:3" 提取 queue_port=Ethernet10, queue_num=3
# 用 $1 取端口,$3 取队列号($2 是 _1 这部分)
extract_labels="queue_port:(Ethernet[0-9]+(_[0-9]+)?):([0-9]+):$1;queue_num:(Ethernet[0-9]+(_[0-9]+)?):([0-9]+):$3"

pg.conf – 优先级组

# Priority Group (PG/SQ) Counter 模块配置

# 是否启用此模块
enabled=true

# SONiC中的NAME_MAP路径(固定)
name_map_path="COUNTERS_PG_NAME_MAP"

# Prometheus指标前缀
prometheus_prefix="sonic_pg"

# 生成的标签名
label_name="pg"

# 名称过滤(正则表达式)
# 只保留PG 0-7
# 匹配 Ethernet56:3 和 Ethernet10_1:3 等所有格式,只保留PG 0-7
name_filter="^Ethernet[0-9]+(_[0-9]+)?:[0-7]$"

# 从名称中提取子标签
# 从 "Ethernet10:3" 提取 pg_port=Ethernet10, pg_num=3
# 用 $1 取端口,$3 取PG号($2 是 _1 这部分)
extract_labels="pg_port:(Ethernet[0-9]+(_[0-9]+)?):([0-9]+):$1;pg_num:(Ethernet[0-9]+(_[0-9]+)?):([0-9]+):$3"

acl.conf – ACL计数器

# ACL Counter 模块配置

# 是否启用此模块
enabled=true

# SONiC中的NAME_MAP路径(固定)
name_map_path="ACL_COUNTER_RULE_MAP"

# Prometheus指标前缀
prometheus_prefix="sonic_acl"

# 生成的标签名
label_name="acl_rule"

# 名称过滤(正则表达式,留空表示不过滤)
name_filter=""

# 从名称中提取子标签
# 从 "Ethernet10:RULE_10" 提取 acl_port=Ethernet10
# 使用非捕获组
extract_labels="acl_port:(Ethernet[0-9]+(?:_[0-9]+)?):(.+):$1"

# ============================================================
# ACL规则名称映射(将RULE_XX映射为可读名称)
# 格式: 原始规则名=显示名称
# ============================================================
[rule_mapping]
RULE_10=CNP
RULE_20=NAK
# 添加更多映射:
# RULE_30=RDMA_READ
# RULE_40=RDMA_WRITE
# DROP_RULE=DROP

自动化脚本详解

auto-refresh.sh 工作流程

下面内容用英文呈现,由于中英文会导致格式对不齐:

┌─────────────────────────────────────────────────────────────────┐
│                    auto-refresh.sh                              │
└─────────────────────────────────────────────────────────────────┘
                              │
        ┌─────────────────────┼─────────────────────┐
        ▼                     ▼                     ▼
┌───────────────┐    ┌───────────────┐    ┌───────────────┐
│ 1. Load       │    │ switches.conf │    │ settings.conf │
│    Config     │◄───│ modules/*.conf│◄───│               │
└───────┬───────┘    └───────────────┘    └───────────────┘
        │
        ▼
┌───────────────────────────────────────────────────────────────┐
│ 2. Check Switch Connectivity                                  │
│    - Test gNMI port connection                                │
│    - Mark online/offline status                               │
└───────┬───────────────────────────────────────────────────────┘
        │
        ▼
┌───────────────────────────────────────────────────────────────┐
│ 3. Collect OIDs (per switch)                                  │
│    - Get NAME_MAP from each online switch                     │
│    - Apply name filters (e.g. exclude CPU ports)              │
│    - Save to cache/{switch_name}/{module}_oids_new.txt        │
│    - Merge mappings to cache/all_{module}_map_new.txt         │
└───────┬───────────────────────────────────────────────────────┘
        │
        ▼
┌───────────────────────────────────────────────────────────────┐
│ 4. Compare Differences                                        │
│    - Compare old/new OID counts                               │
│    - Compare old/new switch lists                             │
│    - Display change summary                                   │
└───────┬───────────────────────────────────────────────────────┘
        │
        ▼
┌───────────────────────────────────────────────────────────────┐
│ 5. Generate Config (if changed)                               │
│    - Backup old config to backup/                             │
│    - Generate gnmic/gnmic.yaml                                │
│    - Generate prometheus/prometheus.yml                       │
│    - Update cache (_new.txt → _old.txt)                       │
└───────────────────────────────────────────────────────────────┘

命令行选项

./scripts/auto-refresh.sh [选项]

选项:
  -h, --help      显示帮助信息
  -v, --verbose   详细输出模式
  -f, --force     强制更新(不提示确认)

输出示例

╔════════════════════════════════════════════════════════════╗
║      SONiC Telemetry 配置刷新工具 v2                         ║
║      (每交换机独立订阅模式)                                   ║
╚════════════════════════════════════════════════════════════╝

  时间: 2025-12-30 13:26:41
  日志: /data/sonic-telemetry/logs/refresh_20251230_132641.log

════════════════════════════════════════════════════════════
  1/4 检查交换机连通性
════════════════════════════════════════════════════════════

✓ switch1 (10.22.1.241) - 在线
✓ switch2 (10.26.4.102) - 在线

  在线: 2, 离线: 0

════════════════════════════════════════════════════════════
  2/4 收集OID
════════════════════════════════════════════════════════════

  启用的模块: acl pg port queue

  [switch1]
    acl: 42 个OID
    pg: 448 个OID
    port: 56 个OID
    queue: 448 个OID
  [switch2]
    acl: 42 个OID
    pg: 448 个OID
    port: 56 个OID
    queue: 448 个OID

════════════════════════════════════════════════════════════
  3/4 对比差异
════════════════════════════════════════════════════════════

╔══════════════════════════════════════════════════╗
║ 配置变化摘要                                       ║
╠══════════════════════════════════════════════════╣
║ 交换机: 0 → 2 (+2)                                ║
║ acl OID: 0 → 84 (+84)                            ║
║ pg OID: 0 → 896 (+896)                           ║
║ port OID: 0 → 112 (+112)                         ║
║ queue OID: 0 → 896 (+896)                        ║
╚══════════════════════════════════════════════════╝

检测到配置变化

════════════════════════════════════════════════════════════
  4/4 生成配置
════════════════════════════════════════════════════════════

✓ gnmic配置已生成
✓ prometheus配置已生成

════════════════════════════════════════════════════════════
  配置更新完成!
════════════════════════════════════════════════════════════

  请运行以下命令应用配置:

    cd /data/sonic-telemetry && docker compose restart gnmic prometheus

生成的配置文件

gnmic.yaml 结构

# SONiC Telemetry gnmic配置
# 架构: 每交换机独立订阅,支持独立认证

insecure: true
encoding: json_ietf
log: true

targets:
  switch1:
    address: 10.22.1.241:8080
    username: username
    password: password
    subscriptions:
      - port_counters_switch1
      - queue_counters_switch1
      - pg_counters_switch1
      - acl_counters_switch1
  switch2:
    address: 10.26.4.102:8080
    username: admin              # 独立认证
    password: password123
    subscriptions:
      - port_counters_switch2
      - queue_counters_switch2
      - pg_counters_switch2
      - acl_counters_switch2

subscriptions:
  port_counters_switch1:
    paths:
      - "COUNTERS:oid:0x1000000000001"
      - "COUNTERS:oid:0x1000000000002"
      # ... switch1的所有port OID
    target: COUNTERS_DB
    mode: stream
    stream-mode: sample
    sample-interval: 1s

  port_counters_switch2:
    paths:
      - "COUNTERS:oid:0x1000000000001"
      - "COUNTERS:oid:0x1000000000002"
      # ... switch2的所有port OID(可能与switch1不同)
    target: COUNTERS_DB
    mode: stream
    stream-mode: sample
    sample-interval: 1s

  # ... 其他模块订阅

outputs:
  prometheus:
    type: prometheus
    listen: :9804
    path: /metrics
    metric-prefix: sonic
    append-subscription-name: false
    export-timestamps: true

注意“export-timestamps: true”这个配置,Prometheus 默认的抓取逻辑是:如果你不提供时间戳,它会把 “收到数据的那一刻” 作为时间戳。 当你强制提供了 export-timestamps: true,Prometheus 必须遵循这个时间戳。如果 gnmic 缓存中的 OID 数据更新频率较低,或者 gnmic 发出的时间戳已经超过了 Prometheus 的 lookback-delta(默认 5 分钟),Prometheus 就会认为该样本已过期,从而在查询时返回 0 个结果!所以如果只是测试,可以去掉这行代码。

prometheus.yml Relabel规则

scrape_configs:
  - job_name: 'gnmic'
    static_configs:
      - targets: ['gnmic:9804']
    metric_relabel_configs:
      # ============================================================
      # port (OID前缀: 0x10)
      # ============================================================
      # 提取OID并重命名指标
      - source_labels: [__name__]
        regex: 'sonic_COUNTERS_(oid_0x10[0-9a-f]+)_(.+)'
        replacement: '${1}'
        target_label: port_oid
      - source_labels: [__name__]
        regex: 'sonic_COUNTERS_(oid_0x10[0-9a-f]+)_(.+)'
        replacement: 'sonic_port_${2}'
        target_label: __name__
      
      # OID到名称映射
      - source_labels: [port_oid]
        regex: 'oid_0x1000000000001'
        replacement: 'Ethernet1'
        target_label: port
      - source_labels: [port_oid]
        regex: 'oid_0x1000000000002'
        replacement: 'Ethernet2'
        target_label: port
      # ... 更多映射

      # ============================================================
      # queue (OID前缀: 0x15)
      # ============================================================
      - source_labels: [__name__]
        regex: 'sonic_COUNTERS_(oid_0x15[0-9a-f]+)_(.+)'
        replacement: '${1}'
        target_label: queue_oid
      - source_labels: [__name__]
        regex: 'sonic_COUNTERS_(oid_0x15[0-9a-f]+)_(.+)'
        replacement: 'sonic_queue_${2}'
        target_label: __name__
      
      # OID到名称映射
      - source_labels: [queue_oid]
        regex: 'oid_0x1500000000006d'
        replacement: 'Ethernet1:0'
        target_label: queue
      # ... 更多映射
      
      # 提取子标签
      - source_labels: [queue]
        regex: '(Ethernet[0-9]+):([0-9]+)'
        replacement: '$1'
        target_label: queue_port
      - source_labels: [queue]
        regex: '(Ethernet[0-9]+):([0-9]+)'
        replacement: '$2'
        target_label: queue_num

      # ============================================================
      # PFC 提取
      # ============================================================
      - source_labels: [__name__]
        regex: 'sonic_port_SAI_PORT_STAT_PFC_([0-7])_(RX|TX)_PKTS'
        replacement: '${1}'
        target_label: pfc_priority
      - source_labels: [__name__]
        regex: 'sonic_port_SAI_PORT_STAT_PFC_([0-7])_(RX|TX)_PKTS'
        replacement: '${2}'
        target_label: pfc_direction
      - source_labels: [__name__]
        regex: 'sonic_port_SAI_PORT_STAT_PFC_([0-7])_(RX|TX)_PKTS'
        replacement: 'sonic_port_PFC_PKTS'
        target_label: __name__

      # ============================================================
      # ACL 规则类型映射
      # ============================================================
      - source_labels: [acl_rule]
        regex: '.+:RULE_10'
        replacement: 'CNP'
        target_label: acl_type
      - source_labels: [acl_rule]
        regex: '.+:RULE_20'
        replacement: 'NAK'
        target_label: acl_type

Prometheus 指标说明

指标命名规则

原始指标转换后指标标签
sonic_COUNTERS_oid_0x10xxx_SAI_PORT_STAT_IF_IN_OCTETSsonic_port_SAI_PORT_STAT_IF_IN_OCTETSport=Ethernet1
sonic_COUNTERS_oid_0x15xxx_SAI_QUEUE_STAT_BYTESsonic_queue_SAI_QUEUE_STAT_BYTESqueue=Ethernet1:0, queue_port=Ethernet1, queue_num=0
sonic_COUNTERS_oid_0x1axxx_SAI_INGRESS_PRIORITY_GROUP_STAT_WATERMARK_BYTESsonic_pg_SAI_INGRESS_PRIORITY_GROUP_STAT_WATERMARK_BYTESpg=Ethernet1:0, pg_port=Ethernet1, pg_num=0
sonic_COUNTERS_oid_0x9xxx_SAI_ACL_COUNTER_ATTR_PACKETSsonic_acl_SAI_ACL_COUNTER_ATTR_PACKETSacl_rule=Ethernet1:RULE_10, acl_port=Ethernet1, acl_type=CNP
sonic_port_SAI_PORT_STAT_PFC_3_RX_PKTSsonic_port_PFC_PKTSpfc_priority=3, pfc_direction=RX

关键指标列表

Port指标

指标说明
sonic_port_SAI_PORT_STAT_IF_IN_OCTETS入口字节数
sonic_port_SAI_PORT_STAT_IF_OUT_OCTETS出口字节数
sonic_port_SAI_PORT_STAT_IF_IN_UCAST_PKTS入口单播包数
sonic_port_SAI_PORT_STAT_IF_OUT_UCAST_PKTS出口单播包数
sonic_port_PFC_PKTSPFC帧数(含pfc_priority和pfc_direction标签)

Queue指标

指标说明
sonic_queue_SAI_QUEUE_STAT_BYTES队列字节数
sonic_queue_SAI_QUEUE_STAT_PACKETS队列包数
sonic_queue_SAI_QUEUE_STAT_DROPPED_PACKETS队列丢包数
sonic_queue_SAI_QUEUE_STAT_WATERMARK_BYTES队列水位(最大占用)

PG指标

指标说明
sonic_pg_SAI_INGRESS_PRIORITY_GROUP_STAT_WATERMARK_BYTESPG水位
sonic_pg_SAI_INGRESS_PRIORITY_GROUP_STAT_DROPPED_PACKETSPG丢包数
sonic_pg_SAI_INGRESS_PRIORITY_GROUP_STAT_SHARED_WATERMARK_BYTES共享缓冲区水位

ACL指标

指标说明
sonic_acl_SAI_ACL_COUNTER_ATTR_PACKETSACL匹配包数
sonic_acl_SAI_ACL_COUNTER_ATTR_BYTESACL匹配字节数

Grafana Dashboard

推荐Panel配置

Panel 1: Port Traffic IN/OUT

# IN Traffic (bps)
rate(sonic_port_SAI_PORT_STAT_IF_IN_OCTETS{port=~"$port"}[$__rate_interval]) * 8

# OUT Traffic (bps)
rate(sonic_port_SAI_PORT_STAT_IF_OUT_OCTETS{port=~"$port"}[$__rate_interval]) * 8

Panel 2: PFC (pps)

rate(sonic_port_PFC_PKTS{port=~"$port", pfc_priority=~"$priority"}[$__rate_interval])

Panel 3: Queue Watermark (OQ)

sonic_queue_SAI_QUEUE_STAT_WATERMARK_BYTES{queue_port=~"$port", queue_num=~"$queue_num"}

Panel 4: PG Watermark (SQ)

sonic_pg_SAI_INGRESS_PRIORITY_GROUP_STAT_WATERMARK_BYTES{pg_port=~"$port", pg_num=~"$pg_num"}

Panel 5: Queue Drop (pps)

rate(sonic_queue_SAI_QUEUE_STAT_DROPPED_PACKETS{queue_port=~"$port"}[$__rate_interval])

Panel 6: ACL Counter (pps)

rate(sonic_acl_SAI_ACL_COUNTER_ATTR_PACKETS{acl_port=~"$port"}[$__rate_interval])

# Legend: {{source}} - {{acl_port}} - {{acl_type}}

Dashboard变量

$source: label_values(sonic_port_SAI_PORT_STAT_IF_IN_OCTETS, source)
$port: label_values(sonic_port_SAI_PORT_STAT_IF_IN_OCTETS{source="$source"}, port)
$priority: 0,1,2,3,4,5,6,7
$queue_num: 0,1,2,3,4,5,6,7

部署指南

环境要求

Docker & Docker Compose

# rm -f /etc/apt/keyrings/docker.gpg
# mkdir -p /etc/apt/keyrings
# curl -fsSL https://mirrors.aliyun.com/docker-ce/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
# echo \
  "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://mirrors.aliyun.com/docker-ce/linux/ubuntu \
  $(. /etc/os-release && echo "$VERSION_CODENAME") stable" | \
  sudo tee /etc/apt/sources.list.d/docker.list > /dev/null

# apt-get update
# apt-get install docker-compose-plugin


# more /etc/docker/daemon.json 
{
  "registry-mirrors": [
    "https://registry-1.xxxx.com"
  ]
}

# dockerd --validate --config-file /etc/docker/daemon.json
# sudo systemctl daemon-reload
# sudo systemctl restart docker

gnmic CLI(用于OID收集)

curl -LO https://github.com/openconfig/gnmic/releases/download/v0.42.1/gnmic_0.42.1_Linux_x86_64.tar.gz
tar -xzf gnmic_0.42.1_Linux_x86_64.tar.gz
sudo mv gnmic /usr/local/bin/
gnmic version

jq(JSON解析)

# apt install jq

网络

能访问所有SONiC交换机的gNMI端口(默认8080)

初始部署

下载并解压

cd /data
tar -xzf sonic-telemetry.tar.gz
cd sonic-telemetry

编辑交换机列表

vi config/switches.conf

编辑全局设置(认证信息)

vi config/settings.conf

运行配置生成脚本

./scripts/auto-refresh.sh

启动服务

./scripts/auto-refresh.sh

启动服务

curl -s http://localhost:9804/metrics | head -10
curl -s "http://localhost:9090/api/v1/query?query=up" | jq .

访问Grafana

# http://<server-ip>:3000 (admin/admin)

添加新交换机

编辑交换机列表

vi config/switches.conf
# 添加: switch3:10.26.4.103

运行脚本

./scripts/auto-refresh.sh

重启服务

docker compose restart gnmic prometheus

交换机重启后OID变化

删除cache数据,这些数据用于对比数据使用

rm -rf /data/sonic-telemetry/cache/*

脚本会自动检测OID变化

./scripts/auto-refresh.sh

如果提示有变化,确认后重启服务

docker compose restart gnmic prometheus

添加新的ACL规则映射

编辑ACL配置

# 在 [rule_mapping] 部分添加:
# RULE_30=RDMA_READ

清除缓存并重新生成

rm -rf cache/*
./scripts/auto-refresh.sh -f

重启服务

docker compose restart gnmic prometheus

运维指南

日常维护命令

# 查看服务状态
docker compose ps

# 查看gnmic日志
docker compose logs gnmic --tail 50

# 查看prometheus日志
docker compose logs prometheus --tail 50

# 检查gnmic指标数量
curl -s http://localhost:9804/metrics | wc -l

# 检查prometheus target状态
curl -s "http://localhost:9090/api/v1/targets" | jq '.data.activeTargets[] | {job: .job, health: .health}'

# 查看脚本日志
ls -lt logs/ | head -5
cat logs/refresh_*.log | tail -50

性能调优

gnmic内存优化

如果OID数量很大,可以增加采样间隔,但不建议,现在默认是1s,这个时间放大,那grafana计算数据的时间也要放大,否则就会有明显的毛刺

SAMPLE_INTERVAL="5s"  # 从1s改为5s

Prometheus存储优化

# docker-compose.yml
command:
  - '--storage.tsdb.retention.time=15d'  # 减少保留时间
  - '--storage.tsdb.retention.size=10GB'  # 限制存储大小

如果数据量太大,也可以直接删除数据库:

rm -rf /data/sonic-telemetry/data/prometheus/*

备份与恢复

每次运行 ./scripts/auto-refresh.sh 并确认更新配置时,会先备份旧配置,自动备份的文件在 backup/ 目录中,只包含 gnmic.yaml 和 prometheus.yml

# 查看备份
ls backup/

# 手动备份
tar -czvf sonic-telemetry-backup-$(date +%Y%m%d).tar.gz \
  config/ cache/ gnmic/ prometheus/ grafana/

故障排查

常见问题

问题1: gnmic订阅失败

# 症状
docker compose logs gnmic | grep "error"
# subscription xxx rcv error: NotFound

# 原因,某个OID不存在

# 解决,清除缓存重新收集OID
rm -rf cache/*
./scripts/auto-refresh.sh -f
docker compose restart gnmic

问题2: Prometheus没有数据

# 检查gnmic是否有数据
curl -s http://localhost:9804/metrics | head -10

# 检查prometheus target
curl -s "http://localhost:9090/api/v1/targets" | jq .

# 检查relabel规则
grep -A 5 "port_oid" prometheus/prometheus.yml

问题3: Grafana显示No Data

# 检查指标是否存在
curl -s "http://localhost:9090/api/v1/label/__name__/values" | jq '.data[]' | grep sonic

# 检查标签是否正确
curl -s "http://localhost:9090/api/v1/query?query=sonic_port_SAI_PORT_STAT_IF_IN_OCTETS" | jq '.data.result[0].metric'

问题4: 脚本执行报错退出

# 使用调试模式
bash -x ./scripts/auto-refresh.sh 2>&1 | head -100

# 查看日志
cat logs/refresh_*.log | tail -50

诊断命令

获得交换机OID Mapping关系:

$ sonic-db-cli COUNTERS_DB hgetall COUNTERS_PORT_NAME_MAP | xargs -n2 | sort
{CPU: oid:0x1000000000050,
Ethernet10_1: oid:0x1000000000015,
Ethernet10_2: oid:0x1000000000016,
Ethernet11: oid:0x1000000000017,
Ethernet1_1: oid:0x1000000000ae5,
Ethernet12: oid:0x1000000000018,

服务器上相关排除命令

# 测试单台交换机gNMI连接
gnmic -a 10.22.1.241:8080 --insecure -u username -p password capabilities

# 获取PORT NAME MAP
gnmic -a 10.22.1.241:8080 --insecure -u username -p password \
  get --path "COUNTERS_PORT_NAME_MAP" --target COUNTERS_DB

# 获取单个OID数据
gnmic -a 10.22.1.241:8080 --insecure -u username -p password \
  get --path "COUNTERS:oid:0x1000000000001" --target COUNTERS_DB

# 检查Prometheus中的指标
curl -s "http://localhost:9090/api/v1/query?query=sonic_port_SAI_PORT_STAT_IF_IN_OCTETS" | jq '.data.result | length'

# 检查特定交换机的数据
curl -s "http://localhost:9090/api/v1/query?query=sonic_port_SAI_PORT_STAT_IF_IN_OCTETS" | jq '.data.result[].metric.source' | sort | uniq -c

扩展示例和汇总

目前只有queue drop,没有queue counter,因此下面是增加queue counter的步骤,其实之前gnmi已经采集这部分信息了,只是没用而已:

root@ubuntu20:/home/unix# docker ps -a
CONTAINER ID   IMAGE                             COMMAND                  CREATED      STATUS        PORTS                                       NAMES
e30834b9804b   grafana/grafana:latest            "/run.sh"                4 days ago   Up 4 days     0.0.0.0:3000->3000/tcp, :::3000->3000/tcp   grafana
57158ba7e4f7   prom/prometheus:latest            "/bin/prometheus --c…"   4 days ago   Up 34 hours   0.0.0.0:9090->9090/tcp, :::9090->9090/tcp   prometheus
d5909eb60755   ghcr.io/openconfig/gnmic:latest   "/app/gnmic subscrib…"   4 days ago   Up 34 hours   0.0.0.0:9804->9804/tcp, :::9804->9804/tcp   gnmic

确认GMNI的采集信息

root@ubuntu20:/home/unix# more /data/sonic-telemetry/cache/switch3/queue_map_old.txt 
Ethernet10_1:0=0x15000000000b75
Ethernet10_1:1=0x15000000000b76
Ethernet10_1:2=0x15000000000b77
Ethernet10_1:3=0x15000000000b78
Ethernet10_1:4=0x15000000000b79
Ethernet10_1:5=0x15000000000b7a
Ethernet10_1:6=0x15000000000b7b
Ethernet10_1:7=0x15000000000b7c
Ethernet10_2:0=0x15000000000b96
Ethernet10_2:1=0x15000000000b97

确认gnmic是否采集到相应数据,gnmic 输出的原始格式是 sonic_COUNTERS_oid_0x15...

root@ubuntu20:/home/unix# curl -s http://localhost:9804/metrics | grep "0x15" | head -5
# HELP sonic_COUNTERS_oid_0x1500000000009d_SAI_BUFFER_POOL_SIZE gNMIc generated metric
# TYPE sonic_COUNTERS_oid_0x1500000000009d_SAI_BUFFER_POOL_SIZE untyped
sonic_COUNTERS_oid_0x1500000000009d_SAI_BUFFER_POOL_SIZE{source="switch3",subscription_name="queue_counters_switch3",subscription_target="COUNTERS_DB",target="COUNTERS_DB"} 5.5658496e+07
# HELP sonic_COUNTERS_oid_0x1500000000009d_SAI_BUFFER_POOL_STAT_CURR_OCCUPANCY_BYTES gNMIc generated metric
# TYPE sonic_COUNTERS_oid_0x1500000000009d_SAI_BUFFER_POOL_STAT_CURR_OCCUPANCY_BYTES untyped

确认Prometheus的采集信息

经过 Prometheus relabel 后,命名变成 sonic_queue_

root@ubuntu20:/home/unix# curl -s "http://localhost:9090/api/v1/query?query=sonic_queue_SAI_QUEUE_STAT_BYTES" | jq '.data.result | length'
600
root@ubuntu20:/home/unix# curl -s "http://localhost:9090/api/v1/label/__name__/values" | jq '.data[]' | grep sonic_queue
"sonic_queue_SAI_BUFFER_POOL_SIZE"
"sonic_queue_SAI_BUFFER_POOL_STAT_CURR_OCCUPANCY_BYTES"
"sonic_queue_SAI_BUFFER_PROFILE_DYNAMIC_PARA"
"sonic_queue_SAI_BUFFER_PROFILE_MIN_GUARANTEE"
"sonic_queue_SAI_QUEUE_STAT_BYTES"
"sonic_queue_SAI_QUEUE_STAT_CURR_OCCUPANCY_BYTES"
"sonic_queue_SAI_QUEUE_STAT_CURR_OCCUPANCY_BYTES_T"
"sonic_queue_SAI_QUEUE_STAT_DROPPED_BYTES"
"sonic_queue_SAI_QUEUE_STAT_DROPPED_PACKETS"
"sonic_queue_SAI_QUEUE_STAT_PACKETS"
"sonic_queue_SAI_QUEUE_STAT_PACKETS_T"
"sonic_queue_SAI_QUEUE_STAT_SHARED_CURR_OCCUPANCY_BYTES"
"sonic_queue_SAI_QUEUE_STAT_SHARED_WATERMARK_BYTES"
"sonic_queue_SAI_QUEUE_STAT_TIMESTAMP"
"sonic_queue_SAI_QUEUE_STAT_WATERMARK_BYTES"
"sonic_queue_SAI_QUEUE_STAT_WRED_ECN_MARKED_PACKETS"
"sonic_queue_timestamp"

更新Grafana Dashboard

在 id: 2 (Port Traffic – OUT) 后面添加以下两个面板:

{
      "id": 8,
      "title": "Per Queue Traffic (bps)",
      "type": "timeseries",
      "gridPos": {"h": 8, "w": 24, "x": 0, "y": 16},
      "fieldConfig": {
        "defaults": {
          "unit": "bps",
          "custom": {"drawStyle": "line", "lineWidth": 1, "fillOpacity": 10, "showPoints": "never"}
        }
      },
      "options": {
        "legend": {"displayMode": "table", "placement": "right", "showLegend": true, "sortBy": "Max", "sortDesc": true, "calcs": ["min", "max", "lastNotNull"]},
        "tooltip": {"mode": "multi", "sort": "desc"}
      },
      "targets": [{"expr": "rate(sonic_queue_SAI_QUEUE_STAT_BYTES{source=~\"$switch\", queue_port=~\"$port\", queue_num=~\"$queue\"}[30s]) * 8", "legendFormat": "{{source}} - {{queue_port}}:{{queue_num}}"}],
      "datasource": {"type": "prometheus", "uid": "PBFA97CFB590B2093"}
    },
    {
      "id": 9,
      "title": "Per Queue Packets (pps)",
      "type": "timeseries",
      "gridPos": {"h": 8, "w": 24, "x": 0, "y": 24},
      "fieldConfig": {
        "defaults": {
          "unit": "pps",
          "custom": {"drawStyle": "line", "lineWidth": 1, "fillOpacity": 10, "showPoints": "never"}
        }
      },
      "options": {
        "legend": {"displayMode": "table", "placement": "right", "showLegend": true, "sortBy": "Max", "sortDesc": true, "calcs": ["min", "max", "lastNotNull"]},
        "tooltip": {"mode": "multi", "sort": "desc"}
      },
      "targets": [{"expr": "rate(sonic_queue_SAI_QUEUE_STAT_PACKETS{source=~\"$switch\", queue_port=~\"$port\", queue_num=~\"$queue\"}[30s])", "legendFormat": "{{source}} - {{queue_port}}:{{queue_num}}"}],
      "datasource": {"type": "prometheus", "uid": "PBFA97CFB590B2093"}
    },

同时需要把后面面板的 y 坐标都往下移 16(因为新增了两个 h=8 的面板),如下是成品:

20260619 – 更新Grafana 变量

如果不控制任何变量,Grafana从Prometheus中读到的信息就是全的。假设我有56个端口,但我只想观察测试的几个端口,那么最简单的方法就是在过滤里面选端口,然后保持,但我发现这样可能会出现保持失效,又回到了所有端口,再重新点很麻烦,端口顺序又不固定,因此可以直接进入编辑,在对应变量里加如下参数即可默认过滤特定测试端口:

/^(?:Ethernet9|Ethernet10|Ethernet19|Ethernet21|Ethernet22|Ethernet25|Ethernet26|Ethernet27|Ethernet28|Ethernet29|Ethernet30|Ethernet31|Ethernet32|Ethernet34|Ethernet35|Ethernet36)$/

20260620 – 新增Headroom Watermark

先到设备上,确认特定端口特定队列的OID,由于headroom在PG里,PG之前采集过,所以目前数据已经有了,我们先确认下数据是否有值:

root@sonic-1:/home# sonic-db-cli COUNTERS_DB HGET COUNTERS_PG_NAME_MAP "Ethernet28:1"
oid:0x1a000000000591

根据这个OID确认其挂的相关信息:

root@sonic-1:/home# sonic-db-cli COUNTERS_DB HGETALL "COUNTERS:oid:0x1a000000000591"
{'SAI_INGRESS_PRIORITY_GROUP_HEADROOM_SIZE': '258048', 'SAI_INGRESS_PRIORITY_GROUP_STAT_XOFF_ROOM_CURR_OCCUPANCY_BYTES': '330752', 'SAI_INGRESS_PRIORITY_GROUP_STAT_CURR_OCCUPANCY_BYTES': '3842816', 'SAI_INGRESS_PRIORITY_GROUP_STAT_DROPPED_PACKETS': '0', 'SAI_INGRESS_PRIORITY_GROUP_STAT_SHARED_CURR_OCCUPANCY_BYTES': '3507712', 'SAI_INGRESS_PRIORITY_GROUP_STAT_XOFF_ROOM_WATERMARK_BYTES': '331008', 'SAI_INGRESS_PRIORITY_GROUP_STAT_SHARED_WATERMARK_BYTES': '3607808', 'SAI_INGRESS_PRIORITY_GROUP_STAT_WATERMARK_BYTES': '3943168'}

然后直接去Grafana中,复制SQ Watermark的Pannel,修改其中的query语法即可:

sonic_pg_SAI_INGRESS_PRIORITY_GROUP_STAT_XOFF_ROOM_WATERMARK_BYTES{source=~"$switch", pg_port=~"$port", pg_num=~"$queue"}

20260620 – 新增Buffer Pool

经过确认,Buffer Pool 属于 COUNTERS_BUFFER_POOL_NAME_MAP:

root@sonic-1:/home# sonic-db-cli COUNTERS_DB HGETALL COUNTERS_BUFFER_POOL_NAME_MAP
{'egress': 'oid:0x180000000007b0', 'ingress': 'oid:0x180000000007b1'}

sonic-db-cli COUNTERS_DB HGETALL "COUNTERS:oid:0x180000000007b0" | grep WATERMARK
root@sonic-1:/home# sonic-db-cli COUNTERS_DB HGETALL "COUNTERS:oid:0x180000000007b0"
{'SAI_BUFFER_POOL_STAT_CURR_OCCUPANCY_BYTES': '99319808', 'SAI_BUFFER_POOL_STAT_WATERMARK_BYTES': '102473728', 'SAI_BUFFER_POOL_STAT_XOFF_ROOM_CURR_OCCUPANCY_BYTES': '0', 'SAI_BUFFER_POOL_STAT_XOFF_ROOM_WATERMARK_BYTES': '0'}
root@sonic-1:/home# sonic-db-cli COUNTERS_DB HGETALL "COUNTERS:oid:0x180000000007b0"
{'SAI_BUFFER_POOL_STAT_CURR_OCCUPANCY_BYTES': '99319808', 'SAI_BUFFER_POOL_STAT_WATERMARK_BYTES': '102473728', 'SAI_BUFFER_POOL_STAT_XOFF_ROOM_CURR_OCCUPANCY_BYTES': '0', 'SAI_BUFFER_POOL_STAT_XOFF_ROOM_WATERMARK_BYTES': '0'}

由于脚本已经自动适配不同模块的Counter,因此只要在module 增加 Buffer Pool 的 config 即可:

root@tme91:/opt/sonic-telemetry/config/modules# more buffer_pool.conf 
# Buffer Pool Counter 模块配置

enabled=true

# name map 固定
name_map_path="COUNTERS_BUFFER_POOL_NAME_MAP"

# 指标前缀
prometheus_prefix="sonic_buffer_pool"

# label 名(name map 的 key 就是 ingress/egress,直接当 label)
label_name="pool"

# 只有 ingress/egress 两个,不过滤
name_filter=""

# key 本身就是 pool 名,不用从 Ethernet:x 提取
extract_labels=""

# 采样间隔:对齐 BUFFER_POOL_WATERMARK 的 POLL_INTERVAL(见下方第二步)
sample_interval="10s"

然后在Grafana 内直接调用,Legend为:{{source}} – {{pool}}

sonic_buffer_pool_SAI_BUFFER_POOL_STAT_WATERMARK_BYTES{source=~"$switch"}

2026-01-31更新:优化ACL逻辑

之前的逻辑适用性不强,太依赖场景了,Mapping维护也有局限性。因此做了如下优化:

  • 通过gnmic从 CONFIG_DB 获取 ACL table 的真实属性(stage、port),不依赖 table 命名约定
  • 解析每个 table 的 stage 和 ports
  • 生成 prometheus relabel 规则时,使用 CONFIG_DB 的数据
  • 不再需要 acl.conf 中的 [rule_mapping] 和 extract_labels,删掉即可
  • Grafana如下更新:
    • 现有变量 $port 可以直接用于过滤ACL
    • $stage:选择ingress还是egress
    • $acl_table:选择特定的ACL table
    • 对于rule name,不在前置设置映射关系,而是通过Grafana的 Transformations来直接在现有数据上转换
  • 脚本每次运行时会:
    • 从 CONFIG_DB 获取最新的 ACL table 信息
    • 与缓存对比(stage、port 是否变化)
    • 有变化则重新生成 prometheus.yml
数据库说明获取内容
COUNTERS_DBACL_COUNTER_RULE_MAP{table}:{rule} → OID
CONFIG_DBACL_TABLEstageports@

最终标签格式:

sonic_acl_SAI_ACL_COUNTER_ATTR_PACKETS{
  source="switch1",
  acl_table="Ethernet33ECN",    # ACL 表名
  acl_stage="egress",           # 从 CONFIG_DB 获取
  acl_port="Ethernet33",        # 从 CONFIG_DB 获取
  acl_seq="10"                  # 从 rule name 提取
}

Prometheus Relabel 规则:

# 1. 提取 acl_table
- source_labels: [acl_rule]
  regex: '(.+):RULE_[0-9]+'
  replacement: '${1}'
  target_label: acl_table

# 2. 提取 acl_seq
- source_labels: [acl_rule]
  regex: '.+:RULE_([0-9]+)'
  replacement: '${1}'
  target_label: acl_seq

# 3. ACL table → stage 映射(从 CONFIG_DB 生成)
- source_labels: [acl_table]
  regex: 'Ethernet33'
  replacement: 'ingress'
  target_label: acl_stage
- source_labels: [acl_table]
  regex: 'Ethernet33ECN'
  replacement: 'egress'
  target_label: acl_stage
# ... 其他 table

# 4. ACL table → port 映射(从 CONFIG_DB 生成)
- source_labels: [acl_table]
  regex: 'Ethernet33'
  replacement: 'Ethernet33'
  target_label: acl_port
- source_labels: [acl_table]
  regex: 'Ethernet33ECN'
  replacement: 'Ethernet33'
  target_label: acl_port
# ... 其他 table

2026-03-13更新:优化OID污染问题

gnmic 不支持在 target 层绑定 subscription,导致所有 subscription 被应用到所有 target,OID 相同时造成数据污染,如下本来源应该来自dut1,但获得的数据包含dut2的端口:

curl -s http://localhost:9804/metrics | grep 'source=' | head -3
......
sonic_COUNTERS_oid_0x1000000000001_SAI_PORT_STAT_PFC_1_RX_PKTS
  {source="dut1", subscription_name="port_counters_dut1"} 0
sonic_COUNTERS_oid_0x1000000000001_SAI_PORT_STAT_PFC_1_RX_PKTS
  {source="dut1", subscription_name="port_counters_dut2"} 0   ← 污染
sonic_COUNTERS_oid_0x1000000000001_SAI_PORT_STAT_PFC_1_RX_PKTS
  {source="dut2", subscription_name="port_counters_dut1"} 0   ← 污染

由于gnmic数据污染,导致Prometheus 中也存在交叉数据,这些会导致grafana从数据库中获取数据时会重复:

curl -G 'http://localhost:9090/api/v1/series'   --data-urlencode 'match[]={source="dut1"}' | python3 -m json.tool | grep -A 30 "Ethernet19"
......
        {
            "__name__": "sonic_port_SAI_PORT_STAT_IF_IN_OCTETS",
            "instance": "gnmic:9804",
            "job": "gnmic",
            "port": "Ethernet19",
            "port_oid": "oid_0x1000000000013",
            "source": "dut1",
            "subscription_name": "port_counters_dut1",
            "subscription_target": "COUNTERS_DB",
            "target": "COUNTERS_DB"
        },
        {
            "__name__": "sonic_port_SAI_PORT_STAT_IF_IN_OCTETS",
            "instance": "gnmic:9804",
            "job": "gnmic",
            "port": "Ethernet19",
            "port_oid": "oid_0x1000000000013",
            "source": "dut1",
            "subscription_name": "port_counters_dut2",
            "subscription_target": "COUNTERS_DB",
            "target": "COUNTERS_DB"
        },
        {
            "__name__": "sonic_port_SAI_PORT_STAT_IF_IN_PKTS_BYTE",
            "instance": "gnmic:9804",
            "job": "gnmic",
            "port": "Ethernet19",
            "port_oid": "oid_0x1000000000013",
            "source": "dut1",
            "subscription_name": "port_counters_dut1",
            "subscription_target": "COUNTERS_DB",
            "target": "COUNTERS_DB"
        },
        {
            "__name__": "sonic_port_SAI_PORT_STAT_IF_IN_PKTS_BYTE",
            "instance": "gnmic:9804",
            "job": "gnmic",
            "port": "Ethernet19",
            "port_oid": "oid_0x1000000000013",
            "source": "dut1",
            "subscription_name": "port_counters_dut2",
            "subscription_target": "COUNTERS_DB",
            "target": "COUNTERS_DB"
        }

虽然 gnmic target配置中绑定了DUT,但目前看没有生效:

targets:
  DUT1:
    subscriptions:
      - port_counters_DUT1
  DUT2:
    subscriptions:
      - port_counters_DUT2

经过讨论,由于只是testbed,因此最终还是把gnmic按设备分开处理,这样杜绝数据污染问题:

2026-04-21更新:Telemetry数据混乱

问题现象

由于需要,更换了RDMA的测试脚本。用这个脚本打长时间流量,就会发现数据混乱了。Telemetry采集很不稳定,当停止测试后,确认交换机流量也到0了,但下面per queue traffics还留了一大段的尾巴流量,但可以看到前面多次10分钟测试没问题:

在问题出现后,再打10m流量,就发现很有趣的现象,当流量停止后,per queue traffics才开始累计,感觉per queue的相关信息被delay了:

问题原因

经过排查,GNMIC对448个queue OID的sample-interval设为1s,但交换机的QUEUE_STAT轮询间隔是5s。gNMI server每秒要对448个OID做一次subscribe response,高负载时处理不过来,数据积压。停流后压力下降,积压消化完后数据才出现。为了解决此问题,拉齐交换机与GNMIC的轮训周期,优化后问题解决。

下面是目前测试中用到监控指标的轮训周期,建议GNMIC的采集周期 ≥ Sonic 轮训周期,否则产生无意义的重复推送,增加gNMI server负载:

Counter类型SoNIC 轮询周期GNMIC采样周期说明
PORT_STAT1s1sOID少(~56),1s无压力
ACL1s1sOID少(根据配置相关,测试场景中少),1s无压力
QUEUE_STAT5s5sOID多(~448,56*8),需要匹配,或增加周期时间,或者给服务器分配更多资源
QUEUE_WATERMARK5s5s同上
QUEUE_DROP5s5s同上

排障流程

这里顺便总结下整个Telemetry方案的Troubleshooting思路。整个Telemetry数据链有4层,问题可能出在任何一层。必须从数据源开始逐层确认,用实际数据定位:

┌───────────────────────────────────────────────┐
|Switch ASIC → COUNTERS_DB (redis) → gNMI Server| → gnmic → Prometheus → Grafana
|     ↑              ↑                   ↑      |     ↑          ↑          ↑
|  counter       counter DB          telemetry  |    容器      scrape     query/
|  interval    polling-writing       container  |    采集       抓取      relabel
└───────────────────────────────────────────────┘

第1层:Switch CLI — 确认数据源,包括确认端口流量状态是否更新,以及Counter更新的周期等。CLI有流量 → 数据源正常,往下一层查。CLI无流量 → 问题在网络/流量本身,不是telemetry的问题;

第2层:COUNTERS_DB (Redis) — 确认counter是否在更新。Redis值在变 → COUNTERS_DB正常,问题在gNMI或更上层。Redis值不变 → Counter/syncd有问题:

# 注意:不同SONiC版本COUNTERS_DB的端口会不一样,这个首先要确认的
# 查port OID映射
redis-cli -n 2 -p xxxx HGETALL COUNTERS_PORT_NAME_MAP | grep -A1 "Ethernet25"

# 查queue OID映射
redis-cli -n 2 -p xxxx HGETALL COUNTERS_QUEUE_NAME_MAP | grep -A1 "Ethernet25:1"

# 验证port counter是否在更新(间隔2秒取2次)
redis-cli -n 2 -p xxxx hget "COUNTERS:oid:<OID>" SAI_PORT_STAT_IF_IN_OCTETS
sleep 2
redis-cli -n 2 -p xxxx hget "COUNTERS:oid:<OID>" SAI_PORT_STAT_IF_IN_OCTETS

# 验证queue counter是否在更新(间隔5秒,匹配QUEUE_STAT轮询间隔)
for i in 1 2 3 4 5 6; do
  echo "=== $i ($(date +%H:%M:%S)) ==="
  redis-cli -n 2 -p xxxx hget "COUNTERS:oid:<QUEUE_OID>" SAI_QUEUE_STAT_BYTES
  sleep 5
done

# 验证queue drop
redis-cli -n 2 -p xxxx hget "COUNTERS:oid:<QUEUE_OID>" SAI_QUEUE_STAT_DROPPED_PACKETS

第3层:gNMI Server — 确认telemetry容器

# 查telemetry容器状态和资源
docker ps -a | grep telemetry
docker stats --no-stream $(docker ps -q -f name=telemetry)
docker inspect --format='{{.RestartCount}} restarts' $(docker ps -q -f name=telemetry)

# 查telemetry容器日志
docker logs --tail 50 $(docker ps -q -f name=telemetry)

# 查活跃gNMI连接
docker exec $(docker ps -q -f name=telemetry) ss -tnp | grep 8080

第4层:gnmic — 确认采集端。gnmic进程存活、端口监听、但metrics返回0条 → prometheus output挂了,需要重启gnmic容器或重跑配置脚本

# 确认gnmic容器进程
docker exec gnmic-dut1 ps aux

# 查gnmic日志(默认不输出到stdout!需要加--log参数)
docker logs gnmic-dut1 2>&1 | tail -50

# 如果日志为空,手动带日志跑(会绑端口失败但能看到连接信息)
docker exec gnmic-dut1 /app/gnmic --config /app/gnmic.yaml subscribe --log

# 确认metrics endpoint有没有数据
curl -s http://localhost:9804/metrics | wc -l
curl -s http://localhost:9804/metrics | grep -v "^#" | wc -l

# 验证特定OID的counter值是否在变(打流时执行)
for i in 1 2 3 4 5 6; do
  echo "=== $i ($(date +%H:%M:%S)) ==="
  curl -s http://localhost:9804/metrics | grep "oid_<OID>_SAI_PORT_STAT_IF_IN_OCTETS" | grep -v "^#"
  sleep 5
done

# 查metrics里有哪些OID前缀
curl -s http://localhost:9804/metrics | grep -oP "oid_0x[0-9a-f]+" | sort -u | head -20

# 查有没有任何IN_OCTETS数据
curl -s http://localhost:9804/metrics | grep "IN_OCTETS" | head -10

第5层:Prometheus — 确认scrape。原始metric有值但relabel后没有 → prometheus.yml的relabel规则问题。原始metric也没有 → scrape配置或gnmic端的问题:

# 查target状态
curl -s 'http://localhost:9090/api/v1/targets' | python3 -c "
import json,sys
d=json.load(sys.stdin)
for t in d['data']['activeTargets']:
    if 'gnmic' in t.get('scrapeUrl',''):
        print(t['scrapeUrl'], t['health'], t['lastError'])
"

# 查原始metric(未relabel)
curl -s 'http://localhost:9090/api/v1/query' \
  --data-urlencode 'query=sonic_COUNTERS_oid_0x1000000000019_SAI_PORT_STAT_IF_IN_OCTETS' \
  | python3 -m json.tool | head -20

# 查relabel后的metric(Grafana用的)
curl -s 'http://localhost:9090/api/v1/query' \
  --data-urlencode 'query=sonic_port_SAI_PORT_STAT_IF_IN_OCTETS{port_oid="oid_0x1000000000019"}' \
  | python3 -m json.tool

# 查Prometheus里所有sonic metric名
curl -s 'http://localhost:9090/api/v1/label/__name__/values' | python3 -c "
import json,sys
for n in json.load(sys.stdin)['data']:
    if 'sonic' in n.lower(): print(n)
" | head -20
本文出自 Frank's Blog

版权声明:


本文链接:Sonic Telemetry Deployment
版权声明:本文为原创文章,仅代表个人观点,版权归 Frank Zhao 所有,转载时请注明本文出处及文章链接
你可以留言,或者trackback 从你的网站

留言哦

blonde teen swallows load.xxx videos