Sonic Telemetry Deployment
之前在老东家做TME时,整理过IOS XR相关的Telemetry,比如:Telemetry Solution Demo with IOS XR,也研究过 Telemetry Receiver by UDP+KV-GPB。但对Sonic Switch一直没搞过,一是Sonic是基于redis数据库的,对openconfig支持的不好,另外之前用的思科开源的pipeline很久没更新了,应该是被废弃了。2024年整理过Sonic + Telemetry,那时用的是Telegraf作为中间件,而且为了灵活,在SONiC上安装Telegraf的container,但这个方案真正实施起来有些繁琐,不如直接在服务器上部署Telegraf,当时很忙,也没时间整理,时间长了,细节就忘记了。

这次正好又面临可视化需求,因此重新梳理了架构,这次不再使用Telegraf+influxdata,而是采用gnmic + prometheus的方案,gnmic是openconfig开发的,而且在github上更新频繁。代码内容,可以查看:https://github.com/yongpro/sonic-telemetry
另外更新了Server Telemetry,具体看这篇总结:Server Telemetry Deployment,是在现有架构基础之上扩充的。
注意:整个方案会不断迭代和优化,因此文章原始内容我可能不一定会跟着优化而修正,我只是给自己作总结。如果需要参考,请过完整个文档。另外部署Telemetry平台,建议在本地局域网,避免数据传输延迟导致数据不正确。
架构概述
功能简介
本系统用于监控RDMA环境中SONiC交换机的关键网络指标,支持:
- Port Traffic: 端口流量统计(IN/OUT)
- PFC: Priority Flow Control 统计
- Queue Watermark (OQ): 出口队列水位
- PG Watermark (SQ): 入口优先级组水位
- Queue Drop: 队列丢包统计
- ACL Counter: ACL规则匹配计数(CNP/NAK等)
系统架构
经过确认,目前SONiC有三种接口供用户调用数据:
- SONiC原生的OID,所有DB里的信息都支持,但VID和RID可能会发生变化,不适合在生产上作为GNMI的接口(上万台设备如果都不一样,那订阅的信息量太大了,也无法维护)。但这个接口有个优势,就是不关心你上层(A-OS,B-OS,C-OS)是否相同,都可以使用这个架构把数据get出来,因此这个架构非常适合用于 LAB 环境;
- OpenConfig Yang,原生SONiC的OC很多都不支持,需要自行补全这个抽象层;
- SONiC Yang,SONiC的原生Yang,但支持的也不是很好,很多都需要用户自行扩展;
注意gnmic的订阅机制,如果订阅列表中任何一个OID不存在,整个订阅会失败。我没有深入研究是否有方法可以解决,而是直接绕开的问题,如上使用的第一种接口,每台交换机生成独立的订阅配置,保证OID都是正确的。如果设备重启,或配置变动,导致OID改变,那么需要重新更新OID。
另外目前架构支持Ethernet1,Ethernet1_1两种形态
┌────────────────────────────────────────────────────────────────────┐
│ SONiC Switches │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Switch1 │ │ Switch2 │ │ Switch3 │ ... │ SwitchN │ │
│ │(SONiC A) │ │(SONiC B) │ │(SONiC A) │ │(SONiC C) │ │
│ └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘ │
│ │gNMI │gNMI │gNMI │gNMI │
└───────┼────────────┼────────────┼──────────────────┼───────────────┘
│ │ │ │
▼ ▼ ▼ ▼
┌────────────────────────────────────────────────────────────────────┐
│ gnmic │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ port_switch1 │ │ port_switch2 │ │ port_switch3 │ ... │
│ │ queue_switch1 │ │ queue_switch2 │ │ queue_switch3 │ │
│ │ pg_switch1 │ │ pg_switch2 │ │ pg_switch3 │ │
│ │ acl_switch1 │ │ acl_switch2 │ │ acl_switch3 │ │
│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │
│ Each switch subscribes independently │
└──────────────────────────────┬─────────────────────────────────────┘
│ :9804/metrics
▼
┌────────────────────────────────────────────────────────────────────┐
│ Prometheus │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ metric_relabel_configs │ │
│ │ sonic_COUNTERS_oid_0x10xxx → sonic_port_xxx + port tag │ │
│ │ sonic_COUNTERS_oid_0x15xxx → sonic_queue_xxx + queue tag │ │
│ │ sonic_COUNTERS_oid_0x1axxx → sonic_pg_xxx + pg tag │ │
│ │ sonic_COUNTERS_oid_0x9xxxx → sonic_acl_xxx + acl_rule tag │ │
│ └──────────────────────────────────────────────────────────────┘ │
└──────────────────────────────┬─────────────────────────────────────┘
│ :9090
▼
┌────────────────────────────────────────────────────────────────────┐
│ Grafana │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ Port │ │ PFC │ │ OQ │ │ SQ │ │ ACL │ │
│ │ Traffic │ │ Stats │ │Watermark│ │Watermark│ │ Counter │ │
│ └─────────┘ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │
└────────────────────────────────────────────────────────────────────┘
目录结构
/data/sonic-telemetry/
├── config/ # 配置目录
│ ├── switches.conf # 交换机列表(主要维护文件)
│ ├── settings.conf # 全局设置(认证、端口等)
│ └── modules/ # Counter模块配置
│ ├── port.conf # Port Counter配置
│ ├── queue.conf # Queue Counter配置
│ ├── pg.conf # PG Counter配置
│ └── acl.conf # ACL Counter配置
├── scripts/
│ └── auto-refresh.sh # 自动配置生成脚本
├── cache/ # OID缓存(自动生成)
│ ├── switch1/
│ │ ├── port_oids_old.txt
│ │ ├── queue_oids_old.txt
│ │ ├── pg_oids_old.txt
│ │ └── acl_oids_old.txt
│ ├── switch2/
│ │ └── ...
│ ├── all_port_map_old.txt # 合并的OID映射
│ ├── all_queue_map_old.txt
│ ├── all_pg_map_old.txt
│ ├── all_acl_map_old.txt
│ └── switches_old.txt
├── backup/ # 配置备份(自动生成)
│ └── 20251230_131716/
│ ├── gnmic.yaml
│ └── prometheus.yml
├── logs/ # 日志(自动生成)
│ └── refresh_20251230_131716.log
├── gnmic/
│ └── gnmic.yaml # gnmic配置(自动生成)
├── prometheus/
│ └── prometheus.yml # Prometheus配置(自动生成)
├── grafana/
│ └── provisioning/
│ ├── datasources/
│ │ └── prometheus.yml
│ └── dashboards/
├── docker-compose.yml
└── README.md
配置文件
switches.conf – 交换机列表
# SONiC Telemetry - 交换机列表配置
#
# 格式1(使用全局认证): 名称:IP地址
# 格式2(独立认证): 名称:IP地址:用户名:密码
#
# 名称会作为Prometheus的source标签
# 使用全局认证(settings.conf中的GNMI_USER/GNMI_PASS)
switch1:10.22.1.241
switch2:10.26.4.102
# 使用独立认证
switch3:10.26.4.103:admin:password123
switch4:10.26.4.104:operator:secret456
settings.conf – 全局设置
# SONiC Telemetry - 全局设置
# gNMI连接设置
GNMI_USER="xxxx" # 默认用户名
GNMI_PASS="xxxx" # 默认密码
GNMI_PORT="8080" # gNMI端口
GNMI_TIMEOUT="10" # 连接超时(秒)
# 采样间隔
SAMPLE_INTERVAL="1s" # 数据采集间隔
# Prometheus设置
PROMETHEUS_PORT="9090"
GNMIC_METRICS_PORT="9804"
# Grafana设置
GRAFANA_PORT="3000"
# 日志保留天数
LOG_RETENTION_DAYS="30"
# 备份保留个数
BACKUP_RETENTION_COUNT="10"
模块配置文件
port.conf – 端口流量
# Port Counter 模块配置
# 是否启用此模块
enabled=true
# SONiC中的NAME_MAP路径(固定)
name_map_path="COUNTERS_PORT_NAME_MAP"
# Prometheus指标前缀
prometheus_prefix="sonic_port"
# 生成的标签名
label_name="port"
# 名称过滤(正则表达式)
# 只保留Ethernet端口,排除CPU端口
# 匹配 Ethernet56 和 Ethernet10_1, Ethernet10_2 等所有格式
name_filter="^Ethernet[0-9]+(_[0-9]+)?$"
# 是否需要从名称中提取子标签
extract_labels=""
queue.conf – 队列统计
# Queue Counter 模块配置
# 是否启用此模块
enabled=true
# SONiC中的NAME_MAP路径(固定)
name_map_path="COUNTERS_QUEUE_NAME_MAP"
# Prometheus指标前缀
prometheus_prefix="sonic_queue"
# 生成的标签名
label_name="queue"
# 名称过滤(正则表达式)
# 只保留UC队列(0-7),排除MC队列(8-15)和CPU队列
# 匹配 Ethernet56:3 和 Ethernet10_1:3 等所有格式,只保留UC队列(0-7)
name_filter="^Ethernet[0-9]+(_[0-9]+)?:[0-7]$"
# 从名称中提取子标签
# 格式: 标签名:正则表达式:替换模式(用分号分隔多个)
# 从 "Ethernet10:3" 提取 queue_port=Ethernet10, queue_num=3
# 用 $1 取端口,$3 取队列号($2 是 _1 这部分)
extract_labels="queue_port:(Ethernet[0-9]+(_[0-9]+)?):([0-9]+):$1;queue_num:(Ethernet[0-9]+(_[0-9]+)?):([0-9]+):$3"
pg.conf – 优先级组
# Priority Group (PG/SQ) Counter 模块配置
# 是否启用此模块
enabled=true
# SONiC中的NAME_MAP路径(固定)
name_map_path="COUNTERS_PG_NAME_MAP"
# Prometheus指标前缀
prometheus_prefix="sonic_pg"
# 生成的标签名
label_name="pg"
# 名称过滤(正则表达式)
# 只保留PG 0-7
# 匹配 Ethernet56:3 和 Ethernet10_1:3 等所有格式,只保留PG 0-7
name_filter="^Ethernet[0-9]+(_[0-9]+)?:[0-7]$"
# 从名称中提取子标签
# 从 "Ethernet10:3" 提取 pg_port=Ethernet10, pg_num=3
# 用 $1 取端口,$3 取PG号($2 是 _1 这部分)
extract_labels="pg_port:(Ethernet[0-9]+(_[0-9]+)?):([0-9]+):$1;pg_num:(Ethernet[0-9]+(_[0-9]+)?):([0-9]+):$3"
acl.conf – ACL计数器
# ACL Counter 模块配置
# 是否启用此模块
enabled=true
# SONiC中的NAME_MAP路径(固定)
name_map_path="ACL_COUNTER_RULE_MAP"
# Prometheus指标前缀
prometheus_prefix="sonic_acl"
# 生成的标签名
label_name="acl_rule"
# 名称过滤(正则表达式,留空表示不过滤)
name_filter=""
# 从名称中提取子标签
# 从 "Ethernet10:RULE_10" 提取 acl_port=Ethernet10
# 使用非捕获组
extract_labels="acl_port:(Ethernet[0-9]+(?:_[0-9]+)?):(.+):$1"
# ============================================================
# ACL规则名称映射(将RULE_XX映射为可读名称)
# 格式: 原始规则名=显示名称
# ============================================================
[rule_mapping]
RULE_10=CNP
RULE_20=NAK
# 添加更多映射:
# RULE_30=RDMA_READ
# RULE_40=RDMA_WRITE
# DROP_RULE=DROP
自动化脚本详解
auto-refresh.sh 工作流程
下面内容用英文呈现,由于中英文会导致格式对不齐:
┌─────────────────────────────────────────────────────────────────┐
│ auto-refresh.sh │
└─────────────────────────────────────────────────────────────────┘
│
┌─────────────────────┼─────────────────────┐
▼ ▼ ▼
┌───────────────┐ ┌───────────────┐ ┌───────────────┐
│ 1. Load │ │ switches.conf │ │ settings.conf │
│ Config │◄───│ modules/*.conf│◄───│ │
└───────┬───────┘ └───────────────┘ └───────────────┘
│
▼
┌───────────────────────────────────────────────────────────────┐
│ 2. Check Switch Connectivity │
│ - Test gNMI port connection │
│ - Mark online/offline status │
└───────┬───────────────────────────────────────────────────────┘
│
▼
┌───────────────────────────────────────────────────────────────┐
│ 3. Collect OIDs (per switch) │
│ - Get NAME_MAP from each online switch │
│ - Apply name filters (e.g. exclude CPU ports) │
│ - Save to cache/{switch_name}/{module}_oids_new.txt │
│ - Merge mappings to cache/all_{module}_map_new.txt │
└───────┬───────────────────────────────────────────────────────┘
│
▼
┌───────────────────────────────────────────────────────────────┐
│ 4. Compare Differences │
│ - Compare old/new OID counts │
│ - Compare old/new switch lists │
│ - Display change summary │
└───────┬───────────────────────────────────────────────────────┘
│
▼
┌───────────────────────────────────────────────────────────────┐
│ 5. Generate Config (if changed) │
│ - Backup old config to backup/ │
│ - Generate gnmic/gnmic.yaml │
│ - Generate prometheus/prometheus.yml │
│ - Update cache (_new.txt → _old.txt) │
└───────────────────────────────────────────────────────────────┘
命令行选项
./scripts/auto-refresh.sh [选项]
选项:
-h, --help 显示帮助信息
-v, --verbose 详细输出模式
-f, --force 强制更新(不提示确认)
输出示例
╔════════════════════════════════════════════════════════════╗
║ SONiC Telemetry 配置刷新工具 v2 ║
║ (每交换机独立订阅模式) ║
╚════════════════════════════════════════════════════════════╝
时间: 2025-12-30 13:26:41
日志: /data/sonic-telemetry/logs/refresh_20251230_132641.log
════════════════════════════════════════════════════════════
1/4 检查交换机连通性
════════════════════════════════════════════════════════════
✓ switch1 (10.22.1.241) - 在线
✓ switch2 (10.26.4.102) - 在线
在线: 2, 离线: 0
════════════════════════════════════════════════════════════
2/4 收集OID
════════════════════════════════════════════════════════════
启用的模块: acl pg port queue
[switch1]
acl: 42 个OID
pg: 448 个OID
port: 56 个OID
queue: 448 个OID
[switch2]
acl: 42 个OID
pg: 448 个OID
port: 56 个OID
queue: 448 个OID
════════════════════════════════════════════════════════════
3/4 对比差异
════════════════════════════════════════════════════════════
╔══════════════════════════════════════════════════╗
║ 配置变化摘要 ║
╠══════════════════════════════════════════════════╣
║ 交换机: 0 → 2 (+2) ║
║ acl OID: 0 → 84 (+84) ║
║ pg OID: 0 → 896 (+896) ║
║ port OID: 0 → 112 (+112) ║
║ queue OID: 0 → 896 (+896) ║
╚══════════════════════════════════════════════════╝
检测到配置变化
════════════════════════════════════════════════════════════
4/4 生成配置
════════════════════════════════════════════════════════════
✓ gnmic配置已生成
✓ prometheus配置已生成
════════════════════════════════════════════════════════════
配置更新完成!
════════════════════════════════════════════════════════════
请运行以下命令应用配置:
cd /data/sonic-telemetry && docker compose restart gnmic prometheus
生成的配置文件
gnmic.yaml 结构
# SONiC Telemetry gnmic配置
# 架构: 每交换机独立订阅,支持独立认证
insecure: true
encoding: json_ietf
log: true
targets:
switch1:
address: 10.22.1.241:8080
username: username
password: password
subscriptions:
- port_counters_switch1
- queue_counters_switch1
- pg_counters_switch1
- acl_counters_switch1
switch2:
address: 10.26.4.102:8080
username: admin # 独立认证
password: password123
subscriptions:
- port_counters_switch2
- queue_counters_switch2
- pg_counters_switch2
- acl_counters_switch2
subscriptions:
port_counters_switch1:
paths:
- "COUNTERS:oid:0x1000000000001"
- "COUNTERS:oid:0x1000000000002"
# ... switch1的所有port OID
target: COUNTERS_DB
mode: stream
stream-mode: sample
sample-interval: 1s
port_counters_switch2:
paths:
- "COUNTERS:oid:0x1000000000001"
- "COUNTERS:oid:0x1000000000002"
# ... switch2的所有port OID(可能与switch1不同)
target: COUNTERS_DB
mode: stream
stream-mode: sample
sample-interval: 1s
# ... 其他模块订阅
outputs:
prometheus:
type: prometheus
listen: :9804
path: /metrics
metric-prefix: sonic
append-subscription-name: false
export-timestamps: true
注意“export-timestamps: true”这个配置,Prometheus 默认的抓取逻辑是:如果你不提供时间戳,它会把 “收到数据的那一刻” 作为时间戳。 当你强制提供了 export-timestamps: true,Prometheus 必须遵循这个时间戳。如果 gnmic 缓存中的 OID 数据更新频率较低,或者 gnmic 发出的时间戳已经超过了 Prometheus 的 lookback-delta(默认 5 分钟),Prometheus 就会认为该样本已过期,从而在查询时返回 0 个结果!所以如果只是测试,可以去掉这行代码。
prometheus.yml Relabel规则
scrape_configs:
- job_name: 'gnmic'
static_configs:
- targets: ['gnmic:9804']
metric_relabel_configs:
# ============================================================
# port (OID前缀: 0x10)
# ============================================================
# 提取OID并重命名指标
- source_labels: [__name__]
regex: 'sonic_COUNTERS_(oid_0x10[0-9a-f]+)_(.+)'
replacement: '${1}'
target_label: port_oid
- source_labels: [__name__]
regex: 'sonic_COUNTERS_(oid_0x10[0-9a-f]+)_(.+)'
replacement: 'sonic_port_${2}'
target_label: __name__
# OID到名称映射
- source_labels: [port_oid]
regex: 'oid_0x1000000000001'
replacement: 'Ethernet1'
target_label: port
- source_labels: [port_oid]
regex: 'oid_0x1000000000002'
replacement: 'Ethernet2'
target_label: port
# ... 更多映射
# ============================================================
# queue (OID前缀: 0x15)
# ============================================================
- source_labels: [__name__]
regex: 'sonic_COUNTERS_(oid_0x15[0-9a-f]+)_(.+)'
replacement: '${1}'
target_label: queue_oid
- source_labels: [__name__]
regex: 'sonic_COUNTERS_(oid_0x15[0-9a-f]+)_(.+)'
replacement: 'sonic_queue_${2}'
target_label: __name__
# OID到名称映射
- source_labels: [queue_oid]
regex: 'oid_0x1500000000006d'
replacement: 'Ethernet1:0'
target_label: queue
# ... 更多映射
# 提取子标签
- source_labels: [queue]
regex: '(Ethernet[0-9]+):([0-9]+)'
replacement: '$1'
target_label: queue_port
- source_labels: [queue]
regex: '(Ethernet[0-9]+):([0-9]+)'
replacement: '$2'
target_label: queue_num
# ============================================================
# PFC 提取
# ============================================================
- source_labels: [__name__]
regex: 'sonic_port_SAI_PORT_STAT_PFC_([0-7])_(RX|TX)_PKTS'
replacement: '${1}'
target_label: pfc_priority
- source_labels: [__name__]
regex: 'sonic_port_SAI_PORT_STAT_PFC_([0-7])_(RX|TX)_PKTS'
replacement: '${2}'
target_label: pfc_direction
- source_labels: [__name__]
regex: 'sonic_port_SAI_PORT_STAT_PFC_([0-7])_(RX|TX)_PKTS'
replacement: 'sonic_port_PFC_PKTS'
target_label: __name__
# ============================================================
# ACL 规则类型映射
# ============================================================
- source_labels: [acl_rule]
regex: '.+:RULE_10'
replacement: 'CNP'
target_label: acl_type
- source_labels: [acl_rule]
regex: '.+:RULE_20'
replacement: 'NAK'
target_label: acl_type
Prometheus 指标说明
指标命名规则
| 原始指标 | 转换后指标 | 标签 |
sonic_COUNTERS_oid_0x10xxx_SAI_PORT_STAT_IF_IN_OCTETS | sonic_port_SAI_PORT_STAT_IF_IN_OCTETS | port=Ethernet1 |
sonic_COUNTERS_oid_0x15xxx_SAI_QUEUE_STAT_BYTES | sonic_queue_SAI_QUEUE_STAT_BYTES | queue=Ethernet1:0, queue_port=Ethernet1, queue_num=0 |
sonic_COUNTERS_oid_0x1axxx_SAI_INGRESS_PRIORITY_GROUP_STAT_WATERMARK_BYTES | sonic_pg_SAI_INGRESS_PRIORITY_GROUP_STAT_WATERMARK_BYTES | pg=Ethernet1:0, pg_port=Ethernet1, pg_num=0 |
sonic_COUNTERS_oid_0x9xxx_SAI_ACL_COUNTER_ATTR_PACKETS | sonic_acl_SAI_ACL_COUNTER_ATTR_PACKETS | acl_rule=Ethernet1:RULE_10, acl_port=Ethernet1, acl_type=CNP |
sonic_port_SAI_PORT_STAT_PFC_3_RX_PKTS | sonic_port_PFC_PKTS | pfc_priority=3, pfc_direction=RX |
关键指标列表
Port指标
| 指标 | 说明 |
| sonic_port_SAI_PORT_STAT_IF_IN_OCTETS | 入口字节数 |
| sonic_port_SAI_PORT_STAT_IF_OUT_OCTETS | 出口字节数 |
| sonic_port_SAI_PORT_STAT_IF_IN_UCAST_PKTS | 入口单播包数 |
| sonic_port_SAI_PORT_STAT_IF_OUT_UCAST_PKTS | 出口单播包数 |
| sonic_port_PFC_PKTS | PFC帧数(含pfc_priority和pfc_direction标签) |
Queue指标
| 指标 | 说明 |
| sonic_queue_SAI_QUEUE_STAT_BYTES | 队列字节数 |
| sonic_queue_SAI_QUEUE_STAT_PACKETS | 队列包数 |
| sonic_queue_SAI_QUEUE_STAT_DROPPED_PACKETS | 队列丢包数 |
| sonic_queue_SAI_QUEUE_STAT_WATERMARK_BYTES | 队列水位(最大占用) |
PG指标
| 指标 | 说明 |
| sonic_pg_SAI_INGRESS_PRIORITY_GROUP_STAT_WATERMARK_BYTES | PG水位 |
| sonic_pg_SAI_INGRESS_PRIORITY_GROUP_STAT_DROPPED_PACKETS | PG丢包数 |
| sonic_pg_SAI_INGRESS_PRIORITY_GROUP_STAT_SHARED_WATERMARK_BYTES | 共享缓冲区水位 |
ACL指标
| 指标 | 说明 |
| sonic_acl_SAI_ACL_COUNTER_ATTR_PACKETS | ACL匹配包数 |
| sonic_acl_SAI_ACL_COUNTER_ATTR_BYTES | ACL匹配字节数 |
Grafana Dashboard
推荐Panel配置
Panel 1: Port Traffic IN/OUT
# IN Traffic (bps)
rate(sonic_port_SAI_PORT_STAT_IF_IN_OCTETS{port=~"$port"}[$__rate_interval]) * 8
# OUT Traffic (bps)
rate(sonic_port_SAI_PORT_STAT_IF_OUT_OCTETS{port=~"$port"}[$__rate_interval]) * 8
Panel 2: PFC (pps)
rate(sonic_port_PFC_PKTS{port=~"$port", pfc_priority=~"$priority"}[$__rate_interval])
Panel 3: Queue Watermark (OQ)
sonic_queue_SAI_QUEUE_STAT_WATERMARK_BYTES{queue_port=~"$port", queue_num=~"$queue_num"}
Panel 4: PG Watermark (SQ)
sonic_pg_SAI_INGRESS_PRIORITY_GROUP_STAT_WATERMARK_BYTES{pg_port=~"$port", pg_num=~"$pg_num"}
Panel 5: Queue Drop (pps)
rate(sonic_queue_SAI_QUEUE_STAT_DROPPED_PACKETS{queue_port=~"$port"}[$__rate_interval])
Panel 6: ACL Counter (pps)
rate(sonic_acl_SAI_ACL_COUNTER_ATTR_PACKETS{acl_port=~"$port"}[$__rate_interval])
# Legend: {{source}} - {{acl_port}} - {{acl_type}}
Dashboard变量
$source: label_values(sonic_port_SAI_PORT_STAT_IF_IN_OCTETS, source)
$port: label_values(sonic_port_SAI_PORT_STAT_IF_IN_OCTETS{source="$source"}, port)
$priority: 0,1,2,3,4,5,6,7
$queue_num: 0,1,2,3,4,5,6,7
部署指南
环境要求
Docker & Docker Compose
# rm -f /etc/apt/keyrings/docker.gpg
# mkdir -p /etc/apt/keyrings
# curl -fsSL https://mirrors.aliyun.com/docker-ce/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
# echo \
"deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://mirrors.aliyun.com/docker-ce/linux/ubuntu \
$(. /etc/os-release && echo "$VERSION_CODENAME") stable" | \
sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
# apt-get update
# apt-get install docker-compose-plugin
# more /etc/docker/daemon.json
{
"registry-mirrors": [
"https://registry-1.xxxx.com"
]
}
# dockerd --validate --config-file /etc/docker/daemon.json
# sudo systemctl daemon-reload
# sudo systemctl restart docker
gnmic CLI(用于OID收集)
curl -LO https://github.com/openconfig/gnmic/releases/download/v0.42.1/gnmic_0.42.1_Linux_x86_64.tar.gz
tar -xzf gnmic_0.42.1_Linux_x86_64.tar.gz
sudo mv gnmic /usr/local/bin/
gnmic version
jq(JSON解析)
# apt install jq
网络
能访问所有SONiC交换机的gNMI端口(默认8080)
初始部署
下载并解压
cd /data
tar -xzf sonic-telemetry.tar.gz
cd sonic-telemetry
编辑交换机列表
vi config/switches.conf
编辑全局设置(认证信息)
vi config/settings.conf
运行配置生成脚本
./scripts/auto-refresh.sh
启动服务
./scripts/auto-refresh.sh
启动服务
curl -s http://localhost:9804/metrics | head -10
curl -s "http://localhost:9090/api/v1/query?query=up" | jq .
访问Grafana
# http://<server-ip>:3000 (admin/admin)
添加新交换机
编辑交换机列表
vi config/switches.conf
# 添加: switch3:10.26.4.103
运行脚本
./scripts/auto-refresh.sh
重启服务
docker compose restart gnmic prometheus
交换机重启后OID变化
删除cache数据,这些数据用于对比数据使用
rm -rf /data/sonic-telemetry/cache/*
脚本会自动检测OID变化
./scripts/auto-refresh.sh
如果提示有变化,确认后重启服务
docker compose restart gnmic prometheus
添加新的ACL规则映射
编辑ACL配置
# 在 [rule_mapping] 部分添加:
# RULE_30=RDMA_READ
清除缓存并重新生成
rm -rf cache/*
./scripts/auto-refresh.sh -f
重启服务
docker compose restart gnmic prometheus
运维指南
日常维护命令
# 查看服务状态
docker compose ps
# 查看gnmic日志
docker compose logs gnmic --tail 50
# 查看prometheus日志
docker compose logs prometheus --tail 50
# 检查gnmic指标数量
curl -s http://localhost:9804/metrics | wc -l
# 检查prometheus target状态
curl -s "http://localhost:9090/api/v1/targets" | jq '.data.activeTargets[] | {job: .job, health: .health}'
# 查看脚本日志
ls -lt logs/ | head -5
cat logs/refresh_*.log | tail -50
性能调优
gnmic内存优化
如果OID数量很大,可以增加采样间隔,但不建议,现在默认是1s,这个时间放大,那grafana计算数据的时间也要放大,否则就会有明显的毛刺
SAMPLE_INTERVAL="5s" # 从1s改为5s
Prometheus存储优化
# docker-compose.yml
command:
- '--storage.tsdb.retention.time=15d' # 减少保留时间
- '--storage.tsdb.retention.size=10GB' # 限制存储大小
如果数据量太大,也可以直接删除数据库:
rm -rf /data/sonic-telemetry/data/prometheus/*
备份与恢复
每次运行 ./scripts/auto-refresh.sh 并确认更新配置时,会先备份旧配置,自动备份的文件在 backup/ 目录中,只包含 gnmic.yaml 和 prometheus.yml
# 查看备份
ls backup/
# 手动备份
tar -czvf sonic-telemetry-backup-$(date +%Y%m%d).tar.gz \
config/ cache/ gnmic/ prometheus/ grafana/
故障排查
常见问题
问题1: gnmic订阅失败
# 症状
docker compose logs gnmic | grep "error"
# subscription xxx rcv error: NotFound
# 原因,某个OID不存在
# 解决,清除缓存重新收集OID
rm -rf cache/*
./scripts/auto-refresh.sh -f
docker compose restart gnmic
问题2: Prometheus没有数据
# 检查gnmic是否有数据
curl -s http://localhost:9804/metrics | head -10
# 检查prometheus target
curl -s "http://localhost:9090/api/v1/targets" | jq .
# 检查relabel规则
grep -A 5 "port_oid" prometheus/prometheus.yml
问题3: Grafana显示No Data
# 检查指标是否存在
curl -s "http://localhost:9090/api/v1/label/__name__/values" | jq '.data[]' | grep sonic
# 检查标签是否正确
curl -s "http://localhost:9090/api/v1/query?query=sonic_port_SAI_PORT_STAT_IF_IN_OCTETS" | jq '.data.result[0].metric'
问题4: 脚本执行报错退出
# 使用调试模式
bash -x ./scripts/auto-refresh.sh 2>&1 | head -100
# 查看日志
cat logs/refresh_*.log | tail -50
诊断命令
获得交换机OID Mapping关系:
$ sonic-db-cli COUNTERS_DB hgetall COUNTERS_PORT_NAME_MAP | xargs -n2 | sort
{CPU: oid:0x1000000000050,
Ethernet10_1: oid:0x1000000000015,
Ethernet10_2: oid:0x1000000000016,
Ethernet11: oid:0x1000000000017,
Ethernet1_1: oid:0x1000000000ae5,
Ethernet12: oid:0x1000000000018,
服务器上相关排除命令
# 测试单台交换机gNMI连接
gnmic -a 10.22.1.241:8080 --insecure -u username -p password capabilities
# 获取PORT NAME MAP
gnmic -a 10.22.1.241:8080 --insecure -u username -p password \
get --path "COUNTERS_PORT_NAME_MAP" --target COUNTERS_DB
# 获取单个OID数据
gnmic -a 10.22.1.241:8080 --insecure -u username -p password \
get --path "COUNTERS:oid:0x1000000000001" --target COUNTERS_DB
# 检查Prometheus中的指标
curl -s "http://localhost:9090/api/v1/query?query=sonic_port_SAI_PORT_STAT_IF_IN_OCTETS" | jq '.data.result | length'
# 检查特定交换机的数据
curl -s "http://localhost:9090/api/v1/query?query=sonic_port_SAI_PORT_STAT_IF_IN_OCTETS" | jq '.data.result[].metric.source' | sort | uniq -c
扩展示例和汇总
目前只有queue drop,没有queue counter,因此下面是增加queue counter的步骤,其实之前gnmi已经采集这部分信息了,只是没用而已:
root@ubuntu20:/home/unix# docker ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
e30834b9804b grafana/grafana:latest "/run.sh" 4 days ago Up 4 days 0.0.0.0:3000->3000/tcp, :::3000->3000/tcp grafana
57158ba7e4f7 prom/prometheus:latest "/bin/prometheus --c…" 4 days ago Up 34 hours 0.0.0.0:9090->9090/tcp, :::9090->9090/tcp prometheus
d5909eb60755 ghcr.io/openconfig/gnmic:latest "/app/gnmic subscrib…" 4 days ago Up 34 hours 0.0.0.0:9804->9804/tcp, :::9804->9804/tcp gnmic
确认GMNI的采集信息
root@ubuntu20:/home/unix# more /data/sonic-telemetry/cache/switch3/queue_map_old.txt
Ethernet10_1:0=0x15000000000b75
Ethernet10_1:1=0x15000000000b76
Ethernet10_1:2=0x15000000000b77
Ethernet10_1:3=0x15000000000b78
Ethernet10_1:4=0x15000000000b79
Ethernet10_1:5=0x15000000000b7a
Ethernet10_1:6=0x15000000000b7b
Ethernet10_1:7=0x15000000000b7c
Ethernet10_2:0=0x15000000000b96
Ethernet10_2:1=0x15000000000b97
确认gnmic是否采集到相应数据,gnmic 输出的原始格式是 sonic_COUNTERS_oid_0x15...:
root@ubuntu20:/home/unix# curl -s http://localhost:9804/metrics | grep "0x15" | head -5
# HELP sonic_COUNTERS_oid_0x1500000000009d_SAI_BUFFER_POOL_SIZE gNMIc generated metric
# TYPE sonic_COUNTERS_oid_0x1500000000009d_SAI_BUFFER_POOL_SIZE untyped
sonic_COUNTERS_oid_0x1500000000009d_SAI_BUFFER_POOL_SIZE{source="switch3",subscription_name="queue_counters_switch3",subscription_target="COUNTERS_DB",target="COUNTERS_DB"} 5.5658496e+07
# HELP sonic_COUNTERS_oid_0x1500000000009d_SAI_BUFFER_POOL_STAT_CURR_OCCUPANCY_BYTES gNMIc generated metric
# TYPE sonic_COUNTERS_oid_0x1500000000009d_SAI_BUFFER_POOL_STAT_CURR_OCCUPANCY_BYTES untyped
确认Prometheus的采集信息
经过 Prometheus relabel 后,命名变成 sonic_queue_:
root@ubuntu20:/home/unix# curl -s "http://localhost:9090/api/v1/query?query=sonic_queue_SAI_QUEUE_STAT_BYTES" | jq '.data.result | length'
600
root@ubuntu20:/home/unix# curl -s "http://localhost:9090/api/v1/label/__name__/values" | jq '.data[]' | grep sonic_queue
"sonic_queue_SAI_BUFFER_POOL_SIZE"
"sonic_queue_SAI_BUFFER_POOL_STAT_CURR_OCCUPANCY_BYTES"
"sonic_queue_SAI_BUFFER_PROFILE_DYNAMIC_PARA"
"sonic_queue_SAI_BUFFER_PROFILE_MIN_GUARANTEE"
"sonic_queue_SAI_QUEUE_STAT_BYTES"
"sonic_queue_SAI_QUEUE_STAT_CURR_OCCUPANCY_BYTES"
"sonic_queue_SAI_QUEUE_STAT_CURR_OCCUPANCY_BYTES_T"
"sonic_queue_SAI_QUEUE_STAT_DROPPED_BYTES"
"sonic_queue_SAI_QUEUE_STAT_DROPPED_PACKETS"
"sonic_queue_SAI_QUEUE_STAT_PACKETS"
"sonic_queue_SAI_QUEUE_STAT_PACKETS_T"
"sonic_queue_SAI_QUEUE_STAT_SHARED_CURR_OCCUPANCY_BYTES"
"sonic_queue_SAI_QUEUE_STAT_SHARED_WATERMARK_BYTES"
"sonic_queue_SAI_QUEUE_STAT_TIMESTAMP"
"sonic_queue_SAI_QUEUE_STAT_WATERMARK_BYTES"
"sonic_queue_SAI_QUEUE_STAT_WRED_ECN_MARKED_PACKETS"
"sonic_queue_timestamp"
更新Grafana Dashboard
在 id: 2 (Port Traffic – OUT) 后面添加以下两个面板:
{
"id": 8,
"title": "Per Queue Traffic (bps)",
"type": "timeseries",
"gridPos": {"h": 8, "w": 24, "x": 0, "y": 16},
"fieldConfig": {
"defaults": {
"unit": "bps",
"custom": {"drawStyle": "line", "lineWidth": 1, "fillOpacity": 10, "showPoints": "never"}
}
},
"options": {
"legend": {"displayMode": "table", "placement": "right", "showLegend": true, "sortBy": "Max", "sortDesc": true, "calcs": ["min", "max", "lastNotNull"]},
"tooltip": {"mode": "multi", "sort": "desc"}
},
"targets": [{"expr": "rate(sonic_queue_SAI_QUEUE_STAT_BYTES{source=~\"$switch\", queue_port=~\"$port\", queue_num=~\"$queue\"}[30s]) * 8", "legendFormat": "{{source}} - {{queue_port}}:{{queue_num}}"}],
"datasource": {"type": "prometheus", "uid": "PBFA97CFB590B2093"}
},
{
"id": 9,
"title": "Per Queue Packets (pps)",
"type": "timeseries",
"gridPos": {"h": 8, "w": 24, "x": 0, "y": 24},
"fieldConfig": {
"defaults": {
"unit": "pps",
"custom": {"drawStyle": "line", "lineWidth": 1, "fillOpacity": 10, "showPoints": "never"}
}
},
"options": {
"legend": {"displayMode": "table", "placement": "right", "showLegend": true, "sortBy": "Max", "sortDesc": true, "calcs": ["min", "max", "lastNotNull"]},
"tooltip": {"mode": "multi", "sort": "desc"}
},
"targets": [{"expr": "rate(sonic_queue_SAI_QUEUE_STAT_PACKETS{source=~\"$switch\", queue_port=~\"$port\", queue_num=~\"$queue\"}[30s])", "legendFormat": "{{source}} - {{queue_port}}:{{queue_num}}"}],
"datasource": {"type": "prometheus", "uid": "PBFA97CFB590B2093"}
},
同时需要把后面面板的 y 坐标都往下移 16(因为新增了两个 h=8 的面板),如下是成品:

20260619 – 更新Grafana 变量
如果不控制任何变量,Grafana从Prometheus中读到的信息就是全的。假设我有56个端口,但我只想观察测试的几个端口,那么最简单的方法就是在过滤里面选端口,然后保持,但我发现这样可能会出现保持失效,又回到了所有端口,再重新点很麻烦,端口顺序又不固定,因此可以直接进入编辑,在对应变量里加如下参数即可默认过滤特定测试端口:

/^(?:Ethernet9|Ethernet10|Ethernet19|Ethernet21|Ethernet22|Ethernet25|Ethernet26|Ethernet27|Ethernet28|Ethernet29|Ethernet30|Ethernet31|Ethernet32|Ethernet34|Ethernet35|Ethernet36)$/
20260620 – 新增Headroom Watermark
先到设备上,确认特定端口特定队列的OID,由于headroom在PG里,PG之前采集过,所以目前数据已经有了,我们先确认下数据是否有值:
root@sonic-1:/home# sonic-db-cli COUNTERS_DB HGET COUNTERS_PG_NAME_MAP "Ethernet28:1"
oid:0x1a000000000591
根据这个OID确认其挂的相关信息:
root@sonic-1:/home# sonic-db-cli COUNTERS_DB HGETALL "COUNTERS:oid:0x1a000000000591"
{'SAI_INGRESS_PRIORITY_GROUP_HEADROOM_SIZE': '258048', 'SAI_INGRESS_PRIORITY_GROUP_STAT_XOFF_ROOM_CURR_OCCUPANCY_BYTES': '330752', 'SAI_INGRESS_PRIORITY_GROUP_STAT_CURR_OCCUPANCY_BYTES': '3842816', 'SAI_INGRESS_PRIORITY_GROUP_STAT_DROPPED_PACKETS': '0', 'SAI_INGRESS_PRIORITY_GROUP_STAT_SHARED_CURR_OCCUPANCY_BYTES': '3507712', 'SAI_INGRESS_PRIORITY_GROUP_STAT_XOFF_ROOM_WATERMARK_BYTES': '331008', 'SAI_INGRESS_PRIORITY_GROUP_STAT_SHARED_WATERMARK_BYTES': '3607808', 'SAI_INGRESS_PRIORITY_GROUP_STAT_WATERMARK_BYTES': '3943168'}
然后直接去Grafana中,复制SQ Watermark的Pannel,修改其中的query语法即可:
sonic_pg_SAI_INGRESS_PRIORITY_GROUP_STAT_XOFF_ROOM_WATERMARK_BYTES{source=~"$switch", pg_port=~"$port", pg_num=~"$queue"}
20260620 – 新增Buffer Pool
经过确认,Buffer Pool 属于 COUNTERS_BUFFER_POOL_NAME_MAP:
root@sonic-1:/home# sonic-db-cli COUNTERS_DB HGETALL COUNTERS_BUFFER_POOL_NAME_MAP
{'egress': 'oid:0x180000000007b0', 'ingress': 'oid:0x180000000007b1'}
sonic-db-cli COUNTERS_DB HGETALL "COUNTERS:oid:0x180000000007b0" | grep WATERMARK
root@sonic-1:/home# sonic-db-cli COUNTERS_DB HGETALL "COUNTERS:oid:0x180000000007b0"
{'SAI_BUFFER_POOL_STAT_CURR_OCCUPANCY_BYTES': '99319808', 'SAI_BUFFER_POOL_STAT_WATERMARK_BYTES': '102473728', 'SAI_BUFFER_POOL_STAT_XOFF_ROOM_CURR_OCCUPANCY_BYTES': '0', 'SAI_BUFFER_POOL_STAT_XOFF_ROOM_WATERMARK_BYTES': '0'}
root@sonic-1:/home# sonic-db-cli COUNTERS_DB HGETALL "COUNTERS:oid:0x180000000007b0"
{'SAI_BUFFER_POOL_STAT_CURR_OCCUPANCY_BYTES': '99319808', 'SAI_BUFFER_POOL_STAT_WATERMARK_BYTES': '102473728', 'SAI_BUFFER_POOL_STAT_XOFF_ROOM_CURR_OCCUPANCY_BYTES': '0', 'SAI_BUFFER_POOL_STAT_XOFF_ROOM_WATERMARK_BYTES': '0'}
由于脚本已经自动适配不同模块的Counter,因此只要在module 增加 Buffer Pool 的 config 即可:
root@tme91:/opt/sonic-telemetry/config/modules# more buffer_pool.conf
# Buffer Pool Counter 模块配置
enabled=true
# name map 固定
name_map_path="COUNTERS_BUFFER_POOL_NAME_MAP"
# 指标前缀
prometheus_prefix="sonic_buffer_pool"
# label 名(name map 的 key 就是 ingress/egress,直接当 label)
label_name="pool"
# 只有 ingress/egress 两个,不过滤
name_filter=""
# key 本身就是 pool 名,不用从 Ethernet:x 提取
extract_labels=""
# 采样间隔:对齐 BUFFER_POOL_WATERMARK 的 POLL_INTERVAL(见下方第二步)
sample_interval="10s"
然后在Grafana 内直接调用,Legend为:{{source}} – {{pool}}
sonic_buffer_pool_SAI_BUFFER_POOL_STAT_WATERMARK_BYTES{source=~"$switch"}
2026-01-31更新:优化ACL逻辑
之前的逻辑适用性不强,太依赖场景了,Mapping维护也有局限性。因此做了如下优化:
- 通过gnmic从 CONFIG_DB 获取 ACL table 的真实属性(stage、port),不依赖 table 命名约定
- 解析每个 table 的
stage和ports - 生成 prometheus relabel 规则时,使用 CONFIG_DB 的数据
- 不再需要
acl.conf中的[rule_mapping]和extract_labels,删掉即可 - Grafana如下更新:
- 现有变量 $port 可以直接用于过滤ACL
- $stage:选择ingress还是egress
- $acl_table:选择特定的ACL table
- 对于rule name,不在前置设置映射关系,而是通过Grafana的 Transformations来直接在现有数据上转换
- 脚本每次运行时会:
- 从 CONFIG_DB 获取最新的 ACL table 信息
- 与缓存对比(stage、port 是否变化)
- 有变化则重新生成 prometheus.yml

| 数据库 | 说明 | 获取内容 |
| COUNTERS_DB | ACL_COUNTER_RULE_MAP | {table}:{rule} → OID |
| CONFIG_DB | ACL_TABLE | stage, ports@ |
最终标签格式:
sonic_acl_SAI_ACL_COUNTER_ATTR_PACKETS{
source="switch1",
acl_table="Ethernet33ECN", # ACL 表名
acl_stage="egress", # 从 CONFIG_DB 获取
acl_port="Ethernet33", # 从 CONFIG_DB 获取
acl_seq="10" # 从 rule name 提取
}
Prometheus Relabel 规则:
# 1. 提取 acl_table
- source_labels: [acl_rule]
regex: '(.+):RULE_[0-9]+'
replacement: '${1}'
target_label: acl_table
# 2. 提取 acl_seq
- source_labels: [acl_rule]
regex: '.+:RULE_([0-9]+)'
replacement: '${1}'
target_label: acl_seq
# 3. ACL table → stage 映射(从 CONFIG_DB 生成)
- source_labels: [acl_table]
regex: 'Ethernet33'
replacement: 'ingress'
target_label: acl_stage
- source_labels: [acl_table]
regex: 'Ethernet33ECN'
replacement: 'egress'
target_label: acl_stage
# ... 其他 table
# 4. ACL table → port 映射(从 CONFIG_DB 生成)
- source_labels: [acl_table]
regex: 'Ethernet33'
replacement: 'Ethernet33'
target_label: acl_port
- source_labels: [acl_table]
regex: 'Ethernet33ECN'
replacement: 'Ethernet33'
target_label: acl_port
# ... 其他 table
2026-03-13更新:优化OID污染问题
gnmic 不支持在 target 层绑定 subscription,导致所有 subscription 被应用到所有 target,OID 相同时造成数据污染,如下本来源应该来自dut1,但获得的数据包含dut2的端口:
curl -s http://localhost:9804/metrics | grep 'source=' | head -3
......
sonic_COUNTERS_oid_0x1000000000001_SAI_PORT_STAT_PFC_1_RX_PKTS
{source="dut1", subscription_name="port_counters_dut1"} 0
sonic_COUNTERS_oid_0x1000000000001_SAI_PORT_STAT_PFC_1_RX_PKTS
{source="dut1", subscription_name="port_counters_dut2"} 0 ← 污染
sonic_COUNTERS_oid_0x1000000000001_SAI_PORT_STAT_PFC_1_RX_PKTS
{source="dut2", subscription_name="port_counters_dut1"} 0 ← 污染
由于gnmic数据污染,导致Prometheus 中也存在交叉数据,这些会导致grafana从数据库中获取数据时会重复:
curl -G 'http://localhost:9090/api/v1/series' --data-urlencode 'match[]={source="dut1"}' | python3 -m json.tool | grep -A 30 "Ethernet19"
......
{
"__name__": "sonic_port_SAI_PORT_STAT_IF_IN_OCTETS",
"instance": "gnmic:9804",
"job": "gnmic",
"port": "Ethernet19",
"port_oid": "oid_0x1000000000013",
"source": "dut1",
"subscription_name": "port_counters_dut1",
"subscription_target": "COUNTERS_DB",
"target": "COUNTERS_DB"
},
{
"__name__": "sonic_port_SAI_PORT_STAT_IF_IN_OCTETS",
"instance": "gnmic:9804",
"job": "gnmic",
"port": "Ethernet19",
"port_oid": "oid_0x1000000000013",
"source": "dut1",
"subscription_name": "port_counters_dut2",
"subscription_target": "COUNTERS_DB",
"target": "COUNTERS_DB"
},
{
"__name__": "sonic_port_SAI_PORT_STAT_IF_IN_PKTS_BYTE",
"instance": "gnmic:9804",
"job": "gnmic",
"port": "Ethernet19",
"port_oid": "oid_0x1000000000013",
"source": "dut1",
"subscription_name": "port_counters_dut1",
"subscription_target": "COUNTERS_DB",
"target": "COUNTERS_DB"
},
{
"__name__": "sonic_port_SAI_PORT_STAT_IF_IN_PKTS_BYTE",
"instance": "gnmic:9804",
"job": "gnmic",
"port": "Ethernet19",
"port_oid": "oid_0x1000000000013",
"source": "dut1",
"subscription_name": "port_counters_dut2",
"subscription_target": "COUNTERS_DB",
"target": "COUNTERS_DB"
}
虽然 gnmic target配置中绑定了DUT,但目前看没有生效:
targets:
DUT1:
subscriptions:
- port_counters_DUT1
DUT2:
subscriptions:
- port_counters_DUT2
经过讨论,由于只是testbed,因此最终还是把gnmic按设备分开处理,这样杜绝数据污染问题:

2026-04-21更新:Telemetry数据混乱
问题现象
由于需要,更换了RDMA的测试脚本。用这个脚本打长时间流量,就会发现数据混乱了。Telemetry采集很不稳定,当停止测试后,确认交换机流量也到0了,但下面per queue traffics还留了一大段的尾巴流量,但可以看到前面多次10分钟测试没问题:

在问题出现后,再打10m流量,就发现很有趣的现象,当流量停止后,per queue traffics才开始累计,感觉per queue的相关信息被delay了:

问题原因
经过排查,GNMIC对448个queue OID的sample-interval设为1s,但交换机的QUEUE_STAT轮询间隔是5s。gNMI server每秒要对448个OID做一次subscribe response,高负载时处理不过来,数据积压。停流后压力下降,积压消化完后数据才出现。为了解决此问题,拉齐交换机与GNMIC的轮训周期,优化后问题解决。
下面是目前测试中用到监控指标的轮训周期,建议GNMIC的采集周期 ≥ Sonic 轮训周期,否则产生无意义的重复推送,增加gNMI server负载:
| Counter类型 | SoNIC 轮询周期 | GNMIC采样周期 | 说明 |
| PORT_STAT | 1s | 1s | OID少(~56),1s无压力 |
| ACL | 1s | 1s | OID少(根据配置相关,测试场景中少),1s无压力 |
| QUEUE_STAT | 5s | 5s | OID多(~448,56*8),需要匹配,或增加周期时间,或者给服务器分配更多资源 |
| QUEUE_WATERMARK | 5s | 5s | 同上 |
| QUEUE_DROP | 5s | 5s | 同上 |
排障流程
这里顺便总结下整个Telemetry方案的Troubleshooting思路。整个Telemetry数据链有4层,问题可能出在任何一层。必须从数据源开始逐层确认,用实际数据定位:
┌───────────────────────────────────────────────┐
|Switch ASIC → COUNTERS_DB (redis) → gNMI Server| → gnmic → Prometheus → Grafana
| ↑ ↑ ↑ | ↑ ↑ ↑
| counter counter DB telemetry | 容器 scrape query/
| interval polling-writing container | 采集 抓取 relabel
└───────────────────────────────────────────────┘
第1层:Switch CLI — 确认数据源,包括确认端口流量状态是否更新,以及Counter更新的周期等。CLI有流量 → 数据源正常,往下一层查。CLI无流量 → 问题在网络/流量本身,不是telemetry的问题;
第2层:COUNTERS_DB (Redis) — 确认counter是否在更新。Redis值在变 → COUNTERS_DB正常,问题在gNMI或更上层。Redis值不变 → Counter/syncd有问题:
# 注意:不同SONiC版本COUNTERS_DB的端口会不一样,这个首先要确认的
# 查port OID映射
redis-cli -n 2 -p xxxx HGETALL COUNTERS_PORT_NAME_MAP | grep -A1 "Ethernet25"
# 查queue OID映射
redis-cli -n 2 -p xxxx HGETALL COUNTERS_QUEUE_NAME_MAP | grep -A1 "Ethernet25:1"
# 验证port counter是否在更新(间隔2秒取2次)
redis-cli -n 2 -p xxxx hget "COUNTERS:oid:<OID>" SAI_PORT_STAT_IF_IN_OCTETS
sleep 2
redis-cli -n 2 -p xxxx hget "COUNTERS:oid:<OID>" SAI_PORT_STAT_IF_IN_OCTETS
# 验证queue counter是否在更新(间隔5秒,匹配QUEUE_STAT轮询间隔)
for i in 1 2 3 4 5 6; do
echo "=== $i ($(date +%H:%M:%S)) ==="
redis-cli -n 2 -p xxxx hget "COUNTERS:oid:<QUEUE_OID>" SAI_QUEUE_STAT_BYTES
sleep 5
done
# 验证queue drop
redis-cli -n 2 -p xxxx hget "COUNTERS:oid:<QUEUE_OID>" SAI_QUEUE_STAT_DROPPED_PACKETS
第3层:gNMI Server — 确认telemetry容器
# 查telemetry容器状态和资源
docker ps -a | grep telemetry
docker stats --no-stream $(docker ps -q -f name=telemetry)
docker inspect --format='{{.RestartCount}} restarts' $(docker ps -q -f name=telemetry)
# 查telemetry容器日志
docker logs --tail 50 $(docker ps -q -f name=telemetry)
# 查活跃gNMI连接
docker exec $(docker ps -q -f name=telemetry) ss -tnp | grep 8080
第4层:gnmic — 确认采集端。gnmic进程存活、端口监听、但metrics返回0条 → prometheus output挂了,需要重启gnmic容器或重跑配置脚本
# 确认gnmic容器进程
docker exec gnmic-dut1 ps aux
# 查gnmic日志(默认不输出到stdout!需要加--log参数)
docker logs gnmic-dut1 2>&1 | tail -50
# 如果日志为空,手动带日志跑(会绑端口失败但能看到连接信息)
docker exec gnmic-dut1 /app/gnmic --config /app/gnmic.yaml subscribe --log
# 确认metrics endpoint有没有数据
curl -s http://localhost:9804/metrics | wc -l
curl -s http://localhost:9804/metrics | grep -v "^#" | wc -l
# 验证特定OID的counter值是否在变(打流时执行)
for i in 1 2 3 4 5 6; do
echo "=== $i ($(date +%H:%M:%S)) ==="
curl -s http://localhost:9804/metrics | grep "oid_<OID>_SAI_PORT_STAT_IF_IN_OCTETS" | grep -v "^#"
sleep 5
done
# 查metrics里有哪些OID前缀
curl -s http://localhost:9804/metrics | grep -oP "oid_0x[0-9a-f]+" | sort -u | head -20
# 查有没有任何IN_OCTETS数据
curl -s http://localhost:9804/metrics | grep "IN_OCTETS" | head -10
第5层:Prometheus — 确认scrape。原始metric有值但relabel后没有 → prometheus.yml的relabel规则问题。原始metric也没有 → scrape配置或gnmic端的问题:
# 查target状态
curl -s 'http://localhost:9090/api/v1/targets' | python3 -c "
import json,sys
d=json.load(sys.stdin)
for t in d['data']['activeTargets']:
if 'gnmic' in t.get('scrapeUrl',''):
print(t['scrapeUrl'], t['health'], t['lastError'])
"
# 查原始metric(未relabel)
curl -s 'http://localhost:9090/api/v1/query' \
--data-urlencode 'query=sonic_COUNTERS_oid_0x1000000000019_SAI_PORT_STAT_IF_IN_OCTETS' \
| python3 -m json.tool | head -20
# 查relabel后的metric(Grafana用的)
curl -s 'http://localhost:9090/api/v1/query' \
--data-urlencode 'query=sonic_port_SAI_PORT_STAT_IF_IN_OCTETS{port_oid="oid_0x1000000000019"}' \
| python3 -m json.tool
# 查Prometheus里所有sonic metric名
curl -s 'http://localhost:9090/api/v1/label/__name__/values' | python3 -c "
import json,sys
for n in json.load(sys.stdin)['data']:
if 'sonic' in n.lower(): print(n)
" | head -20