Soft-RoCE初体验
之前建了两台VM,测试了下Soft-RoCE,但当时忘记抓包了,想打开再发些流量,发现详细步骤已经忘记了。。。所以这次还是把步骤记录下来,后面不断完善,防止下次忘记。
Topology
注:为了避免麻烦,我使用的是比较新的ubuntu 22.04.2,这里面自带iproute2和perftest,以及相应的Soft-RoCE的包;
192.168.123.130
┌────────────┐ ┌────────────┐
│ ubuntu1 ├─────────────┤ ubuntu2 │
│ rdma1 │ │ rdma2 │
└────────────┘ └────────────┘
192.168.123.129
部署Soft-RoCE
详细可以看这篇文章:https://zhuanlan.zhihu.com/p/361740115;
下面只展示rdma1的配置,rdma2的配置一致,在此忽略:
root@rdma1:~# modprobe rdma_rxe #rxe就是支持soft-roce的虚拟网卡,加载模块
root@rdma1:~# rdma link add rxe_0 type rxe netdev ens160 #实例化虚拟网卡并与实际网卡关联
root@rdma1:~# rdma link
link rxe_0/1 state ACTIVE physical_state LINK_UP netdev ens160
root@rdma1:~# ibv_devices
device node GUID
------ ----------------
rxe_0 020c29fffe86da5b
root@rdma1:~# ibv_devinfo -d rxe_0 #通过这个命令可以查看虚拟网卡的信息
hca_id: rxe_0
transport: InfiniBand (0)
fw_ver: 0.0.0
node_guid: 020c:29ff:fe86:da5b
sys_image_guid: 020c:29ff:fe86:da5b
vendor_id: 0xffffff
vendor_part_id: 0
hw_ver: 0x0
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 1024 (3)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: Ethernet
root@rdma1:~# ip add #这里是看不到rxe_0的
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: ens160: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
link/ether 00:0c:29:86:da:5b brd ff:ff:ff:ff:ff:ff
altname enp2s0
inet 192.168.123.129/24 metric 100 brd 192.168.123.255 scope global dynamic ens160
valid_lft 1386sec preferred_lft 1386sec
inet6 fe80::20c:29ff:fe86:da5b/64 scope link
valid_lft forever preferred_lft forever
3: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default
link/ether 02:42:d9:a6:c8:88 brd ff:ff:ff:ff:ff:ff
测试Soft-RoCE
现在使用Pertest在两边发送RDMA报文
单次测试
Server测:
root@rdma1:~# ib_send_bw -d rxe_0 #打开后等待client激活,类似Trex
************************************
* Waiting for client to connect... *
************************************
---------------------------------------------------------------------------------------
Send BW Test
Dual-port : OFF Device : rxe_0
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
PCIe relax order: ON
ibv_wr* API : OFF
RX depth : 512
CQ Moderation : 1
Mtu : 1024[B]
Link type : Ethernet
GID index : 1
Max inline data : 0[B]
rdma_cm QPs : OFF
Data ex. method : Ethernet
---------------------------------------------------------------------------------------
local address: LID 0000 QPN 0x0011 PSN 0x9465cf
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:123:129
remote address: LID 0000 QPN 0x0011 PSN 0x97db69
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:123:130
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps]
65536 1000 0.00 71.19 0.001139
---------------------------------------------------------------------------------------
Client测:
root@rdma2:~# ib_send_bw -d rxe_0 192.168.123.129
---------------------------------------------------------------------------------------
Send BW Test
Dual-port : OFF Device : rxe_0
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
PCIe relax order: ON
ibv_wr* API : OFF
TX depth : 128
CQ Moderation : 1
Mtu : 1024[B]
Link type : Ethernet
GID index : 1
Max inline data : 0[B]
rdma_cm QPs : OFF
Data ex. method : Ethernet
---------------------------------------------------------------------------------------
local address: LID 0000 QPN 0x0011 PSN 0x97db69
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:123:130
remote address: LID 0000 QPN 0x0011 PSN 0x9465cf
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:123:129
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps]
65536 1000 71.10 70.91 0.001135
---------------------------------------------------------------------------------------
持续测试
perftest有很多参数可以调节,比如下面
Server测:
root@rdma1:~# ib_write_bw -c RC -d rxe_0 -F -s 1024 -q 4 --run_infinitely -t 512 -D 1
************************************
* Waiting for client to connect... *
************************************
---------------------------------------------------------------------------------------
RDMA_Write BW Test
Dual-port : OFF Device : rxe_0
Number of qps : 4 Transport type : IB
Connection type : RC Using SRQ : OFF
PCIe relax order: ON
ibv_wr* API : OFF
CQ Moderation : 100
Mtu : 1024[B]
Link type : Ethernet
GID index : 1
Max inline data : 0[B]
rdma_cm QPs : OFF
Data ex. method : Ethernet
---------------------------------------------------------------------------------------
local address: LID 0000 QPN 0x0017 PSN 0x30a422 RKey 0x00020c VAddr 0x00aaaaf3a04000
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:123:129
local address: LID 0000 QPN 0x0018 PSN 0xc24b9b RKey 0x00020c VAddr 0x00aaaaf3a05000
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:123:129
local address: LID 0000 QPN 0x0019 PSN 0x868113 RKey 0x00020c VAddr 0x00aaaaf3a06000
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:123:129
local address: LID 0000 QPN 0x001a PSN 0x70332f RKey 0x00020c VAddr 0x00aaaaf3a07000
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:123:129
remote address: LID 0000 QPN 0x0017 PSN 0xd55abe RKey 0x0002dc VAddr 0x00aaab22375000
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:123:130
remote address: LID 0000 QPN 0x0018 PSN 0x4dc0d0 RKey 0x0002dc VAddr 0x00aaab22376000
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:123:130
remote address: LID 0000 QPN 0x0019 PSN 0x8779da RKey 0x0002dc VAddr 0x00aaab22377000
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:123:130
remote address: LID 0000 QPN 0x001a PSN 0x475d21 RKey 0x0002dc VAddr 0x00aaab22378000
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:123:130
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps]
Client测:
与Server测类似,就不标记了,区别是数据信息会在Client测显示
root@rdma2:~# ib_write_bw 192.168.123.129 -c RC -d rxe_0 -F -s 1024 -q 4 --run_infinitely -t 512 -D 1
---------------------------------------------------------------------------------------
RDMA_Write BW Test
Dual-port : OFF Device : rxe_0
Number of qps : 4 Transport type : IB
Connection type : RC Using SRQ : OFF
PCIe relax order: ON
ibv_wr* API : OFF
TX depth : 512
CQ Moderation : 100
Mtu : 1024[B]
Link type : Ethernet
GID index : 1
Max inline data : 0[B]
rdma_cm QPs : OFF
Data ex. method : Ethernet
---------------------------------------------------------------------------------------
local address: LID 0000 QPN 0x0017 PSN 0xd55abe RKey 0x0002dc VAddr 0x00aaab22375000
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:123:130
local address: LID 0000 QPN 0x0018 PSN 0x4dc0d0 RKey 0x0002dc VAddr 0x00aaab22376000
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:123:130
local address: LID 0000 QPN 0x0019 PSN 0x8779da RKey 0x0002dc VAddr 0x00aaab22377000
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:123:130
local address: LID 0000 QPN 0x001a PSN 0x475d21 RKey 0x0002dc VAddr 0x00aaab22378000
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:123:130
remote address: LID 0000 QPN 0x0017 PSN 0x30a422 RKey 0x00020c VAddr 0x00aaaaf3a04000
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:123:129
remote address: LID 0000 QPN 0x0018 PSN 0xc24b9b RKey 0x00020c VAddr 0x00aaaaf3a05000
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:123:129
remote address: LID 0000 QPN 0x0019 PSN 0x868113 RKey 0x00020c VAddr 0x00aaaaf3a06000
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:123:129
remote address: LID 0000 QPN 0x001a PSN 0x70332f RKey 0x00020c VAddr 0x00aaaaf3a07000
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:123:129
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps]
1024 33800 0.00 33.01 0.033803
1024 33600 0.00 32.81 0.033602
1024 33700 0.00 32.91 0.033701
1024 33500 0.00 32.71 0.033495
1024 33300 0.00 32.52 0.033302
1024 33600 0.00 32.81 0.033599
1024 34000 0.00 33.19 0.033986
CM建联测试
ROCEv2 建联分两种:
- Socket建联,也可以称TCP建联,我上面的测试都是基于TCP建联;
- CM建联,可以在上面Server/Client段的命令上加上“-z”;
Server测:
root@rdma1:~# ib_write_bw -c RC -d rxe_0 -F -s 1024 -q 4 --run_infinitely -t 512 -D 1 -z
************************************
* Waiting for client to connect... *
************************************
---------------------------------------------------------------------------------------
RDMA_Write BW Test
Dual-port : OFF Device : rxe_0
Number of qps : 4 Transport type : IB
Connection type : RC Using SRQ : OFF
PCIe relax order: ON
ibv_wr* API : OFF
CQ Moderation : 100
Mtu : 1024[B]
Link type : Ethernet
GID index : 1
Max inline data : 0[B]
rdma_cm QPs : OFF
Data ex. method : rdma_cm
---------------------------------------------------------------------------------------
local address: LID 0000 QPN 0x0021 PSN 0x917bae
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:123:129
local address: LID 0000 QPN 0x0022 PSN 0xb5da37
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:123:129
local address: LID 0000 QPN 0x0023 PSN 0x2a617f
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:123:129
local address: LID 0000 QPN 0x0024 PSN 0x67c92b
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:123:129
remote address: LID 0000 QPN 0x0021 PSN 0x251718
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:123:130
remote address: LID 0000 QPN 0x0022 PSN 0x71cf22
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:123:130
remote address: LID 0000 QPN 0x0023 PSN 0x573ec4
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:123:130
remote address: LID 0000 QPN 0x0024 PSN 0x6070c3
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:123:130
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps]
Client测:
root@rdma2:~# ib_write_bw 192.168.123.129 -c RC -d rxe_0 -F -s 1024 -q 4 --run_infinitely -t 512 -D 1 -z
---------------------------------------------------------------------------------------
RDMA_Write BW Test
Dual-port : OFF Device : rxe_0
Number of qps : 4 Transport type : IB
Connection type : RC Using SRQ : OFF
PCIe relax order: ON
ibv_wr* API : OFF
TX depth : 512
CQ Moderation : 100
Mtu : 1024[B]
Link type : Ethernet
GID index : 1
Max inline data : 0[B]
rdma_cm QPs : OFF
Data ex. method : rdma_cm
---------------------------------------------------------------------------------------
local address: LID 0000 QPN 0x0021 PSN 0x251718
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:123:130
local address: LID 0000 QPN 0x0022 PSN 0x71cf22
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:123:130
local address: LID 0000 QPN 0x0023 PSN 0x573ec4
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:123:130
local address: LID 0000 QPN 0x0024 PSN 0x6070c3
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:123:130
remote address: LID 0000 QPN 0x0021 PSN 0x917bae
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:123:129
remote address: LID 0000 QPN 0x0022 PSN 0xb5da37
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:123:129
remote address: LID 0000 QPN 0x0023 PSN 0x2a617f
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:123:129
remote address: LID 0000 QPN 0x0024 PSN 0x67c92b
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:123:129
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps]
1024 39600 0.00 38.68 0.039606
1024 39500 0.00 38.57 0.039498
1024 40000 0.00 39.04 0.039979
1024 38200 0.00 37.31 0.038201
报文示例
ROCEv2 Socket/TCP建联交互报文:
ROCEv2 CM建联交互报文:
TCP ECN报文(注意ECN是IP包头TOS的后2位,3代表ECN,所以即使是ROCEv2的报文,也是在IP层):
ROCEv2 CNP报文(注意这个报文我是通过这篇文章手动组成的,组包方法可以看我之前的文章:Wireshark使用技巧之三: 利用text2pcap转换Hex文本到pcap,CNP在博通的文档中也提到过,可以参考这篇文章):
以太Pause帧报文,作用整条链路,此报文来自Wireshark的实例:
PFC报文,复用了802.1p队列,使其可以作用于某个队列,报文通过此开源脚本生成: