Soft-RoCE初体验

之前建了两台VM,测试了下Soft-RoCE,但当时忘记抓包了,想打开再发些流量,发现详细步骤已经忘记了。。。所以这次还是把步骤记录下来,后面不断完善,防止下次忘记。

Topology

注:为了避免麻烦,我使用的是比较新的ubuntu 22.04.2,这里面自带iproute2和perftest,以及相应的Soft-RoCE的包;

                     192.168.123.130
┌────────────┐             ┌────────────┐
│ ubuntu1    ├─────────────┤ ubuntu2    │
│ rdma1      │             │ rdma2      │
└────────────┘             └────────────┘
    192.168.123.129

部署Soft-RoCE

详细可以看这篇文章:https://zhuanlan.zhihu.com/p/361740115

下面只展示rdma1的配置,rdma2的配置一致,在此忽略:

root@rdma1:~# modprobe rdma_rxe  #rxe就是支持soft-roce的虚拟网卡,加载模块
root@rdma1:~# rdma link add rxe_0 type rxe netdev ens160  #实例化虚拟网卡并与实际网卡关联
root@rdma1:~# rdma link
link rxe_0/1 state ACTIVE physical_state LINK_UP netdev ens160 
root@rdma1:~# ibv_devices
    device                 node GUID
    ------              ----------------
    rxe_0               020c29fffe86da5b

root@rdma1:~# ibv_devinfo -d rxe_0  #通过这个命令可以查看虚拟网卡的信息
hca_id: rxe_0
        transport:                      InfiniBand (0)
        fw_ver:                         0.0.0
        node_guid:                      020c:29ff:fe86:da5b
        sys_image_guid:                 020c:29ff:fe86:da5b
        vendor_id:                      0xffffff
        vendor_part_id:                 0
        hw_ver:                         0x0
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             1024 (3)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00
                        link_layer:             Ethernet
root@rdma1:~# ip add   #这里是看不到rxe_0的
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: ens160: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether 00:0c:29:86:da:5b brd ff:ff:ff:ff:ff:ff
    altname enp2s0
    inet 192.168.123.129/24 metric 100 brd 192.168.123.255 scope global dynamic ens160
       valid_lft 1386sec preferred_lft 1386sec
    inet6 fe80::20c:29ff:fe86:da5b/64 scope link 
       valid_lft forever preferred_lft forever
3: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default 
    link/ether 02:42:d9:a6:c8:88 brd ff:ff:ff:ff:ff:ff

测试Soft-RoCE

现在使用Pertest在两边发送RDMA报文

单次测试

Server测:

root@rdma1:~# ib_send_bw -d rxe_0  #打开后等待client激活,类似Trex

************************************
* Waiting for client to connect... *
************************************
---------------------------------------------------------------------------------------
                    Send BW Test
 Dual-port       : OFF          Device         : rxe_0
 Number of qps   : 1            Transport type : IB
 Connection type : RC           Using SRQ      : OFF
 PCIe relax order: ON
 ibv_wr* API     : OFF
 RX depth        : 512
 CQ Moderation   : 1
 Mtu             : 1024[B]
 Link type       : Ethernet
 GID index       : 1
 Max inline data : 0[B]
 rdma_cm QPs     : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0000 QPN 0x0011 PSN 0x9465cf
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:123:129
 remote address: LID 0000 QPN 0x0011 PSN 0x97db69
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:123:130
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
 65536      1000             0.00               71.19              0.001139
---------------------------------------------------------------------------------------

Client测:

root@rdma2:~# ib_send_bw -d rxe_0 192.168.123.129
---------------------------------------------------------------------------------------
                    Send BW Test
 Dual-port       : OFF          Device         : rxe_0
 Number of qps   : 1            Transport type : IB
 Connection type : RC           Using SRQ      : OFF
 PCIe relax order: ON
 ibv_wr* API     : OFF
 TX depth        : 128
 CQ Moderation   : 1
 Mtu             : 1024[B]
 Link type       : Ethernet
 GID index       : 1
 Max inline data : 0[B]
 rdma_cm QPs     : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0000 QPN 0x0011 PSN 0x97db69
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:123:130
 remote address: LID 0000 QPN 0x0011 PSN 0x9465cf
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:123:129
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
 65536      1000             71.10              70.91              0.001135
---------------------------------------------------------------------------------------

持续测试

perftest有很多参数可以调节,比如下面

Server测:

root@rdma1:~# ib_write_bw  -c RC -d rxe_0 -F -s 1024 -q 4 --run_infinitely  -t 512  -D 1

************************************
* Waiting for client to connect... *
************************************
---------------------------------------------------------------------------------------
                    RDMA_Write BW Test
 Dual-port       : OFF          Device         : rxe_0
 Number of qps   : 4            Transport type : IB
 Connection type : RC           Using SRQ      : OFF
 PCIe relax order: ON
 ibv_wr* API     : OFF
 CQ Moderation   : 100
 Mtu             : 1024[B]
 Link type       : Ethernet
 GID index       : 1
 Max inline data : 0[B]
 rdma_cm QPs     : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0000 QPN 0x0017 PSN 0x30a422 RKey 0x00020c VAddr 0x00aaaaf3a04000
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:123:129
 local address: LID 0000 QPN 0x0018 PSN 0xc24b9b RKey 0x00020c VAddr 0x00aaaaf3a05000
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:123:129
 local address: LID 0000 QPN 0x0019 PSN 0x868113 RKey 0x00020c VAddr 0x00aaaaf3a06000
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:123:129
 local address: LID 0000 QPN 0x001a PSN 0x70332f RKey 0x00020c VAddr 0x00aaaaf3a07000
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:123:129
 remote address: LID 0000 QPN 0x0017 PSN 0xd55abe RKey 0x0002dc VAddr 0x00aaab22375000
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:123:130
 remote address: LID 0000 QPN 0x0018 PSN 0x4dc0d0 RKey 0x0002dc VAddr 0x00aaab22376000
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:123:130
 remote address: LID 0000 QPN 0x0019 PSN 0x8779da RKey 0x0002dc VAddr 0x00aaab22377000
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:123:130
 remote address: LID 0000 QPN 0x001a PSN 0x475d21 RKey 0x0002dc VAddr 0x00aaab22378000
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:123:130
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]

Client测:

与Server测类似,就不标记了,区别是数据信息会在Client测显示

root@rdma2:~# ib_write_bw 192.168.123.129 -c RC -d rxe_0 -F -s 1024 -q 4 --run_infinitely  -t 512  -D 1
---------------------------------------------------------------------------------------
                    RDMA_Write BW Test
 Dual-port       : OFF          Device         : rxe_0
 Number of qps   : 4            Transport type : IB
 Connection type : RC           Using SRQ      : OFF
 PCIe relax order: ON
 ibv_wr* API     : OFF
 TX depth        : 512
 CQ Moderation   : 100
 Mtu             : 1024[B]
 Link type       : Ethernet
 GID index       : 1
 Max inline data : 0[B]
 rdma_cm QPs     : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0000 QPN 0x0017 PSN 0xd55abe RKey 0x0002dc VAddr 0x00aaab22375000
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:123:130
 local address: LID 0000 QPN 0x0018 PSN 0x4dc0d0 RKey 0x0002dc VAddr 0x00aaab22376000
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:123:130
 local address: LID 0000 QPN 0x0019 PSN 0x8779da RKey 0x0002dc VAddr 0x00aaab22377000
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:123:130
 local address: LID 0000 QPN 0x001a PSN 0x475d21 RKey 0x0002dc VAddr 0x00aaab22378000
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:123:130
 remote address: LID 0000 QPN 0x0017 PSN 0x30a422 RKey 0x00020c VAddr 0x00aaaaf3a04000
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:123:129
 remote address: LID 0000 QPN 0x0018 PSN 0xc24b9b RKey 0x00020c VAddr 0x00aaaaf3a05000
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:123:129
 remote address: LID 0000 QPN 0x0019 PSN 0x868113 RKey 0x00020c VAddr 0x00aaaaf3a06000
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:123:129
 remote address: LID 0000 QPN 0x001a PSN 0x70332f RKey 0x00020c VAddr 0x00aaaaf3a07000
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:123:129
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
 1024       33800            0.00               33.01              0.033803
 1024       33600            0.00               32.81              0.033602
 1024       33700            0.00               32.91              0.033701
 1024       33500            0.00               32.71              0.033495
 1024       33300            0.00               32.52              0.033302
 1024       33600            0.00               32.81              0.033599
 1024       34000            0.00               33.19              0.033986

CM建联测试

ROCEv2 建联分两种:

  • Socket建联,也可以称TCP建联,我上面的测试都是基于TCP建联;
  • CM建联,可以在上面Server/Client段的命令上加上“-z”;

Server测:

root@rdma1:~# ib_write_bw  -c RC -d rxe_0 -F -s 1024 -q 4 --run_infinitely  -t 512  -D 1 -z

************************************
* Waiting for client to connect... *
************************************
---------------------------------------------------------------------------------------
                    RDMA_Write BW Test
 Dual-port       : OFF          Device         : rxe_0
 Number of qps   : 4            Transport type : IB
 Connection type : RC           Using SRQ      : OFF
 PCIe relax order: ON
 ibv_wr* API     : OFF
 CQ Moderation   : 100
 Mtu             : 1024[B]
 Link type       : Ethernet
 GID index       : 1
 Max inline data : 0[B]
 rdma_cm QPs     : OFF
 Data ex. method : rdma_cm
---------------------------------------------------------------------------------------
 local address: LID 0000 QPN 0x0021 PSN 0x917bae
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:123:129
 local address: LID 0000 QPN 0x0022 PSN 0xb5da37
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:123:129
 local address: LID 0000 QPN 0x0023 PSN 0x2a617f
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:123:129
 local address: LID 0000 QPN 0x0024 PSN 0x67c92b
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:123:129
 remote address: LID 0000 QPN 0x0021 PSN 0x251718
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:123:130
 remote address: LID 0000 QPN 0x0022 PSN 0x71cf22
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:123:130
 remote address: LID 0000 QPN 0x0023 PSN 0x573ec4
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:123:130
 remote address: LID 0000 QPN 0x0024 PSN 0x6070c3
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:123:130
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]

Client测:

root@rdma2:~# ib_write_bw 192.168.123.129 -c RC -d rxe_0 -F -s 1024 -q 4 --run_infinitely  -t 512  -D 1 -z
---------------------------------------------------------------------------------------
                    RDMA_Write BW Test
 Dual-port       : OFF          Device         : rxe_0
 Number of qps   : 4            Transport type : IB
 Connection type : RC           Using SRQ      : OFF
 PCIe relax order: ON
 ibv_wr* API     : OFF
 TX depth        : 512
 CQ Moderation   : 100
 Mtu             : 1024[B]
 Link type       : Ethernet
 GID index       : 1
 Max inline data : 0[B]
 rdma_cm QPs     : OFF
 Data ex. method : rdma_cm
---------------------------------------------------------------------------------------
 local address: LID 0000 QPN 0x0021 PSN 0x251718
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:123:130
 local address: LID 0000 QPN 0x0022 PSN 0x71cf22
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:123:130
 local address: LID 0000 QPN 0x0023 PSN 0x573ec4
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:123:130
 local address: LID 0000 QPN 0x0024 PSN 0x6070c3
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:123:130
 remote address: LID 0000 QPN 0x0021 PSN 0x917bae
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:123:129
 remote address: LID 0000 QPN 0x0022 PSN 0xb5da37
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:123:129
 remote address: LID 0000 QPN 0x0023 PSN 0x2a617f
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:123:129
 remote address: LID 0000 QPN 0x0024 PSN 0x67c92b
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:123:129
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
 1024       39600            0.00               38.68              0.039606
 1024       39500            0.00               38.57              0.039498
 1024       40000            0.00               39.04              0.039979
 1024       38200            0.00               37.31              0.038201

报文示例

ROCEv2 Socket/TCP建联交互报文:

ROCEv2 CM建联交互报文:

TCP ECN报文(注意ECN是IP包头TOS的后2位,3代表ECN,所以即使是ROCEv2的报文,也是在IP层):

ROCEv2 CNP报文(注意这个报文我是通过这篇文章手动组成的,组包方法可以看我之前的文章:Wireshark使用技巧之三: 利用text2pcap转换Hex文本到pcap,CNP在博通的文档中也提到过,可以参考这篇文章):

以太Pause帧报文,作用整条链路,此报文来自Wireshark的实例:

PFC报文,复用了802.1p队列,使其可以作用于某个队列,报文通过此开源脚本生成:

本文出自 Frank's Blog

版权声明:


本文链接:Soft-RoCE初体验
版权声明:本文为原创文章,仅代表个人观点,版权归 Frank Zhao 所有,转载时请注明本文出处及文章链接
你可以留言,或者trackback 从你的网站

留言哦

blonde teen swallows load.xxx videos