HSRP Unicast Flapping

0

1、3750与7206之间为三层OSPF互联
2、29、37之间连接全为两层trunk
3、29-1到29-5都属于vlan2,并且block口在29之间
4、37之间设置HSRP,3750-1对于VLAN2来说是主的

问题:

1、在29上并未配置端口镜像,但能在29-5上能获得server 发给client的IP unicast包,而且目的地址均不同。只要29-5和29-2的端口只要开着,不论有没有业务,就会收到流量,而且很平均。

2、主要受影响的是2950-2和2950-5,都是跟3750-2连的,而且如果block端口在29之间,那么就会出现这个问题,如果把block端口定在上联口,那么就没有问题了。


解决问题:

The problem was a result of non-symmetric traffic patterns through the L3-switched network.

The traffic path is follow for send:
PC — >>> 2950-1 —>>> 3750-1 —>>> 7206-1 —>>>
Intranet Server

But the return traffic path may be:
Intranet Server —>>> 7206-2 —>>> 3750-2 —>>> 3750-1
—>>> 2950-1 —>>> PC

Because no traffic from PC to 37-2, so the mac will aging after 5 minutes, then the 3750-2 will flood, both 29-2 and 29-5 will be impacted. After the 3750-1 received the flood unicast, the 3750-1 have the 2950-1′ mac, but the 37-1 didn’t pronounce the mac to 37-2.

In addition, the arp aging is 4 hour, so only after 4 hour, the 37-2 will resend new arp broadcast, then the PC will response. So the 37-2 will received the PC’s mac. But alive time only 5 minutes. After 5 minutes, the mac will be cleared. Then 3750-2 flood, impact the 29-2 and 29-5.

There have three workaround:
——————————————————————————————
1 Decrease default arp timeout from 4 hours to below 5 minutes in vlan 2 of 37-2.

This results in the L3 SW issuing arp request to the host every 5 minutes.
So this will refresh the mac table on the switch, before it gets aged out, thus preventing flooding.

Attention: Change the arp timeout, the 37-2 will have numerous arp broadcast, it maybe impact the 37-2 performance.

2 Increase default Mac address-table aging time from 5 minutes to more 4 hours in 37-2.

This results in the L3 SW issuing the arp request to the host before the Mac aging.
So the arp broadcast will happen in 4 hours.

Attention: Change the Mac aging timeout, the PC client number will be limited. Because the

CAM table size is restrictedness.
In addition, if change the 2950’s port for PC client, the PC client maybe cannot online, because the Mac aging is 4 hours.

3 Adjustment the route between 37 and 72.

Change the non-symmetric traffic patterns to symmetric traffic patterns .

客户解决中的问题:

The customer cut over failed and complain that workaround have problem.

There have some reasons for cuting over fail when you change the aging time to 14400.
——————————————————————————————
1 The problem exists when you cut over. So the 37-2 have no Mac address for some client of 29-1.

2 The Mac aging is 4 hours after you change the aging time, but arp timeout also is 4 hours. So after 4 hours, the 37-2 will learn those Mac address.

You need clear arp table after change the Mac aging time and the network will OK.
——————————————————————————————
I had recreated the customer environment in my lab, following command information is to change the mac aging time.

(config)#mac-address-table aging-time 14400
#sh mac-address-table aging-time
Global Aging Time: 14400
Vlan Aging Time
—- ———-
1 14400
2 14400
3 14400
999 14400
#clear arp

更改MAC的老化时间后,过了1周。。。

>客户告知没有解决问题,仍然存在同样的问题,协调客户抓包分析后,发现确实有Unicast Flapping,然而发现在37-1和37-2上MAC地址是全的,而且也已经把MAC的老化时间改为4小时了,为什么仍然会出现此问题?

再次解决问题:

1、为什么在交换机的MAC地址是全的?

那是因为抓包的时间跟show mac信息的时间不是同时的,所以在看采集的信息中批对show的信息,没有什么意义。而且每个人老化时间都不可能达到一致,所以当有MAC老化后,在arp发送并响应这段时间中,仍然会有Unicast Flapping,只不过非常短,也就1S左右,所以在查看抓包的信息时,如果多个数据包的源和目的地址均相同,而且在1S左右,那么就是正常的。

2、分析采集的数据,发现在29-2上仍然有多个数据包的源和目的地址均相同并超过1S !

为什么会这样?难道之前更改的老化时间没有任何效果?
经过查阅相关资料,找到了原因:

这是由于TCN导致的问题。

什么是TCN?在生成树中,如果拓扑发生变化,交换机会产生TCN并把此信息发给根桥,并且根桥把此信息发送到整个广播域中。当根桥收到TCN后,他会把MAC的老化时间设为15S

因此当改完MAC的老化时间,并clear arp后,确实能暂时解决,但当客户下班或上班时,电脑肯定需要开启或关闭,所以就会产生TCN,因此根桥就会忽略MAC 4小时的老化时间,把MAC都改成了15S, 15S过后,所有MAC都被清空。

现象又回到了最初的问题-非对称路由,只有当4小时过后,才能通过ARP来学习新的MAC地址。

以下是给用户回的邮件并给出了Workaround:

After change the MAC aging to 14400, the flapping didn’t disappear. Follow information for reason.
1 The switch will send information “TCN BPDU” to root bridge when the PC power on or power down.
2 After root switch received the “TCN BPDU”, The Root Bridge then sets the Topology Change Acknowledgment flag (to acknowledge the TCN BPDU sent by the previous bridge) and the Topology change flag in the next Configuration BPDU that it sends out.
3 The Root Bridge continues to set the Topology Change flag in all Configuration BPDUs that it sends out for a total of Forward Delay + Max Age seconds (default = 35 seconds). This flag instructs all bridges to shorten their bridge table aging process from the default value of 300 seconds to the 15 seconds.
4 After 15S, the issue back to the past: non-symmetric traffic patterns.

Workaround
————————————
Change the port mode to “portfast”, then the switch will not send “TCN BPDU” to root bridge when the PC power on or power down. So the MAC aging time will don’t be changed.
But i attention to you, don’t connect hubs, concentrators, switches, bridges and so on to port of portfast. If you connect, the STP will loop, then the problem will serious impact the business.
(config-if)#span portfast
%Warning: portfast should only be enabled on ports connected to a single
host. Connecting hubs, concentrators, switches, bridges, etc… to this
interface when portfast is enabled, can cause temporary bridging loops.
Use with CAUTION
%Portfast has been configured on GigabitEthernet1/0/44 but will only
have effect when the interface is in a non-trunking mode.
(config-if)#end
————————————

总结:

这种Design在国内非常常见,只要是由HSRP导致的非对称路由,上层又没有做策略,都会有这种问题,但为什么反应的不是很明显?主要可能是由于带宽充足,而且下连的客户端比较少,所以没有发现,而且出现Mac Flapping, 不会造成设备Performance的降低,只会占用一定量的带宽。

此CASE涉及到了HSRP,Mac Flapping,ARP table 及Mac and ARP的aging。所以对于数据包传送的理解非常有帮助。

根据此CASE,可以扩展思考以下问题:
1、当数据帧从一个广播域到达HSRP主的端口及转发出去的整个2层详细的过程是怎样的?
2、为什么3层不涉及到Mac地址表?只查看ARP表就可以了?
3、到了3层和2层的交接点,是如何从ARP到Mac的?在一些比较新的IOS中,ARP表中包括了Mac地址及端口,那是否就不需要Mac地址表了?

这有篇关于在HSRP中,非对称路由导致Mac地址问题的标准文档

本文出自 Frank's Blog

版权声明:


本文链接:HSRP Unicast Flapping
版权声明:本文为原创文章,仅代表个人观点,版权归 Frank Zhao 所有,转载时请注明本文出处及文章链接
你可以留言,或者trackback 从你的网站

留言哦

blonde teen swallows load.xxx videos