Do action by EEM+TCL after the log happen X Times in Y LC/RSP at ASR9k
Problem:
We can do more automated action by EEM + TCL on Cisco router, and have more trigger way for syslog pattern trigger, OID trigger, CPU Threshold trigger and so on. That will match IOS platform, no any issue. But in XR platform, each LC/RSP have separate alarm, we maybe have special requirement, e.g:
Some alarms frequency happen, I want to restart the process (base on pid) if the alarm happen 3 times in 5min on each LC, how to do that?
0/3/cpu0: alarm report "C", Pid = zzz 0/1/cpu0: alarm report "A", Pid = xxx 0/2/cpu0: alarm report "B", pid = yyy 0/3/cpu0: alarm report "C", pid = zzz 0/1/cpu0: alarm report "A", pid = xxx 0/1/cpu0: alarm report "A", pid = xxx
Solution:
We can do interactive script by TCL I/O, create a file in Harddisk/disk which has the history/count of syslog for Lcs. We can read this file using the script whenever the syslog is observed. Based on the number of syslogs the script can take the required action.
The steps will be like this, please check attachment and script flow chart for detail script, in my example, I only dump arp process for testing, please change script base on your requirement, in order to test script, you can add flag to test that, e.g “action_syslog priority info msg “a””:
- Create a file in harddisk/disk which contains the count of syslog and the LC where the syslog is seen
- Run the EEM script whenever the event happens
- Check the file in harddisk/disk for the number of times the issue is seen
- Take the required action incase the count exceeds x times in Y LC/RSP
Script flow chart:
Script Test
Test1: Dump only happened 1 times each LC
RP/0/RSP0/CPU0:ASR9010-1#more test.txt Tue Jan 28 15:05:09.295 UTC LC=0/RSP0/CPU0 T=1390921477 FLAG=1 PID=573646 RP/0/RSP0/CPU0:ASR9010-1#dumpcore running arp location 0/0/cpu0 Tue Jan 28 15:06:41.570 UTC RP/0/RSP0/CPU0:ASR9010-1#dumpcore running arp location 0/4/cpu0 Tue Jan 28 15:06:55.280 UTC RP/0/RSP0/CPU0:ASR9010-1#more test.txt Tue Jan 28 15:07:06.257 UTC LC=0/RSP0/CPU0 T=1390921477 FLAG=1 PID=573646 LC=0/0/CPU0 T=1390921603 FLAG=1 PID=516231 LC=0/4/CPU0 T=1390921616 FLAG=1 PID=520331
Test2: Dump happened again for LC 0/0
RP/0/RSP0/CPU0:ASR9010-1#dumpcore running arp location 0/0/cpu0
Tue Jan 28 15:09:27.878 UTC
RP/0/RSP0/CPU0:ASR9010-1#
RP/0/RSP0/CPU0:ASR9010-1#more test.txt
Tue Jan 28 15:09:39.310 UTC
LC=0/RSP0/CPU0 T=1390921477 FLAG=1 PID=573646
LC=0/0/CPU0 T=1390921603 FLAG=2 PID=516231 <<< flag change 2, time not change
LC=0/4/CPU0 T=1390921616 FLAG=1 PID=520331
Test3: Dump happened 3 times for LC 0/0 in 10 min
RP/0/RSP0/CPU0:ASR9010-1#dumpcore running arp location 0/0/cpu0 Tue Jan 28 15:12:36.086 UTC RP/0/RSP0/CPU0:ASR9010-1#more test.txt Tue Jan 28 15:12:49.300 UTC LC=0/RSP0/CPU0 T=1390921477 FLAG=1 PID=573646 LC=0/0/CPU0 T=1390921957 FLAG=1 PID=516231 << both flag and time are initial LC=0/4/CPU0 T=1390921616 FLAG=1 PID=520331 And you will found have action log, you can change any action! RP/0/RSP0/CPU0:Jan 28 15:12:38.659 : tclsh[65872]: %HA-HA_EEM-6-ACTION_SYSLOG_LOG_INFO : test1.tcl: show process location
Test4: Dump happened again after 10min for 0/RSP0/cpu0
RP/0/RSP0/CPU0:ASR9010-1#dumpcore running arp
Tue Jan 28 15:56:37.982 UTC
RP/0/RSP0/CPU0:ASR9010-1#more test.txt
Tue Jan 28 15:56:43.942 UTC
LC=0/RSP0/CPU0 T=1390924599 FLAG=1 PID=573646 << time had initial
LC=0/0/CPU0 T=1390921957 FLAG=1 PID=516231
LC=0/4/CPU0 T=1390921616 FLAG=1 PID=520331
Script:
# After copy the script to disk0, then config follow command. # Attention: if you change any variable or script, you need re-config “event manager policy snmp_trap.tcl username cisco”. # # aaa authorization eventmanager default local # event manager directory user policy disk0: # event manager policy test_syslog.tcl username cisco persist-time 3600 type user ::cisco::eem::event_register_syslog pattern "OS-DUMPER-4-CORE_INFO : Core for pid" maxrun_sec 600 namespace import ::cisco::eem::* namespace import ::cisco::lib::* set interval "600" set times "3" set f_lc "" set f_t "" set f_pid "" set f_flag "" set lc "" set t [clock seconds] set pid "" set flag "1" set if_check "0" set command_list [list \ "show process $pid location $lc" \ ] array set syslog_info [event_reqinfo] set messages $syslog_info(msg) regexp {^.*([0-9]/.*[0-9]/CPU[0-9]).*pid = ([0-9]+).*} $messages all lc pid set line "LC=$lc T=$t FLAG=$flag PID=$pid" # Query the event info, check error, you should not change array set arr_einfo [event_reqinfo] if {$_cerrno != 0} { set result [format "component=%s; subsys err=%s; posix err=%s;\n%s" \ $_cerr_sub_num $_cerr_sub_err $_cerr_posix_err $_cerr_str] error $result } # Open a cli connection if [catch {cli_open} result] { error $result $errorInfo } else { array set cli1 $result } # set timestamp [clock format [clock seconds] -format {%Y%m%d%H%M%S}] set newfile "/disk0:/test.txt" set firstcheck "/disk0:/test.txt" set filename "/disk0:/test.txt" set temp $filename.new # if first run script, set a new null file, "a+" indicate not clear file content set new [open $newfile a+] close $new # check whether match 1st set check [open $firstcheck r] # open two file, flag "w"=only write, if have content, clear, then write. set in [open $filename r] set out [open $temp a+] # "-1" = end of file, check whether empty for file # line1 is variable, please attention line+1 each read if {[gets $check line1] < 0} { puts $out $line close $check } else { while {[gets $in line2] > -1} { regexp {LC=([0-9]/.*[0-9]/CPU0).*T=([0-9]+).*FLAG=([0-9]+).*PID=([0-9]+).*} $line2 all f_lc f_t f_flag f_pid if {$lc == $f_lc} { set if_check "1" if {([expr $t-$f_t] <= [expr $interval]) && ([expr $f_flag+1] == [expr $times])} { set line "LC=$f_lc T=$t FLAG=$flag PID=$f_pid" # Loop through the command list foreach comm_temp $command_list { if [catch {cli_exec $cli1(fd) $comm_temp} cli_show] { error $result $errorInfo } action_syslog priority info msg $cli_show } puts $out $line } elseif {([expr $t-$f_t] <= [expr $interval]) && ([expr $f_flag+1] < [expr $times])} { set line "LC=$f_lc T=$f_t FLAG=[expr $f_flag+1] PID=$f_pid" puts $out $line } else { set line "LC=$f_lc T=$t FLAG=$flag PID=$f_pid" puts $out $line } } else { puts $out $line2 } } if { $if_check == 0 } { puts $out $line } } close $in close $out # rename commands, <source> to <target> file rename -force $temp $filename #close the cli connection if [catch {cli_close $cli1(fd) $cli1(tty_id)} result] { error $result $errorInfo }
版权声明:
本文链接:Do action by EEM+TCL after the log happen X Times in Y LC/RSP at ASR9k
版权声明:本文为原创文章,仅代表个人观点,版权归 Frank Zhao 所有,转载时请注明本文出处及文章链接