Do action by EEM+TCL after the log happen X Times in Y LC/RSP at ASR9k

Problem:

We can do more automated action by EEM + TCL on Cisco router, and have more trigger way for syslog pattern trigger, OID trigger, CPU Threshold trigger and so on. That will match IOS platform, no any issue. But in XR platform, each LC/RSP have separate alarm, we maybe have special requirement, e.g:

Some alarms frequency happen, I want to restart the process (base on pid) if the alarm happen 3 times in 5min on each LC, how to do that?

0/3/cpu0: alarm report "C", Pid = zzz
0/1/cpu0: alarm report "A", Pid = xxx
0/2/cpu0: alarm report "B", pid = yyy
0/3/cpu0: alarm report "C", pid = zzz
0/1/cpu0: alarm report "A", pid = xxx
0/1/cpu0: alarm report "A", pid = xxx

Solution:

We can do interactive script by TCL I/O, create a file in Harddisk/disk which has the history/count of syslog for Lcs. We can read this file using the script whenever the syslog is observed. Based on the number of syslogs the script can take the required action.

The steps will be like this, please check attachment and script flow chart for detail script, in my example, I only dump arp process for testing, please change script base on your requirement, in order to test script, you can add flag to test that, e.g “action_syslog priority info msg “a””:

  • Create a file in harddisk/disk which contains the count of syslog and the LC where the syslog is seen
  • Run the EEM script whenever the event happens
  • Check the file in harddisk/disk for the number of times the issue is seen
  • Take the required action incase the count exceeds x times in Y LC/RSP

Script flow chart:

match-2-condition

Script Test

Test1: Dump only happened 1 times each LC

RP/0/RSP0/CPU0:ASR9010-1#more test.txt
Tue Jan 28 15:05:09.295 UTC
LC=0/RSP0/CPU0 T=1390921477 FLAG=1 PID=573646
RP/0/RSP0/CPU0:ASR9010-1#dumpcore running arp location 0/0/cpu0
Tue Jan 28 15:06:41.570 UTC
RP/0/RSP0/CPU0:ASR9010-1#dumpcore running arp location 0/4/cpu0
Tue Jan 28 15:06:55.280 UTC
RP/0/RSP0/CPU0:ASR9010-1#more test.txt                         
Tue Jan 28 15:07:06.257 UTC
LC=0/RSP0/CPU0 T=1390921477 FLAG=1 PID=573646
LC=0/0/CPU0 T=1390921603 FLAG=1 PID=516231
LC=0/4/CPU0 T=1390921616 FLAG=1 PID=520331

Test2: Dump happened again for LC 0/0

RP/0/RSP0/CPU0:ASR9010-1#dumpcore running arp location 0/0/cpu0
Tue Jan 28 15:09:27.878 UTC
RP/0/RSP0/CPU0:ASR9010-1#
RP/0/RSP0/CPU0:ASR9010-1#more test.txt                         
Tue Jan 28 15:09:39.310 UTC
LC=0/RSP0/CPU0 T=1390921477 FLAG=1 PID=573646
LC=0/0/CPU0 T=1390921603 FLAG=2 PID=516231 <<< flag change 2, time not change
LC=0/4/CPU0 T=1390921616 FLAG=1 PID=520331

Test3: Dump happened 3 times for LC 0/0 in 10 min

RP/0/RSP0/CPU0:ASR9010-1#dumpcore running arp location 0/0/cpu0
Tue Jan 28 15:12:36.086 UTC
RP/0/RSP0/CPU0:ASR9010-1#more test.txt                         
Tue Jan 28 15:12:49.300 UTC
LC=0/RSP0/CPU0 T=1390921477 FLAG=1 PID=573646
LC=0/0/CPU0 T=1390921957 FLAG=1 PID=516231 << both flag and time are initial 
LC=0/4/CPU0 T=1390921616 FLAG=1 PID=520331 

And you will found have action log, you can change any action!
RP/0/RSP0/CPU0:Jan 28 15:12:38.659 : tclsh[65872]: %HA-HA_EEM-6-ACTION_SYSLOG_LOG_INFO : test1.tcl: show process  location 

Test4: Dump happened again after 10min for 0/RSP0/cpu0

RP/0/RSP0/CPU0:ASR9010-1#dumpcore running arp                  
Tue Jan 28 15:56:37.982 UTC
RP/0/RSP0/CPU0:ASR9010-1#more test.txt       
Tue Jan 28 15:56:43.942 UTC
LC=0/RSP0/CPU0 T=1390924599 FLAG=1 PID=573646 << time had initial
LC=0/0/CPU0 T=1390921957 FLAG=1 PID=516231
LC=0/4/CPU0 T=1390921616 FLAG=1 PID=520331

Script:

# After copy the script to disk0, then config follow command.
# Attention: if you change any variable or script, you need re-config “event manager policy snmp_trap.tcl username cisco”.
# 
# aaa authorization eventmanager default local
# event manager directory user policy disk0:
# event manager policy test_syslog.tcl username cisco persist-time 3600 type user

::cisco::eem::event_register_syslog pattern "OS-DUMPER-4-CORE_INFO : Core for pid" maxrun_sec 600

namespace import ::cisco::eem::*
namespace import ::cisco::lib::*

set interval "600"
set times "3"
set f_lc ""
set f_t ""
set f_pid ""
set f_flag ""

set lc ""
set t [clock seconds]
set pid ""
set flag "1"
set if_check "0"
set command_list [list \
    "show process $pid location $lc" \
]

array set syslog_info [event_reqinfo]
set messages $syslog_info(msg)

regexp {^.*([0-9]/.*[0-9]/CPU[0-9]).*pid = ([0-9]+).*} $messages all lc pid
set line "LC=$lc T=$t FLAG=$flag PID=$pid"

# Query the event info, check error, you should not change
array set arr_einfo [event_reqinfo]

if {$_cerrno != 0} {
    set result [format "component=%s; subsys err=%s; posix err=%s;\n%s" \
        $_cerr_sub_num $_cerr_sub_err $_cerr_posix_err $_cerr_str]
    error $result 
}

# Open a cli connection
if [catch {cli_open} result] {
        error $result $errorInfo
} else {
        array set cli1 $result
}

# set timestamp [clock format [clock seconds] -format {%Y%m%d%H%M%S}]

set newfile "/disk0:/test.txt"
set firstcheck "/disk0:/test.txt"
set filename "/disk0:/test.txt"
set temp  $filename.new

# if first run script, set a new null file, "a+" indicate not clear file content
set new [open $newfile a+]
close $new

# check whether match 1st
set check [open $firstcheck r] 

# open two file, flag "w"=only write, if have content, clear, then write.
set in [open $filename r]
set out [open $temp a+]

# "-1" = end of file, check whether empty for file
# line1 is variable, please attention line+1 each read
if {[gets $check line1] < 0} {
    puts $out $line
    close $check
} else {
    while {[gets $in line2] > -1} {
        regexp {LC=([0-9]/.*[0-9]/CPU0).*T=([0-9]+).*FLAG=([0-9]+).*PID=([0-9]+).*} $line2 all f_lc f_t f_flag f_pid
        if {$lc == $f_lc} {
            set if_check "1"
            if {([expr $t-$f_t] <= [expr $interval]) && ([expr $f_flag+1] == [expr $times])} {
                set line "LC=$f_lc T=$t FLAG=$flag PID=$f_pid"
                # Loop through the command list
                foreach comm_temp $command_list {       
                    if [catch {cli_exec $cli1(fd) $comm_temp} cli_show] {
                        error $result $errorInfo
                    }
                    action_syslog priority info msg $cli_show
                }
                puts $out $line
            } elseif {([expr $t-$f_t] <= [expr $interval]) && ([expr $f_flag+1] < [expr $times])} {
                set line "LC=$f_lc T=$f_t FLAG=[expr $f_flag+1] PID=$f_pid"
                puts $out $line
            } else {
                set line "LC=$f_lc T=$t FLAG=$flag PID=$f_pid"
                puts $out $line
            }
        } else {
            puts $out $line2
        }
    }
    if { $if_check == 0  } {
    puts $out $line
    }
}

close $in
close $out

# rename commands, <source> to <target>
file rename -force $temp $filename

#close the cli connection
if [catch {cli_close $cli1(fd) $cli1(tty_id)} result] {
    error $result $errorInfo
}
0
你可以留言,或者trackback 从你的网站

留言哦