Configuring Dude’s SyslogPosted: November 10, 2010
Polling for information is similar to asking a little kid who is sitting in the back seat of your vehicle “Are you hungry?” every 5 seconds. Your mental session will process their response to your poll for information and react accordingly based on their response similar to the way The Dude will respond when a probe condition is true or not.
Unsolicited information is similar to a kid sitting in the back seat of your vehicle saying (not asking), “Are we there yet?” every 5 seconds. Your mental session will process this information and act accordingly. Most of the time, the most appropriate response is a non-response; you just log it. But occasionally, the kid will say something interesting like, “I feel sick. I think I’m gonna throw up.” You might want to take action based on these unsolicited pieces of information. Why this distinction is important may be the difference between a quick pit stop and cleaning vomit from the backseat of your car.
By now you are using the Dude or some other network management system to poll your network devices for information using SNMP. Now I’m asking about your syslog facility. Why is syslog important if you are able to poll for stats real-time? Isn’t polling good enough? Well, no – not really. Like the kid in your back seat, there is a big difference between the two: polling implies solicitation of information and syslog is unsolicited information.
You work in the electric utility industry so it’s probably easier for you to think of it this way: you are using DNP3 to poll a substation. The intervals are set to update every 30 seconds. You have a binary state set on your SCADA display that tells you what state a substation breaker is currently operating in. Knowing in advance that substation breakers operate in less than 30 seconds, it is conceivable that you could miss events, right? Depending on the curve your relay is using, an operation can occur in 1/60th of a second. If you aren’t polling at this split second interval, the event could be missed. These cases need to be known immediately. They must preempt standard polling intervals and let you know before the next poll that something just happened. This is why unsolicited messaging is so important.
SNMP does offer a way to send unsolicited messages (traps). Unfortunately, The Dude does not have a facility to receive them and make alerts out of them. But it offers a really cool alternative. Syslog can be setup on The Dude to notify you of anything interesting that you might want to be notified about.
Setting up Syslog in Dude
Syslog is a facility that directs information across your network to The Dude. This facility is actually a service on the device that you configure on each piece of your equipment. By default, the Syslog server is already enabled on your Dude installation. Click on the Settings button and select the Syslog tab. By default, everything that is pointed to the Dude’s IP will be accepted and logged in the Syslog section of the interface. Now you will record everything. Nice.
What if you want to be notified if something interesting is logged? There’s got to be a better way than looking through all of the daily logs – just like ignoring that kid in your backseat saying, “Are we there yet?”. Well, there is a way to filter it. The Dude lets you create regular expressions to filter each message for certain words. You can specify a regexp based on each source IP address or device that you are monitoring. I usually apply the regexp for all messages because most of my equipment is similar and I expect the same information to come in from all of them.
Using Regexp to Log and Notify
Regexp is the most useful method, in my opinion, for processing unsolicited events sent to syslog. The scope of this is to describe how to notify and log interesting syslog messages using regex and the Dude. Describing what regex is and how it works is out of my scope or expertise to describe. There is a good crash course on regular-expressions.info if you don’t know at all what it is. In the Dude, we use it for searching the logs for interesting messages. For example, if our regexp is “crash”, we will match all lines that contain “crash”; if our regexp is crash$, it will match only strings that end with the word “crash”
To setup a regexp rule, click on the red “+”. Insert your regex that you want to be notified if it appears. Some that I use are “CORE DUMP” for when a router takes a dump, “ENABLE OK” for when someone successfully enables a router, and “attempt FAILED” for when someone is trying to log in to a router using an incorrect password.
The check-box next to the Regexp: field is a ! (a.k.a NOT). This will match everything that does NOT equal what you want to match. Let me know if you find a use for this check box. I’m really curious why anyone would want to take an action based on something NOT matching. My guess is that it would be useful if you also filter by IP in addition to regex.
So, now you can setup your syslog and get notified when your equipment starts spewing interesting messages. Have fun!
There were several instances where we found leaky memory in a router located in a remote office which was resulting in several seconds of downtime. While the router was rebooting, the remote office would always complain about dropped calls or sessions. It was very rare that our 30 second ping interval would fail during these outages and even if it did, it had to miss three pings to be considered down anyway. It wasn’t until we configured syslog on each piece of network equipment to report unsolicited messages that we obtained enough information to troubleshoot the problem. The ones outlined below are some of the more “interesting” cases. The point of this is to prove to you that this type of information is important to set up and record. It really helps in troubleshooting issues when they arise… and they will.
Router Firmware Problem
The following syslog was obtained from an Adtran router. No one complained about this 18 second outage which happened every week since the day the router was put into service around 2006. The other downstream routers reported link issues about the same time. Notice the timestamps: the first time is when The Dude received the message and the second time is the date/time stamp of what’s on the equipment when it sent the message. There are deviations in time because of the delay of the network or, in this case, the clock is not set properly on the router. This is good to know when troubleshooting issues with network equipment. A functioning device may report a link failure while the broken device is silent. This is why it is also important to keep your clocks synched up!
2010.04.28-05:22:14 <private.ip.address.here>: <3>Jan 02 10:14:13 Porter_NEC_WAN OPERATING_SYSTEM:CORE DUMP 17.01.02.00\source\PacketCore\AdUtil\AdAlloc.cpp#439: AdFatal(OUT OF MEMORY - Called from 0x001AE0C0 Stack Trc: 002FF038 002FF838 002FFAB8 002FFF1C 002FFC38 001AE0C0 00A8CB8C 00A8D9F0 00195310 00197854 001979EC 00 197ADC 001A7340 00C101FC ....Regs: r0=0x002ff810 r1=0x045e3b54 r2=0x019a0b30 r3=0x00000026r4=0x045e3ad4 r5= 0x045e3aa1 r6=0x045e3aa6 r7=0x062bf964 r8=0x00000002 r9=0x00000000 r10=0x00000000 r11=0x045e3adc r12=0x0000 0000 r13=0x04008000 r14=0x00000004 r15=0x00000005 r16=0x00000006 r17=0x00000007 r18=0x00000008 r19=0x000000 09 r20=0x0000000a r21=0x0000000b r22=0x0000000c r23=0x0000000d r24=0x00000001 r25=0x00000001 r26=0x00000000 r27=0x0ff01d90 r28=0x00000008 r29=0x001ae0c0 r30=0x0ff0a000 r31=0x00008000 LR=0x002ff810 CTR=0x0140d5c4.)
STP network loop abatement
While the polling of SNMP was successful, these logs show how the Spanning Tree Protocol process is running from inside of my precious HP network switch that I purchased off of eBay. If you loop a network cable and do not have STP turned on, a broadcast storm get your network. This shows when network loops are being blocked on my network gear. This is good to look into because it is important to know how or why this is occurring. For example: did someone install a hub on their desk and loop a cable?
2010.11.10-07:41:00 : <12> Nov 10 07:41:05 10.3.12.200 00564 ports:port B18 PD Invalid Signature indication. 2010.11.10-07:41:08 : <14> Nov 10 07:41:14 10.3.12.200 00435 ports:port B18 is Blocked by STP 2010.11.10-07:41:11 : <14> Nov 10 07:41:17 10.3.12.200 00076 ports:port B18 is now on-line 2010.11.10-07:41:28 : <12> Nov 10 07:41:33 10.3.12.200 00331 FFI:port B18-High collision or drop rate. See help. 2010.11.10-07:43:20 : <14> Nov 10 07:43:25 10.3.12.200 00077 ports:port B18 is now off-line 2010.11.10-07:43:25 : <14> Nov 10 07:43:30 10.3.12.200 00435 ports:port B18 is Blocked by STP 2010.11.10-07:43:27 : <14> Nov 10 07:43:33 10.3.12.200 00076 ports:port B18 is now on-line 2010.11.10-07:44:17 : <12> Nov 10 07:44:22 10.3.12.200 00331 FFI:port B18-High collision or drop rate. See help. 2010.11.10-07:46:51 : <14> Nov 10 07:46:56 10.3.12.200 00077 ports:port B18 is now off-line 2010.11.10-07:46:56 : <14> Nov 10 07:47:01 10.3.12.200 00435 ports:port B18 is Blocked by STP 2010.11.10-07:46:58 : <14> Nov 10 07:47:04 10.3.12.200 00076 ports:port B18 is now on-line 2010.11.10-08:12:46 : <14> Nov 10 08:12:51 10.3.12.200 00077 ports:port B18 is now off-line 2010.11.10-13:51:01 : <12> Nov 10 13:51:05 10.3.12.200 00564 ports:port B18 PD Invalid Signature indication.
Random switch reboots
Stay away from D-Link switches. They’re junk. This manufacturer has made it on my “crap” list. Here are some syslog’s that we get from them:
(blank – yes, you read right: “blank” as in nothing…)
Strange T-1 circuit problems
PCV stands for Path Code Violation and is common with cabling problems. (We still haven’t figured this one out)
2010.11.10-10:14:45 : <4>Nov 10 10:14:53 LAKESIDE_NEC_WAN T1:t1 3/4 PCV 15 min threshold exceeded