T1 Line Testing with an Adtran NetvantaPosted: December 8, 2010
In my prior post, Configuring Dude’s Syslog, I mentioned a couple of strange or interesting cases where syslog helped troubleshoot an issue I was experiencing. The last log entry was a PCV 24 hour threshold alarm. I mentioned that it had something to do with cabling or a bad CSU and that this issue had not been resolved. After looking at the alarm for several months, it was time to take action. I decided to run another T1 alongside the existing one through our microwave network; essentially replace the circuit and move the production traffic to the new line. I did ask Paul and Jeremy to keep the existing path in place so that we can test it and find the cause for the PCV errors.
Before the circuit was replaced, I took a closer look at the issue. After looking at the alarms for a while on syslog, I decided to setup a regex for ‘PCV’ to page my cell immediately after the event took place. I was thinking that the syslog wasn’t sending me the entire story. After receiving the first page, I quickly ssh’d into the router and looked at the troubled interface:
LAKESIDE_NEC_WAN>sho int t1 3/4 t1 3/4 is UP Description: ### T1 To Frisco ### Receiver has no alarms T1 coding is B8ZS, framing is ESF Clock source is internal, FDL type is ANSI Line build-out is 0dB No remote loopbacks, No network loopbacks Acceptance of remote loopback requests enabled Tx Alarm Enable: rai Last clearing of counters 2w 5d 00:39:24 loss of frame : 0 loss of signal : 0 AIS alarm : 0 Remote alarm : 0 DS0 Status: 123456789012345678901234 NNNNNNNNNNNNNNNNNNNNNNNN Status Legend: '-' = DS0 is unallocated 'N' = DS0 is dedicated (nailed) Line Status: -- No Alarms -- 5 minute input rate 592 bits/sec, 1 packets/sec 5 minute output rate 6264 bits/sec, 7 packets/sec Current Performance Statistics: 1 Errored Seconds, 0 Bursty Errored Seconds 1 Severely Errored Seconds, 0 Severely Errored Frame Seconds 0 Unavailable Seconds, 2048 Path Code Violations 0 Line Code Violations, 0 Controlled Slip Seconds 0 Line Errored Seconds, 0 Degraded Minutes
Syslog wasn’t giving me the entire picture. The PCV threshold alarm it gave me was a clue. The cause of the PCV’s was the errored second (ES) and the following severely errored second (SES). The ES/SES counter would clear in the next five minutes. The reason these were not being delivered to syslog was because a single ES didn’t violate any system thresholds. But the PCV errors did violate a threshold.
This confirmed it. ES and SES are mostly physical problems with cable or a MUX. When the snow melted, Paul and Jeremy drove up to each mountain and patched in a new path. The path for both of the links went from HQ — Porter Mountain — Greens — South Mountain — Frisco. When they went to Frisco, I had them replace the single port Adtran 1g 3205 with a 3g 3205 with two T-1 interfaces and the latest firmware. The changes to the AOS was interesting. Because we use bridging on this network (I know – friends don’t let friends bridge networks; be gentle if you have comments about this configuration) I had to configure what is called a BVI (bridge virtual interface) to tie an IP address to the router. This took me a while to figure out. In Cisco land, I would use a loopback interface. This will not work for Adtran routers. If you want to bridge and have in-band access to your router, you have to configure the BVI.
I had them terminate both of the circuits into the new replacement router and configured the newly provisioned T1 path back to HQ. Then I started to test the old link.
I had a sneaky suspicion that the router at the far end could be the culprit. I had replaced that router once before and still had issues with the link. Just because we replaced the router, I ran a test in-place with the new router alongside the new production link. Sure enough, the old path gave PCV errors. The new path was solid. It wasn’t the router on the Frisco end.
Because we had five physical hops from end-to-end, it could be any one of the patch cables at each site. I decided to run a test simultaneously from both ends of the link. Getting the radio guys to loopback both ends at Greens, each router could send test patterns. The topology after the loopbacks looked like this:
- Router — HQ — Porter Mountain — Greens (loop)
- Router — Frisco — South Mountain — Greens (loop)
Enabling the router, I chose the interfaces that I wanted to test and issued the following commands:
FRISCO_NEC_WAN#conf t FRISCO_NEC_WAN(config)#int t1 1/2 FRISCO_NEC_WAN(config-t1 1/2)# test-pattern p215 FRISCO_NEC_WAN(config-t1 1/2)# do sh int t1 1/2 t1 1/2 is IN TEST Description: ### suspect T-1 to Lakeside ### Receiver has no alarms T1 coding is B8ZS, framing is ESF Clock source is through t1 1/1, FDL type is ANSI Line build-out is 0dB No remote loopbacks, No network loopbacks Acceptance of remote loopback requests enabled In Test: Sending 2^15-1 pattern Tx Alarm Enable: rai Last clearing of counters 01:54:29 loss of frame : 0 loss of signal : 0 AIS alarm : 0 Remote alarm : 0 DS0 Status: 123456789012345678901234 ------------------------ Status Legend: '-' = DS0 is unallocated 'N' = DS0 is dedicated (nailed) Line Status: -- No Alarms -- 5 minute input rate 0 bits/sec, 0 packets/sec 5 minute output rate 0 bits/sec, 0 packets/sec Current Performance Statistics: 0 Errored Seconds, 0 Bursty Errored Seconds 0 Severely Errored Seconds, 0 Severely Errored Frame Seconds 0 Unavailable Seconds, 0 Path Code Violations 0 Line Code Violations, 0 Controlled Slip Seconds 0 Line Errored Seconds, 0 Degraded Minutes
Before doing anything with my production routers (or any remote router for that matter), I usually schedule a reload. Have you ever screwed up a running configuration and had to drive to a site to reload it? Well, this has happened to me more than once. Next time, schedule a reload and your downtime will be a max of 15 minutes should you screw something up. Just be sure not to save the running config while you are testing. I run the following commands whenever I’m working on a router remotely:
FRISCO_NEC_WAN#rel in 15 Save System Configuration?[y/n]y Reload scheduled in 15 minutes You are about to reboot the system. Continue?[y/n]y 2010.12.07 09:57:28 OPERATING_SYSTEM System reboot scheduled in 15 minutes!
After this, I set a stopwatch or alarm clock on my cell phone to sound-off after 12 minutes (T minus 3 minutes). This gives me a warning before the reload is going to take place. I can either extend the reload timer another 15 minutes if I need more time (run the commands above again) or cancel the reload:
FRISCO_NEC_WAN#rel can ******RELOAD CANCELLED****** 2010.12.07 09:57:34 OPERATING_SYSTEM Scheduled system reboot cancelled.
After running the test for 2 hours, I got my first error on the HQ end router. Nothing showed up on the Frisco end. Logically, I asked the radio guys to loop both ends at Porter. My idea is that if the error moves from HQ to Frisco, the problem is between Greens and Porter. If not, then the problem is between HQ and Porter or the cable between the router and HQ. After taking the loop off of Greens and extending it to Porter, I got this interesting series of syslogs:
T1:t1 1/2 Tx Yellow, Red INTERFACE_STATUS:t1 1/2 changed state to down T1:t1 1/2 LIU eq bumped T1:t1 1/2 No Alarms INTERFACE_STATUS:t1 1/2 changed state to up
It makes sense that the link would bounce. The most interesting was the LIU eq bumped informational log message. LIU stands for Line Interface Unit. The LIU is part of the t1 framer in the Adtran NIM (network interface module). The framer locates the frame and multiframe boundaries and monitors the data stream for alarms. The cool thing about this alarm (I think – because I couldn’t find it anywhere in the documentation) is that when the link was extended to Porter which added about 100 miles to the loop, it calculated on-the-fly that the distance on the link changed and ‘bumped’ the frame boundaries. This calculation uses the speed of light as a constant. I don’t know why, but I thought his event was really cool to see – it’s like watching theory in action.
The most logical place to check next was Porter at this point. Jeremy went up to check the cabling. He called me and said that he was going to shake the cables around and asked me to check for errors. Nothing. Jeremy looped up Porter and I got an alarm during lunch.
At this point I decided to loop up the interface at the HQ router and move the test to another port. Because the PCV value was always 2048, I found that significant. If it was a cabling problem, the errors would vary. So, I put in a hard loop on an RJ-45 and kept the tests running after clearing the counters. Sure enough, port t1 3/4 reported an error.
All this time it was a faulty t1 port in my HQ router!
We’ll see if Adtran is going to replace the network module. This was one heck of a troubleshooting day!