Search

When to Replace an ISP

Internet Connection Monitoring

Last night, we upgraded one of our Internet connections.   For several years we have been using a T-1 from Cellular One and a cable Internet connection from Cable One.  The Cable One Internet connection has been extremely reliable.  The Cellular One T-1 has not.  After gathering metrics for outages and being woken up with nightly text messages about a sick link, I decided to replace the Internet circuit with one from our ILEC, Frontier.

Coming to this conclusion required me to compile some reports on Internet link performance. Before installing a network management system, the first clue I had that there was a problem was that my Internet radio kept timing out.  When my radio went silent, it was a very “loud” alarm for me.  How can I work without my Sirius Internet radio?  Here’s a tip: if you don’t have a network management station setup, use an Internet radio station to monitor your uptime.  If you see “Buffering…” a lot, this is an indicator that something might be wrong with your network or service provider.

I noticed that even though my Internet radio was timing out, there was not issue between my gateway and the outside port of the firewall.  I assumed that the trouble was “down-line” from our local connection.  So, I decided to map out the rest of the Cellular One network and monitor it.  Doing a trace, I noticed that the same path was being taken from where I am at in Lakeside to Phoenix.  The path went from our office, to Porter Mountain, then Winslow, and finally to Qwest’s peer in Phoenix.  I decided to monitor these connections with a ping-test.

Cellular One Map

I was using a Sonicwall with Internet connection load balancing configured to use the “spill-over” method.  When using the spill-over method, I specify when the firewall starts sending traffic through the secondary Internet connection.  I use this because I do not want outbound traffic to be sent across the secondary connection unless the primary connection is overloaded.   Because I used this method, it was important to setup static routes to control the source interface pings.  The ping must originate from a specified interface in order to monitor the network devices down-line.  For example: if I don’t have these static routes setup and the primary link fails, the ping will be redirected (spilled-over) to a secondary link.  The latency on the ping may be higher but the device will still respond to the ping.  This is not good – it defeats the purpose of monitoring.

Communicate

I thought I’d let the ISP know that I was collecting stats on their connection.  After the first month, they said that they replaced some of their equipment:

From: Ian Fleming

Sent: Tue 12/22/2009 7:28 AM

To: [Tech]

Subject: RE: Link Issues

[Tech],

Last night I logged an outage that appears to begin with the winslow-core—porter-edge link at 1:12:30 AM today.  The outage lasted until 2:16:13 AM.  I’m still logging a flapping link, however.  Since this time, the Winslow-core link was unavailable for >2 seconds over 130 times.  The delays are getting worse – some up to 15 seconds of downtime.

Thanks,

-ian

Ian,

The outage last night was us we replaced some equipment.

[Tech]

Communicate More “Harder”

After collecting stats for eight months, I found that the persistent trouble was between the sites “porter-edge” and “winslow-core”.  I sent several e-mails to the ISP to address the problem (names changed to protect the guilty):

On 2010-04-28, at 7:33 AM, “Ian Fleming” <email@here.com> wrote:

[Tech],

I’m getting a lot of outages again from the porter-edge router hop .  There are several that affected our operation yesterday evening.  The major one was from 20:44 – 21:14.

Our backup provider’s circuit charge is 22% of this Cell One circuit fee and provides 10Mb of bandwidth.  Their gateways are mapped and monitored and we’ve found that they are infinitely more reliable than the Cell One circuit.  While I enjoy having access to two Internet connections, it is difficult for me to continue rationalizing a pricy and unreliable link.

Are there any chances of this being fixed permanently?  At this point I would like to be informed on what action Cell One is taking to fix this problem.

Thanks,

-ian

From: Ian Fleming

Sent: Thursday, August 05, 2010 10:01 AM

To: [Tech]

Cc: Ben Esparza

Subject: Winslow link

[Tech],

This morning I was paged due to an e-mail connectivity issue.  I found that the Winslow link was down again for 5 minutes from 5:25 to 5:30.  Are there any plans to fix this link?

Attached are my logs.

The alerts are being sent from outside (Google Postini) if one or all of our internet links are down.  It is a sideband alert mechanism (e.g. sent from another provider should our network be completely disconnected from the Internet).  I have no control of these pages because it is letting me know from the cloud if there are any troubles with those critical links (CableOne and CellOne).

Regarding the route taken:

For all Cellone IP addresses (and CableOne), I have static routes built to ensure that the correct physical path is being taken for monitoring purposes.  If one link is unavailable from CellOne, the ICMP will only be routed out the CellOne interface and not a backup link (CableOne).

Here is a trace:

C:\>tracert 63.239.78.89
Tracing route to winslow-core--porter-edge.ip.cellularoneonline.com [63.239.78.89] over a maximum of 30 hops:
1     2 ms     1 ms     1 ms  63.239.79.81
2     3 ms     3 ms     3 ms  porter-edge--ip-nec.cellularoneonline.com [63.239.78.49]
3    10 ms    11 ms    10 ms  winslow-core--porter-edge.ip.cellularoneonline.com [63.239.78.89]
Trace complete

Make the Decision

After all of the e-mails sent, I still continued to see long outages and reply times.  Keep in mind that the other link from Cable One was being monitored just as closely.  I only had one issue with the Cable One connection.  For three weeks straight, the connection failed for 5 full seconds (!) during the week.  When I called them out, I gave them my charts and logs.  The tech mentioned something about a failed HVAC unit and (get this:) a power outage which duration exceeded the backup battery during that day.  “If you keep the power on, we can help keep that Internet connection stable for ya!”  So, just about all the outages that occurred on Cable One was our fault in one way or another.

Here are the month’s logs which prove persistent link suckiness from Cellular One:

2010.11.03-05:07:52 <63.239.78.89>: Service ping on winslow (timeout)
2010.11.03-05:08:10 <63.239.78.89>: Service ping on winslow (ok)
2010.11.04-04:40:04 <63.239.78.89>: Service ping on winslow (timeout)
2010.11.04-04:40:13 <63.239.78.89>: Service ping on winslow (ok)
2010.11.04-04:41:34 <63.239.78.89>: Service ping on winslow (timeout)
2010.11.04-04:41:43 <63.239.78.89>: Service ping on winslow (ok)
2010.11.04-04:43:13 <63.239.78.89>: Service ping on winslow (timeout)
2010.11.04-04:43:31 <63.239.78.89>: Service ping on winslow (ok)
2010.11.05-05:32:35 <63.239.78.89>: Service ping on winslow (timeout)
2010.11.05-05:32:53 <63.239.78.89>: Service ping on winslow (ok)
2010.11.05-06:20:53 <63.239.78.89>: Service ping on winslow (timeout)
2010.11.05-06:21:02 <63.239.78.89>: Service ping on winslow (ok)
2010.11.05-10:44:14 <63.239.78.89>: Service ping on winslow (timeout)
2010.11.05-10:44:32 <63.239.78.89>: Service ping on winslow (ok)
2010.11.06-03:30:46 <63.239.78.89>: Service ping on winslow (timeout)
2010.11.06-03:30:55 <63.239.78.89>: Service ping on winslow (ok)
2010.11.07-07:06:04 <63.239.78.89>: Service ping on winslow (timeout)
2010.11.07-07:06:13 <63.239.78.89>: Service ping on winslow (ok)
2010.11.07-10:47:16 <63.239.78.89>: Service ping on winslow (timeout)
2010.11.07-10:47:22 <63.239.78.89>: Service ping on winslow (ok)
2010.11.08-00:52:06 <63.239.78.89>: Service ping on winslow (timeout)
2010.11.08-00:52:15 <63.239.78.89>: Service ping on winslow (ok)
2010.11.08-06:03:24 <63.239.78.89>: Service ping on winslow (timeout)
2010.11.08-06:03:42 <63.239.78.89>: Service ping on winslow (ok)
2010.11.08-07:37:46 <63.239.78.89>: Service ping on winslow (timeout)
2010.11.08-07:37:55 <63.239.78.89>: Service ping on winslow (ok)
2010.11.08-18:51:20 <63.239.78.89>: Service ping on winslow (timeout)
2010.11.08-18:51:29 <63.239.78.89>: Service ping on winslow (ok)
2010.11.10-11:08:43 <63.239.78.89>: Service ping on winslow (timeout)
2010.11.10-11:09:01 <63.239.78.89>: Service ping on winslow (ok)
2010.11.13-18:50:51 <63.239.78.89>: Service ping on winslow (timeout)
2010.11.13-18:51:00 <63.239.78.89>: Service ping on winslow (ok)
2010.11.13-19:47:21 <63.239.78.89>: Service ping on winslow (timeout)
2010.11.13-19:47:39 <63.239.78.89>: Service ping on winslow (ok)
2010.11.14-03:18:01 <63.239.78.89>: Service ping on winslow (timeout)
2010.11.14-03:18:40 <63.239.78.89>: Service ping on winslow (ok)
2010.11.14-07:40:43 <63.239.78.89>: Service ping on winslow (timeout)
2010.11.14-07:41:01 <63.239.78.89>: Service ping on winslow (ok)
2010.11.14-11:01:31 <63.239.78.89>: Service ping on winslow (timeout)

I was upset to let them go; however, the replacement cost 60% less and is a lot more reliable.  Also, we only have 3 hops to our branch office VPNs.  Less latency and more reliability!  We are happy.

Staging the Cut-Over

At first, I thought that it would be something simple that I could do during the workday – but I was wrong.  I usually put together a checklist before every cut-over.  This checklist should have everything that one does to perform the cut in specific order.  The checklist includes checkpoints and tests.  I also make a “point of no return” for some cut-overs that require modification of sources out of my control (for example: a PRI cut that requires the telephone company to move a circuit from one set of wires to another).

The replacement Internet connection was coming in from another building and was on a VLAN over our switch gear.  Thinking that I would be able to peel off that VLAN on a sub-interface on the Sonicwall, my initial checklist mentioned enabling the sub-interface as a WAN connection and setting up the spill-over load balancing. After compiling the first checklist, I found out that Sonicwall does not support load-balancing over virtual interfaces.  The interface must be physical.

I re-compiled the checklist with the following steps:

◊ Save current firewall settings

◊ Move Cable One interface (X3 to X1)

◊ Label old X1 cable (CellOne) and set aside

◊ Configure Frontier interface (X3) (see configuration sheet #1)

◊ Check ping Frontier default gateway from LAN (X0)

◊ Remove CellularOne static routes from Sonicwall

◊ Change CableOne Gateway static route from interface X3 to X1 (see configuration sheet #2)

◊ Add Frontier static routes (see configuration sheet #3)

◊ Ping test all VPN’s

◊ Change public DNS A Records for ebill and secondary e-mail (see configuration sheet #4)

◊ Enable load balance/spillover (see configuration sheet #5)

◊ Configure fail-over probes (see configuration sheet #6)

◊ Pull cable X1 and test for fail-over

◊ Insert cable X1 and test for fail-back

◊ Pull cable X3 and test for fail-over

◊ Insert cable X3 and test for fail-back

◊ Configure NAT for new IP (see configuration sheet #7)

◊ Configure firewall for new NAT (see configuration sheet #8)

◊ Test syslog for new production routers

◊ Test SNMP for new production routers

◊ Test e-mail secondary fail-over

Ready for Production

After going through the cut-over checklist, everything went as planned.  There were some silly things that happened when I enabled the spill-over/load balance.  After rebooting the Sonicwall, everything worked as expected.  There are a lot of things about the Sonicwall’s front-end that doesn’t make sense to me but overall, it is a pretty decent system.

Advertisements


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s