Measuring NISC’s iostat

“Ian, I know I sound crazy because the last Administrator hated it when I called in…  The NISC system is running slow and you really need to do something about it!”

“Let me look into it.  I’m looking at ‘top’ which shows me all processes that may be out of control.  Everything appears normal – about 4% CPU utilization.  Let me look at the history of this system on our network monitor… nope.  Everything looks normal.  Can you tell me when you noticed the slow down?”

“About 20 minutes ago.  I’m really surprised that you don’t see anything there.  Are you sure you are looking at the right system?”

“Yep.  We only have one NISC machine and it’s right here.  Let me go into the server room and look at some blinking lights… I’ll give you a call back if I find something.”

As I walked into the server room to look at the SAS JBOD enclosure, I was reminded of the ILOVEYOU bug in 1998.  I was a systems analyst for a SOC unit in Stuttgart when the worm hit.  It was what is known as a “zero day” attack; no antivirus vendor had a fix for it because no one knew it existed until it was in the wild.  I can’t forget that helpless feeling as my Microsoft Outlook inbox filled with messages containing the subject line “I LOVE YOU” from everyone I knew. Each message contained a little attachment that appeared to be an innocuous Notepad icon at first glance.  As people clicked on it expecting to read a love letter from their friend, the worm was activated:  It first rewrote all of the visual basic scripts that existed on your computer and replaced them with the malignant code.  This enhanced the propagation of the worm.  After that, it went through your global address list and attached the e-mail to everyone you knew.  One of the systems administrators clicked on the worm first releasing it into the wild on my LAN.  “Ian!  Do something!”  the guilty NCOIC yelled.

Walking into the server room, I could feel the heat from the dozens of 10k RPM drives all with lights a’flashin. In Stuttgart, I simply removed the network cable from the server to isolate the spread of the worm which also calmed the drives.  Here, I wasn’t sure what the problem was and I wasn’t about to remove the network cable.  I could tell you directly that it wasn’t memory or a runaway process.  This was an I/O bottleneck.  But from what process?

Back in the days of early computing, disks were managed directly by the operating system; in fact, the “D” in DOS stands for Disk Operating System.  Because RAID and other methods of data redundancy and resiliency are more common, there has been widespread use of disk subsystem controller cards.  These little cards handle all of the disk I/O on your server. It virtualizes that logical “C:” or “D:” drive (as Windows users may refer to it) by spreading the data across two or more physical discs using one of the various versions of RAID that it supports.  Since this abstract hardware layer has been introduced, many operating systems are unaware (e.g. the “D” has been removed from “OS”) of what is actually happening with the physical hardware that it is using. Unless a systems administrator loads gobs of proprietary software which enhances OS communications with the disk subsystem, Duh-OS will run in-the-blind requiring you to glance at blinking amber lights every day to notify you of problems should they arise.  In my experience, this software is rarely loaded and being in a new environment, I must rely on native OS tools to find out what the root-cause of this slowdown.

Because the NISC application is running sweet, sweet, flexible, and fresh Linux, I putty’d in and ran an ‘iostat’ to look at the amount of time it was taking for disc operations to complete per logical spindle.  This command and their clones are disc subsystem independent.  The OS must always be aware of how long it is taking the controller or disks to actually perform the write or read operation.  This is a necessary evil no matter how fancy-expensive your disc SAN or DAS system is…

Huh… Unacceptable iostat on /dev/sdb.  Better call NISC support.

After listening to the contradictory “Your call is important to us. Please continue to hold and your call will be answered in the order it was received…” for 18 minutes, the I/O calmed down on the system.  After the NISC tech was on the phone it was a matter of me proving that we actually had a problem. Having no data or evidence, I could not prove anything nor could I figure out why this slowdown happened or if it would again in the future.  I sounded like a 1990’s street-show idiot with signs hung on front-and-back preaching that an economic slowdown was coming if we don’t change our investing ways.  With no proof, frustrated and somewhat embarrassed, I hung up the call.

At this point, the lesson that I should have learned was to not disturb the NISC support clergy by questioning why the slowdown occurred.  Should a user call about this in the future, I will now just ignore them and let it work itself out… just like it did here.
Alas! I didn’t learn my lesson.  Instead, I now wanted to graph the iostat on The Dude.  Next time this happens, I’ll have empirical proof of the slowdown.

Take that – life lessons of complacency!

Setting up SNMP on NISC

First, I had to configure the SNMP daemon on the NISC machine.  This was pretty simple and there are several howto’s on the Internet about setting this up.  I just ran the configure script to setup a basic /etc/snmpd.conf file. If the config does not run, call NISC and ask them to install SNMP on your system.  After setting up SNMP, I had to modify the snmpd.conf file to present the iostat output via SNMP when the object is queried or walked on by The Dude.  This required three things to be configured on my NISC box:

  1. A script that runs ‘iostat’ and spits the output to a file on /tmp
  2. Configuring a CRON job that runs the ‘iostat’ shell script
  3. The compilation (plagiarizing) of Mark Round’s perl script that formats the file for the snmpd to present the iostat to an experimental OID

It sounds complex but I will make it very simple for you by providing the code and step-by-step directions. (Note: pretty much all of this information was taken directly from Mark Round’s Blog)

Install Mark’s Scripts

Login to NISC’s server and type the following commands:

cd ~
tar -xzf cacti-iostat-1.6.tar.gz
cp ./cacti-iostat-1.6/scripts/ /usr/local/bin

Setup the Cron Job

This will create the iostat cache file once every minute, and will sample for 30 seconds.

cat > /etc/cron.d/iostat
-*/1 * * * * root cd /tmp && iostat -xkd 30 2 | sed 's/,/\./g' > io.tmp && mv io .tmp iostat.cache
(press CTRL-D)

Modify snmpd.conf

Use your favorite Linux text editor to add the following line to the end of your the snmpd.conf file:

pass . /usr/bin/perl /usr/local/bin/

Edit Mark’s Script

I found it necessary to modify Mark’s perl script to work with NISC and The Dude.  NISC’s (SuSe’s) iostat contains some extra counters that Mark’s script doesn’t map out properly.  Because the iostat counters show up as integers, I had to get rid of the decimal point for The Dude to precisely chart them.  So, I multiply each counter by 100 and divide it by 100 when setting up the probes on The Dude.  Open Mark’s script in your favorite text editor and modify the Linux section to look like this:

if ($ostype eq 'linux') {

 $stats{"$base_oid.1.$devices"}  = $devices;  # index
 $stats{"$base_oid.2.$devices"}  = $1*100;            # device name
 $stats{"$base_oid.3.$devices"}  = $2*100;            # rrqm/s
 $stats{"$base_oid.4.$devices"}  = $3*100;            # wrqm/s
 $stats{"$base_oid.5.$devices"}  = $4*100;            # r/s
 $stats{"$base_oid.6.$devices"}  = $5*100;            # w/s
 $stats{"$base_oid.7.$devices"}  = $6*100;            # rsec/s
 $stats{"$base_oid.8.$devices"}  = $7*100;            # wsec/s
 $stats{"$base_oid.9.$devices"}  = $8*100;            # rkB/s
 $stats{"$base_oid.10.$devices"} = $9*100;            # wkB/s
 $stats{"$base_oid.11.$devices"} = $10*100;           # avgrq-sz
 $stats{"$base_oid.12.$devices"} = $11*100;           # avgqu-sz
 $stats{"$base_oid.13.$devices"} = $12*100;           # await
 $stats{"$base_oid.14.$devices"} = $13*100;           # scvtm
 $stats{"$base_oid.15.$devices"} = $14*100;           # %util

The Dude cannot graph strings.  Mark’s script makes them strings so we need to change the sub getoid section from strings to gauge:

sub getoid {
 my $oid = shift(@_);
 print "Fetching oid : $oid\n" if (debug);
 if ( $oid =~ /^$base_oid\.(\d+)\.(\d+).*/ && exists( $stats{$oid} ) ) {
    print $oid. "\n";
    if ( $1 == 1 ) {
    print "integer\n";
    else {
    print "gauge\n";

    print $stats{$oid} . "\n";

Setting up a Dude data source

Setup The Dude Data Source

Now, all we have to do is configure a couple of data sources on The Dude to chart the results of the iostat command. Double-click on the Charts button on the tree in The Dude.  Click on the tab on the right pane that appears called “Data Sources”.  Press the “+” to add a new data source.  Be sure you divide by a scale of 100.000.  I created a data source for every one of the counters presented by iostat; however, I only graph the %util for each device.  I’ll leave it to you on which counters are important to you.

Setup a Chart on The Dude

Setting up a chart is self-explanatory.  When you are done, you should have something that looks like this:

NISC Hourly iostat Chart

NISC Hourly iostat Chart

Gloat or Cry

Now you have some data to show NISC.  Next time something like this happens, you can say:

Hey!  What’s wrong with this picture?

Really Bad iowait times on dev/sdb

To which they would reply:

Hi Ian,

The performance issue you’re having with iVUE currently is they’re running the conversion for iVUE 2.x.  I’ve spoke with the conversion folks and they anticipate about another 30 minutes and it should be complete. Thanks for sending the disk utilization jpegs and here’s an explanation for some it it.  Your daily backup kicks off at 11:00 pm and finishes at about 2:50 am.   A database build for mobile is scheduled at 4:00am along with a DQ.  Both of those only take a few minutes each.  There’s also BillCalcs running typically around 7am and a couple about 11:20am this month.

Let me know if you can hold off a little while longer until the conversion process completes.  If it’s not acceptable let me know and I can have the process killed.

To which you would reply:


Ah!  I forgot!


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s