Search

Monitoring CPU Runaway Processes

Summary

My event log recorded cpu_runaway processes from SNMP on our virtualized Exchange server, Leka, this week.  Usually we correlate a cpu_runaway with a scheduled task such as a backup or system maintenance process.  The cause of these was elusive.

The cpu_runaway probe is a Dude function that looks for overall utilization over 25% for 30 seconds.  I chose 25% as the threshold because our VM’s have quad processors.  One-quarter overall utilization means a single process could be in a runaway state.  I researched the times that the CPU runaway’s occurred and couldn’t find anything abnormal.  I started to setup the performance monitor to record utilization to a log file but I really wanted better resolution when the issue was actually happening.

the probe

The event shows CPU Runaway. What now?

The Problem

I wasn’t getting enough detail from the logs when researching why a cpu_runaway probe was asserted. I needed to record detail “on demand” of what is going on with the CPU when the probe asserts.

My Solution

Having scripts.log going to syslog, I figured I should leverage what I already had configured with the alerts.  The steps as I saw them:

  1. Setup a trigger on the Notification menu of The Dude
  2. When the cpu_runaway probe is asserted, Notification will tell the ProcMon script to execute on the troubled server and record information for a certain amount of time
  3. Setup the script as a scheduled task on the machine to be monitored
  4. The Dude will execute this task remotely using schtasks.exe
  5. Script execution is logged to the dude (see: Script Monitoring for more information on how I did this)
  6. A CSV file is written recording what happened
  7. Script completion is logged to The Dude with the file name of the CSV file written

The end result is a detailed process log file on the server that provides more detail of the processes running which are causing the cpu_runaway condition.

The Implementation

First, I needed to find a way to log Windows process information to a file from the command line. Then I would compile a script. I looked up tasklist.exe but this utility did not show CPU time per process, unfortunately.  Then I found typeperf.  This would do exactly what I wanted.  I compiled the following script and placed it in c:\scripts\ProcMon.bat:

@echo off
rem This script gives 5 minutes of CPU time, IO, memory, paging, etc. of all processes at 10 second intervals.
rem When completed, the redulting CSV file will be located in c:\scripts\ProcMon{Random}.csv
set Proc=%random%
echo ProcMon.bat triggered %date% %Proc% >> c:\scripts\scripts.log
typeperf -qx "\Process" -o c:\scripts\ProcMonconfig.txt
typeperf -cf c:\scripts\ProcMonconfig.txt -o perf.csv -f CSV -y -si 10 -sc 60
ren perf.csv ProcMon%Proc%.csv
del Procmonconfig.txt
echo ProcMon.bat complete. See ProcMon%Proc%.csv for detail >> c:\scripts\scripts.log
set Proc=

Following writing this script, I setup a scheduled task called ProcMon on my Exchange server pointing to it.  Upon initial setup, I set it to execute once on a date in the past:

This script would be triggered by The Dude when the cpu_runaway probe asserts.  The method used was with the schtasks.exe command line application.   So, I setup a notification on The Dude:

Note: The Dude must have permissions to execute a task on the remote server. Also, it is implied that the 'Device.Name' is the server's actual name.

 

On the Leka Device Configuration options on The Dude, I selected the probe cpu_runaway, and clicked on Notifications.  The only notification that I had setup before was the log to events.  I added a check box next to my new notification “ProcMon”

 

Test My Implementation

I wanted to do a “real” test.  I needed a way to make the CPU run in circles really fast so that it would exceed the 25% threshold, log to events, and start the ProcMon script.  So, I opened two cmd prompts, ran the following command from c:\ (root) and minimized the windows.  Minimizing the cmd windows was the trick to increase cpu temporarily:

For /L %a in (1,1,10) do @(dir /s)

Essentially, this runs a system directory listing in a loop 10 times and will increase CPU and IO on the system which will trigger the script.

I see that the log file recorded is ProcMon15070.csv  I opened it with Excel, created a pivot chart and was able to identify the two “dir /s” as hog processes.

Conclusion

The cool thing is that stats for all running processes are being recorded on a 10 second interval stat.  This is useful should I really want to really dig into a problem. These 5 minute log files are about 1.2MB for my mail server; however, it will vary depending on the number of processes running on each system.

Now I will have more information to look at next time this happens. Cool!

Advertisements


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s