Mysterious Disk ConsumptionPosted: November 9, 2010
Way out west there was this system… this system I wanna tell ya about. A System Application that’s usually called a Network Management Application. At least that was the handle his loving users would give it, but it never had much use for that name. The Network Management Application, it was called “The Dude”. Now, “Dude” – that’s a name no one would self-apply where I come from. But then there was a lot about The Dude that didn’t make a whole lot of sense. And a lot about what it does and where it lives, likewise. They call VMWare the “Kingdom Of Virtualization.” I didn’t find it to be that, exactly. But then again, maybe that’s why I found the place so darned interestin’. Course I can’t say I’ve seen London, and I ain’t never been to France. And I ain’t never seen no queen in her damned undies, so the feller says. But I’ll tell you what – this here story I’m about to unfold, well, I guess I seen somethin’ every bit as stupefyin’ as you’d see in any of them other places. And in English, too. So I can die with a smile on my face, without feelin’ like the good Lord gypped me. Now this here story I’m about to unfold took place back in 2010. I only mention it because sometimes there’s a application… I won’t say a hero, ’cause, what’s a hero? But sometimes, there’s a system monitoring application – and I’m talkin’ about The Dude here. Sometimes, there’s an application, well, there’s an application for its time and place. It fits right in there. And that’s the Dude, in VMWare. And even if it’s a lazy application – and the Dude was most certainly that. Quite possibly the laziest of all VMWare guests, which would place it high in the runnin’ for laziest systemwide. But sometimes there’s a application, sometimes, there’s a application. Aw… I lost my train of thought here. But… aw, hell. I’ve done introduced it enough…
Several times, The Dude has saved my butt. This is a story of one such time.
One time, we were working on moving a bunch of data from a system that was having trouble with its hard drives. We used this iSCSI application called Openfiler. Openfiler is another piece of software that should be described in another post. It uses old server hardware (something that you would normally scrap) insert a bunch of hard drives in it and present them as an iSCSI target to Windows or ESXi. Again, if you haven’t noticed, I’m into accomplishing just as much as the next guy with as little equipment or money as possible. Leave it up to the people that spend money like it is growing on the power lines to tell you what works and what doesn’t. For those of us who wish to learn from our own mistakes, make sure that it doesn’t set you back or put your cooperative in a bind.
Anyway, I was using an Openfiler application on CentOS to host an iSCSI target to an old Windows 2003 server. I wanted to monitor it. Reasons for monitoring a host that is maintaining your storage should be self-explanatory. While these machines are relatively bulletproof, occasional issues can occur. A hard drive failing is an all-too-common problem that I worry about. It seems like the only time you find out about this is when you go into the server room and find amber lights. Dell servers (yuck) are the ones that are terrible about this lately. Their PERC RAID controller cards are known to fail (I have a new horror story about a hardware failure). The point is that it would be nice to get alerts beyond visual ones that will let you know when your RAID is operating in a degraded state.
In comes The Dude. I wanted to use SNMP to monitor my Openfiler box hosted on a Dell Poweredge 2650 machine. I used canary to install the SNMP agent on my Openfiler distribution and configured it to be polled by my Dude virtual machine. This was a pretty simple task and it is outlined here if you care to know exactly how I did it. Really, I wanted to be sure that the SNMP agent was able to monitor the PERC disk subsystem and I’d setup a probe on the Dude to notify me when there is a hard drive or other disc subsystem problem that arises. When I installed the agent software, the Dude automagically mapped out the CPU time, virtual memory, network time, and other interesting things. It had it all in a nice and pretty little chart. I noticed that if I hover my mouse over a device, it would show the chart with all of the notes that I put on each device. This was a cool feature that I didn’t even have to configure and will eventually be the reason that I’m writing this long and drawn out post about how important it is to monitor your systems…
The Openfiler iSCSI appliance ran for several months without any issues. Upon reviewing the trend data offered by the chart, I noticed that the disk usage was increasing at a linear/flat rate. This was not an immediate cause of alarm for me. I figured that a CRON job would eventually kick-off and knock the disk down after 3 months or so. Nope, it didn’t happen. Apparently, this was not going to be a benign issue. After four months of watching free disk space disappearing, I reached an eyebrow-raising threshold of disk space consumption on the OS: 81%. To me, anything over 75% is considered maximum for a production system. This forced me to take a look into the guts of the system.
I ran du –h > ~6-24.log command to give me current disk system usage and spit it out to a file. I also made a note of this in the Dude system monitor application to remind me to follow-up next week. Thinking that this process would eventually grow like cancer and eventually destroy my Openfiler application, I thought, “Let’s just reboot it.” After talking to Ben, my systems administrator, he talked me out of it. “Rebooting is a cop-out”, he suggested. Let’s find out what the real cause of the problem is while the cancer is being monitored and studied. Also, there are always people on that disk. If we remove it for even a minute, we would get some complaints.
I waited another week to find no change in the disk consumption pattern. I ran the ‘du’ utility again, logged it to another file and compared the numbers off of what I took the week prior. To my surprise, there wasn’t a change to the file system. At this point, I called my brother in Germany.
My brother has lived in Germany and Russia for most of his adult life. He speaks six languages and graduated with a music and German major. He became a fairly successful musician. Some of his work was on this MTV show called “The Real World”. Later, he decided to manage his own music label. He was searching for some software that would assist US-based bands manage their European tours (something he would set up for many bands in the future). He couldn’t find a package that accomplished everything that he wanted… So, he picked up a couple PHP, MySQL and perl books and started programming. He designed a whole system in pursuit of his hobby and business as a musician. He is now a very, very good programmer who develops on Linux platforms and does freelance work for European companies.
“Seth, I got this problem with CentOS. It is eating up disk space. I did a du last week, compared the results and can’t find the files that are growing.”
“Wow. That’s really weird. Which part of the file system do you think? It sounds like it could be in /proc if it is, it might be virtual memory that you are counting using SNMP. This would show up as disk space. It sounds like a leaky daemon or something. Did you try a ps –ef?”
This command gives the administrator a picture of how many processes are running on the system at that point in time. He pointed me in the right direction. I found that the SNMP daemon was the leaky application – that’s right, the application that allows me to visualize, monitor, and chart systems performance and point out potential issues was telling on itself. The snmpd was running with a large amount of memory. I ran the commands to restart the process and saw the result in The Dude – disk space went down from 81% to 57%!
Huh… well, let’s fix this so I don’t have to do this again. I went into the crontab and simply made the snmp-daemon restart every week. Well, I could have dug further into why the snmpd was hogging memory and not cleaning up after itself, but I didn’t feel that it was necessary. I don’t like messing with systems like these unless I have time to perform regression tests on the software updates. In this case, it wasn’t worth the trouble. It isn’t like a missed poll is going to hurt anything. So far, it has been running like this for about a year and hasn’t complained. In fact, I got some interesting saw-tooth charts to look at and can tell an interesting story about how to keep memory cancer from getting out of control.
#!/bin/sh service snmpd restart # this will restart the leaky snmp daemon on a monthly basis.