Search

HP P400 Controller – RAID 6 Array Trouble

Yesterday we had a scare.  One of our Elster branded AMI (Automated Meter Infrastructure) server’s drives was flashing amber lights.  The server platform was an HP DL380 G5.  I found it interesting that the server panel did not go into alarm and the OS was not reporting a failure – only the drive showed any indication of a problem.

Upon looking into the problem, I guessed that it wasn’t actually a failure.  It was a predicted failure.  The SMART (Self-Monitoring, Analysis, and Reporting Technology) system loaded on each SAS drive monitors the hard disk drive to detect and report on various indicators of reliability to anticipate drive failures.  This drive was a member of a RAID 6 volume on a HP P400 controller.  The cool thing about RAID 6 is that it can withstand 2 drive failures in an array before losing redundancy.  So, the array was not operating in a depreciated state.

I reviewed the firmware releases on the DL380 and I found that it was quite behind.  The release notes suggested that there was an issue with one of the drives SMART system and it required updating.  Hoping that the firmware update would fix the alarm, I loaded the freshly downloaded hp.com/go/foundation USB stick and installed the drivers and firmware updates on the Windows OS.  The utility notified me that a reboot was required and without giving it a second thought, I notified the users of downtime and rebooted.

I witnessed the firmware installation as it updated the system BIOS, then the iLO, the NIC followed by the P400 controller, finally the drives.  When it reached the drive on bay 3 – the one with a predicted failure – the update froze.  There was a warning message not to interrupt the update process or loss of data will occur.  It appeared that all of the drives in the enclosure were going through some sort of verification process.  The OS mirror array (RAID 1) was flashing between the two drives while the RAID 6 array was flashing in succession, as one would expect if the data were being verified.  The troubling thing about this process was that there was no feedback from the P400 controller as to what was actually going on.  This left me to assume that the drive BIOS update on the drive in Bay 3 failed.  The controller assumed that something was wrong and initiated a verify on all of the arrays.  Thinking that this was going to take a long time, I called the users and informed them that downtime would exceed initial estimates.

I figured the last time I performed a verify on a P400 controller it took about 4 hours.  Sure enough, this is just as long as whatever it was doing took.  I heard the BIOS beep and Windows was loading.  Upon booting Windows, I opened the HP array configuration utility (ACU) to take a look at the array.

Predictied failure. Queued for rebuild without a rebuild option.

The ACU verified that there was a predicted failure on the drive in Bay 3.  It also suggested that the array was queued for rebuilding.  Thinking that all I had to do was pop in another drive, I stuck in some spares in Bay 7&8.  All this did was mark them as Unassigned Drives as noted above.  The array did not start a rebuild or replacement…  Odd…

At this point I thought that I would have to add a “global spare” or assign a drive as a spare to rebuild to for each array.  I navigated to the P400 Controller and selected “Manage Spare Drives”.  The menu options that I was given only allowed me to assign the spare drives to Array A – the RAID 1 (mirror).  The option to select Array B was grayed out.

Attempt to add spares to RAID 6 array

Puzzled, I got on a chat session with HP tech support:

[Tuesday, February 01, 2011 3:08 PM] — Balaji Makams C says:

Ian, if the online spare drive is configured than it will automatically start rebuilding if there is a drive failure in the array.

[Tuesday, February 01, 2011 3:09 PM] — Ian Fleming says:

I cannot configure an online spare for the array. It will only let me setup a spare on the RAID 1 array

[Tuesday, February 01, 2011 3:12 PM] — Ian Fleming says:

The drive in bay 8 shows up in the ACU as an “Unassigned Drive”. There is no drive activity on the array; however, all of the drives in the array with the predicted failure have blue lights on them.

[Tuesday, February 01, 2011 3:14 PM] — Balaji Makams C says:

Ian, probably this drive might have not be assigned as spare.

[Tuesday, February 01, 2011 3:16 PM] — Ian Fleming says:

yes. When I click on the controller –> manage spare drives, it gives the option to assign the spare to an array. The only array that is available is Array A (which is the OS mirror array). It does not give the option to assign the spare to Array B (the RAID 6 with a drive in predicted failure). Is there a way to do this via the CLI?

[Tuesday, February 01, 2011 3:18 PM] — Balaji Makams C says:

Ian, you might be not getting option at assign as spare drive, a drive in the array has failed.

[Tuesday, February 01, 2011 3:20 PM] — Ian Fleming says:

But I can assign the spare drive to another array. Just not the [RAID 6] array with the predicted failure. I can access the logical drive with predicted failure via the operating system. I do not believe there is a failure on any drive – only a prediction.

[Tuesday, February 01, 2011 3:21 PM] — Balaji Makams C says:

Ian, please let me know the LED status on these drives.

[Tuesday, February 01, 2011 3:21 PM] — Ian Fleming says:

Bay 1&2 – Array A = mirror

Bays 3-6 – Array B – RAID 6 (bay 3 is blue – no green)

Bay 8 – Unassigned

No lights on bay 8

[Tuesday, February 01, 2011 3:21 PM] — Balaji Makams C says:

Okay.

[Tuesday, February 01, 2011 3:22 PM] — Ian Fleming says:

[Bay 7&8] shows up in the ADC as “Unassigned”

[Tuesday, February 01, 2011 3:27 PM] — Balaji Makams C says:

Okay.

[Tuesday, February 01, 2011 3:28 PM] — Balaji Makams C says:

Ian, please stay online for few minutes while I will check on this issue.

[Tuesday, February 01, 2011 3:32 PM] — Balaji Makams C says:

Ian, I am afraid it is not possible to assign one spare drive to RAID 6, as the the RAID 6 has fault tolerance of two drives.

[Tuesday, February 01, 2011 3:32 PM] — Ian Fleming says:

So, if a drive is in predictive failure, how do I replace it?

[Tuesday, February 01, 2011 3:33 PM] — Balaji Makams C says:

Ian, it might require two spare drives to assign two spare drives.

[Tuesday, February 01, 2011 3:33 PM] — Balaji Makams C says:

However, you can try assign it as a global spare.

[Tuesday, February 01, 2011 3:34 PM] — Ian Fleming says:

Where do I assign a drive as a global spare?

[Tuesday, February 01, 2011 3:35 PM] — Ian Fleming says:

Ok, I am about to insert another spare drive…  [There are two spare drives inserted]

[Tuesday, February 01, 2011 3:35 PM] — Balaji Makams C says:

You would need to assign in the ACU.

[Tuesday, February 01, 2011 3:35 PM] — Balaji Makams C says:

Please try it and let me know your observations.

[Tuesday, February 01, 2011 3:37 PM] — Ian Fleming says:

The ACU does not appear to have an option for a global spare. When inserting two drives, two of them show up under ACU as “Unassigned”. There are no additional options for adding spares to the RAID 6 array.

At this point, Balaji went through all of the things we spoke about above.  I ran a file backup of the array just in case it crashed as I prepared to create a hard failure by removing the drive in Bay 3.  This is our remote control chat session:

Hard fault shown after pulling drive from Bay 3

Rebuild process executing

Rebild complete; parity initialization queued

ian fleming: This is not intuitive.. I had to make a hard failure to replace a predicted one…

ian fleming: can we get a new drive shipped to us?

balaji: sure I will send you the replacement drive

balaji: ian now try to assign spare drive and check

balaji: if get the option

ian fleming: Again, I cannot assign a spare drive to array b

balaji: okay

balaji: ideally you should be able to assign it as global spare. I am not sure why we are not getting this option.

ian fleming: yes, but I don’t see that option …  I think because it’s raid 6 and no real loss of redundancy occurred

balaji: yes – could be

ian fleming: Even if I pull one drive, everything is still redundant.  We are now operating degraded

ian fleming: The thing I don’t like is that we didn’t have to run degraded because the rebuild could have happed by replacing the drive with a predictive failure.  There is no option to replace a drive?

balaji: ian, once the rebuild is completed you it will change the degraded state

ian fleming: Yes.  But my point being – if we could have the option to replace a drive instead of make a hard failure (by removing a drive) there would be no degradation.

balaji: Okay

[Tuesday, February 01, 2011 4:06 PM] — Balaji Makams C says:

Ian, this is a strange case.  Thank you for letting us know.  I am notifying the engineers about this issue.

Conclusion

This morning, the replacement drive was delivered.  One thing I will hand to HP is that they have decent support and use overnight shipping.

Lesson learned:

On a RAID 6 array you cannot assign a spare to on a P400 controller.  The only way to do this is to create a hard fault by removing the drive.  In retrospect, you are not losing redundancy doing it this way because RAID 6 has double redundancy anyway.  The thing I don’t like about this is that there is no option in the ACU software or P400 controller to replace a member disk.  Maybe HP will fix this in a future release.

So completing this task on a HP P400 controller is not as easy as replacing a member disk like it is with a PERC controller:

Replacing a member disk in an array with a Dell PERC controller

You will have to create a hard fault by pulling a drive!


2 Comments on “HP P400 Controller – RAID 6 Array Trouble”

  1. Khuzaima says:

    Hello Ian,
    It was nice reading this article. I am in similar situation. One of our clients has an HP DL 380 G5 server, with 8 HD bays. Out of that two are in mirror and the other six in Raid 5. One of the HD ( bay 3 ) in raid 5 was blinking red and so was replaced offline with a new one. After the installation and boot up, raid asked me to rebuild to which i said yes. This new HD started blinking green, but after 45 mins it started blinking red. In another half an hour the other two hard disk in bay 6 and 8 also started blinking red. I restarted the server the next day, but these three disks still blink red.
    Its been 2 weeks now, the server and data seems to be working and accessible as yet, but today the drive in bay 3 ( which is a new one ) started blinking red and blue.
    I am going to update the firmware tomorrow hopefully, but what do you think could be wrong.
    Please do replay with suggestions.
    Appreciated all the help !!

    • itcoop says:

      Red blinking usually means that the issue is a SMART predicted failure. This is not critical but the drive will need to be replaced. Blue lights mean the you have that drive selected in the ACU. Call HP support, send them the ACU report and they will send you a new drive. Pull the old one and replace it. It should take an hour or so to rebuild. Oh, have you checked the ACU?
      Now, the issue I am describing in this entry, a predictive failure was in assert. I would not recommend upgrading the firmware unless you move all the data off the drive. Here’s what I would do in your situation:
      1) File backup the RAID 5 partition
      2) Delete the array
      3) Upgrade the firmware NOTE: use the foundations USB key and DON’T do automatic; select manual and reboot 3 times with the USB upgrade in place or until it can’t find any more updates to perform.
      4) Re-create the array
      5) Restore the files from backup

      Good luck!


Leave a comment