February 20, 2012

Fun with LSI RAID

If you deal with “enterprise” hardware at all, you're probably familiar with LSI's RAID controllers. Their stuff seems to generally work rather well, though the user interface could be improved. In any case, one of the machines I run with 12 disks on an LSI controller recently started exhibiting some unusual performance characteristics. For the most part things were working fine, but maybe once a week the system would slow to a crawl until it was rebooted or left alone for a few hours. I eventually tracked down the problem to a failing drive which the controller never recognized as failed.  Hopefully the troubleshooting technique can be useful to others having similar problems (or at least as notes for next time ). Before we get started, it's worth mentioning that it's very easy to loose data when doing this sort of thing. Standard disclaimers apply: you're responsible for your actions (not me, not my webhost, etc). Since we're not using an OS on which the LSI GUI tools run, I'm stuck using the (rather cryptic) LSI CLI tools. First we ask for the status of each drive and look for anomalies:

MegaCli64 -PDList -aALL | grep "Count" | sort -u 
MegaCli64 -PDList -aALL | grep "state" | sort -u 

Assuming that returns nothing interesting, you can always look at the output without any filtering

MegaCLI64 -PDList -aALL | less 
but chances are, you'll have to look at the event log:
MegaCli64 -AdpEventLog -GetEvents -f log.txt -aALL less log.txt 

This is where problems started to show up on my system, in the form of sense errors.

  Code: 0x00000071 Class: 0 Locale: 0x02 Event Description: Unexpected sense: PD 12(e0x08/s11) Path 5003048000779f4f, CDB: 28 00 00 e8 ba 01 00 00 02 00, Sense: 3/11/00 

I'm told that it is normal to see a few such communication errors on a loaded system with this series of controller, but something was obviously wrong and the problems were generally occurring on the same disk, so I decided to try replacing it with my cold spare. You can prepare a drive to be removed as follows (be very careful to get the various ids correct – LSI uses different addresses for different things)

 MegaCli64 -CfgDsply -a0 # figure out the enclosure id
 MegaCli64 -PDInfo -PhysDrv \[8:11\] -a0 # make sure this is the drive 
 MegaCli64 -PDOffline -PhysDrv \[8:11\] -a0 # mark offline enlosure 8, disk 11 ("PD12" in the event log in my case) 
 MegaCli64 -PDMarkMissing -PhysDrv \[8:11\] -a0 #mark missing MegaCli64 -PdPrpRmv -PhysDrv \[8:11\] -a0 #prepare for removal MegaCli64 -PdLocate -start -PhysDrv \[8:11\] -a0 #turn on the error LED 
 MegaCli64 -PDInfo -PhysDrv \[8:11\] -a0 #verify and swap the disk MegaCli64 -PDOnline -PhysDrv \[8:11\] -a0 #mark the new disk online if it's not already 

You may also want to monitor the progress of the rebuild:

MegaCli64 -PDRbld -ShowProg -PhysDrv \[8:11\] -a0 

Note that my example uses “PD12”, which is marked “11” on the supermicro chassis and which is enclosure 8, disk 11 as detected by the rest of the controller (confused yet?). “-a0” refers to the zeroith adapter. You'll need to change the numbers for your situation. After all this, I was able to do some testing and see *much* better performance:

 #Write a big file dd if=/dev/zero of=/test.img bs=1024M count=4096 
 

Powered by Hugo & Kiss.