Yesterday one of our primary servers had a disk failure on its 3ware RAID disk controller. Fortunately, we had bought in advance about 2 years ago the suitable replacement “cold standby” disk, went towards the datacenter where the server is located for replacing the disk — and especially expected no major problems…
Odyssey 1: Locksmith for Server
Once we stood in front of the server we had to recognize that we will be unable to replace the disk, because the server had its “security” front panel locked, but the key was no longer available for unlocking it. As this “security” front panels is more or less useless in this situation (the rack is already locked and not used by other customers), last time we visited the server physically, we stored the key in the datacenter by simply attaching it to the right side of the server front panel. But the datacenter operating company in the meantime had cleaned up the cabling and other things in the rack and additionally seem to have removed our key, too. Unfortunately, nobody in the datacenter was able to find our keys again.
Totally nerved we went back to our office in the hope to find a clone of the key. Unfortunately, this was not the case. We found everything related to the server, including handbooks, disks, a spare/replacement power supply, etc. But no second key. So, what to do?
Well, if you get locked out of your apartment you usually call the locksmith. So, why not also call the locksmith to unlock a server in a datacenter? To make a long story short: we called the locksmith and went back to the datacenter with the locksmith in our coat-tails. You can image that the datacenter operators were rather surprised seeing a customer who let a locksmith unlock their server equipment ;-)
The locksmith required a few minutes and then we got rid of the security front panel again. I worked in a datacenter environment for 6 years, but this situation was one of the most strange ones…
Odyssey 2: Bad RAID Command Line Interface
Now we were able to replace the disk. For this we removed the failed disk (port 2) in the 3ware RAID controller (controller 0) CLI (
tw_cli) with the command “
maint remove c0 p2” (which the online help system told us). Then we physically detached the failed disk and attached the new one. But now, how do we tell the RAID controller that this disk is the replacement one and should be rebuild?
The controller did not recognize (and rebuild) the replacement disk automatically as for instance the great HP SmartArray controllers do. We were also unable to determine from the help system what the opposite command to “
maint remove” is. No “
maint add” or something like this exists. The only command dealing with controllers, units and ports is “
/c0/u0 migrate type=raid5 disk=1:2:3” but this looked such dangerous and was documented such unclear that we were afraid to use it without some foreign confirmation.
So, we went back to our office and read the 200 page PDF of 3ware describing their CLI. Sorry, after this we still had no clue what command is really required. We only found “
maint rescan c0“, but this on-the-fly created a new JBOD unit out of our replacement disk instead of adding it to the already existing RAID5 unit. Rather frustrated we called the 3ware/AMCC hotline in DE. Result was that there is no longer any technical guy who can help us, but we were told that 3ware/AMCC in UK should be able to help us. Unfortunately, even after calling the 3ware/AMCC hotline in UK we still had no answer. They really seem to have no detailed clue about their own product and were not able to tell us how one can add the replacement disk back into the RAID5 unit through their CLI. We were only able to determine that instead of the CLI a hope exists that the web interface (which we do not have installed) or the controller BIOS could help out.
Totally frustrated of this CLI, we finally were forced to reboot this server, although it is a fully productive one, in order to reach the 3ware controller BIOS. In this interface we finally were able to attach the available replacement disk to the existing degraded RAID5 unit and started a background rebuild process.
Sorry, 3ware/AMCC, a CLI exists in addition to a BIOS in order to perform maintenance tasks on a production server. It is not acceptable that one has to reboot a server just because the CLI is not intuitive, bad documented, not really known to the vendor and especially does not allow one to perform a simple and obvious task easily. And sorry, a RAID controller CLI’s most obvious maintenance task is to reattach a replacement disk into an existing RAID unit.
I guess “
tw_cli” really has the necessary functionality, but it is not worth anything if one cannot find it within a reasonable short amount of time when one stands in front of a production server in order to repair a degraded RAID…
1. Never hand out any keys to a datacenter operator; 2. If you have to hand out a key to a datacenter operator (for instance to allow him to perform maintenance tasks for one), make sure a clone of the key is kept; 3. There exist RAID controllers which do not allow one to perform easy tasks easily; 4. There exist RAID controllers which force one to reboot a production server in order to recover from a simple disk failure problem just because the CLI is not intuitive and even the vendor hotline is unable to tell one how to use it properly.