Locksmith for Server & Bad RAID Command Line Interface

September 26th, 2007 by Ralf S. Engelschall

Yesterday one of our primary servers had a disk failure on its 3ware RAID disk controller. Fortunately, we had bought in advance about 2 years ago the suitable replacement “cold standby” disk, went towards the datacenter where the server is located for replacing the disk — and especially expected no major problems…

Odyssey 1: Locksmith for Server

Once we stood in front of the server we had to recognize that we will be unable to replace the disk, because the server had its “security” front panel locked, but the key was no longer available for unlocking it. As this “security” front panels is more or less useless in this situation (the rack is already locked and not used by other customers), last time we visited the server physically, we stored the key in the datacenter by simply attaching it to the right side of the server front panel. But the datacenter operating company in the meantime had cleaned up the cabling and other things in the rack and additionally seem to have removed our key, too. Unfortunately, nobody in the datacenter was able to find our keys again.

Totally nerved we went back to our office in the hope to find a clone of the key. Unfortunately, this was not the case. We found everything related to the server, including handbooks, disks, a spare/replacement power supply, etc. But no second key. So, what to do?

Well, if you get locked out of your apartment you usually call the locksmith. So, why not also call the locksmith to unlock a server in a datacenter? To make a long story short: we called the locksmith and went back to the datacenter with the locksmith in our coat-tails. You can image that the datacenter operators were rather surprised seeing a customer who let a locksmith unlock their server equipment ;-)

The locksmith required a few minutes and then we got rid of the security front panel again. I worked in a datacenter environment for 6 years, but this situation was one of the most strange ones…

Odyssey 2: Bad RAID Command Line Interface

Now we were able to replace the disk. For this we removed the failed disk (port 2) in the 3ware RAID controller (controller 0) CLI (tw_cli) with the command “maint remove c0 p2” (which the online help system told us). Then we physically detached the failed disk and attached the new one. But now, how do we tell the RAID controller that this disk is the replacement one and should be rebuild?

The controller did not recognize (and rebuild) the replacement disk automatically as for instance the great HP SmartArray controllers do. We were also unable to determine from the help system what the opposite command to “maint remove” is. No “maint add” or something like this exists. The only command dealing with controllers, units and ports is “/c0/u0 migrate type=raid5 disk=1:2:3” but this looked such dangerous and was documented such unclear that we were afraid to use it without some foreign confirmation.

So, we went back to our office and read the 200 page PDF of 3ware describing their CLI. Sorry, after this we still had no clue what command is really required. We only found “maint rescan c0“, but this on-the-fly created a new JBOD unit out of our replacement disk instead of adding it to the already existing RAID5 unit. Rather frustrated we called the 3ware/AMCC hotline in DE. Result was that there is no longer any technical guy who can help us, but we were told that 3ware/AMCC in UK should be able to help us. Unfortunately, even after calling the 3ware/AMCC hotline in UK we still had no answer. They really seem to have no detailed clue about their own product and were not able to tell us how one can add the replacement disk back into the RAID5 unit through their CLI. We were only able to determine that instead of the CLI a hope exists that the web interface (which we do not have installed) or the controller BIOS could help out.

Totally frustrated of this CLI, we finally were forced to reboot this server, although it is a fully productive one, in order to reach the 3ware controller BIOS. In this interface we finally were able to attach the available replacement disk to the existing degraded RAID5 unit and started a background rebuild process.

Sorry, 3ware/AMCC, a CLI exists in addition to a BIOS in order to perform maintenance tasks on a production server. It is not acceptable that one has to reboot a server just because the CLI is not intuitive, bad documented, not really known to the vendor and especially does not allow one to perform a simple and obvious task easily. And sorry, a RAID controller CLI’s most obvious maintenance task is to reattach a replacement disk into an existing RAID unit.

I guess “tw_cli” really has the necessary functionality, but it is not worth anything if one cannot find it within a reasonable short amount of time when one stands in front of a production server in order to repair a degraded RAID…

Lessions Learned

1. Never hand out any keys to a datacenter operator; 2. If you have to hand out a key to a datacenter operator (for instance to allow him to perform maintenance tasks for one), make sure a clone of the key is kept; 3. There exist RAID controllers which do not allow one to perform easy tasks easily; 4. There exist RAID controllers which force one to reboot a production server in order to recover from a simple disk failure problem just because the CLI is not intuitive and even the vendor hotline is unable to tell one how to use it properly.

5 Responses to “Locksmith for Server & Bad RAID Command Line Interface”

  1. cs says:

    In addition to lessions learned, never buy a 3ware controller again. Their support is just a piece of crap. In the past I had several servers in production equipped with 3ware 6000 series controllers. The only way to monitor the health of the RAID volumes was using the supplied binary-only 3dm utility. Unfortunately this tool was using an ioctl() which was flagged as deprecated in the Linux kernel. Several kernel releases later the ioctl() was actually removed. I asked them several times to supply a new version of the 3dm utility, but as you might already guess this never happend.

    Furthermore they refused to give me any kind of specs or documentation to implement new ways of monitoring the state of the RAID volumes. The solution offered by 3ware was the advice to buy some new 3ware RAID controllers of the 9000 series. While I’m was not really considering to buy products from a company which such little service oriented support again, I asked them about EoL of the 9000-series, just to avoid the case to have bought some new controllers and in the very next week support is going to be dropped by the vendor. Nevertheless, I never received answer about EoL dates of 3ware products. So my solution is quite clear, never to buy 3ware products again.

  2. jason says:

    i agree about the crappy documentation. there are 3 commands that i’ve found that work on 9500S controllers and tw_cli version 2.00.02.009.

    to remove a disk (i don’t know if this is necessary if the disk is dead):

    tw_cli /cx/px export

    to find a newly inserted disk:

    tw_cli /cx rescan

    to make a new disk a spare:

    tw_cli /cx add type=spare disk=

    after the disk is added as a spare it will automatically start rebuilding after about 30-60 seconds.

    it definitely should be more obvious. maybe this will help someone else who stumbles across this post like i did.

  3. Nishanth says:

    I had the same problem with maint rescan creating an entirely new unit with the newly replaced drive (c0/u10 as opposed to c0/u0). This was fixed by deleting the unit, and then rebuilding.

    maint deleteunit c0 u10  
    maint rebuild c0 u0 p10
    

    These are the major ones. I may have left something out since i did not document the commands i executed.

  4. Locksmiths in Wakefield says:

    Nice blog. Really interesting

  5. Ron says:

    That is quite a humerous story about the locked security front panel. Then bringing in the locksmith to help was a very creative, and ironic turn of events haha. Your advice of making a copy of the key can’t be said enough.

Leave a Reply