Tuesday, April 19, 2011

Good job, kid! Don't get cocky!

Having been in IT for as long as I have, you'd have thought that I'd known better to mention hubris, good jobs, and things turning out ok in public.  The gods do love a proud man...the better to grind him down.  First the good news...at least my attendance at the storage forum will be covered by the company putting it on.  With that, I'll go if I have to cover the rest on my dime.  After all, it's only a 13 hour drive!

And then the other shoe fell.

The company I work for is like any for-profit organization...if money can be saved by cutting corners, then of course you cut them.  When we virtualized our 3 remote clinics, we set them up with a single ESX host, and a local SAN.  The SANs that we installed were designed to be fully redundant.  In other words, dual controllers, power supplies, etc.  Ok, we got the dual power supplies but went with only one controller.  After all, we were only 1/2 filling the drive bay, so one should be fine, right?

I'm sure those of you in the know are cringing by now, because of course one of the remote clinic's SAN lost it's controller last Thursday night. Now, normally you'd think that if you could get a replacement (I had a spare installed at one of the remote sites), and you can move the flash memory card to said replacement, you should be able to power everything back up and be right back in business. At least, that's what we were thinking when I headed out Friday morning to get the spare and install it in the failed SAN.

Ok, so the spare is installed, and the SAN restarted.  Oops, no LAN links on the controller.  So, lets find a serial cable and hope the serial port on the non-ESX server in the rack works.  The cable was easy, the serial port decided that whatever program had control of it, it wasn't willing to share.  So, grab a pc and get connected via Putty.  By now, I'm with a level 2 tech, who gets in and drops bombshell number one of the day (at least from tech support)...any time that a single controller SAN fails, it REQUIRES L2 support to get it back on line.  Then, bombshell number 2...the read/write cache is unrecoverable, so any read/writes that didn't make it to the SAN when the crash happened is lost.  Telling me that the only way to get the controller on-line is to dump the cache, he then requested that I ok the dump.  Kinda like the commander of the firing squad requesting that you give the order to fire, eh?

Never one to hesitate when confronted with recalcitrant hardware, I had 'em pull the trigger, dump the cache and glory be, we were back on line, with LAN connections and all. The VMs mounted up, and everything was just peachy.  Oh, except for the corrupt 1/2 terrabyte of storage on our XRay image store VM.  Thus begins the tech support shuffle, with the SAN techs suggesting one thing, and the PACs (thats xray software, to you non-healthcare techies) techs wanting to go the checkdisk path.

No joy on any suggestions from the SAN techs, and chkdsk failed miserably. Those of you wondering whyinhell we weren't backing up this data, the original design was to replicate every image to an archive server, which we'd been assured that we could recover a site from with not too much problem.  So, lets get the xray machines aimed at a different server (we have 100mbps connections between our offices...gotta love a metro fibre ring!), and then restore the data that got trashed..  Now, this little redirect chore requires that the company that supports the xray equipment make the networking change, on site and in person, using a hardware security dongle, and they assure us that there would be a tech on-site first thing Monday morning.

Monday morning, and no tech.  A quick call to the xray company, and we get forwarded to the tech, who was just leaving St. Louis, with an expected ETA  of mid-afternoon. So, the boss borrows a security dongle from the IT director at another orthopaedic clinic in town who uses the same company (they paid large green for the training and the dongle) and a coworker went down and redirected the machines to send their images to our main office, which got our doctors at that location back in business.  Hurrah for us!

For about 45 minutes, that is.  That's how long of a reprise we had bfore getting a call that another site had it's PACs software stop working altogether.  So, all afternoon was spent troubleshooting that issue(me working on the hardware, and interfacing with the PACs techs).  We had docs at that site calling the CEO, wanting to know what IT was doing, and why wasn't it fixed yet, dammit!

I found zero wrong with the server/SAN/network/workstations.  That's what the PACs techs said about their software.  So, they fell back to the (usually right) IT solution...reboot everything.  The server, the ESXi host, the SAN. I grouse, but agree, and plan on doing that at 9pm last night.  Finally, I leave the day from hell behind me, and head for home, for a brief respite before the massive reboot.

At about 5:30, I get an email from work.  I call to get details, and it seems that I don't have to reboot a damn thing.  Since we still have the PACs security dongle, the co-worker who had redirected the images from the SAN failure site had decided to do the same at the site of the new failure.  Once there, he discovered that during the full upgrade on that system a month or so ago, the company that did the upgrade had ignored the IP addresses they'd been given for that site, and had instead directed the images to the only other site that hasn't had any major problems during all of this Charlie-Fox.  He directed the images to the correct server, and strangely enough, everything started working. Unrelated work by our PACs company had actually turned off the failed sites ability to pull images from their server.  That's what triggered the afternoon outage.

So, how was YOUR Monday?

Still wondering about the large data recovery?  Tune in tomorrow....

No comments:

Post a Comment