Status Update, 12/07/2007
Friday, December 7th, 2007 at 7:39 pm(mt) Media Temple indicated earlier this week that it would follow-up with customers when additional information was available regarding the failed storage upgrade and continued performance inconsistency. We hope this information is useful to you.
Why did the 1st upgrade fail?
We are still waiting on a formal RCA (Root Cause Analysis) from BlueArc. However, our current understanding is that the firmware in the ‘disk controllers’ (the hardware that connects all of the individual hard drives into the main system) is what failed when the controllers underwent a consistency check and were then unable to properly verify that the drives were “compatible”. For data safety, the system is programmed to “lock the drives out.” This only happened to a small group of the disks but it prevented the storage system from mounting required data - which is why Cluster.2 was down. There was some conflict with the specific firmware revision on that batch of disks itself. BlueArc confirms that in lab testing, and in the field, and even in the majority of our disk controllers, this is a “never seen before” issue.
The final result of this upgrade was that we are now running the correct version of firmware on ALL controllers in our system. At this time, there is no anticipated need to update them (firmware for controllers) again while we are running the 4.3.x branch of the Titan codebase.
Current Status
24 hours after the upgrade, (mt) and BlueArc engineers re-balanced the utilization of the existing Titan storage cluster. This provided a major needed performance improvement over the course of this week.
Future Remedies
BlueArc has been working non-stop since this incident and has demonstrated serious commitment to supporting (mt). Multiple director and management level BlueArc employees have been flown in this week and have been spending time in our data center and corporate offices since the incident. Their relentless intent of solving the outstanding problems has resulted in a new plan which was finalized this morning. Additionally BlueArc has been simulating and testing their strategy in a separate Titan lab to make sure we do not have rollback issues like in the previous upgrade.
A maintenance will be performed this Saturday evening, beginning at 10:30PM PST, to complete this well-tested update. The details of this will be tracked in weblog entry. You and your account contacts should have received an email announcing this work.
We thank you for your continued patience.
Best Regards,
(mt) Media Temple
Hosting Operations
»