Weblog

 

Electrical Systems Maintenance Notice - Nov. 30th  

Incident Tracker status:  RESOLVED  view incidents »

Status Update, 12/07/2007

Friday, December 7th, 2007 at 7:39 pm

(mt) Media Temple indicated earlier this week that it would follow-up with customers when additional information was available regarding the failed storage upgrade and continued performance inconsistency. We hope this information is useful to you.

Why did the 1st upgrade fail?
We are still waiting on a formal RCA (Root Cause Analysis) from BlueArc. However, our current understanding is that the firmware in the ‘disk controllers’ (the hardware that connects all of the individual hard drives into the main system) is what failed when the controllers underwent a consistency check and were then unable to properly verify that the drives were “compatible”. For data safety, the system is programmed to “lock the drives out.” This only happened to a small group of the disks but it prevented the storage system from mounting required data - which is why Cluster.2 was down. There was some conflict with the specific firmware revision on that batch of disks itself. BlueArc confirms that in lab testing, and in the field, and even in the majority of our disk controllers, this is a “never seen before” issue.

The final result of this upgrade was that we are now running the correct version of firmware on ALL controllers in our system. At this time, there is no anticipated need to update them (firmware for controllers) again while we are running the 4.3.x branch of the Titan codebase.

Current Status
24 hours after the upgrade, (mt) and BlueArc engineers re-balanced the utilization of the existing Titan storage cluster. This provided a major needed performance improvement over the course of this week.

Future Remedies
BlueArc has been working non-stop since this incident and has demonstrated serious commitment to supporting (mt). Multiple director and management level BlueArc employees have been flown in this week and have been spending time in our data center and corporate offices since the incident. Their relentless intent of solving the outstanding problems has resulted in a new plan which was finalized this morning. Additionally BlueArc has been simulating and testing their strategy in a separate Titan lab to make sure we do not have rollback issues like in the previous upgrade.

A maintenance will be performed this Saturday evening, beginning at 10:30PM PST, to complete this well-tested update. The details of this will be tracked in weblog entry. You and your account contacts should have received an email announcing this work.

We thank you for your continued patience.

Best Regards,

(mt) Media Temple
Hosting Operations

We apologize

Tuesday, December 4th, 2007 at 9:05 am

(mt) Media Temple would like to apologize to our (gs) Grid-Service customers for the series of issues relating to the (gs) system in the past few months. As an appreciation for your patience, we have applied 2 months free credit to your account. This credit has been issued automatically and is reflected in the billing area of your AccountCenter. We appreciate your continued business.

Notably, during our scheduled system upgrade on November 30th, the (gs) Grid-Service was offline longer than expected due to a failed upgrade to the storage firmware in Cluster.2. Although our company has a demonstrated 10-year track record of successful system maintenance actions, this past weekend’s event was an unfortunate exception. The majority of all scheduled items were completed and upgraded according to plan. However, one of BlueArc’s Titan disk systems, which provides a portion of the storage to our (gs) Grid, did not upgrade successfully nor did it roll back correctly when errors were discovered. Consequently this portion of the system maintenance missed its allotted time window by 7 hours. All other facets of the system maintenance were completed ahead of schedule and we encourage you to review the original scheduled maintenance announcement for additional reference: http://weblog.mediatemple.net/weblog/2007/11/21/electrical-systems-maintenance-notice-nov-30th/

The situation with the storage upgrade is particularly frustrating because the vendor supplied update was intended to fix issues - not create new ones. Even after the prolonged upgrade, the system is still unfortunately exhibiting some problems. We are waiting on a full analysis from the vendor regarding the reasons for the failed upgrade and continued instability. When these findings are available they will be communicated immediately.

It is well known that a vast majority of the performance and stability issues that have affected (gs) Grid-Service since its launch relate to storage issues. Consequently (mt) Media Temple engineers, along with senior management, have been working on a redesign to the storage architecture along with several other radically improved features in the platform. Most notably a new storage solution is been developed internally, with substantially reduced commercial vendor dependence and an architecture that will bring a high level of reliability back into our systems. This will result in a longer term solution that will be named the (cs) Cluster-Server, currently scheduled to go into beta in January. The beta testing program will give many customers the opportunity to experience some of the changes and improvements which we have made. Check http://www.mediatemple.net/labs/cs/ for more information when you have a chance.

(mt) Media Temple will continue to work vigorously on the (gs) Grid-Service platform until it is stable again. Our 75+ staffed company is fully committed to making the product as reliable as possible well before any new platforms are released. This evening engineers will be implementing some newly found tuning parameters to the system which are believed to correct some of the performance issues witnessed this morning. While we anxiously await the full root cause analysis from BlueArc concerning the failed upgrade and continued stability problems, we encourage customers to continue watching incident #306 for update to date system status.

Thank you again for your patience.

Best Regards,

Demian Sellfors
CEO
(mt) Media Temple, Inc

Emergency Systems Maintenance Monday night.

Monday, December 3rd, 2007 at 9:59 pm

Due to the performance issues we have seen post firmware upgrade, our systems engineers will be performing an emergency systems maintenance tonight from approximately 10:30 PM (PST) to 11:30 PM.  This maintenance is an effort to reduce the load on (gs) Grid.Cluster.2 by balancing the filesystems load across multiple storage segments and bringing additional hardware online.
The maintenance will be performed in 2 stages.
Stage 1 will consist of moving a segment of the storage system to (gs) Grid.Cluster.1. (we expect approximately 20-30 minutes of unavailability tonight)
Stage 2, which will be announced shortly, will consist of bringing new hardware online for (gs) Grid.Cluster.2 and balancing the filesystems across all of the storage segments as well as rolling out any additional firmware updates.  The current firmware contains a bug that causes random storage failures on (gs) Grid.Cluster.1 and we anticipate that a new firmware will be available before the implementation of stage 2.

Bug in new firmware

Monday, December 3rd, 2007 at 2:35 pm

We have discovered a bug in the new firmware upgrade, provided to us by our storage vendor, that is causing storage related failures on (gs) Grid.Cluster.1.  BlueArc is currently investigating this and we are waiting for their update to resolve this issue.  We are also aware of the latency issues on web and email on (gs) Grid.Cluster.2 and are working with BlueArc to resolve this issue as well.

We anticipated that the storage upgrade would resolve a lot of the issues we have faced in the past few months but it seems to have created some new ones.

We will have further updates on this issue as progress is made.

Thank you very much for your continued patience and understanding.

Maintenance complete and issues resolved.

Saturday, December 1st, 2007 at 10:13 am

Given a failed upgrade from the vendor on (gs) Grid.Cluster.2’s storage segment, and then a failed rollback attempt, we either had to work to repair the systems so customers would have their “live” data — or recover from backup, potentially taking several days to get fully back online and rolling back some customers to their “last backed up” date. Though neither option was “pleasant”, we had good confidence that the “up to date” data was safe and accessible with the appropriate vendor involvement — this was seen as the best overall customer outcome.  And so after several hours of troubleshooting, we have have managed to repair the systems preventing a restore situation.

We are continuing to validate the auxillary systems of the (gs) — database, database container, containers — at this time. All basic services such as web, email, and FTP have been restored.

We understand how frustrating situations like these can be and we sincerely appreciate your continued patience and understanding.

Engineers are still working hard

Saturday, December 1st, 2007 at 7:13 am

The BlueArc engineers are still continuing to work feverishly on this issue and have already escalated the issue to the senior engineering staff of their component vendor.  (storage shelf)  We should be able to report further updates soon with more information on the progress.

Maintenance update

Saturday, December 1st, 2007 at 4:49 am

We’ve been busy since the last update. The majority of our maintenance actions were completed successfully and quicker then expected. There have been a few items which did not comply with the plan for reasons which are still being discovered. At this time most of the random MySQL and (gs) Grid Container issues are resolved leaving one final issue which is BlueArc.
BlueArc engineers have been in our data center for the last 8 hours undergoing their project to upgrade three of our Titan disk systems which combined power the storage for Cluster.1 and Cluster.2 of the (gs) Grid. During the post-upgrade reboot of one of the BlueArc’s components failed (storage shelf) which kept Cluster.2 from rebooting completely. For the last several hours BlueArc has been trying to remedy the failed upgrade. We are waiting on engineering results from their team at this time.
Several engineers are currently in our El-IDC data center working all of the outstanding issues. More information should be available soon regarding the status of the failed BlueArc storage component in Cluster.2.

(gs) Grid.Cluster.2 upgrade

Saturday, December 1st, 2007 at 2:29 am

Unfortunately the upgrade to (gs) Grid.Cluster.2 is taking longer than anticipated.  Our systems engineers expect all services to be restored by 4:00AM PST.

Thank you for your patience and understanding.

What will happen to emails during this maintenance?

Thursday, November 29th, 2007 at 4:05 pm

During the maintenance, although email functionality will be unavailable, all emails being sent to your email accounts will be remotely queued and delivered after the maintenance is complete.  No emails should be lost during this maintenance.

Electrical Systems Maintenance Notice - Nov. 30th

Wednesday, November 21st, 2007 at 2:42 pm

On Friday, November 30th, our data center Electrical Engineers will facilitate proactive replacement of certain electrical systems in one of our Facility Power Segments at our EL-DC3 data center. In addition, an upgrade will be taking place to core components of the (gs), including the storage subsystem. This vendor-recommended upgrade required additional time and thus has been grouped with the data center activity in order to reduce customer impact.

This larger than normal maintenance period is a proactive measure to prevent power failure incidents as experienced by various other data centers mentioned in recent news. We would like to remind all customers that scheduled infrastructure maintenance and security related updates are a necessary and vital aspect of web hosting that ensures the long term uptime and reliability of your services.

The window for this action is:

Friday, Nov 30th 2007 9:30PM - Saturday, Dec 1st 3:00AM PST

To see when this maintenance window will occur in a different timezone please visit:

http://mediatemple.net/go/date/0711302130

This maintenance action will require downtime of the following services:

(gs) Grid-Services
(ss) Shared Services
(dp)(dpv)(nitro) some Dedicated Physical Services

As a courtesy we are notifying all of our affected customers of this upcoming maintenance.

We sincerely apologize for any inconvenience this maintenance causes. This is a necessary upgrade to our data center infrastructure and will prevent the possibility of future issues pertaining to power.

We would like to remind all customers that scheduled infrastructure maintenance and security related updates are a necessary and vital aspect of web hosting that ensures the long term uptime and reliability of your server. Should you have any questions regarding this scheduled maintenance please open a support request inside the (mt) AccountCenter.

Thank you in advance