Weblog

 

(gs) Grid-Service - intermittent service availability  

Incident Tracker status:  RESOLVED  view incidents »

Update: Incident now resolved and closed.

Wednesday, May 28th, 2008 at 12:11 pm

On May 14th our Engineers made substantial changes to the storage architecture of our (gs) Grid-Service platform to address the issues stated on April 29th (read below).  We have been actively monitoring all aspects of performance and stability since and it appears that we have successfully addressed the service interruptions our customers have been seeing the past several months.

This maintenance successfully redistributed massive amounts of data across all our storage segments.  It also gave us the opportunity to upgrade existing software to address the bug directly causing 5-30 minutes chunks of unavailability on each of the affected clusters.

We are fully aware of the impact this particular issue has had on your service and are taking the necessary steps to improve the performance and stability of this platform moving forward.  The knowledge and insight gained from this incident has helped us greatly in designing an improved system that will continue to grow.

Thank you for your patience leading up to the closing of this incident.  If you have any additional inquiries related to your (gs) Grid-Service please make sure to open a new Support Request in the AccountCenter.

Update: May 14, 2008 @ 5:30PM Pacific Time.

Wednesday, May 14th, 2008 at 4:35 pm

A new emergency system maintenance has been scheduled for tonight. This maintenance is to address the issues covered in this System Incident. Please see this page to learn more.

Update: May 14, 2008 @ 12:00PM Pacific Time.

Wednesday, May 14th, 2008 at 12:00 pm

As a further update to Incident #393 regarding recent issues affecting the (gs) Grid-Service Cluster.2:

Engineers are continuing progress in defining this morning’s issues more accurately. The root cause of the issue relates to problems with (mt) Media Temple’s use of our current storage system and the open-source software it uses.

At approximately 9:15AM our system experienced an increased load to a particular segment of the cluster which we did not anticipate. As a safety measure we decided to take this segment off-line to prevent other sites from being impacted. This resulted in the Apache webserver displaying a generic “403 Forbidden” server error code for the sites associated with this problematic segment.

“403 Forbidden” is one of the many simple error codes Apache uses to tell users there was a internal system problem. The term “Forbidden” simply means that it couldn’t communicate as expected. (mt) Media Temple has initiated plans to create a more informative error message in the future if this condition occurs again. The resulting error page is a generic document chosen by the original software developers of Apache. It does not mean that any intentional access has been denied, nor does it mean that any data has been compromised or lost.

More information about the “403 Forbidden” error message can by found on Apache’s official website at http://httpd.apache.org. Users can also find more information by searching for “403 Forbidden” at http://www.google.com/

We apologize if this particular error message caused any unneccesary confusion to you and your website visitors during this time.

Update: May 13, 2008

Tuesday, May 13th, 2008 at 9:27 am

DOWN: 10:15AM Pacific Time

UP: 10:25AM Pacific Time

Cluster Affected: Cluster 1

Update: May 9, 2008

Friday, May 9th, 2008 at 9:37 am

DOWN: 10:22AM Pacific Time

UP: 10:36AM Pacific Time

Cluster Affected:  Cluster 2

Update: May 7, 2008

Wednesday, May 7th, 2008 at 12:01 pm

(mt) Media Temple would like to apologize for the 15 minutes of unavailability that occurred at 12:15PM Pacific Time this afternoon. Our engineers brought all services back online and are continuing to closely monitor all activity on Cluster.2 at this time. We appreciate your patience as we further explore the resolutions to this incident mentioned in our previous responses. Thank you.

Update: May 5, 2008

Monday, May 5th, 2008 at 12:44 pm

(mt) Media Temple would like to apologize for the 15 minutes of unavailability that occurred at 1:05PM Pacific Time this afternoon.  Our engineers brought all services back online and are continuing to closely monitor all activity on Cluster.1 at this time.  We appreciate your patience as we further explore the resolutions to this incident mentioned in our previous responses.  Thank you.

Update: May 04, 2008

Sunday, May 4th, 2008 at 11:27 am

(mt) Media Temple would like to apologize for the 30 minutes of unavailability that occurred at 11:35AM Pacific Time this morning.  Our engineers brought all services back online and are continuing to closely monitor all activity on Cluster.2 at this time.  We appreciate your patience as we further explore the resolutions to this incident mentioned in our previous responses.  Thank you.

Update: April 29, 2008 - Further details.

Tuesday, April 29th, 2008 at 2:02 pm

(mt) Media Temple would like to bring you up to date on some of the progress we have made thus far with our (gs) Grid-Service platform relating to this incident.

There have been several known issues we have been dealing with up until now that have been addressed in this incident. These have all been related to our storage subsystem in slightly different ways. We felt it best to simplify some of these issues to better communicate with our customers as a whole.

In a nutshell, our storage has been mostly suffering from two unique problems, both of them resulting in different types of unavailability. Here is a breakdown of where we’re at with each one.

ISSUE 1:

We have identified a bug that we have been tracking for some time. The problem is isolated to a given cluster when it occurs, but can potentially happen on any cluster. All services become unavailable (web, ftp, email included) for a period of 5-30 minutes during this time. We are working closely with leading experts to get to the bottom of this problem. We’re still investigating the matter, so more definitive information will be available in the days ahead.

STATUS: ACTIVE

ISSUE 2:

This particular bug results in intermittent load spikes on our servers, and might have actually gone unnoticed by some of our customers. These temporary spikes accounted for inconsistent webpage-loading times. Even many popular website monitoring services interpreted these periods as “down”. Although we definitely considered the performance sub-par, your service, in almost all cases, was actually still “up” or available.

We discovered the root cause of these spikes were directly caused by our storage architecture. Under this configuration our clustered servers were not properly isolated from each other. This resulted in both clusters fighting over resources that were not needed. Our hardware vendors came to the same conclusion and a plan was put in motion:

On April 17th we met with our vendor at our Data Center to perform that “unscheduled maintenance” we announced at the last minute (sorry about that) and… so far so good! For almost the past two weeks we have been unable to detect any of the “dropouts” that we were previously suffering from.

We’re sorry that it took so long to correctly diagnose and solve this bug, but it is our conclusion that these “micro-outages” have all but disappeared.

STATUS: SOLVED

We understand the very real impact all of this has on your service and that there is much more to be done. We have made the current state of the (gs) platform our top priority and promise to report any additional findings when available. Until such time this is the best way to notify you of any availability issues.

Update: April 29, 2008

Tuesday, April 29th, 2008 at 8:48 am

(mt) Media Temple would like to apologize for the 30 minutes of unavailability that occurred at 9:15AM Pacific Time this morning.  Our engineers brought all services back online and are continuing to closely monitor all activity on Cluster.2 at this time.  We appreciate your patience as we further explore the resolutions to this incident mentioned in our previous responses.  Thank you.