(mt) Media Temple would like to bring you up to date on some of the progress we have made thus far with our (gs) Grid-Service platform relating to this incident.
There have been several known issues we have been dealing with up until now that have been addressed in this incident. These have all been related to our storage subsystem in slightly different ways. We felt it best to simplify some of these issues to better communicate with our customers as a whole.
In a nutshell, our storage has been mostly suffering from two unique problems, both of them resulting in different types of unavailability. Here is a breakdown of where we’re at with each one.
ISSUE 1:
We have identified a bug that we have been tracking for some time. The problem is isolated to a given cluster when it occurs, but can potentially happen on any cluster. All services become unavailable (web, ftp, email included) for a period of 5-30 minutes during this time. We are working closely with leading experts to get to the bottom of this problem. We’re still investigating the matter, so more definitive information will be available in the days ahead.
STATUS: ACTIVE
ISSUE 2:
This particular bug results in intermittent load spikes on our servers, and might have actually gone unnoticed by some of our customers. These temporary spikes accounted for inconsistent webpage-loading times. Even many popular website monitoring services interpreted these periods as “down”. Although we definitely considered the performance sub-par, your service, in almost all cases, was actually still “up” or available.
We discovered the root cause of these spikes were directly caused by our storage architecture. Under this configuration our clustered servers were not properly isolated from each other. This resulted in both clusters fighting over resources that were not needed. Our hardware vendors came to the same conclusion and a plan was put in motion:
On April 17th we met with our vendor at our Data Center to perform that “unscheduled maintenance” we announced at the last minute (sorry about that) and… so far so good! For almost the past two weeks we have been unable to detect any of the “dropouts” that we were previously suffering from.
We’re sorry that it took so long to correctly diagnose and solve this bug, but it is our conclusion that these “micro-outages” have all but disappeared.
STATUS: SOLVED
We understand the very real impact all of this has on your service and that there is much more to be done. We have made the current state of the (gs) platform our top priority and promise to report any additional findings when available. Until such time this is the best way to notify you of any availability issues.