#1061 Incident Review
December 16th, 2009 at 4:39 pmThis post is a summary of Incident #1061, relating to a period of service unavailability for customers on (gs) Grid Service Cluster 04.
Details:
- Date/Time: The issue started at approximately 12:35PM on Wednesday, December 16 2009 and was resolved by 2:44PM, Pacific Time. Service impact was contained to a window of just over 2 hours.
- Symptoms: Customers on Cluster 04 would have been unable to reach all hosting functions including web and email services. The AccountCenter was also unavailable during this time.
- Impact: Only Cluster 04 was affected. All other (gs) Grid Service clusters were available during this incident.
- Root Cause: The root cause of today’s issue was an outstanding bug in our server’s firmware supporting the (gs) Grid-Service, which caused a host machine for the segment of (gs) Grid Cluster 04 to switch to a slower network speed. This is a known bug and there has been ongoing work with our vendors to get this addressed as soon as possible. This slow network interface caused the cluster to appear to be operating, (and booting), extremely slowly. Although checking for this bug was an early troubleshooting step, the check was run improperly, and thus this possibility was initially ruled out. The link that failed in this way was part of a highly available system. However, because it failed to a “slow connection” instead of a “down connection”, the failover was not triggered.
- Takeaways: Moving forward, we will be automating the detection and self-correction of this issue while we wait for updated drivers or firmware to permanently resolve the problem.
The odds of a recurrence are minimal.
This now concludes this System Incident. If you feel that you are still experiencing the symptoms outlined in this post, please open a support request from the (mt) AccountCenter.
Downloading last tweet...
»