#1418 - Incident Review
July 27th, 2010 at 4:25 pmThis post is a summary of Incident #1418, relating to a period of excessive load and service interruption which affected Storage Segment 03 on Cluster.03 of the (gs) Grid-Service.
Details:
Date/Time: The issue started at approximately 12:12 PM on Tuesday, July 27 and was resolved by 1:15 PM, Pacific Time. Service impact was contained to a window of a little more than an hour.
Symptoms: Access to all services was interrupted. This included:
- HTTP
- FTP/SFTP/SSH
- Email and webmail
During the period of website unavailability, affected sites would have produced a “403 Forbidden” or a “500 Internal Server Error” message.
Impact: All customers on (gs) Grid-Service Cluster.03, Storage Segment 03 were affected by this system incident. The rest of the (gs) Grid-Service, all (dv) Dedicated-Virtual Servers, and all (ve) Servers remained unaffected.
Root Cause: Our engineers have determined that the root cause of the high load was related to a very high file lock count on Storage Segment 03. The immediate fix was a reboot of the storage segment, which led to the service interruption noted above. Once the storage segment stabilized, the customers who had higher than normal file locks were notified directly and some of their services were temporarily taken offline to protect other customers on the same storage segment.
Takeaways: We are actively monitoring the entire cluster for high load and for users with abnormally high file lock counts. If we find any unusual usage, we will notify customers individually and work diligently to prevent any further service interruption.
This now concludes this System Incident. If you feel that you are still experiencing the symptoms outlined in this post, please open a support request from the (mt) AccountCenter.
Downloading last tweet...
»