#1420 - Incident Review
July 27th, 2010 at 10:58 pmThis post is a summary of Incident #1420, relating to a period of authentication issues with the (gs) Grid Service.
Earlier today, the AccountCenter became unavailable for approximately 15 minutes due to MySQL Replication . Soon after, we began receiving reports of failed email and FTP authentication from customers on various Clusters. After some investigation, it was determined that a portion of the account authentication servers used by each (gs) luster were out of sync. This is the process by which all new password changes are stored and synced across our multi-node, clustered (gs) Grid-Service platform. These servers are replicated database slaves, which are normally self-healing.
(mt) Engineers identified the source of this issue and made the appropriate corrections to restore functionality to these servers.
- Date/Time: The issue started at approximately 3:15 PM on Tuesday, July 27 2010 and was resolved by 6:30 PM. Service impact was variable the (gs) Grid-Service during this time.
- Symptoms: Customers creating or modifying email addresses or updating FTP/SSH passwords may have experienced authentication .
- Impact: All (gs) Grid-Service Clusters were affected. mail was lost during this time.
- Root Cause and Takeaways: Although our investigation will be ongoing, we have identified a point where the binary logs that are required for replication were corrupted. Going forward, we are looking into system changes which would help prevent this issue from re-occurring. We will also be looking into increasing the efficiency of our replication repair utilities. Performing this change will allow us the ability to repair replication services for all Clusters simultaneously.
This now concludes this System Incident. If you feel that you are still experiencing the symptoms outlined in this post, please open a support request from the (mt) AccountCenter.
Downloading last tweet...
»