Weblog

 

Authentication issues on (gs) Grid.Cluster.2  

Incident Tracker status:  RESOLVED  view incidents »

Authentication server replication issues

Wednesday, December 12th, 2007 at 3:47 pm

After the authentication server was brought back online, the servers needed to replicate all of the new data to the server that was re-introduced to the system.  The process of replication increased the load on this server which caused the same symptoms that we saw earlier this morning to recur.  The replication has been completed and all services should be working properly at this time.

As mentioned in a previous update, work will be done on the application-level load balancing to ensure that this type of failure is handled more elegantly in the future.

Maintenance completed and issues resolved.

Wednesday, December 12th, 2007 at 1:01 pm

Our systems engineers have completed the maintenance and restored the authentication server successfully.  The memory upgrade was seamless and we do not anticipate any further disruptions in service but will be monitoring all of the services closely for the next 24 hours.

Mail delays due to this issue

Wednesday, December 12th, 2007 at 11:48 am

As a result of the issues this morning, customers may be experiencing some delays with both inbound and outbound emails.  We do not anticipate any loss of emails and that all emails will be delivered shortly.

Systems engineers have also completed the maintenance on our authentication server and are in the process of bringing this server back online.

Issue has returned

Wednesday, December 12th, 2007 at 11:33 am

It seems that our temporary solution to this issue was much shorter than anticipated.  We had a brief momentary recurrence of this issue and have decided to remove this member of the redundant authentication subsystem so that we can further investigate the root cause of this issue as well as perform the memory upgrade ahead of schedule.  There should be no customer impact as a result of this removal due to the redundant architecture of the system.  We will provide another update after the maintenance has been completed.

Authentication issues resolved

Wednesday, December 12th, 2007 at 10:54 am

It appears that the root cause of the issues today was due to a performance degradation of one of the members of the authentication cluster. Though the system was equipped with what has historically been sufficient physical memory, it exceeded those limitations and began to swap, thus causing authentication failures for applications that need this service in order to connect.
Though this system has internal redundancies, this “degraded state” was handled less cleanly than if a “complete failure” of that node had actually occurred.
We have temporarily resolved the issue by restarting the authentication services which has cleared up the aberrant over-utilization of physical memory.
Remedies:
• A memory upgrade on the hardware to ensure that there is sufficient memory to prevent these issues. We do not anticipate any downtime associated with the hardware upgrade and expect a seamless operation.
• Monitoring for this exceptional case will also be added to the system. The application-level load balancing code is being reviewed to see if this case could have been detected and avoided more cleanly.

Authentication issues on (gs) Grid.Cluster.2

Wednesday, December 12th, 2007 at 9:22 am

Customers on (gs) Grid.Cluster.2 may be experiencing issues with connecting to email, FTP, SSH and any services that require authentication. Our systems engineers are currently investigating the issue and will provide further updates shortly.
Thank you for your patience and understanding.