PRELIMINARY RCA
Incidents on April 17th 9:05 PM UTC – 9:15 PM UTC, April 19 9:14 PM UTC – 9:20 PM UTC, and April 20th 12:59 PM UTC – 01:35 PM UTC resulted in intermittent Authentication and Administration Service degradation for customers in our NA region. Customers may have experienced these incidents as delays or intermittent authentication failures.
The cause of these incidents was that some of the nodes within our load balancer cluster were intermittently losing connections to remote mounted drives which are essential to proper operation. Traffic being handled by the impacted load balancer nodes slowed down and eventually failed, but traffic being handled by other nodes was handled without delay. Because of the partial nature of this degradation, we did not hit an overall failure rate threshold which would have triggered our disaster recovery failover procedure.
As a result of the incidents on April 17th and April 19th, we were working closely with our vendors to identify a full RCA and determine appropriate mitigations. Because of the intermittent pattern of failure noted above, RSA and our vendors initially reached an incomplete RCA and mitigation plan. Additional evidence provided by our enhanced monitoring led us to realize that additional mitigations were necessary. These mitigations have now been put in place in all of our production environments.
RECOVERY
RSA is continuously taking steps to improve the RSA SecurID Access service and our processes to help ensure such incidents do not occur in the future. In this case, steps include (but are not limited to):