SecurID Service Incident (NA Region)

Incident Report for RSA ID Plus

Postmortem

PRELIMINARY RCA

Incidents on April 17th 9:05 PM UTC – 9:15 PM UTC, April 19 9:14 PM UTC – 9:20 PM UTC, and April 20th 12:59 PM UTC – 01:35 PM UTC resulted in intermittent Authentication and Administration Service degradation for customers in our NA region. Customers may have experienced these incidents as delays or intermittent authentication failures.

The cause of these incidents was that some of the nodes within our load balancer cluster were intermittently losing connections to remote mounted drives which are essential to proper operation. Traffic being handled by the impacted load balancer nodes slowed down and eventually failed, but traffic being handled by other nodes was handled without delay. Because of the partial nature of this degradation, we did not hit an overall failure rate threshold which would have triggered our disaster recovery failover procedure.

As a result of the incidents on April 17th and April 19th, we were working closely with our vendors to identify a full RCA and determine appropriate mitigations. Because of the intermittent pattern of failure noted above, RSA and our vendors initially reached an incomplete RCA and mitigation plan. Additional evidence provided by our enhanced monitoring led us to realize that additional mitigations were necessary. These mitigations have now been put in place in all of our production environments.

RECOVERY

RSA is continuously taking steps to improve the RSA SecurID Access service and our processes to help ensure such incidents do not occur in the future. In this case, steps include (but are not limited to):

SecurID SaaS Operations has adjusted our monitoring and failover tuning thresholds so that we more rapidly remove a failing node from the load balancer cluster.
SecurID SaaS Operations has added additional capacity to our load balancer cluster so that failures in one node are less likely to impact overall cluster operation.
SecurID SaaS Operations has increased the size of our load balancer nodes so they can more easily handle a temporary loss of connectivity to the remote mounted drives
We are also working with our vendors, including our cloud provider, to determine the full root cause of the remote drive connection failure. Additional mitigations may be added after this process has been completed.

Posted May 09, 2023 - 21:13 UTC

Resolved

After monitoring the fix, SaaS Operations has determined that the incident affecting SecurID has been resolved.

We will post a root cause analysis as soon as it is available.

Posted Apr 20, 2023 - 14:17 UTC

Monitoring

The issue affecting SecurID has been corrected. The SaaS Operations team is monitoring the fix.

We will post a root cause analysis as soon as it is available.

Posted Apr 20, 2023 - 13:57 UTC

Investigating

We have detected an issue affecting SecurID.
SaaS Operations is investigating the issue and will post updates as they become available.

Posted Apr 20, 2023 - 13:06 UTC

This incident affected: NA (na.access Administration Console, na.access Authentication Service, na2.access Administration Console, na2.access Authentication Service, na3.access Administration Console, na3.access Authentication Service, na4.access Administration Console, na4.access Authentication Service).