RSA SecurID Access Service Incident (NA Region)
Incident Report for RSA ID Plus
Postmortem

On September 21, 2021, an authentication outage occurred that lasted 1 hour and 12 minutes. The outage occurred during an upgrade from the July 2021 to August 2021 release on our na2.access authentication server. During the upgrade process, the front-end CPUs hit maximum load and only recovered when they were restored to the July release.

The primary root cause of the issue was a change in the configuration of our virtual machines that was put in place after an outage on September 12 but that did not perform sufficiently.

During this outage, nodes experiencing very high CPU utilization changed status from Healthy, to Degraded, to Down, and back to Degraded after a single successful connection was made. These frequent status changes prevented identity router traffic from remaining on a single node, causing additional load on the node CPUs as traffic moved from one node to another.

Secondary contributing factors in the outage:

  • System improvements that helped increase the speed of monthly upgrades caused a surge of activity on the nodes hosting the new release.
  • Recent internal changes removed regular and recurring load spikes but raised the average traffic load between server nodes by a significant percentage. 

The following mitigations are being implemented for our next release: 

  • Increase capacity of our virtual machine configuration.
  • Improve internal network handling to avoid health status changes in nodes that are still operational but operating extremely slowly due to high CPU consumption.
  • Throttle the speed of system upgrades.
  • Add additional performance tests constructed specifically around the cause of this outage.

Additional mitigations are also being investigated: 

  • Use more efficient SSL libraries for SSL connections. 
  • Limit the number of connections to prevent excessive CPU utilization.
  • Update our internal connections to TLS 1.3, which is more efficient when establishing SSL connections.

The SecurID team would like to apologize for the inconvenience caused by this outage. We understand the disruption that incidents like this cause and are taking the necessary steps to help avoid similar incidents in the future.

Thank you,
The SecurID Team

Posted Oct 01, 2021 - 22:30 UTC

Resolved
Authentication services have been restored for customers impacted by today’s outage on na2.access.securid.com. Our SaaS Operations Team has completed the roll back of impacted to our July 2021 release.

We continue to investigate the cause of today’s issue. Customers on na2.access.securid.com will be upgraded to the August 2021 release when we have completed our investigation and have a fix in place.

We apologize for any inconvenience this has caused and will post a root cause analysis as soon as it is available.

Thank you,
The SecurID Team
Posted Sep 21, 2021 - 21:17 UTC
Identified
The issue is currently being monitored at this time.
Posted Sep 21, 2021 - 20:42 UTC
Investigating
RSA has detected an issue affecting RSA SecurID Access.
RSA SaaS Operations is investigating the issue and will post updates as they become available.
Posted Sep 21, 2021 - 20:21 UTC
This incident affected: NA (na2.access Authentication Service).