On September 21, 2021, an authentication outage occurred that lasted 1 hour and 12 minutes. The outage occurred during an upgrade from the July 2021 to August 2021 release on our na2.access authentication server. During the upgrade process, the front-end CPUs hit maximum load and only recovered when they were restored to the July release.
The primary root cause of the issue was a change in the configuration of our virtual machines that was put in place after an outage on September 12 but that did not perform sufficiently.
During this outage, nodes experiencing very high CPU utilization changed status from Healthy, to Degraded, to Down, and back to Degraded after a single successful connection was made. These frequent status changes prevented identity router traffic from remaining on a single node, causing additional load on the node CPUs as traffic moved from one node to another.
Secondary contributing factors in the outage:
The following mitigations are being implemented for our next release:
Additional mitigations are also being investigated:
The SecurID team would like to apologize for the inconvenience caused by this outage. We understand the disruption that incidents like this cause and are taking the necessary steps to help avoid similar incidents in the future.
The SecurID Team