An administration outage on September 23, 2021 was triggered by a storage outage at our service provider, Microsoft Azure. The outage impacted all servers in our NA region for 33 minutes.
This outage is related to a series of outages that have occurred since August 25, 2021. The SecurID Engineering team has been working non-stop to determine the root cause of these outages and has concluded that each outage was ultimately caused by a defect in our retry logic under certain load and failure conditions. While each of the outages was triggered by a different event, each event ultimately exposed the same defect in our retry logic, causing outages in each case.
On September 23, the storage outage at our service provider exposed the defect in our retry logic, which caused the node CPUs to reach 100% utilization. When the storage outage ended, however, the nodes did not recover. The retry logic defect resulted in frequent SSL connection attempts to our backend nodes, which continued even after the nodes were stopped. Simultaneously stopping the affected nodes for a short period of time allowed enough time for network traffic to stop and node resources to recover.
Additionally, the outage exposed an issue where our feature toggle component is unnecessarily dependent on one type of node. This issue caused authentications to timeout and fail on na.access.
The following mitigations will be put in place in our next release to help prevent these issues in the future:
The above mitigations are part of a broader set of mitigations that we are implementing to help avoid additional outages like those occurring since August 25. Mitigations will be rolled out to all customers by October 23.
The SecurID team apologizes for this incident and acknowledges the disruption that outages like this can cause. We are making every effort to avoid outages like this in the future.
Thank you,
The SecurID Team