The authentication outage on October 4, 2021 is related to a series of outages that have occurred since August 25, 2021. The SecurID Engineering team has been working non-stop to determine the root cause of these outages and has concluded that each outage was ultimately caused a defect in our retry logic under certain load and failure conditions. While each of the outages was triggered by a different event, each event ultimately exposed the same defect in our retry logic, causing outages in each case.
On October 4, the outage was triggered by an audit log purge that took three minutes. Due to the retry logic bug, this purge resulted in frequent SSL connection attempts to our backend nodes. These frequent SSL connection attempts continued even after the nodes were stopped. Simultaneously stopping the affected nodes for a short period of time allowed enough time for network traffic to stop and node resources to recover.
The following mitigations will be put in place in our next release to help prevent these issues in the future:
The above mitigations are part of a broader set of mitigations that we are implementing to help avoid additional outages like those occurring since August 25. Mitigations will be rolled out to all customers by October 23.
The SecurID team apologizes for this incident and acknowledges the disruption that outages like this can cause. We are making every effort to avoid outages like this in the future.
The SecurID Team