RSA SecurID Access Service Incident (NA Region)
Incident Report for RSA ID Plus
Postmortem

The authentication outage on October 4, 2021 is related to a series of outages that have occurred since August 25, 2021. The SecurID Engineering team has been working non-stop to determine the root cause of these outages and has concluded that each outage was ultimately caused a defect in our retry logic under certain load and failure conditions. While each of the outages was triggered by a different event, each event ultimately exposed the same defect in our retry logic, causing outages in each case.

On October 4, the outage was triggered by an audit log purge that took three minutes. Due to the retry logic bug, this purge resulted in frequent SSL connection attempts to our backend nodes. These frequent SSL connection attempts continued even after the nodes were stopped. Simultaneously stopping the affected nodes for a short period of time allowed enough time for network traffic to stop and node resources to recover.

The following mitigations will be put in place in our next release to help prevent these issues in the future:

  • Resolve the defect in our retry logic.
  • Adjust the Connection Keep-Alive default to reduce the number of HTTPS connections that must be re-established.
  • Purge audit log records in small batches. 

The above mitigations are part of a broader set of mitigations that we are implementing to help avoid additional outages like those occurring since August 25. Mitigations will be rolled out to all customers by October 23.

The SecurID team apologizes for this incident and acknowledges the disruption that outages like this can cause. We are making every effort to avoid outages like this in the future.

Thank you,
The SecurID Team

Posted Oct 18, 2021 - 21:43 UTC

Resolved
The authentication outage on na3.access has been resolved. We will continue to monitor the service, and will publish a root cause for this issue as soon as it is available.

Thank you.
Posted Oct 04, 2021 - 16:41 UTC
Monitoring
Authentication is recovering on our na3.access server. We continue to monitor this issue and will provide additional updates within 30 minutes.
Posted Oct 04, 2021 - 16:29 UTC
Update
We continue to investigate an outage on our na3.access authentication server. Authentications on this server are slow or are failing. We will provide an update within the next 30 minutes.
Posted Oct 04, 2021 - 15:53 UTC
Investigating
RSA has detected an issue affecting RSA SecurID Access.
RSA SaaS Operations is investigating the issue and will post updates as they become available.
Posted Oct 04, 2021 - 15:19 UTC
This incident affected: NA (na3.access Authentication Service).