Degraded authentication and administration services

Incident Report for RSA ID Plus

Postmortem

An administration outage on September 23, 2021 was triggered by a storage outage at our service provider, Microsoft Azure. The outage impacted all servers in our NA region for 33 minutes.

This outage is related to a series of outages that have occurred since August 25, 2021. The SecurID Engineering team has been working non-stop to determine the root cause of these outages and has concluded that each outage was ultimately caused by a defect in our retry logic under certain load and failure conditions. While each of the outages was triggered by a different event, each event ultimately exposed the same defect in our retry logic, causing outages in each case.

On September 23, the storage outage at our service provider exposed the defect in our retry logic, which caused the node CPUs to reach 100% utilization. When the storage outage ended, however, the nodes did not recover. The retry logic defect resulted in frequent SSL connection attempts to our backend nodes, which continued even after the nodes were stopped. Simultaneously stopping the affected nodes for a short period of time allowed enough time for network traffic to stop and node resources to recover.

Additionally, the outage exposed an issue where our feature toggle component is unnecessarily dependent on one type of node. This issue caused authentications to timeout and fail on na.access.

The following mitigations will be put in place in our next release to help prevent these issues in the future:

Resolve the defect in our retry logic.
Adjust the Connection Keep-Alive default to reduce the number of HTTPS connections that must be re-established.
Make corrections to feature toggles to properly rely on a second-level cache. This second-level cache will help to insulate us from future storage outages at our service provider.

The above mitigations are part of a broader set of mitigations that we are implementing to help avoid additional outages like those occurring since August 25. Mitigations will be rolled out to all customers by October 23.

The SecurID team apologizes for this incident and acknowledges the disruption that outages like this can cause. We are making every effort to avoid outages like this in the future.

Thank you,
The SecurID Team

Posted Oct 18, 2021 - 21:34 UTC

Resolved

We have confirmed that the outage reported by our hosting provider has been resolved and that all SecurID authentication and administration services are functional.
We apologize for the disruption caused by this incident and will work with our hosting provider to establish the root cause and to determine what mitigations they are taking as a result. We will also review the incident for any mitigations we may be able to put in place to prevent similar issues in the future.
Thank you.

Posted Sep 23, 2021 - 20:56 UTC

Monitoring

SecurID authentication and administration services have recovered and are now available.

We will continue to monitor our services and the storage outage impacting our hosting provider. We will provide updates as they become available.

Thank you

Posted Sep 23, 2021 - 18:46 UTC

Update

We continue to investigate issues affecting our authentication and administration services. These issues are related to a storage outage impacting our hosting provider. We are working with our provider to resolve this issue as quickly as possible.

We will update you again within 30 minutes.

Posted Sep 23, 2021 - 18:10 UTC

Update

We continue to investigate a potential issue with the identity router health check service and administration console service.
We are also investigating degraded authentication services. We will provide an additional update within 30 minutes.

Posted Sep 23, 2021 - 17:40 UTC

Update

We are continuing to investigate this issue.

Posted Sep 23, 2021 - 17:13 UTC

Investigating

We are currently investigating a potential issue with the identity router health check service and Administration Console service.
Authentication services are not impacted. We will provide an additional update within 30 minutes.

Thank you,
The SecurID Team

Posted Sep 23, 2021 - 17:12 UTC

This incident affected: NA (na.access Administration Console, na.access Authentication Service, na2.access Administration Console, na2.access Authentication Service, na3.access Administration Console, na3.access Authentication Service).