SecurID Service Incident (NA Region) - Degraded Authentications

Incident Report for RSA ID Plus

Postmortem

On May 16, 2022, authentication services hosted in the North American region were degraded for 2 hours and 45 minutes (from 8:00AM ET to 10:45AM ET). Customers who relied on our web-based authentication UI were impacted while others, such as RADIUS, API or agent-initiated authentication, continued functioning. The incident occurred due to a defect in our April release, which went live to customers in North American on May 14th. In the April release, we made updates to the infrastructure that handles authentication session data. The session management infrastructure became overwhelmed and degraded due to a defect in the way connections to these components are managed. This defect caused excessive delays in web-based authentications.

During the incident SecurID SaaS Operations rolled customers back to the March release.

Secondary contributing factors in the outage:

During this incident, our session management infrastructure experienced multiple high resource utilization events.
SecurID monitors, both external and internal, did not detect this event. SecurID external health checks for web-based authentications were not correctly validating the session management infrastructure. As a result of this defect, SecurID’s SLA and status pages did not register the event correctly.

The following mitigations have already been implemented:

Additional resources (CPU, Memory) have been allocated to our session management infrastructure to address any potential future high utilization events.
Additional internal health checks have been added to monitor for conditions that would impact web-based authentications.
SecurID has made a network update to our session management infrastructure reducing connection paths and removing connection limits to the components it serves.

The following mitigations are being implemented for our next release:

Updating the connection pool configuration between our Cloud-based systems and the session management infrastructure.
Moving SLA data from http://sla.securid.com to https://status.securid.com.
- The tooling behind our primary SLA page does not allow us to manage incidents that are not triggered from our health monitors. This may cause inaccuracies in SLA reporting under certain circumstances.
- This functionality is being moved to our primary status page to provide better visibility for our customers for these types of events.
Adding additional performance tests constructed specifically to target the cause of this outage.

Additional mitigations are also being investigated:

Updated public status health check that will account for all services involved in web-based authentications.
Additional internal health checks for additional Azure managed services.

The SecurID team would like to apologize for the inconvenience caused by this outage. We understand the disruption that incidents like this cause and are taking the necessary steps to help avoid similar incidents in the future.

Thank you, The SecurID Team

Posted May 26, 2022 - 20:57 UTC

Resolved

After monitoring the fix, SaaS Operations has determined that the incident affecting SecurID has been resolved.

We will post a root cause analysis as soon as it is available.

Posted May 16, 2022 - 15:26 UTC

Monitoring

The issue affecting SecurID has been corrected. The SaaS Operations team is monitoring the fix.

We will post a root cause analysis as soon as it is available.

Posted May 16, 2022 - 14:52 UTC

Identified

SaaS Operations has identified the cause of the issue and is working to implement a fix.

Posted May 16, 2022 - 12:00 UTC

This incident affected: NA (na.access Authentication Service, na2.access Authentication Service, na3.access Authentication Service, na4.access Authentication Service).