SecurID Service Incident (NA Region)
Incident Report for RSA ID Plus
Postmortem

PRELIMINARY RCA 

 On September 5th, 2023, between 13:47 and 14:09 UTC, customers in our NA region encountered an Authentication and Administration Service disruption. This was followed by a period until 14:41 UTC where customers may have experienced degraded service, depending on their DNS caching configurations.

This incident was triggered by failures in some nodes within our Web Application Firewall (WAF) cluster, leading to a performance degradation and resource exhaustion. Traffic on impacted nodes slowed down and eventually failed. As part of our restoration process, we reverted to a known good configuration temporarily, causing some customers to briefly encounter an expired SSL certificate. Subsequently, the cluster was fully restored to a healthy state.

To minimize downtime, we initiated a failover to our secondary site at 14:09 UTC, restoring Authentication and Administration services there. Meanwhile, our Operations team continued to mitigate the incident at the primary site.  By 14:41 UTC the mitigation was complete and traffic was restored to our primary site. 

In response to this incident, RSA is actively enhancing the ID Plus service and related processes. Our steps include:

  • Ongoing evaluation of best-of-class technology for third-party components.  We are already in the process of replacing our WAF solution (currently targeted for completion January 2024 or earlier).
  • Implementing additional WAF performance and stability improvements in September (planned prior to the incident).
  • Collaborating with vendors to conduct a comprehensive Root Cause Analysis (RCA) of the WAF failure and implementing additional mitigations.
  • Encouraging customers to ensure both primary and secondary regions are reachable from on-premises infrastructure, with an enhanced validation feature already available in the August IDR.
  • Enhancing failover capabilities in the next Identity Router release, enabling more rapid switchover regardless of DNS caching configurations. 
  • Continuing to review ID Plus service logs and customer logs for potential additional mitigations to be included in the final RCA.
Posted Sep 08, 2023 - 20:20 UTC

Resolved
After monitoring the fix, SaaS Operations has determined that the incident affecting SecurID has been resolved.

We will post a root cause analysis as soon as it is available.
Posted Sep 05, 2023 - 15:53 UTC
Update
We are continuing to monitor for any further issues.
Posted Sep 05, 2023 - 14:56 UTC
Monitoring
The issue affecting SecurID has been corrected. The SaaS Operations team is monitoring the fix.

We will post a root cause analysis as soon as it is available.
Posted Sep 05, 2023 - 14:56 UTC
Investigating
We have detected an issue affecting SecurID.
SaaS Operations is investigating the issue and will post updates as they become available.
Posted Sep 05, 2023 - 13:41 UTC
This incident affected: NA (na.access Administration Console, na.access Authentication Service, na2.access Administration Console, na2.access Authentication Service, na3.access Administration Console, na3.access Authentication Service, na4.access Administration Console, na4.access Authentication Service).