PRELIMINARY RCA
On September 5th, 2023, between 13:47 and 14:09 UTC, customers in our NA region encountered an Authentication and Administration Service disruption. This was followed by a period until 14:41 UTC where customers may have experienced degraded service, depending on their DNS caching configurations.
This incident was triggered by failures in some nodes within our Web Application Firewall (WAF) cluster, leading to a performance degradation and resource exhaustion. Traffic on impacted nodes slowed down and eventually failed. As part of our restoration process, we reverted to a known good configuration temporarily, causing some customers to briefly encounter an expired SSL certificate. Subsequently, the cluster was fully restored to a healthy state.
To minimize downtime, we initiated a failover to our secondary site at 14:09 UTC, restoring Authentication and Administration services there. Meanwhile, our Operations team continued to mitigate the incident at the primary site. By 14:41 UTC the mitigation was complete and traffic was restored to our primary site.
In response to this incident, RSA is actively enhancing the ID Plus service and related processes. Our steps include: