PRELIMINARY RCA
Summary:
On May 7th, 2024, between approximately 12:45 and 14:20 UTC, a service degradation impacted a subset of our customers within the North American region hosted on our NA2 authentication components. The degradation primarily affected browser-based authentication workflows, while other workflows that did not rely on the hosted Authentication UI experienced minimal disruption.
Root Cause Analysis:
The incident stemmed from a failure in our front-end service tier, resulting in requests being incorrectly directed to a degrading node. This issue was compounded by inconsistent results from internal health service checks, leading to the premature return of the node to service before full recovery. Consequently, the node processed incoming requests too slowly, causing authentication service timeouts.
Mitigations:
In response to this incident, RSA is actively enhancing the ID Plus service and related processes with the following measures:
Upgrade of SSL Library: Addressing an edge case performance flaw in a specific SSL library, which was identified as the ultimate root cause of the node failure.
Enhanced Monitoring and Alerting: Implementation of advanced monitoring and alerting systems to promptly detect and mitigate degraded performance anomalies in front-end clustering.
Incident Response Enhancement: Revision and enhancement of incident response procedures to incorporate specific protocols for managing failures in front-end clustering and traffic misrouting.
These proactive steps aim to fortify our systems and processes, ensuring improved resilience and reliability in service delivery.