Summary
On 04/01/2026, RSA ID Plus experienced a service outage in the ANZ region, impacting authentication workflows and related services.
The issue was caused by a performance issue at the datastore layer caused by a non-optimal query execution pattern within one of the service data tiers, which led to elevated resource utilization and increased latency. This resulted in an authentication service outage.
Service was restored through a combination of capacity adjustments and regional failover, and stability has been maintained since recovery.
Preliminary Root Cause
The incident was attributed to a query optimizer behavior that resulted in a suboptimal execution plan within the service data tier under specific runtime conditions.
Under these conditions, the optimizer selected an inefficient query execution strategy, which led to:
These conditions drove increased latency and service instability within authentication workflows.
The behavior is consistent with a query plan regression scenario, where the optimizer generates a plan that is not optimal for the current data distribution or workload characteristics.
Recovery
During the incident, the team followed a controlled recovery approach aligned with the platform’s resilience design.
RSA ID Plus is architected with strong in-region resilience, including redundancy and scaling capabilities across service layers, as well as a warm secondary region available for failover. In line with this design, the initial response focused on stabilizing the primary region by addressing resource contention within the data tier.
Failover to the secondary region is available and was successfully executed as part of the recovery. Based on real-time impact assessment and recovery progress, it was determined that in-region remediation would not restore service within acceptable thresholds. A controlled failover to the secondary region was then initiated.
Following stabilization, corrective changes were implemented within the primary region to address the underlying query optimization behavior. Once these changes were validated, traffic was safely transitioned back to the primary region in a controlled manner.
All systems have remained stable under continued monitoring since recovery.
Mitigation and Resolution
The following actions were taken to restore service:
Preventive Actions
Immediate Actions
Vendor Engagement and Safeguard Review
As part of the ongoing investigation and prevention efforts, RSA has engaged with our data service provider to further evaluate query optimization behavior observed during the incident.
This includes a focused review of existing safeguards and protective mechanisms designed to detect and mitigate suboptimal query execution patterns. While these controls are in place, this event identified conditions under which they did not intervene as expected.
This review is actively in progress, and findings will be incorporated into follow-up corrective actions.