SecurID Service Incident (NA Region)

Incident Report for RSA ID Plus

Postmortem

ROOT CAUSE

The na4.access Cloud Authentication Service was affected by high CPU load and latency on the database cluster. As a result, customers experienced failed authentications from approximately 1:05 UTC – 1:25 UTC. RSA Engineering has determined that during this time period, the database optimizer was using an inefficient query plan for an essential authentication workflow, which caused high CPU/resource load.

RSA SaaS Operations has dramatically increased the base processing power of the impacted database cluster to mitigate against this risk. We have been continuously monitoring this environment since then, and there have been no further signs of excessive DB usage.

RECOVERY

RSA is continuously taking steps to improve the RSA SecurID Access service and our processes to help ensure such incidents do not occur in the future. In this case, steps include (but are not limited to):

Modifying the database query to make it less likely for a sub-optimal query plan to be selected by the database optimizer.
Dramatically increasing the base processing power of the database to mitigate against this risk.
Additional resource health monitoring has been added to allow earlier detection of excess database load conditions

Posted Jan 13, 2023 - 21:41 UTC

Resolved

An incident on January 5th, 2023, caused degraded authentication services for a subset of customers hosted in the NA region from 13:10UTC – 13:36UTC. Most impacted customers experienced the incident as increased latency during authentications and/or as intermittent authentication failures.

We will post a root cause analysis as soon as it is available.

Posted Jan 05, 2023 - 13:00 UTC