ROOT CAUSE
The na4.access Cloud Authentication Service was affected by high CPU load and latency on the database cluster. As a result, customers experienced failed authentications from approximately 1:05 UTC – 1:25 UTC. RSA Engineering has determined that during this time period, the database optimizer was using an inefficient query plan for an essential authentication workflow, which caused high CPU/resource load.
RSA SaaS Operations has dramatically increased the base processing power of the impacted database cluster to mitigate against this risk. We have been continuously monitoring this environment since then, and there have been no further signs of excessive DB usage.
RECOVERY
RSA is continuously taking steps to improve the RSA SecurID Access service and our processes to help ensure such incidents do not occur in the future. In this case, steps include (but are not limited to):
Modifying the database query to make it less likely for a sub-optimal query plan to be selected by the database optimizer.
Dramatically increasing the base processing power of the database to mitigate against this risk.
Additional resource health monitoring has been added to allow earlier detection of excess database load conditions