An incident on December 5th, 2022, caused degraded authentication services for a subset of customers hosted in the NA region from 12:55UTC – 16:33UTC. Most impacted customers experienced the incident as increased latency during authentications and/or as intermittent authentication failures. The issue was caused by a product defect causing increased processing times on one of our backend nodes. Once the issue was identified, it was remediated by replacing the impacted node with a healthy node.
This issue was caused by a flaw in our High Availability system. This issue only impacted a single backend node on the na3.access Authentication Service. This node gradually started experiencing increased processing times, resulting in intermittent failures and timeouts. On the impacted node, we experienced an error condition that caused us to fall back to a temporary file-based data storage system. When we attempted to return to normal operations, a flaw in the recovery logic caused us to unexpectedly continue queuing data in the file-based system beyond normal operating limits, which eventually impacted processing times on the node. Authentication attempts that were being serviced by other backend nodes were processed in a timely manner during the same period.
Since this was only a modest slowdown and increase in error rates overall for the na3.access Authentication Service, it did not meet our criteria for a disaster recovery procedure or for real-time reporting on our status page. Further, the problematic node appeared to our monitoring systems to be repeatedly recovering rapidly back to a healthy state (since many authentications were still be processed successfully). As a result, the monitoring system didn’t trigger our automated procedure to take the node out of service.
SecurID is continuously taking steps to improve the SecurID Access service and processes to help ensure such incidents do not occur in the future. This includes, but is not limited to:
Addressing the defect in our replication logic and our replication error-handling logic
Further enhancing error handling so that data that repeatedly fails to replicate is directed to a separate queue for manual intervention.
More sensitive detection for node-level degradation, so that any nodes that are experiencing a problem are taken out of service rapidly