RSA SecurID Access Service Incident (NA Region)
Incident Report for RSA ID Plus
Postmortem

On September 12, 2021, a misconfiguration in our Logstash log collection and parsing service resulted in a sustained period of unexpectedly high CPU usage on our na2.access authentication servers after routine OS patching. This issue, combined with the expected short-term load surge of establishing new SSL connections as traffic was rerouted from servers being taken offline for OS patching, overloaded the remaining servers. Overall system performance was severely degraded and only a small number of authentications were being processed. Only customers on na2.access were impacted by this issue.

The SecurID team stabilized the system by rolling back affected customers to our July release. The outage was not caused by code in our August release, but rolling back to our July release moved these affected customers to servers that were not impacted by the heavy utilization issues that resulted from the misconfiguration in Logstash.

We sincerely apologize for any inconvenience caused by this issue and are investigating the following possible mitigations to avoid similar situations in the future:

·         Increase the number of servers. This will distribute traffic across more servers, so short-term spikes from rerouting traffic will be smaller and shared among more servers during operations that take a server offline. 
·         Reconfigure Logstash to prevent position data corruption.
·         Lower the priority of the Logstash processes. 
·         Improve internal network handling to avoid health status changes in nodes that are still operational but operating extremely slowly due to high CPU consumption.
·         Use more efficient SSL handshake technology to put less load on the CPUs.

Updates for this, and all other incidents and maintenance windows are available on status.securid.com.

Thank you,
The SecurID Team

Posted Sep 17, 2021 - 19:06 UTC

Resolved
Authentication services have been restored for customers impacted by today’s outage on na2.access.securid.com. Our SaaS Operations Team has completed the roll back of impacted to our July 2021 release.

We continue to investigate the cause of today’s issue. Customers on na2.access.securid.com will be upgraded to the August 2021 release when we have completed our investigation and have a fix in place.

We apologize for any inconvenience this has caused and will post a root cause analysis as soon as it is available.

Thank you,
The SecurID Team
Posted Sep 12, 2021 - 16:32 UTC
Update
We continue to investigate authentication issues on na2.access.securid.com. The SaaS Operations Team is rolling back our August 2021 release and restoring customers on na2.access.securid.com to our July 2021 release.

As a result of this rollback, your administrators will notice that recent user interface enhancements and identity router status enhancements are removed from the system. These enhancements will be restored once we have completed our investigation and have a fix in place.

We will provide updates as more information becomes available.

Thank you,
The SecurID Team
Posted Sep 12, 2021 - 15:10 UTC
Update
The SecurID SaaS Operations Team continues to investigate this issue. We will provide an update as soon as more information is available.

Thank you,
The SecurID Team
Posted Sep 12, 2021 - 14:08 UTC
Investigating
RSA has detected an issue affecting RSA SecurID Access.
RSA SaaS Operations is investigating the issue and will post updates as they become available.
Posted Sep 12, 2021 - 13:22 UTC
This incident affected: NA (na2.access Authentication Service).