On September 12, 2021, a misconfiguration in our Logstash log collection and parsing service resulted in a sustained period of unexpectedly high CPU usage on our na2.access authentication servers after routine OS patching. This issue, combined with the expected short-term load surge of establishing new SSL connections as traffic was rerouted from servers being taken offline for OS patching, overloaded the remaining servers. Overall system performance was severely degraded and only a small number of authentications were being processed. Only customers on na2.access were impacted by this issue.
The SecurID team stabilized the system by rolling back affected customers to our July release. The outage was not caused by code in our August release, but rolling back to our July release moved these affected customers to servers that were not impacted by the heavy utilization issues that resulted from the misconfiguration in Logstash.
We sincerely apologize for any inconvenience caused by this issue and are investigating the following possible mitigations to avoid similar situations in the future:
· Increase the number of servers. This will distribute traffic across more servers, so short-term spikes from rerouting traffic will be smaller and shared among more servers during operations that take a server offline.
· Reconfigure Logstash to prevent position data corruption.
· Lower the priority of the Logstash processes.
· Improve internal network handling to avoid health status changes in nodes that are still operational but operating extremely slowly due to high CPU consumption.
· Use more efficient SSL handshake technology to put less load on the CPUs.
Updates for this, and all other incidents and maintenance windows are available on status.securid.com.
Thank you,
The SecurID Team