RSA SecurID Access Service Incident (NA, EMEA and ANZ Regions)
Incident Report for RSA ID Plus
Postmortem

Azure DNS servers experienced an anomalous surge in DNS queries from across the globe targeting a set of domains hosted on Azure. Normally, Azure’s layers of caches and traffic shaping would mitigate this surge. In this incident, one specific sequence of events exposed a code defect in our Azure DNS service that reduced the efficiency of our DNS Edge caches. As the Azure DNS service became overloaded, DNS clients began frequent retries of their requests which increased workload on the Azure DNS service. Since client retries are considered legitimate DNS traffic, this traffic was not dropped by our volumetric spike mitigation systems. This increase in traffic led to decreased availability of the Azure DNS service. As a result, RSA SecurID Access transactions using these services timed out due to these availability issues.

Azure Mitigation: The decrease in service availability triggered our monitoring systems and engaged our engineers. Our DNS services automatically recovered themselves by 22:00 UTC. This recovery time exceeded our design goal, and our engineers prepared additional serving capacity and the ability to answer DNS queries from the volumetric spike mitigation system in case further mitigation steps were needed. The majority of services were fully recovered by 22:30 UTC. Immediately after the incident, we updated the logic on the volumetric spike mitigation system to protect the DNS service from excessive retries.

Recovery / Preventative Steps:

  • Microsoft has published a public RCA that includes (but is not limited to) the following action items ([https://status.azure.com/en-us/status/history/](https://status.azure.com/en-us/status/history)):

    • Repair the code defect so that all requests can be efficiently handled in cache.
    • Improve the automatic detection and mitigation of anomalous traffic patterns.
  • RSA is actively engaging Microsoft and driving incremental improvements through regular technical-level synch meetings

  • Following this and other recent incidents, RSA has established an executive-level channel with equivalent Microsoft executives to review and track on a regular basis the progress of the improvement plan above.

  • In parallel, RSA technical staff members are synching closely and proactively with technical peers at Microsoft to review improvement changes and recommendations and drive the plan to resolution.

RSA would like to apologize for any inconvenience this may have caused. If you have any questions, please do not hesitate to contact us.

RSA SaaS Operations

Posted Apr 20, 2021 - 03:00 UTC

Resolved
After monitoring the fix, RSA SaaS Operations has determined that the incident affecting RSA SecurID Access has been resolved.

RSA will post a root cause analysis as soon as it is available.
Posted Apr 02, 2021 - 00:01 UTC
Monitoring
RSA SaaS Operations team is monitoring a fix for the issue affecting RSA SecurID Access. We are currently seeing improvement in service availability.
Posted Apr 01, 2021 - 22:49 UTC
Identified
RSA has detected an issue affecting RSA SecurID Access.
RSA SaaS Operations is investigating the issue and will post updates as they become available.
Posted Apr 01, 2021 - 21:46 UTC
This incident affected: EMEA (access-eu Administration Console, access-eu Authentication Service, eu2.access-eu Administration Console, eu2.access-eu Authentication Service), ANZ (access-anz Administration Console, access-anz Authentication Service), and NA (na.access Administration Console, na.access Authentication Service, na2.access Administration Console, na2.access Authentication Service, na3.access Administration Console, na3.access Authentication Service).