Service Incident Notification – RSA SecurID Access - US Region
Incident Report for RSA SecurID Access
Postmortem

Published July 22, 2020

ROOT CAUSE

Decommissioning a legacy DNS solution inadvertently caused some data streams for the Azure DNS recursive resolver service to become out of sync with the resolver state. This was detected by a sync pipeline, which triggered a reset of the resolver instances to recover from the stale state. This reset was not done in a staggered fashion and led to multiple resolver instances rebooting at the same time. This led to degradation of the service and caused DNS resolution failures for the queries originating from virtual networks.

Impacts were observed across multiple Azure regions. While some instances of the service saw no impact, the US region that hosts RSA SecurID Access was impacted and took 30 minutes to recover. The DNS resolution issues were fully auto-mitigated across all Azure regions within 54 minutes.

RECOVERY: While this impacted only the RSA SecurID Access US region, this was a global Azure outage. We are working with Microsoft to review and track their proposed mitigations:

  • Fix the orchestration logic in the sync pipeline to help ensure that resolver instances are reset in a staggered, partitioned fashion.
  • Improve the resolver startup sequence to help ensure that a resolver instance can be up and running within 10 minutes after a reset.
Posted Oct 02, 2020 - 16:21 UTC

Resolved
Between 07:50 and 08:45 UTC (approx.) on July 18, 2020, Azure DNS experienced a transient resolution issue that impacted connectivity for other Azure services.
Posted Jul 18, 2020 - 07:50 UTC