Published July 22, 2020
ROOT CAUSE
Decommissioning a legacy DNS solution inadvertently caused some data streams for the Azure DNS recursive resolver service to become out of sync with the resolver state. This was detected by a sync pipeline, which triggered a reset of the resolver instances to recover from the stale state. This reset was not done in a staggered fashion and led to multiple resolver instances rebooting at the same time. This led to degradation of the service and caused DNS resolution failures for the queries originating from virtual networks.
Impacts were observed across multiple Azure regions. While some instances of the service saw no impact, the US region that hosts RSA SecurID Access was impacted and took 30 minutes to recover. The DNS resolution issues were fully auto-mitigated across all Azure regions within 54 minutes.
RECOVERY: While this impacted only the RSA SecurID Access US region, this was a global Azure outage. We are working with Microsoft to review and track their proposed mitigations: