Netwrix - Azure Issues

Incident Report for Netwrix

Postmortem

What happened?

Between 15:45 UTC on 29 October and 00:05 UTC on 30 October 2025, customers and Microsoft services leveraging Azure Front Door (AFD) may have experienced latencies, timeouts, and errors.

Affected Azure services include, but are not limited to: App Service, Azure Active Directory B2C, Azure Communication Services, Azure Databricks, Azure Healthcare APIs, Azure Maps, Azure Portal, Azure SQL Database, Azure Virtual Desktop, Container Registry, Media Services, Microsoft Copilot for Security, Microsoft Defender External Attack Surface Management, Microsoft Entra ID (Mobility Management Policy Service, Identity & Access Management, and User Management UX), Microsoft Purview, Microsoft Sentinel (Threat Intelligence), and Video Indexer.

Customer configuration changes to AFD remain temporarily blocked. We will notify customers once this block has been lifted. While error rates and latency are back to pre-incident levels, a small number of customers may still be seeing issues and we are still working to mitigate this long tail. Updates will be provided directly via Azure Service Health.

What went wrong and why?

An inadvertent tenant configuration change within Azure Front Door (AFD) triggered a widespread service disruption affecting both Microsoft services and customer applications dependent on AFD for global content delivery. The change introduced an invalid or inconsistent configuration state that caused a significant number of AFD nodes to fail to load properly, leading to increased latencies, timeouts, and connection errors for downstream services.

As unhealthy nodes dropped out of the global pool, traffic distribution across healthy nodes became imbalanced, amplifying the impact and causing intermittent availability even for regions that were partially healthy. We immediately blocked all further configuration changes to prevent additional propagation of the faulty state and began deploying a ‘last known good’ configuration across the global fleet. Recovery required reloading configurations across a large number of nodes and rebalancing traffic gradually to avoid overload conditions as nodes returned to service. This deliberate, phased recovery was necessary to stabilize the system while restoring scale and ensuring no recurrence of the issue.

The trigger was traced to a faulty tenant configuration deployment process. Our protection mechanisms, to validate and block any erroneous deployments, failed due to a software defect which allowed the deployment to bypass safety validations. Safeguards have since been reviewed and additional validation and rollback controls have been immediately implemented to prevent similar issues in the future.

How did we respond?

  • 15:45 UTC on 29 October 2025 – Customer impact began.
  • 16:04 UTC on 29 October 2025 – Investigation commenced following monitoring alerts being triggered.
  • 16:15 UTC on 29 October 2025 – We began the investigation and started to examine configuration changes within AFD.
  • 16:18 UTC on 29 October 2025 – Initial communication posted to our public status page.
  • 16:20 UTC on 29 October 2025 – Targeted communications to impacted customers sent to Azure Service Health.
  • 17:26 UTC on 29 October 2025 – Azure portal failed away from Azure Front Door.
  • 17:30 UTC on 29 October 2025 – We blocked all new customer configuration changes to prevent further impact.
  • 17:40 UTC on 29 October 2025 – We initiated the deployment of our ‘last known good’ configuration.
  • 18:30 UTC on 29 October 2025 – We started to push the fixed configuration globally.
  • 18:45 UTC on 29 October 2025 – Manual recovery of nodes commenced while gradual routing of traffic to healthy nodes began after the fixed configuration was pushed globally.
  • 23:15 UTC on 29 October 2025 - PowerApps mitigation of dependency, and customers confirm mitigation.
  • 00:05 UTC on 30 October 2025 – AFD impact confirmed mitigated for customers.

What happens next?

Our team will be completing an internal retrospective to understand the incident in more detail and will share findings within 14 days. Once we complete our internal retrospective, generally within 14 days, we will publish a final Post Incident Review (PIR) to all impacted customers.

To get notified when that happens, and/or to stay informed about future Azure service issues, make sure that you configure and maintain Azure Service Health alerts – these can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/ash-alerts

For more information on Post Incident Reviews, refer to https://aka.ms/AzurePIRs

The impact times above represent the full incident duration, so are not specific to any individual customer. Actual impact to service availability may vary between customers and resources – for guidance on implementing monitoring to understand granular impact: https://aka.ms/AzPIR/Monitoring

Finally, for broader guidance on preparing for cloud incidents, refer to https://aka.ms/incidentreadiness

How can we make our incident communications more useful?

You can rate this PIR and provide any feedback using our quick 3-question survey: https://aka.ms/AzPIR/YKYN-BWZ

Posted Oct 30, 2025 - 09:18 UTC

Resolved

Microsoft has identified and resolved the issues affecting Azure and Front Door services. Access to our systems should now be fully restored. We are continuing to monitor to ensure stability.
Posted Oct 29, 2025 - 23:59 UTC

Update

Currently access to Netwrix 1Secure maybe intermittent.

Update from Microsoft

Current status: We initiated the deployment of our ‘last known good’ configuration, which has now successfully completed. We are currently recovering nodes and re-routing traffic through healthy nodes.

As recovery progresses, some requests may still land on unhealthy nodes, resulting in intermittent failures or reduced availability until more nodes are fully restored. This recovery effort involves reloading configurations and rebalancing traffic across a large volume of nodes to restore full operational scale. The process is gradual by design, ensuring stability and preventing overload as dependent services recover. We expect continued improvement across affected regions. This means we expect recovery to happen by 23:20 UTC on 29 October 2025
Posted Oct 29, 2025 - 21:58 UTC

Monitoring

Microsoft has confirmed a widespread outage impacting Azure and Azure Front Door. This may cause latency or service interruptions for users accessing our platform.
We are continuing to monitor the situation and will update once mitigation steps are in place.

https://azure.status.microsoft/en-gb/status
Posted Oct 29, 2025 - 16:33 UTC

Identified

We are currently investigating an issue affecting connectivity and performance due to an ongoing outage with Microsoft Azure and Azure Front Door. This may result in degraded performance or intermittent errors for customers accessing our services.

We are monitoring Microsoft’s status updates and will provide additional information as it becomes available.
Posted Oct 29, 2025 - 16:27 UTC
This incident affected: 1Secure - Europe (Portal, Reporting, Activity, Alerts, Risk, Agent, Azure AD, Exchange Online, SharePoint Online) and 1Secure - USA (Portal, Reporting, Activity, Alerts, Risk, Agent, Azure AD, Exchange Online, SharePoint Online).