Network issues in EU
Incident Report for acast

Postmortem: Network issues in EU

First off: We're really sorry that this happened! The team will work at preventing this from happening again.

Summary

It sneaked in a bug in the software Microsoft TrafficManager that Microsoft Azure use for routing load between geographical regions and we are dependent on. This led to that users was not able to access the sites in Europe (as far as we could tell). This affected just about everyting: Apps, the web-player, the RSS feeds and the Create tool.

A little more detail

We've got a Root Cause Analyses (RCA) from Microsoft whitch we can't find on the web so we'll paste in the details here:

On April 30 from 1:25 to 11:23 AM UTC, some customers using Azure Traffic Manager in the US and Europe regions may have experienced intermittent DNS resolution failures for Traffic Manager (*.trafficmanager.net) endpoints. This was due to a software bug in an update to the Traffic Manager service that was being deployed to those regions. This bug resulted in some Traffic Manager servers not having a complete set of Traffic Manager profiles, which caused lookup requests for endpoints for those missing profiles to fail while requests for other profiles succeeded normally. Azure platform monitoring alerts were received by the Azure engineering team, which initiated an investigation and diagnosed the cause of the partial service interruption. The Traffic Manager update deployment was stopped and reverted, which restored services for all customers.

Customer Impact
Some customers using Azure Traffic Manager in the US and Europe regions may have experienced intermittent DNS resolution failures for Traffic Manager (*.trafficmanager.net) endpoints.

Root Cause
This was due to a software bug in an update to the Traffic Manager service that was being deployed to those regions.

Next Steps
We are continuously taking steps to improve the Microsoft Azure Platform and our processes to ensure such incidents do not occur in the future, and in this case it includes (but is not limited to):
· Fix the software bug that resulted in some Traffic Manager servers not having a complete set of Traffic Manager profiles.

· Further enhance Azure platform monitoring to detect similar partial service interruptions more quickly to lessen potential impact to customers.

Next steps

We trust Microsoft will fix this issue and we'll also look into sreading our systems over multiple cloud providers.

Posted over 2 years ago. May 05, 2015 - 14:09 CEST

Resolved
Looks like evertthing is back on track and working as it should.
Posted over 2 years ago. Apr 30, 2015 - 13:07 CEST
Update
Most things seems to have cleard up. We are still experirensing some issues with Search on www.acast.com and the phone apps.
Posted over 2 years ago. Apr 30, 2015 - 11:55 CEST
Monitoring
The fix form Microsoft seems to have resolved the issue. We will keep on monitoring the situation to see if we're good.
Posted over 2 years ago. Apr 30, 2015 - 11:20 CEST
Identified
The issue has been identified and a fix is being implemented.
Posted over 2 years ago. Apr 30, 2015 - 10:45 CEST
Update
Microsoft is currently rolling out a fix for the DNS issues and we see some services comming on-line again. We're monetoring the situation.
Posted over 2 years ago. Apr 30, 2015 - 10:45 CEST
Update
DNS resolution issues still persists. We're looking into a work around.
Posted over 2 years ago. Apr 30, 2015 - 09:59 CEST
Investigating
Our cloud vendor had network issues in the EU region (http://azure.microsoft.com/en-us/status/). We're currently routing over servies to the US region to get things running again.
Posted over 2 years ago. Apr 30, 2015 - 08:28 CEST