Yes, Microsoft confirmed it had finally resolved an issue with Outlook.com that had been affecting some users for up to three days. The company apologized multiple times to those affected, explained what happened and the steps it has taken to prevent future issues.
Microsoft blamed this particular incident on a failure of the caching service that interfaces with devices using Exchange ActiveSync, including most smartphones. The failure caused these devices to receive an error and continuously try to connect to Microsoft’s service, which resulted in a flood of traffic that the company’s servers did not handle properly.
As a result, some users could not access their accounts and Microsoft was forced to temporarily block access via Exchange ActiveSync. The company could then restore access to Outlook.com via the Web and restore the sharing features of SkyDrive, which took “a few hours of the initial incident.”
Unfortunately, Microsoft still had a “significant” backlog of Exchange ActiveSync requests to work through, which it had to do slowly in order to prevent the issue from resurfacing, meaning “some customers remained impacted for a longer period of time.” The company says the backlog is now clear and the service has been restored for all.
We want to apologize to our customers who were affected by the outage on Outlook.com this week. We have restored access to all accounts and have made changes so that the service will be more resilient in the future. We realize that we have a responsibility to the customers who use our services to communicate and share with the people they care most about, and we apologize for letting those customers down this week.
Our first priority is to the health of the services, and we will learn from this incident and work to improve the experience of all our customers. As part of that, we would also like to provide more detail about what happened.
This incident was a result of a failure in a caching service that interfaces with devices using Exchange ActiveSync, including most smart phones. The failure caused these devices to receive an error and continuously try to connect to our service. This resulted in a flood of traffic that our services did not handle properly, with the effect that some customers were unable to access their Outlook.com email and unable to share their SkyDrive files via email.
In order to stabilize the overall email service, we temporarily blocked access via Exchange ActiveSync. This allowed us to restore access to Outlook.com via the web and restore the sharing features of SkyDrive. These parts of the service were fully stabilized within a few hours of the initial incident. A significant backlog of Exchange ActiveSync requests accumulated as we worked to stabilize access. To avoid another flood of traffic, we needed to restore access to Exchange ActiveSync slowly, which meant that some customers remained impacted for a longer period of time.
We have learned from this incident, and have made two key changes to harden our systems against future failure – one that involved increasing network bandwidth in the affected part of the system, and one that involved changing the way error handling is done for devices using Exchange ActiveSync. We will continue to monitor the system and make additional changes as needed to keep the service healthy.
We are now fully through the backlog and have restored service so all customers should have normal access from all of their devices. We want to apologize to everyone who was affected by the outage, and we appreciate the patience you have shown us as we worked through the issues.
Unfortunately, this week’s (Wednesday 21st 2013) Outlook.com problems aren’t completely resolved. In fact, they are affecting the desktops resulting in a big headache for all those companies who nowadays (since the migration to the cloud) rely heavily on their cloud infrastructure.
If we do remember what happened to Blackberry e-mail business services which were the first ones to be based on the cloud, we should re-think about the solutions this platform provides comparing it to the risks that come by relying on it completely…