Monday 3 February 2020 at 8.45am EST, all GSX Gizmo Robot Users started failing login to MS Teams. It was clear something was going on the Microsoft side. We could even see that the average login time of all locations increased just before the incident.
Convinced that there was a global outage, company IT staff could alert all users about the incident to reduce the impact on operations. Obviously when you rely on Teams service for calls, instant messaging, file sharing and screen sharing, the impact is huge. When all your end users are trying to resolve the issue themselves, rebooting and reaching out to IT, the productivity impact is even greater!
Teams Login graph averaged across all Gizmo Robot Users
A few minutes later, the social networks were showing various companies complaining about the service. At 9.13am EST, Microsoft published the incident in the Office 365 portal, confirming after 30 minutes what we had discovered right away with our Robots.
By 10.40am, the expired certificate that caused the issue was renewed and the service came back into operation. Due to the outage, there was then a surge of users from both EMEA and NOAM regions. This caused the service to perform with high latency. (As you can see, our Robots throughout the organization needed an average of 38 seconds to connect to Teams on the first successful attempt.)
Microsoft engineers increased the throttle limit to account for these extra connections so that scaling up was completed by 11.20am EST. It still impacted some users, but the issue was fully sorted out by the afternoon (EST).
Post-Incident Report from Microsoft on the Teams incident TM202916
What did we learn here? Above all, monitoring by performing synthetic transactions was the only way to detect this issue properly, and even before any of your end users were impacted! It provided clear visibility of the impact in real time. It could be tracked throughout the entire course of the incident until the complete restoration of service. Each minute is valuable; knowledge of what is going on saves your company money and reduces the frustration of end users who want to use the service.
Historical Data focus on Monday. Showing the red outage, as well as high latency times before and after the outage for 4 of the deployed Robots. (UTC Times)
The other aspect is that if it happened to Microsoft… it could happen to you, too! So you had best start monitoring all the certificates that you have for web applications, and GSX can help you do just that!
Keep reading about MS Teams:
Check out this article about the Microsoft Teams SLA written by Microsoft MVP Tom Arbuthnot. It gives interesting breakdown of the key elements.
Read this article about the 10 key metrics to assess your MS Teams user experience.