As I am writing this text, it seems that the issue that affected Office 365 users worldwide has been diagnosed and solved. Users were unable to login due to a DNS issue that propagated. It is now behind us as you can read here.
The various GSX Robot Users identified this issue straight away as you’ll see from the dashboards shown below. Like many of our clients, GSX received warnings a good fifteen minute before Microsoft officially announced the start of the outage, and we saw it hitting OneDrive in Europe more than any other tests. Here below are the metrics we collected for Skype Online:
Office 365 outages of this size are visible because they are global but the main reason is that they are very rare.
We work with hundreds of large international companies that use GSX Gizmo to monitor the health of their Office 365 deployments. Here are the main take-aways we gathered from them:
1. Outages like this one are very rare. This means that the service delivered by Microsoft is of high quality. Issues like these are visible but think about the following comparison. Airline accidents make headline news because they are spectacular. However, if you look at the ratio of casualties to miles travelled, flying is the safest mean of transportation.
2. Yet Microsoft still beats their SLA every month. But customer responsibility issues are 5.3x more common. If you’re worried about this issue, how about the five outages or performance degradations you missed that were on your end?
What affects users is usually performance degradation, which occurs frequently. Based on our expertise, our analysis shows that what affects delivery is really rarely Microsoft but rather the responsibilities of the company using Office 365. All “full cloud” deployments are actually “full hybrid”. Using Office 365 does not mean that usual classic mechanisms have to be set up to manage delivery.
3. Facts, not emotions.
If we talk about monitoring in the new IT world, you must collect your own metrics using the service exactly as an end user would, from the locations they perform their actions, and through the same delivery path. It is the only way to gather end-to-end visibility in order to manage user expectations and close tickets should any arise. It is also the best way to ensure your work has appropriate visibility to senior management.
Our solutions are here to achieve two goals:
- Helping you deliver optimal performance by proactively measuring level of services.
- Ensuring that you can measure the ROI and efforts done to deliver the best use of all active workloads. Investments are important and returns from productivity gains are huge. Returns only happen when good service is delivered, and such service has got to be objective.
So, all in all, yes, there was an outage on Office 365 service, it has affected performance and end-user experience. But we believe the real deal is under your responsibility. You need to go further and carefully examine the complete service delivery path between the Microsoft datacenter and your users.
GSX helps organizations understand what their users are truly experiencing, and spots everything that affects the end-user experience:
- Performing real end-user actions on Office 365: Exchange, SharePoint, OneDrive, Teams, Skype for Business online.
- Performing hybrid identities health checks: Exchange, SharePoint, Skype for Business on-premises.
- Performing network health & performance checks: checking traceroute with hop number & latency, port connectivity, round trip time, DNS resolution availability as well as packet loss analysis.
- Performing real end-user actions on Microsoft hybrid deployment:
- For Azure AD Connect, GSX checks the synchronization of the identities, the specific SQL database health & performance and, of course, retrieve system performance information.
- For ADFS, GSX performs true authentication, collects ADFS usage metrics (number of tokens of different forms), and checks the overall health of the server.