GSX Resource Center > RoboTech Library > 
How to detect Exchange Online tenant wide issue?



How to detect Exchange Online tenant wide issue?

Being able to make the difference between a tenant wide issue and a local issue is one of the main challenges our customers are facing everyday.  

This question is not only important for the IT and Service level management. It is also important for Microsoft. The number of tickets that are opened to Microsoft support because of a lack of knowledge about what is really going on the environment is massive. And at the end it costs to the customer and to Microsoft.  

There is simple way to detect tenant wide issue when you have Robot Users. And here is how to do.  

For more details about how a Robot User is measuring end-user performance data, please read this article >> 

For this first test, we ran 10 Robot Users from the different locations (Azure, Bangalore, Boston, Nice, Philadelphia) with different configurations (with or without proxy, connecting to headquarter in Europe or not, etc.). These are the one selected in the chart below.   

We then have a view of how Office 365 connects with different tenants on a variety of configurations. 

Let’s look at end-user experience statistics and trends to see what we can find. 

First set of Results 

Exchange Online Open Mailbox
Here is the data of all of Robot Users. The purpose is to see if the connection trends are consistent. 
On the top we see a linear chart that compares the performance over time of every Robot User. 
At the Open Mailbox level, apart from several spikes, the trend seems normal, with maximal performance during weekends as there is less activity. 
On the bottom left, we have the list of all the Robot Users (actives or not). In the bottom middle are shown a few statistics that are calculated to measure the severity of connection issues. 
The square chart on the bottom right compares the average of time taken by each Robot User to perform the selected action (in this case open mailbox).  You will see this same dashboard structure for many of our tests. 
Looking at the statistics there is not a single moment when all Robot Users had issue at the same time. We can conclude that nothing impacted our service at the tenant level. Issues were only local. 
Another important element we see is that the standard deviation has about the same value as the median value which shows a very stable and healthy spectrum of connections. 

To learn more about how to interpret the statistics we display in PowerBI dashboards please read this article >> 

The findings here are that if we just analyze an “open mailbox,” everything looks fine. However, we might be able to see different results if we look into more actions. 
So let’s dig into these other actions to see whether or not “open mailbox” is a sufficient measure to test tenant level connections.

Why it is necessary to analyze real end-user actions 
The results here with a “create a meeting” action are a bit different. 
img5.jpgWe can see some specific peaks when every Robot User impacted. Even the Robot User working in Microsoft Azure has been impacted at those moments. 

This leads to strong evidence that, at these particular moments, the entire tenant was working operating subpar. 

The first test actions show us that you cannot rely on just checking an Open Mailbox on your tenant as it does not detect legitimate problems. 
Each user's actions has its own characteristics and performance and each of them can be a source of complaint, tickets and eventually costs. 

Let’s continue to dig into other actions. 


Free / Busy lookups 

The Free / Busy lookup is a very important feature of Microsoft Outlook. Everybody uses it to schedule meetings between others and make sure everybody is available for those meetings. To perform this operation, Microsoft retrieves the status, free or busy, of another person’s calendar for the time that is selected. 
If the Free / Busy is slow, the time to schedule meeting with multiple attendees can become an arduous process for the end-users. Here, if we look here at what happened between June 16 and 17, we can spot the issues. 
The main question being: how do we know if it is the tenant rather than a single location? 

Once again, you can see that there are multiple robots experiencing latency at the same time with the same workload. And once again, even out of Microsoft Azure, our Robot User had issues to check the free / busy lookup. 

It was a short time window so the shift was not too dramatic, but it shows real facts. 

We will see in other articles how other network configurations can affect the end-user performance in a good or a bad way. But, looking at the correlation with 6 or 7 different sites including Microsoft Azure being in trouble, it is certainly the time to open a ticket to Microsoft and ask for answers. 

And this time you’ll be able to approach Microsoft with evidence, with the exact dates showing that instances of connection issues. 

GSX Solutions helps you to better identify the situation when submitting a ticket to Microsoft; helping mitigate the cost of ticketing to Microsoft. It is better to be ready with the facts when you complain to your cloud provider. 

We provide an optimal solution to interact with Microsoft, because questions raised in the tickets will have valuable reason and valuable facts. 

We’ve seen how easy it is to spot tenant wide issue and how we can easily see the issues rise in front of us.