GSX Resource Center > RoboTech Library > 
Troubleshooting Exchange Online End-User Experience Complaints
 

Troubleshooting Exchange Online

robotUser-prof7.png

End-User Experience Complaints



Introduction
 

img1.jpg
At GSX we work with a lot of customers that have moved, are moving, or plan to move to Office 365.  

We help them assess their Office 365 environment before, during and after the migration, as well as make them understand the end-user experience that they really deliver to their end-users. 

In one of our “business stories” we mentioned how migrating to the cloud could directly impact the Outlook performance. 

In this graph, we compare the timetable between end-user actions performed on Exchange servers and their Office 365 tenant.
This example, done for a large bank in North America, shows that simply monitoring an “open mailbox” is not sufficient. Each action matters and has its own “user threshold”; when exceeded you will have to deal with an influx of complaints and open tickets. 

Office 365 has been built to enforce capable security and high availability and we can say that it was a huge success.

The downside, of course, is that these two main concerns impact performance.
Even if the cloud provides a service a bit slower than former on-premises servers, it would be a mistake to believe that the cloud services themselves are responsible for the rise of end-user complaints.

After dozens of projects with our customers, we’ve realized that 9 times out of 10, end-user experience issues were related to the hybrid network and server components necessary to run a cloud service.

The main issue was that, before the deployment of GSX, there was nothing to measure end-user experience and correlate that with independent network and hybrid components metrics.

So, in order to provide data and feedback, we decided to run a few experiments in our distributed environment and share our findings with you.

We deployed a few GSX Robot Users at critical locations, to mimic customer activity, and started to measure the standard end-user experience; action per action as well as critical network metrics to have an idea on their impact on end-user experience.

Today we are focusing our experiment on Office 365 Exchange Online and the main actions as a normal user would do with Outlook. We plan to discuss Skype for Business and SharePoint Online in future articles.

The purpose of this RoboTech content is to show you how network equipment and configuration can really impact end-user experience and how, with GSX, you can easily understand and troubleshoot certain network issues.

From a tenant-wide issue to network misconfiguration to local network problems, measuring the true end-user experience is your basis for efficiently troubleshooting hybrid infrastructure and optimizing your support costs.


Our experiment

We operated with two tenants: One in the USA and one in Europe.
We deployed Robot Users in the following locations:

► Three in Philadelphia (USA)
► One in Nice (France)
► One in Bangalore (India)
► One in Boston (USA)
► One in Microsoft Azure Cloud

On each of them, we ran the same tests but applied multiple configurations:

► With or without proxy
► With packet loss or not
► With a flapping router
► With or without ExpressRoute
► With connection to Office 365 through a headquarter or connection directly through the Internet
► With regional DNS resolution and out of the country DNS resolution

There were two small implications we looked into before jumping to any conclusions.
First, locations may have different internet connection speeds. But since the purpose is to understand how network changes can affect end-user experience, these varying speeds don’t affect the outcome of the tests.

Let’s now check how you can easily spot a tenant-wide issue, a problem caused by a bad network configuration or a local IT issue.
We will also address the ExpressRoute topics to understand its impact on end-user performance.

Detecting Tenant-Wide issues

Experiment

For this first test, we ran 10 Robot Users from the different locations (Azure, Bangalore, Boston, Nice, Philadelphia) with different configurations (with or without proxy, connecting to headquarter in Europe or not, etc.) and removed the 2 specific Robot User that were forced to run with problematic network equipment (flapping router and packet loss).

We then have a view of how Office 365 connects with different tenants on a variety of configurations.
Let’s look at end-user experience statistics and trends to see what we can find.

Results

Why checking "open mailbox" alone is not enough

img2-1.jpg

Here is the data of all of Robot Users. The purpose is to see if the connection trends are consistent.

On the top we see a linear chart that compares the performance over time of every Robot User.
At the Open Mailbox level, apart from several spikes, the trend seems normal, with maximal performance during weekends as there is less activity.

On the bottom left, we have the list of all the Robot Users (actives or not). In the bottom middle are shown a few statistics that are calculated to measure the severity of connection issues.

The square chart on the bottom right compares the average of time taken by each Robot User to perform the selected action (in this case open mailbox).
You will see this same dashboard structure for many of our tests.

Looking at the statistics there is not a single moment when all Robot Users had issue at the same time. We can conclude that nothing impacted our service above the tenant level.

Another important element we see is that the standard deviation has about the same value as the median value which shows a very stable and healthy spectrum of connections

The findings here are that if we just analyze an “open mailbox,” everything looks fine. However, we might be able to see different results if we look into more actions.

So let’s dig into these other actions to see whether or not “open mailbox” is a sufficient measure to test tenant level connections.

Why it is necessary to analyze real end-user actions

img3.jpg

The results here are a bit different.

We can see some specific peaks when every Robot User impacted. Even the Robot User in Azure has been impacted at those moments.

This leads to strong evidence that, at these particular moments, the entire tenant was working operating subpar.

These are the situations where the problem resides beyond the user and is legitimate enough to open a ticket at Microsoft.

The first test actions show us that you cannot rely on just checking an Open Mailbox on your tenant as it does not detect legitimate problems.

Each user's actions has its own characteristics and performance and each of them can be a source of complaint, tickets and eventually costs.

Let’s continue to dig into other actions.

Free / Busy lookups

img4.jpg

The Free / Busy lookup is a very important feature of Microsoft Outlook. Everybody uses it to schedule meetings between others and make sure everybody is available for those meetings. To perform this operation, Microsoft retrieves the status, free or busy, of another person’s calendar for the time that is selected.

If the Free / Busy is slow, the time to schedule meeting with multiple attendees can become an arduous process for the end-users. Here, if we look here at what happened between June 16 and 17, we can spot the issues.

The main question being: how do we know if it is the tenant rather than a single location?

Once again, you can see that there are multiple robots experiencing latency at the same time with the same workload. And once again, even out of Microsoft Azure, our Robot User had issues to check the free / busy lookup.

It was a short time window so the shift was not too dramatic, but it shows real facts.

We will later see how other network configurations can affect the end-user performance in a good or a bad way. But, looking at the correlation with 6 or 7 different sites including Azure being in trouble, it is certainly the time to open a ticket to Microsoft and ask for answers.

And this time you’ll be able to approach Microsoft with evidence, with the exact dates showing that instances of connection issues.

GSX Solutions helps you to better identify the situation when submitting a ticket to Microsoft; helping mitigate the cost of ticketing to Microsoft. It is better to be ready with the facts when you complain to your cloud provider.

We provide an optimal solution to interact with Microsoft, because questions raised in the tickets will have valuable reason and valuable facts.

We’ve seen how easy it is to spot tenant wide issue and how we can easily see the issues rise in front of us.

Now let’s focus on how network components and configuration can impact the end-user performance.


Network configuration impacts

The first important point to recognize with your Office 365 deployment is your entry point. The way you access your tenant and where your access it from a configuration standpoint is extremely important.

You probably know that if you just ping Office 365 from whatever region you are in, you’ll see the latency results of your specific local region. It is one test but clearly doesn’t help much when it comes to understand what goes on from an end-user perspective.

Let’s see how you can really detect issues within the configuration with real end-user experience metrics.

Configuring the best entry point

img5.jpg

Let’s come back to our create meeting test.

We can clearly see that the “Average of Access Time by RU Name” square comparison graph provides great insights on geo-localization issues.

It becomes clear that performance of Bangalore trying to access the European (green & blue) tenant is not as good as USA (red & yellow) trying to access the European tenant. The best one being located in Europe accessing a European tenant (the black one).

For now, let’s focus on the Boston location.

img6.jpg

From Boston, we performed 2 types of tests using dedicated mailboxes. 

One Robot User is trying to access an European tenant and the second one is accessing the US-based tenant.

We are receiving 450 ms latency between Boston and the Europe tenant compared to 390 ms when Boston is reaching a US tenant.

So we are picking up 20% of latency just because the Robot User came on to a US entry point, and then from the US entry point traversed across Microsoft networks in Europe back over to where that Mailbox is located.

You should definitely consider these parameters when organizing your tenant and entry point worldwide. Limiting the distance between users and their mailbox is always a good idea to increase end-user performance.

Let’s now observe a typical instance that happens all the time with enterprise grade companies.

Connection through a headquarter

Experiments

Like most of the companies worldwide, we organized our tests to mirror those offices with different locations. In our example, we performed 2 different tests from the same location; both located in Philadelphia and both connecting to a tenant in Europe.

So you can think of a company with global headquarters in Europe and each of the branch offices connecting to that location and then breaking out to the internet from different parts of the world.

img7.jpg

This situation happened with a lot of customers we assessed whom received performance issues: hold back the internet to corporate and then send it out from corporate.

To show you the impact, we reproduced the situation with one Robot User connecting directly to the Internet in Philadelphia in order to connect to Microsoft network in the USA. It then traveled across the ocean directly onto their network.

We had the second Robot User connect to the headquarters and then break out to the Internet to access the European tenant.

You can see on the dashboard that we also consider the hops; with the popular notion that the less hops there are, the better the connection is.

Here, the robot user in Philadelphia connects to the Microsoft endpoint in the USA and travels on the Microsoft network actually has 26 hops end to end.

The one that connects to headquarters in Europe and then breaks out to the Internet has 15 hops. This could lead us to think that the latter was faster than the first Robot Users (RU2).

But that was not necessarily the case.

Let’s take a look at the data.

Data

img8.jpg

In green we see Robot User 1, configured to hold back the connection to the internet to headquarters in Europe.

In black is the Robot User 2, configured to access the Internet as soon as it can in Philadelphia.

We can see the results in linear graphs in the middle, action per action. Also in the square graph we can see how each square represents the average performance of an action of a Robot. This type of graph easily allows a visual comparison of Robot User performance.

At the bottom, we can see two bar graphs that demonstrate the amount of times a certain action has reached the limit of acceptable performance.

We defined that an acceptable limit would be at three seconds for the free busy lookup and the create meeting features. This three seconds came from our observations of customers by linking the end-user performance data with opened tickets.

Working with customers’ and their environments, it appears that if the users repeatedly have to wait 3 seconds for a simple free / busy lookup to create a meeting, they will start to open tickets due to losing patience.

Data Analysis

First, we looked at the open mailbox. But it doesn’t really matter as much because the complaints and support tickets come mostly from the actual actions that users perform such as looking at the free / busy statuses or creating meeting.

The free/busy lookup, again, really shows the end-user when there is a performance issue because they can actually see the waiting bars filling in each time they try to create a meeting; this is especially true with multiple attendees.

Going from Philadelphia to headquarters with the 15 hops provides an average of 1 second per free / busy lookup. During two days, the free / busy lookup was out of range of the end-user acceptable performance limit 16 times.

Going from Philadelphia directly to the internet, using a Microsoft entry point in the USA, provide an average latency of about 0.6 second, meaning about 40% faster! The number of times the performance was out of our acceptable range in the instance was around 8 times.

This impacted the "create meeting" function as well where we were able to see a difference of 50 to 60% in the end-user experience.

This data is really important when looking at what you can do to improve performance. This scenario shows how easy it is to improve end-user performance with simple correction of network configuration.

Microsoft will always have the best network and choosing one network configuration over the other can make a tremendous difference in the end-user experience.

Verify our findings

In order to make sure that these results were not skewed by a bad local network at the headquarters, we added another Robot User that operates from headquarters in Europe and connects to the tenant.

img9.jpg

You can see the new results in RED:

It is clearly the fastest and confirms that there is no issue with the headquarters network. But we also see that it is not that far off from the one in Philadelphia connecting to the Internet in the USA (the headquarter Robot is just about 15% faster).

The only action where it makes a big difference is the open mailbox, but as we said, it is not the test that really count in terms of user performance.

So the conclusion in this case is that going onto the Microsoft network as quickly as you can is the best thing you can possibly do.

Our main advice is that you want to break out to the internet to a tenant located near you geographically and get your packet handle up to Microsoft as quick as you can possibly can.

The entry point configuration is definitely important when it comes to Office 365 end-user performance, but it is not the only one. DNS configuration can also highly impact performance.

DNS configuration to avoid

We worked with a customer that had trouble accessing Office 365 in his European domain.

After comparing performance across multiple locations and acknowledging that the issue was only happening in Europe, we started to dig into their network configuration.

Where was the data going? Where was it leaving the network? How was it processed on the Microsoft backbone?

We started to realize that the issue was not the European access point, nor their local network, but that the data was crossing the US before going off to Microsoft backbone.

Specifically, it was going to the northwest of United States.

We continued digging and found out that the customer had an internal DNS pointing to the Google DNS; and you can imagine that from Europe, this DNS would not provide very good performances.

When European clients traverse the web through US west coast DNS to retrieve their emails, their connection performance took a nosedive.

Experiment

In order to show how this degrades performance, we configured the same situation and checked the end-user experience with our Robot Users.

We configured the Robot User to force an IP address resolve in the US, then send the traffic back to EMEA, to where it would be sent back to the US again before finally being sent to Microsoft.

So we had that Robot User on the test we did previously on the headquarters entry point in order to see how bad it would impact the results.

Let’s take a look at the data.

Findings

You can see here the results with the yellow Robot.

img10.jpg

The average time for any action is much worse than the others; the number of spikes that lead to potential tickets is a lot higher than any of the 3 other Robot Users that we configured earlier.

The more distance you put between your user and the Microsoft tenant the more impactful performance issues you will experience.

This is an extreme case, but it shows how the network configuration can truly affect the end-user experience.

In the same way, VPN can also impact end-user performance.

VPN configuration

Connecting with a VPN can greatly impact performance, depending on the protocol, its settings, and especially the gateway settings.

Once you’re connected to the VPN, if you connect from Europe, even to connect to a US tenant, it will have a great impact because you are actually bouncing off from different location due to VPN masking.

Once again, the configuration of your entry point is what really matters here. You need to go as fast as you can on the Microsoft Network if you want to receive the best end-user experience.

The case of the authenticated proxy

In the same way, here is another Network configuration that we faced multiple times at with customers who had connectivity issues because of it.

Many of our customers had issues after installing a proxy to connect to Office 365.

What seems to be a good idea quickly transformed into a source of additional costs.

Impact of the authenticated proxy on end-user experience

We reproduced the situation to provide data on that proxy related issue.

End-User experience comparison with authenticated proxy

img11.jpg

The dashboard shows our Free / Busy tests done on multiple Robot Users. It is interesting to understand who is doing what here. Let’s focus on the Square graph which compares the time taken to complete this Free / Busy action.

The first square (grey) represents the results of the Robot User from Philadelphia connecting to the European tenant. The second one (black) is a Bangalore Robot User that had bandwidth issues and is connecting to the European tenant.

What is interesting here is the third one (bottom left- green). This is a Robot User sitting in Philadelphia connecting to a US tenant but the connection to the tenant is configured to redirect every outgoing connection to an authenticated proxy first. 

The authenticated proxy server simply requires clients to authenticate before going to Office 365.

What is noticeable here is that this US Robot User connecting to a US tenant performs worse than the top red square representing a Robot User sitting in Bangalore connecting to the US tenant.

This is a very good example of how, with a simple network equipment in between Office 365 and your tenant, you can drastically degrade the end-user performance.

Just putting simple equipment like a proxy can really affect the performance delivered to end-users, up to the point where it will cost a lot in term of support and overall productivity.

Conclusion on network configuration

After this set of tests, we came to the conclusion that there are several fundamental characteristics needed to be able to manage the end-user experience properly.

First you need to know your own network configuration and measure it constantly; and you need those metrics because you need to analyze the data in order to understand what is going on.

Second, you need to have multiple points of connectivity for comparison, meaning multiple Robot Users in order to understand if issues are affecting one or multiple locations or your entire tenant.

Then, you need to measure the end-user experience at the actions level. We’ve clearly shown that a normal behavior of an open mailbox doesn’t mean anything unlike other actions such as the Free / busy lookup or create meeting process, for example.

Finally, you need to understand how the network configuration is executed, where the data from comes, and how it goes to Microsoft. The route of the data and the way equipment is configured can dramatically affect the end-user experience and drastically increase your support costs.

That is why it is critical to monitor this end-user experience before, during and after your network changes to get facts and results based on those network changes.

 

Troubleshooting local issues

Context

Spotting a tenant-wide issue or a configuration problem can be difficult sometimes.

Also, understanding how the performance of local equipment or network can impact the end-user experience can be trickier.

Here are two examples describing situations we’ve noticed with customers.

Packet loss & flapping router

2 Robot Users were configured to experience some network issues. The blue one has a consistent 30% packet loss but a healthy router.

The second Robot User has a router that drops connection for one out of every eight seconds. It is what we call a “Flapping router.”

On top of that, this router also deliver a persistent 8% packet loss.

Generally, when people see 30% packet loss, they get very worried. On the contrary, 8% packet loss is often considered a minor problem not to worry about.

Let’s take a look at the data we collected with these 2 Robot Users.

 

First results

img12.jpg
Not surprisingly the blue one from the “Nice” location is slightly lower.

But if you think about it, from 30% to 8% it should be way more than it is.

Therefore we can see that Packet loss has an impact on performance, but not as dramatic you would think.

 

 

Focus on the Robot User with 30% packet loss results

img13.jpg

The threshold after which the performance is considered unacceptable was set at 5 seconds this time. (if people have to wait 5 or more seconds for an availability call to come back, they are usually not happy).

We realized that the standard deviation for this data set is huge and that just within 4 days, these actions would generate around 102 availability calls that would take longer than 5s and are eligible for support tickets! And that is for just one location!

So what are the results with just 8% packet loss, but with an unstable router?

The 8% packet loss, instable router configuration

img14.jpg

The network team was, at first, not that afraid about the 8% packet loss. But if you look at the results, we generated even more tickets than with the other faulty router (139 compared to 102).

When the router drops connections like this, it sometime does not look like there is a problem – especially from a packet loss perspective - but the results are clearly evident in term of tickets generation.

And if you look at the trend, you can clearly see that it only gets worse. It is easy here to identify that something is terribly wrong with the end-user experience.

Looking into local equipment gives a tremendous advantage of gaining visibility into end-user experience.

Let’s continue with checking our second action: downloading an attachment.

Impact of download attachment process with local network issues  img15.jpg

We tested the time it takes for a user to download a standard attachment from an email.

The first results here show how the normal Robot Users performed.

We can see a relatively stable trend and a standard deviation not too far from the median.

The “non-acceptable” end-user experience threshold was set at 10 seconds. Only seven potential tickets were generated over 6 days.

Below are the results of the two Robot Users with network issues.

img16.jpg

We can recognize the same patterns observed with the Free / Busy action. The difference is that results are even worse here. Both Robots went 743 times out of threshold.

When you look Robot vs. Robot, the first Robot with 30% packet loss generated 325 tickets with an average latency of 10.38 second; while the second one, even with a better latency average (5.46 second), generated 418 tickets.

A flapping router really is the worst thing that can happen to a support team!

The interesting thing is that a network team would probably not even have noticed this problem right away, which is a nightmare for the help desk and messaging teams as end-user are going to submit massive amounts of complaints.

Without a tool to monitor each action for each location, it is impossible to treat this issue in a timely fashion.

We can validate, again, this general behavior with a last action: Create meeting, demonstrated in the graph below.

Impact on the create meeting

img17.jpg

When we focus on Boston and we look at the standard deviation, for example, we see how it is below the median access time. We can see a nice pattern with good structure within the data.

We can see it peaks up during weekdays, which is not surprising.

But that standard deviation is completely skewed away when we turn back on our Robot User from Nice that has network issues.

It is incredible to see the difference of how much worse it is.

Create meeting comparison

img18.jpg

We are talking about a tremendous variance about the time it could take to make the same action (up to 69 seconds of a difference!) and how unreliable the median and medium can be.

Problems on one or a few locations can badly affect your service delivery.

 

 

Conclusion on troubleshooting local issue

The problem with local issues is that they are, well, local. It means that from a central point of view, it is very difficult to diagnose whether or not unhealthy network equipment is really causing issue. On top of that, without any end-user experience metrics at the action level, it is almost impossible to understand its real impact on customers. Packet loss is always something to monitor but, as we’ve just seen, the extent of packet loss on connectivity can be misleading.

Once again, only end-user-experience data can help you understand the situation.

Let’s check now if we can improve the performance for our user with ExpressRoute.

How about ExpressRoute?

Rumors around ExpressRoute

We’ve heard contradicting rumors about ExpressRoute. For example, it has been said that ExpressRoute makes Mailbox move significantly faster or that ExpressRoute makes Exchange Online slower because it was designed to improve security.

And the last one rumor people tend to believe pretty heavily is that ExpressRoute will significantly improve Office 365 performance. So what really is ExpressRoute?

The reality of ExpressRoute

The shortest way to explain ExpressRoute is that it is a logical connection between a peering provider (that provides you with ExpressRoute) and Microsoft peering location.

So the first thing to really understand is that ExpressRoute is not designed to be a performance solution and it is not designed to be a security solution either.

How do we test it?

There is different three technologies of ExpressRoute. The one we studied is the one that connects to Office 365. It is called Microsoft peering.

We tested the access to Exchange Online with or without using ExpressRoute. For that, as usual, we tested the access to the mailbox and a few other actions.

Schema of our tests

 Image13.png

Results

There is no need to show any report here because results, as expected, did not provide any evidence of any performance benefit for the end-users.

Depending on the time of the day and the action, sometimes using ExpressRoute provided the Robot better performance and sometimes did not. This means that it had no real consistent impact on the end-user experience for any of the actions that we tested (open mailbox, create meeting, create task, resolve a user, free/busy lookup, download attachment).

And again, these were the results that we expected.

When you look at what ExpressRoute is, it is a connection to a designated Microsoft peering point. There are many other parameters that will impact the performance of ExpressRoute; such as the peering point location.

However, there is no guarantee that your entry point with ExpressRoute is better than the one you already have with your Internet Provider entry point. There is no guarantee that is worse either.

ExpressRoute will give more predictable performance which means predictably worse or better, but it is not designed to improve performance.

Conclusion on ExpressRoute

The first thing you need to evaluate if you’re going to use ExpressRoute to improve performance is distance. That is, how you’re going to get to the Microsoft network in the shortest amount of time.

You need to understand what city you’re in, what is the ExpressRoute entry point in your region, and where can your ISP get you to the Microsoft network. Without these details, it is impossible to properly build the latency calculation that you need to understand your connection.

This diagram below is a good way to understand the ExpressRoute. You can have a nice, clean, easy path but it can be significantly longer than the original one.

Image14.png

 

Tips to improve end-user experience

The biggest thing that we’ve seen is that, before considering ExpressRoute, you need to consider all the other options.

You need to investigate and invest in your network, look for what is the real root cause of your issues, and gather facts and metrics on the locations and actions of each connection.

Improving local bandwidth

If you want to improve end-user performance and you have money to spend, between bigger bandwidth and ExpressRoute, you should first take a look on how to improve your bandwidth.

Image15.png

Increasing the bandwidth locally is a much better decision. You will end up with a much faster direct connection instead of taking the longer path with ExpressRoute.

Calculating your port speed and knowing your network

If you know your message flow rate, and you understand your use case, you can run a Microsoft tool to calculate your needed bandwidth. It will help you calculate for peaks and value of the busy times of your data.

You can take that data and compare it with what you have available on your current internet connection.

The calculation of how much you need for Office 365 is the easier part, as Microsoft provides you with this calculation. Also, it is important to understand the components of your network and the speed in which it operates.

Where is your connection point? Can your ISP can get you in the Microsoft network faster? There are plenty of factors that will impact the end-user experience. And again, if you don’t measure it, you can’t work to improve it.


Conclusion

Multiple things can impact the Office 365 end-user experience. From tenant-wide issues to bad global network configuration, to even unnecessary equipment to issues; everything can have an impact.

And we’ve seen that even deploying an ExpressRoute – that sounds like it is designed for performance, although it is not - will not guarantee a better performance for your end-user.

On top of that, as you know, Microsoft Identities servers can also have impacts and Office 365 hybrid deployments (including Microsoft Exchange, SharePoint or Skype for Business servers) and can also increase the complexity and sometimes impact the end-user experience too.

This is an issue we plan to talk about a future RoboTech content paper.

At the heart of any decent troubleshooting is the metrics of what users really experience when they use the Office 365 services.

This what GSX Solutions is providing you with our Robot Users.

This is how you’ll stay ahead of every performance issues; by spotting them before the end-user decides to complain.

And this how you can finally understand why end-users complain, if it is for a true issue or not, and what is the root cause of the issue.

We’ll continue this discussion through our webinars, blogs and RoboTech contents.

Join the GSX Solutions community to make the Cloud better… for its users.