Understanding how the performance of local equipment or network can impact the end-user experience can be difficult.
Here are 2 examples describing situations we’ve experiences with customers and how they can affect the overall Office 365 service delivery.
Packet loss & flapping router
2 Robot Users were configured to experience some network issues. The blue one has a consistent 30% packet loss but a healthy router.
The second Robot User has a router that drops connection for one out of every eight seconds. It is what we call a “Flapping router.”
For more information about how the Robot Users are providing you with end-user experience metrics, please read this article >>
On top of that, this router also delivers a persistent 8% packet loss.
Generally, when people see 30% packet loss, they get very worried. On the contrary, 8% packet loss is often considered a minor problem not to worry about (especially if you use Exchange online only).
Let’s take a look at the data we collected with these 2 Robot Users.
The Robot Users tested multiples end-users scenarios. We can see here the results on the “Free/busy” action.
Not surprisingly the blue (30% packet loss, healthy router) one from the “Nice” location is slightly slower. But if you think about it, from 30% to 8% it should be way more than it is.
Therefore we can see that Packet loss has an impact on performance, but not as dramatic you would think when you consider Exchange online (results would be different for Skype for Business online).
To know more about to interpret the statistics we display in PowerBI dashboards please read this article:
How gsx displays and analyzes the office 365 end user experience >>
Focus on the Robot User with 30% packet loss results
The threshold after which the performance is considered unacceptable was set at 5 seconds this time. (if people have to wait 5 or more seconds for an availability call to come back, they are usually not happy).
We realized that the standard deviation for this data set is huge and that just within 4 days, these actions would generate around 102 availability calls that would take longer than 5s and are eligible for support tickets! And that is for just one location!
So what are the results with just 8% packet loss, but with an unstable router?
We tested the time it takes for
a user to download a standard attachment from an email.
The chart below shows how the normal Robot Users performed.
We can see a relatively stable trend and a standard deviation not too far from the median.
The “non-acceptable” end-user experience threshold was set at
Only seven potential tickets were generated over 6 days.
Below are the results of the two Robot Users with network issues.
We can recognize the same patterns observed with the Free / Busy action. The difference is that results are even worse here. Both Robots went 743 times out of threshold.
When you look Robot vs. Robot, the first Robot with 30% packet loss generated 325 tickets with an average latency of 10.38 second; while the second one, even with a better latency average (5.46 second), generated 418 tickets.
A flapping router really is the worst thing that can happen to a support team!
The interesting thing is that a network team would probably not even have noticed this problem right away, which is a nightmare for the help desk and messaging teams as end-user are going to submit massive amounts of complaints.
Without a tool to monitor each action for each location, it is impossible to treat this issue in a timely fashion.
We can validate, again, this general behavior with a last action: Create meeting, demonstrated in the graph below.
Third set of results
Before looking at the results of the 2 « damaged » configuration, let’s see what a healthy one looks like.
When we focus on Boston and we look at the standard deviation, for example, we see how it is below the median access time. We can see a nice pattern with good structure within the data.
Again if you want more information about the meaning of the median, variance, standard deviation, please read this article >>
We can see it peaks up during weekdays, which is not surprising.
Now let’s add our other Robot Users on the dashboard.
Create meeting comparison between all Robot Users
Here the 2 Robot Users running on damaged configurations are Gray and Black.
If we turn back to our Robot User from Nice that has network issues, the standard deviation is completely skewed away.
The difference is very clear.
We are talking about a tremendous variance about the time it could take to do the same action (up to 69 seconds of a difference!) and how unreliable the median and medium can be.
Problems on one or a few locations can badly affect your service delivery.
Conclusion on troubleshooting local issue
The problem with local issues is that they are, well, local. It means that from a central point of view, it is very difficult to diagnose whether or not unhealthy network equipment is really causing issue. On top of that, without any end-user experience metrics at the action level, it is almost impossible to understand its real impact on customers. Packet loss is always something to monitor but, as we’ve just seen, the extent of packet loss on connectivity can be misleading.
Once again, only end-user-experience data can help you understand the situation.