GSX Resource Center > RoboTech Library > 
Packet loss & flapping router consequences on Office 365 performance
 

 

robotUser-prof7.png

Packet loss & flapping router 
consequences on 
Office 365 performance 



Understanding how the performance of local equipment or network can impact the end-user experience can be difficult. 

Here are 2 examples describing situations we’ve experiences with customers and how they can affect the overall Office 365 service delivery. 

 

Packet loss & flapping router 

2 Robot Users were configured to experience some network issues. The blue one has a consistent 30% packet loss but a healthy router. 

The second Robot User has a router that drops connection for one out of every eight seconds. It is what we call a “Flapping router.” 

For more information about how the Robot Users are providing you with end-user experience metrics, please read this article >>  

On top of that, this router also delivers a persistent 8% packet loss. 

Generally, when people see 30% packet loss, they get very worried. On the contrary, 8% packet loss is often considered a minor problem not to worry about (especially if you use Exchange online only). 

Let’s take a look at the data we collected with these 2 Robot Users. 

 

First results 

img12.jpg

The Robot Users tested multiples end-users scenarios. We can see here the results on the “Free/busy” action.  

Not surprisingly the blue (30% packet loss, healthy router) one from the “Nice” location is slightly slower. But if you think about it, from 30% to 8% it should be way more than it is. 

Therefore we can see that Packet loss has an impact on performance, but not as dramatic you would think when you consider Exchange online (results would be different for Skype for Business online). 

To know more about to interpret the statistics we display in PowerBI dashboards please read this article:  
How gsx displays and analyzes the office 365 end user experience >>

 

Focus on the Robot User with 30% packet loss results 

img13.jpgThe threshold after which the performance is considered unacceptable was set at 5 seconds this time. (if people have to wait 5 or more seconds for an availability call to come back, they are usually not happy). 

We realized that the standard deviation for this data set is huge and that just within 4 days, these actions would generate around 102 availability calls that would take longer than 5s and are eligible for support tickets! And that is for just one location! 

So what are the results with just 8% packet loss, but with an unstable router? 

 
 
 
 
The 8% packet loss, instable router configuration  
 
img14.jpgThe network team was, at first, not that afraid about the 8% packet loss. But if you look at the results, we generated even more tickets (139 compared to 102). 

When the router drops connections like this, it sometime does not look like there is a problem – especially from a packet loss perspective - but the results are clearly evident in term of tickets generation. 

And if you look at the trend, you can clearly see that it only gets worse. It is easy here to identify that something is terribly wrong with the end-user experience. 

Looking into local equipment gives a tremendous advantage of gaining visibility into end-user experience. 

Let’s continue with checking our second action: downloading an attachment. 

 

img15.jpg

Second set of results 

We tested the time it takes for 
a user to download a standard attachment from an email. 

The chart below shows how the normal Robot Users performed. 

We can see a relatively stable trend and a standard deviation not too far from the median. 

The “non-acceptable” end-user experience threshold was set at
10 seconds.
Only seven potential tickets were generated over 6 days. 

 

Below are the results of the two Robot Users with network issues. 

img16.jpg

We can recognize the same patterns observed with the Free / Busy action. The difference is that results are even worse here. Both Robots went 743 times out of threshold. 

When you look Robot vs. Robot, the first Robot with 30% packet loss generated 325 tickets with an average latency of 10.38 second; while the second one, even with a better latency average (5.46 second), generated 418 tickets. 

A flapping router really is the worst thing that can happen to a support team! 

The interesting thing is that a network team would probably not even have noticed this problem right away, which is a nightmare for the help desk and messaging teams as end-user are going to submit massive amounts of complaints. 

Without a tool to monitor each action for each location, it is impossible to treat this issue in a timely fashion. 

We can validate, again, this general behavior with a last action: Create meeting, demonstrated in the graph below. 

 

Third set of results img17.jpg

Before looking at the results of the 2 « damaged » configuration, let’s see what a healthy one looks like.  

When we focus on Boston and we look at the standard deviation, for example, we see how it is below the median access time. We can see a nice pattern with good structure within the data. 

Again if you want more information about the meaning of the median, variance, standard deviation, please read this article >> 

We can see it peaks up during weekdays, which is not surprising. 

Now let’s add our other Robot Users on the dashboard. 

Create meeting comparison between all Robot Users 

img18.jpg

Here the 2 Robot Users running on damaged configurations are Gray and Black.  

If we turn back to our Robot User from Nice that has network issues, the standard deviation is completely skewed away. 

The difference is very clear. 

We are talking about a tremendous variance about the time it could take to do the same action (up to 69 seconds of a difference!) and how unreliable the median and medium can be. 

Problems on one or a few locations can badly affect your service delivery. 

 

Conclusion on troubleshooting local issue 

The problem with local issues is that they are, well, local. It means that from a central point of view, it is very difficult to diagnose whether or not unhealthy network equipment is really causing issue. On top of that, without any end-user experience metrics at the action level, it is almost impossible to understand its real impact on customers. Packet loss is always something to monitor but, as we’ve just seen, the extent of packet loss on connectivity can be misleading. 

Once again, only end-user-experience data can help you understand the situation.