Gartner's Recommendations for useful service dashboards
For Gartner, service dashboard needs to be able to show the end-user experience availability and performance conditions, to help to define root cause analysis and to allow trending and planning of the past statistics while providing insight about what might happen in the future.
The purpose of a service delivery dashboard is to breach the silos that exist across IT departments by providing a common source of reliable statistics that display and explain the end-user performance and the health of impacting infrastructure components.
► Gartner’s name for service delivery dashboard is top level business and end-user experience dashboard.
► The name of the service incident dashboard is the triage dashboard that is here to quickly analyze the main root causes of issues.
► They also mention depending-mapping dashboard that help identifies hybrid-cloud component at stakes. We will see that you can do that
with a platform-oriented dashboard.
As we can see, Gartner’s service management dashboards are completely in line with Microsoft recommendations and ITSM best practices when it comes to Cloud Service Delivery Management.
Let’s see now example how GSX for Office 365 enables you to manage your Office 365 Service delivery.
Using GSX Solutions to build Office 365 Service Management practice
As a first example, we will focus on Exchange Online service.
Here is a sum up of what we will focus on.
So let’s look first on our real time top level dashboards that are displayed in our Gizmo real time UI.
What we see here are 3 Robot Users operating from free different locations (Boston, Pennsylvania and Azure) but operating the same monitoring of Exchange Online.
Here you can see that the Boston location is not at its greater state but because of our Robot User we have been alerted even before user start to realize something was not working properly.
Let’s take a deeper look. We can see that in the same time frame, the Robot User from Azure or the one from Philadelphia had no issue.
And if we want to corelate the data with the Service Level Dashboard, we can also see that there was no issue from an Office 365 perspective at that time.
From the data that we gather, it is safe to say that Exchange Online is meeting its SLA. Let’s go deeper in the Boston location to see what is going on.
Here you can see that Boston is clearly having an issue with the network and the end-user experience.
Going deeper in the network statistics:
We can see that there is clearly excessive round-trip time and packet loss from this location. It is now the perfect moment to contact the network team in Boston with this information.
With our GSX Robot User, you can easily get real time unbiased Exchange Online performance data from your most critical location.
We saw that Boston’s user experience was clearly subpar compared to the overall Exchange user experience and that it clearly was not a tenant issue or a problem in multiple locations, but something specific to the Boston network environment.
The second dashboard we mentioned earlier is the triage dashboard.
Let’s take a look at one of our real-time triage dashboards.
Our Office 365 real-time dashboard can really contain everything that can impact the Office 365 end-user experience. For example, here, ADFS proxy, ADFS, Azure AD Connect, any Mail-routing, any ActiveSync, SharePoint Online, etc.
If we take a look at Azure AD Connect for example, we can see that the synchronization service is clearly down preventing any sync to take place. So again, you can quickly contact the identity management team to resolve the issue without contacting Microsoft.
As you can see, these dashboards allow you to be proactive instead of reactive because you can be alerted on these issues before end-users even realize what is going on.
We have seen how real-time top level dashboards and triage dashboards can significantly guide your understanding what is going on and fix issues without involving Microsoft, and before your users are impacted.
Now let’s examine the service level delivery of Office 365, at the location level over time and how to see your achievements in term of Service Level Targets for your users.
For that we will use PowerBI.
► If you want more information on how to read PowerBI dashboards, please read this article >>
You can see here an example of our Top Level dashboard for Exchange online services delivered to 3 different locations.
You can see several gauges on the top that shows the % of achievement of the Service Level Target we defined.
► For more information about how to define Service Level Target, please read the corresponding Robotech article >>
To sum up, each action that a user can do with Exchange can be considered as a service that you provide. Service that is based on a hybrid infrastructure encompassing Office 365, your ISV, your network and any server and application you maintain that can impact the end-user experience.
The purpose of a SLT is to measure the % of achievement of service quality for your user. So, you have to decide what is the “happiness threshold” for each action.
What is the performance that you should be delivered to your user for them to consider the service healthy? For example, 200ms to open a mailbox, 500ms to download an attachment etc.
► Once you’ve define that, you want to know how often you deliver a good service.
► And that means to calculate the % of time you deliver the service below the threshold. For example, you want to make sure that 98% of the time, the Exchange Online feature can be used by your user with the performance that you have define.
And that is your Service Level Target.
You can see on the top right of the dashboard that the SLT here is 98% and you can see what it means in term of minutes.
► Basically 98% of SLT allow the service to be down or degraded 29 minutes per month.
► It allows you to communicate on something real, something that your users understand, and it gives a very good sense of the quality of the service that you provide to your locations and to your users.
The top two gauges are a consolidation of all the services / actions for all the locations. The top left represents the pure availability of them when the top right shows the overall achievement of the Service Level Target.
So right here, it looks good. But it is not because your overall performance is good that it is the same for each location. And that is why it is important to check what is really happening location per location.
As we can see on “top critical locations” chart, Boston seems to have way more problem than the 2 other ones.
So let’s take a look at Boston statistics alone to have a better idea of what is going on there.
Now that we have isolated Boston from the other location, you can see that Boston is not necessarily meeting our service level targets.
You can see that several actions, corresponding to services that you deliver to your users, experienced more issues than they should. Free/busy, Search through mailbox, and downloading attachments do not provide the desired quality of service in Boston.
So now we want to know what happened and try to quickly understand who/what is responsible for these issues.
That is why we are going to take a look at our PowerBI triage dashboard.
We are here focusing again on the Boston statistics in order to quickly understand how to improve the situation in Boston.
As you can see, the % network performance uptime shows that almost 50% of the time the network between Boston and Office 365 is below our performance threshold.
You can see below that they often have excessive round-trip time, and packet loss as well as high DNS resolution requests.
Right here already, you clearly have enough information to ask your Boston network team to investigate the issue and fix it.
As problem usually never come alone, we can also see here in that the ADFS Proxy there experienced some problems.
We also see below that federation request status request time dramatically rose, impacting of course the end-user experience.
Again, you can directly contact the ADFS team to have the problem fixed.
But you can also provide more information by going into the platform level dashboard of ADFS Proxy.
Here we can see that our federation request performance was 50% below our defined performance threshold.
But we can also see in the graph below that the ADFS Proxy server experienced excessively high CPU, RAM and disk time.
So again, instead of contacting Microsoft because of performance issues that you don’t understand, you now have more than enough information to check with the identity management team so they can resolve the ADFS Proxy issues that we see coming out of Boston.
To sum up that part, we can see here how the triage dashboard can breach the silos between your IT departments, avoiding the blame game and going straight to the root cause of the issue.
And finally, we have seen with our top-level dashboard that even if the performance of the services looks good overall it is important to track it as well at the location level.