IBM Connections provides applications that ease collaboration between end-users. In doing so, it becomes quite important for productivity and essential business processes, and thus essential to ensure it is functioning as it should.
When we talk about verifying application performance and uptime, we’re ultimately concerned with the quality of the end-user experience. If the end-users can’t use the application as intended or experience major latencies, the flow of productivity comes to a halt. This is why it’s important to monitor and measure performance where it matters — at the end-user level.
Obviously, the first key question you would ask at the end-user level is whether or not it is possible to connect to the application. But in the case of IBM Connections, ensuring quality of service becomes more complex. In order to monitor the entire infrastructure, you’ll need to cover the entire breadth of the network between your users and the servers. You’ll need to verify the health of the WebSphere nodes and JVMs, and that your databases and database servers are up and running. Staying on top of all of these components will enable you to be proactive about your IT environment, and work to avoid major incidents before they even happen. This is absolutely essential to enhancing the end-user experience, and ensuring user satisfaction.
At GSX Solutions, we make the comprehensive, real-time monitoring of critical IT components our number one priority. We’ve worked with hundreds of companies to provide invaluable insights into the end-user experience so that they can work on providing a higher quality of service to their employees.
Let’s take a look at three specific use cases from our customers, and how they would be resolved before they implemented our real-time monitoring solutions.
Use Case: The wiki isn’t available
The first example is one that can happen quite frequently — a crucial application isn’t working. In this case, it happens to be the wiki. A user reports the issue to the help desk, and it is escalated to an IT administrator for investigation.
What does the admin do? First, he or she isolates the issue to understand if it’s affecting all users or just the one by attempting to access the wiki. It’s definitely down. So then the admin checks the network to see if a DNS issue is the root cause, but everything looks fine. The admin moves on to check the WebSphere server, deployment manager, WebSphere node, and JVM running the wiki application — but all of these look fine.
After several more tests and manual scripts, the admin realizes that the heap size for the wiki has reached a critical point. The heap size is the memory used by JAVA for a particular application, and is increased by the system when the amount of memory isn’t sufficient. Problems can arise when too many nodes are increasing their heap size.
In this case, the application has run out of memory and failed. This means that other applications are on the verge of failing, unless steps are taken to rectify the situation. So the admin looks to pinpoint the cause of the growing heap size, and finds that the garbage collector has stopped working properly. The garbage collector is responsible for cleaning unnecessary memory used, and managing the memory used by all applications. Its malfunction allowed the heap size to grow indefinitely.
Without real-time monitoring, these are the kinds of manual steps an IT administrator would need to perform to locate the source of a problem — which might have taken hours. Without being able to quickly and definitively identify the root source of the issue, this IT administrator faced massive amounts of downtime as all of the applications rely on heap size and the garbage collector to function properly.
Use Case: My profile page is slow
A user complains that his profile page is incredibly slow to download. Eventually, the help desk receives similar complaints from other users, and the problem becomes widespread. Once again, the admin first checks the network to look for any latencies, internal or external, as well as the DNS resolution time issue. He or she checks the WebSphere nodes, JVMs, and all of the memory statistics. This time, the garbage collector is running smoothly.
So the admin goes deeper to analyze the health of the DB2 database, and attempts an ODBC connection. Bingo: the connection is very slow, and now the question is why? After running multiple manual scripts to collect the statistics, the admin realizes the IO wait time and lock times have abnormally high values, and the BP Hit is really low for IBM standards. These trends indicate that the DB2 server cache is in trouble, since the BP Hit shows how the cache is used compared to disk access and represents a total hit ratio for the local bufferpool. When the cache has an issue, the DB2 servers are constantly creating disk accesses, and this slows down everything.
Once more, you can see what a lengthy process it was to check every single component to locate the source of this issue. Real-time monitoring of these critical statistics would have alerted the admin as soon as any abnormality existed, even before users began reporting issues to the help desk.
Use Case: My business applications aren’t working
In our a third scenario, a member of the sales staff calls support. As these things go, he is in the midst of his biggest sale of the year, and his pricing tool (a Portal application) stops working. It seems that several applications, all hosted on IBM Portal, are experiencing serious issues. The volume of help desk calls has risen dramatically, so the admin must investigate this as a high priority issue.
After checking the usual first components, the admin checks the Portal server — which has crashed. The admin performs an emergency restart of the system, and tries to understand why this happened. The WebSphere nodes all look fine. Upon checking the DB2 databases and servers using multiple, tedious scripts, the admin discovers that the table space has reached a dangerous value. Essentially, too many applications are sharing the table space in too small entities, which forced the IBM Portal server to run out of memory and crash. The admin rectifies this by reorganizing the table space to prevent future incidents.
Once again, proper monitoring of every critical component of the IBM Portal infrastructure is critical to avoiding downtime. An IBM Portal crash can have an enormous impact on critical business lines, and if you want to avoid these types of incidents you need to stay aware and be alerted to the early signs.
GSX Solutions for IBM Connections, Portal & DB2 keep your users happy.
GSX Monitor is the perfect tool for the proactive management of your IT environment, so that you can avoid the use cases we’ve described. GSX Monitor provides a unified dashboard that displays the health and performance of your infrastructure. It tests the availability and performance of the applications themselves, and can do so from multiple locations. And beyond availability, it performs network diagnostic latencies between your users and the application server itself. This means you can rapidly troubleshoot any internal or external network issue that might impact your users.
In addition, we specifically monitor the critical components of the WebSphere server, checking the health of every node and JVM running on it. We also test the health of the backend environment, whether it’s DB2 or SQL. All these capabilities enable you to not only monitor the health of your datacenter, but the quality of the service delivered to your end-users.
Watch this podcast to see how you can gain actionable insight into your mission-critical applications and solve business problems with GSX.