In a recent industry study done by a Gleanster, IT departments find that they are overwhelmed with the amount of monitoring tools they have at their disposal. By using vRealize Operations as a data aggregator you can quickly ingest endpoint data while taking advantage of the core value propositions of vROps. A common problem in the industry is that often times symptoms are deemed root cause issues leading to what we call bandaid fixes. They may provide additional resources and performance gains but haven’t addressed the actual cause or multiple causes of an issue. Today we will look at root cause analysis by drilling down through the stack and using artificially created KPI thresholds to determine the issue.
Figure 1 – Custom Relationship KPI Dashboard
Figure 1 shows a comprehensive view of an Oracle DB being supported by VMware VI and Flexpod Hardware Technology. As we select the database instance we are able to automatically discover each component tied to the stack to create true dependencies. As these are selected the dashboard refreshes the KPI list to the right. The KPI list has been custom made and for purposes today we’ve created artificial thresholds to tell a story. Based on the thresholds created the scoreboard to the right will provide a highlighted green, yellow, orange, and red silhouette. With red being the an offender of interest it is likely that we will want to beginning performance troubleshooting with KPIs tied to red silhouettes.
Figure 2 – Oracle DB List and KPIs
Drilling down into the the OEM12c Oracle DB we can notice that we have created an issue with Executions, I/O, Write Time, and DB Wait Time Ratio. These thresholds were predetermined and can be tuned to meet an environment expectations. From the creation of this dashboard we can take away that based on our predefined standards we are running into some type of write and I/O issue.
Figure 3 – VM and Host List and KPIs
With the issue in hand we can begin to cross reference the issues that we have established at the Oracle DB layer with the virtual layer. As shown in Figure 3 we can quickly rule out that this does not look to be an issue with CPU, Memory, Latency, or IOPs at the virtual layer. One point of interest is that we are beginning to see an excess of write throughput at the Host layer (ucs-1.bluemedora.localnet).
Figure 4 – UCS Compute and Fabric Interconnect List and KPIs
Moving from the Virtual infrastructure and to the physical compute we can cross reference our write issue with key components on our Cisco UCS Environment. As shown in Figure 4 we can quickly rule out an issue at the blade layer and the Fabric Interconnections seem to be working well within the thresholds we have created.
Figure 5 – Cisco Nexus Networking List and KPIs
Drilling down into the Cisco Nexus environment we can begin to see red flags based on the thresholds we have tuned our environment to. As we can notice we are seeing unacceptable amounts of Throughput, Packets Received, and Packets Sent. Based on this scenario and the thresholds we have created we can begin troubleshooting our network layer, and more specifically this Cisco Switch.
Today we used vROps to support an advanced use case of troubleshooting performance led issues using predefined and custom KPIs. In doing so we were able to isolate a symptom at the Oracle Database potentially tie it to an issue with the network as we noticed high amounts of traffic through the network layer potentially causing an increase in write time at the database layer. These specific dashboards KPIs can be finely tuned to meet any environment and provide deep insight in the overall performance of your IT environment by creating a deep understand of dependencies created as you use vROps as a point of data aggregation to ingest data from multiple endpoints.
For more information or a free trial of vROps Management Packs, visit the product page.