Over the last decade there has been a huge movement towards virtual platforms to provide highly available environments while offering better utilization, performance, and functionality. A key component to virtualized environments is the sharing and distribution of physical hardware to virtual hosts and virtual machines. Although virtualizing a data center is often very advantageous it does not come without some risk to both performance and health. Today we will look at the impact of a “noisy neighbor” which is commonly defined Virtual Machine that has monopolizing resources on a Virtual Host causing issues constraints to other Virtual Machines on the same host.
Figure 1 – vSphere Host Overview Dashboard
Using the vSphere Host Overview Dashboard provided by VMware as shown in Figure 1, we can identify CPU Contention via heatmaps. As we select the first yellow heatmap item we see that it is tied to ESXi Host bellona.bluemedora.localnet. As defined in the popup window we see that the host is running at greater than 80% CPU Demand, which is also confirmed in the Top-N widget just below the heatmap.
Figure 2 – Full Stack View – Custom Dashboard
Now that we have determined that we are seeing CPU Contention on belladona.bluemedora.localnet we can further drill down into the resource. In the summary tab we are able to see in the upper left corner that we are looking at the ESXi Host specifically. Under the health badge we see the Top Health Alert is “Host in a cluster has CPU contention caused by less than half of its virtual machines”.
Figure 3 – Summary Tab
In conjunction to that we can see all of the top health and risk alerts for the descendents. The descendents to this particular host are listed on the left in the inventory tree. Specifically we can see that there are 76 VMs, 17 Datastores, 11 LUNs, and an Oracle DB environment being associated with this host. As we are seeing contention of resources on the host it is very likely that the other VMs and Oracle Environment are experience performance issues.
Figure 4 – Summary Alerts Tab
Drilling into the smart alert we can see the recommendation to use vMotion in order to better balance the workload. To the right we can see the alert information that this is performance led and impacting health, it is still active and was last updated at 10:14 am. From there we expand out on what is causing the issue. What we find is that VM – vrops621-sup is causing the CPU contention issue on the Host.
Figure 5 – Summary Alerts Tab
Expanding out the host we can see that the contention issue seems to be pretty steady across the last 24 hour period and should be address as though it is not an unusual spike in performance.
Figure 6 – Summary Alerts Tab
Now that we have identified that there is an issue, understood the impact to the business, and determined a fix based off of the root cause analysis and recommendation, we can begin to remediate the issue. In this specific case we could select the actions drop down menu and open the host within the vSphere Web Client. Once in the web client we could turn on DRS, or vMotion vrops621-sup to another host with enough resources available.
Today we used vROps to support an advanced use case of troubleshooting performance led issues created by CPU Contention on the Host. We were able to identify a problem, determine the cause, and set a path to remediate the issue. If the issue had originated at the Oracle Database, Cisco Nexus, or other layer, similar steps would be taken using vROps Management Packs by Blue Medora.