Over the years we have watched the datacenter evolve in an effort to commoditize servers. One way this has been accomplished is through the trend of standardizing compute resources. Even as standardization becomes a common practice, datacenters are still faced with complexities associated at the compute layer. Today we will be following up on previous blogs (Network and Storage) where we investigated network and storage troubleshooting in a 3-2-1 hardware stack.
As a reminder from my previous blogs, the term “3-2-1” refers to a redundant architecture of 3 servers, 2 switches, and 1 storage array, or some derivative of that nature (2-2-1, etc). Relying on the robust user interface that VMware vRealize Operations (vROps) provides, we will use a single pane to explore health alerts at both the physical and virtual layer and correlate them with key performance metrics and health statuses of the compute layer.
To begin the troubleshooting, we will look at the relationships of the 3-2-1 stack to determine which server(s) are of interest. We will then investigate triggered alerts and associate them with key performance metrics to fully understand issues in our compute stack.
Figure 1 – Custom 3-2-1 (Derivative) Stack Dashboard
We can use various third-party management packs to aggregate data in vROps and gain better understanding of the dependencies in our stack. To understand the complexities related to the 3-2-1 architecture, we have built the full stack view in Figure 1.
It’s easy to see that even using standardized hardware can yield complex dependencies between related objects such as chassis, blades, power supplies, fans, interconnects, etc. Each of these objects acts as a catchpoint in the stack; an issue at any point will impact the entire stack.
Using resources like vMotion and distributed resource scheduler (DRS) can also add complexities when understanding dependencies at the compute layer. Using vROps we can define and re-define relationships with the physical and virtual layer, quickly remediating issues that stem from virtual machines and hosts.
Figure 2 – Dell Compute Health Investigation
One method of troubleshooting is to start with a specific object and dig into the object’s alerts. Using the Dell Compute Health Investigation dashboard (Figure 2) from the Management Pack for Dell PowerEdge, we can quickly identify any active or inactive alerts, associate them with a criticality, and pull in associated alerts around the virtual layer.
By selecting a specific server in the heatmap on the left (Server Health Selectable), the dashboard will populate the alert widget with alerts for the specific host and virtual machine(s) tied to that server.
Figure 3 – UCS Alert Summary
Any of the alerts on the dashboard can be clicked to view more details about the alert. In Figure 3 we see a summary, criticality, time of alert, and impact of the alert. We can follow the out-of-the-box recommendation(s) to try resolving the alert. This particular Management Pack cross references the alert with the vendor’s faults guide to supply specific steps to remediation as suggested by the vendor.
Figure 4 – UCS VMs to Related UCS Server Details Dashboard
Another approach to troubleshooting involves working from the virtual layer into the hardware stack. As shown in Figure 4, we are able to select a specific virtual machine and map that directly to its associated compute resources using the UCS VM to Related UCS Server Details dashboard. Once the virtual machine is selected, we are able to understand key performance metrics of the physical server that could possibly be impacting the performance of the virtual machine. In this example we are able to view key performance metrics around throughput of the physical server to quickly determine the impact it may be having on the environment.
Using the methods explored in this blog, you can streamline troubleshooting and provide deep insight into the health and status of your compute layer. By adding management packs for server, storage, and networking to the mix, you could use vROps to aggregate key endpoints in the stack and better understand the health of your entire 3-2-1 stack.
This blog post first appeared on VMware Cloud Management Blog. To read the full article, click here.