Compute Troubleshooting in a 3-2-1 Hardware Stack

by bluemedora_editor on June 8, 2016

Over the years we have watched the datacenter evolve in an effort to commoditize servers. One way this has been accomplished is through the trend of standardizing compute resources. Even as standardization becomes a common practice, datacenters are still faced with complexities associated at the compute layer. Today we will be following up on previous blogs (Network and Storage) where we investigated network and storage troubleshooting in a 3-2-1 hardware stack.

What is 3-2-1?

As a reminder from my previous blogs, the term “3-2-1” refers to a redundant architecture of 3 servers, 2 switches, and 1 storage array, or some derivative of that nature (2-2-1, etc). Relying on the robust user interface that VMware vRealize Operations (vROps) provides, we will use a single pane to explore health alerts at both the physical and virtual layer and correlate them with key performance metrics and health statuses of the compute layer.

Compute Troubleshooting: Relationships

To begin the troubleshooting, we will look at the relationships of the 3-2-1 stack to determine which server(s) are of interest. We will then investigate triggered alerts and associate them with key performance metrics to fully understand issues in our compute stack.

Figure1-border

Figure 1 – Custom 3-2-1 (Derivative) Stack Dashboard

We can use various third-party management packs to aggregate data in vROps and gain better understanding of the dependencies in our stack. To understand the complexities related to the 3-2-1 architecture, we have built the full stack view in Figure 1.

It’s easy to see that even using standardized hardware can yield complex dependencies between related objects such as chassis, blades, power supplies, fans, interconnects, etc. Each of these objects acts as a catchpoint in the stack; an issue at any point will impact the entire stack.

Using resources like vMotion and distributed resource scheduler (DRS) can also add complexities when understanding dependencies at the compute layer. Using vROps we can define and re-define relationships with the physical and virtual layer, quickly remediating issues that stem from virtual machines and hosts.

Figure2-border

Figure 2 – Dell Compute Health Investigation

Compute Troubleshooting: Object Alerts

One method of troubleshooting is to start with a specific object and dig into the object’s alerts. Using the Dell Compute Health Investigation dashboard (Figure 2) from the Management Pack for Dell PowerEdge, we can quickly identify any active or inactive alerts, associate them with a criticality, and pull in associated alerts around the virtual layer.

By selecting a specific server in the heatmap on the left (Server Health Selectable), the dashboard will populate the alert widget with alerts for the specific host and virtual machine(s) tied to that server.

Figure3-border

Figure 3 – UCS Alert Summary

Any of the alerts on the dashboard can be clicked to view more details about the alert. In Figure 3 we see a summary, criticality, time of alert, and impact of the alert. We can follow the out-of-the-box recommendation(s) to try resolving the alert. This particular Management Pack cross references the alert with the vendor’s faults guide to supply specific steps to remediation as suggested by the vendor.

Figure4-border

Figure 4 – UCS VMs to Related UCS Server Details Dashboard

Compute Troubleshooting: Virtual Layer

Another approach to troubleshooting involves working from the virtual layer into the hardware stack. As shown in Figure 4, we are able to select a specific virtual machine and map that directly to its associated compute resources using the UCS VM to Related UCS Server Details dashboard. Once the virtual machine is selected, we are able to understand key performance metrics of the physical server that could possibly be impacting the performance of the virtual machine. In this example we are able to view key performance metrics around throughput of the physical server to quickly determine the impact it may be having on the environment.

Using the methods explored in this blog, you can streamline troubleshooting and provide deep insight into the health and status of your compute layer. By adding management packs for server, storage, and networking to the mix, you could use vROps to aggregate key endpoints in the stack and better understand the health of your entire 3-2-1 stack.

For more information or a free trial of vROps management packs by Blue Medora, visit the product page on Blue Medora’s website.

This blog post first appeared on VMware Cloud Management Blog. To read the full article, click here.

Get started

Try BindPlane for free. No credit card required.

Sign up
True Visibility
BindPlane for VMware vRealize Operations

True Visibility allows cloud management teams to use VMware vRealize’s powerful machine learning and capacity planning engine across their entire hybrid cloud environment.

Azure Monitor...everything
BindPlane for Microsoft Azure Monitor

Make Azure Monitor your first-pane-of-glass across your entire multi-cloud, multi-database or hybrid platform environment.

Thank you for contacting us. Your information was received. We'll be in touch shortly.