Network Troubleshooting in a 3-2-1 Hardware Stack

by bluemedora_editor on March 17, 2016

Even as information technology trends work toward simplifying architecture, complexities within the hardware stack are still evident. IT teams are often faced with the daunting task of determining where the root cause of an issue resides and, more specifically, what is causing the issue.

Today we will use vRealize Operations as a data aggregator for network troubleshooting in a 3-2-1 hardware stack. For reference, the term “3-2-1” refers to a redundant architecture of 3 servers, 2 switches, and 1 storage array, or some derivative of that nature (2-2-1, etc). To begin the troubleshooting we will map the relationships of the 3-2-1 setup to determine which switch(es) and ports are associated with the stack.

After we’ve mapped the relationships we will investigate performance issues associated with packet drops, network congestion, and over-utilization at the network layer.

Blog1

Figure 1 – Custom 3-1-1 Dashboard

 

Let’s start by understanding the 3-2-1 architecture in its most simplistic form. The example shown in Figure 1 uses 3 Dell servers, 1 Nexus switch, and 1 NetApp storage array. As we can see, the idea of 3-2-1 is simplistic, but when trying to determine the root cause of performance issues we find that additional components not listed above are part of this type of infrastructure.

Blog2

Figure 2 – Custom 3-1-1 Dashboard (complex)

 

To show how quickly a 3-2-1 architecture can become complex we’ve built out a custom dashboard as outlined in Figure 2. Inside of the stack we find that PSUs, Fans, Ports, Disk, etc. can all become the underlying cause of an issue. In our example today we’ve used vRealize Operations to further create the relationships through the infrastructure to determine which servers, switches, storage, and each of their components are related, and how each impacts the other.

Blog3

Figure 3 – Cisco Nexus Switch Overview Dashboard (top portion)

 

After we’ve determined all of the components in the stack, we begin high-level troubleshooting at the dashboard level. By using the Cisco Nexus Switch Overview Dashboard we are able to to map the Nexus switch to the ports.  Looking at the right column of Figure 3, we see all of the alerts associated with the Nexus switch, and ports to quickly determine if any issues need attention.

Blog4

Figure 4 – Cisco Nexus Switch Overview Dashboard (bottom portion)

 

On the bottom portion of the Cisco Nexus Switch Overview Dashboard as outlined in Figure 3, we are able to dive into specific metrics to determine the performance status of each port. By  selecting a specific port on the switch we see received, and transmitted statistics such as traffic, and packet discards.

Blog5

Figure 5 – All Metrics List of Nexus Switch

 

In order to drill down further into what is causing performance issues at the network layer we will want to look at the all metrics tab. To investigate switch congestion we can pull up packet errors and correlate them with aggregated port traffic. This is shown in Figure 5, as we can see how packet drop errors relate to traffic throughput at the aggregated port level.

Blog6

Figure 6 – Nexus Capacity Remaining

 

Taking the network troubleshooting a step further, we can pull up the capacity remaining of this switch to ensure that the memory and CPU usage has not exceed the available resources.  By expanding the capacity remaining tab we are able to see the trend of each resource.  At-a-glance we can see where we were, where we are, and where we are projected to be. Using the capacity badges we can easily determine if overconsumption is the cause of network latency. Furthermore, we can see when we will overrun the memory and CPU of the switch based on trends.

Network troubleshooting a 3-2-1 stack can be simplified by determining the relationships through the stack and by identifying specific thresholds of key performance indicators (KPIs).

In today’s blog post, we were able to show that using vRealize Operations as a management and analytical engine in conjunction with third-party management packs streamline troubleshooting processes within a 3-2-1 infrastructure stack. For more information or a free trial of the Management Pack for Cisco Nexus, visit the Product Page on Blue Medora’s website.

 

This blog post was originally posted on VMware Cloud Management Blog. Read the full post here.

Get started

Try BindPlane for free. No credit card required.

Sign up
True Visibility
BindPlane for VMware vRealize Operations

True Visibility allows cloud management teams to use VMware vRealize’s powerful machine learning and capacity planning engine across their entire hybrid cloud environment.

Azure Monitor...everything
BindPlane for Microsoft Azure Monitor

Make Azure Monitor your first-pane-of-glass across your entire multi-cloud, multi-database or hybrid platform environment.

Thank you for contacting us. Your information was received. We'll be in touch shortly.