Tech Field Day at VMworld: Hybrid Platform Operations for VMware vRealize Demo

Tech Field Day at VMworld: Hybrid Platform Operations for VMware vRealize Video and Transcript

Hello, everybody. My name’s Craig Lee. I’m the chief architect at Blue Medora. We’re gonna dive in. This is kind of the fun part where you take a lot of this data and actually show it solving some problems and being functional.

What we’re looking at here… on the left-hand side, you’ll see what you really get out of the box with vRealize Operations. On the right, you see the whole system. As you click through that, you’ll see the relationships change. These virtual machines are running on that host, and then ultimately down to the datastore.

What comes out of the box with vRealize Operations is pretty useful, but I’m gonna really show you what Blue Medora adds, and show you what you’re missing. This is with Blue Medora. In our example, we’re talking about Pivotal Ready Architecture, so PCF, Kubernetes, EMC clusters –things like that. We can actually pull in the hardware information from the VxRail cluster itself. I’m clicking through, you can see the rack server, later the power supply. We’re also adding in relationships. This is part of the Dimensional Data stream, as Mike had mentioned, to the host system and down to these virtual machines, but also with that we really want to look at–the application workload.

I can see as I’m clicking through here, there’s a Diego Cell in PCF. Ignore the fact that it says Diego Cell. This could be your SQL Server, HANA or anything else for monitoring. As I’m clicking through, you see the relationships change to show what virtual machine it’s related to. So we’re linking back to that virtual layer itself. Same thing with the PCF Boss jobs and Dopler server and Diego router. As you can see, as I’m clicking through here there’s a lot of different layers. Really what this is…it’s the realization of all that Dimensional Data and the metadata around relationships and also understanding our target’s architecture too.

We also have Kubernetes here. You saw a lot of announcements at VMworld around PKS and we want to definitely address that. So we have a Kubernetes and mixed relationships through, and vSAN and all the way down to this. What you get on the right-hand side, with just a few of our offerings, in a select stack. On the left side is really kind of what you get out of the box.

So now I’m going to take it through and actually show that in action what it really means to our customers. So first starting off with our Pivotal example, PCF example, you have two different availability zones. Different clouds side-by-side. On the left-hand side, you have PCF running on AWS. You can see the relationship between the foundation availability zone, the jobs. These are the actual application workloads running on the AWS side and we relate that back to Amazon Web Services ec2. There’s a lot of alerts. Perhaps you want to do some investigation. With that visibility, we were able to really see that. I know most folks here are VMware experts and they understand this already, but to clarify out-of-the-box VMware is doing the piece that you see here the virtual
machine, the ESXi hosts, the datastores are all things that vRealize Operations monitors on its own. Everything else that you see there is our integrations. So we’re adding the integrations for PCF, for Dell, for AWS. All of those have been added in, and they look native, right?

We’re using their APIs, you know, to keep a really consistent look and feel throughout the product. Those have all been added by Blue Medora.

Q. Is there anything stopping the management packs from pushing events to Log Insights, or do you just look to vROps to do that? At least with the management pack structure today within the vROps platform?

A. It’s almost going the other way. VMware’s designed it so that the Log Insight events come into the objects into vRealize Operations and this is the central aggregation point, but that’s an excellent segue into what I’m going to show around alerting.

It is actually this next piece here. So we’re looking at the VxRail cluster itself. We can look at the health and of all the different aspects of, you know, cooling units, power supplies, things like that. You saw the relationships earlier, so I won’t really dive into much depth, but you can see the blade servers were monitoring and how they relate to the VMware host systems. Where this gets really interesting, because we have the relationships, and also, talking about your topic around alerts, is now we have alerts. We’re pulling not just metrics, but also relationships, and the alerts themselves.

So now we’re looking at a PCF Diego Brain. We’re looking at an alert coming from a particular service within PCF itself. Because of the relationships, we can see what the real impact is and where the real issues are. This is the ultimate understanding we’re trying
to get to. This is an alert that is coming out of PCF, and we can see that the Diego Brain, which is a service has a critical failure. As Mike had mentioned, we have several partnerships and we try to have best practices and great relationships with those we collect data from. From within PCF, we actually use their best practices to write these recommendations. So right we’re actually giving here some action items on how to actually follow up on this particular alert. I was able to trace us through from the VxRail
hardware layer because of the relationships, down to the app side now. I hope that gives you a bit of context on the Log Insights.

Q. That means we don’t always blame the network. It’s the network! It’s the storage. When you have a tool like this, it really identifies what the real issue is, right?

A. I mean it gets to a great point. A lot of customers that are buying are trying to reduce finger-pointing, and this is one of the
ways that you do that. You have a single source of truth. Actually, excellent segue into our first pane of glass view. This is how to really troubleshoot quickly. We’re looking at a Microsoft SQL environment. It’s a cluster. We can see several KPIs there, and we’ll dive into that in a minute. As I’m clicking through these, you see the relationship through the stack. It’s a traditional architecture. We have SQL running on VMs, ESX host, we’re monitoring the actual PowerEdge server, the networking layer with the
Cisco switch itself, and the datastore, and then on to the actual SAN. What’s really cool about this stuff is the Blue Medora integrations, they’re not only very high level and relational. We’re not only pulling in metrics, but we’re also going deep.

I’m going to look at this particular SQL Server with a very high wait time. I want to explore further, so I navigate to the Microsoft SQL Server query plan. From this, I can look at several SQL queries. This is actually highlighting our highly utilized SQL Server. We can go to the query by the highest average execution time, get some info on that, and then we ultimately pull in the actual text of that particular query. Already this is a pretty good level, right? You can have conversations between your DBA team, perhaps storage and virtualization admin team to really find and pinpoint where that root cause is coming from.

So I have this alongside everything else. It is pretty interesting, but what we do is we go even further. We have our SQL query at the individual functions within that query. It’s a bit hard to read, but this hatch match is taking up 62% of the operator cost. So this particular function within the query is putting the load all the way through the virtualization environment. We’re able to trace it down to that level.

Q. This is starting to look like actual application tracing. You maybe explain the difference between, you know, actual tracing or what you deal with here?

A. Well, if you say tracing you mean by transaction tracking, yeah, that’s not really what we’re doing here. From a transaction tracking, as artificial instrumentation, things along those lines. What we’re doing is we’re gathering a wide variety of metrics, and getting all the relational information, making those relationships and metadata and just pulling it together. What I think of it, sometimes the way I’ve described it, we’ve started where application tracing ends. So in the application trace, you couldn’t your instrumenting, your app. You can see usually where it hits the database, but then that’s it. So we pick up, and you see here’s where you hit the database, and now let’s look at everything else inside of your environment.

I know time’s running real short. I have one last thing to show you guys. As I’m tracing this through, the other piece is now we went deep on the application side and went into the query. Let’s look at the impact on the infrastructure. Perhaps that’s what’s causing it, or that’s a cascading effect. Through here can see there’s a high average wait time on the VM, very high latency on the host. Because we’re making that relationship to the hardware itself. It looks fine. The network area looks okay. There are no KPI alerts. From the datastore side, high total latency.

We’re taking out of what’s in vROps itself and making relationships. We’re looking at the underlying SAN volume itself. There’s a lot going on here. A lot of alerts around IO and latency. Using that same methodology of linking the dashboards and objects within vROps. We can go to our dashboard called troubleshoot app or array. This happens to be a Pure Storage array within this environment, but it may be NetApp or Dell EMC, or any of a dozen other targets. Especially around storage, where we can go step-by-step, really looking at this particular array, looking at if the latency is high. We offer recommendations on what to do. We’re looking at the latency overall. This is an example of what Mike mentioned, we’re really trying to supercharge all the different monitoring platforms that we hook up to. We’re actually powering the analytics engine to do forecasting analysis on the numbers that we’re bringing in. Following this through, we also see the distribution of latency. So the amount of time spent with the latency itself. We’re looking at the current IO high and then ultimately down to the actual queue depth. These are basically jobs that are not being serviced by the storage array itself.

Get started

Try BindPlane for free for 30 days. No credit card required.

Sign up
Have questions?
Meet BindPlane

Dive deep into documentation, data sheets, videos and more.

Ready to buy?
Let us answer your questions

Want to talk to an expert? We’re ready!

Thank you for contacting us. Your information was received. We'll be in touch shortly.