Technical Contributions from Blue Medora Chief Systems Architect, Craig Lee.
This is part one of a three-part blog series on Log Management for Google Stackdriver with BindPlane
Proper Log Management
Whether you’re a multi-billion-dollar tech firm or a small startup, sorting your data in a usable and logical manner can prove to be a major challenge. No matter the size of your company, you are going to have individuals within your organization that only need access to specific data and information. Providing them with anything extra ends up slowing down workflows and creates digital clutter. When you first begin ingesting data with Google Stackdriver Logging, the amount of data can be overwhelming and you’ll almost immediately find a need for log management.
Unfiltered logs can greatly slow down the workflow when you are ingesting log data from multiple data centers, web apps, databases, etc.… especially when each team only requires data from specific sources. For example, without the proper ability to sort and filter your data between sources, it leaves your database team sifting through web-app and other irrelevant data to locate what they are searching for and vice versa. Even if you are pulling data from a single source, it can be difficult to find the specific log or log set that you need.
Users may be looking to keep an eye out for different severity levels, specific log types, or even a single specific log. All of this can be extremely difficult and time-consuming to sort out without proper log management. 71% of users reported that current tools give hints but rarely find root cause. They end up losing days of productivity in this fruitless search. This is where log management and tagging comes into play.
Why you need Log Tagging
Log tagging will prove to be one of the most useful tools in your log management arsenal. Once you have implemented tagging, all of your filtering challenges within Google Stackdriver will be a thing of the past. Google Stackdriver, out of the box, offers some basic tagging features. These allow users to sort logs by message severity, the namespace, the application that the log is sent from, and the respective Google Datacenter. These basic tags are a great tool to help teams sort out the majority of noise within their log ingestion, but still does not solve issues with sorting out individual log instances within these applications and data centers.
To really drill down into tagging and custom log management, that will require some custom FluentD work. Custom work can prove to be difficult without the assistance of a service like BindPlane. We’ll dig more into that in part 2 of the series. Below you will see an example if a Kubernetes log message with custom log tagging applied.
Example Kubernetes Log Message:
Namespace, node_name, container_name, and other tags are in the JSON Payload of the log message
Log Management on a Global Scale
Tagging seems like such a simple concept, and you’re probably thinking it would be common sense to have it implemented. As we said earlier, creating custom tags and log management can be very difficult and time consuming without help. This was made evident when one of our clients came to us with the challenge of scaling their log monitoring on a global level.
Our client was running seven application services between two data centers in US East-1 and Europe West-2, ingesting logs from six data sources, being used by six different teams, that were running on 50 different servers. Wow, that was complicated to type out, let alone starting to manage the sorting of the data to the correct teams. Now you can see why tagging and log management are not as simple as it seems on paper. Currently, our client’s database admins have to administer their MySQL databases in US-East 1 but have no need to see irrelevant processes coming from order procurement in Europe West-2 or any of the multiple other teams working within the organization. But without proper sorting, critical signals will be lost as these teams manually sort through the flow of thousands of log messages.
Custom Tagging with BindPlane
Basic tagging in Google Stackdriver logging is a good place to start for log management, but in this customer use case, it can get complicated to manage at scale. Implementing BindPlane to help monitor your logs allows you to easily customize your log tagging at any scale. BindPlane’s capabilities easily customize each individual source and can create templates to apply at any scale. These templates save you from having to manually recreate and input each tagging options into every single data source.
Once all of these customizations have been implemented to your different sources, users on different teams will save time and effort when trying to find exactly what they need. Now rather than the order processing team in South Carolina having to sort through the production team logs from Frankfurt , Germany manually, they can just filter logs coming from US-East 1, to the team they are related to (order processing team), or the specific application they are looking to grab log data from, such as MongoDB.
Example Log Message with Tags:
Note the bindplane_app, function, and location tags.
Effectively managing log data can be challenging if your organization does not have a way to easily tag and identify the logs that are important to them. Join us in part two of the series to learn how to customize your log tagging in BindPlane for Stackdriver Logging and how it can increase efficiency and completely change your workflow.
What is MongoDB Replication Lag?
MongoDB is no different from other databases as in the fact that it relies on data replication, and even if we had quantum computers at our disposal, there will always be at least a small amount of lag when replicating operations from the primary to secondary node. MongoDB replication lag is specifically the interval of time from when an operation is run on an application in the primary node, and the operation being applied to the application on the secondary node from the oplog.
Why are You Experiencing MongoDB Replication Lag?
MongoDB replication lag occurs when the secondary node cannot replicate data fast enough to keep up with the rate that data is being written to the primary node. This can occur for a few reasons, so it can be hard to pinpoint exactly why you are experiencing replication lag. Some of the main culprits include network latency, disk throughput, concurrency, and large amounts of data writes to MongoDB. Your MongoDB replication lag could be caused by something as simple as network latency, packet loss within your network, or a routing issue. Any of which could be slowing down the replication from your primary node to your secondary.
One of the leading causes of replication lag in multi-tenant systems is slow disk throughput. If the filesystem on the secondary disk can’t replicate the data to the disks as fast as the primary, the secondary will have issues in keeping up. Disks may also run out of memory, I/O and CPU, keeping data from being written to secondary node disks and letting them fall further behind the primary.
Concurrency strikes again! As we mentioned in our last blog on MongoDB Lock Percentage, concurrency (while entirely supported and well-handled on MongoDB with their granularity) can sometimes cause unintended consequences within your system. In this case, large and long running write operations will lock up the system and block the replication to secondaries until complete, increasing replication lag. Similar to concurrency, when running frequent and large write operations, the secondary node disk will be unable to read the oplog as fast as the primary is being written to and fall behind on replication.
When and Why Should you be Concerned?
As stated above, even with the most powerful computers and databases at your fingertips, you will see some sort of replication lag, but the question is, how much lag is too much lag? Ideally, in a healthy system, MongoDB replication lag should remain as close to zero as possible, but that’s not always going to be possible. Sometimes, the secondary nodes may lag behind the primary, but will usually fix themselves without any intervention necessary, and that’s perfectly normal. However, if MongoDB replication lag persistently stays high and continues to rise, then you will need to step in and remedy the situation before the quality of your database begins to degrade and you have even more problems on your hand.
You need to stay on top of this for a number of reasons. As you probably know, the main reason you have secondary nodes is for it to take over if your primary node is no longer apart of the majority active set and steps down, or it flat out fails. You will not want an out of date secondary node taking over for your primary node, and if it is so far behind that your database won’t function correctly, then you may even have to take down your entire database until your new primary node is updated, or the old one is recovered. To go along with this point, if your secondary node falls too far behind from your primary node, and operations are not being replicated and kept up-to-date, then if your primary fails and can’t be recovered, there will be a large amount of manual reconciliation that will need to be done which will take a lot of time, resources and create headaches for everyone involved.
How to Monitor and Minimize MongoDB Replication Lag
Make sure to use all of your tools at your disposal when minimizing MongoDB replication lag. Since the Opslogs has limited space, you won’t want to let it fall too far behind, or else the secondary node can’t catch up with the primary, and if this occurs, you will need to run a full sync. It’s important to avoid a full sync at all costs since it can be extremely expensive. There are a few different methods to make sure you’re on top of MongoDB replication lag and keep it from getting out of hand. First you will want to frequently check where the replication lag interval is sitting. To check on the current replication rate, use this command in the mongo shell that is connected to the Primary Node: rs.printSlaveReplicationInfo(). This will return the ‘syncedTo’ value for each member and when the most recent oplog entry was written to the secondary.
If you don’t want to manually check MongoDB replication lag every day, consider monitoring it with a service such as Google Cloud’s Stackdriver with BindPlane. BindPlane works in tandem with the leading data monitoring services to allow you to monitor metrics such as “Slave Delay” and “replication count” within and allows you to create intelligent alerts. By definition replication lag is the amount of time it takes for the secondary to read the opslog and replicate the data from the primary node, so if it takes 15 minutes for the secondary node to read the opslog, then there is a 15 minute replication lag time. This would obviously be too much time, so to mitigate it, you can set alerts to let you know when the replication interval exceeds your ideal time, 2-5 minutes for example. Along with monitoring MongoDB replication lag time directly, you can also monitor the disk related metrics of the secondary node to make sure they keep up with the primary. For example, you can monitor metrics such as Network I/O, CPU, and Disk Space. Alerts can be set up for these as well to keep you on your toes and help you stop excessive replication lag before it happens.
We are pleased to announce that BindPlane Logs for Stackdriver is now out of Beta and is generally available to all Google Stackdriver customers. Stackdriver is Google Cloud’s monitoring and logging service, that allows you to easily see how your systems are performing across platforms and providers with metric and log data that is provided by BindPlane. This new integration that beta released for customers in October pulls hybrid cloud sources and on-premises workloads into Stackdriver Logging, which allows users to gain critical insights into their environments to get ahead of potential problems.
Together, Stackdriver and BindPlane bring an in-depth hybrid-cloud, multi-cloud and on-premises view into a single dashboard, connecting health and performance signals from a wide variety of Google and non-GCP sources. Once logs are ingested in Stackdriver, you can easily view and search through the raw log data and create log-based metrics. These metrics can then be used within Stackdriver to create monitoring charts, and alerting policies, providing the ability to visualize and view logs and metrics side-by-side.
How BindPlane Logs for Stackdriver Works:
The BindPlane Logs Agent is deployed via a single line install command for Windows and Linux and a yaml file for Kubernetes that significantly shortens the time it takes to get data streaming into Google Stackdriver Logging. BindPlane’s centralized administration offers simplified updates and configuration changes. By allowing customers to create and push out all of their changes right from the BindPlane UI, they spend less time configuring JSON files. When you combine all of these factors, along with the 24/7 customer support from Blue Medora, Stackdriver customers will see a significant decrease in the amount of time it takes to get started with log monitoring, while experiencing a more streamlined experience with Stackdriver Logging.
How Stackdriver Customers Benefit from BindPlane Logs:
Customers of the Alpha and Beta versions of BindPlane logs gained insight into previously invisible systems by bringing logs from on-prem systems like Kubernetes , databases, and security systems into Stackdriver. This allowed them to leverage the same alerts, policies and SRE methods that they use with GKE. Some Stackdriver customers have even replaced their 6-figure SIEM solutions with BindPlane logging, saving both time and money. As a reminder, all BindPlane metric and log collection functionality continues to be available at no additional cost to Stackdriver users.
Anyone who has ever monitored logs in software such as Active Directory knows the number of logs can be overwhelming to comb through in order to find anything of importance. To help with this, Stackdriver Logging comes with the capability to create alerts that notify you when a certain event is triggered. For example, alerts can be set up within your environment to notify you if any of the constraints or limits for the objects in your schema have been violated. The log data streamed by BindPlane automatically parses your data and includes a JSON payload that gives a more contextual look on what is included within each log entry such as log-type, the severity level and other insights depending on the event, helping users easily read and interpret their logs for better understanding and quick analysis of their data.
Stackdriver Logging also comes with the ability to create log-based metrics which are beneficial for locating trends within your system data. With log-based metrics, you can create metric charts that provide a visual representation of logs within a system. By comparing these charts, you can receive better insights into your network and systems, helping you identify correlations or causations between different log events. For example, Stackdriver Logging customers can track all of the log-on attempts that occur on their active directory, and compare them to the number of log-on failures being returned. These graphs allow users to filter by time, helping them dive deeper into the data to gain stronger insights on any issues they might be having.
BindPlane Logs for Stackdriver Sources:
BindPlane for Stackdriver today supports 50+ Log sources, including, but not limited to:
Application web servers
These are just a small sample and the beginning of the preconfigured sources that BindPlane can integrate into Stackdriver Logging. By the end of the year, BindPlane will expand support to over 100 log types. Visit the BindPlane documentation page to see if your favorite sources are supported.
Ready to get started with BindPlane Logs for Google Stackdriver?
Visit our website to learn more about how BindPlane Logs for Stackdriver supports Stackdriver customers and register for free.
If you already have a BindPlane account created, start importing logs into Stackdriver today by logging in to your account.
If you don’t have a BindPlane account yet, get started today by signing up through the Google Cloud Platform Marketplace.
Stackdriver log integration expands functionality across on-prem and multi-cloud Kubernetes, databases and applications with 50+ log sources
GRAND RAPIDS, MICH. – Dec 3, 2019 – IT monitoring integration innovator Blue Medora today announced general availability and open access of 50+ log source integrations through its BindPlane log streaming platform, for all Google’s Stackdriver customers.
This follows the October beta release and announcement from Google Cloud bringing BindPlane’s log services to market at no additional licensing cost to Stackdriver customers. Stackdriver and BindPlane together bring an in-depth hybrid-cloud, multi-cloud and on-premises view into a single dashboard, connecting health and performance signals from a wide variety of Google and non-Google Cloud Platform (GCP) sources. The managed log streaming capability simplifies extending Stackdriver’s observability to enterprise customer data centers and other public clouds. The addition of log source ingestion complements BindPlane’s existing metrics data pipelining capabilities.
The 50+ preconfigured log source integrations include Kubernetes, Amazon EKS and Azure AKS, along with support for key workloads including Windows applications, Microsoft SQL Server, Oracle, Elasticsearch, Kafka, NGINX and more. They allow Stackdriver customers to unify event analysis, even when running multiple Kubernetes orchestrated services across Google Cloud, New Relic, Amazon Web Services, Microsoft Azure and private data centers. These capabilities also enable diagnosing production issues for application stacks running on Google Cloud VMs. With new product functionality including fully customizable log parsing and formatting, Stackdriver customers can now search on a specific log tag within groups of logs to identify issues faster and quickly match sources to log records. BindPlane automates the collection and enhancement of diverse IT operations data and the metadata that exposes IT relationships. Designed to address the complexity of managing operations data in hybrid and multi-cloud environments, this data stream improves upon the analytics of popular monitoring platforms including Stackdriver, Azure Monitor, New Relic, VMware, Datadog and others. With the possibility of configuring multiple systems sending log data to Stackdriver, BindPlane now allows for remote agent updates from a central location, saving administrators time by not needing to access each individual system to update their log agents.
“The BindPlane for Stackdriver solution has saved us significant money and effort otherwise spent on managing additional SIEM logging tools and open source solutions,” said Timothy Wright, Founder at Eastbound Technology.
“As BindPlane continues to grow, introducing log monitoring was the next step for providing single pane health and performance monitoring to our customers,” said Mike Kelly, Chief Technology Officer, Blue Medora. “Our greatest value is continuing to offer a wide range of integrations and types of monitoring as possible. We’ll continue to build these integrations within our customers’ preferred monitoring tools, like Stackdriver, to make sure we offer the most extensive monitoring and observability capabilities.”
Stackdriver brings together application performance tracing, logs, and infrastructure monitoring functions into a single tool. “Our partnership with BindPlane has extended Stackdriver’s capabilities for monitoring logs and metrics to help support our customers’ dynamic and hybrid environments,” said Rami Shalom, Product Manager, Google Cloud. “The integration of BindPlane unlocks a real-time, dimensional data stream that allows our customers to monitor non-GCP public cloud resources that are vital to understanding their full picture health, all within Stackdriver.”
BindPlane currently supports as many as 150 operations data sources that support monitoring of non-GCP public cloud resources including Amazon AWS, Microsoft Azure, Alibaba Cloud, and IBM Cloud. It also monitors critical workloads and databases on GCP (or data center) virtual machines and on-premises infrastructure.
Details on how to deploy Stackdriver with BindPlane can be found on Google Cloud’s website.
About Blue Medora
Blue Medora’s pioneering IT monitoring integration as a service addresses today’s IT challenges by easily connecting system health and performance data–no matter its source–with the world’s leading monitoring and analytics platforms. Blue Medora helps customers unlock dimensional data across their IT stack, otherwise hidden by traditional approaches to metrics collection.
P: +1 650 996 0778
MongoDB Multi-granular Locking System
Are you experiencing high queues or slow response times in your MongoDB database? Well then, you may have a deeper issues within your database… a high MongoDB Lock Percentage. MongoDB can be a large and complicated database to manage and maintain, which means you will probably have multiple team members accessing and working on data at the same time. MongoDB employs a multi-granular lock system that allows users to concurrently work in the database and view the same data without making conflicting modifications. This Multi-granular locking system locks MongoDB and stops users from writing on it, when someone else is reading or writing, and vice-versa. However it does not lock the database when two or more users are reading simultaneously. MongoDB tracks how often your database is locked with a “Lock Percentage” metric. This percentage is calculated on two separate levels, “database” and “global”. On the “database” level, MongoDB will lock two concurrent users from working on the same database, but will not lock two users on separate databases unless there is a global rule put in place to lock one system when another related one is being used.
What is an Average MongoDB lock Percentage?
If your database is a read-only system, where most users won’t be modifying data on a regular basis, then you should expect a low MongoDB lock percentage, somewhere around 10%. A write database may have an average lock percentage of around 20% but can even exceed 60%. Even in a write-heavy database environment where you often have multiple users working on the data concurrently, you will still want to keep your lock percentage as feasibly low as possible.
Find the Root Cause
A high lock percentage may be the root cause of other issues that you are experiencing in your MongoDB database. When you have a high locking percentage, you may experience slower response and application call times, high queues, increased replication lag, high CPU usage, and even failures. To ensure that none of these issues occur and you minimize your MongoDB lock percentage, we recommend that you set up a monitoring system, such as Google Stackdriver or New Relic, to make sure the lock percentage does not exceed your set limits. To ensure your lock percentage is within the optimal threshold, you may also want to set up alerts to help monitor other key performance indicators with metrics such as: write tickets, available read and write tickets, intent exclusive lock times, and many more. Monitoring these metrics should help you stay on top of everything, and understand why your MongoDB lock percentage may be so high.
Stream MongoDB metrics with BindPlane
BindPlane can stream MongoDB metrics and logs to your data monitoring service of choice, and if you already use Google Stackdriver you can use BindPlane for no extra cost. Read this blog to learn how to configure and deploy MongoDB within BindPlane so you can begin monitoring your MongoDB lock percentage and other KPIs.