Update: Google Stackdriver is now Google Cloud Logging and Google Cloud Monitoring. BindPlane will continue to integrate and support both of these products.
MongoDB is no different from other databases as in the fact that it relies on data replication, and even if we had quantum computers at our disposal, there will always be at least a small amount of lag when replicating operations from the primary to secondary node. MongoDB replication lag is specifically the interval of time from when an operation is run on an application in the primary node, and the operation being applied to the application on the secondary node from the oplog.
MongoDB replication lag occurs when the secondary node cannot replicate data fast enough to keep up with the rate that data is being written to the primary node. This can occur for a few reasons, so it can be hard to pinpoint exactly why you are experiencing replication lag. Some of the main culprits include network latency, disk throughput, concurrency, and large amounts of data writes to MongoDB. Your MongoDB replication lag could be caused by something as simple as network latency, packet loss within your network, or a routing issue. Any of which could be slowing down the replication from your primary node to your secondary.
One of the leading causes of replication lag in multi-tenant systems is slow disk throughput. If the filesystem on the secondary disk can’t replicate the data to the disks as fast as the primary, the secondary will have issues in keeping up. Disks may also run out of memory, I/O and CPU, keeping data from being written to secondary node disks and letting them fall further behind the primary.
Concurrency strikes again! As we mentioned in our last blog on MongoDB Lock Percentage, concurrency (while entirely supported and well-handled on MongoDB with their granularity) can sometimes cause unintended consequences within your system. In this case, large and long running write operations will lock up the system and block the replication to secondaries until complete, increasing replication lag. Similar to concurrency, when running frequent and large write operations, the secondary node disk will be unable to read the oplog as fast as the primary is being written to and fall behind on replication.
As stated above, even with the most powerful computers and databases at your fingertips, you will see some sort of replication lag, but the question is, how much lag is too much lag? Ideally, in a healthy system, MongoDB replication lag should remain as close to zero as possible, but that’s not always going to be possible. Sometimes, the secondary nodes may lag behind the primary, but will usually fix themselves without any intervention necessary, and that’s perfectly normal. However, if MongoDB replication lag persistently stays high and continues to rise, then you will need to step in and remedy the situation before the quality of your database begins to degrade and you have even more problems on your hand.
You need to stay on top of this for a number of reasons. As you probably know, the main reason you have secondary nodes is for it to take over if your primary node is no longer apart of the majority active set and steps down, or it flat out fails. You will not want an out of date secondary node taking over for your primary node, and if it is so far behind that your database won’t function correctly, then you may even have to take down your entire database until your new primary node is updated, or the old one is recovered. To go along with this point, if your secondary node falls too far behind from your primary node, and operations are not being replicated and kept up-to-date, then if your primary fails and can’t be recovered, there will be a large amount of manual reconciliation that will need to be done which will take a lot of time, resources and create headaches for everyone involved.
Make sure to use all of your tools at your disposal when minimizing MongoDB replication lag. Since the Opslogs has limited space, you won’t want to let it fall too far behind, or else the secondary node can’t catch up with the primary, and if this occurs, you will need to run a full sync. It’s important to avoid a full sync at all costs since it can be extremely expensive. There are a few different methods to make sure you’re on top of MongoDB replication lag and keep it from getting out of hand. First you will want to frequently check where the replication lag interval is sitting. To check on the current replication rate, use this command in the mongo shell that is connected to the Primary Node: rs.printSlaveReplicationInfo(). This will return the ‘syncedTo’ value for each member and when the most recent oplog entry was written to the secondary.
If you don’t want to manually check MongoDB replication lag every day, consider monitoring it with a service such as Google Cloud’s Stackdriver with BindPlane. BindPlane works in tandem with the leading data monitoring services to allow you to monitor metrics such as “Slave Delay” and “replication count” within and allows you to create intelligent alerts. By definition replication lag is the amount of time it takes for the secondary to read the opslog and replicate the data from the primary node, so if it takes 15 minutes for the secondary node to read the opslog, then there is a 15 minute replication lag time. This would obviously be too much time, so to mitigate it, you can set alerts to let you know when the replication interval exceeds your ideal time, 2-5 minutes for example. Along with monitoring MongoDB replication lag time directly, you can also monitor the disk related metrics of the secondary node to make sure they keep up with the primary. For example, you can monitor metrics such as Network I/O, CPU, and Disk Space. Alerts can be set up for these as well to keep you on your toes and help you stop excessive replication lag before it happens.