If you’re using the Elasticsearch query functionality, for mainly front-facing client search, there are 3 important metrics to monitor performance.
Elasticsearch Query Load
Your cluster can be putting up with any number of queries at a time. The volume of queries over time will align roughly to the load of requests laying a potential burden. Unexpected peaks and valley in a time series of query load could be signs of a problem or potential optimization opportunities.
Elasticsearch Query Latency
The average query latency, measured as the total count of queries and the total time over regular intervals, will alert you to how your available resources are performing under your set conditions. Establish a ceiling where if query latency breaches a particular max, there could be resource strain or opportunity for optimization.
Elasticsearch Fetch Latency
As the second part of Elasticsearch’s search process, fetch follows the query step to deliver the requested data. Fetch latency should be considerably lower than your query latency. Normal behavior would be indicated by level constant fetch latency. Should fetch latency begin to rise, there’s likely issues developing within your resources.
Full Elasticsearch monitoring
The machine that runs your instance of Elasticsearch will indicate vital signs of performance. Eyes on the CPU, memory usage, and disk I/O will ensure optimal Elasticsearch node performance in production.
Elasticsearch node | CPU Performance
You may notice that your Elasticsearch instance can easily eat up CPU. CPU peaks are expected but underlying issues could be lurking. Whether it is clear performance issues or not, there will certainly be an opportunity for performance optimization. The Java Virtual Machine (JVM) indicators will likely coincide with the spikes in CPU you see in your Elasticsearch node performance. Match the spikes in JVM metrics with Elasticsearch node performance CPU to uncover the underlying cause.
Elasticsearch node | memory usage
It is particularly normal to expect no free memory on the machine running your Elasticsearch instance. This is not an indicator to panic because you want your machine to be utilizing all of the available memory. However, the cached memory availability is something to keep your eye on. If you see the cached memory is running low, then you can expect available RAM to be running low.
Elasticsearch node | disk I/O rate
When Elasticsearch is deployed as a search engine it is expected that disk I/O will be put to the test. When a reduction in disk I/O is materializing in the machine, underlying problems are present. Let this be a catalyst to troubleshoot what the culprit issue may be.
The ratio between read and write operations will vary based on the particular usage of Elasticsearch you have deployed. Depending on the ratio within the node, indexing and query performance could be sources of optimization.
Last week, Google Cloud Platform (GCP) announced that it rebranded its Stackdriver monitoring and logging platform that Google acquired in 2014, to be part of its new Google Operations platform. This rebrand included renaming Google Stackdriver Monitoring to Google Cloud Monitoring and Google Stackdriver Logs to Google Cloud Logging. So what does this mean for Stackdriver customers?
While I for one am excited to see Google pulling all of its operations products together, I also want to be clear that other than a few new feature releases, these products are in fact still Stackdriver! We are looking at this rebrand as essentially being Google Stackdriver 2.0. It allows Google to say goodbye to the Stackdriver brand as it fully embraces its Google-Esque naming conventions to make it clear what Stackdriver delivers. The new Google Cloud Operations SKU enables Google to take the monitoring and logging functionality that Stackdriver customers know and love and promote it to the”Googleverse so that other GCP customers can also benefit.
This changing direction can be seen in the recent merging of the Stackdriver Metrics UI into the Google Cloud Console. A change that will make for a more unified experience in the Google Ecosystem.
Google Monitoring is now available in the same console as all the other services.
BindPlane Logs and metrics will continue to integrate with Google Cloud Logging and Google Cloud Monitoring to support the extension of Google Cloud’s monitoring capabilities to on-prem, hybrid cloud and multi-cloud environments. This allows for GCP users to manage over 150+ of the most common non-GCP technology sources all within Google Cloud, enhancing the observability for users all within Google Cloud. One of the most exciting parts about this new release is that Google did add in a few feature updates that BindPlane customers have been asking for! Some of these features include:
- Dashboard API to create and share dashboards across projects
- Log storage for up to 10 years
- Metrics retention for up to 24 months
- Increased granularity of metric write-up to 10 seconds
As Google continues its momentum with Google Cloud Operations, one thing is for sure – whether we call it Stackdriver or Google Cloud Logging and Monitoring, BindPlane will continue to help GCP customers extend their visibility to on-prem and hybrid clouds to accurately troubleshoot, monitor and report, and real-time alert on their full-stack all within GCP!
Get Started with BindPlane for Google Cloud Logging & Monitoring:
Registering for BindPlane to support Google Cloud Operations is still the same process. You can signup for your BindPlane account here.
What are Page Faults in MongoDB?
In General, page faults will occur when the database fails to read data from RAM, so it is forced to read off of the physical disk. Now MongoDB page faults occur when the database fails to read data from virtual memory and must read the data from the physical disk. Most databases cache data in RAM as often as possible to avoid having to read from physical disks since it is slow and costs you valuable time. We all wish, that all of our data would be stored in RAM, but that’s extremely expensive and usually infeasible, so the database will inevitably need to read from disk.
Now, depending on the version of MongoDB, that you’re running, the storage engine you will be using is either MMAPv1 or WiredTiger. MMAPv1 was MongoDB’s original storage engine until 3.2, which was then replaced by WiredTiger as the default storage engine, and has been officially deprecated as of the release of 4.2. This blog mainly focuses on MMAPv1 since it is an older, deprecated technology and page faults are not tracked as a relevant statistic in WiredTiger.
The older versions of MongoDB use memory to manage documents, indexes in memory and then has MMAPv1 translate these files to virtual memory. MMAPv1 uses memory-mapped files to read data from virtual memory through a
mmap () syscall. Since MMAPv1 relies heavily on virtual memory for its processes, it is prone to page faults as there is a limited amount of space to cache data to virtual memory.
Find The Root Cause
As mentioned in our previous MongoDB troubleshooting blogs, just like database locking, and replication lag, MongoDB page faults are a common occurrence in your MongoDB Database (in MMAPv1), but if they begin to happen consistently, or at a higher than normal volume, then you may need to take action. MongoDB Page faults are usually indicative of a deeper problem within your system. Due to it being an older technology, MongoDB page faults will occur more often in MMAPv1. Some of its constraints are a lack of data compression options, using all of its free memory to cache, and its inability to scale, which all relates back to not enough available RAM to read off of. MongoDB page faults may also happen due to unindexed queries, and this may be the root cause if you are noticing a high ratio of page faults compared to operations, usually a 1:1 ratio or higher.
Minimize and Prevent Future Page Faults
Now that you know what causes Page Faults to happen and what the underlying problem could be, its time to stop them from happening too often!
Ideally, in a perfect world, you would be able to completely stop MongoDB page faults from ever happening again… sadly, we don’t live in a perfect world, so your best hope is to minimize their occurrence. Since you know that there are a few causes for MongoDB page faults, then you can expect that there are a few different methods in preventing them. Since MMAPv1’s performance relies heavily on the amount of RAM available and caching data in the virtual memory, the first and foremost preventative measure you will want to take is to monitor the amount of RAM that your systems have available for use. You can use Mongostat, which allows you to get stats returned to you from MongoDB, including page faults. MongoStat is a pretty basic monitoring tool, that won’t give you much insight into why problems are arising. You may want to consider setting up a more comprehensive monitoring system for your entire environment with services like Google Cloud Monitoring and Logging, or New Relic monitoring.
Using a more comprehensive monitoring solution can allow you to stay on top of the problem and be proactive, instead of reactive. BindPlane can be implemented with these monitoring services and will let you monitor and set up alerts for metrics relevant to Page faults including, file size, index size, the number of indexes, Memory usage (mapped, resident, virtual) and a lot more, you can find the rest in our MongoDB for Stackdriver docs.
Along with monitoring the relevant metrics of MongoDB page faults, you can also make sure your data is configured into working sets that fit into memory and won’t use more RAM than required. You should also make sure your data is indexed in MongoDB correctly. Indexing is very important when it comes to executing queries efficiently. When your data isn’t indexed, it will take more RAM to access your data and could push page faults if there isn’t enough available in your environment. Visit the MongoDB docs on indexing to learn more about properly indexing your data.
Now unless there are extenuating circumstances, if you are using MMAPv1, you might want to consider upgrading to the newest version of MongoDB and jump to WiredTiger. It could be difficult to migrate all of your data to a new engine, but in the long run, it is worth the upgrade now that MMAPv1 has been deprecated and is no longer supported by MongoDB.
WiredTiger Storage Engine:
MongoDB’s storage engine released in version 3.0 and is now the default engine as of MongoDB version 3.2.
MMAPv1 Storage Engine:
MongoDB’s original storage engine that has since been deprecated in MongoDB 4.0.
WiredTiger vs MMAPv1 Data compression
WiredTiger Data compression: Having its own write-cache and a filesystem cache, as well as supporting snappy and Zlib compression, WiredTiger takes up much less space than MMAPv1
MMAPv1 Data compression: Data compression is not supported so MMAPv1 is based on memory-mapped files. Consequentially, MMAPv1 succeeds at high volume actions.
WiredTiger vs MMAPv1 Journaling compression
WiredTiger Journaling: Using checkpoints is at the core, while all journal writes maintain data changes between checkpoints. To recover from crashes, the journal entries from the last checkpoint are used.
MMAPv1 Journaling: In the event of a crash, MongoDB with MMAPV1 will access the journal files to apply a consistent state.
WiredTiger vs MMAPV1 Locks and Concurrency
WiredTiger Locks and Concurrency: Employs document-level locking where intent locks are only used at the global, database, and collection layers.
MMAPV1 3.0 Locks and Concurrency: Uses collection-level locking.
MMAPV1 2.6: Allows concurrent reads to the database but single writes get exclusive access.
WiredTiger vs MMAPV1 Memory
WiredTiger Memory: MongoDB with WiredTiger deploys a filesystem cache and an internal cache. All free memory is used by MongoDB. WiredTiger will use either 50% of RAM or 256 MB depending on which is larger.
MMAPv1 Memory: MongoDB with MMAPv1 will access as much available memory. However, MongoDB will yield cached memory when another process requires at least half of the server’s RAM.
WiredTiger vs MMAPv1 Comparison Table
|Updates||Documents are forced to rewrite. In-place updates not supported.||High volume & in-place updates.|
|CPU Performance||multi-core system performance||more CPU cores != better performance|
|Transaction||Multi-document transactions||Atomic operations on a single document|
|Encryption||Encryption at rest||Not possible|
|Memory||Internal & filesystem cache||Uses all free memory as cache|
|Tuning||More variables available for tuning||Less opportunities to tune|
|Locks & Concurrency||Document-level locking||Collection level locking|
|Journaling||Checkpoints used & journals used between checkpoints||Uses journal files for a consistent state|
|Data Compression||Snappy & Zlib compression||Not supported|