This article originally appeared on InfoWorld
Too many alerts and alarms can make it tough for IT to catch the critical ones
We’ve all experienced it: being bombarded by monitoring alerts and alarms that ultimately make us ignore them. Let’s look at how to streamline them to get meaningful data and a good night’s sleep.
One of the most insidious problems in the IT monitoring space is the challenge of dealing with false alarms and false positives. Given how detrimental this can be to a software solution’s MTTR, it’s surprising to find how pervasive it has become. Every company I’ve spoken to has acknowledged that they have at least one alerting subsystem that is entirely ignored, and frequently more than one. Alarm overload and alarm fatigue have taken over most companies.
This isn’t a problem unique to IT. The medical community has struggled with this in hospitals for ages. The frequent alarms of medical equipment leads to clinicians who are desensitized to alarms and more likely to ignore a true critical issue. This is regularly mentioned as a top patient safety concern.
Unfortunately, neither the medical community nor the IT industry have fully solved this problem, but we are able to provide some observations to help identify when this issue begins to creep into the lives of technology companies.
The O’Reilly book Site Reliability Engineering: How Google Runs Production Systems has some thoughtful advice on developing monitoring solutions. It provides a brief description of the ideal (and only) outputs of systems:
You could call this simplistic, but teams frequently lose sight of the basics. Alerts become a catchall for any information someone found useful at any point in that software’s history. Teams always need to deliberately come back to the idea that alerts should only fire if action needs to be taken immediately.
Assuming you have a working notification system (meaning when an alert is fired, someone on the IT team sees it), an easy way to tell if an alert was meaningful is to see if any manual action was taken that resolved it. If an alert is fired and it doesn’t result in action, it should never have been fired in the first place. That is an alert that should be deleted. Do this for a week and you should have reduced your alerts to a meaningful set of information.
Time for an alert intervention
But removing alerts is challenging. Not technically—it’s easy to remove or silence an alert. But the arguments you and your team members will make to justify keeping it are the sticking point. No one wants to lose visibility, so even if you’re not taking action, isn’t it still valuable just to know? Or maybe it’s simply complacency. Alerts will creep up on you slowly, but before long you can have an unmanageable environment.
Several years ago, I talked with a company that had taken this to heart and was ruthless about managing its monitoring environment. It had recently gone through an incredibly difficult site rollout and its Site Reliability Team was a mess. The team was short-staffed and primarily based in a single time zone, which meant long work days followed by on-call hours for the whole team. Nearly every day, someone on the team woke up multiple times per night to review an alert. Sometimes it was real, with need to fix something, but usually it wasn’t. After a few weeks of this, the team members’ were understandably at their wits’ end. In response, they began reviewing every single alert that their system produced, every morning. They asked the questions: Was manual intervention needed? Could it have been automated? Was it needed immediately, or could it have waited?
If it wasn’t urgent, they turned it into a ticket that was reviewed during business hours—and didn’t wake anyone up. If it was a consistent issue that had an automated solution, they automated it (the only thing waking up is a process). If it wasn’t a valid alert to begin with, they deleted it.
Breaking up is hard to do
Ultimately, the last step is the hardest. It can be incredibly difficult to delete an alert that seems to contain meaningful information, even when there’s nothing to do with it. Don’t you want to know about a spike in service latency, or a health warning coming from a server?
In the early stages of a project, you may want these details. These are certainly anomalies, and it feels like you should look into things. But as a system increases in size and complexity, these anomalies become the norm. While you shouldn’t bury them, they no longer warrant the 2 a.m. phone call. They move from the Alert category to the Log category and should be reviewed regularly, but never alerted unless they correlate directly to service interruption.
Fortunately, this approach of daily alert reviews is becoming more widely adopted by the IT community, although it typically happens out of desperation, when teams are no longer able to manage the load.
So better to get ahead of it, before your team is lost in alarm and alert overload.