Monitoring systems is a complex problem statement which looks deceptively simple from afar.

Monitoring has three important blocks:

  • Event aggregation
  • Alarms
  • Notification trigger

Alarms

When people are talking about alarms, they broadly fall into two categories

  • one which should wake you up or get you to stop working & start looking at it immediately: These should come to SMS, phone and other channels with high attention
  • which are meant as an FYI: These should go into some kind of an inbox which can be summarized and understood. Email or channels with more review-ability

Alert fatigue is a thing, trust is lost on monitoring if there are too many un-actionable alerts.

Types of alarms

  • Domain alarms : user sign up < 1, transaction failures > 1 etc..
  • Infrastructure alarms : CPU > 90%, memory > 95%
  • Application alarms : 4XX / 2XX > 5% etc..

Lessons learnt while crafting alerts

Minimize False Positives

More alerts cause attention fatigue, which in turn causes us to lose signal in the noise. Optimize and work towards making sure your false positive rate is near zero.

Use Combination of Data Rather Than Individual Data Points

Transaction failures only make sense to look at as a percentage of transactions initiated. Failures > 2 then error does not make sense if there are a million transactions initiated. Maybe a better alarm would be to say failure percentage > 1.

Cap to Ensure Very Low Volumes Don’t Cause Alerts

On the flip side, if you have two transactions and one fails, the failure rate is 50%. The failure percentage only makes sense if the volume is decent. Maybe a minimum of 30 as a precondition would make sure that if 3 out of 30 fails, we don’t get an alarm.

No user signed up in 15 minutes makes sense in the daytime, but the same alarm would not work in the midnight because the trend of user sign ups would be very miniscule. Maybe no user signed up in an hour is a valid alarm in that situation. Either go the anomaly detection route for trend recognition or have multiple alarms for different time ranges.

Tackle Data Absence with Care

For transaction failure, no events is a good sign. For user sign ups, no events is a bad sign. Missing data should be explicitly handled with care in alarm creation.

How to find good thresholds for alarms

TODO

References