Looking up events in the Riemann index
Forthcoming book - The Art of Monitoring
One of the classic problems of monitoring alerts is that they are often very cryptic. Coupled with the challenge of alert fatigue1 this makes working out what to do next when you receive an alert quite tricky. Additionally, alerts often happen when we’re not at the top of our game: a 4am on a Sunday morning alert is not likely to foster an exemplary response.
The quintessential example of cryptic/unhelpful alerts are Nagios disk space alerts.
|
|
What does this alert mean? We can see that filesystem /data
has 678912 Mb of disk space left or 9%. Should we worry? How fast it is filling up? Is this likely to happen RSN or sometimes in the future? What’s on that filesystem? Do I care if it fills up? I already have five questions from a single alert and I haven’t even started to diagnose WHY things might be wrong. Meh I am going back to sleep.
Thankfully, in the middle of last year the estimable Ryan Frantz released Nagios Herald. Nagios Herald is a decorator for Nagios alerts. It allows you to add context or further information to alerts generated by Nagios.
For example, here is a decorated Nagios disk alert.
Much more useful. Nice big stack bar. Helpful graph. Output from the df
command. With this information I’m feeling a lot more comfortable about fixing the issue. (You can find a bunch of other example alerts here too.)
So helpful to all using Nagios. Not so helpful to others. (Although I think there is support for user-supplied attributes in Sensu and uchiwa and probably some other tools but nothing quite so well integrated and helpful (yet).)
So in the spirit of recent Riemann posts I thought about what I could do quickly and simply to provide some context for alerts, specifically email alerts. Riemann does have one useful store of information: the index. Every event you index is stored in there until its TTL expires and the expiration reaper runs. So if you’re collecting useful events then some of those might help to color your alerts with helpful context.
In my environment Riemann receives events from collectd and does most of its alerting based on the values of collectd metrics. One of those plugins, df
, emits metrics that measure the size of your filesystems. It emits a metric like so:
|
|
We can use this event, through the :service
field, for example :service df-root/percent_bytes-use
, to identify when specific filesystem have exceeded a threshold.
We can create a configuration like so to do this:
|
|
This uses the where
filter stream to select all df
-generated metric matching df-(.\*)/percent_bytes-used
. This should find the percent bytes used for every filesystem we’re monitoring, for example for the /
filesytem the metric would be: df-root/percent_bytes-used
. Our where
filter all matches on the metric
when the percentage if greater than or equal to 90%. If it matches it sends an email using the email
function to james@example.com
.
It’s inside our email alerting that we’re going to add the additional context. Inside our email
variable we’re going to redefine how Riemann creates the emails it sends. We do this by adding the :body
option to the mailer
plugin. We’ve defined that plugin inside our email
variable.
|
|
The :body
option takes a function and the events
argument. The events
argument contains one or more events in a sequence that our function, here format-body
, will then parse and format.
Our new format-body
function will look pretty similar to the default Riemann email formatting.
|
|
We take the events
argument and loop through the sequence of events inside it to produce a notification. Where the function starts to differ is when we begin to populate our additional insights. The insight is generated by looking up events in the Riemann index. To do this we use a third function called print-context
. The print-context
function takes a host, here the host of the current event from the :host
field, and uses the search
function to return all of the other events from that host from the index.
|
|
The search
function uses the riemann.index/search
function to query the index. It constructs a query using the host
argument. It then uses that query to retrieve all matching events from that host from the index. Where the location of the index is the currently running core. Any matching events in the index will be returned as a sequence of standard Riemann events.
We then pass this sequence to the print-context
function as an argument. The print-context
function iterates through the sequence and prints out a list of services and associated metrics.
|
|
The contextual example is a little silly because you probably don’t want all of these services and their metrics but you could easily select something more elegant. (In the example code we’re also included a lookup
function which uses the other index parsing function: riemann.index/lookup
. The lookup
function uses a host/service pair to look up specific events inside the index.)
We also run our events through the round
function which uses cl-format
from clojure.pprint
to round any numbers to 2 decimal places.
|
|
Phew! That’s a lot of background. So what actually happens when this alert triggers? In this case you will generate an email much like:
|
|
You could easily modify this to only select specific, relevant, events. You could also use any of Riemann’s stream functions or Clojure’s functions to manipulate those events.
You could also extend this example beyond the index to retrieve external information. For example to retrieve further information from the host, construct a graph, or link to an existing Graphite graph or data source. This could even be further extended to take some action on the host itself in addition to the notification. The possibilities are broad and exciting!
P.S. You can find a fully-functioning Riemann configuration for this example here.
-
Becoming desensitized to alerts because you get so many. ↩︎