Looking up events in the Riemann index

June 14, 2015 June 2015 in Blog , Book , Riemann , DevOps , Monitoring

Forthcoming book - The Art of Monitoring

One of the classic problems of monitoring alerts is that they are often very cryptic. Coupled with the challenge of alert fatigue¹ this makes working out what to do next when you receive an alert quite tricky. Additionally, alerts often happen when we’re not at the top of our game: a 4am on a Sunday morning alert is not likely to foster an exemplary response.

The quintessential example of cryptic/unhelpful alerts are Nagios disk space alerts.

PROBLEM Host: server.example.com
Service: Disk Space

State is now: WARNING for 0d 0h 2m 4s (was: WARNING) after 3/3 checks

Notification sent at: Thu Aug 7th 03:36:42 UTC 2015 (notification number
1)

Additional info:
DISK WARNING - free space: /data 678912 MB (9% inode=99%)

What does this alert mean? We can see that filesystem /data has 678912 Mb of disk space left or 9%. Should we worry? How fast it is filling up? Is this likely to happen RSN or sometimes in the future? What’s on that filesystem? Do I care if it fills up? I already have five questions from a single alert and I haven’t even started to diagnose WHY things might be wrong. Meh I am going back to sleep.

Thankfully, in the middle of last year the estimable Ryan Frantz released Nagios Herald. Nagios Herald is a decorator for Nagios alerts. It allows you to add context or further information to alerts generated by Nagios.

For example, here is a decorated Nagios disk alert.

Decorated Nagios disk alert

Much more useful. Nice big stack bar. Helpful graph. Output from the df command. With this information I’m feeling a lot more comfortable about fixing the issue. (You can find a bunch of other example alerts here too.)

So helpful to all using Nagios. Not so helpful to others. (Although I think there is support for user-supplied attributes in Sensu and uchiwa and probably some other tools but nothing quite so well integrated and helpful (yet).)

So in the spirit of recent Riemann posts I thought about what I could do quickly and simply to provide some context for alerts, specifically email alerts. Riemann does have one useful store of information: the index. Every event you index is stored in there until its TTL expires and the expiration reaper runs. So if you’re collecting useful events then some of those might help to color your alerts with helpful context.

In my environment Riemann receives events from collectd and does most of its alerting based on the values of collectd metrics. One of those plugins, df, emits metrics that measure the size of your filesystems. It emits a metric like so:

{:host host.example.com, :service df-root/percent_bytes-used, :state nil, :description nil, :metric 90.334929260253906, :tags [collectd], :time 1433706333, :ttl 20.0, :ds_index 0, :ds_name value, :ds_type gauge, :type_instance used, :type percent_bytes, :plugin_instance root, :plugin df}

We can use this event, through the :service field, for example :service df-root/percent_bytes-use, to identify when specific filesystem have exceeded a threshold.

We can create a configuration like so to do this:

(let [index (index)]

  (streams
    (default :ttl 60
      ; Index all events immediately.
      index

      (where (and (service #"^df-(.\*)/percent_bytes-used") (>= metric 90.0))
        (email james@example.com)
      )
)))

This uses the where filter stream to select all df-generated metric matching df-(.\*)/percent_bytes-used. This should find the percent bytes used for every filesystem we’re monitoring, for example for the / filesytem the metric would be: df-root/percent_bytes-used. Our where filter all matches on the metric when the percentage if greater than or equal to 90%. If it matches it sends an email using the email function to james@example.com.

It’s inside our email alerting that we’re going to add the additional context. Inside our email variable we’re going to redefine how Riemann creates the emails it sends. We do this by adding the :body option to the mailer plugin. We’ve defined that plugin inside our email variable.

(def email (mailer {:from "reimann@example.com"
                    :body (fn [events] (format-body events))
                    }))

The :body option takes a function and the events argument. The events argument contains one or more events in a sequence that our function, here format-body, will then parse and format.

Our new format-body function will look pretty similar to the default Riemann email formatting.

(defn format-body
  "Format the email body"
  [events]
  (clojure.string/join "\n\n\n"
        (map
          (fn [event]
            (str
              "Time: " (riemann.common/time-at (:time event)) "\n"
              "Host: " (:host event) "\n"
              "Service: " (:service event) "\n"
              "Metric: " (if (ratio? (:metric event))
                (double (:metric event))
                (:metric event)) "\n"
              "\n"
              "Additional context for host: " (:host event) "\n\n"
              (print-context (search (:host event)))
              "\n\n"))
          events))
)

We take the events argument and loop through the sequence of events inside it to produce a notification. Where the function starts to differ is when we begin to populate our additional insights. The insight is generated by looking up events in the Riemann index. To do this we use a third function called print-context. The print-context function takes a host, here the host of the current event from the :host field, and uses the search function to return all of the other events from that host from the index.

(defn search
  "Search events in the index"
  [host]
  (->> '(= host host)
       (riemann.index/search (:index @riemann.config/core)))
)

The search function uses the riemann.index/search function to query the index. It constructs a query using the host argument. It then uses that query to retrieve all matching events from that host from the index. Where the location of the index is the currently running core. Any matching events in the index will be returned as a sequence of standard Riemann events.

We then pass this sequence to the print-context function as an argument. The print-context function iterates through the sequence and prints out a list of services and associated metrics.

(defn print-context
  "Print the event content"
  [events]
  (clojure.string/join "\n"
    (map
      (fn [event]
        (str
          "Service: " (:service event) " with metric: " (round (:metric event))))
    events))
)

The contextual example is a little silly because you probably don’t want all of these services and their metrics but you could easily select something more elegant. (In the example code we’re also included a lookup function which uses the other index parsing function: riemann.index/lookup. The lookup function uses a host/service pair to look up specific events inside the index.)

We also run our events through the round function which uses cl-format from clojure.pprint to round any numbers to 2 decimal places.

(defn round
  "Round numbers to 2 decimal places"
  [metric]
  (clojure.pprint/cl-format nil "~,2f" metric)
)

Phew! That’s a lot of background. So what actually happens when this alert triggers? In this case you will generate an email much like:

Time: Sun Jun 14 15:22:19 UTC 2015
Host: app2-api
Service: df-root/percent_bytes-used
Metric: 90.33

Additional context for host: app2-api

Service: cpu-0/cpu-system with metric: 0.40
Service: processes-rsyslogd/ps_disk_octets/read with metric: 0.00
Service: processes-collectd/ps_cputime/syst with metric: 3002.70
Service: cpu-0/cpu-wait with metric: 0.00
Service: interface-lo/if_errors/rx with metric: 0.00
Service: swap/swap_io-out with metric: 0.00
Service: interface-docker0/if_errors/rx with metric: 0.00
Service: elasticsearch-productiona/counter-indices.refresh.total with metric: 0.59
Service: interface-eth0/if_octets/tx with metric: 10192.04
Service: processes-collectd/ps_disk_ops/read with metric: 81.07
Service: processes-collectd/ps_data with metric: 621551616.00
Service: processes-rsyslogd/ps_pagefaults/minflt with metric: 0.00
Service: processes/ps_state-paging with metric: 0.00
Service: processes-rsyslogd/ps_count/processes with metric: 1.00
Service: interface-eth0/if_packets/rx with metric: 117.10
Service: interface-lo/if_packets/tx with metric: 0.00
Service: load/load/shortterm with metric: 0.14
. . .

You could easily modify this to only select specific, relevant, events. You could also use any of Riemann’s stream functions or Clojure’s functions to manipulate those events.

You could also extend this example beyond the index to retrieve external information. For example to retrieve further information from the host, construct a graph, or link to an existing Graphite graph or data source. This could even be further extended to take some action on the host itself in addition to the notification. The possibilities are broad and exciting!

P.S. You can find a fully-functioning Riemann configuration for this example here.

Becoming desensitized to alerts because you get so many. ↩

Footnotes

Share this post