Using Riemann for Fault Detection

2015-01-05 1686 words 8 minutes

Contents

In the last post I introduced you to Riemann. I mentioned streams in that post and how they are at the heart of Riemann’s power. However I only provided a vague teaser of streams and left you having to go fish for yourself.

In this post I’m going to build on our example Riemann configuration. I’ll show you how to do simple service management with streams and introduce you to Riemann’s state table: the index. We’ll see:

How the index works.
How we can alert on services and hosts using events.
How we can send those alerts via email and PagerDuty.

Configuring Streams

Streams are specified in Riemann’s Clojure-based configuration file. On our example Ubuntu host we can find that file at /etc/riemann/riemann.config. We edited that configuration in the last post to bind Riemann to all interfaces and to add some more logging. Let’s look at it again now.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22


(logging/init {:file "/var/log/riemann/riemann.log"})

; Listen on all interfaces over TCP (5555), UDP (5555), and websockets
; (5556)
(let [host "0.0.0.0"]
(tcp-server {:host host})
(udp-server {:host host})
(ws-server  {:host host}))

; Expire old events from the index every 5 seconds.
(periodically-expire 5)

(let [index (index)]
  ; Inbound events will be passed to these streams:
  (streams
    (default :ttl 60
      ; Index all events immediately.
      index

      ; Log expired events.
      (expired
        (fn [event] (info "expired" event))))))

In our configuration we can see a section called (streams. Inside this section is where we configure Riemann’s streams. The first entry in this section specifies a default time to live for events. More on this shortly. The second entry tells Riemann to index all events.

The Riemann Index

The index is a table of the current state of all services being tracked by Riemann. In the last post, when we introduced events, we discovered that each Riemann event is a struct that can contain one of a number of (optional) fields including: host, service, state, a time and description, a metric value or a time to live. Each event you tell Riemann to index is added and mapped by its host and service fields. The index retains the most recent event for each host and service. You can think about the index as Riemann’s worldview. The Riemann dashboard, which we also saw in the last post, uses the index as its source of truth.

Each indexed event has a Time To Live or TTL. The TTL can be set by the event’s ttl field or as a default. In our configuration we’ve set the default TTL to 60 seconds with the default variable. This is the period for any event which doesn’t already have a TTL.

After an event’s TTL expires it is dropped from the index and fed back into the stream with a state of expired. This seems pretty innocuous right? Nope! This is where the change in monitoring methodology that Riemann facilitates starts to become clear (and exciting).

Detecting down services

In the last post I talked a bit about pull/polling models versus push models for monitoring. In the monitoring “pull model” we actively poll services, for example using an active check like a Nagios plugin. If any of those services failed to respond or returned a malformed response our monitoring system would alert us to that. This active monitoring generally results in a centralized, monolithic and vertically scaled solution. That’s not an ideal architecture.

In an event-driven push model we don’t do any active monitoring. Our services generate events. Those events are pushed to Riemann. Each event has a TTL and the last event received is stored in the index. When the TTL expires Riemann will expire the event and feed it back into the stream. In that stream I can then monitor for events with a status of expired and alert on those. A much simpler, more scalable and IMHO more elegant solution.

So let’s see how this might work for a service. In the last post we looked at some of the Riemann tools for service checking. Let’s use the riemann-varnish tool again for our testing.

On our Varnish host we need to install riemann-tools via RubyGems.

1

$ sudo gem install riemann-tools

We can then use riemann-varnish to send our events.

1

$ riemann-varnish --host riemann.example.com

The riemann-varnish command wraps the varnishstat command and converts Varnish statistics into Riemann events, for example the client connections accepted metric generates an event like so:

1

:host varnish.example.com, :service varnish client_conn, :state ok, :description Client connections accepted, :metric 13795.0, :tags nil, :time 1419404501, :ttl 10.0

We can see that the event has a host and a service, the combination of which Riemann will use to track state in the index. The event also has a state field of ok plus other useful information like the actual client connections accepted metric.

We’re going to use this data, plus the TTL, to do basic service monitoring with Riemann. Let’s update our configuration to

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


(def email (mailer {:from "reimann@example.com"}))

(let [index (index)]
; Inbound events will be passed to these streams:
(streams
  (default :ttl 60
    ; Index all events immediately.
    index

    (changed-state {:init "ok"}
      (email "james@example.com")))))

The first thing we’ve added is a function called email that configures the emailing of events. Under the covers Riemann uses Postal to send email for you. This basic configuration uses local sendmail to send emails. The From email will be riemann@example.com. You could also configure sending via SMTP. To send emails you’ll need to ensure you have local mail configured on your host. To do this I usually install the mailtools package.

1

$ sudo apt-get -y install mailtools

If you don’t install a suitable local mail server then you’ll receive a somewhat cryptic error in your Riemann log along the lines of:

riemann.email$mailer$make_stream threw java.lang.NullPointerException

Next we’ve used a helper shortcut called changed-state to monitor for events whose state has changed. The init variable specifies the base assumption of an event’s state, here ok. This is because Riemann doesn’t know about the previous state of events when it starts. This tells Riemann to assume previous events are all okay. Now the changed-state shortcut will match any events whose state is not ok and pass them to the email function we defined earlier.

Let’s see this in action. First, we need to restart or HUP Riemann. Next, whilst I’ve been explaining this, the riemann-varnish tool has been sending events to Riemann. Those events are from my Varnish host, varnish.example.com, and an event is generated by each Varnish metric. Each event has a state of ok and a TTL of 10 seconds.

1

:host varnish.example.com, :service varnish client_conn, :state ok, :description Client connections accepted, :metric 13795.0, :tags nil, :time 1419404501, :ttl 10.0

If Varnish fails or I stop the riemann-varnish tool then the events flow will cease. When the TTL has expired, 10 seconds later, this should trigger an event with a state of expired and email notifications telling us that the Varnish services have changed state.

If we check our Riemann log file we should see the following event.

1

:time 1420058947163/1000, :state expired, :metric 7184.0, :tags nil, :service varnish client_conn, :host varnish.example.com

As well as additional events for each Varnish metric that has also expired. If we check our inbox we should also see email notifications for each service that has stopped reporting.

If the service starts working again you’ll receive another set of notifications that things are back to normal.

Preventing spikes and flapping

Like most monitoring systems we also have to be conscious of the potential for state spikes and flapping. Riemann provides a useful variable to help us here called stable. This variable allows us to specify a time period and event field, like the state (or usefully the metric for certain types of monitoring), and it monitors for spikey or flapping behavior. Let’s add stable to our example.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


(let [index (index)]
  ; Inbound events will be passed to these streams:
  (streams
    (default :ttl 60
      ; Index all events immediately.
      index

      (changed-state {:init "ok"}
        (stable 60 :state
          (email "james@example.com")))))))

Here we’ve specified the stable variable with a time period of 60 seconds watching the state of events. This will mean that Riemann will only pass on events where the state remains the same for at least 60 seconds. Hopefully avoiding service flapping. (Also potentially interesting here is the ability to rollup and throttle event streams.)

Sending events to PagerDuty

We aren’t limited to email either for alerting. Riemann comes with some additional options, most notably PagerDuty.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13


(def pd (pagerduty "pagerduty-service-key"))

(let [index (index)]
; Inbound events will be passed to these streams:
(streams
  (default :ttl 60
    ; Index all events immediately.
    index

    (changed-state {:init "ok"}
    (stable 60 :state
      (where (state "ok") (:resolve pd))
      (where (state "expired") (:trigger pd)))))))

Here we’ve defined a function called pd that creates a connection to PagerDuty. We’ve specified a service key we previously defined in PagerDuty. We’ve updated our state monitoring to trigger in two cases:

When an event has a state of expired we send an alert trigger to PagerDuty.
When an event has a state of ok we send a resolution signal to PagerDuty.

This ensures we can both trigger and resolve issues created from Riemann.

Let’s trigger some PagerDuty alerts. First, we need to restart or HUP Riemann to update our configuration. Next, we can generate some alerts by stopping our riemann-varnish tool again. The expired events should trigger some PagerDuty alerts like these.

Summary

Pretty cool stuff eh? Well this post just scratches the surface of things you can do with Riemann streams. There are a bunch of other ideas and examples in the Riemann HOWTO section that you can explore. Also look out for my next post on Riemann where I’ll be looking at streams again, this time with a focus on metrics and Graphite.