Connecting Riemann and Zookeeper

2015-04-21 778 words 4 minutes

Contents

One of my pet hates is having to maintain configuration inside monitoring tools. Not only large pieces like host definitions but smaller pieces like service and component definitions. Using a configuration management tool makes this much easier but it still generally requires some convergence to update your monitoring configuration when a host is added or removed or a service changes.

An example might be HAProxy. I have a HAProxy running with multiple back-end nodes. I want to know about issues if the node count drops below a threshold, potentially if it drops at all. With auto-scaling or just adding and subtracting nodes I need to keep this count up-to-date in my monitoring system to ensure I am correctly alerted when something goes wrong and to avoid false positives. I could do that with configuration management and converge the configuration when I deploy, using Puppet’s exported resources for example. But in a dynamic and fast-moving environment I’d really prefer not to wait for any convergence.

(Note: This is a somewhat artificial and very pets v. cattle example. I don’t overly care if individual nodes die because they are disposable and easily replaced. I could apply the same logic to any host or service threshold that I wanted to query.)

Instead I want my monitoring system to be able to lookup my threshold in some source of truth about the state of my infrastructure. That source of truth could be something like Apache Zookeeper, Consul, or a configuration management store like PuppetDB.

In this post I’m going to combine Zookeeper and my Riemann monitoring stack. Let’s start with some code to connect to Zookeeper. It makes use of the Zookeeper-clj Clojure client.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20


(use '[cemerick.pomegranate :only (add-dependencies)])
(add-dependencies :coordinates '[[zookeeper-clj "0.9.1"]]
                  :repositories (merge cemerick.pomegranate.aether/maven-central
                                       {"clojars" "http://clojars.org/repo"}))

(ns zookeep
  "Zookeeper functions"
  (:use clojure.tools.logging)
  (:require [zookeeper :as zk]
            [zookeeper.data :as data]))

(def client (zk/connect "127.0.0.1:2181"))

(defn get_data
  "Gets data from Zookeeper"
  [node]
  (-> (:data (zk/data client node))
      data/to-string
      read-string)
)

The first part of our code loads the zookeeper-clj client. We then define a namespace called zookeep and require the client (as zk) and the Zookeeper client’s data function as data.

We’ve defined a var called client that is a connection to a local Zookeeper server. We could easily specify a remote server instead.

We’ve created a very simple function named get_data that retrieves the contents of a specific Zookeeper node specified by the node argument.

Let’s now create a riemann.config file to make use of our Zookeeper functions.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15


(include "/etc/riemann/include/zookeeper.clj")

(let [host "0.0.0.0"]
  (tcp-server {:host host})
  (udp-server {:host host})
  (ws-server  {:host host}))

(def email (mailer {:from "reimann@example.com"}))

(let [index (index)]
  (streams
    (where (and (< metric (zookeep/get_data "/app1/haproxy/nodes")) (service haproxy-backend.web-backend/gauge-active_servers) (tagged "app1")
      (throttle 1 120
        (email "james@example.com")))))
)

In our configuration we’ve included our Zookeeper functions using the include function and bound Riemann to all the interfaces on our host. We’ve also configured the email plug-in to allow us to send emails from Riemann.

Next we’ve defined some streams including a where filter on an event generated from collectd called haproxy-backend.web-backend/gauge-active_servers. This is the active back-end server count from the HAProxy stats output.

Our where filter matches this service, if it is tagged with app1, and if the value of the metric field is less than the value derived from the (zookeep/get_data "/app1/haproxy/nodes") function. This function, zookeep/get_data, takes the node name /app1/haproxy/nodes and looks it up in Zookeeper.

Inside Zookeeper we’ve created this node and populated it with the count of HAProxy back-end nodes running for this specific application. That population of the node or its update would normally take place during deployment.

Now when the metric arrives into Riemann, the lookup is triggered and Riemann compares the value of the metric field with the value from the Zookeeper node. If the metric value is less than the node value then Riemann sends an email out containing the specific event. Now our monitoring system doesn’t need any changes when our HAProxy configuration changes. We hence eliminate the need to wait for our deployment changes to converge in our monitoring environment. Which means less risk of missing an alert or a false positive alert being generated.

(interlude….

This approach is somewhat of a hack and lookups down like this could cause latency issues in Riemann. A better approach was suggested by Pyr that makes use of atoms.

1
2
3
4


(def hanodes (atom [])
  (future (zookeeper/watch "/haproxy/nodes"
     (fn [val] (reset! hanodes val))))
       (streams ... ))

… end interlude)