Connecting Riemann and Zookeeper
One of my pet hates is having to maintain configuration inside monitoring tools. Not only large pieces like host definitions but smaller pieces like service and component definitions. Using a configuration management tool makes this much easier but it still generally requires some convergence to update your monitoring configuration when a host is added or removed or a service changes.
An example might be HAProxy. I have a HAProxy running with multiple back-end nodes. I want to know about issues if the node count drops below a threshold, potentially if it drops at all. With auto-scaling or just adding and subtracting nodes I need to keep this count up-to-date in my monitoring system to ensure I am correctly alerted when something goes wrong and to avoid false positives. I could do that with configuration management and converge the configuration when I deploy, using Puppet’s exported resources for example. But in a dynamic and fast-moving environment I’d really prefer not to wait for any convergence.
(Note: This is a somewhat artificial and very pets v. cattle example. I don’t overly care if individual nodes die because they are disposable and easily replaced. I could apply the same logic to any host or service threshold that I wanted to query.)
Instead I want my monitoring system to be able to lookup my threshold in some source of truth about the state of my infrastructure. That source of truth could be something like Apache Zookeeper, Consul, or a configuration management store like PuppetDB.
In this post I’m going to combine Zookeeper and my Riemann monitoring stack. Let’s start with some code to connect to Zookeeper. It makes use of the Zookeeper-clj Clojure client.
|
|
The first part of our code loads the zookeeper-clj
client. We then
define a namespace called zookeep
and require the client (as zk
) and
the Zookeeper client’s data function as data
.
We’ve defined a var called client
that is a connection to a local
Zookeeper server. We could easily specify a remote server instead.
We’ve created a very simple function named get_data
that
retrieves the contents of a specific Zookeeper node specified by the
node
argument.
Let’s now create a riemann.config
file to make use of our Zookeeper
functions.
|
|
In our configuration we’ve included our Zookeeper functions using the
include
function and bound Riemann to all the interfaces on our host.
We’ve also configured the email
plug-in to allow us to send emails
from Riemann.
Next we’ve defined some streams including a where
filter on an event
generated from collectd called
haproxy-backend.web-backend/gauge-active_servers
. This is the active
back-end server count from the HAProxy
stats
output.
Our where
filter matches this service, if it is tagged with app1
, and if
the value of the metric field is less than the value derived from the
(zookeep/get_data "/app1/haproxy/nodes")
function. This function,
zookeep/get_data
, takes the node name /app1/haproxy/nodes
and looks it up
in Zookeeper.
Inside Zookeeper we’ve created this node and populated it with the count of HAProxy back-end nodes running for this specific application. That population of the node or its update would normally take place during deployment.
Now when the metric arrives into Riemann, the lookup is triggered and Riemann compares the value of the metric field with the value from the Zookeeper node. If the metric value is less than the node value then Riemann sends an email out containing the specific event. Now our monitoring system doesn’t need any changes when our HAProxy configuration changes. We hence eliminate the need to wait for our deployment changes to converge in our monitoring environment. Which means less risk of missing an alert or a false positive alert being generated.
(interlude….
This approach is somewhat of a hack and lookups down like this could cause latency issues in Riemann. A better approach was suggested by Pyr that makes use of atoms.
|
|
… end interlude)