The first pillar: We alert on what we draw

The first pillar of monitoring that I propose is that we should alert on what we draw. What I mean by this is that the data we use to determine whether a system is behaving anomalously should be the performance data that we are already collecting for trending purposes. What is the thinking behind this pillar? Well, if we cast our minds back to yesterday’s introduction, we agreed that one of purposes of monitoring systems was to provide data to allow us, as engineers, to detect patterns, and predict issues before they become production impacting. In order to do this, we need to be capturing data and storing it somewhere which allows us to analyse it. If we care about it - if the data could provide the kind of engineering insight which helps us to understand our systems and give early warning - we should be capturing it. If we have that data, why not use it as the basis for our alerting. To understand why this would be anything other than blindingly obvious, we need to consider the mainstream approach to this.

The most common open source tool used for alerting is Nagios. A typical Nagios setup operates by carrying out a check for a certain metric, and comparing it to a predetermined threshold. An example might be disk space or CPU load. Although sometimes carried out over ssh, the most frequently used way of running these checks is via NRPE - the Nagios Remote Plugin Executor. A central monitoring server runs a check which speaks over the network to a remote machine, which carries out a check in real time, which returns a status, and sometimes some values associated with that check. There are a couple of weaknesses in this approach. Assuming we’ve agreed that if we care about the metric enough to want to alert on it then we should be gathering that data for analysis, and graphing it, then we already have the data upon which to base our check. Furthermore, this data is not on the machine we’re monitoring, so our checks don’t in any way add further stress to that machine. This might not sound like a very resource-intensive task, but it could be the proverbial last straw.

Another reason to eschew this approach is security. It’s a fundamental principle that we should run and expose as few network services as possible. Given that we don’t need to run our monitoring over NRPE if we’re already capturing the data elsewhere, an Ockhamist attitude would suggest we should stop using it. Notwithstanding this general principle, in order to enable user customisation of alerts, NRPE is often configured to allow command arguments to be passed through. The risk can be mitigated by ensuring NRPE runs over SSL, as the nagios user, with strict sudoer configuration, but if it can be shown that this isn’t actually needed because we can simply alert on our performance data, the whole question is moot.

From the other side of the argument, there are good positive reasons to alert on your performance data too. The obvious one is that it allows for complex event processing - we have access to more data in the same place - it might be very valuable to know, for example, that not just one machine but all machines in a load balancer pool have an abnormally high load. Another great advantage is that you are never in any doubt as to the veracity of your data. You receive an alert, and can look at the graphs, and know that you’re looking at exactly the same data at exactly the same time. This pleases me, as an engineer

OK - so there’s a strong philosophical and practical case for doing our alerting on our performance metrics. How can we go about this?

Well, firstly, if you’re already using Zabbix or Zenoss, you’re already using a homogenous alerting and monitoring system - there’s no practical difference between the data you graph and the data you use for your alerting. If you use Nagios, however, you’re probably gathering trending data somewhere else. Popular options are Munin, Cacti, Graphite, Collectd or Ganglia. What we need to do is to be able to query this data with a nagios plugin and alert based on the thresholds we’ve agreed.

The simplest and most obvious way to do this is to pull the metrics to the Nagios server and process them. This is the approach taken by Vladimir Vuksan and Michael Conigliaro to enable Nagios to alert on metrics collected in Ganglia - see https://github.com/mconigliaro/check_ganglia_metric.

Another approach is to use Nagios to query your RRD data stores. This is the approach used by the check_munin_rrd tool - see http://nagios-munin.googlecode.com/svn/trunk/check_munin_rrd.pl. Naturally, this logic could be adapted to collectd or cacti. The Nagios Exchange has a number of plugins for speaking to RRD databases that all look to be worthy of exploration.

One of the most important tools to gain attention and advocates in the last 18 months is Graphite. Graphite was written in 2006 as a real time graphing system. It has three components - a processing engine, called carbon, a datastore, called whisper, and a Django app which presents the graphs. The API is written in such a way as to allow the raw data that would be used to create a graph to be requested. This makes it possible to write a Nagios plugin that uses data from Graphite for alerting purposes. An example plugin can be found at https://github.com/recoset/check_graphite.

The final option I’d like to mention is Reconnoiter. Reconnoiter is a ground-up implementation of enterprise monitoring emanating from the engineering team at Omniti. Reconnoiter’s architecture is based around two daemons - noitd and stratcond. Noitd is the agent in the field - it ships with various modules, and pulls in metric data from remote nodes, very rapidly. Typically one noitd is deployed per datacentre. This data is then fed over a secure connection to the stratcond. Stratcond is responsible for aggregating information from various noitd instances in the field, and inserting them into a PostgreSQL database. Fault detection is possible via hooks into the Esper complex event processing system. The last time I looked into this, earlier in the summer, it was possible to write alerts in Esper’s EPL language to run against the live data, and push it back over a message queue, however there was no easy way to consume these results. At Atalanta Systems we make considerable use of Circonus)http://circonus.com/) - the monitoring-as-a-service offering from Omniti that has Reconnoiter at its heart. This provides the alert capability and interfaces nicely with Pager Duty to make an excellent end-to-end solution. In R&D time we have backlog stories to look into getting alerting working with Reconnoiter, and have also invested time in Flapjack, the promising but dormant project from Lindsay Holmwood.

So, to summarise, with current open source tools, the options to implement the first pillar are as follows:

  • Build your monitoring system around Zenoss or Zabbix - both tools collect performance data and alert based on that
  • If you use Nagios, look into one of the plugins that can speak to the tool you use for performance metric gathering, and think seriously about deprecating your NRPE checks
  • If you have development and R&D time available, look into Reconnoiter or Flapjack

If you have the budget or appetite to outsource your monitoring infrastructure to an as-a-service provider, Circonus is an excellent option.

In the next article I’ll be discussing the second pillar - corellation is king.