The Six Pillars of Monitoring
In the last week I’ve had conversations with three different clients in which the subject of monitoring and performance trending has come up. In each case my role has been to assess the state of the incumbent monitoring solution, give a perspective on its fitness for purpose, and put forward ideas for how to improve it. In each case I’ve found myself saying the same thing, so in the spirit of DRY, I’m putting forward my framework for thinking about monitoring.
What is monitoring?
One way to answer this is with a thought experiment. What happens if we put a website live with no monitoring? What happens? Well, the truth is that this isn’t actually possible. Think about it - imagine you run a website that sells air rifles (I’ve recently been researching this with my son), and something happens to make the website become unavailable. If your website has any degree of popularity, you’re going to find out about it some time soon. You might start to get emails from disgruntled customers. You might start noticing people tweeting, or you might even get phone calls from friends or associates. You do have monitoring - it’s just very late in the game, unautomated, uncontrolled and, frankly, unprofessional.
Now, naturally, few people would be so foolish as to have no monitoring at all. So let’s take a step back. What did the human monitoring do? It complained when stuff broke, where broke indicates a difference between the customer’s expectation and reality. If we automate this process, we get our warnings more quickly - perhaps quickly enough to fix the problem before our customers start to complain. This then is the primary definition of monitoring - a mechanism that alerts us when anomalous behaviour is detected so as to minimise service interruption.
But is this enough? If all we ever did was alert, albeit quickly, when things were broken, we would leave ourselves open to at least two problems. Firstly, sometimes the problems we experienced might not be simple to fix - we may still experience service interruption. And secondly, we’re not capturing or recording any historical data. That means we’re not learning anything from our mistakes. How could we improve this? We need to broaden our understanding of anomalous behaviour to be more than just “the site is down”. In order to do this we need to introduce some intelligence. We need to keep historical data in a format that we can access and analyse, and we need to be able to use our experience as engineers to identify the kinds of warning signs that indicate there could be a problem. We should also think of think of the data we capture as being part of a time series. By comparing data now with data in the past, we could gain valuable insight into the behaviour of our systems, and perhaps illuminate our understanding of the atypical behaviour we’re observing now. If we’ve reached this level of maturity in our thinking and praxis, our monitoring has moved beyond reactive to proactive - we’re looking for exceptional behaviour before it exhibits itself as a production problem.
Why do we monitor?
This would seem to be an outwardly obvious question. We monitor our systems so we can minimise service interruption. But without over-pressing the point, why do we do that? Again, the obvious answer is because service interruption is bad - it’s bad for the customer and/or end user, and it’s bad for our reputation. Well, why do we care about that? This is the crux - we care about this stuff because it’s about money. This comes back to my fundamental contention that an effective technical manager’s role is ensure that operations delivers value to the business. This might seem incredibly obvious - of course we monitor our systems because downtime costs money. But if we really think like this, why are so many of our systems so woefully monitored? And why are we not monitoring business metrics alongside technical metrics? The fact is that at the highest level, the purpose of a monitoring system is two-fold. Firstly, to provide engineering insight into our infrastructure and application, informing the creation of intelligent metrics and alerts, and secondly to provide business insight into the relationship between the way in which our infrastructure and application performs and our commercial or strategic objectives.
Naturally there will always a reactive aspect to monitoring - alerting us that a fault condition has arisen, and this cannot be overlooked, but by investing time and engineering effort in measurement and correlation and analysis we can improve our ability to anticipate problems before they become critical. In order best to serve this purpose, I believe we need to build and implement monitoring systems which adhere to the following principles, which I call the six pillars of monitoring.
- Pillar 1: We alert on what we draw
- Pillar 2: Correlation is king
- Pillar 3: We never throw data away
- Pillar 4: Our monitoring should be real time
- Pillar 5: Our monitoring should be API-driven
- Pillar 6: Our monitoring should be intelligent
In this series I will explore each pillar, before evaluating the current options for achieving monitoring Nirvana.