Puppet and policy - violator or enforcer?
A common challenge for an organisation running Puppet is how balance the desire for a fully automated and standardised environment, with the risk that automated Puppet runs may introduce bugs or revert hot fixes. This concern was apparent at Puppetcamp this morning, when Rafael Brito from the New York Stock Exchange gave an informative presentation about his experience of using Puppet to build machines for their live platform. What particularly struck me was that although his team puts a lot of effort into creating a standard environment, the current culture is that operations teams on the ground can and should make live changes to boxes, and that these changes may not ever make it back into Puppet.
I asked Rafael how frequently Puppet runs on the live machines, to ensure the state of each machine is kept the same, and according to standards. He told me ‘once a quarter’. I think it’s fair to say that in a context such as this, Puppet is not really being used as a config management tool - it’s being used as part of the build process to produce a standard image, which is then being managed in the traditional way.
I fully understand the motivation behind this approach. This is a very high profile application, and there’s a worry that mistakes in the Puppet manifest could accidentaly be rolled out to the live site and cause a massive problem. Their situation is also complicated by having a large, multi-tiered operations team, across several countries, many of whom who don’t know how to use Puppet. The approach they have settled on is to allow engineers to make chanegs to the live site, but to be aware that these machines will effectively be refreshed every quarter, and so there’s a risk that these changes may be lost. This places the burden of maintaining the standard on the team writing and maintaining the Puppet manifests to ensure that changes made by the operations team are folded in.
The trouble with approach is that it means that the de facto standard is always the current state of the machines, as modified by the operations team. If the Puppet run undoes some fixes applied by the operations team, Puppet is placed in the position of standards violator - that’s not a great place to be.
Once issue we often come across with clients who have started to use Puppet occurs when a change rolled out by Puppet breaks the system. In this situation Puppet advocates are in a weak negotiating position - we can argue that the changes should have been made in Puppet, but when the site is down, and money is being lost, somehow that argument doesn’t win much support. The fact is that when a mistake is made, Puppet gets blamed - it broke the site. Sadly this can even result in pressure to stop using this unstable, unreliable tool.
I’d like to turn this on its head. We all agree that we need a standard or set of standards to which the live site must adhere. Let’s make Puppet the enforcer of this standard, and never the violator. This standard can be designed, tested, approved and signed off. This is the standard - we don’t diverge from it. Now we can set up a mechanism for testing the site against the standard, so we know if the standard has ever been broken.
A great way to do this is simply to run Puppet in noop mode, so it doesn’t make the changes, but simply reports what changes it would make if it were to run in live mode. If our standard is being adhered to, Puppet should usually report that it wouldn’t make a change. If Puppet reports that it would make a change, this should only ever be because that change has been approved by, for example, a change advisory board. This mechanism, therefore, will alert us as to whether the machine is out of sync with the stand, what changed, and how Puppet intends to revert the system to the agreed standard. Running this process with reasonable frequency will give us a pretty granular report into when changes we made, and could even be tied into system logs to identify the most likely source of the change. The output of the process could be parsed and monitored, and alerts raised to senior stakeholders, and emails reports sent out, detailing the change that has occurred.
This way we get to play the role of enforcer - we can say: Hey look - this change has happened - we can change it back again, and we should, but we need to find who made the change, why they made it, make it in Puppet, then back it out and apply it properly. We then need to identify and educate the policy breakers, and find out what happened.
This approach, I think, walks the line between the kind of careful conservatism that a production site needs, and the desire to make use of the power of Puppet to guarantee a consistent environment.
Of course this approach will also catch the other risk - the risk that someone has committed a change to Puppet which may get rolled out to live machines when not wanted. Again, there needs to be a policy to protect this. Puppet changes in a live envirnment of this nature should not be made unless tested. This means that Puppet chanegs should be made in a testing branch, and confirmed against a test environment, and only merged into the production repository when the testing has been completyed to everyone’s satisfaction, and, in some environemnts, only rolled out following the appropriate change control mechanism. An hourly noop run, monitored, would immediatey alert if someone had managed to get a change into the love puppet manifest without following the correct procedure.
Of course not running the puppet daemon automatically brings with it a different set of management challenges - such as ensuring all machines are uptodate, and how to minimise the time taken to bring the machines into sync. My answer to this is to orchestrate your puppet clients from a central location, rather than to run your puppet clients in daemon mode. I’ll cover this in a future article.