<?xml version='1.0' encoding='utf-8' ?>
<feed xmlns='http://www.w3.org/2005/Atom'>
  <title type='text'>Agile Sysadmin</title>
  <generator uri='http://effectif.com/nesta'>Nesta</generator>
  <id>tag:www.agilesysadmin.net,2009:/</id>
  <link href='http://www.agilesysadmin.net/articles.xml' rel='self' />
  <link href='http://www.agilesysadmin.net' rel='alternate' />
  <subtitle type='text'>Delivering value through web operations</subtitle>
  <updated>2011-09-26T00:00:00+00:00</updated>
  <entry>
    <title>Building a Devops team, Part 2</title>
    <link href='http://www.agilesysadmin.net/buiding-a-devops-team-part-2' rel='alternate' type='text/html' />
    <id>tag:www.agilesysadmin.net,2011-09-26:/buiding-a-devops-team-part-2</id>
    <content type='html'>
      &lt;p&gt;&lt;em&gt;This is a guest post by Brian Henerey, from Sony Computer Entertainment Europe&lt;/em&gt;&lt;/p&gt;
      
      &lt;p&gt;A few months ago I wrote about &lt;a href='http://agilesysadmin.net/building-a-devops-team'&gt;Building a Devops team&lt;/a&gt; here at Sony. I really enjoy leading a team, and have tremendous pride in those I&amp;#8217;ve brought into Sony. When I was given additional head count recently, I was excited and went straight to work. When your screening process makes the front page of &lt;a href='http://news.ycombinator.com/item?id=2674841'&gt;HackerNews&lt;/a&gt; however, you know you&amp;#8217;re going to have to mix things up a bit. Here&amp;#8217;s what was new:&lt;/p&gt;
      
      &lt;h2 id='success_criteria'&gt;Success criteria&lt;/h2&gt;
      
      &lt;p&gt;Throughout this round of interviewing, I found myself asking a few questions about each candidate that became the ultimate success criteria:&lt;/p&gt;
      
      &lt;ol&gt;
      &lt;li&gt;Can they do the job?&lt;/li&gt;
      
      &lt;li&gt;Will they make this team better?&lt;/li&gt;
      
      &lt;li&gt;Will we learn from them?&lt;/li&gt;
      
      &lt;li&gt;Can they pick up new technologies and languages quickly?&lt;/li&gt;
      
      &lt;li&gt;What support with they need from me and the team?&lt;/li&gt;
      &lt;/ol&gt;
      
      &lt;h2 id='selfevaluations'&gt;Self-evaluations&lt;/h2&gt;
      
      &lt;p&gt;I like to measure things and be as objective as possible in my work. This is enormously hard to achieve in recruitment because it&amp;#8217;s about people, not machines, bandwidth, latency, etc. After the first couple of candidates we found, I started asking people to complete self-evaluations. I ended up comparing, weighting and graphing people&amp;#8217;s scores to see if it might be useful. I stole this evaluation form from Google&amp;#8217;s interview process and added a few categories to tailor it to my team&amp;#8217;s needs:&lt;/p&gt;
      
      &lt;p&gt;&lt;strong&gt;Self Evaluation Guide:&lt;/strong&gt;&lt;/p&gt;
      
      &lt;ul&gt;
      &lt;li&gt;10 - Wrote the book on it (there must be a book)&lt;/li&gt;
      
      &lt;li&gt;9 - Could have written the book, but didn&amp;#8217;t.&lt;/li&gt;
      
      &lt;li&gt;8 - Deep understanding of corner cases and esoteric features.&lt;/li&gt;
      
      &lt;li&gt;7 - Understanding and (appropriate) usage of most lesser known features.&lt;/li&gt;
      
      &lt;li&gt;6 - Can develop large programs and deploy new systems from scratch.&lt;/li&gt;
      
      &lt;li&gt;5 - Can develop/deploy larger programs/systems using all basic (w/o book) and more esoteric features (some w/ book, some without)&lt;/li&gt;
      
      &lt;li&gt;4 - Can develop/deploy medium programs/systems using all basic (w/o book) and a few esoteric features (w/ book). Understands enough about internals to do nontrivial troubleshooting.&lt;/li&gt;
      
      &lt;li&gt;3 - Can utilize basic features without much help, manage a small installation competently.&lt;/li&gt;
      
      &lt;li&gt;2 - can write hello world without looking at a book, kind of figure out how a system works, if necessary.&lt;/li&gt;
      
      &lt;li&gt;1 - Can read programs, make small changes to existing programs, or make adjustments to already installed systems, w/book handy.&lt;/li&gt;
      
      &lt;li&gt;0 - No experience.&lt;/li&gt;
      &lt;/ul&gt;
      
      &lt;blockquote&gt;
      &lt;p&gt;__ TCP/IP Networking&lt;/p&gt;
      &lt;/blockquote&gt;
      
      &lt;blockquote&gt;
      &lt;p&gt;__ Unix/Linux internals&lt;/p&gt;
      &lt;/blockquote&gt;
      
      &lt;blockquote&gt;
      &lt;p&gt;__ Unix/Linux Systems administration&lt;/p&gt;
      &lt;/blockquote&gt;
      
      &lt;blockquote&gt;
      &lt;p&gt;__ Algorithms &amp;amp; Data Structures&lt;/p&gt;
      &lt;/blockquote&gt;
      
      &lt;blockquote&gt;
      &lt;p&gt;__ C&lt;/p&gt;
      &lt;/blockquote&gt;
      
      &lt;blockquote&gt;
      &lt;p&gt;__ C++&lt;/p&gt;
      &lt;/blockquote&gt;
      
      &lt;blockquote&gt;
      &lt;p&gt;__ Java&lt;/p&gt;
      &lt;/blockquote&gt;
      
      &lt;blockquote&gt;
      &lt;p&gt;__ Python&lt;/p&gt;
      &lt;/blockquote&gt;
      
      &lt;blockquote&gt;
      &lt;p&gt;__ Perl&lt;/p&gt;
      &lt;/blockquote&gt;
      
      &lt;blockquote&gt;
      &lt;p&gt;__ Ruby&lt;/p&gt;
      &lt;/blockquote&gt;
      
      &lt;blockquote&gt;
      &lt;p&gt;__ Shell Scripting&lt;/p&gt;
      &lt;/blockquote&gt;
      
      &lt;blockquote&gt;
      &lt;p&gt;__ SQL and/or Database Admin&lt;/p&gt;
      &lt;/blockquote&gt;
      
      &lt;blockquote&gt;
      &lt;p&gt;__ Scripting language of your choice, not already mentioned: _______________&lt;/p&gt;
      &lt;/blockquote&gt;
      
      &lt;blockquote&gt;
      &lt;p&gt;__ Build systems and/or Continuous Integration&lt;/p&gt;
      &lt;/blockquote&gt;
      
      &lt;blockquote&gt;
      &lt;p&gt;__ Performance testing and/or Capacity Planning&lt;/p&gt;
      &lt;/blockquote&gt;
      
      &lt;blockquote&gt;
      &lt;p&gt;__ Project Management and/or ITIL&lt;/p&gt;
      &lt;/blockquote&gt;
      
      &lt;blockquote&gt;
      &lt;p&gt;__ Configuration Managment (Puppet, Chef, Cfengine, etc)&lt;/p&gt;
      &lt;/blockquote&gt;
      
      &lt;blockquote&gt;
      &lt;p&gt;__ Support of Production/Highly Available environments&lt;/p&gt;
      &lt;/blockquote&gt;
      
      &lt;blockquote&gt;
      &lt;p&gt;__ Amazon Web Services (or other &amp;#8216;cloud&amp;#8217; technologies/services)&lt;/p&gt;
      &lt;/blockquote&gt;
      
      &lt;h2 id='comparing_the_selfevaluations'&gt;Comparing the self-evaluations&lt;/h2&gt;
      &lt;img class='screenshot' src='/attachments/candidate-compare.png' /&gt;
      &lt;p&gt;This chart was really useful in getting me thinking about what I wanted. Candidate-8 certainly scored a lot higher than Candidate-1, but does it actually mean I want that candidate more? For example, before I weighted the scores someone might be good at Ruby, Python, Perl, and Bash and have an usually high score. I don&amp;#8217;t particularly need someone good at all of those as they&amp;#8217;re so learnable. I ended up combining those languages into a single score where I gave an extra point or two for broad knowledge, but masked the original high totals.&lt;/p&gt;
      
      &lt;p&gt;The chart also kept me from being overly excited about a single skillset of a candidate. I&amp;#8217;m hiring generalist polyglot programmers after all. A couple of times we found that if a candidate had gone too far down a specialist path it seemed to inhibit creative thinking.&lt;/p&gt;
      
      &lt;p&gt;It is important however to keep in mind two things with this chart. Some people over rate themselves, and others under rate themselves. I had one candidate tell me &amp;#8220;If I don&amp;#8217;t love myself, who will?&amp;#8221;. This was pretty funny, but it also meant I was going to critique his skills against the high bar he set himself.&lt;/p&gt;
      
      &lt;h2 id='remote_technical_test'&gt;Remote technical test&lt;/h2&gt;
      
      &lt;p&gt;So last time I blogged about hiring I gave away most of the details of the test, so we had to rethink how we were going to do this. In the end, we asked people to install Wordpress on an EC2 instance again, but we made the task considerably harder. There were about 5 goose-eggs in this test that people had to get through:&lt;/p&gt;
      
      &lt;ol&gt;
      &lt;li&gt;Iptables was filtering incoming traffic to port 80 on the external NIC.&lt;/li&gt;
      
      &lt;li&gt;A process was already running on port 80 and it need to be killed.&lt;/li&gt;
      
      &lt;li&gt;This process was run inside an infinite loop so anytime it was killed, a new process would spawn.&lt;/li&gt;
      
      &lt;li&gt;Apache configuration files were present which set the port to 81.&lt;/li&gt;
      
      &lt;li&gt;No MySQL root password was provided.&lt;/li&gt;
      &lt;/ol&gt;
      
      &lt;p&gt;We ultimately wanted to see enough familiarity with Linux and basic systems administration skills at their disposal. Using things such as netstat, telnet, pstree, and kill are pretty basic, but I only gave people 1 hour so there isn&amp;#8217;t much time for googling.&lt;/p&gt;
      
      &lt;p&gt;After 2 candidates I decided that anyone who&amp;#8217;s CV was good enough would have to complete the remote technical test before I&amp;#8217;d do a phone interview. While I spent a few hours writing/testing the Chef cookbook that created the test instances, this test was a VERY effective screening tool. Candidate-3 wasn&amp;#8217;t able to complete any of the problems above even though he rated himself a 4 in Linux Systems Administration which should mean &amp;#8220;Understands enough about internals to do nontrivial troubleshooting.&amp;#8221;&lt;/p&gt;
      
      &lt;p&gt;Not many people passed the remote technical test, which isn&amp;#8217;t surprising as I was mostly drawing people from Development backgrounds that may not have much Systems Administration background.&lt;/p&gt;
      
      &lt;h2 id='in_person_interview_rounds'&gt;In person interview rounds&lt;/h2&gt;
      
      &lt;ol&gt;
      &lt;li&gt;
      &lt;p&gt;A member of my team would spend 30-60 minutes pair programming with a candidate. We would do simple things, usually in a language that the candidate didn&amp;#8217;t know. If they don&amp;#8217;t know Ruby, we&amp;#8217;d do the Ruby Koans with them. This is mostly to see if this is a person we can spend a lot of close time with as a member of the team.&lt;/p&gt;
      &lt;/li&gt;
      
      &lt;li&gt;
      &lt;p&gt;We had a couple of fairly standard interviews as well, going over the person&amp;#8217;s CV and talking to them about the role. This also included white board sessions of programming and algorithm questions.&lt;/p&gt;
      &lt;/li&gt;
      &lt;/ol&gt;
      
      &lt;h2 id='programming_assignment'&gt;Programming assignment&lt;/h2&gt;
      
      &lt;p&gt;When I hired the two previous members of my team, they both went and taught themselves Ruby before they even had an interview because they knew it was important to the role. One of them created a demo application that calculates the number of stops between Tube stations. He included this with his application and I was suitably impressed. We decided we wanted to see that the candidates were interested in learning a new programming language and also see how quickly they could learn it. I think this is a LOT to ask a candidate since there is no guarantee we&amp;#8217;re going to hire them. In the end I made this optional, and used it as a deciding factor if I was on the fence. If someone had opensource projects I would have accepted them instead.&lt;/p&gt;
      
      &lt;h2 id='the_end_is_the_beginning'&gt;The end is the beginning&lt;/h2&gt;
      
      &lt;p&gt;I&amp;#8217;ve filled my head count and now my real work begins supporting them and getting the best out of team.&lt;/p&gt;
    </content>
    <published>2011-09-26T00:00:00+00:00</published>
    <updated>2011-09-26T00:00:00+00:00</updated>
  </entry>
  <entry>
    <title>The first pillar: We alert on what we draw</title>
    <link href='http://www.agilesysadmin.net/pillar-one' rel='alternate' type='text/html' />
    <id>tag:www.agilesysadmin.net,2011-09-25:/pillar-one</id>
    <content type='html'>
      &lt;p&gt;The first pillar of monitoring that I propose is that we should alert on what we draw. What I mean by this is that the data we use to determine whether a system is behaving anomalously should be the performance data that we are already collecting for trending purposes. What is the thinking behind this pillar? Well, if we cast our minds back to yesterday&amp;#8217;s introduction, we agreed that one of purposes of monitoring systems was to provide data to allow us, as engineers, to detect patterns, and predict issues before they become production impacting. In order to do this, we need to be capturing data and storing it somewhere which allows us to analyse it. If we care about it - if the data could provide the kind of engineering insight which helps us to understand our systems and give early warning - we should be capturing it. If we have that data, why not use it as the basis for our alerting. To understand why this would be anything other than blindingly obvious, we need to consider the mainstream approach to this.&lt;/p&gt;
      
      &lt;p&gt;The most common open source tool used for alerting is &lt;a href='http://www.nagios.org'&gt;Nagios&lt;/a&gt;. A typical Nagios setup operates by carrying out a check for a certain metric, and comparing it to a predetermined threshold. An example might be disk space or CPU load. Although sometimes carried out over ssh, the most frequently used way of running these checks is via NRPE - the Nagios Remote Plugin Executor. A central monitoring server runs a check which speaks over the network to a remote machine, which carries out a check in real time, which returns a status, and sometimes some values associated with that check. There are a couple of weaknesses in this approach. Assuming we&amp;#8217;ve agreed that if we care about the metric enough to want to alert on it then we should be gathering that data for analysis, and graphing it, then we already have the data upon which to base our check. Furthermore, this data is not on the machine we&amp;#8217;re monitoring, so our checks don&amp;#8217;t in any way add further stress to that machine. This might not sound like a very resource-intensive task, but it could be the proverbial last straw.&lt;/p&gt;
      
      &lt;p&gt;Another reason to eschew this approach is security. It&amp;#8217;s a fundamental principle that we should run and expose as few network services as possible. Given that we don&amp;#8217;t need to run our monitoring over NRPE if we&amp;#8217;re already capturing the data elsewhere, an Ockhamist attitude would suggest we should stop using it. Notwithstanding this general principle, in order to enable user customisation of alerts, NRPE is often configured to allow command arguments to be passed through. The risk can be mitigated by ensuring NRPE runs over SSL, as the nagios user, with strict sudoer configuration, but if it can be shown that this isn&amp;#8217;t actually needed because we can simply alert on our performance data, the whole question is moot.&lt;/p&gt;
      
      &lt;p&gt;From the other side of the argument, there are good positive reasons to alert on your performance data too. The obvious one is that it allows for complex event processing - we have access to more data in the same place - it might be very valuable to know, for example, that not just one machine but all machines in a load balancer pool have an abnormally high load. Another great advantage is that you are never in any doubt as to the veracity of your data. You receive an alert, and can look at the graphs, and know that you&amp;#8217;re looking at &lt;em&gt;exactly&lt;/em&gt; the same data at &lt;em&gt;exactly&lt;/em&gt; the same time. This pleases me, as an engineer&lt;/p&gt;
      
      &lt;p&gt;OK - so there&amp;#8217;s a strong philosophical and practical case for doing our alerting on our performance metrics. How can we go about this?&lt;/p&gt;
      
      &lt;p&gt;Well, firstly, if you&amp;#8217;re already using &lt;a href='http://www.zabbix.com'&gt;Zabbix&lt;/a&gt; or &lt;a href='http://www.zenoss.com'&gt;Zenoss&lt;/a&gt;, you&amp;#8217;re already using a homogenous alerting and monitoring system - there&amp;#8217;s no practical difference between the data you graph and the data you use for your alerting. If you use Nagios, however, you&amp;#8217;re probably gathering trending data somewhere else. Popular options are &lt;a href='http://munin-monitoring.org/'&gt;Munin&lt;/a&gt;, &lt;a href='http://www.cacti.net/'&gt;Cacti&lt;/a&gt;, &lt;a href='http://graphite.wikidot.com/'&gt;Graphite&lt;/a&gt;, &lt;a href='http://collectd.org/'&gt;Collectd&lt;/a&gt; or &lt;a href='http://ganglia.sourceforge.net/'&gt;Ganglia&lt;/a&gt;. What we need to do is to be able to query this data with a nagios plugin and alert based on the thresholds we&amp;#8217;ve agreed.&lt;/p&gt;
      
      &lt;p&gt;The simplest and most obvious way to do this is to pull the metrics to the Nagios server and process them. This is the approach taken by Vladimir Vuksan and Michael Conigliaro to enable Nagios to alert on metrics collected in Ganglia - see https://github.com/mconigliaro/check_ganglia_metric.&lt;/p&gt;
      
      &lt;p&gt;Another approach is to use Nagios to query your RRD data stores. This is the approach used by the check_munin_rrd tool - see http://nagios-munin.googlecode.com/svn/trunk/check_munin_rrd.pl. Naturally, this logic could be adapted to collectd or cacti. The Nagios Exchange has a number of plugins for speaking to RRD databases that all look to be worthy of exploration.&lt;/p&gt;
      
      &lt;p&gt;One of the most important tools to gain attention and advocates in the last 18 months is Graphite. Graphite was written in 2006 as a real time graphing system. It has three components - a processing engine, called carbon, a datastore, called whisper, and a Django app which presents the graphs. The API is written in such a way as to allow the raw data that would be used to create a graph to be requested. This makes it possible to write a Nagios plugin that uses data from Graphite for alerting purposes. An example plugin can be found at https://github.com/recoset/check_graphite.&lt;/p&gt;
      
      &lt;p&gt;The final option I&amp;#8217;d like to mention is &lt;a href='https://labs.omniti.com/labs/reconnoiter'&gt;Reconnoiter&lt;/a&gt;. Reconnoiter is a ground-up implementation of enterprise monitoring emanating from the engineering team at Omniti. Reconnoiter&amp;#8217;s architecture is based around two daemons - noitd and stratcond. Noitd is the agent in the field - it ships with various modules, and pulls in metric data from remote nodes, very rapidly. Typically one noitd is deployed per datacentre. This data is then fed over a secure connection to the stratcond. Stratcond is responsible for aggregating information from various noitd instances in the field, and inserting them into a PostgreSQL database. Fault detection is possible via hooks into the &lt;a href='http://esper.codehaus.org/'&gt;Esper&lt;/a&gt; complex event processing system. The last time I looked into this, earlier in the summer, it was possible to write alerts in Esper&amp;#8217;s EPL language to run against the live data, and push it back over a message queue, however there was no easy way to consume these results. At Atalanta Systems we make considerable use of &lt;span&gt;Circonus&lt;/span&gt;)http://circonus.com/) - the monitoring-as-a-service offering from &lt;a href='http://omniti.com/'&gt;Omniti&lt;/a&gt; that has Reconnoiter at its heart. This provides the alert capability and interfaces nicely with Pager Duty to make an excellent end-to-end solution. In R&amp;amp;D time we have backlog stories to look into getting alerting working with Reconnoiter, and have also invested time in &lt;a href='http://flapjack-project.com/'&gt;Flapjack&lt;/a&gt;, the promising but dormant project from &lt;a href='https://github.com/auxesis'&gt;Lindsay Holmwood&lt;/a&gt;.&lt;/p&gt;
      
      &lt;p&gt;So, to summarise, with current open source tools, the options to implement the first pillar are as follows:&lt;/p&gt;
      
      &lt;ul&gt;
      &lt;li&gt;Build your monitoring system around Zenoss or Zabbix - both tools collect performance data and alert based on that&lt;/li&gt;
      
      &lt;li&gt;If you use Nagios, look into one of the plugins that can speak to the tool you use for performance metric gathering, and think seriously about deprecating your NRPE checks&lt;/li&gt;
      
      &lt;li&gt;If you have development and R&amp;amp;D time available, look into Reconnoiter or Flapjack&lt;/li&gt;
      &lt;/ul&gt;
      
      &lt;p&gt;If you have the budget or appetite to outsource your monitoring infrastructure to an as-a-service provider, Circonus is an excellent option.&lt;/p&gt;
      
      &lt;p&gt;In the next article I&amp;#8217;ll be discussing the second pillar - corellation is king.&lt;/p&gt;
    </content>
    <published>2011-09-25T22:35:44+00:00</published>
    <updated>2011-09-25T22:35:44+00:00</updated>
    <category term='devops'></category>
    <category term='monitoring'></category>
  </entry>
  <entry>
    <title>The Six Pillars of Monitoring</title>
    <link href='http://www.agilesysadmin.net/six-pillars' rel='alternate' type='text/html' />
    <id>tag:www.agilesysadmin.net,2011-09-24:/six-pillars</id>
    <content type='html'>
      &lt;p&gt;In the last week I&amp;#8217;ve had conversations with three different clients in which the subject of monitoring and performance trending has come up. In each case my role has been to assess the state of the incumbent monitoring solution, give a perspective on its fitness for purpose, and put forward ideas for how to improve it. In each case I&amp;#8217;ve found myself saying the same thing, so in the spirit of DRY, I&amp;#8217;m putting forward my framework for thinking about monitoring.&lt;/p&gt;
      
      &lt;h2 id='what_is_monitoring'&gt;What is monitoring?&lt;/h2&gt;
      
      &lt;p&gt;One way to answer this is with a thought experiment. What happens if we put a website live with no monitoring? What happens? Well, the truth is that this isn&amp;#8217;t actually possible. Think about it - imagine you run a website that sells air rifles (I&amp;#8217;ve recently been researching this with my son), and something happens to make the website become unavailable. If your website has any degree of popularity, you&amp;#8217;re going to find out about it some time soon. You might start to get emails from disgruntled customers. You might start noticing people tweeting, or you might even get phone calls from friends or associates. You do have monitoring - it&amp;#8217;s just very late in the game, unautomated, uncontrolled and, frankly, unprofessional.&lt;/p&gt;
      
      &lt;p&gt;Now, naturally, few people would be so foolish as to have no monitoring at all. So let&amp;#8217;s take a step back. What did the human monitoring do? It complained when stuff broke, where broke indicates a difference between the customer&amp;#8217;s expectation and reality. If we automate this process, we get our warnings more quickly - perhaps quickly enough to fix the problem before our customers start to complain. This then is the primary definition of monitoring - a mechanism that alerts us when anomalous behaviour is detected so as to minimise service interruption.&lt;/p&gt;
      
      &lt;p&gt;But is this enough? If all we ever did was alert, albeit quickly, when things were broken, we would leave ourselves open to at least two problems. Firstly, sometimes the problems we experienced might not be simple to fix - we may still experience service interruption. And secondly, we&amp;#8217;re not capturing or recording any historical data. That means we&amp;#8217;re not learning anything from our mistakes. How could we improve this? We need to broaden our understanding of anomalous behaviour to be more than just &amp;#8220;the site is down&amp;#8221;. In order to do this we need to introduce some intelligence. We need to keep historical data in a format that we can access and analyse, and we need to be able to use our experience as engineers to identify the kinds of warning signs that indicate there could be a problem. We should also think of think of the data we capture as being part of a time series. By comparing data now with data in the past, we could gain valuable insight into the behaviour of our systems, and perhaps illuminate our understanding of the atypical behaviour we&amp;#8217;re observing now. If we&amp;#8217;ve reached this level of maturity in our thinking and praxis, our monitoring has moved beyond reactive to proactive - we&amp;#8217;re looking for exceptional behaviour before it exhibits itself as a production problem.&lt;/p&gt;
      
      &lt;h2 id='why_do_we_monitor'&gt;Why do we monitor?&lt;/h2&gt;
      
      &lt;p&gt;This would seem to be an outwardly obvious question. We monitor our systems so we can minimise service interruption. But without over-pressing the point, why do we do that? Again, the obvious answer is because service interruption is bad - it&amp;#8217;s bad for the customer and/or end user, and it&amp;#8217;s bad for our reputation. Well, why do we care about that? This is the crux - we care about this stuff because it&amp;#8217;s about money. This comes back to my fundamental contention that an effective technical manager&amp;#8217;s role is ensure that operations delivers value to the business. This might seem incredibly obvious - of course we monitor our systems because downtime costs money. But if we really think like this, why are so many of our systems so woefully monitored? And why are we not monitoring business metrics alongside technical metrics? The fact is that at the highest level, the purpose of a monitoring system is two-fold. Firstly, to provide engineering insight into our infrastructure and application, informing the creation of intelligent metrics and alerts, and secondly to provide business insight into the relationship between the way in which our infrastructure and application performs and our commercial or strategic objectives.&lt;/p&gt;
      
      &lt;p&gt;Naturally there will always a reactive aspect to monitoring - alerting us that a fault condition has arisen, and this cannot be overlooked, but by investing time and engineering effort in measurement and correlation and analysis we can improve our ability to anticipate problems before they become critical. In order best to serve this purpose, I believe we need to build and implement monitoring systems which adhere to the following principles, which I call the six pillars of monitoring.&lt;/p&gt;
      
      &lt;ul&gt;
      &lt;li&gt;Pillar 1: We alert on what we draw&lt;/li&gt;
      
      &lt;li&gt;Pillar 2: Correlation is king&lt;/li&gt;
      
      &lt;li&gt;Pillar 3: We never throw data away&lt;/li&gt;
      
      &lt;li&gt;Pillar 4: Our monitoring should be real time&lt;/li&gt;
      
      &lt;li&gt;Pillar 5: Our monitoring should be API-driven&lt;/li&gt;
      
      &lt;li&gt;Pillar 6: Our monitoring should be intelligent&lt;/li&gt;
      &lt;/ul&gt;
      
      &lt;p&gt;In this series I will explore each pillar, before evaluating the current options for achieving monitoring Nirvana.&lt;/p&gt;
    </content>
    <published>2011-09-24T12:35:44+00:00</published>
    <updated>2011-09-24T12:35:44+00:00</updated>
    <category term='devops'></category>
    <category term='monitoring'></category>
  </entry>
  <entry>
    <title>Command-line cookbook dependency solving with knife exec</title>
    <link href='http://www.agilesysadmin.net/chef-dependencies' rel='alternate' type='text/html' />
    <id>tag:www.agilesysadmin.net,2011-06-23:/chef-dependencies</id>
    <content type='html'>
      &lt;p&gt;Imagine you have a fairly complicated infrastructre with a large number of nodes and roles. Suppose you have a requirement to take one of the nodes and rebuild it in an entirely new network, perhaps even for a completely different organization. This should be easy, right? We have our infrastructure in the form of code. However, our current infrastructure has hundreds of uploaded cookbooks - how do we know the minimum ones to download and move over? We need to find out from a node exactly what cookbooks are needed for that node to be built.&lt;/p&gt;
      
      &lt;p&gt;The obvious place to start is with the node itself:&lt;/p&gt;
      
      &lt;pre&gt;&lt;code&gt;$ knife node show controller&amp;#x000A;Node Name:   controller&amp;#x000A;Environment: _default&amp;#x000A;FQDN:        controller&amp;#x000A;IP:          182.13.194.41&amp;#x000A;Run List:    role[base], recipe[apt::cacher], role[pxe_server]&amp;#x000A;Roles:       pxe_server, base&amp;#x000A;Recipes      apt::cacher, pxe_dust::server, dhcp, dhcp::config&amp;#x000A;Platform:    ubuntu 10.04&lt;/code&gt;&lt;/pre&gt;
      
      &lt;p&gt;OK, this tells us we need the apt, pxe_dust and dhcp cookbooks. But what about them - do they have any dependencies? How could we find out? Well, dependencies are specified in two places - in the cookbook metadata, and in the individual recipes. Here&amp;#8217;s a primitive way to illustrate this:&lt;/p&gt;
      
      &lt;pre&gt;&lt;code&gt;bash-3.2$ for c in apt pxe_dust dhcp&amp;#x000A;&amp;gt; do&amp;#x000A;&amp;gt; grep -iER &amp;#39;include_recipe|^depends&amp;#39; $c/* | cut -d &amp;#39;&amp;quot;&amp;#39; -f 2 | sort | uniq&amp;#x000A;&amp;gt; done&amp;#x000A;apt::cacher-client&amp;#x000A;apache2&amp;#x000A;pxe_dust::server&amp;#x000A;tftp&amp;#x000A;tftp::server&amp;#x000A;utils&lt;/code&gt;&lt;/pre&gt;
      
      &lt;p&gt;As I said - primitive. However the problem doesn&amp;#8217;t end here. In order to be sure, we now need to repeat this for each dependency, recursively. And of course it would be nice to present them more attractively. Thinking about it, it would be rather useful to know what cookbook versions are in use too. This is definitely not a job for a shell one liner - is there a better way?&lt;/p&gt;
      
      &lt;p&gt;As it happens, there is. Think about it - the Chef server already needs to solve these dependencies to know what cookbooks to push to API clients. Can we access this logic? Of course we can - clients carry out all their interactions with the Chef server via the API. This means we can let the server solve the dependencies and query it via the API ourselves.&lt;/p&gt;
      
      &lt;p&gt;Chef provides two powerful ways to access the API without having to write a RESTful client. The first, Shef, is an interactive REPL based on IRB, which when launched gives access to the Chef server. This isn&amp;#8217;t trivial to use. The second, much simpler way is the knife exec subcommand. This allows you to write Ruby scripts or simple one-liners that are executed in the context of a fully configured Chef API Client using the knife configuration file.&lt;/p&gt;
      
      &lt;pre&gt;&lt;code&gt;knife exec -E &amp;#39;(api.get &amp;quot;nodes/controller/cookbooks&amp;quot;).each { |cb| pp cb[0] =&amp;gt; cb[1].version }&amp;#39;&lt;/code&gt;&lt;/pre&gt;
      
      &lt;p&gt;The /nodes/NODE_NAME/cookbooks endpoint returns the cookbook attributes, definitions, libraries and recipes that are required for this node. The response is a hash of cookbook name and Chef::CookbookVersion object. We simply iterate over each one, and pretty print the cookbook name and the version.&lt;/p&gt;
      
      &lt;p&gt;Let&amp;#8217;s give it a try:&lt;/p&gt;
      
      &lt;pre&gt;&lt;code&gt;$ knife exec -E &amp;#39;(api.get &amp;quot;nodes/controller/cookbooks&amp;quot;).each { |cb| pp cb[0] =&amp;gt; cb[1].version }&amp;#39;&amp;#x000A;{&amp;quot;apt&amp;quot;=&amp;gt;&amp;quot;1.1.1&amp;quot;}&amp;#x000A;{&amp;quot;tftp&amp;quot;=&amp;gt;&amp;quot;0.1.0&amp;quot;}&amp;#x000A;{&amp;quot;apache2&amp;quot;=&amp;gt;&amp;quot;0.99.3&amp;quot;}&amp;#x000A;{&amp;quot;dhcp&amp;quot;=&amp;gt;&amp;quot;0.1.0&amp;quot;}&amp;#x000A;{&amp;quot;utils&amp;quot;=&amp;gt;&amp;quot;0.9.5&amp;quot;}&amp;#x000A;{&amp;quot;pxe_dust&amp;quot;=&amp;gt;&amp;quot;1.1.0&amp;quot;}&lt;/code&gt;&lt;/pre&gt;
      
      &lt;p&gt;Nifty! :)&lt;/p&gt;
    </content>
    <published>2011-06-23T23:03:44+00:00</published>
    <updated>2011-06-23T23:03:44+00:00</updated>
    <category term='chef'></category>
    <category term='linux'></category>
    <category term='system-administration'></category>
  </entry>
  <entry>
    <title>Building a Devops team</title>
    <link href='http://www.agilesysadmin.net/building-a-devops-team' rel='alternate' type='text/html' />
    <id>tag:www.agilesysadmin.net,2011-05-25:/building-a-devops-team</id>
    <content type='html'>
      &lt;p&gt;This is a guest post by Brian Henerey, from Sony Computer Entertainment Europe.&lt;/p&gt;
      
      &lt;h2 id='background'&gt;Background&lt;/h2&gt;
      
      &lt;p&gt;I&amp;#8217;ve had 3 roles at Sony since joining in August 2008. Nearly a year ago I took over the management of the original engineering team I joined. This was a failing team by any definition, but I was excited about the opportunity to reshape it. I knew the remaining team was deeply unhappy and likely to quit at any moment, so I had a few immediate goals:&lt;/p&gt;
      
      &lt;ul&gt;
      &lt;li&gt;Hire!&lt;/li&gt;
      
      &lt;li&gt;Keep people from quitting.&lt;/li&gt;
      
      &lt;li&gt;Hire!&lt;/li&gt;
      &lt;/ul&gt;
      
      &lt;p&gt;&lt;em&gt;Side story&lt;/em&gt;: I stumbled on one important objective I didn&amp;#8217;t list however. Keep customers happy. It doesn&amp;#8217;t matter how awesome you think your team &lt;em&gt;can&lt;/em&gt; be if no one wants to work with you based on past experiences. I didn&amp;#8217;t appreciate how much a demotivated employee could jeopardise customer relationships by virtue of not caring. It has taken me &lt;em&gt;months&lt;/em&gt; to restore trust with one customer. I&amp;#8217;ve heard a story about a manager offering employees £500 to quit on a regular basis. I think that probably has some practical problems, but its a tempting idea to cull the unmotivated.&lt;/p&gt;
      
      &lt;p&gt;I come from a long background of small/medium size enterprises. It has been a challenge adapting to a large corporation, but I don&amp;#8217;t think there&amp;#8217;s much unique to Sony about the anti-Devops patterns I&amp;#8217;ve encountered. I know several people in small companies who says they&amp;#8217;ve been practicing Devops before there was such a word and I completely agree. The trouble of silos, bureaucracy, organizational boundaries, politics, etc, seem pretty common in larger businesses though. I can&amp;#8217;t speak to how to create a Devops culture across a large organisation from the top down, but I&amp;#8217;ve been working really hard to create one from the inside.&lt;/p&gt;
      
      &lt;h2 id='the_beginning'&gt;The beginning&lt;/h2&gt;
      
      &lt;p&gt;A year ago I&amp;#8217;d never heard of the term Devops. If you&amp;#8217;re in the same boat, it is easy to find a great deal to read about what Devops is:&lt;/p&gt;
      
      &lt;ul&gt;
      &lt;li&gt;&lt;a href='http://www.jedi.be/blog/2010/02/12/what-is-this-devops-thing-anyway/'&gt;what is this devops thing anyway?&lt;/a&gt;&lt;/li&gt;
      
      &lt;li&gt;&lt;a href='http://www.kartar.net/2010/02/what-devops-means-to-me/'&gt;what devops means to me&lt;/a&gt;&lt;/li&gt;
      
      &lt;li&gt;&lt;a href='http://agileoperations.net/index.php?/archives/35-DevOps-and-Agile-Operations.html'&gt;Devops and Agile Operations&lt;/a&gt;&lt;/li&gt;
      &lt;/ul&gt;
      
      &lt;p&gt;And what it is not:&lt;/p&gt;
      
      &lt;ul&gt;
      &lt;li&gt;&lt;a href='http://www.agileweboperations.com/what-devops-is-not'&gt;What devops is not&lt;/a&gt;&lt;/li&gt;
      
      &lt;li&gt;&lt;a href='http://dev2ops.org/blog/2010/11/7/devops-is-not-a-technology-problem-devops-is-a-business-prob.html'&gt;devops is not a technology problem&lt;/a&gt;&lt;/li&gt;
      
      &lt;li&gt;&lt;a href='http://www.krisbuytaert.be/blog/apparently-devops-not-jobtitle'&gt;devops not jobtitle&lt;/a&gt;&lt;/li&gt;
      
      &lt;li&gt;&lt;a href='http://streamstep.com/index.php/blogs/armblog/devops_is_not_just_about_automation/'&gt;devops is not just about automation&lt;/a&gt;&lt;/li&gt;
      &lt;/ul&gt;
      
      &lt;p&gt;However, I suspect some people will have trouble finding the read-worthy gems amongst all the chatter. Here&amp;#8217;s a good place to get started: &lt;a href='http://dev2ops.org/blog/2011/5/12/getting-started-with-devops.html'&gt;getting started with devops&lt;/a&gt;. The gigantic list of Devops related bookmarks compiled by Patrick Debois shows why you may not want to try and read &lt;em&gt;everything&lt;/em&gt;: &lt;a href='http://jedi.be/bookmarks/'&gt;devops bookmarks&lt;/a&gt;&lt;/p&gt;
      
      &lt;p&gt;If you&amp;#8217;re in the know already and Devops resonates with you, and you want to build a team around the concept, here&amp;#8217;s how I went about it.&lt;/p&gt;
      
      &lt;h2 id='networking'&gt;Networking&lt;/h2&gt;
      
      &lt;p&gt;The terms Devops didn&amp;#8217;t really take shape &lt;em&gt;for me&lt;/em&gt; until I started to talk about it with others. Fortunately, London has a really active Devops community so I&amp;#8217;ve had ample opportunity. The tireless Gareth Rushgrove organises many events, and The Guardian is a frequent host. I&amp;#8217;ve been to sessions discussing Continuous Integration, Deployments, Google App Engine, Load Balancers, Chef, CloudFoundry, etc. I&amp;#8217;ve found people to be incredibly open about technology, processes, culture, difficulties and successes they&amp;#8217;ve had.&lt;/p&gt;
      
      &lt;p&gt;While Devops is of course about more than technology and tools, I personally have found Devops to be an excellent banner under which to have really interesting conversations. Having a forum which brings people from diverse backgrounds together has helped me shape my own internal understanding of what Devops &lt;em&gt;should&lt;/em&gt; be about.&lt;/p&gt;
      
      &lt;p&gt;I felt a bit of an imposter going to the initial London Devops meetups because I was so keen on recruiting. However, the quality of the discussions has been so good I eagerly anticipate each upcoming meetup even though I&amp;#8217;m no longer hiring. I&amp;#8217;ve also discovered that half the attendees are also hiring. It&amp;#8217;s a Devopsee&amp;#8217;s market.&lt;/p&gt;
      
      &lt;p&gt;&lt;strong&gt;Result!:&lt;/strong&gt; I met and subsequently hired Stephen Nelson-Smith from Atalanta-Systems. (He&amp;#8217;s @Lordcope on twitter, and the author of &lt;a href='http://agilesysadmin.net/'&gt;agilesysadmin.net&lt;/a&gt;&lt;/p&gt;
      
      &lt;h2 id='working_definition_of_devops'&gt;Working definition of Devops&lt;/h2&gt;
      
      &lt;p&gt;If you&amp;#8217;re going to hire people with Devops in mind, its good to have a working definition. I like the pillars of Devops (CAMS) put forth by John Willis: &lt;a href='http://www.opscode.com/blog/2010/07/16/what-devops-means-to-me/'&gt;what devops means to me&lt;/a&gt;&lt;/p&gt;
      
      &lt;ul&gt;
      &lt;li&gt;Culture&lt;/li&gt;
      
      &lt;li&gt;Automation&lt;/li&gt;
      
      &lt;li&gt;Measurement&lt;/li&gt;
      
      &lt;li&gt;Sharing&lt;/li&gt;
      &lt;/ul&gt;
      
      &lt;p&gt;SMAC might have been a better acronym, but I&amp;#8217;ll go with CAMS.&lt;/p&gt;
      
      &lt;h2 id='a_devops_job_spec'&gt;A Devops job spec&lt;/h2&gt;
      
      &lt;p&gt;I don&amp;#8217;t think Devops is a role, though I&amp;#8217;ve seen jobs posting for such a thing. I only mentioned that I was looking for someone &amp;#8216;Devops-savvy&amp;#8217;, and later changed it to &amp;#8216;Devops-minded&amp;#8217; or something similar. The job posting expired and I&amp;#8217;d have to dig it out, but R.I.Pinearr described in on Twitter as the &amp;#8216;perfect devops job posting&amp;#8217;. I&amp;#8217;m pretty keen on revising a job spec until the requirements are only things I actually require and can measure against. Saying that, how to write a job spec is way outside the scope of this post. To summarize, I was looking for:&lt;/p&gt;
      
      &lt;ul&gt;
      &lt;li&gt;problem solving skills&lt;/li&gt;
      
      &lt;li&gt;&amp;#8216;can do&amp;#8217; attitude&lt;/li&gt;
      
      &lt;li&gt;good team fit (really hard to quantify)&lt;/li&gt;
      
      &lt;li&gt;a broad set of skills (LAMP, Java, C++, Ruby, Python, Oracle, Scaling/Capacity, High-Availability, etc, etc)&lt;/li&gt;
      &lt;/ul&gt;
      
      &lt;p&gt;My team works on a ton of different technology stacks, and the landscape is constantly changing. Its a techie-dream job, but the interpersonal skills are the most important.&lt;/p&gt;
      
      &lt;h2 id='recruiters'&gt;Recruiters&lt;/h2&gt;
      
      &lt;p&gt;I strongly believe in giving recruiters a fair bit of my time. I&amp;#8217;ve seen many people be rude to recruiters, ignore them, etc, and then wonder why they don&amp;#8217;t get good candidates through. I&amp;#8217;m quite keen on engaging the recruiters, explaining the role I&amp;#8217;m trying to fill thoroughly, and having the occasional coffee or beer with them. Feedback is of course vital to candidates, and I try to give it honestly and quickly, letting the recruiter worry about sugar coating things.&lt;/p&gt;
      
      &lt;h2 id='cv_selection'&gt;CV selection&lt;/h2&gt;
      
      &lt;p&gt;This is tough. I regularly get CV blindness where everyone starts to look the same. And generally ill-suited. I try to remember there are human beings on the other end and force myself to have concrete reasons why I&amp;#8217;m rejecting someone. Talking to a recruiter about this helps me be concrete.&lt;/p&gt;
      
      &lt;h2 id='first_interview__remote_technical_test'&gt;First interview - remote technical test&lt;/h2&gt;
      
      &lt;p&gt;This is where things get interesting! I don&amp;#8217;t know if this is unique to London, but I&amp;#8217;ve had a LOT of candidates from other countries apply to join this team. If someone has a good CV and the recruiter vouches for their English language skills, I developed a great screening test which can be conducted remotely. This saves a trip to London + hotel, and I can end it promptly if things aren&amp;#8217;t going well. Here&amp;#8217;s how it works:&lt;/p&gt;
      
      &lt;ul&gt;
      &lt;li&gt;I email the candidate/recruiter a url to an ec2 instance that I spin up on the day about 20 minutes before the interview.&lt;/li&gt;
      
      &lt;li&gt;The instance is running a web browser which contains instructions for the test. These only state that the candidate will need a terminal such as Putty if they&amp;#8217;re on Windows.&lt;/li&gt;
      
      &lt;li&gt;At the arranged time I phone the candidate. I explain that there will be two tests. The first is a sys admin task which will be time bound to 20 minutes. The second is a programming task which they can use the remainder of the time to complete. The call will end after 1 hour.&lt;/li&gt;
      
      &lt;li&gt;I explain the rules: They are to perform all of their work on the ec2 instance. They have a test account/password, and sudo root access. They can use any resources they want to solve the problems. Google, man pages, libraries are not only fair game, but fully expected.&lt;/li&gt;
      
      &lt;li&gt;I explain what I want from them: They need to talk to me, tell me what they are thinking, and walk me through the problem solving process. I&amp;#8217;m far more interested in that dialogue than whether they solve either problem I give them.&lt;/li&gt;
      
      &lt;li&gt;I also add that we&amp;#8217;re using Screen, and I can see everything they type.&lt;/li&gt;
      
      &lt;li&gt;I swap the index.html with the complete instructions in place, make note of the time, and let them begin.&lt;/li&gt;
      &lt;/ul&gt;
      
      &lt;p&gt;&lt;strong&gt;The problems&lt;/strong&gt;&lt;/p&gt;
      
      &lt;p&gt;1) Its really quite simple: install Wordpress and configure it to work properly. The catch is that we install mysql first, break it, and then watch as candidates wonder what the heck is going on. For an experienced sysadmin this is child&amp;#8217;s play. I tended to interview people with stronger development background and less familiar installing applications. I could tell almost immediately how well someone knew there way around a Linux system. It was interesting to see what kinds of assumptions people made about the system itself (I never mentioned the OS that was running. Several just assumed Ubuntu.) Some people read instructions, some don&amp;#8217;t. I give people the mysqladmin password, but some people search on how to reset a lost password because they didn&amp;#8217;t read what I gave them. I had one guy spend 10 minutes trying to ssh to http://ec2&amp;#8230;&amp;#8230;. I gave him a pass on nerves, but he continued to suck and I ended it soon there after. He blamed language barrier (Eastern European), and said if only I had been more clear to him. If I can&amp;#8217;t communicate with him, I think that&amp;#8217;s a pretty big problem and it doesn&amp;#8217;t really matter who&amp;#8217;s fault it is.&lt;/p&gt;
      
      &lt;p&gt;2) We provide sanitized Production Tomcat logs for a real application we support and ask the candidate to write a log parsing script in a language of their choice. We want the output of the script to show methods calls, call counts, frequencies, average and 90% latencies. Our preference is Ruby, but they can do it however they&amp;#8217;d like. I had one candidate choose to implement this in Bash and was writing some serious regex-fu that I had no idea how it worked. He got stuck however, and I couldn&amp;#8217;t help but ask as he claimed to be a Ruby developer why he didn&amp;#8217;t do it in Ruby, which was my stated preference. He started over in Ruby and did okay. Depending how much time was spent on problem 1, this part of the interview is really boring for me. I stay on the phone in case they have questions, I ask them to explain their approach before they begin coding, but then I just start checking email/etc. After 60 minutes total is up, I explain to the candidate that they can continue working on the coding task as long as they need and to send me an email when they&amp;#8217;ve finished. I get off the phone however, stating that we&amp;#8217;ll give them feedback as soon as we&amp;#8217;ve reviewed the code they submit and explain the next steps.&lt;/p&gt;
      
      &lt;p&gt;&lt;strong&gt;Results&lt;/strong&gt;&lt;/p&gt;
      
      &lt;p&gt;I put several candidates through this process. In the beginning of creating this test, I&amp;#8217;d have a couple members of my team on this call as well, but we found this too time consuming and a bit intimidating to certain candidates. Timeboxing problem 1 was a HUGE improvement, and once Stephen Nelson-Smith was on board I had someone better than me at evaluating the Ruby code. We all felt this test process was extremely revealing of candidates skillsets and I highly recommend it.&lt;/p&gt;
      
      &lt;p&gt;One of my favourite candidates conducted this interview on a laptop in the shared wifi area of a crowded and noisy London hostel. In the background were screaming people and overbearing Christmas music. He was able to tune out the distractions and nailed both problems with ease, and got major bonus points for doing so.&lt;/p&gt;
      
      &lt;h2 id='round_2__face_to_face_interview'&gt;Round 2 - Face to face interview&lt;/h2&gt;
      
      &lt;p&gt;Round 2 actually has a few parts:&lt;/p&gt;
      
      &lt;ul&gt;
      &lt;li&gt;Coffee/lunch/dinner informal chat up to 1 hour in length. I explain what I&amp;#8217;m looking for; they can talk about themselves; we can find out if we have a good match.&lt;/li&gt;
      
      &lt;li&gt;Hypothetical whiteboard problem solving exercise: You receive a call saying customer goes to http://yoursite.com and gets a blank page. What do you do next? We can improvise a bit here on what the actual problem is, but we&amp;#8217;re hoping to learn two things: How does this person approach problem solving? What level of architectural complexity have they been exposed to?&lt;/li&gt;
      
      &lt;li&gt;2 hours of pair programming with a member of my team. This is usually a real bit of work that needs doing. It could be writing a chef cookbook, or a cucumber test, etc. We want to learn what its like to work closely with this person. My team pair programs often. Do we want to pair with this person day in / day out?&lt;/li&gt;
      &lt;/ul&gt;
      
      &lt;h2 id='round_3__my_boss__any_member_of_my_team_who_hasnt_met_the_candidate_yet'&gt;Round 3 - my boss + any member of my team who hasn&amp;#8217;t met the candidate yet.&lt;/h2&gt;
      
      &lt;ul&gt;
      &lt;li&gt;This is generally very open, though my boss has her own techniques for evaluating people.&lt;/li&gt;
      &lt;/ul&gt;
      
      &lt;p&gt;Its very important to me that everyone on my team have a voice. I was quite keen on one candidate, but when one of my team member&amp;#8217;s voiced vague concerns about the person&amp;#8217;s team-fit, we all stopped and took it on board. We rejected the candidate in the end because once the first doubts were out in the open, other people&amp;#8217;s concerns started to be raised as well. I recognised that I was a bit too keen to hire someone to fill a pressing need and am glad how things worked out..&lt;/p&gt;
      
      &lt;h2 id='a_great_candidatehire'&gt;A GREAT candidate/hire&lt;/h2&gt;
      
      &lt;p&gt;One of my favourite hires not only does he know C, Java, and Linux, but wrote a sample Ruby application because he knew we were looking to hire Ruby skills within the team. His app worked out the shortest path between tube stations, though only in terms of number of stops, not time travelled. This initiative told me a lot about him, and its been 100% the same since he joined the team. Eager to learn and try new things. Any problem/task put in front of him is &amp;#8216;easy&amp;#8217;. My only trouble is he tends to consider problems solved when he&amp;#8217;s worked out in his head how he will solve it. This is a bit of a joke really. I accused him the other day of declaring checkmate on a task because he was so confident it would be completed in his next 7 seven steps.&lt;/p&gt;
      
      &lt;h2 id='beyond_hiring'&gt;Beyond hiring&lt;/h2&gt;
      
      &lt;p&gt;Now what? Well, hiring the right people is HUGE. We celebrated each hire, as opposed to the typical &amp;#8216;leaving drinks&amp;#8217; when people move on. How I manage the team will be a future blog post (I hope), but I&amp;#8217;ll add one quick comment. Hiring people according to the vision I had means that I am held accountable as well. Whenever I find myself explaining that the reason for a decision I&amp;#8217;m making is &amp;#8216;politics&amp;#8217;, I know I have to change.&lt;/p&gt;
      
      &lt;h2 id='about_the_author'&gt;About the author&lt;/h2&gt;
      
      &lt;p&gt;&lt;a href='bhenerey.jpg'&gt;&lt;img src='/attachments/bhenerey_sm.jpg' alt='Image' /&gt;&lt;/a&gt;&lt;/p&gt;
      
      &lt;p&gt;Brian Henerey heads up Operations Engineering in the Online Technology Group at Sony Computer Entertainment Europe. His passions include Devops, Tool-chains, Web Operations, Continuous Delivery and Lean thinking. He&amp;#8217;s currently building automated infrastructure pipelines with Ruby, Chef, and AWS, enabling self-service, just-in-time development and test environments for Sony&amp;#8217;s Worldwide Studios.&lt;/p&gt;
      
      &lt;p&gt;&lt;a href='http://uk.linkedin.com/in/brianhenerey' title='Brian Henerey on Linkedin'&gt;&lt;img src='/attachments/linkedin.png' alt='Image' /&gt;&lt;/a&gt; &lt;a href='http://twitter.com/bhenerey' title='@bhenerey on twitter'&gt;&lt;img src='/attachments/twitter.png' alt='Image' /&gt;&lt;/a&gt;&lt;/p&gt;
    </content>
    <published>2011-05-25T20:03:44+00:00</published>
    <updated>2011-05-25T20:03:44+00:00</updated>
    <category term='agile'></category>
    <category term='devops'></category>
    <category term='system-administration'></category>
  </entry>
  <entry>
    <title>Kanban for Sysadmin</title>
    <link href='http://www.agilesysadmin.net/kanban_sysadmin' rel='alternate' type='text/html' />
    <id>tag:www.agilesysadmin.net,2011-04-27:/kanban_sysadmin</id>
    <content type='html'>
      &lt;p&gt;&lt;em&gt;This article was originally published in December 2009, in Jordan Sissel&amp;#8217;s SysAdvent&lt;/em&gt;&lt;/p&gt;
      
      &lt;p&gt;Unless you&amp;#8217;ve been living in a remote cave for the last year, you&amp;#8217;ve probably noticed that the world is changing. With the maturing of automation technologies like Puppet, the popular uptake of Cloud Computing, and the rise of Software as a Service, the walls between developers and sysadmins are beginning to be broken down. Increasingly we&amp;#8217;re beginning to hear phrases like &amp;#8216;Infrastructure is code&amp;#8217;, and terms like &amp;#8216;Devops&amp;#8217;. This is all exciting. It also has an interesting knock-on effect. Most development environments these days are at least strongly influenced by, if not run entirely according to &amp;#8216;Agile&amp;#8217; principles. Scrum in particular has experienced tremendous success, and adoption by non-development teams has been seen in many cases. On the whole the headline objectives of the Agile movement are to be embraced, but the thorny question of how to apply them to operations work has yet to be answered satisfactorily.&lt;/p&gt;
      
      &lt;p&gt;I&amp;#8217;ve been managing systems teams in an Agile environment for a number of years, and after thought and experimentation, I can recommend using an approach borrowed from Lean systems management, called Kanban.&lt;/p&gt;
      
      &lt;h2 id='operations_teams_need_to_deliver_business_value'&gt;Operations teams need to deliver business value&lt;/h2&gt;
      
      &lt;p&gt;As a technical manager, my top priority is to ensure that my teams deliver business value. This is especially important for Web 2.0 companies - the infrastructure is the platform &amp;#8211; is the product &amp;#8211; is the revenue. Especially in tough economic times it&amp;#8217;s vital to make sure that as sysadmins we are adding value to the business.&lt;/p&gt;
      
      &lt;p&gt;In practice, this means improving throughput - we need to be fixing problems more quickly, delivering improvements in security, performance and reliability, and removing obstacles to enable us to ship product more quickly. It also means building trust with the business - improving the predictability and reliability of delivery times. And, of course, it means improving quality - the quality of the service we provide, the quality of the staff we train, and the quality of life that we all enjoy - remember - happy people make money.&lt;/p&gt;
      
      &lt;p&gt;The development side of the business has understood this for a long time. Aided by Agile principles (and implemented using such approaches as Extreme Programming or Scrum) developers organise their work into iterations, at the end of which they will deliver a minimum marketable feature, which will add value to the business.&lt;/p&gt;
      
      &lt;p&gt;The approach may be summarised as moving from the historic model of software development as a large team taking a long time to build a large system, towards small teams, spending a small amount of time, building the smallest thing that will add value to the business, but integrating frequently to see the big picture.&lt;/p&gt;
      
      &lt;p&gt;Systems teams starting to work alongside such development teams are often tempted to try the same approach.&lt;/p&gt;
      
      &lt;p&gt;The trouble is, for a systems team, committing to a two week plan, and setting aside time for planning and retrospective meetings, prioritisation and estimation sessions just doesn&amp;#8217;t fit. Sysadmin work is frequently interrupt-driven, demands on time are uneven, frequently specialised and require concentrated focus. Radical shifts in prioritisation are normal. It&amp;#8217;s not even possible to commit to much shorter sprints of a day, as sysadmin work also includes project and investigation activities that couldn&amp;#8217;t be delivered in such a short space of time.&lt;/p&gt;
      
      &lt;p&gt;Dan Ackerman recently carried out a survey in which he asked sysadmins their opinions and experience of using agile approaches in systems work&lt;span&gt;1&lt;/span&gt;. The general feeling was that it helped encourage organisation, focus and coordination, but that it didn&amp;#8217;t seem to handle the reactive nature of systems work, and the prescription of regular meetings interrupted the flow of work. My own experience of sysadmins trying to work in iterations is that they frequently fail their iterations, because the world changed (sometimes several times) and the iteration no longer captured the most important things. A strict, iteration-based approach just doesn&amp;#8217;t work well for operations - we&amp;#8217;re solving different problems. When we contrast a highly interdependent systems team with a development team who work together for a focussed time, answering to themselves, it&amp;#8217;s clear that the same tools won&amp;#8217;t necessarily be appropriate.&lt;/p&gt;
      
      &lt;h2 id='what_is_kanban_and_how_might_it_help'&gt;What is Kanban, and how might it help?&lt;/h2&gt;
      
      &lt;p&gt;Let&amp;#8217;s keep this really really simple. You might read other explanations making it much more complicated than necessary. A Kanban system is simply a system with two specific characteristics. Firstly, it is a pull-based system. Work is only ever pulled into the system, on the basis of some kind of signal. It is never pushed; it is accepted, when the time is right, and when there is capacity to do the work. Secondly, work in progress (WIP) is limited. At any given time there is a limit to the amount of work flowing through the system - once that limit is reached, no more work is pulled into the system. Once some of that work is complete, space becomes available and more work is pulled into the system.&lt;/p&gt;
      
      &lt;p&gt;Kanban as a system is all about managing flow - getting a constant and predictable stream of work through, whilst improving efficiency and quality. This maps perfectly onto systems work - rather than viewing our work as a series of projects, with annoying interruptions, we view our work as a constant stream of work of varying kinds.&lt;/p&gt;
      
      &lt;p&gt;As sysadmins we are not generally delivering product, in the sense that a development team are. We&amp;#8217;re supporting those who do, addressing technical debt in the systems, and looking for opportunities to improve resilience, reliability and performance.&lt;/p&gt;
      
      &lt;h3 id='supporting_tools'&gt;Supporting tools&lt;/h3&gt;
      
      &lt;p&gt;Kanban is usually associated with some tools to make it easy to implement the basic philosophy. Again, keeping it simple, all we need is a stack of index cards and a board.&lt;/p&gt;
      
      &lt;p&gt;The word Kanban itself means &amp;#8216;Signal Card&amp;#8217; - and is a token which represents a piece of work which needs to be done. This maps conveniently onto the agile &amp;#8216;story card&amp;#8217;. The board is a planning tool, and and an information radiator. Typically it is organised into the various stages on the journey that a piece of work goes through. This could be as simple as to-do, in-progress, and done, or could feature more intermediate steps.&lt;/p&gt;
      
      &lt;p&gt;The WIP limit controls the amount of work (or cards) that can be on any particular part of the board. The board makes visible exactly who is working on what, and how much capacity the team has. It provides information to the team, and to managers and other people about the progress and priorities of the team..&lt;/p&gt;
      
      &lt;p&gt;Kanban teams abandon the concept of iterations altogether. As Andrew Clay Shafer once said to me: &amp;#8220;We will just work on the highest priority &amp;#8216;stuff&amp;#8217;, and kick-ass!&amp;#8221;&lt;/p&gt;
      
      &lt;p&gt;&lt;img src='http://2.bp.blogspot.com/_u-5lMShiO40/Sy4FyzIyriI/AAAAAAAAADI/QDIyzQBv5nU/s1600/kanban_board.jpg' alt='The Radisson Edwardian' /&gt;&lt;/p&gt;
      
      &lt;h3 id='how_does_kanban_help'&gt;How does Kanban help?&lt;/h3&gt;
      
      &lt;p&gt;Kanban brings value to the business in three ways - it improves trust, it improves quality and it improves efficiency.&lt;/p&gt;
      
      &lt;p&gt;Trust is improved because very rapidly the team starts being able to deliver quickly on the highest priority work. There&amp;#8217;s no iteration overhead, it is absolutely transparent what the team is working on, and, because the responsibility for prioritising the work to be done lies outside the technical team, the business soon begins to feel that the team really is working &lt;em&gt;for them&lt;/em&gt;.&lt;/p&gt;
      
      &lt;p&gt;Quality is improved because the WIP limit makes problems visible very quickly. Let&amp;#8217;s consider two examples - suppose we have a team of four sysadmins:&lt;/p&gt;
      
      &lt;p&gt;The team decides to set a WIP limit on work in progress of one. This means that the team as a whole will only ever work on one piece of work at a time. While that work is being done, everything else has to wait. The effects of this will be that all four sysadmins will need to work on the same issue simultaneously. This will result in very high quality work, and the tasks themselves should get done fairly quickly, but it will also be wasteful. Work will start queueing up ahead of the &amp;#8216;in progress&amp;#8217; section of the board, and the flow of work will be too slow. Also it won&amp;#8217;t always be possible for all four people to work on the same thing, so for some of the time the other sysadmins will be doing nothing. This will be very obvious to anyone looking at the board. Fairly soon it will become apparent that the WIP limit of one is too low.&lt;/p&gt;
      
      &lt;p&gt;Suppose we now decide to increase the WIP limit to ten. The syadmins go their own ways, each starting work on one card each. The progress on each card will be slower, because there&amp;#8217;s only one person working on it, and the quality may not be as good, as individuals are more likely to make mistakes than pairs. The individual sysadmins also don&amp;#8217;t concentrate as well on their own, but work is still flowing through the system. However fairly soon, something will come up which makes progress difficult. At this stage a sysadmin will pick another card and work on that. Eventually two or three cards will be &amp;#8216;stuck&amp;#8217; on the board, with no progress, while work flows around them owing to the large WIP limit. Eventually we might hit a big problem, system wide, that halts progress on all work, and perhaps even impacts other teams. It turns out that this problem was the reason why work stopped on the tasks earlier on. The problem gets fixed, but the impact on the team&amp;#8217;s productivity is significant, and the business has been impacted too. Has the WIP limit been lower, the team would have been forced to react sooner.&lt;/p&gt;
      
      &lt;p&gt;The board also makes it very clear to the team, and to anyone following the team, what kind of work patterns are building up. As an example, if the team&amp;#8217;s working cadence seems to be characterised by a large number of interrupts, especially for repeatable work, or to put out fires, that&amp;#8217;s a sign that the team is paying interest on technical debt. The team can then make a strong case for tackling that debt, and the WIP limit protects the team as they do so.&lt;/p&gt;
      
      &lt;p&gt;Efficiency is improved simply because this method of working has been shown to be the best way to get a lot of work through a system. Kanban has its origins in Toyota&amp;#8217;s lean processes, and has been explored and used in dozens of different kinds of work environment. Again, the effects of the WIP limit, and the visibility of their impact on the board makes it very easy to optimise the system, to reduce the cycle time - that is to reduce the time it takes to complete a piece of work once it enters the system.&lt;/p&gt;
      
      &lt;p&gt;Another benefit of Kanban boards is that it encourages self-management. At any time any team member can look at the board and see at once what is being worked on, what should be worked on next and, with a little experience, can see where the problems are. If there&amp;#8217;s one thing sysadmins hate, it&amp;#8217;s being micro-managed. As long as there is commitment to respect the board, a sysops team will self-organise very well around it. Happy teams produce better quality work, at a faster pace.&lt;/p&gt;
      
      &lt;h2 id='how_do_i_get_started'&gt;How do I get started?&lt;/h2&gt;
      
      &lt;p&gt;If you think this sounds interesting, here are some suggestions for getting started.&lt;/p&gt;
      
      &lt;ul&gt;
      &lt;li&gt;
      &lt;p&gt;Have a chat to the business - your manager and any internal stakeholders. Explain to them that you want to introduce some work practices that will improve quality and efficiency, but which will mean that you will be limiting the amount of work you do - i.e. you will have to start saying no. Try the puppy dog close: &amp;#8220;Let&amp;#8217;s try this for a month - if you don&amp;#8217;t feel it&amp;#8217;s working out, we&amp;#8217;ll go back to the way we work now&amp;#8221;.&lt;/p&gt;
      &lt;/li&gt;
      
      &lt;li&gt;
      &lt;p&gt;Get the team together, buy them pizza and beer, and try playing some Kanban games. There are a number of ways of doing this, but basically you need to come up with a scenario in which the team has to produce things, but the work is going to be limited and only accepted when there is capacity. Speak to me if you want some more detailed ideas - there are a few decent resources out there.&lt;/p&gt;
      &lt;/li&gt;
      
      &lt;li&gt;
      &lt;p&gt;Get the team together for a white-board session. Try to get a sense of the kinds of phases your work goes through. How much emergency support work is there? How much general user support? How much project work? Draw up a first cut of a Kanban board, and imagine some scenarios. The key thing is to be creative. You can make work flow left to right, or top to bottom. You can use coloured cards or plain cards - it doesn&amp;#8217;t matter. The point of the board is to show what work is being done, by whom, and to make explicit what the WIP limits are.&lt;/p&gt;
      &lt;/li&gt;
      
      &lt;li&gt;
      &lt;p&gt;Set up your Kanban board somewhere highly visible and easy to get to. You could use a whiteboard and magnets, a cork board and pins, or just stick cards to a wall with blue tack. You can draw lines with a ruler, or you can use insulating tape to give bold, straight dividers between sections. Make it big, and clear.&lt;/p&gt;
      &lt;/li&gt;
      
      &lt;li&gt;
      &lt;p&gt;Agree your WIP limit amongst yourselves - it doesn&amp;#8217;t matter what it is - just pick a sensible number, and be prepared to tweak it based on experience.&lt;/p&gt;
      &lt;/li&gt;
      
      &lt;li&gt;
      &lt;p&gt;Gather your current work backlog together and put each piece of work on a card. If you can, sit with the various stakeholders for whom the work is being done, so you can get a good idea of what the acceptance criteria are, and their relative importance. You&amp;#8217;ll end up with a huge stack of cards - I keep them in a card box, next to the board.&lt;/p&gt;
      &lt;/li&gt;
      
      &lt;li&gt;
      &lt;p&gt;Get your manager, and any stakeholders together, and have a prioritisation session. Explain that there&amp;#8217;s a work in progress limit, but that work will get done quickly. Your team will work on whatever is agreed is the highest priority. Then stick the highest priority cards to the left of (or above) the board. I like to have a &amp;#8216;Next Please&amp;#8217; section on the board, with a WIP limit. Cards can be added or removed by anyone from this board, and the team will pull from this section when capacity becomes available.&lt;/p&gt;
      &lt;/li&gt;
      
      &lt;li&gt;
      &lt;p&gt;Write up a team charter - decide on the rules. You might agree not to work on other people&amp;#8217;s cards without asking first. You might agree times of the day you&amp;#8217;ll work. I suggest two very important rules - once a card goes onto the in progress section of the board, it never comes off again, until it&amp;#8217;s done. And nobody works on anything that isn&amp;#8217;t on the board. Write the charter up, and get the team to sign it.&lt;/p&gt;
      &lt;/li&gt;
      
      &lt;li&gt;
      &lt;p&gt;Have a daily standup meeting at the start of the day. At this meeting, unlike a traditional scrum or XP standup, we don&amp;#8217;t need to ask who is working on what, or what they&amp;#8217;re going to work on next - that&amp;#8217;s already on the board. Instead, talk about how much more is needed to complete the work, and discuss any problems or impediments that have come up. This is a good time for the team to write up cards for work they feel needs to be done to make their systems more reliable, or to make their lives easier. I recommend trying to get agreement from the business to always ensure one such card is in the &amp;#8216;Next Please&amp;#8217; section.&lt;/p&gt;
      &lt;/li&gt;
      
      &lt;li&gt;
      &lt;p&gt;Set up a ticketing system. I&amp;#8217;ve used RT and Eventum. The idea is to reduce the amount of interrupts, and to make it easy to track whatever work is being carried out. We have a rule of thumb that everything needs a ticket. Work that can be carried out within about ten minutes can just be done, at the discretion of the sysadmin. Anything that&amp;#8217;s going to be longer needs to go on the board. We have a dedicated &amp;#8216;Support&amp;#8217; section on our board, with a WIP limit. If there are more support requests than slots on the board, it&amp;#8217;s up to the requestors to agree amongst themselves which has the greatest business value (or cost).&lt;/p&gt;
      &lt;/li&gt;
      
      &lt;li&gt;
      &lt;p&gt;Have a regular retrospective. I find fortnightly is enough. Set aside an hour or so, buy the team lunch, and talk about how the previous fortnight has been. Try to identify areas for improvement. I recommend using &amp;#8216;SWOT&amp;#8217; (strengths, weaknesses, opportunities, threats) as a template for discussion. Also try to get into the habit of asking &amp;#8216;Five Whys&amp;#8217; - keep asking why until you really get to the root cause. Also try to ensure you fix things &amp;#8216;Three ways&amp;#8217;. These habits are part of a practice called &amp;#8216;Kaizen&amp;#8217; - continuous improvement. They feed into your Kanban process, and make everyone&amp;#8217;s life easier, and improve the quality of the systems you&amp;#8217;re supporting.&lt;/p&gt;
      &lt;/li&gt;
      &lt;/ul&gt;
      
      &lt;p&gt;The use of Kanban in development and operations teams is an exciting new development, but one which people are finding fits very well with a devops kind of approach to systems and development work. If you want to find out more, I recommend the following resources:&lt;/p&gt;
      
      &lt;ul&gt;
      &lt;li&gt;http://limitedwipsociety.org - the home of Kanban for software development; A central place where ideas, resources and experiences are shared.&lt;/li&gt;
      
      &lt;li&gt;http://finance.groups.yahoo.com/group/kanbandev - the mailing list for people deploying Kanban in a software environment - full of very bright and experienced people&lt;/li&gt;
      
      &lt;li&gt;http://www.agileweboperations.com - excellent blog covering all aspects of agile operations from a devops perspective&lt;/li&gt;
      &lt;/ul&gt;
      
      &lt;p&gt;&lt;span&gt;1&lt;/span&gt;http://www.agileweboperations.com/what-do-sysadmins-really-think-about-agile/&lt;/p&gt;
    </content>
    <published>2011-04-27T06:03:44+00:00</published>
    <updated>2011-04-27T06:03:44+00:00</updated>
    <category term='agile'></category>
    <category term='devops'></category>
    <category term='system-administration'></category>
  </entry>
  <entry>
    <title>Today's EC2 / EBS Outage: Lessons learned</title>
    <link href='http://www.agilesysadmin.net/ec2-outage-lessons' rel='alternate' type='text/html' />
    <id>tag:www.agilesysadmin.net,2011-04-21:/ec2-outage-lessons</id>
    <content type='html'>
      &lt;p&gt;Today Britain woke to the news that Amazon Web Services had suffered a major outage in its US East facility. This affected Heroku, Reddit, Foursquare, Quora and many more well-known internet services hosted on EC2. The cause of the outage appears to have been a case of so-called &amp;#8216;auto-immune disease&amp;#8217;. Amazon&amp;#8217;s automated processes began remirroring a large number of EBS volumes, which had a knock on effect of significantly degrading EBS (and thus RDS) performance and availability across multiple availability zones. Naturally the nay-sayers were out in force, decrying cloud-based architectures as doomed to failure from the very start. As the dust starts to settle, we attempt to distill some lessons from the outage.&lt;/p&gt;
      
      &lt;h2 id='expect_downtime'&gt;Expect downtime&lt;/h2&gt;
      
      &lt;p&gt;The first and most obvious point to make is that downtime is inevitable. Clouds fail. Datacenters fail. Disasters happen. The people trying to make some causal relationship between deploying to the cloud and general failure are missing the point.&lt;/p&gt;
      
      &lt;p&gt;What matters is how you respond to downtime. At Atalanta Systems we challenge our clients to switch off machines at random. If their architecture isn&amp;#8217;t built to withstand failure, we&amp;#8217;ve failed in helping them. Incidentally we&amp;#8217;ve been doing this for years, long before anyone ever mentioned &amp;#8216;chaos monkeys&amp;#8217;.&lt;/p&gt;
      
      &lt;p&gt;Especially in a cloudy world, expect failure - EC2 instances can and will randomly crash. Expect this, and you won&amp;#8217;t be disappointed. From day one, expect hardware problems, expect network problems, expect your availability zone to break.&lt;/p&gt;
      
      &lt;p&gt;Now, of course, there&amp;#8217;s a big difference between switching off a few machines or pulling a few cables and losing a whole datacenter. However, we have to expect downtime, and we have to be ready for it. Here are a few suggestions:&lt;/p&gt;
      
      &lt;p&gt;&lt;strong&gt;Use amazon&amp;#8217;s built-in availability mechanisms&lt;/strong&gt;&lt;/p&gt;
      
      &lt;p&gt;Don&amp;#8217;t treat AWS like a traditional datacenter. Amazon provides up to four availability zones per region, and a range of free and paid-for tools for using them. Techniques for taking advantage of these features range from as simple as using elastic IP addresses and remapping manually to a different zone, to using multi-availability zone RDS instances to replicate database updates across zones.&lt;/p&gt;
      
      &lt;p&gt;Make use of autoscaling groups, and deploy in more than two availability zones. Latency between zones is minimal, and autoscaling groups can span availability zones, and can be configured to trigger based on utilisation. People maintaining that it costs twice as much to run a highly available infrastructure in AWS simply haven&amp;#8217;t read the documentation. Take care to avoid the classic fallacy of having three web servers at 60% utilisation, and one failing, resulting in two failing immediately afterwards.&lt;/p&gt;
      
      &lt;p&gt;Size your infrastructure to include headroom for load spikes, and to be able to sustain an complete AZ failure. For any business for whom downtime can be measured in tens of pounds per minute (which accounts for even small startups), it&amp;#8217;s cheaper to build in the availability than to suffer the outage.&lt;/p&gt;
      
      &lt;p&gt;The problem with today&amp;#8217;s outage is that it appears to have impacted multiple availability zones. The full explanation for this has not yet been forthcoming, but it does service to highlight that if availability really matters to you, you really need to consider using multiple regions. Amazon has points of presence on the East coast, the West coast, Western Europe, and two in South East Asia. Backing up to S3 from one region enables restore into another. Cloudwatch triggers can be used to launch new instances in a different region, or even a full stack via Cloud Formation. We have clients doing this on the East and West coast, without spending outrageous amounts of money.&lt;/p&gt;
      
      &lt;p&gt;The bottom line is that one of the key benefits of using AWS is the geographic spread it enables, together with its monitoring and scaling and balancing capabilities. Look into using these - if you&amp;#8217;re not at least exploring these areas, you&amp;#8217;re doing the equivalent of buying an iPhone and only ever using it for text messages.&lt;/p&gt;
      
      &lt;p&gt;&lt;strong&gt;Think about your use of EBS&lt;/strong&gt;&lt;/p&gt;
      
      &lt;p&gt;It&amp;#8217;s not the first time there have been problems with EBS - only last month, Reddit was down for most of the day because of EBS-related issues. Here are a few things to consider when thinking about using EBS in your setup:&lt;/p&gt;
      
      &lt;ul&gt;
      &lt;li&gt;EBS is not a SAN&lt;/li&gt;
      &lt;/ul&gt;
      
      &lt;p&gt;EBS is network accessible, block storage. It&amp;#8217;s more like a NetApp than a fibre-based storage array. Treat it as such. Don&amp;#8217;t expect to be able to use EBS effectively if your network is saturated. Also be aware that EBS (and the whole of AWS) is built on commodity hardware, and as such is not going to behave in the same way as a NetApp. You&amp;#8217;re going to struggle to get the kind of performance you&amp;#8217;d get from a commercial SAN or NAS, with battery-backed cache, but EBS is considerably cheaper.&lt;/p&gt;
      
      &lt;ul&gt;
      &lt;li&gt;EBS is multi-tenant&lt;/li&gt;
      &lt;/ul&gt;
      
      &lt;p&gt;Remember that you&amp;#8217;re sharing disk space and IO with other people. Design with this in mind. Deploy large volumes, even if you don&amp;#8217;t need the space, to minimise contention. Consider using lots of volumes and building up your own RAID 10 or RAID 6 from EBS volumes. Think of it as a way to get as many spindles as you can, spread across as many disk-providers as possible. Avoid wherever possible using a single EBS volume - as Reddit found to their cost last month, this is not the right way to use EBS.&lt;/p&gt;
      
      &lt;ul&gt;
      &lt;li&gt;Don&amp;#8217;t use EBS snapshots as a backup&lt;/li&gt;
      &lt;/ul&gt;
      
      &lt;p&gt;EBS snapshots are a very handy feature, but they are not backups. Although they are available to different availabilty zones in a given region, you can&amp;#8217;t move them between regions. If you want backups of your EBS-backed volumes, by all means use a snapshot as part of your backup strategy, but then actually do a backup - either to S3 (we use duplicity) or to another machine in a different region (we back up to EBS-backed volumes in US-EAST). Don&amp;#8217;t be afraid of bandwidth charges - run the calculation on the AWS simple calculator - it&amp;#8217;s not as terrifying as you might have feared.&lt;/p&gt;
      
      &lt;ul&gt;
      &lt;li&gt;Consider not using EBS at all&lt;/li&gt;
      &lt;/ul&gt;
      
      &lt;p&gt;In many cases, EBS volumes are not needed. Instance storage scales to 1.7TB, and although ephemeral, doesn&amp;#8217;t seem to have the kinds of problems many have been experiencing with EBS. If this fits your architecture, give it some thought.&lt;/p&gt;
      
      &lt;p&gt;&lt;strong&gt;Consider building towards a vendor-neutral architecture&lt;/strong&gt;&lt;/p&gt;
      
      &lt;p&gt;We&amp;#8217;re big fans of AWS. But today raises questions about the wisdom of tying your infrastructure to one cloud provider. Heroku is an interesting example. Heroku&amp;#8217;s infrastructure piggy-backs on top of AWS, which meant that many applications were unavailable. Worse, access to the Heroku API was affected, and so users were stuck.&lt;/p&gt;
      
      &lt;p&gt;Architecting across multiple vendors is difficult, but not impossible. Cloud abstraction tools like Fog, and configuration management frameworks such as Chef make the task easier.&lt;/p&gt;
      
      &lt;p&gt;Patterns within the application architecture can also be used. If a decision has been made to make use of an AWS-specific tool or API, consider a writing lightweight wrapper around the AWS service, and try to build in and test an alternative provider&amp;#8217;s API, or your own implementation, or at least provide the capability of plugging one in. This prevents lock-in, and makes it much easier to deploy your systems to a different cloud should the requirement arise.&lt;/p&gt;
      
      &lt;p&gt;This said, I happen to hold to the view that for a smaller investment, if a client is already committed to using AWS, they can probably make use of Amazon&amp;#8217;s five regions, and design their systems around the ability to move between regions in the very rare case where multiple availability zones are impacted.&lt;/p&gt;
      
      &lt;h2 id='have_a_dr_plan_and_practice_it'&gt;Have a DR plan, and practice it&lt;/h2&gt;
      
      &lt;p&gt;Part of planning for failure is to know what to do when disaster strikes. When you&amp;#8217;ve been paged at 3am and told that the whole site is down, and your hosting provider has no estimated time to recovery, the last thing you want to do is think. You should be on autopilot - everyone knows what to do, it&amp;#8217;s written down, it&amp;#8217;s been rehearsed, as much of it is automated as possible.&lt;/p&gt;
      
      &lt;p&gt;I encourage my engineers to write the plan down, somewhere accessible (and not only on the wiki that just went down). Have fire drills - pick a day, and run through the process of bringing up the DR systems, and recovering from backup. Follow the process - and improve it if you can.&lt;/p&gt;
      
      &lt;p&gt;Testing restores is the critical part of the process. Know how long it takes to restore your systems. If you have vast datasets that take hours to import, at least you know this in advance, and when and if you need to put the recovery plan into action, you can set expectations. Remember, though, your backups mean nothing if you haven&amp;#8217;t verified you can restore them. Make it a habit. When you need to do it for real, you&amp;#8217;ll be grateful you drilled yourself and your team.&lt;/p&gt;
      
      &lt;h2 id='infrastructure_as_code_is_hugely_relevant'&gt;Infrastructure as code is hugely relevant&lt;/h2&gt;
      
      &lt;p&gt;One of the great enablers of the infrastructure as code paradigm is the ability to rebuild the business from nothing more than a source code repository, some new compute resource (virtual or physical) and an application data backup. In the case of multi-region failover, you might find that your strategy is to keep a database running, but deploy a stack, provisioned with your configuration management tool, on demand. We&amp;#8217;ve tested this with cloud formation and chef and can bring up a simple site in five or ten minutes, and a multi-tier architecture with dozens of nodes within 30 minutes. The bottleneck is almost always the data restore - so work out ways to reduce the time taken to do this, and practice practice practice.&lt;/p&gt;
      
      &lt;p&gt;Many people reading this will be in a position where they already have an infrastructure in place that either isn&amp;#8217;t managed with a framework such as Chef, or is only partially built. If you take nothing else from today&amp;#8217;s issues, take an action to prioritise getting to the stage where you can rebuild your whole infrastructure from a git repo and a backup. The cloud is great for this - you can practice spinning your systems up in a different region, or a different zone, as many times as you like, until you&amp;#8217;re happy with it.&lt;/p&gt;
      
      &lt;h2 id='the_cloud_and_aws_is_still_great'&gt;The cloud (and AWS) is still great&lt;/h2&gt;
      
      &lt;p&gt;Sadly today has brought out the worst kinds of smugness and schadenfreude from people using other cloud providers, or traditional infrastructures. These people have very short memories. Joyent, Rackspace, Savvis, all these providers have had large and public outages. As we&amp;#8217;ve already said, outages are part of life - get used to it.&lt;/p&gt;
      
      &lt;p&gt;Some commentators have suggested that AWS has inherent weaknesses by offering platform services beyond the basic resource provision that a simpler provider such as Linode offers. Linode is a great provider, and we&amp;#8217;ve used them for year. However, I&amp;#8217;m not sure it&amp;#8217;s as simple as that. If you&amp;#8217;ve decided to deploy your application in the cloud, and you need flexible, scalable, persistent storage, or a highly available relational database, or an API-driven SMTP service, you have a choice. You can spend your time, and your developers&amp;#8217; time, building your own, and making it enterprise ready, or you can trust some of the best architects in the world to build one for you. Sometimes making your own is a better choice, but you don&amp;#8217;t get it for free. You&amp;#8217;ll be paying more for the extra machines to support it, and the staff to administer it. Personally, I&amp;#8217;m unconvinced that trying to build and manage these ancillary systems delivers value for the organisation.&lt;/p&gt;
      
      &lt;p&gt;Yes, today&amp;#8217;s outage is hugely visible. Yes it&amp;#8217;s had a massive impact on some businesses. That doesn&amp;#8217;t make the cloud bad, or dangerous. Quora, made a great point by serving a maintenance page with a cute YouTube video and the following error message, “We’d point fingers, but we wouldn’t be where we are today without EC2.”&lt;/p&gt;
      
      &lt;p&gt;Using the cloud as part of your IT strategy is about much more than reliability. Not that EC2&amp;#8217;s reliability is bad - EC2 offers a 99.95% SLA. That&amp;#8217;s equivalent to the best managed hosting providers. The US East region that suffered so much today had a 100% record between 2009 and 2010. It should, of course, be noted that, strictly speaking, todays issues were with EBS, which doesn&amp;#8217;t attract an SLA. Be wary of SLAs and figures - they can be misleading.&lt;/p&gt;
      
      &lt;p&gt;Making use of the cloud is about flexibility and control and scalability. It&amp;#8217;s about a different way of thinking about provisioning infrastructure that encourages better business agility, and caters for unpredictable business growth. Yes you might get better availability from traditional hardware in a managed hosting facility, but even then outages happen, and more often than not these outages can take many hours to recover from.&lt;/p&gt;
      
      &lt;p&gt;The cloud is about being able to spin up complete systems in minutes. The cloud is about being able to triple the size of your infrastructure in days, when your product turns out to be much more popular than you imagined. Similarly, it&amp;#8217;s about being able to shrink to something tiny, and still survive, if you misjudge the market. The cloud is about the ability change how your infrastructure works, quickly, without worrying about sunk cost in switches or routers that you thought you might need. The cloud is about the ease with which we can provide a development environment that mirrors production, within 30 minutes, and then throw it away again. The cloud is about being able to add capacity for a big launch, and then take it away again with a mere API call. I could go on&amp;#8230;&lt;/p&gt;
      
      &lt;p&gt;One, albeit major, outage in one region of one cloud vendor doesn&amp;#8217;t mean the cloud was a big con, a waste of time, a marketing person&amp;#8217;s wet dream. The emperor isn&amp;#8217;t naked, and the nay-sayers are simply enjoying their day of &amp;#8216;I told you so&amp;#8217;. The cloud is here to stay, and brings with it huge benefits to the IT industry. However, it does require a different approach to building systems. The cloud is not dead - it&amp;#8217;s still great.&lt;/p&gt;
      
      &lt;h2 id='summary'&gt;Summary&lt;/h2&gt;
      
      &lt;p&gt;Today has been a tough day for business affected by the EC2 outage. We can take the following high level lessons away from today:&lt;/p&gt;
      
      &lt;ul&gt;
      &lt;li&gt;Expect, and design for downtime&lt;/li&gt;
      
      &lt;li&gt;Have a DR plan, and practice it until it&amp;#8217;s second nature&lt;/li&gt;
      
      &lt;li&gt;Make it your priority to build your infrastructure as code, and to be able to rebuild it from scratch, from nothing more than a source code repository and a backup&lt;/li&gt;
      
      &lt;li&gt;The cloud is still great&lt;/li&gt;
      &lt;/ul&gt;
    </content>
    <published>2011-04-21T20:03:44+00:00</published>
    <updated>2011-04-21T20:03:44+00:00</updated>
    <category term='devops'></category>
  </entry>
  <entry>
    <title>The Impact of Amazon's new CloudFormation service</title>
    <link href='http://www.agilesysadmin.net/cloudformation' rel='alternate' type='text/html' />
    <id>tag:www.agilesysadmin.net,2011-02-26:/cloudformation</id>
    <content type='html'>
      &lt;p&gt;Let me put to rest the worst of the FUD. This was never a master plan by Amazon to wipe out Chef and Puppet in a hostile takeover of the configuration management territory. Opscode were part of the CloudFormation Beta, and deeper integration with Chef is very much part of the future roadmap. So don&amp;#8217;t worry - this is not an apocalyptic disaster - it&amp;#8217;s an overwhelmingly good and exciting development that promises to make the task of complex orchestration a little bit easier.&lt;/p&gt;
      
      &lt;p&gt;CloudFormation is a service that simplifies the process of firing up a complete AWS stack. Instead of making individual API calls to set up EC2 instances, elastic load balancers, scaling groups and other offerings, we simply make one call. This is great - because previously making these calls was a bit of pain. Your options ranged from using the AWS console, which is pretty unpleasnant, through using tools such the Java-based EC2 command line tools, through to scripting a series of calls with a library such as Fog or Boto.&lt;/p&gt;
      
      &lt;p&gt;Does that sound a lot like Chef or Puppet to you? No. Sure, knife has EC2 management capabilities because it wraps Fog, but that&amp;#8217;s peripheral, and is really just recognition of the fact that Amazon hadn&amp;#8217;t produced a fully featured and consistent way to drive their API.&lt;/p&gt;
      
      &lt;p&gt;The main point of confusion here is that people are equating provisioning and configuration management. Provisioning is going to the shop and buying a server. Racking it and cabling it. Putting it in the right VLAN. Giving it a port and an IP address and sticking an operating system on it. Outside of the cloud this is a pretty major undertaking, but the cloud makes all this very easy. Configuration management is policy driven. It&amp;#8217;s deciding what software goes onto the machine, how it&amp;#8217;s configured, how it should behave in certain circumstances, and enforcing that. You need both - CloudFormation provides the former.&lt;/p&gt;
      
      &lt;p&gt;Let&amp;#8217;s be clear - I&amp;#8217;m not downplaying the significance or awesomeness of the service. What Amazon have done with CloudFormation is make it much much easier to do this at a stack level rather than for each individual component of an AWS infrastructure. Together with Elastic Beanstalk, Amazon are doing some important and innovative stuff in this space.&lt;/p&gt;
      
      &lt;p&gt;For me the area which is of most interest is the mechanism for creating these stacks. CloudFormation uses JSON templates to specify the infrastructure components and interdependencies. Amazon have provided some sample templates for provisioning popular opensource stacks such as Drupal, Wordpress and Redmine. I think this is what has caused all the excitement. However, it&amp;#8217;s important to remember that this is purely image-based - there&amp;#8217;s no ongoing management of the essential configuration of these machines.&lt;/p&gt;
      
      &lt;p&gt;What excites me about all this is that it&amp;#8217;s&amp;#8230; JSON. We like JSON - JSON is used throughout Chef, and CloudFormation opens up lots of possibilities for creative interplay. Far from competing with or replacing Chef, CloudFormation plays directly to its strengths. Chef metadata can be passed from a JSON template, including role information, validation key and Chef server URL. The end result is a fully configured and managed AWS infrastructure, from scracth, with one call.&lt;/p&gt;
      
      &lt;p&gt;The other exciting thing is that this JSON can just be stored in a databag. This suddenly makes it really rather easy to manage and control some of the more complicated and powerful AWS services such as the queing service, or cloud watch alarms from the very heart of your configuration management tool.&lt;/p&gt;
      
      &lt;p&gt;So: is CloudFormation awesome? Yes. Exciting? Absolutely. Powerful? You bet! A replacement? A threat? Absolutely not - what we have here is the next generation in server automation and provisioning, in a form which slots in perfectly with next generation system integration and configuration management. Bring it on.&lt;/p&gt;
    </content>
    <published>2011-02-26T07:03:44+00:00</published>
    <updated>2011-02-26T07:03:44+00:00</updated>
    <category term='chef'></category>
  </entry>
  <entry>
    <title>Opscode Chef Fundamentals Training 2011</title>
    <link href='http://www.agilesysadmin.net/chef-fundamentals' rel='alternate' type='text/html' />
    <id>tag:www.agilesysadmin.net,2011-01-06:/chef-fundamentals</id>
    <content type='html'>
      &lt;p&gt;&lt;img src='http://www.jedi.be/events/opscode-chef-fundamentals-2010//opscode_logo.png' alt='Opscode Chef' /&gt;&lt;/p&gt;
      
      &lt;p&gt;It&amp;#8217;s configuration management season in Europe! Prior to the eagerly anticipated &lt;a href='http://www.fosdem.org/2011/news/accepted-devrooms'&gt;Fosdem Config Management Dev Room&lt;/a&gt;, Opscode&amp;#8217;s technical evangelist Joshua Timberman will be in London on the 31st of January and the 1st and 2nd of Feburary, to give his highly regarded Chef Fundamentals training course.&lt;/p&gt;
      
      &lt;p&gt;For those of you who want the full menu, there is the opportuity to follow up this course with &lt;a href='http://www.jedi.be/events/opscode-chef-advanced-2011/'&gt;advanced Chef training&lt;/a&gt;, hosted by Patrick Debois, in Gent on the 3rd and 4th.&lt;/p&gt;
      
      &lt;h3 id='the_course'&gt;The Course&lt;/h3&gt;
      
      &lt;p&gt;Chef Fundamentals is a 3-day comprehensive class covering the basic architecture of Chef and all of the underlying components. We will be covering installation basics of Chef Client and Chef Solo. Other topics will include: creating Chef repositories, creating cookbooks and advanced use of the command line utility, Knife. This class will include lecture, labs and some comprehensive case studies.&lt;/p&gt;
      
      &lt;h3 id='pricing'&gt;Pricing&lt;/h3&gt;
      
      &lt;p&gt;&lt;a href='http://atalanta-systems.com'&gt;Atalanta Systems&lt;/a&gt; is able to offer a significant reduction against the usual pricing of £500 per day, and offer a special community cost of £500 + VAT all inclusive for the whole three day course.&lt;/p&gt;
      
      &lt;p&gt;To register, contact &lt;a href='mailto:dee.strutt@atalanta-systems.com'&gt;Dee Strutt&lt;/a&gt; at Atalanta Systems. Places are limited, and last autumn&amp;#8217;s class filled up quickly, so book soon!&lt;/p&gt;
      
      &lt;h3 id='location'&gt;Location&lt;/h3&gt;
      
      &lt;p&gt;&lt;img src='http://cache.carlsonhotels.com/rad/images/hotels/GBHAMPSH/HampshireExterior_450.jpg' alt='The Radisson Edwardian' /&gt;&lt;/p&gt;
      
      &lt;p&gt;Training will take place at the prestigious &lt;a href='http://www.radissonedwardian.com/london-hotel-gb-wc2h-7lh/gbhampsh'&gt;Radisson Edwardian Hampshire Hotel&lt;/a&gt; in Leicester Square, right in the heart of London&amp;#8217;s theatre land.&lt;/p&gt;
      
      &lt;h3 id='trainer'&gt;Trainer&lt;/h3&gt;
      
      &lt;p&gt;&lt;img src='https://secure.gravatar.com/avatar/55851a288d9c7f9c03f4877e618f65d2?s=140&amp;amp;d=https%3A%2F%2Fgithub.com%2Fimages%2Fgravatars%2Fgravatar-140.png' alt='Joshua Timberman' /&gt;&lt;/p&gt;
      
      &lt;p&gt;Joshua Timberman is a technologist, focused on automation and continual improvement of software processes. As such, he has become an Agile practitioner. With over 10 years experience in Linux and Unix system administration, Joshua has worked for companies from 5 person startups, up to the largest IT company in the world. His background includes deploying highly available enterprise application environments and providing internal infrastructure services and team-based training. Joshua currently works for Opscode, where he is an infrastructure cooking expert with Chef. He speaks at local user group meetings and has a passion for teaching people how to make the most out of automation.&lt;/p&gt;
    </content>
    <published>2011-01-06T13:03:44+00:00</published>
    <updated>2011-01-06T13:03:44+00:00</updated>
    <category term='chef'></category>
  </entry>
  <entry>
    <title>Puppet and policy - violator or enforcer?</title>
    <link href='http://www.agilesysadmin.net/violator-or-enforcer' rel='alternate' type='text/html' />
    <id>tag:www.agilesysadmin.net,2010-05-27:/violator-or-enforcer</id>
    <content type='html'>
      &lt;p&gt;A common challenge for an organisation running Puppet is how balance the desire for a fully automated and standardised environment, with the risk that automated Puppet runs may introduce bugs or revert hot fixes. This concern was apparent at &lt;a href='http://puppetcamp.org/europe-2010-ghent/'&gt;Puppetcamp&lt;/a&gt; this morning, when Rafael Brito from the New York Stock Exchange gave an informative presentation about his experience of using Puppet to build machines for their live platform. What particularly struck me was that although his team puts a lot of effort into creating a standard environment, the current culture is that operations teams on the ground can and should make live changes to boxes, and that these changes may not ever make it back into Puppet.&lt;/p&gt;
      
      &lt;p&gt;I asked Rafael how frequently Puppet runs on the live machines, to ensure the state of each machine is kept the same, and according to standards. He told me &amp;#8216;once a quarter&amp;#8217;. I think it&amp;#8217;s fair to say that in a context such as this, Puppet is not really being used as a config management tool - it&amp;#8217;s being used as part of the build process to produce a standard image, which is then being managed in the traditional way.&lt;/p&gt;
      
      &lt;p&gt;I fully understand the motivation behind this approach. This is a very high profile application, and there&amp;#8217;s a worry that mistakes in the Puppet manifest could accidentaly be rolled out to the live site and cause a massive problem. Their situation is also complicated by having a large, multi-tiered operations team, across several countries, many of whom who don&amp;#8217;t know how to use Puppet. The approach they have settled on is to allow engineers to make chanegs to the live site, but to be aware that these machines will effectively be refreshed every quarter, and so there&amp;#8217;s a risk that these changes may be lost. This places the burden of maintaining the standard on the team writing and maintaining the Puppet manifests to ensure that changes made by the operations team are folded in.&lt;/p&gt;
      
      &lt;p&gt;The trouble with approach is that it means that the de facto standard is always the current state of the machines, as modified by the operations team. If the Puppet run undoes some fixes applied by the operations team, Puppet is placed in the position of standards violator - that&amp;#8217;s not a great place to be.&lt;/p&gt;
      
      &lt;p&gt;Once issue we often come across with clients who have started to use Puppet occurs when a change rolled out by Puppet breaks the system. In this situation Puppet advocates are in a weak negotiating position - we can argue that the changes should have been made in Puppet, but when the site is down, and money is being lost, somehow that argument doesn&amp;#8217;t win much support. The fact is that when a mistake is made, Puppet gets blamed - it broke the site. Sadly this can even result in pressure to stop using this unstable, unreliable tool.&lt;/p&gt;
      
      &lt;p&gt;I&amp;#8217;d like to turn this on its head. We all agree that we need a standard or set of standards to which the live site must adhere. Let&amp;#8217;s make Puppet the &lt;em&gt;enforcer&lt;/em&gt; of this standard, and never the violator. This standard can be designed, tested, approved and signed off. This is the standard - we don&amp;#8217;t diverge from it. Now we can set up a mechanism for testing the site against the standard, so we know if the standard has ever been broken.&lt;/p&gt;
      
      &lt;p&gt;A great way to do this is simply to run Puppet in noop mode, so it doesn&amp;#8217;t make the changes, but simply reports what changes it would make if it were to run in live mode. If our standard is being adhered to, Puppet should usually report that it wouldn&amp;#8217;t make a change. If Puppet reports that it would make a change, this should only ever be because that change has been approved by, for example, a change advisory board. This mechanism, therefore, will alert us as to whether the machine is out of sync with the stand, what changed, and how Puppet intends to revert the system to the agreed standard. Running this process with reasonable frequency will give us a pretty granular report into when changes we made, and could even be tied into system logs to identify the most likely source of the change. The output of the process could be parsed and monitored, and alerts raised to senior stakeholders, and emails reports sent out, detailing the change that has occurred.&lt;/p&gt;
      
      &lt;p&gt;This way we get to play the role of enforcer - we can say: Hey look - this change has happened - we can change it back again, and we should, but we need to find who made the change, why they made it, make it in Puppet, then back it out and apply it properly. We then need to identify and educate the policy breakers, and find out what happened.&lt;/p&gt;
      
      &lt;p&gt;This approach, I think, walks the line between the kind of careful conservatism that a production site needs, and the desire to make use of the power of Puppet to guarantee a consistent environment.&lt;/p&gt;
      
      &lt;p&gt;Of course this approach will also catch the other risk - the risk that someone has committed a change to Puppet which may get rolled out to live machines when not wanted. Again, there needs to be a policy to protect this. Puppet changes in a live envirnment of this nature should not be made unless tested. This means that Puppet chanegs should be made in a testing branch, and confirmed against a test environment, and only merged into the production repository when the testing has been completyed to everyone&amp;#8217;s satisfaction, and, in some environemnts, only rolled out following the appropriate change control mechanism. An hourly noop run, monitored, would immediatey alert if someone had managed to get a change into the love puppet manifest without following the correct procedure.&lt;/p&gt;
      
      &lt;p&gt;Of course not running the puppet daemon automatically brings with it a different set of management challenges - such as ensuring all machines are uptodate, and how to minimise the time taken to bring the machines into sync. My answer to this is to orchestrate your puppet clients from a central location, rather than to run your puppet clients in daemon mode. I&amp;#8217;ll cover this in a future article.&lt;/p&gt;
    </content>
    <published>2010-05-27T23:03:44+00:00</published>
    <updated>2010-05-27T23:03:44+00:00</updated>
    <category term='puppet'></category>
  </entry>
</feed>

