Technical Leadership Summary #5

Another recap of a week’s worth of links, news, and discussion around technical leadership and technology; as usual, follow me of Linkedin if you want to receive a notification when I share a new link.

What I have been thinking this week

Despite our best efforts in minimizing bugs, releasing frequent updates, producing good logs, setting up monitoring and alerting for every component in our infrastructure, there are times where you find yourself in a situation where it’s more practical to rebuild from scratch a certain platform/subsystem, rather than trying to fix the problem right away: rebuild the platform, switch the DNS, and you’re back online. That has the additional benefit that the faulty component is still alive, giving you more time to find why the problem happened in the first place.

If you accept this point of view (and you should), the whole Continuous Delivery idea doesn’t just represent a way to release features quickly to production, but a fundamental piece of insurance for your business, together with Environment Automation. Being able to recreate an environment and deploy all the applications you need in the space of minutes gives you the ultimate peace of mind: if things go horribly wrong (and statistically, sooner or later they will, if your business is successful), you’ll be able to get away with just a few hours of downtime in the absolute worst case.

So, does it mean we just need a few scripts to automate environment creations and push-button releases to be safe? It turns out it’s not that simple in practice.

Environments are built over time; we start with a simple template, adding more services, components as we grow. Similarly when we deploy components; rarely we build and deploy a component “in a vacuum”, more frequently it depends on existing resources or services. The final result is the whole automation, while it might serve as well when in our day-to-day operations, falls short in those critical moments when we need to rebuild everything from scratch, and we need to do that in minutes. The scripts will not work as expected, the dependencies between them (which one to run first, in which order) might not be clear, and the whole path to switch the traffic between the old and new sub-system might be obscure.

As usual, I approached this issue trying to identify the principles to follow to make sure we are prepared in these circumstances. Here’s my list so far:

  • first and foremost, make sure the full recreation of every sub-system is performed regularly; once per month should not be unachievable, and you should go all the way to switch off the old copy and replace with the new one
  • absolute no tolerance for any manual step in environment or application deployments
  • prefer immutable infrastructure (when possible) over configuring machines at deployment time; while both can be done in an automatic way, deploying immutable images is considerably faster than running Ansible/Chef/Puppet at deployment time to set up machines
  • remove as much as possible (or at least make explicit) dependencies between services; identify common, cross-service configurations the environment needs and create a separate job to deploy them (to be launched before deploying all the applications)
  • besides the common configurations mentioned above, rigorously attribute configurations to components/services; every component must have a well-defined set of configurations; configurations and software must be deployed at the same time
  • Make application/configurations deployments environment-agnostic; ideally, you’d have a parameter in your deployment jobs specifying the environment: that should be the only change you need
  • ruthless versioning: every automated operation on the platform must be performed starting from a precise, specified version of code; that includes environment automation scripts, deployment automation scripts, configurations, software components, virtual machine images, …
  • track every single version (see the list above) of code used to build a platform and running in the platform in some tool; the tool must allow you to see what was the situation in any moment in the past

What do you think? What should be added to the list?

Topics I talked about this week

I started the week with another Adrian Colyer research paper review, Don’t trust the locals, on security problems introduced by new browser features like local storage. The point of the article isn’t just that local storage is insecure (some even say to stop using it at all), but that a web application should treat all the data stored there as insecure and check before using them.

Magnitude of Exploration was one of my favorite articles this week, and it’s about the freedom to experiment with new tools and technique, vs the need of standardization. I worked in places that were all about adopting the latest tools (fun, but it was very hard to build reliable software), and in places that were all about standardization (boring, and people were just looking to leave). It’s all about finding the right balance, I suppose.

Another article to read and ponder is Fault Injection in Production. We read a lot about Chaos Engineering nowadays, but fault injection is not really about tools and more principles. Knowing what are the critical points of your infrastructure, knowing how to fix problems on them (or rebuild them), and exercising regularly (with Game Days — there are plenty of suggestions on how to run them).

Explicit Data Contracts Using Protocol Buffers shows how you can use Protocol Buffers for Contract Driven Development. This is one of the topics I’m mostly interested these days, and it’s unfortunate that we don’t have a solution/tool that is applicable to a wide array of use cases and protocols.

I need to go back to Introduction to Infrastructure Patterns and explore all the links there. It’s the companion website of the highly successful book Infrastructure as Code.

A couple of good articles on AWS Security. Part 1 is particularly interesting for its focus on automation and policies as code, while Part 2 tries to establish an incident response process.

Four links to finish: