Back in 2010, Netflix announced their Chaos Monkey tool which helped them scale successfully into AWS:
Chaos Monkey randomly terminates virtual machine instances and containers that run inside your production environment. Exposing engineers to failures more frequently incentivizes them to build resilient services.
The concept here is simple enough – Netflix says to itself that it doesn’t trust its infrastructure and services to always be there. Despite the fact we engineer for resilience and expect failure, let’s prove that it’ll work.
That’s great for Netflix but what about the rest of us who aspire to be able to engineer systems as resilient as theirs?
Sure, we could implement Chaos Monkey but that only tells us that one test of our production infrastructure works. What if we can’t yet consistently and automatically deploy our software changes? If the production environment is not consistent between releases then our results are meaningless.
Take a step back from system design, infrastructure or even test data design – to release process design.
Make sure that all of your build and release steps are documented. Script them so that they are automated so that there is no, or very minimal, manual interaction happening. This will ensure that your release is repeatable.
Automation is the friend of consistency. Without having repeatable release processes all the way through your deployment chain, you won’t be able to get consistent results from your releases.
Design a release system which can support your software and your environments not as an afterthought, but as something that is essential to the health and success of your product.