I’m a Linux/Unix subject matter expert that has spent the last 12 years professionally designing, maintaining, and implementing infrastructure and code deployments for some of the world’s leading businesses. With a deep knowledge and understanding of what it takes to ensure reliability for enterprise applications at a global scale from the physical datacenter, to on-prem, and all leading cloud providers.
I’ve learned along the way, to eliminate toil and reduce human induced error, it’s best to adopt software development practices, and treat the system as a software component. This allows the operations team to reduce day to day tasks down to a self-documenting solution with reproducible code, that solves a problems once, and implements the solution through an automated process.
Reliability through visibility and chaos experiments
This is the foundational element to ensuring reliability, and from where all other pieces will be built out. This is why observability is critical in establishing steady state, and ensuring that you have measurables tracking code changes, and system performance at the global level.
In order to ensure a system can withstand turbulence and establishing steady state, you must have visibility into the components of the system. No matter if the components are a simple pod running in a kubernetes cluster or cloud instances scattered across the globe.
Chaos Engineering and Embracing Risk
Chaos engineering is a software testing method with a focus on discovering problems and finding emergent properties before the user experiences issues.
As an operations team you need to learn how to embrace risk and know how to manage it.
Extreme reliability comes at a cost: maximizing stability limits how fast new features can be developed and how quickly products can be delivered to users, and dramatically increases their cost, which in turn reduces the numbers of features a team can afford to offer.
Through this methodology you can measure risk, define a steady state, and define service level indicators to reinforce realistic service level objectives. The data gathered from these controlled experiments allow teams to measure risk and costs, based on the qualitative data.
Designed with destruction in mind
Everything that’s architected, built, and deployed should be created on top of a philosophy that guarantees it can withstand turbulence, and disaster.
Over the past years this proven model, has been successfully deployed in professional enterprise global environments, that ensures the architectural design is constructed on a solid foundation that all components are captured in code, and reproducible.
As the products and systems grow so do teams and the demands on the individual engineer. At scale automation is the key to success. We’ve all heard the phrase, “You are going to automate yourself out of a job.” This has been proven to not be a true statement, and has quite the opposite affect.
The more automation exists the more engaged a single engineer is, and as well as their capabilities, while lowering the overhead and negative impact from manual processes.
If a human operator needs to touch your system during normal operations, you have a bug. The definition of normal changes as your systems grow.
Carla Geisser, Google SRE