Spin the wheel
The AWS operational review is a weekly meeting open to the entire company. Every meeting, a “wheel of fortune” is spun to select a random AWS service from hundreds for live review. The team under review has to answer pointed questions from experienced operational leaders about their dashboards and metrics. The meeting is attended by hundreds of employees, dozens of directors and several VPs.
This incentivizes every team to have a baseline level of operational competence. Even if the probability of an individual team getting selected is low (at AWS, less than 1%), as a manager or tech lead on the team, you really don’t want to appear clueless in front of half the company the day your luck runs out.
It is important that you regularly review your reliability metrics. Leaders who take an active interest in operational health set that tone for the entire organization. Spin the wheel is just one tool to accomplish this.
But what do you do in these operational reviews? This brings us to the next point.
Define measurable reliability goals
You would like to have a ‘high up-time’ or ‘five nines’, but what does that really mean for your customers? The latency tolerance of live interactions (chat) is much lower than that of asynchronous workloads (training a machine learning model, uploading a video). Your goals should reflect what your customers care about.
When you review a team’s metrics, ask them to describe measurable reliability goals. Make sure you understand — and they understand — why those goals were chosen. Then, have them use dashboards to prove that those goals are being met. Having measurable goals will help you prioritize reliability work in a data-driven manner.
It is a good idea to focus on the detection of issues. If you see an anomaly in their dashboards, ask them to explain the issue, but also ask them whether their on-call was notified of the issue. Ideally, you should realize something is wrong before your customers do.
Embrace chaos
One of the most revolutionary mindset-shifts in cloud resiliency is the concept of injecting failure into production. Netflix formalized this concept as “chaos engineering” — and the idea is as cool as the name suggests.
Netflix wanted to incentivize its engineers to build fault tolerant systems without resorting to micromanagement. They reasoned that if systemic failure is made to be the norm rather than the exception, engineers have no choice but to build fault-tolerant systems. It took time to get there, but at Netflix, anything from individual servers to entire availability zones are knocked out routinely in production. Every service is expected to automatically absorb such failures with no impact to service availability.
This strategy is expensive and complex. But if you’re shipping a product where a high uptime is an absolute necessity, then failure injection in production is a very effective way to get something resembling a ‘correctness proof’. If your product needs this, introduce it as early as possible. It will never be easier or cheaper than it is today.
If chaos engineering seems like overkill, you should at least require your teams to do ‘game days’ (simulated outage practice runs) once or twice a year, or leading up to any major feature launch. During a game day, you will have three designated roles — the first role simulates the outage, the second fixes it without knowing beforehand what was broken and the third observes and takes detailed notes. Afterward, the whole team should get together and do a post-mortem on the simulated incident (see below). The game day will reveal gaps not only in how your systems handle outages, but also in how your engineers handle them.
Have a rigorous post-mortem process
A company’s post-mortem process reveals a great deal about its culture. Each of the top tech companies require teams to write post-mortems for significant outages. The report should describe the incident, explore its root causes and identify preventative actions. The post-mortem should be rigorous and held to a high standard, but the process should never single out individuals to blame. Post-mortem writing is a corrective exercise, not a punitive one. If an engineer made a mistake, there are underlying issues that allowed that mistake to happen. Perhaps you need better testing, or better guardrails around your critical systems. Drill down to those systemic gaps and fix them.
Designing a robust post-mortem process could be the subject of its own article, but it’s safe to say that having one will go a long way toward preventing the next outage.
Reward reliability work
If engineers have a perception that only new features lead to raises and promotions, reliability work will take a back seat. Most engineers should be contributing to operational excellence, regardless of seniority. Reward reliability improvements in your performance reviews. Hold your senior-most engineers accountable for the stability of the systems they oversee.
While this recommendation may seem obvious, it is surprisingly easy to miss.
Conclusion
In this article, we explored some fundamental tools that embed reliability into your company culture. Startups and early-stage companies usually do not make reliability a priority. This is understandable — your fledgling company must be obsessively focused on proving product-market fit to ensure survival. However, once you have a returning customer base, the future of your company depends on retaining trust. Humans earn trust by being reliable. The same is true of internet services.
Aditya Visweswaran is a senior software engineer at Google Cloud’s security platform team.
Credit: venturebeat.com