SRE Wisdom From the Trenches
Key takeaways from Ryan Kitchens talk "How Did Things Go Right? Learning More from Incidents" based on his experience at Netflix:
Failure is ever-present in modern software systems
Success isn’t necessarily the absence of failure, and having 99.999% uptime is practically meaningless if the users are unable to use the system as they intend.
Safety, great performance, and sources of resilience do not come from the absence of failure but rather the presence of adaptive capacity.
Moving from "Why did things go wrong?" ask "How did things go right?" is a challenging but valuable exercise. Find out -
- What's going on when it seems like nothing is happening?
- When failure does occur, what's going to keep it from being worse?
- How do teams adapt successfully when preventative techniques fail?
- How should we prioritize the effort to develop systems that help us safely manage the consequences of failure?
Recovery is better than prevention.
An incident occurs when there is a “perfect storm” of events - there is no root cause.
Incidents are not made up of causes; we do not “find” them, and instead we construct them, and develop our understanding by creating a narrative. And learning from the last incident will not allow you to predict the next one; complex systems are not deterministic.
Within a complex distributed system, as found at Netflix, failure is the normal state, and the Netflix SRE team has evolved their practices and process so that although failure is important, it is “no longer interesting”.
The most important thing that can be learnt is how to build capacity into the system in order to encounter failure successfully; the ability to recover effectively is much more valuable than preventing an incident.
There are often difficulties in handling an incident due to “islands of knowledge”, and these should be identified in follow-ups and documented within artifacts.
Creating a timeline for an incident can also help.
Takeaways from Denise Yu's SRE for Cats:
Site Reliability Engineering is a set of practices to help teams scale distributed software systems and keep them online.SRE is becoming more interdisciplinary.
Cats can teach us a lot about designing for failure in distributed systems...just like trying to achieve 100% uptime, it is futile to expect cats to follow rules. Instead...
1. Set realistic, sustainable Service Level Objectives (SLOs) - Effective SLOs should enable teams to learn & experiment; not constrain them with unrealistic goals
2. Eliminate toilsome work so we can solve higher order problems
3. The bigger the change, the bigger the risk - Prefer smaller, more frequent changes
4. Shit’s just gonna fail. When it does, we should optimize for learning above all else.
Comments
Post a Comment