Surprises happen during operations! Sometimes, when those surprises have a significant impact on the business, we label them "incidents" or "outages". We might even spend time investigating some of our bigger outages to better understand what happened.
It turns out that what can be learned from investigating an outage is not proportional to how big the impact was! In fact, it can be easier to learn from incidents with less impact because there's less pressure from the organization to get closure and move on.
In this talk, I will present the OOPS project, an effort inside of Netflix to encourage engineers to report and write up operational surprises they were involved with, even if there was no customer or business impact.
I'll talk about what we hope to learn as an organization from OOPS writeups, what sorts of questions an investigator should ask in order to maximize learning, and how to write up the results of an OOPS investigation as a story to make it easier for a reader to absorb the lessons.