Anticipating Fallibility

Engineers are fallible, no ‘if’s, ‘and’s, or ‘but’s about it. The question is never if an engineer will break something, it’s when: at some point or another, a mistake is inevitable. How will your organization deal with it?

There are two approaches to dealing with these kinds of mistakes: the first is to try to prevent them. This is perhaps the most intuitive approach, and a young organization might be tempted to pursue this path. But take a step back, and it becomes clear that preventing mistakes from ever occurring is a daunting, time consuming, expensive, and impossible task: how do you train somebody for all eventualities? How can you anticipate every use case? How do you squash the really tricky bugs, the ones that don’t come out of the deepest, darkest crevasses of the codebase until every last condition is right—or, more likely, wrong?

The dirty secret is simple: you don’t. This preventative approach, while intuitive, doesn’t last long. Remember: nobody is infallible. Mistakes will happen.

So what’s the second approach? How does a smarter organization cope with this? Simple: rather than preventing mistakes, anticipate them, and introduce measures to catch and mitigate them. Don’t take the error out of the engineer. Take the error out of the process.

At Squarespace, we use a system called Correction of Errors (COEs). Whenever Squarespace breaks, a COE is filed and assigned to the engineer most closely related with the problem. Note that this is not an assignment of fault, it’s an assignment of responsibility. Correction of Errors are fault-agnostic. An mistake that causes harm may have been caused by an individual, but the fact that the mistake was allowed to cause some kind of a service disruption is the fault of the system.

Once a COE has been opened and assigned to an individual, it can only be closed after some kind of process is implemented to keep that specific error from ever occurring again. This could be a matter of beefing up testing, defining better communication, introducing technical redundancy, or any number of things. COEs don’t prescribe a solution, they simply require one to exist.

It’s actually a pretty simple system, but it’s remarkable how well it’s worked for us. Two things stand out to me as important:

1st, COEs are recorded. Every COE is recorded, along with the part of the system that they impacted. We can graph them monthly or quarterly to see how we’re doing: are we becoming more error-prone, or less? Where is our system most fragile? 

2nd, COEs are addressed. Every month the engineering team highlights not only how many COEs were filed, but also how many COEs are still open. Having a COE assigned to you is fine, but having an outstanding open COE assigned to you is a bad thing—there’s cultural pressure to find a solution, implement it, and close it.

Since we started recording (and addressing!) COEs, we’ve definitely seen an improvement in Squarespace’s uptime and service. And, while we haven’t seen a monotonic decrease in the number of COEs filed each quarter, the trend is certainly present, and it's strong. Perhaps the most interesting graph is to look at COEs compared to the number of deployments we have each month: by reactively being proactive (I know, it sounds dumb, but it’s the raw truth!) about catching and mitigating mistakes, we’ve been able to ship more and break less.

And I call that a win.

Timothy MillerffComment