No error left behind

A systematic approach for surfacing each error to the right stakeholder
Cartoon in the style of webcomicname.com - first panel: Stakeholder: 'Make it do X' Developer: 'OK'; second panel: developer works on machine, thinks: 'X X X X X'; third panel: user uses the machine saying 'Y'. Machine: explodes. Developer: 'oh no'

Which design guidelines should software follow in order to run reliably? Which guidelines do you use in your day-to-day work?

Error handling is a significant complexity in real-world projects, but programming textbooks are often giving very little advice on how to think about it at the higher level. This article tries to fill that gap:

To handle the error cases in your software, you should ask yourself who is the human stakeholder who would be able to address it, and then find a suitable mechanism to escalate the error in the direction of that stakeholder.

I’ve cross-checked this approach with multiple people from various software backgrounds, but I’m still happy to receive your feedback.

What is an error?

Definition: An error is when the program is operating outside the intended path of execution.

Running programs within the intended path of execution is a goal by definition. To track and reach this goal, we need to make it observable whether the program is running healthily:

Rule: All errors should be observable from the outside.

Observable in this context means that it’s possible to bring the error to a human’s attention (e.g. end user, system administrator, developer) in an automated fashion.

The outside is defined by the bounds of your program. The error reports need to be made accessible to the surrounding environment, so that stakeholders can access them.

Error stakeholders

The stakeholders for an error are the people involved in the software lifecycle, such as: End users, developers, system administrators, network administrators etc.

Rule: For each error, there is a corresponding stakeholder.
Different errors have different stakeholders who can address the issue.

Different errors have different stakeholders who can address the issue.

Is the error cause within the program itself, then the developer is the stakeholder.

Is the error cause in the user input or interaction (e.g. invalid input), then the user is the stakeholder.

Is the error cause in the execution environment, then the operator of that environment is the stakeholder (e.g. sysadmin, network admin, SRE).

Mechanisms

Routing errors to the stakeholders

Routing error indicators to the right stakeholders is partially done in code, and partially within the deployment. Particularly, displaying errors to end users is usually done in the program code itself. On the other hand, metrics collection set-ups can be monitored by developers and administrators alike.

There are two main strategies to make errors observable:

At least one of these should be used for any error.

A function can propagate errors upwards or to the side,
at least one of these should be used.

A function can propagate errors upwards or to the side,
at least one of these should be used.

Propagating an error upwards

Propagating “upwards” moves the responsibility of error handling to a higher level. In general, higher level software has more context for the failing operation, so they are often in a better position to route the error in the right direction.

Propagating errors upwards can take many forms.

Note: Remember to convert the error into a representation which fits the level of abstraction that the caller expects. Many languages support nesting (wrapping) errors so that the full context is retained.

Propagating an error to the side

Sometimes passing the error upwards is not an option or not sufficient, for example because you want to transparently recover or hide a subsystem failure from the user.

Propagating errors to the side can be one of the following:

Make sure that all the relevant error propagation sinks are monitored by the right stakeholders during the program’s operation.

Systems like Prometheus are built with monitoring and alerting in mind. It’s often a good idea to alert on symptoms close to your business needs and then use collected metrics on other errors for further analysis1.

Warning: Textual logging is not an error handling strategy. Logs are not meant for machine consumption, so it’s hard to have automated monitoring based on them.
Warning: Ignoring errors is not an error handling strategy either.

What if my case is not a fit?

If your case it not a fit, it’s possible that the condition at hand might not be an error to begin with, but is maybe only a “corner case” which may happen in normal operation (e.g. a lookup key was not found).

Consider switching the way that the condition is propagated. The following options are alternatives where otherwise errors may be used.

If none of these worked for you, I’d love to hear from you, so I can correct my understanding.

Sometimes the most elegant solution is to change the API to make the error impossible3. For example, replace a dynamic check with one that’s statically guaranteed by the type system, or change to idempotent operation semantics, so that multiple invocations do not conflict with each other.

Examples

These are all written from the perspective of a piece of code detecting an error.

Recovery through redundancy

Example

A disk in a software RAID is failing.

Stakeholder

Mechanism

Similar cases

A TCP network packet is lost. (Difference: Retry has less control over the network routing)

Recovery through omission

Example

A web server uses multiple database backends. One of the non-essential backends starts to return errors.

Stakeholder

Mechanism

Delegating the decision to the calling procedure

Example

The Unix open() function is asked to open a given filename for reading, but the file doesn’t exist.

Stakeholder

Mechanism

Similar cases

Mistakes in passed input values are best handled by the code which passed them, and which can judge why they happened.

Reporting upwards and sideways at the same time

Example

A web server’s servlets are returning errors in the form of HTTP status codes.

Stakeholders

Mechanisms


  1. A good background read is the Site Reliability Engineering Book, section “Symptoms vs. Causes” ↩︎

  2. Wikipedia: Tony Hoare, section “Apologies and retractions” ↩︎

  3. The book Philosophy of Software Design by John Ousterhout has an entire chapter on the idea of defining errors away. The author has also given a tech talk↩︎

Comments