No error left behind
The topic of error handling is often neglected in programming textbooks, even though dealing with errors is a significant part of the complexity in real-world projects. While textbooks usually cover the technical aspects well, they are often giving only little advice on policy questions.
This documents describes a mental model which has helped me think about error handling and for which I haven’t found any counterexamples.
We start with some definitions and basic assumptions, then talk about the role of human stakeholders in that process, and continue to discuss common mechanisms used for error propagation and handling. Finally, we will look at some examples for how to apply this.
This approach seems right to me, but it’s not perfect and I’m happy about your feedback.
What is an error?
Running programs within the intended path of execution is a goal by definition. To track and reach this goal, we need to make it observable whether the program is running healthily:
Observable in this context means that it’s possible to bring the error to a human’s attention (e.g. end user, system administrator, developer) in an automated fashion.
The outside is defined by the bounds of your program. The error reports need to be made accessible to the surrounding environment, so that stakeholders can access them.
The stakeholders for an error are the people involved in the software lifecycle, such as: End users, developers, system administrators, network administrators etc.
Is the error cause within the program itself, then the developer is the stakeholder.
Is the error cause in the user input or interaction (e.g. invalid input), then the user is the stakeholder.
Is the error cause in the execution environment, then the operator of that environment is the stakeholder (e.g. sysadmin, network admin, SRE).
Routing errors to the stakeholders
Routing error indicators to the right stakeholders is partially done in code, and partially within the deployment. Particularly, displaying errors to end users is usually done in the program code itself. On the other hand, metrics collection set ups can be monitored by developers and administrators alike.
There are two main strategies to make errors observable:
- Propagate the error upwards towards the initiator of the erroring operation.
- Propagate the error to the side (monitoring and logging).
At least one of these should be used for any error.
Propagating an error upwards
Propagating “upwards” moves the responsibility of error handling to a higher level. In general, higher level software has more context for the failing operation, so they are often in a better position to route the error in the right direction.
Propagating errors upwards can take many forms.
- Function level: Propagate to the calling procedure
- Return an error code
- Raise an exception
- Propagate a lower-level exception
- Process level: Crash the process
- Network level: Return the error in a network response
- User level: Show the error to the user
- e.g. in a UI dialog
Propagating an error to the side
Sometimes passing the error upwards is not an option or not sufficient, for example because you want to transparently recover or hide a subsystem failure from the user.
Propagating errors to the side can be one of the following:
- Increment a counter which can be monitored from the outside. (e.g. using Prometheus or Linux kernel stats counters)
- Write the error to an error collection service.
- For server side software: Record stack traces and notify developers about them.
- For client side software: Ask the user to send bug reports when errors are encountered.
Make sure that all of the relevant error propagation sinks are monitored by the right stakeholders during the program’s operation.
Systems like Prometheus are built with monitoring and alerting in mind. It’s often a good idea to alert on symptoms close to your business needs and then use collected metrics on other errors for further analysis1.
What if my case is not a fit?
If your case it not a fit, it’s possible that the condition at hand might not be an error to begin with, but is maybe only a “corner case” which may happen in normal operation (e.g. a lookup key was not found).
Consider switching the way that the condition is propagated. The following options are alternatives where otherwise errors may be used.
- Null/empty object: Returning empty lists, sets,
Optional<T>types, or other Null Objects. Note that this is not the same as
- Separate channel: Return the non-error conditions through a separate channel (e.g. don’t use an exception, but return specially-built objects for these conditions). Example: A linter tool’s output is produced as part of the expected operation and is therefore not an error within that context.
- Null: Returning
nilor similar is idiomatic in some languages too, but may lead to follow-up errors2 when the value is dereferenced.
Sometimes the most elegant solution is to change the API to make the error impossible3. For example, replace a dynamic check with one that’s statically guaranteed by the type system, or change to idempotent operation semantics, so that multiple invocations do not conflict with each other.
These are all written from the perspective of a piece of code detecting an error.
Recovery through redundancy
A disk in a software RAID is failing.
- The system administrator, as soon as the percentage of recovered executions > threshold. They can then investigate the cause of network issues, exchange hard drives or similar.
- Recover by retrying the operation on a different disk.
- Increment an error counter indicating the failure cause. When incremented in a monitorable system, this counter can trigger an alert.
A TCP network packet is lost. (Difference: Retry has less control over the network routing)
Recovery through omission
A web server uses multiple database backends. One of the non-essential backends starts to return errors.
- The operator of the failing subsystem.
- Omit the backend’s response. (Treat it as if it didn’t exist.)
- Increment an error counter indicating the failure cause, so that the subsystem operator can be alerted.
Delegating the decision to the calling procedure
open() function is asked to open a given filename for
reading, but the file doesn’t exist.
- We don’t know who is responsible for passing the wrong filename, but somewhere in the call chain, someone is going to know where that filename comes from.
- Report the error to the caller with the documented
Mistakes in passed input values are best handled by the code which passed them, and which can judge why they happened.
Reporting upwards and sideways at the same time
A web server’s servlets are returning errors in the form of HTTP status codes.
- The requester is a stakeholder because they care about the request.
- The web server’s operator is a stakeholder because they care about keeping overall error fractions within reasonable bounds.
- Propagate status via HTTP, upwards to the caller who has more context.
- Count HTTP statuses keyed by relevant criteria (such as servlet), e.g. in Prometheus, for the web server operator.
- A good background read is the Site Reliability Engineering Book, section “Symptoms vs. Causes” [return]
- https://en.wikipedia.org/wiki/Tony_Hoare#Apologies_and_retractions [return]
- The book Philosophy of Software Design by John Ousterhout has an entire chapter on the idea of defining errors away. The author has also given a tech talk. [return]