![Cartoon in the style of webcomicname.com - first panel: Stakeholder: 'Make it do X' Developer: 'OK'; second panel: developer works on machine, thinks: 'X X X X X'; third panel: user uses the machine saying 'Y'. Machine: explodes. Developer: 'oh no'](/images/reliability.svg)

Which design guidelines should software follow in order to run
reliably?  Which guidelines do you use in your day-to-day work?

Error handling is a significant complexity in real-world projects, but
programming textbooks are often giving very little advice on how to
think about it at the higher level.  This article tries to fill that
gap:

 * We will define what an error is.
 * We will find out that each error has a human stakeholder.
 * We will discuss technical mechanisms to bring errors to human attention.
 * The final section contains various examples of how to apply this.

To handle the error cases in your software, you should ask yourself
who is the human stakeholder who would be able to address it, and then
find a suitable mechanism to escalate the error in the direction of
that stakeholder.

I've cross-checked this approach with multiple people from various
software backgrounds, but I'm still happy to receive your feedback.

# What is an error?

> **Definition:**
> *An error* is when the program is operating outside the intended path
of execution.
{.def}

Running programs within the intended path of execution is a goal by
definition.  To track and reach this goal, we need to make it
observable whether the program is running healthily:

> **Rule:**
> All errors should be *observable* from *the outside*.
{.rule}

*Observable* in this context means that it's possible to bring the
error to a human's attention (e.g. end user, system administrator,
developer) in an automated fashion.

*The outside* is defined by the bounds of your program.  The error
reports need to be made accessible to the surrounding environment,
so that stakeholders can access them.

## Error stakeholders

The stakeholders for an error are the people involved in the software
lifecycle, such as: End users, developers, system administrators,
network administrators etc.

> **Rule:**
> For each error, there is a corresponding stakeholder.
{.rule}

![](/images/stakeholders.svg)
Different errors have different stakeholders who can address the issue.

Is the error cause **within the program itself**, then the
**developer** is the stakeholder.

Is the error cause **in the user input or interaction** (e.g. invalid
input), then **the user** is the stakeholder.

Is the error cause **in the execution environment**, then **the
operator of that environment** is the stakeholder (e.g. sysadmin,
network admin, SRE).

# Mechanisms

## Routing errors to the stakeholders

Routing error indicators to the right stakeholders is partially done
in code, and partially within the deployment.  Particularly,
displaying errors to end users is usually done in the program code
itself.  On the other hand, metrics collection set-ups can be
monitored by developers and administrators alike.

There are two main strategies to make errors observable:

* Propagate the error upwards towards the initiator of the erroring
  operation.
* Propagate the error to the side (monitoring and logging).

At least one of these should be used for any error.

![](/images/errors.svg)
A function can propagate errors upwards or to the side,<br>at least one of these should be used.

## Propagating an error upwards

Propagating "upwards" moves the responsibility of error handling to a
higher level.  In general, higher level software has more context for
the failing operation, so they are often in a better position to route
the error in the right direction.

Propagating errors upwards can take many forms.

* Function level: **Propagate to the calling procedure**
  * Return an error code
  * Raise an exception
  * Propagate a lower-level exception
* Process level: **Crash the process**
  * Exit the program with an error status (e.g. Unix [`exit()`](http://man7.org/linux/man-pages/man3/exit.3.html))
  * Abort the program (e.g. Unix [`abort()`](http://man7.org/linux/man-pages/man3/abort.3.html))
* Network level: **Return the error in a network response**
* User level: **Show the error to the user**
  * e.g. in a UI dialog

> **Note:**
> Remember to convert the error into a representation which fits the
> level of abstraction that the caller expects.  Many languages
> support nesting (wrapping) errors so that the full context is
> retained.
{.info}

## Propagating an error to the side

Sometimes passing the error upwards is not an option or not
sufficient, for example because you want to transparently recover or
hide a subsystem failure from the user.

Propagating errors to the side can be one of the following:

* Increment a counter which can be monitored from the outside.
  (e.g. using Prometheus or Linux kernel stats counters)
* Write the error to an error collection service.
  * For server side software: Record stack traces and notify
    developers about them.
  * For client side software: Ask the user to send bug reports when
    errors are encountered.

Make sure that all the relevant error propagation sinks are
monitored by the right stakeholders during the program's operation.

Systems like Prometheus are built with monitoring and alerting in
mind.  It's often a good idea to alert on symptoms close to your
business needs and then use collected metrics on other errors for
further analysis[^srebook].

> **Warning:**
> Textual logging is not an error handling strategy. Logs are not
> meant for machine consumption, so it's hard to have automated
> monitoring based on them.
{.warning}

> **Warning:**
> Ignoring errors is not an error handling strategy either.
{.warning}

## What if my case is not a fit?

If your case it not a fit, it's possible that the condition at hand
might not be an error to begin with, but is maybe only a "corner case"
which may happen in normal operation (e.g. a lookup key was not
found).

Consider switching the way that the condition is propagated.  The
following options are alternatives where otherwise errors may be
used.

*  **Null/empty object:** Returning empty lists, sets, `Optional<T>`
   types, or other [Null
   Objects](https://en.wikipedia.org/wiki/Null_object_pattern). Note
   that this is not the same as `null`.
*  **Separate channel:** Return the non-error conditions through a
   separate channel (e.g. don't use an exception, but return
   specially-built objects for these conditions).  **Example:** A
   linter tool's output is produced as part of the expected operation
   and is therefore not an error within that context.
*  **Null:** Returning `null`, `nil` or similar is idiomatic in some languages
   too, but may lead to follow-up
   errors[^hoareapologies]
   when the value is dereferenced.

> If none of these worked for you, I'd love to hear from you,
> so I can correct my understanding.
{.info}

Sometimes the most elegant solution is to **change the API to make the
error impossible**[^psd]. For example, replace a dynamic check with
one that's statically guaranteed by the type system, or change to
idempotent operation semantics, so that multiple invocations do not
conflict with each other.

# Examples

These are all written from the perspective of a piece of code
detecting an error.

## Recovery through redundancy

### Example

A disk in a software RAID is failing.

### Stakeholder

* **The system administrator**, as soon as the percentage of recovered executions > threshold.
  They can then investigate the cause of network issues, exchange hard
  drives or similar.

### Mechanism

* **Recover** by retrying the operation on a different disk.
* **Increment an error counter** indicating the failure cause.  When
incremented in a monitorable system, this counter can trigger an alert.

### Similar cases

A TCP network packet is lost. (Difference: Retry has less control over
the network routing)

## Recovery through omission

### Example

A web server uses multiple database backends.  One of the
non-essential backends starts to return errors.

### Stakeholder

 * The **operator** of the failing subsystem.

### Mechanism

* **Omit** the backend's response.  (Treat it as if it didn't exist.)
* **Increment an error counter** indicating the failure cause, so that
  the subsystem operator can be alerted.

## Delegating the decision to the calling procedure

### Example

The Unix `open()` function is asked to open a given filename for
reading, but the file doesn't exist.

### Stakeholder

* **We don't know** who is responsible for passing the wrong filename,
  but somewhere in the call chain, someone is going to know where that
  filename comes from.

### Mechanism

* **Report the error to the caller** with the documented `ENOENT` error code.

### Similar cases

Mistakes in passed input values are best handled by the code which
passed them, and which can judge why they happened.

## Reporting upwards and sideways at the same time

### Example

A web server's servlets are returning errors in the form of HTTP
status codes.

### Stakeholders

* The **requester** is a stakeholder because they care about the
  request.
* The web server's **operator** is a stakeholder because they care
  about keeping overall error fractions within reasonable bounds.

### Mechanisms

* **Propagate status via HTTP**, upwards to the caller who has more
  context.
* **Count HTTP statuses** keyed by relevant criteria (such as
  servlet), e.g. in Prometheus, for the web server operator.

[^hoareapologies]: [Wikipedia: Tony Hoare, section "Apologies and retractions"](https://en.wikipedia.org/wiki/Tony_Hoare#Apologies_and_retractions)
[^psd]: The book [Philosophy of Software Design](https://www.goodreads.com/book/show/39996759-a-philosophy-of-software-design) by John Ousterhout has an entire chapter on the idea of defining errors away.  The author has also given a [tech talk](https://www.youtube.com/watch?v=bmSAYlu0NcYt).
[^srebook]: A good background read is the [Site Reliability Engineering Book](https://landing.google.com/sre/book/chapters/monitoring-distributed-systems.html#symptoms-versus-causes-g0sEi4), section "Symptoms vs. Causes"
