 ---
title: "The feasibility of pledge() on Linux"
date: 2022-07-16T10:36:20+02:00
description: "Or: Why my attempt to implement pledge() on Linux failed"
comments: pledge-on-linux
tags: ["Sandbox", "Linux"]
---

So, there was a [post by Justine Tunney](https://justine.lol/pledge/)
about her port of [OpenBSD's
`pledge()`](https://man.openbsd.org/pledge.2) to her own libc, the
[Cosmopolitan libc](https://justine.lol/cosmopolitan/).

She is also calling out that previous attempts at this were flawed:

> There's been a few devs in the past who've tried this. I'm not going to name names, because most of these projects were never completed. [...] The projects that got further along also had oversights like allowing the changing of setuid/setgid/sticky bits. So none of the current alternatives should be used.

My own [seccomp-scopes](https://github.com/gnoack/seccomp-scopes)
project which I worked on from 2016 onwards is one of these attempts,
so I feel I should explain the reasons *why* I stopped pursuing this
approach of unprivileged sandboxing, and what I think needs to get
done to do it right.

At the high level, the main problem is that seccomp-bpf does its
filtering at the level of system calls and **software libraries
generally do not give guarantees about which system calls they are
using under the hood**. This does not even hold for `libc`
implementations.

## You can't predict the syscalls a program will do

Here are some ways in which glibc makes it hard to predict which
system calls it will do:

* glibc replaces existing uses of system calls with newer variants. A
  call to the `open()` libc function used the `openat(2)` syscall
  under the hood, and that is just one of many examples. This changes
  between glibc versions.

* glibc initializes parts of the library on demand when they are first
  used, and that may involve system calls that should better be
  forbidden. So this initialization needs to ideally be done before
  enforcing seccomp.

* glibc makes use of shared libraries for commonly used functionality
  ([`nsswitch.conf`](https://man7.org/linux/man-pages/man5/nsswitch.conf.5.html)).
  Administrators can flexibly install additional ways of doing name
  lookups, but any attempt at reasoning about this will need to
  involve these shared libraries as well.

For example: If a program calls `gethostbyname()` for the first time, the following things happen:

* It looks up `/etc/nsswitch.conf` to find the shared libraries that implement hostname lookup (system calls: various file accesses)
* It loads these shared libraries (system calls: various file accesses, various address space manipulation syscalls)
* It calls these shared libraries to do name lookup (system calls: you can't tell anymore)

There is additionally the problem that at the system call layer, DNS
lookups are indistinguishable from other UDP socket operations, so
allow-listing DNS will probably allow other UDP traffic as well.

So, to summarize: Attempting to implement a `pledge()` like call with
seccomp-bpf and independent of a specific libc is an inherently
brittle approach, which **involves keeping up-to-date lists of system
calls on different kernel versions and architectures and their use by
different libcs**. The complexity and feature-richness of glibc
(particularly libnss) makes this particularly difficult. Any
libc-independent `pledge()` library would need to get updated in sync
with glibc updates, or it would run the risk that glibc starts using a
syscall that it doesn't allow-list, breaking the programs that use it.

Justine Tunney's `pledge()` implementation works around these problems
by (a) only supporting her own, simpler, libc implementation, and (b)
only supporting the x86-64 architecture. I'm really happy to see that
this works well together, but I'm afraid it's a mistake to think this
implementation can be "ported" to glibc which is used for the bulk of
Linux distributions.

## Restricting by file path

In OpenBSD, `pledge()` was always path-aware, until they moved that
part into the separate `unveil()` call.

Seccomp-bpf can only filter syscalls by their direct arguments, so the
filter can see the value of the pointer to the path name, but not the
path name itself in the memory referenced by that pointer.

There are more advanced techniques to inspect pointer memory, but
using these safely involves separate supervisor processes or more
complicated constrained ways to control what processes do -- you need
to take security very seriously pull that off, and doesn't map to a
call to a single C function like `pledge()` anymore.

## Landlock promises to fix this in the future

Unprivileged sandboxing continues to be difficult on Linux, for the
moment, and it's no surprise that the main users of seccomp-bpf are
either dedicated sandboxing or containerization tools, or projects
where security is a major focus, like web browsers, OpenSSH or Tor.
But we should not give up yet. :)

The [Landlock LSM](https://landlock.io/) offers a better approach for
unprivileged sandboxing, although it can't currently restrict the same
number of operations yet as seccomp-bpf can.  Landlock can
solve the above problems, because:

* Landlock filters security-sensitive operations at the point when
  these operations are done in the kernel, not at the system call
  layer. This makes it architecture independent and removes the
  need to keep up-to-date lists of system calls.
* Landlock can easily filter on file paths and other relevant
  in-memory properties that can not be observed by seccomp-bpf at the
  system call interface.

If you want to try it out, Landlock is already enabled on some Linux
distributions (i.e. Arch Linux). A simple call to Landlock (using the
Go library) is:

```
err := landlock.V1.BestEffort().RestrictPaths(
    landlock.RODirs("/usr", "/bin"),
    landlock.RWDirs("/tmp"),
)
```

Some further links:

* https://landlock.io
* [documentation for Go-Landlock](https://pkg.go.dev/github.com/landlock-lsm/go-landlock/landlock)

## Summary

As shown above, seccomp-bpf makes it more difficult than necessary to
sandbox processes. It is available on a wide range of Linux
distributions, but it's currently not practical to use for the bulk of
software linked to glibc, and it's not possible to restrict operations
by file path in BPF.

Landlock is not rolled out to all Linux distributions yet, and it
still has some known gaps in its current version, but it has a
significantly simpler API and a much simpler implementation in the
kernel than what would be required in userspace to work around the
problems of seccomp-bpf.

And simplicity is a great property for security features to have. I
can wholeheartedly recommend having a look at it.
