The feasibility of pledge() on Linux

Or: Why my attempt to implement pledge() on Linux failed

July 16, 2022

So, there was a post by Justine Tunney about her port of OpenBSD’s pledge() to her own libc, the Cosmopolitan libc.

She is also calling out that previous attempts at this were flawed:

There’s been a few devs in the past who’ve tried this. I’m not going to name names, because most of these projects were never completed. […] The projects that got further along also had oversights like allowing the changing of setuid/setgid/sticky bits. So none of the current alternatives should be used.

My own seccomp-scopes project which I worked on from 2016 onwards is one of these attempts, so I feel I should explain the reasons why I stopped pursuing this approach of unprivileged sandboxing, and what I think needs to get done to do it right.

At the high level, the main problem is that seccomp-bpf does its filtering at the level of system calls and software libraries generally do not give guarantees about which system calls they are using under the hood. This does not even hold for libc implementations.

You can’t predict the syscalls a program will do

Here are some ways in which glibc makes it hard to predict which system calls it will do:

glibc replaces existing uses of system calls with newer variants. A call to the open() libc function used the openat(2) syscall under the hood, and that is just one of many examples. This changes between glibc versions.
glibc initializes parts of the library on demand when they are first used, and that may involve system calls that should better be forbidden. So this initialization needs to ideally be done before enforcing seccomp.
glibc makes use of shared libraries for commonly used functionality (nsswitch.conf). Administrators can flexibly install additional ways of doing name lookups, but any attempt at reasoning about this will need to involve these shared libraries as well.

For example: If a program calls gethostbyname() for the first time, the following things happen:

It looks up /etc/nsswitch.conf to find the shared libraries that implement hostname lookup (system calls: various file accesses)
It loads these shared libraries (system calls: various file accesses, various address space manipulation syscalls)
It calls these shared libraries to do name lookup (system calls: you can’t tell anymore)

There is additionally the problem that at the system call layer, DNS lookups are indistinguishable from other UDP socket operations, so allow-listing DNS will probably allow other UDP traffic as well.

So, to summarize: Attempting to implement a pledge() like call with seccomp-bpf and independent of a specific libc is an inherently brittle approach, which involves keeping up-to-date lists of system calls on different kernel versions and architectures and their use by different libcs. The complexity and feature-richness of glibc (particularly libnss) makes this particularly difficult. Any libc-independent pledge() library would need to get updated in sync with glibc updates, or it would run the risk that glibc starts using a syscall that it doesn’t allow-list, breaking the programs that use it.

Justine Tunney’s pledge() implementation works around these problems by (a) only supporting her own, simpler, libc implementation, and (b) only supporting the x86-64 architecture. I’m really happy to see that this works well together, but I’m afraid it’s a mistake to think this implementation can be “ported” to glibc which is used for the bulk of Linux distributions.

Restricting by file path

In OpenBSD, pledge() was always path-aware, until they moved that part into the separate unveil() call.

Seccomp-bpf can only filter syscalls by their direct arguments, so the filter can see the value of the pointer to the path name, but not the path name itself in the memory referenced by that pointer.

There are more advanced techniques to inspect pointer memory, but using these safely involves separate supervisor processes or more complicated constrained ways to control what processes do – you need to take security very seriously pull that off, and doesn’t map to a call to a single C function like pledge() anymore.

Landlock promises to fix this in the future

Unprivileged sandboxing continues to be difficult on Linux, for the moment, and it’s no surprise that the main users of seccomp-bpf are either dedicated sandboxing or containerization tools, or projects where security is a major focus, like web browsers, OpenSSH or Tor. But we should not give up yet. :)

The Landlock LSM offers a better approach for unprivileged sandboxing, although it can’t currently restrict the same number of operations yet as seccomp-bpf can. Landlock can solve the above problems, because:

Landlock filters security-sensitive operations at the point when these operations are done in the kernel, not at the system call layer. This makes it architecture independent and removes the need to keep up-to-date lists of system calls.
Landlock can easily filter on file paths and other relevant in-memory properties that can not be observed by seccomp-bpf at the system call interface.

If you want to try it out, Landlock is already enabled on some Linux distributions (i.e. Arch Linux). A simple call to Landlock (using the Go library) is:

err := landlock.V1.BestEffort().RestrictPaths(
    landlock.RODirs("/usr", "/bin"),
    landlock.RWDirs("/tmp"),
)

Some further links:

Summary

As shown above, seccomp-bpf makes it more difficult than necessary to sandbox processes. It is available on a wide range of Linux distributions, but it’s currently not practical to use for the bulk of software linked to glibc, and it’s not possible to restrict operations by file path in BPF.

Landlock is not rolled out to all Linux distributions yet, and it still has some known gaps in its current version, but it has a significantly simpler API and a much simpler implementation in the kernel than what would be required in userspace to work around the problems of seccomp-bpf.

And simplicity is a great property for security features to have. I can wholeheartedly recommend having a look at it.

The feasibility of pledge() on Linux

You can’t predict the syscalls a program will do

Restricting by file path

Landlock promises to fix this in the future

Summary

Comments