The feasibility of pledge() on Linux

Or: Why my attempt to implement pledge() on Linux failed

So, there was a post by Justine Tunney about her port of OpenBSD’s pledge() to her own libc, the Cosmopolitan libc.

She is also calling out that previous attempts at this were flawed:

There’s been a few devs in the past who’ve tried this. I’m not going to name names, because most of these projects were never completed. […] The projects that got further along also had oversights like allowing the changing of setuid/setgid/sticky bits. So none of the current alternatives should be used.

My own seccomp-scopes project which I worked on from 2016 onwards is one of these attempts, so I feel I should explain the reasons why I stopped pursuing this approach of unprivileged sandboxing, and what I think needs to get done to do it right.

At the high level, the main problem is that seccomp-bpf does its filtering at the level of system calls and software libraries generally do not give guarantees about which system calls they are using under the hood. This does not even hold for libc implementations.

You can’t predict the syscalls a program will do

Here are some ways in which glibc makes it hard to predict which system calls it will do:

For example: If a program calls gethostbyname() for the first time, the following things happen:

There is additionally the problem that at the system call layer, DNS lookups are indistinguishable from other UDP socket operations, so allow-listing DNS will probably allow other UDP traffic as well.

So, to summarize: Attempting to implement a pledge() like call with seccomp-bpf and independent of a specific libc is an inherently brittle approach, which involves keeping up-to-date lists of system calls on different kernel versions and architectures and their use by different libcs. The complexity and feature-richness of glibc (particularly libnss) makes this particularly difficult. Any libc-independent pledge() library would need to get updated in sync with glibc updates, or it would run the risk that glibc starts using a syscall that it doesn’t allow-list, breaking the programs that use it.

Justine Tunney’s pledge() implementation works around these problems by (a) only supporting her own, simpler, libc implementation, and (b) only supporting the x86-64 architecture. I’m really happy to see that this works well together, but I’m afraid it’s a mistake to think this implementation can be “ported” to glibc which is used for the bulk of Linux distributions.

Restricting by file path

In OpenBSD, pledge() was always path-aware, until they moved that part into the separate unveil() call.

Seccomp-bpf can only filter syscalls by their direct arguments, so the filter can see the value of the pointer to the path name, but not the path name itself in the memory referenced by that pointer.

There are more advanced techniques to inspect pointer memory, but using these safely involves separate supervisor processes or more complicated constrained ways to control what processes do – you need to take security very seriously pull that off, and doesn’t map to a call to a single C function like pledge() anymore.

Landlock promises to fix this in the future

Unprivileged sandboxing continues to be difficult on Linux, for the moment, and it’s no surprise that the main users of seccomp-bpf are either dedicated sandboxing or containerization tools, or projects where security is a major focus, like web browsers, OpenSSH or Tor. But we should not give up yet. :)

The Landlock LSM offers a better approach for unprivileged sandboxing, although it can’t currently restrict the same number of operations yet as seccomp-bpf can. Landlock can solve the above problems, because:

If you want to try it out, Landlock is already enabled on some Linux distributions (i.e. Arch Linux). A simple call to Landlock (using the Go library) is:

err := landlock.V1.BestEffort().RestrictPaths(
    landlock.RODirs("/usr", "/bin"),
    landlock.RWDirs("/tmp"),
)

Some further links:

Summary

As shown above, seccomp-bpf makes it more difficult than necessary to sandbox processes. It is available on a wide range of Linux distributions, but it’s currently not practical to use for the bulk of software linked to glibc, and it’s not possible to restrict operations by file path in BPF.

Landlock is not rolled out to all Linux distributions yet, and it still has some known gaps in its current version, but it has a significantly simpler API and a much simpler implementation in the kernel than what would be required in userspace to work around the problems of seccomp-bpf.

And simplicity is a great property for security features to have. I can wholeheartedly recommend having a look at it.

Comments