## Introduction

Unix processes have *file descriptors* which point to *file descriptions* (`struct file` in Linux).  Multiple file descriptors can point to the same file description, for instance by duplicating them with [*dup*(2)], or by passing them across process boundaries using [*fork*(2)] or UNIX Domain Sockets ([*unix*(7)]).

```pikchr
boxht = 0.5*boxht
fill = PapayaWhip

down
P10: box "0"
box "1"
box "2"
P13: box "3"

P20: box with nw at 5cm right of first box.ne "0"
box "1"
box "2"
P23: box "3"

text "Process 1:" italic ljust with s at P10.nw
text "Process 2:" italic ljust with s at P20.nw

F: box at 1/2<P13,P23> "file" "description" ht boxht*1.5
   box "f_pos"

arrow from P13.e to F.w dotted
arrow from P23.w to F.e dotted
```

For a long time, I was under the impression that that was also what happened behind the scenes when opening `/dev/fd/${FD}` (a.k.a. `/proc/${PID}/fd/${FD}`) on Linux.  I thought I would get a new file descriptor which is also pointing to the same file descrip*tion*, similar to if you were calling `dup(fd)`.  **This is wrong!**

### This feature is mis-documented

The misunderstanding is even documented in my earlier edition of
"[The Linux Programming Interface](https://man7.org/tlpi/)" (section 5.11)
([but it has been fixed in newer editions](https://man7.org/tlpi/errata/index.html#p_107),
as Michael Kerrisk points out in the comments below):

> Opening one of the files in the /dev/fd directory is equivalent to duplicating the corresponding file descriptor.  Thus, the following statements are equivalent:
>
> ```bad
> fd = open("/dev/fd/1", O_WRONLY);
> fd = dup(1);
> ```

This is a reasonably simple explanation which is close enough to
reality for many practical use cases, and which is true on other
Unixes, but it is not fully accurate on Linux.  (The book is very
comprehensive and useful nevertheless.)

[This RedHat bug from
2000](https://bugzilla.redhat.com/show_bug.cgi?id=10417) discusses how
that behaviour was apparently changed in Linux 1.3.34.  The
aforementioned equivalence between the [*open*(2)] and [*dup*(2)] calls is
called the "Plan9 semantics" there.

[*proc_pid_fd*(5)] gives usage examples, but does not go into a lot of
detail on the exact semantics in the case of [*open*(2)].

### `/dev/fd/*` behave different on other Unixes

On top of that, the behavior is implemented differently on other Unixes.

From a FreeBSD 14 box:

```
$ ./dup -dup > out; cat out; echo
1d
$ ./dup -proc > out; cat out; echo
1d
```

On FreeBSD, the result of `open("/dev/fd/1", O_WRONLY);` *does* share the same file descrip*tion* with the original file descriptor, as if we were calling `dup(1)`.

## Part 1: An experiment!

It turns out, opening `/dev/fd/*`, `/proc/${PID}/fd/*` or `/proc/self/fd/*` ([*proc_pid_fd*(5)]) results in a *separate* file descrip*tion* (`struct file`) being allocated for you, but it refers to the same underlying file on disk.

You can try it out with the following program:

```
$ cat dup.c
#include <err.h>
#include <fcntl.h>
#include <stdio.h>
#include <string.h>
#include <unistd.h>

int usage(const char *name) {
  printf("Usage: %s [-dup|-proc]\n", name);
  return 0;
}

int main(int argc, char *argv[]) {
  int fd;

  if (argc != 2) {
    return usage(argv[0]);
  }

  if (!strcmp(argv[1], "-dup")) {
    fd = dup(1);  // stdout
    if (fd < 0) {
      err(1, "dup");
    }
  } else if (!strcmp(argv[1], "-proc")) {
    fd = open("/dev/fd/1", O_WRONLY);
    if (fd < 0) {
      err(1, "open /dev/fd/1");
    }
  } else {
    return usage(argv[0]);
  }

  write(1, "1", 1);
  
  write(fd, "d", 1);
  close(fd);
}
```

When we build and run this program, we can see that the behavior of [*dup*(2)] and [*open*(2)] is actually different!

### Duplicating the file descriptor using [*dup*(2)]

```
$ make dup
cc -Wall -static    dup.c   -o dup
$ ./dup -dup > out; cat out; echo
1d
$
```

In the [*dup*(2)] case, the `struct file` is actually shared -- both file descriptors refer to the exact same file descrip*tion*.  The first [*write*(2)] updates the file description's file position (`f_pos`).  The second [*write*(2)] uses the exact same file description, so it sees the updated file position, and the byte gets written *after* the one that was written before.

```pikchr
boxht = 0.5*boxht
fill = PapayaWhip

down
P10: box "0"
P11: box "1"
box "2"
P13: box "3"

text "'dup' process:" italic ljust with s at P10.nw

F: box with n at 2.5cm right of P11.e "file" "description" ht boxht*1.5
   box "f_pos"

arrow from P11.e to F.w dotted
arrow from P13.e to 1/2<F.w,F.sw> dotted color red "dup(2)" aligned below
```

### Duplicating the file descriptor through `/proc`

```
$ ./dup -proc > out; cat out; echo
d
$
```

In the [*proc_pid_fd*(5)] case, we see only one byte written to the output file. 
So there are *two* `struct file`s created --
and they use independent positions `f_pos` in the file, which are both set to 0 initially.

* The first *write*(2) through stdout (fd 1) updates the file position from 0 to 1.
* The second *write*(2) uses a separate file description
  and *overwrites* the byte that was previously written.

That's why we can only see "`d`" in the output.

```pikchr
boxht = 0.5*boxht
fill = PapayaWhip

down
P10: box "0"
P11: box "1"
box "2"
P13: box "3"

text "'dup' process:" italic ljust with s at P10.nw

F1: box at P11.e +(2.5cm, 0.5cm) "file" "description" ht boxht*1.5
    box "f_pos"

F2: box at 2.5cm right of P13.e "file" "description" ht boxht*1.5
    box "f_pos"

arrow from P11.e to F1.w dotted
arrow from P13.e to F2.w dotted color red "open(2)" aligned below
```

### Other file types

So far, this was a bit confusing.  It's definitely inconsistent with
the theory that opening `/dev/fd/*` does the same as [*dup*(2)].  But
what happens for other file types than regular files?

### TCP Sockets: Can not be reopened through `/proc`

You can try this out by redirecting stdout to a socket, using the
obscure `/dev/tcp` extension in bash[^1]:

```
$ nc -l 9999 &
[1] 4166
$ ./dup -proc >/dev/tcp/localhost/9999
dup: open /dev/fd/1: No such device or address
[1]+  Done                    nc -l 9999
$ 
```

The error here is `ENXIO: No such device or address`.

For sockets, the `/proc/self/fd/*` entry is a symlink to a name like `socket:[16902]`.

```
lrwx------ 1 gnoack gnoack 64 Feb 17 23:12 1 -> 'socket:[16902]'
```

### Pipes: *Can* be reopened through `/proc`

However, a pipe **can** be reopened through `/dev/fd/1`, for example like this:

```
$ ./dup -proc | cat ; echo
1d
```

...and this works even though the pipe's symlink looks like this:

```
l-wx------ 1 gnoack gnoack 64 Feb 17 23:10 1 -> 'pipe:[15895]'
```

## Part 2: What is really happening

First, let's recall the in-kernel VFS structure:

```pikchr
boxht = 0.5*boxht

box "Process" fill lightblue
arrow "fd" above
FO: box "file object" rad 10px fill PapayaWhip
arrow "f_path" above
Path: box "path" fill LemonChiffon
arrow "dentry" above
box "dentry" fill powderblue
down
arrow from last box.s "d_inode" aligned above
Inode: box "inode object" rad 10px fill PapayaWhip
left
arrow from last box.w "i_sb" above
box "Superblock" "object" fill LightSalmon ht boxht*1.5
arrow dotted
arrow from Inode.s \
  down boxht \
  then left until even with last arrow.end \
  dotted
cylinder "Disk" "file" with e at 1/2<last arrow.end, 2nd last arrow.end> ht boxht*3 fill PaleVioletRed

move to Path.n
up
arrow "mnt" above aligned
box "struct" "vfsmount" fill gold ht boxht*1.5

up
move to FO.n
text "This is the" "file description" italic
```

The following things happen in a sequence:

* A user space process calls `open("/proc/self/fd/1")`
* System call handler:
  * parses flags
  * does the path walk, which eventually invokes `proc_pid_get_link()`:
    * `fs/proc/base.c:proc_pid_get_link()`:
       * invokes `proc_fd_link()` through a callback
         * `fs/proc/fd.c:proc_fd_link()`: looks up the original `struct file*` from the target task and **returns the `->f_path`** that existed on that `struct file` (through an output pointer argument).
       * invokes `nd_jump_link()`, which **sets the result of the path walk in `nameidata` to the previously set path!**
  * eventually calls `path_openat()`.
    * `namei.c:path_openat()`: **Always** allocates a new `struct file` with `alloc_empty_file()`
    * `namei.c:do_open()`: calls `vfs_open()`, which in turn calls `do_dentry_open()`
    * `open.c:do_dentry_open()`:
      * first initializes the file ops from the inode: `f->f_op = fops_get(inode->i_fop)`
      * the calls the "open" file operation: `f->f_op->open`

### Where did the `no_open` pointer come from?

For the TCP socket above, `f->f_op->open` is set to the `no_open` function, which unconditionally returns `ENXIO`.  So that socket can't be reopened through `/proc`.

![](https://blog.gnoack.org/images/perf-no_open.png)

The decision which `f_op->open` is used for each file is done in `inode.c:init_special_inode`, for sockets and pipes.

## Summary

* Every call to [*open*(2)] results in a new `struct file*` being allocated.
* The resulting `struct file*` refers to an existing inode, even for special files like pipes.
* Not all of the special files support this kind of re-opening.

[^1]: These `/dev/tcp/...` files do not actually exist: bash treats these
    paths specially and really just calls the BSD socket API
    itself... but we can use it here to write directly into a socket.

[*proc_pid_fd*(5)]: https://man.gnoack.org/5/proc_pid_fd
[*open*(2)]: https://man.gnoack.org/2/open
[*write*(2)]: https://man.gnoack.org/2/write
[*dup*(2)]: https://man.gnoack.org/2/dup
[*fork*(2)]: https://man.gnoack.org/2/fork
[*unix*(7)]: https://man.gnoack.org/7/unix
