C – Alfonso Sánchez-Beato's blog

Filter and Modify System Calls with seccomp and ptrace

In the conclusions to my last post, “Modifying System Call Arguments With ptrace”, I mentioned that one of the main drawbacks of the explained approach for modifying system call arguments was that there is a process switch for each system call performed by the tracee. I also suggested a possible approach to overcome that issue using ptrace jointly with seccomp, with the later making sure the tracer gets only the system calls we are interested in. In this post I develop this idea further and show how this can be achieved.

For this, I have created a little example that can be found in github, along the example used in the previous post. The main idea is to use seccomp with a Berkeley Packet Filter (BPF) that will specify the conditions under which the tracer gets interrupted.

Now we will go through the source code, with emphasis on the parts that differ from the original example. Skipping the include directives and the forward declarations we get to main():

int main(int argc, char **argv)
{
    pid_t pid;
    int status;

    if (argc < 2) {
        fprintf(stderr, "Usage: %s <prog> <arg1> ... <argN>\n", argv[0]);
        return 1;
    }

    if ((pid = fork()) == 0) {
        /* If open syscall, trace */
        struct sock_filter filter[] = {
            BPF_STMT(BPF_LD+BPF_W+BPF_ABS, offsetof(struct seccomp_data, nr)),
            BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_open, 0, 1),
            BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_TRACE),
            BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW),
        };
        struct sock_fprog prog = {
            .filter = filter,
            .len = (unsigned short) (sizeof(filter)/sizeof(filter[0])),
        };
        ptrace(PTRACE_TRACEME, 0, 0, 0);
        /* To avoid the need for CAP_SYS_ADMIN */
        if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0) == -1) {
            perror("prctl(PR_SET_NO_NEW_PRIVS)");
            return 1;
        }
        if (prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &prog) == -1) {
            perror("when setting seccomp filter");
            return 1;
        }
        kill(getpid(), SIGSTOP);
        return execvp(argv[1], argv + 1);
    } else {
        waitpid(pid, &status, 0);
        ptrace(PTRACE_SETOPTIONS, pid, 0, PTRACE_O_TRACESECCOMP);
        process_signals(pid);
        return 0;
    }
}

The main change here when compared to the original code is the set-up of a BPF in the tracee, right after performing the call to fork(). BPFs have an intimidating syntax at first glance, but once you grasp the basic concepts behind they are actually quite easy to read. BPFs are defined as a sort of virtual machine (VM) which has one data register or accumulator, one index register, and an implicit program counter (PC). Its “assembly” instructions are defined as a structure with format:

struct sock_filter {
    u_short code;
    u_char  jt;
    u_char  jf;
    u_long k;
};

There are codes (opcodes) for loading into the accumulator, jumping, and so on. jt and jf are increments on the program counter that are used in jump instructions, while k is an auxiliary value which usage depends on the code number.

BPFs have an addressable space with data that is in the networking case a packet datagram, and for seccomp the following structure:

struct seccomp_data {
    int   nr;                   /* System call number */
    __u32 arch;                 /* AUDIT_ARCH_* value
                                   (see <linux/audit.h>) */
    __u64 instruction_pointer;  /* CPU instruction pointer */
    __u64 args[6];              /* Up to 6 system call arguments */
};

So basically what BPFs do in seccomp is to operate on this data, and return a value that tells the kernel what to do next: allow the process to perform the call (SECCOMP_RET_ALLOW), kill it (SECCOMP_RET_KILL), or other options as specified in the seccomp man page.

As can be seen, struct seccomp_data contains more than enough information for our purposes: we can filter based on the system call number and on the arguments.

With all this information we can look now at the filter definition. BPFs filters are defined as an array of sock_filter structures, where each entry is a BPF instruction. In our case we have

BPF_STMT(BPF_LD+BPF_W+BPF_ABS, offsetof(struct seccomp_data, nr)),
BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_open, 0, 1),
BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_TRACE),
BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW),

BPF_STMT and BPF_JUMP are a couple of simple macros that fill the sock_filter structure. They differ in the arguments, which include jumping offsets in BPF_JUMP. The first argument is in both cases the “opcode”, which is built with macros as a mnemonics help: for instance the first one is for loading into the accumulator (BPF_LD) a word (BPF_W) using absolute addressing (BPF_ABS). More about this can be read here, for instance.

Analysing now in more detail the filter, the first instruction is asking the VM to load the call number, nr, to the accumulator. The second one compares that to the number for the open syscall, and asks the VM to not modify the counter if they are equal (PC+0), so the third instruction is run, or jump to PC+1 otherwise, which would be the 4th instruction (when executing this instruction the PC points already to the 3rd instruction). So if this is an open syscall we return SECCOMP_RET_TRACE, which will invoke the tracer, otherwise we return SECCOMP_RET_ALLOW, which will let the tracee run the syscall without further impediment.

Moving forward, the first call to prctl sets PR_SET_NO_NEW_PRIVS, which impedes child processes to have more privileges than those of the parent. This is needed to make the following call to prctl, which sets the seccomp filter using the PR_SET_SECCOMP option, succeed even when not being root. After that, we call execvp() as in the ptrace-only example.

Switching to what the parent does, we see that changes are very few. In main(), we set the PTRACE_O_TRACESECCOMP option, that makes the tracee stop when a filter returns SECCOMP_RET_TRACE and signals the event to the tracer. The other change in this function is that we do not need to set anymore PTRACE_O_TRACESYSGOOD, as we are being interrupted by seccomp, not because of system calls.

Moving now to the next function,

static void process_signals(pid_t child)
{
    const char *file_to_redirect = "ONE.txt";
    const char *file_to_avoid = "TWO.txt";

    while(1) {
        char orig_file[PATH_MAX];

        /* Wait for open syscall start */
        if (wait_for_open(child) != 0) break;

        /* Find out file and re-direct if it is the target */

        read_file(child, orig_file);
        printf("[Opening %s]\n", orig_file);

        if (strcmp(file_to_avoid, orig_file) == 0)
            redirect_file(child, file_to_redirect);
    }
}

we see here that now we invoke wait_for_open() only once. Differently to when we are tracing each syscall, which interrupted the tracer before and after the execution of the syscall, seccomp will interrupt us only before the call is processed. We also add here a trace for demonstration purposes.

After that, we have

static int wait_for_open(pid_t child)
{
    int status;

    while (1) {
        ptrace(PTRACE_CONT, child, 0, 0);
        waitpid(child, &status, 0);
        printf("[waitpid status: 0x%08x]\n", status);
        /* Is it our filter for the open syscall? */
        if (status >> 8 == (SIGTRAP | (PTRACE_EVENT_SECCOMP << 8)) &&
            ptrace(PTRACE_PEEKUSER, child,
                   sizeof(long)*ORIG_RAX, 0) == __NR_open)
            return 0;
        if (WIFEXITED(status))
            return 1;
    }
}

Here we use PTRACE_CONT instead of PTRACE_SYSCALL. We get interrupted every time there is a match in the BPF as we have set the PTRACE_O_TRACESECCOMP option, and we let the tracer run until that happens. The other change here, besides a trace, is how we check if we have received the event we are interested in, as obviously the status word is different. The details can be seen in ptrace’s man page. Note also that we could actually avoid the test for __NR_open as the BPF will interrupt us only for open syscalls.

The rest of the code, which is the part that actually changes the argument to the open syscall is exactly the same. Now, let’s check if this works as advertised:

$ git clone https://github.com/alfonsosanchezbeato/ptrace-redirect.git
$ cd ptrace-redirect/
$ cat ONE.txt 
This is ONE.txt
$ cat TWO.txt 
This is TWO.txt
$ gcc redir_filter.c -o redir_filter
$ ./redir_filter cat TWO.txt 
[waitpid status: 0x0000057f]
[waitpid status: 0x0007057f]
[Opening /etc/ld.so.cache]
[waitpid status: 0x0007057f]
[Opening /lib/x86_64-linux-gnu/libc.so.6]
[waitpid status: 0x0007057f]
[Opening /usr/lib/locale/locale-archive]
[waitpid status: 0x0007057f]
[Opening TWO.txt]
This is ONE.txt
[waitpid status: 0x00000000]

It does indeed! Note that traces show that the tracer gets interrupted only by the open syscall (besides an initial trap and when the child exits). If we added the same traces to the ptrace-only program we would see many more calls.

Finally, a word of caution regarding call numbers: in this post and in the previous one we are assuming an x86-64 architecture, so the programs would need to be adapted if we want to use it in different archs. There is also an important catch here: we are implicitly assuming that the child process that gets run by the execvp() call is also x86-64, as we are filtering by using the syscall number for that arch. This implies that this will not work in the case that the child program is compiled for i386. To make this example work properly also in that case, we must check the architecture in the BPF, by looking at “arch” in seccomp_data, and use the appropriate syscall number in each case. We would also need to check the arch before looking at the tracee registers, see an example on how to do this here (alternatively we could make the BPF return this data in the SECCOMP_RET_DATA bits of its return value, which can be retrieved by the tracer via PTRACE_GETEVENTMSG). Needless to say, for arm64/32 we would have similar issues.

Modifying System Call Arguments With ptrace

As part of the ongoing effort we are doing in Canonical to snappify the world, we are trying to make available more and more software as easily-installable, secure, snaps. One of the things snapd does to isolate applications is to install snaps in separate folders in /snap/<mysnap>, and also creating $HOME/snap/<mysnap> for storing the snap data. Unfortunately, this changes where many applications expects data files to be as per the usual Unix conventions.

Solving this usually implies maintaining patches that change the file paths. If the $SNAP environment variable is set, the patches use it to find files in the right locations. However, this is cumbersome as upstreams are not always willing to accept the patches, at least until snap package format gets more popular. Also, in some cases we might be interested in snappifying proprietary software, where patches are not an option.

To solve these problems, Michael Terry created a library that uses the LD_PRELOAD trick to intercept calls to glibc. Although the library works really well, it has the disadvantage of not working in all cases, like on programs that perform syscalls directly without using the C library, or on statically compiled binaries.

In this post, I have explored an alternative to using LD_PRELOAD that would solve these issues by going to a lower level and intercepting syscalls using the ptrace syscall.

ptrace is used by programs like gdb or strace for debugging purposes. It is a swiss knife tool that lets us access a process’ memory and registers (the tracee) from another process (the tracer). There are many good tutorials around that explain how to use it, like this or this (from which I have borrowed parts of the code), so here I will focus on how it can be used to modify arbitrary syscall arguments. The concrete problem I had at hand was how to change a call to, say, open() so it ends up opening a file in a different path to the one originally specified by the tracee.

I have developed a proof of concept for this that can be found in github. The code is specific to x86_64, although the concepts behind are applicable to other architectures. As a word of caution, I have preferred to not do as many checks as I should to have cleaner code, so please do not consider it production-ready. I will go now through the code, function by function. Skipping the include directives and the forward declarations we get to main():

int main(int argc, char **argv)
{
    pid_t pid;
    int status;

    if (argc < 2) {
        fprintf(stderr, "Usage: %s <prog> <arg1> ... <argN>\n", argv[0]);
        return 1;
    }

    if ((pid = fork()) == 0) {
        ptrace(PTRACE_TRACEME, 0, 0, 0);
        kill(getpid(), SIGSTOP);
        return execvp(argv[1], argv + 1);
    } else {
        waitpid(pid, &amp;status, 0);
        ptrace(PTRACE_SETOPTIONS, pid, 0, PTRACE_O_TRACESYSGOOD);
        process_signals(pid);
        return 0;
    }
}

Here we do all the usual stuff needed when we want to trace a process from the start: we call fork(), then the child executes ptrace() with request PTRACE_TRACEME to indicate that it is willing to be traced. After that, it sends itself a SIGSTOP signal, which makes it stop (the execve call ¹ will be performed later). At that point, the parent process, which was waiting for a signal from the child, re-starts. The first thing it does is setting ptrace option PTRACE_O_TRACESYSGOOD, which makes the kernel set bit 7 in the signal numbers so we can easily distinguish between system call traps and normal traps. After that, it calls process_signal(), which is defined as

static void process_signals(pid_t child)
{
    const char *file_to_redirect = "ONE.txt";
    const char *file_to_avoid = "TWO.txt";
 
    while(1) {
        char orig_file[PATH_MAX];
 
        /* Wait for open syscall start */
        if (wait_for_open(child) != 0) break;
 
        /* Find out file and re-direct if it is the target */
 
        read_file(child, orig_file);
 
        if (strcmp(file_to_avoid, orig_file) == 0)
            redirect_file(child, file_to_redirect);
 
        /* Wait for open syscall exit */
        if (wait_for_open(child) != 0) break;
    }
}

This function is the main loop of the tracer. It waits for the open() syscall from the child to be started, and for it to exit (ptrace signals us both events) by calling wait_for_open(). When an open() call is detected, we read the file that the child wants to open in read_file(), and then we compare with string “TWO.txt”. If there is a match, we change the syscall arguments so “ONE.txt” is opened instead. Next we analyze the different functions that perform the low level stuff, and that contain architecture specific parts:

static int wait_for_open(pid_t child)
{
    int status;
 
    while (1) {
        ptrace(PTRACE_SYSCALL, child, 0, 0);
        waitpid(child, &status, 0);
        /* Is it the open syscall (sycall number 2 in x86_64)? */
        if (WIFSTOPPED(status) && WSTOPSIG(status) & 0x80 &&
            ptrace(PTRACE_PEEKUSER, child, sizeof(long)*ORIG_RAX, 0) == 2)
            return 0;
        if (WIFEXITED(status))
            return 1;
    }
}

wait_for_open() is executed until an open() system call is detected. By calling ptrace with argument PTRACE_SYSCALL, we let the child continue until the next signal or syscall enter/exit. The first time this happens the child, which was stopped after sending itself SIGSTOP, continues its execution and calls execve(). The parent then waits for signals from the child. If the child has stopped due to the signal, the signal number has the 7th bit set (should happen if the signal was triggered due to a syscall as we have set PTRACE_O_TRACESYSGOOD option), and it is the open() syscall (system call number 2 for x86_64), then we return with status 0. If the child has actually exited, the return value is 1. If nothing of this happens, we wait for the next signal. Here we are using PTRACE_PEEKUSER request, which lets us access the tracee user area. This area contains an array with the general purpose registers, and we use offsets defined in <sys/reg.h> to access them. When performing a syscall, RAX register contains the syscall number. However, we use ORIG_RAX offset to grab that number instead of the also existing RAX offset. We do this because RAX is also used to store the return value for the syscall, so the kernel stores the syscall number with offset ORIG_RAX and the return value with offset RAX. This needs to be taking into account especially when processing the exit of the system call. More information can be found here.

Next function is

static void read_file(pid_t child, char *file)
{
    char *child_addr;
    int i;
 
    child_addr = (char *) ptrace(PTRACE_PEEKUSER, child, sizeof(long)*RDI, 0);
 
    do {
        long val;
        char *p;
 
        val = ptrace(PTRACE_PEEKTEXT, child, child_addr, NULL);
        if (val == -1) {
            fprintf(stderr, "PTRACE_PEEKTEXT error: %s", strerror(errno));
            exit(1);
        }
        child_addr += sizeof (long);
 
        p = (char *) &val;
        for (i = 0; i < sizeof (long); ++i, ++file) {
            *file = *p++;
            if (*file == '\0') break;
        }
    } while (i == sizeof (long));
}

The read_file() function uses PTRACE_PEEKUSER as in the previous function to retrieve the first argument to the call function, which is the address of the string with the file name. This parameter is stored in the RDI register. Then it uses PTRACE_PEEKTEXT ptrace request, which lets us copy over data from the traced process’ memory. This is performed by words, so we do this in a loop until we find a null byte that indicates the end of the string.

The last function is

static void redirect_file(pid_t child, const char *file)
{
    char *stack_addr, *file_addr;
 
    stack_addr = (char *) ptrace(PTRACE_PEEKUSER, child, sizeof(long)*RSP, 0);
    /* Move further of red zone and make sure we have space for the file name */
    stack_addr -= 128 + PATH_MAX;
    file_addr = stack_addr;
 
    /* Write new file in lower part of the stack */
    do {
        int i;
        char val[sizeof (long)];
 
        for (i = 0; i < sizeof (long); ++i, ++file) {
            val[i] = *file;
            if (*file == '\0') break;
        }
 
        ptrace(PTRACE_POKETEXT, child, stack_addr, *(long *) val);
        stack_addr += sizeof (long);
    } while (*file);
 
    /* Change argument to open */
    ptrace(PTRACE_POKEUSER, child, sizeof(long)*RDI, file_addr);
}

The redirect_file() function is the most interesting one of this program. It modifies the argument to the open system call, forcing the child to open a different file to the one it originally specified. The main problem with this is that we need to modify the child’s memory space so the new file name is used by the kernel. We can change the tracee’s memory easily by using PTRACE_POKETEXT, the issue is, where can we store it?

A possible option is to save the original string using PTRACE_PEEKTEXT, then overwrite it by using PTRACE_POKETEXT. When we get called after the sycall exits, we copy back the original data. This can work fine in some cases, but it can be problematic if the new file name is longer than the original. We could be overwriting data that is used by other threads, which are not necessarily stopped so they could access that data while the kernel is processing the call. Or that data we are overwriting could be part of another parameter to the syscall, which would not happen for open(), but it is possible for other syscalls like link(). Finally, there is also the possibility that the string we are trying to modify is in a read only segment. Therefore, this is not a safe option.

After noticing this, I considered the option of adding a read-write segment to the binary under study, or to resize an existing one. However, I found there are not that many tools to do this, and those that apparently could do the job like ERESI, were not very intuitive ². Also, we would need to find out where the new segment gets loaded to know where to write, which would complicate the code. Furthermore, I wanted to avoid modifying the binary if possible.

Finally, I concluded that the stack was exactly what I needed: it is of course RW, and a reference address can be found by simply looking at the RSP register. What we have to do is make sure we write in a safe part of the stack. This can be performed by writing to addresses lower than the RSP (that is, the free part of the stack). To achieve this, we “reserve” stack memory so we can write a string of up to PATH_MAX length, and we add up 128 bytes for the red zone (this size is specified by the x86_64 ABI). Note also that syscalls do not write on the process stack: one of the very first things that is done by the Linux kernel syscalls entry point is to switch RSP to a kernel stack. This approach has also the advantage of being automatically thread-friendly, as each one has its own stack. On the other hand, there is the possibility of writing outside the stack. However that risk is quite small nowadays, as stacks of user space programs tend to be big and typically auto-grow on page fault. Another advantage is that we do not need to save and recover memory areas at enter/exit of the syscall, as the tracee should not write anything in the used memory area.

Once it is decided where to write, the implementation is straightforward: first we use PTRACE_PEEKUSER to get the RSP value of the tracee. Then, we write the new file name to a pointer lower than the RSP calculated as explained in the previous paragraph. The data is written by using PTRACE_POKETEXT, word by word. Finally, we change the child’s RDI register so it points to the new address.

Now we can give the program a try. I created a couple of files with content:

$ cat ONE.txt 
This is ONE.txt
$ cat TWO.txt 
This is TWO.txt

Executing the same cat command using redirect we have:

$ gcc redirect.c -o redirect
$ ./redirect cat ONE.txt 
This is ONE.txt
$ ./redirect cat TWO.txt 
This is ONE.txt

Things work as publicized: we modify the file opened by the cat in case it tries to show the content of “TWO.txt”.

Conclusions

As has been seen, the code to make this is remarkably small, which shows the power of the ptrace call. There are indeed parts of this that are very architecture specific, but that is mostly the name of the registers and maybe the red zone size, so it should be relatively straightforward to make it multi-architecture by adding some macros.

Another appreciation is that the example is for the open() syscall, but this technique can be applied to arbitrary arguments which are passed to any syscall as pointers to data in the traced process.

To finish, the main drawback for this solution is performance, as we have to stop (twice) for each syscall invoked by the child, with all the context switches that implies. A possible solution would be to use ptrace in combination with seccomp and seccomp Berkeley Packet Filters, which apparently make possible for the tracer to specify the syscalls that would provoke a trap. That would be, however, matter for another post.