Returning values from shell functions and dynamic scoping

When writing shell scripts, one of the things I used to find more problematic was how to return values from a function. If writing a function, I try to use local variables as far as possible, so I do not like the idea of using global variables to return values. There are a few well known options that can help with this, but none of them completely convinced me:

  • You can use the return code (f; $?) of the function, but you are limited to integers and to an 8-bit range for them. Also, using it conflicts with the philosophy of function return codes, which are expected to be zero for success and otherwise an error code. As an example of how this can be a problem, if you run your shell with sh -e, the shell will exit when a function returns a non-zero code.
  • You can take the function output to stdout as return value, by using command substitution (your_var=$(f)). The disadvantages are that (1) you need to make sure that the output from the function does not include anything undesired, so you end up redirecting output of the programs you call to /dev/null if you are not interested on it, and (2) that a sub-shell is created, which is costly and can be problematic too (for instance, if you want some side effect to still happen in the calling process).
  • You can pass the names of the output variables to the function and then write the return values in there using eval. An advantage of this approach compared to the other methods is that you can easily return more than one value to the caller. But, you are actually writing in variables unknown to the function and that are then of global scope, which is what I wanted to avoid… or maybe not?

I actually liked more the third approach, but the problem with some variables still being global did not satisfy me. Concretely, I was worried with the case of a function calling another function, like

#!/bin/dash -e

return_two() {
    local ret_var_1=$1
    local ret_var_2=$2
    eval "$ret_var_1"=one
    eval "$ret_var_2"=two
}

calls_return_two() {
    return_two ret1 ret2
    echo "$ret1" "$ret2"
}

ret1=oldvalue1
ret2=oldvalue2
calls_return_two
echo "$ret1" "$ret2"

Here return_two() indeed overwrites $ret1 and $ret2, so the output of the script is

one two
one two

But, it turns out there is a way to avoid this problem. I had unconsciously assumed that shell interpreters do, as most languages these days, static (also called lexical) scoping. So, when the interpreter tried to find $ret1 and $ret2, it would not find them in the local variables for return_two(), and it would overwrite/create them in the global scope. But that is not necessarily the case. This script:

#!/bin/dash -e

return_two() {
    local ret_var_1=$1
    local ret_var_2=$2
    eval "$ret_var_1"=one
    eval "$ret_var_2"=two
}

calls_return_two() {
    local ret1 ret2
    return_two ret1 ret2
    echo "$ret1" "$ret2"
}

ret1=oldvalue1
ret2=oldvalue2
calls_return_two
echo "$ret1" "$ret2"

Has as output:

one two
oldvalue1 oldvalue2

The $ret1 and $ret2 variables are not being overwritten by return_two(), because shells have dynamic scoping! That means that when the interpreter does not find a variable, it goes up in the stack until it finds a parent that owns a variable with that name. As the variable names that calls_return_two() provides to return_two() are local, those are the variables actually modified and not the global ones.

So, in the end, I found what I wanted: a way to return multiple variables from a shell function without polluting the global namespace. The solution is quite obvious once you find how dynamic scoping behaves, but I had the static scoping principles so deeply burned into my brain that it was a bit of a surprise to find this.

It is interesting how in this case dynamic scoping comes to the rescue, taking into account that is quite controversial, and for good reasons – it can make the code less readable as you need to take into account the dynamic behavior of the program to understand which variables are available at a given execution path. Think for instance on something like

#!/bin/dash -e

const=2

mult_by_my_const() {
    echo $((const * $1))
}

call_mult_by_my_const() {
    local const=4
    mult_by_my_const 2
}

call_mult_by_my_const

which ‘overrides’ with a local variable a global constant! (the output is indeed 8 and not 4). The eval command is also something that needs to be handled with care, there are a few reasons for that too. So, it is a bit ironic that I had to use them to be able to return values in a clean way. But, I also think that this idiom is relatively safe to use, or that at least it is better than not using it:

  • eval is used only to set the output variables for the function
  • Even though dynamic binding is used, in this case it is a sort of passing by reference method and should make the program easier to understand, not harder
  • All the information about the variable names is kept encapsulated in the caller function

Something to note is that I am assuming that the shell interpreter supports local variables. This is unfortunately not part of the POSIX standard, but it turns out that most of the interpreters around these days support it, including dash and bash. However, be aware of the different syntax in each case.

Anyway, I hope you enjoyed this small shell scripting pill!

Filter and Modify System Calls with seccomp and ptrace

In the conclusions to my last post, “Modifying System Call Arguments With ptrace”, I mentioned that one of the main drawbacks of the explained approach for modifying system call arguments was that there is a process switch for each system call performed by the tracee. I also suggested a possible approach to overcome that issue using ptrace jointly with seccomp, with the later making sure the tracer gets only the system calls we are interested in. In this post I develop this idea further and show how this can be achieved.

For this, I have created a little example that can be found in github, along the example used in the previous post. The main idea is to use seccomp with a Berkeley Packet Filter (BPF) that will specify the conditions under which the tracer gets interrupted.

Now we will go through the source code, with emphasis on the parts that differ from the original example. Skipping the include directives and the forward declarations we get to main():

int main(int argc, char **argv)
{
    pid_t pid;
    int status;

    if (argc < 2) {
        fprintf(stderr, "Usage: %s <prog> <arg1> ... <argN>\n", argv[0]);
        return 1;
    }

    if ((pid = fork()) == 0) {
        /* If open syscall, trace */
        struct sock_filter filter[] = {
            BPF_STMT(BPF_LD+BPF_W+BPF_ABS, offsetof(struct seccomp_data, nr)),
            BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_open, 0, 1),
            BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_TRACE),
            BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW),
        };
        struct sock_fprog prog = {
            .filter = filter,
            .len = (unsigned short) (sizeof(filter)/sizeof(filter[0])),
        };
        ptrace(PTRACE_TRACEME, 0, 0, 0);
        /* To avoid the need for CAP_SYS_ADMIN */
        if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0) == -1) {
            perror("prctl(PR_SET_NO_NEW_PRIVS)");
            return 1;
        }
        if (prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &prog) == -1) {
            perror("when setting seccomp filter");
            return 1;
        }
        kill(getpid(), SIGSTOP);
        return execvp(argv[1], argv + 1);
    } else {
        waitpid(pid, &status, 0);
        ptrace(PTRACE_SETOPTIONS, pid, 0, PTRACE_O_TRACESECCOMP);
        process_signals(pid);
        return 0;
    }
}

The main change here when compared to the original code is the set-up of a BPF in the tracee, right after performing the call to fork(). BPFs have an intimidating syntax at first glance, but once you grasp the basic concepts behind they are actually quite easy to read. BPFs are defined as a sort of virtual machine (VM) which has one data register or accumulator, one index register, and an implicit program counter (PC). Its “assembly” instructions are defined as a structure with format:

struct sock_filter {
    u_short code;
    u_char  jt;
    u_char  jf;
    u_long k;
};

There are codes (opcodes) for loading into the accumulator, jumping, and so on. jt and jf are increments on the program counter that are used in jump instructions, while k is an auxiliary value which usage depends on the code number.

BPFs have an addressable space with data that is in the networking case a packet datagram, and for seccomp the following structure:

struct seccomp_data {
    int   nr;                   /* System call number */
    __u32 arch;                 /* AUDIT_ARCH_* value
                                   (see <linux/audit.h>) */
    __u64 instruction_pointer;  /* CPU instruction pointer */
    __u64 args[6];              /* Up to 6 system call arguments */
};

So basically what BPFs do in seccomp is to operate on this data, and return a value that tells the kernel what to do next: allow the process to perform the call (SECCOMP_RET_ALLOW), kill it (SECCOMP_RET_KILL), or other options as specified in the seccomp man page.

As can be seen, struct seccomp_data contains more than enough information for our purposes: we can filter based on the system call number and on the arguments.

With all this information we can look now at the filter definition. BPFs filters are defined as an array of sock_filter structures, where each entry is a BPF instruction. In our case we have

BPF_STMT(BPF_LD+BPF_W+BPF_ABS, offsetof(struct seccomp_data, nr)),
BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_open, 0, 1),
BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_TRACE),
BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW),

BPF_STMT and BPF_JUMP are a couple of simple macros that fill the sock_filter structure. They differ in the arguments, which include jumping offsets in BPF_JUMP. The first argument is in both cases the “opcode”, which is built with macros as a mnemonics help: for instance the first one is for loading into the accumulator (BPF_LD) a word (BPF_W) using absolute addressing (BPF_ABS). More about this can be read here, for instance.

Analysing now in more detail the filter, the first instruction is asking the VM to load the call number, nr, to the accumulator. The second one compares that to the number for the open syscall, and asks the VM to not modify the counter if they are equal (PC+0), so the third instruction is run, or jump to PC+1 otherwise, which would be the 4th instruction (when executing this instruction the PC points already to the 3rd instruction). So if this is an open syscall we return SECCOMP_RET_TRACE, which will invoke the tracer, otherwise we return SECCOMP_RET_ALLOW, which will let the tracee run the syscall without further impediment.

Moving forward, the first call to prctl sets PR_SET_NO_NEW_PRIVS, which impedes child processes to have more privileges than those of the parent. This is needed to make the following call to prctl, which sets the seccomp filter using the PR_SET_SECCOMP option, succeed even when not being root. After that, we call execvp() as in the ptrace-only example.

Switching to what the parent does, we see that changes are very few. In main(), we set the PTRACE_O_TRACESECCOMP option, that makes the tracee stop when a filter returns SECCOMP_RET_TRACE and signals the event to the tracer. The other change in this function is that we do not need to set anymore PTRACE_O_TRACESYSGOOD, as we are being interrupted by seccomp, not because of system calls.

Moving now to the next function,

static void process_signals(pid_t child)
{
    const char *file_to_redirect = "ONE.txt";
    const char *file_to_avoid = "TWO.txt";

    while(1) {
        char orig_file[PATH_MAX];

        /* Wait for open syscall start */
        if (wait_for_open(child) != 0) break;

        /* Find out file and re-direct if it is the target */

        read_file(child, orig_file);
        printf("[Opening %s]\n", orig_file);

        if (strcmp(file_to_avoid, orig_file) == 0)
            redirect_file(child, file_to_redirect);
    }
}

we see here that now we invoke wait_for_open() only once. Differently to when we are tracing each syscall, which interrupted the tracer before and after the execution of the syscall, seccomp will interrupt us only before the call is processed. We also add here a trace for demonstration purposes.

After that, we have

static int wait_for_open(pid_t child)
{
    int status;

    while (1) {
        ptrace(PTRACE_CONT, child, 0, 0);
        waitpid(child, &status, 0);
        printf("[waitpid status: 0x%08x]\n", status);
        /* Is it our filter for the open syscall? */
        if (status >> 8 == (SIGTRAP | (PTRACE_EVENT_SECCOMP << 8)) &&
            ptrace(PTRACE_PEEKUSER, child,
                   sizeof(long)*ORIG_RAX, 0) == __NR_open)
            return 0;
        if (WIFEXITED(status))
            return 1;
    }
}

Here we use PTRACE_CONT instead of PTRACE_SYSCALL. We get interrupted every time there is a match in the BPF as we have set the PTRACE_O_TRACESECCOMP option, and we let the tracer run until that happens. The other change here, besides a trace, is how we check if we have received the event we are interested in, as obviously the status word is different. The details can be seen in ptrace’s man page. Note also that we could actually avoid the test for __NR_open as the BPF will interrupt us only for open syscalls.

The rest of the code, which is the part that actually changes the argument to the open syscall is exactly the same. Now, let’s check if this works as advertised:

$ git clone https://github.com/alfonsosanchezbeato/ptrace-redirect.git
$ cd ptrace-redirect/
$ cat ONE.txt 
This is ONE.txt
$ cat TWO.txt 
This is TWO.txt
$ gcc redir_filter.c -o redir_filter
$ ./redir_filter cat TWO.txt 
[waitpid status: 0x0000057f]
[waitpid status: 0x0007057f]
[Opening /etc/ld.so.cache]
[waitpid status: 0x0007057f]
[Opening /lib/x86_64-linux-gnu/libc.so.6]
[waitpid status: 0x0007057f]
[Opening /usr/lib/locale/locale-archive]
[waitpid status: 0x0007057f]
[Opening TWO.txt]
This is ONE.txt
[waitpid status: 0x00000000]

It does indeed! Note that traces show that the tracer gets interrupted only by the open syscall (besides an initial trap and when the child exits). If we added the same traces to the ptrace-only program we would see many more calls.

Finally, a word of caution regarding call numbers: in this post and in the previous one we are assuming an x86-64 architecture, so the programs would need to be adapted if we want to use it in different archs. There is also an important catch here: we are implicitly assuming that the child process that gets run by the execvp() call is also x86-64, as we are filtering by using the syscall number for that arch. This implies that this will not work in the case that the child program is compiled for i386. To make this example work properly also in that case, we must check the architecture in the BPF, by looking at “arch” in seccomp_data, and use the appropriate syscall number in each case. We would also need to check the arch before looking at the tracee registers, see an example on how to do this here (alternatively we could make the BPF return this data in the SECCOMP_RET_DATA bits of its return value, which can be retrieved by the tracer via PTRACE_GETEVENTMSG). Needless to say, for arm64/32 we would have similar issues.