Porting Ubuntu Core 18 to nvidia Jetson TX1 Developer Kit

Ubuntu Core (UC) is Canonical’s take in the IoT space. There are pre-built images for officially supported devices, like Raspberry Pi or Intel NUCs, but if we have something else and there is no community port, we need to create the UC image ourselves. High level instructions on how to do this are found in the official docs. The process is straightforward once we have two critical components: the kernel and the gadget snap.

Creating these snaps is not necessarily complex, but there can be bumps in the road if you are new to the task. In this post I explain how I created them for the Jetson TX1 developer kit board, and how they were used to create a UC image for said device, hoping this will provide new tricks to hackers working on ports for other devices. All the sources for the snaps and the build scripts are available in github:
https://github.com/alfonsosanchezbeato/jetson-kernel-snap
https://github.com/alfonsosanchezbeato/jetson-gadget-snap
https://github.com/alfonsosanchezbeato/jetson-ubuntu-core

So, let’s start with…

The kernel snap

The Linux kernel that we will use needs some kernel configuration options to be activated, and it is also especially important that it has a modern version of apparmor so snaps can be properly confined. The official Jetson kernel is the 4.4 release, which is quite old, but fortunately Canonical has a reference 4.4 kernel with all the needed patches for snaps backported. Knowing this, we are a git format-patch command away to obtain the patches we will use on top of the nvidia kernel. The patches include also files with the configuration options that we need for snaps, plus some changes so the snap could be successfully compiled on Ubuntu 18.04 desktop.

Once we have the sources, we need, of course, to create a snapcraft.yaml file that will describe how to build the kernel snap. We will walk through it, highlighting the parts more specific to the Jetson device.

Starting with the kernel part, it turns out that we cannot use easily the kernel plugin, due to the special way in which the kernel needs to be built: nvidia distributes part of the needed drivers as separate repositories to the one used by the main kernel tree. Therefore, I resorted to using the nil plugin so I could hand-write the commands to do the build.

The pull stage that resulted is

override-pull: |
  snapcraftctl pull
  # Get kernel sources, which are distributed across different repos
  ./source_sync.sh -k tegra-l4t-r28.2.1
  # Apply canonical patches - apparmor stuff essentially
  cd sources/kernel/display
  git am ../../../patch-display/*
  cd -
  cd sources/kernel/kernel-4.4
  git am ../../../patch/*

which runs a script to retrieve the sources (I pulled this script from nvidia Linux for Tegra -L4T- distribution) and applies Canonical patches.

The build stage is a few more lines, so I decided to use an external script to implement it. We will analyze now parts of it. For the kernel configuration we add all the necessary Ubuntu bits:

make "$JETSON_KERNEL_CONFIG" \
    snappy/containers.config \
    snappy/generic.config \
    snappy/security.config \
    snappy/snappy.config \
    snappy/systemd.config

Then, to do the build we run

make -j"$num_cpu" Image modules dtbs

An interesting catch here is that zImage files are not supported due to lack of a decompressor implementation in the arm64 kernel. So we have to build an uncompressed Image instead.

After some code that stages the built files so they are included in the snap later, we retrieve the initramfs from the core snap. This step is usually hidden from us by the kernel plugin, but this time we have to code it ourselves:

# Get initramfs from core snap, which we need to download
core_url=$(curl -s -H "X-Ubuntu-Series: 16" -H "X-Ubuntu-Architecture: arm64" \
                "https://search.apps.ubuntu.com/api/v1/snaps/details/core?channel=stable" \
               | jq -r ".anon_download_url")
curl -L "$core_url" > core.snap
# Glob so we get both link and regular file
unsquashfs core.snap "boot/initrd.img-core*"
cp squashfs-root/boot/initrd.img-core "$SNAPCRAFT_PART_INSTALL"/initrd.img
ln "$SNAPCRAFT_PART_INSTALL"/initrd.img "$SNAPCRAFT_PART_INSTALL"/initrd-"$KERNEL_RELEASE".img

Moving back to the snapcraft recipe we also have an initramfs part, which takes care of doing some changes to the default initramfs shipped by UC:

initramfs:
  after: [ kernel ]
  plugin: nil
  source: ../initramfs
  override-build: |
    find . | cpio --quiet -o -H newc | lzma >> "$SNAPCRAFT_STAGE"/initrd.img

Here we are taking advantage of the fact that the initramfs can be built as a concatenation of compressed cpio archives. When the kernel decompresses it, the files included in the later archives overwrite the files from the first ones, which allows us to modify easily files in the initramfs without having to change the one shipped with core. The change that we are doing here is a modification to the resize script that allows UC to get all the free space in the disk on first boot. The modification makes sure this happens in the case when the partition is already taken all available space but the filesystem does not. We could remove this modification when these changes reach the core snap, thing that will happen eventually.

The last part of this snap is the firmware part:

firmware:
  plugin: nil
  override-build: |
    set -xe
    wget https://developer.nvidia.com/embedded/dlc/l4t-jetson-tx1-driver-package-28-2-ga -O Tegra210_Linux_R28.2.0_aarch64.tbz2
    tar xf Tegra210_Linux_R28.2.0_aarch64.tbz2 Linux_for_Tegra/nv_tegra/nvidia_drivers.tbz2
    tar xf Linux_for_Tegra/nv_tegra/nvidia_drivers.tbz2 lib/firmware/
    cd lib; cp -r firmware/ "$SNAPCRAFT_PART_INSTALL"
    mkdir -p "$SNAPCRAFT_PART_INSTALL"/firmware/gm20b
    cd "$SNAPCRAFT_PART_INSTALL"/firmware/gm20b
    ln -sf "../tegra21x/acr_ucode.bin" "acr_ucode.bin"
    ln -sf "../tegra21x/gpmu_ucode.bin" "gpmu_ucode.bin"
    ln -sf "../tegra21x/gpmu_ucode_desc.bin" "gpmu_ucode_desc.bin"
    ln -sf "../tegra21x/gpmu_ucode_image.bin" "gpmu_ucode_image.bin"
    ln -sf "../tegra21x/gpu2cde.bin" "gpu2cde.bin"
    ln -sf "../tegra21x/NETB_img.bin" "NETB_img.bin"
    ln -sf "../tegra21x/fecs_sig.bin" "fecs_sig.bin"
    ln -sf "../tegra21x/pmu_sig.bin" "pmu_sig.bin"
    ln -sf "../tegra21x/pmu_bl.bin" "pmu_bl.bin"
    ln -sf "../tegra21x/fecs.bin" "fecs.bin"
    ln -sf "../tegra21x/gpccs.bin" "gpccs.bin"

Here we download some files so we can add firmware blobs to the snap. These files come separate from nvidia kernel sources.

So this is it for the kernel snap, now you just need to follow the instructions to get it built.

The gadget snap

Time now to take a look at the gadget snap. First, I recommend to start by reading great ogra’s post on gadget snaps for devices with u-boot bootloader before going through this section. Now, same as for the kernel one, we will go through the different parts that are defined in the snapcraft.yaml file. The first one builds the u-boot binary:

uboot:
  plugin: nil
  source: git://nv-tegra.nvidia.com/3rdparty/u-boot.git
  source-type: git
  source-tag: tegra-l4t-r28.2
  override-pull: |
    snapcraftctl pull
    # Apply UC patches + bug fixes
    git am ../../../uboot-patch/*.patch
  override-build: |
    export ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu-
    make p2371-2180_defconfig
    nice make -j$(nproc)
    cp "$SNAPCRAFT_PART_BUILD"/u-boot.bin $SNAPCRAFT_PART_INSTALL"/

We decided again to use the nil plugin as we need to do some special quirks. The sources are pulled from nvidia’s u-boot repository, but we apply some patches on top. These patches, along with the uboot environment, provide

  • Support for loading the UC kernel and initramfs from disk
  • Support for the revert functionality in case a core or kernel snap installation goes wrong
  • Bug fixes for u-boot’s ext4 subsystem – required because the just mentioned revert functionality needs to call u-boot’s command saveenv, which happened to be broken for ext4 filesystems in tegra’s u-boot

More information on the specifics of u-boot patches for UC can be found in this great blog post.

The only other part that the snap has is uboot-env:

uboot-env:
  plugin: nil
  source: uboot-env
  override-build: |
    mkenvimage -r -s 131072 -o uboot.env uboot.env.in
    cp "$SNAPCRAFT_PART_BUILD"/uboot.env "$SNAPCRAFT_PART_INSTALL"/
    # Link needed for ubuntu-image to work properly
    cd "$SNAPCRAFT_PART_INSTALL"/; ln -s uboot.env uboot.conf
  build-packages:
    - u-boot-tools

This simply encodes the uboot.env.in file into a format that is readable by u-boot. The resulting file, uboot.env, is included in the snap.

This environment is where most of the support for UC is encoded. I will not delve too much into the details, but just want to mention that the variables that need to be edited usually for new devices are

  • devnum, partition, and devtype to set the system boot partition, from which we load the kernel and initramfs
  • fdtfile, fdt_addr_r, and fdt_high to determine the name of the device tree and where in memory it should be loaded
  • ramdisk_addr_r and initrd_high to set the loading location for the initramfs
  • kernel_addr_r to set where the kernel needs to be loaded
  • args contains kernel arguments and needs to be adapted to the device specifics
  • Finally, for this device, snappy_boot was changed so it used booti instead of bootz, as we could not use a compressed kernel as explained above

Besides the snapcraft recipe, the other mandatory file when defining a gadget snap is the gadget.yaml file. This file defines, among other things, the image partitioning layout. There is more to it, but in this case, partitioning is the only thing we have defined:

volumes:
  jetson:
    bootloader: u-boot
    schema: gpt
    structure:
      - name: system-boot
        role: system-boot
        type: 0FC63DAF-8483-4772-8E79-3D69D8477DE4
        filesystem: ext4
        filesystem-label: system-boot
        offset: 17408
        size: 67108864
      - name: TBC
        type: EBD0A0A2-B9E5-4433-87C0-68B6B72699C7
        size: 2097152
      - name: EBT
        type: EBD0A0A2-B9E5-4433-87C0-68B6B72699C7
        size: 4194304
      - name: BPF
        type: EBD0A0A2-B9E5-4433-87C0-68B6B72699C7
        size: 2097152
      - name: WB0
        type: EBD0A0A2-B9E5-4433-87C0-68B6B72699C7
        size: 6291456
      - name: RP1
        type: EBD0A0A2-B9E5-4433-87C0-68B6B72699C7
        size: 4194304
      - name: TOS
        type: EBD0A0A2-B9E5-4433-87C0-68B6B72699C7
        size: 6291456
      - name: EKS
        type: EBD0A0A2-B9E5-4433-87C0-68B6B72699C7
        size: 2097152
      - name: FX
        type: EBD0A0A2-B9E5-4433-87C0-68B6B72699C7
        size: 2097152
      - name: BMP
        type: EBD0A0A2-B9E5-4433-87C0-68B6B72699C7
        size: 134217728
      - name: SOS
        type: EBD0A0A2-B9E5-4433-87C0-68B6B72699C7
        size: 20971520
      - name: EXI
        type: EBD0A0A2-B9E5-4433-87C0-68B6B72699C7
        size: 67108864
      - name: LNX
        type: 0FC63DAF-8483-4772-8E79-3D69D8477DE4
        size: 67108864
        content:
          - image: u-boot.bin
      - name: DTB
        type: EBD0A0A2-B9E5-4433-87C0-68B6B72699C7
        size: 4194304
      - name: NXT
        type: EBD0A0A2-B9E5-4433-87C0-68B6B72699C7
        size: 2097152
      - name: MXB
        type: EBD0A0A2-B9E5-4433-87C0-68B6B72699C7
        size: 6291456
      - name: MXP
        type: EBD0A0A2-B9E5-4433-87C0-68B6B72699C7
        size: 6291456
      - name: USP
        type: EBD0A0A2-B9E5-4433-87C0-68B6B72699C7
size: 2097152

The Jetson TX1 has a complex partitioning layout, with many partitions being allocated for the first stage bootloader, and many others that are undocumented. So, to minimize the risk of touching a critical partition, I preferred to keep most of them untouched and do just the minor amount of changes to fit UC into the device. Therefore, the gadget.yaml volumes entry mainly describes the TX1 defaults, with the main differences comparing to the original being:

  1. The APP partition is renamed to system-boot and reduced to only 64MB. It will contain the uboot environment file plus the kernel and initramfs, as usual in UC systems with u-boot bootloader.
  2. The LNX partition will contain our u-boot binary
  3. If a partition with role: system-data is not defined explicitly (which is the case here), a partition which such role and with label “writable” is implicitly defined at the end of the volume. This will take all the available space left aside by the reduction of the APP partition, and will contain the UC root filesystem. This will replace the UDA partition that is the last in nvidia partitioning scheme.

Now, it is time to build the gadget snap by following the repository instructions.

Building & flashing the image

Now that we have the snaps, it is time to build the image. There is not much to it, you just need an Ubuntu One account and to follow the instructions to create a key to be able to sign a model assertion. With that just follow the README.md file in the jetson-ubuntu-core repository. You can also download the latest tarball from the repository if you prefer.

The build script will generate not only a full image file, but also a tarball that will contain separate files for each partition that needs to be flashed in the device. This is needed because unfortunately there is no way we can fully flash the Jetson device with a GPT image, instead we can flash only individual partitions with the tools nvidia provides.

Once the build finishes, we can take the resulting tarball and follow the instructions to get the necessary partitions flashed. As can be read there, we have to download the nvidia L4T package. Also, note that to be able to change the partition sizes and files to flash, a couple of patches have to be applied on top of the L4T scripts.

Summary

After this, you should have a working Ubuntu Core 18 device. You can use the serial port or an external monitor to configure it with your launchpad account so you can ssh into it. Enjoy!

Modifying System Call Arguments With ptrace

As part of the ongoing effort we are doing in Canonical to snappify the world, we are trying to make available more and more software as easily-installable, secure, snaps. One of the things snapd does to isolate applications is to install snaps in separate folders in /snap/<mysnap>, and also creating $HOME/snap/<mysnap> for storing the snap data. Unfortunately, this changes where many applications expects data files to be as per the usual Unix conventions.

Solving this usually implies maintaining patches that change the file paths. If the $SNAP environment variable is set, the patches use it to find files in the right locations. However, this is cumbersome as upstreams are not always willing to accept the patches, at least until snap package format gets more popular. Also, in some cases we might be interested in snappifying proprietary software, where patches are not an option.

To solve these problems, Michael Terry created a library that uses the LD_PRELOAD trick to intercept calls to glibc. Although the library works really well, it has the disadvantage of not working in all cases, like on programs that perform syscalls directly without using the C library, or on statically compiled binaries.

In this post, I have explored an alternative to using LD_PRELOAD that would solve these issues by going to a lower level and intercepting syscalls using the ptrace syscall.

ptrace is used by programs like gdb or strace for debugging purposes. It is a swiss knife tool that lets us access a process’ memory and registers (the tracee) from another process (the tracer). There are many good tutorials around that explain how to use it, like this or this (from which I have borrowed parts of the code), so here I will focus on how it can be used to modify arbitrary syscall arguments. The concrete problem I had at hand was how to change a call to, say, open() so it ends up opening a file in a different path to the one originally specified by the tracee.

I have developed a proof of concept for this that can be found in github. The code is specific to x86_64, although the concepts behind are applicable to other architectures. As a word of caution, I have preferred to not do as many checks as I should to have cleaner code, so please do not consider it production-ready. I will go now through the code, function by function. Skipping the include directives and the forward declarations we get to main():

int main(int argc, char **argv)
{
    pid_t pid;
    int status;

    if (argc < 2) {
        fprintf(stderr, "Usage: %s <prog> <arg1> ... <argN>\n", argv[0]);
        return 1;
    }

    if ((pid = fork()) == 0) {
        ptrace(PTRACE_TRACEME, 0, 0, 0);
        kill(getpid(), SIGSTOP);
        return execvp(argv[1], argv + 1);
    } else {
        waitpid(pid, &amp;status, 0);
        ptrace(PTRACE_SETOPTIONS, pid, 0, PTRACE_O_TRACESYSGOOD);
        process_signals(pid);
        return 0;
    }
}

Here we do all the usual stuff needed when we want to trace a process from the start: we call fork(), then the child executes ptrace() with request PTRACE_TRACEME to indicate that it is willing to be traced. After that, it sends itself a SIGSTOP signal, which makes it stop (the execve call 1 will be performed later). At that point, the parent process, which was waiting for a signal from the child, re-starts. The first thing it does is setting ptrace option PTRACE_O_TRACESYSGOOD, which makes the kernel set bit 7 in the signal numbers so we can easily distinguish between system call traps and normal traps. After that, it calls process_signal(), which is defined as

static void process_signals(pid_t child)
{
    const char *file_to_redirect = "ONE.txt";
    const char *file_to_avoid = "TWO.txt";
 
    while(1) {
        char orig_file[PATH_MAX];
 
        /* Wait for open syscall start */
        if (wait_for_open(child) != 0) break;
 
        /* Find out file and re-direct if it is the target */
 
        read_file(child, orig_file);
 
        if (strcmp(file_to_avoid, orig_file) == 0)
            redirect_file(child, file_to_redirect);
 
        /* Wait for open syscall exit */
        if (wait_for_open(child) != 0) break;
    }
}

This function is the main loop of the tracer. It waits for the open() syscall from the child to be started, and for it to exit (ptrace signals us both events) by calling wait_for_open(). When an open() call is detected, we read the file that the child wants to open in read_file(), and then we compare with string “TWO.txt”. If there is a match, we change the syscall arguments so “ONE.txt” is opened instead. Next we analyze the different functions that perform the low level stuff, and that contain architecture specific parts:

static int wait_for_open(pid_t child)
{
    int status;
 
    while (1) {
        ptrace(PTRACE_SYSCALL, child, 0, 0);
        waitpid(child, &status, 0);
        /* Is it the open syscall (sycall number 2 in x86_64)? */
        if (WIFSTOPPED(status) && WSTOPSIG(status) & 0x80 &&
            ptrace(PTRACE_PEEKUSER, child, sizeof(long)*ORIG_RAX, 0) == 2)
            return 0;
        if (WIFEXITED(status))
            return 1;
    }
}

wait_for_open() is executed until an open() system call is detected. By calling ptrace with argument PTRACE_SYSCALL, we let the child continue until the next signal or syscall enter/exit. The first time this happens the child, which was stopped after sending itself SIGSTOP, continues its execution and calls execve(). The parent then waits for signals from the child. If the child has stopped due to the signal, the signal number has the 7th bit set (should happen if the signal was triggered due to a syscall as we have set PTRACE_O_TRACESYSGOOD option), and it is the open() syscall (system call number 2 for x86_64), then we return with status 0. If the child has actually exited, the return value is 1. If nothing of this happens, we wait for the next signal. Here we are using PTRACE_PEEKUSER request, which lets us access the tracee user area. This area contains an array with the general purpose registers, and we use offsets defined in <sys/reg.h> to access them. When performing a syscall, RAX register contains the syscall number. However, we use ORIG_RAX offset to grab that number instead of the also existing RAX offset. We do this because RAX is also used to store the return value for the syscall, so the kernel stores the syscall number with offset ORIG_RAX and the return value with offset RAX. This needs to be taking into account especially when processing the exit of the system call. More information can be found here.

Next function is

static void read_file(pid_t child, char *file)
{
    char *child_addr;
    int i;
 
    child_addr = (char *) ptrace(PTRACE_PEEKUSER, child, sizeof(long)*RDI, 0);
 
    do {
        long val;
        char *p;
 
        val = ptrace(PTRACE_PEEKTEXT, child, child_addr, NULL);
        if (val == -1) {
            fprintf(stderr, "PTRACE_PEEKTEXT error: %s", strerror(errno));
            exit(1);
        }
        child_addr += sizeof (long);
 
        p = (char *) &val;
        for (i = 0; i < sizeof (long); ++i, ++file) {
            *file = *p++;
            if (*file == '\0') break;
        }
    } while (i == sizeof (long));
}

The read_file() function uses PTRACE_PEEKUSER as in the previous function to retrieve the first argument to the call function, which is the address of the string with the file name. This parameter is stored in the RDI register. Then it uses PTRACE_PEEKTEXT ptrace request, which lets us copy over data from the traced process’ memory. This is performed by words, so we do this in a loop until we find a null byte that indicates the end of the string.

The last function is

static void redirect_file(pid_t child, const char *file)
{
    char *stack_addr, *file_addr;
 
    stack_addr = (char *) ptrace(PTRACE_PEEKUSER, child, sizeof(long)*RSP, 0);
    /* Move further of red zone and make sure we have space for the file name */
    stack_addr -= 128 + PATH_MAX;
    file_addr = stack_addr;
 
    /* Write new file in lower part of the stack */
    do {
        int i;
        char val[sizeof (long)];
 
        for (i = 0; i < sizeof (long); ++i, ++file) {
            val[i] = *file;
            if (*file == '\0') break;
        }
 
        ptrace(PTRACE_POKETEXT, child, stack_addr, *(long *) val);
        stack_addr += sizeof (long);
    } while (*file);
 
    /* Change argument to open */
    ptrace(PTRACE_POKEUSER, child, sizeof(long)*RDI, file_addr);
}

The redirect_file() function is the most interesting one of this program. It modifies the argument to the open system call, forcing the child to open a different file to the one it originally specified. The main problem with this is that we need to modify the child’s memory space so the new file name is used by the kernel. We can change the tracee’s memory easily by using PTRACE_POKETEXT, the issue is, where can we store it?

A possible option is to save the original string using PTRACE_PEEKTEXT, then overwrite it by using PTRACE_POKETEXT. When we get called after the sycall exits, we copy back the original data. This can work fine in some cases, but it can be problematic if the new file name is longer than the original. We could be overwriting data that is used by other threads, which are not necessarily stopped so they could access that data while the kernel is processing the call. Or that data we are overwriting could be part of another parameter to the syscall, which would not happen for open(), but it is possible for other syscalls like link(). Finally, there is also the possibility that the string we are trying to modify is in a read only segment. Therefore, this is not a safe option.

After noticing this, I considered the option of adding a read-write segment to the binary under study, or to resize an existing one. However, I found there are not that many tools to do this, and those that apparently could do the job like ERESI, were not very intuitive 2. Also, we would need to find out where the new segment gets loaded to know where to write, which would complicate the code. Furthermore, I wanted to avoid modifying the binary if possible.

Finally, I concluded that the stack was exactly what I needed: it is of course RW, and a reference address can be found by simply looking at the RSP register. What we have to do is make sure we write in a safe part of the stack. This can be performed by writing to addresses lower than the RSP (that is, the free part of the stack). To achieve this, we “reserve” stack memory so we can write a string of up to PATH_MAX length, and we add up 128 bytes for the red zone (this size is specified by the x86_64 ABI). Note also that syscalls do not write on the process stack: one of the very first things that is done by the Linux kernel syscalls entry point is to switch RSP to a kernel stack. This approach has also the advantage of being automatically thread-friendly, as each one has its own stack. On the other hand, there is the possibility of writing outside the stack. However that risk is quite small nowadays, as stacks of user space programs tend to be big and typically auto-grow on page fault. Another advantage is that we do not need to save and recover memory areas at enter/exit of the syscall, as the tracee should not write anything in the used memory area.

Once it is decided where to write, the implementation is straightforward: first we use PTRACE_PEEKUSER to get the RSP value of the tracee. Then, we write the new file name to a pointer lower than the RSP calculated as explained in the previous paragraph. The data is written by using PTRACE_POKETEXT, word by word. Finally, we change the child’s RDI register so it points to the new address.

Now we can give the program a try. I created a couple of files with content:

$ cat ONE.txt 
This is ONE.txt
$ cat TWO.txt 
This is TWO.txt

Executing the same cat command using redirect we have:

$ gcc redirect.c -o redirect
$ ./redirect cat ONE.txt 
This is ONE.txt
$ ./redirect cat TWO.txt 
This is ONE.txt

Things work as publicized: we modify the file opened by the cat in case it tries to show the content of “TWO.txt”.

Conclusions

As has been seen, the code to make this is remarkably small, which shows the power of the ptrace call. There are indeed parts of this that are very architecture specific, but that is mostly the name of the registers and maybe the red zone size, so it should be relatively straightforward to make it multi-architecture by adding some macros.

Another appreciation is that the example is for the open() syscall, but this technique can be applied to arbitrary arguments which are passed to any syscall as pointers to data in the traced process.

To finish, the main drawback for this solution is performance, as we have to stop (twice) for each syscall invoked by the child, with all the context switches that implies. A possible solution would be to use ptrace in combination with seccomp and seccomp Berkeley Packet Filters, which apparently make possible for the tracer to specify the syscalls that would provoke a trap. That would be, however, matter for another post.