arm – Alfonso Sánchez-Beato's blog

Porting Ubuntu Core 18 to nvidia Jetson TX1 Developer Kit

Ubuntu Core (UC) is Canonical’s take in the IoT space. There are pre-built images for officially supported devices, like Raspberry Pi or Intel NUCs, but if we have something else and there is no community port, we need to create the UC image ourselves. High level instructions on how to do this are found in the official docs. The process is straightforward once we have two critical components: the kernel and the gadget snap.

Creating these snaps is not necessarily complex, but there can be bumps in the road if you are new to the task. In this post I explain how I created them for the Jetson TX1 developer kit board, and how they were used to create a UC image for said device, hoping this will provide new tricks to hackers working on ports for other devices. All the sources for the snaps and the build scripts are available in github:
https://github.com/alfonsosanchezbeato/jetson-kernel-snap
https://github.com/alfonsosanchezbeato/jetson-gadget-snap
https://github.com/alfonsosanchezbeato/jetson-ubuntu-core

So, let’s start with…

The kernel snap

The Linux kernel that we will use needs some kernel configuration options to be activated, and it is also especially important that it has a modern version of apparmor so snaps can be properly confined. The official Jetson kernel is the 4.4 release, which is quite old, but fortunately Canonical has a reference 4.4 kernel with all the needed patches for snaps backported. Knowing this, we are a git format-patch command away to obtain the patches we will use on top of the nvidia kernel. The patches include also files with the configuration options that we need for snaps, plus some changes so the snap could be successfully compiled on Ubuntu 18.04 desktop.

Once we have the sources, we need, of course, to create a snapcraft.yaml file that will describe how to build the kernel snap. We will walk through it, highlighting the parts more specific to the Jetson device.

Starting with the kernel part, it turns out that we cannot use easily the kernel plugin, due to the special way in which the kernel needs to be built: nvidia distributes part of the needed drivers as separate repositories to the one used by the main kernel tree. Therefore, I resorted to using the nil plugin so I could hand-write the commands to do the build.

The pull stage that resulted is

override-pull: |
  snapcraftctl pull
  # Get kernel sources, which are distributed across different repos
  ./source_sync.sh -k tegra-l4t-r28.2.1
  # Apply canonical patches - apparmor stuff essentially
  cd sources/kernel/display
  git am ../../../patch-display/*
  cd -
  cd sources/kernel/kernel-4.4
  git am ../../../patch/*

which runs a script to retrieve the sources (I pulled this script from nvidia Linux for Tegra -L4T- distribution) and applies Canonical patches.

The build stage is a few more lines, so I decided to use an external script to implement it. We will analyze now parts of it. For the kernel configuration we add all the necessary Ubuntu bits:

make "$JETSON_KERNEL_CONFIG" \
    snappy/containers.config \
    snappy/generic.config \
    snappy/security.config \
    snappy/snappy.config \
    snappy/systemd.config

Then, to do the build we run

make -j"$num_cpu" Image modules dtbs

An interesting catch here is that zImage files are not supported due to lack of a decompressor implementation in the arm64 kernel. So we have to build an uncompressed Image instead.

After some code that stages the built files so they are included in the snap later, we retrieve the initramfs from the core snap. This step is usually hidden from us by the kernel plugin, but this time we have to code it ourselves:

# Get initramfs from core snap, which we need to download
core_url=$(curl -s -H "X-Ubuntu-Series: 16" -H "X-Ubuntu-Architecture: arm64" \
                "https://search.apps.ubuntu.com/api/v1/snaps/details/core?channel=stable" \
               | jq -r ".anon_download_url")
curl -L "$core_url" > core.snap
# Glob so we get both link and regular file
unsquashfs core.snap "boot/initrd.img-core*"
cp squashfs-root/boot/initrd.img-core "$SNAPCRAFT_PART_INSTALL"/initrd.img
ln "$SNAPCRAFT_PART_INSTALL"/initrd.img "$SNAPCRAFT_PART_INSTALL"/initrd-"$KERNEL_RELEASE".img

Moving back to the snapcraft recipe we also have an initramfs part, which takes care of doing some changes to the default initramfs shipped by UC:

initramfs:
  after: [ kernel ]
  plugin: nil
  source: ../initramfs
  override-build: |
    find . | cpio --quiet -o -H newc | lzma >> "$SNAPCRAFT_STAGE"/initrd.img

Here we are taking advantage of the fact that the initramfs can be built as a concatenation of compressed cpio archives. When the kernel decompresses it, the files included in the later archives overwrite the files from the first ones, which allows us to modify easily files in the initramfs without having to change the one shipped with core. The change that we are doing here is a modification to the resize script that allows UC to get all the free space in the disk on first boot. The modification makes sure this happens in the case when the partition is already taken all available space but the filesystem does not. We could remove this modification when these changes reach the core snap, thing that will happen eventually.

The last part of this snap is the firmware part:

firmware:
  plugin: nil
  override-build: |
    set -xe
    wget https://developer.nvidia.com/embedded/dlc/l4t-jetson-tx1-driver-package-28-2-ga -O Tegra210_Linux_R28.2.0_aarch64.tbz2
    tar xf Tegra210_Linux_R28.2.0_aarch64.tbz2 Linux_for_Tegra/nv_tegra/nvidia_drivers.tbz2
    tar xf Linux_for_Tegra/nv_tegra/nvidia_drivers.tbz2 lib/firmware/
    cd lib; cp -r firmware/ "$SNAPCRAFT_PART_INSTALL"
    mkdir -p "$SNAPCRAFT_PART_INSTALL"/firmware/gm20b
    cd "$SNAPCRAFT_PART_INSTALL"/firmware/gm20b
    ln -sf "../tegra21x/acr_ucode.bin" "acr_ucode.bin"
    ln -sf "../tegra21x/gpmu_ucode.bin" "gpmu_ucode.bin"
    ln -sf "../tegra21x/gpmu_ucode_desc.bin" "gpmu_ucode_desc.bin"
    ln -sf "../tegra21x/gpmu_ucode_image.bin" "gpmu_ucode_image.bin"
    ln -sf "../tegra21x/gpu2cde.bin" "gpu2cde.bin"
    ln -sf "../tegra21x/NETB_img.bin" "NETB_img.bin"
    ln -sf "../tegra21x/fecs_sig.bin" "fecs_sig.bin"
    ln -sf "../tegra21x/pmu_sig.bin" "pmu_sig.bin"
    ln -sf "../tegra21x/pmu_bl.bin" "pmu_bl.bin"
    ln -sf "../tegra21x/fecs.bin" "fecs.bin"
    ln -sf "../tegra21x/gpccs.bin" "gpccs.bin"

Here we download some files so we can add firmware blobs to the snap. These files come separate from nvidia kernel sources.

So this is it for the kernel snap, now you just need to follow the instructions to get it built.

The gadget snap

Time now to take a look at the gadget snap. First, I recommend to start by reading great ogra’s post on gadget snaps for devices with u-boot bootloader before going through this section. Now, same as for the kernel one, we will go through the different parts that are defined in the snapcraft.yaml file. The first one builds the u-boot binary:

uboot:
  plugin: nil
  source: git://nv-tegra.nvidia.com/3rdparty/u-boot.git
  source-type: git
  source-tag: tegra-l4t-r28.2
  override-pull: |
    snapcraftctl pull
    # Apply UC patches + bug fixes
    git am ../../../uboot-patch/*.patch
  override-build: |
    export ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu-
    make p2371-2180_defconfig
    nice make -j$(nproc)
    cp "$SNAPCRAFT_PART_BUILD"/u-boot.bin $SNAPCRAFT_PART_INSTALL"/

We decided again to use the nil plugin as we need to do some special quirks. The sources are pulled from nvidia’s u-boot repository, but we apply some patches on top. These patches, along with the uboot environment, provide

Support for loading the UC kernel and initramfs from disk
Support for the revert functionality in case a core or kernel snap installation goes wrong
Bug fixes for u-boot’s ext4 subsystem – required because the just mentioned revert functionality needs to call u-boot’s command saveenv, which happened to be broken for ext4 filesystems in tegra’s u-boot

More information on the specifics of u-boot patches for UC can be found in this great blog post.

The only other part that the snap has is uboot-env:

uboot-env:
  plugin: nil
  source: uboot-env
  override-build: |
    mkenvimage -r -s 131072 -o uboot.env uboot.env.in
    cp "$SNAPCRAFT_PART_BUILD"/uboot.env "$SNAPCRAFT_PART_INSTALL"/
    # Link needed for ubuntu-image to work properly
    cd "$SNAPCRAFT_PART_INSTALL"/; ln -s uboot.env uboot.conf
  build-packages:
    - u-boot-tools

This simply encodes the uboot.env.in file into a format that is readable by u-boot. The resulting file, uboot.env, is included in the snap.

This environment is where most of the support for UC is encoded. I will not delve too much into the details, but just want to mention that the variables that need to be edited usually for new devices are

devnum, partition, and devtype to set the system boot partition, from which we load the kernel and initramfs
fdtfile, fdt_addr_r, and fdt_high to determine the name of the device tree and where in memory it should be loaded
ramdisk_addr_r and initrd_high to set the loading location for the initramfs
kernel_addr_r to set where the kernel needs to be loaded
args contains kernel arguments and needs to be adapted to the device specifics
Finally, for this device, snappy_boot was changed so it used booti instead of bootz, as we could not use a compressed kernel as explained above

Besides the snapcraft recipe, the other mandatory file when defining a gadget snap is the gadget.yaml file. This file defines, among other things, the image partitioning layout. There is more to it, but in this case, partitioning is the only thing we have defined:

volumes:
  jetson:
    bootloader: u-boot
    schema: gpt
    structure:
      - name: system-boot
        role: system-boot
        type: 0FC63DAF-8483-4772-8E79-3D69D8477DE4
        filesystem: ext4
        filesystem-label: system-boot
        offset: 17408
        size: 67108864
      - name: TBC
        type: EBD0A0A2-B9E5-4433-87C0-68B6B72699C7
        size: 2097152
      - name: EBT
        type: EBD0A0A2-B9E5-4433-87C0-68B6B72699C7
        size: 4194304
      - name: BPF
        type: EBD0A0A2-B9E5-4433-87C0-68B6B72699C7
        size: 2097152
      - name: WB0
        type: EBD0A0A2-B9E5-4433-87C0-68B6B72699C7
        size: 6291456
      - name: RP1
        type: EBD0A0A2-B9E5-4433-87C0-68B6B72699C7
        size: 4194304
      - name: TOS
        type: EBD0A0A2-B9E5-4433-87C0-68B6B72699C7
        size: 6291456
      - name: EKS
        type: EBD0A0A2-B9E5-4433-87C0-68B6B72699C7
        size: 2097152
      - name: FX
        type: EBD0A0A2-B9E5-4433-87C0-68B6B72699C7
        size: 2097152
      - name: BMP
        type: EBD0A0A2-B9E5-4433-87C0-68B6B72699C7
        size: 134217728
      - name: SOS
        type: EBD0A0A2-B9E5-4433-87C0-68B6B72699C7
        size: 20971520
      - name: EXI
        type: EBD0A0A2-B9E5-4433-87C0-68B6B72699C7
        size: 67108864
      - name: LNX
        type: 0FC63DAF-8483-4772-8E79-3D69D8477DE4
        size: 67108864
        content:
          - image: u-boot.bin
      - name: DTB
        type: EBD0A0A2-B9E5-4433-87C0-68B6B72699C7
        size: 4194304
      - name: NXT
        type: EBD0A0A2-B9E5-4433-87C0-68B6B72699C7
        size: 2097152
      - name: MXB
        type: EBD0A0A2-B9E5-4433-87C0-68B6B72699C7
        size: 6291456
      - name: MXP
        type: EBD0A0A2-B9E5-4433-87C0-68B6B72699C7
        size: 6291456
      - name: USP
        type: EBD0A0A2-B9E5-4433-87C0-68B6B72699C7
size: 2097152

The Jetson TX1 has a complex partitioning layout, with many partitions being allocated for the first stage bootloader, and many others that are undocumented. So, to minimize the risk of touching a critical partition, I preferred to keep most of them untouched and do just the minor amount of changes to fit UC into the device. Therefore, the gadget.yaml volumes entry mainly describes the TX1 defaults, with the main differences comparing to the original being:

The APP partition is renamed to system-boot and reduced to only 64MB. It will contain the uboot environment file plus the kernel and initramfs, as usual in UC systems with u-boot bootloader.
The LNX partition will contain our u-boot binary
If a partition with role: system-data is not defined explicitly (which is the case here), a partition which such role and with label “writable” is implicitly defined at the end of the volume. This will take all the available space left aside by the reduction of the APP partition, and will contain the UC root filesystem. This will replace the UDA partition that is the last in nvidia partitioning scheme.

Now, it is time to build the gadget snap by following the repository instructions.

Building & flashing the image

Now that we have the snaps, it is time to build the image. There is not much to it, you just need an Ubuntu One account and to follow the instructions to create a key to be able to sign a model assertion. With that just follow the README.md file in the jetson-ubuntu-core repository. You can also download the latest tarball from the repository if you prefer.

The build script will generate not only a full image file, but also a tarball that will contain separate files for each partition that needs to be flashed in the device. This is needed because unfortunately there is no way we can fully flash the Jetson device with a GPT image, instead we can flash only individual partitions with the tools nvidia provides.

Once the build finishes, we can take the resulting tarball and follow the instructions to get the necessary partitions flashed. As can be read there, we have to download the nvidia L4T package. Also, note that to be able to change the partition sizes and files to flash, a couple of patches have to be applied on top of the L4T scripts.

Summary

After this, you should have a working Ubuntu Core 18 device. You can use the serial port or an external monitor to configure it with your launchpad account so you can ssh into it. Enjoy!

How to Access Safely Unaligned Data

Occasionally I find myself processing input data which arrives as a stream, like data from files or from a socket, but that has a known structure that can be modeled with C types. For instance, let’s say we are receiving from a socket a parcel that consists on a header of one byte, and a payload that is an integer. A naive way to handle this is the following (simplified for readability) code snippet:

int main(void)
{
    int fd;
    char *buff;
    struct sockaddr_in addr;
    int vint;
    char vchar;

    fd = socket(AF_INET, SOCK_STREAM, 0);
    buff = malloc(BUFF_SIZE);
    /* Init socket address */
    ...
    connect(fd, (struct sockaddr *) &addr, sizeof(addr));

    read(fd, buff, BUFF_SIZE);

    vchar = buff[0];
    vint  = *(int *) &buff[1];
    /* Do something with extracted data, free resources */
    ...
    return 0;
}

Here we get the raw data with a read() call, we read the first byte, then we read an integer by taking a pointer to the second read byte and casting it to a pointer to an integer. (for this example we are assuming that the integer inserted in the stream has the same size and endianness as the CPU ones).

There is a big issue with this: the cast to int *, which is undefined behavior according to the C standard ¹. And it is because things can go wrong in at least two ways, first due to pointer aliasing rules, second due to type alignment.

Strict pointer aliasing tells the compiler that it can assume that pointers to different types point to different places in memory. This allows some optimizations, like reordering. Therefore, we could be in trouble if, say, we take &buff[1] into a char * and use it to write a byte in that location, as reordering could hit us. So just do not do that. Let’s also hope that we have a compiler that is not completely insane and does not move our reading by int pointer before the read() system call. We could also disable strict aliasing if we are using GCC with option -fno-strict-aliasing, which by the way is something that the Linux kernel does. At any rate, this is a complex subject and I will not dig into it this time.

We will concentrate in this article on how to solve the other problem, that is, how to access safely types that are not stored in memory in their natural alignment.

The C Standard-Compliant Solution

Before moving further, keep in mind that it is always possible to be strictly compliant with the standard and access safely memory without breaking language rules or using compiler or machine specific tricks. In the example, we could retrieve vint by doing

    vint  =   buff[1] + (buff[2] << 8)
            + (buff[3] << 16) + (buff[4] << 24);

(supposing stored data is little endian).

The issue here is performance: we are implicitly transforming four bytes to integers, then we have to bit-shift three of them, and finally we have to add them up. Note however that this is what we want if data and CPU have different endianness.

Doing Unaligned Memory Accesses

In all machine architectures there is a natural alignment for the different data types. This alignment is usually the size of the types, for instance in 32 bits architectures the alignment for integers is 4, for doubles it is 8, etc. If instances of these types are not stored in memory positions that are multiple of their alignment, we are talking about unaligned access. If we try to access unaligned data either of these can happen:

The hardware let’s us access it – but always at a performance penalty.
An exception is triggered by the CPU. This type of exception is called bus error ².

We might be willing to accept the performance penalty ³, which is mitigated by CPU caches and not that noticeable in certain architectures like x86-64 , but we certainly do not want our program to crash. How possible is this? To be honest it is not something I have seen that often. Therefore, as a first analysis step, I checked how easy it was to get bus errors. To do so, I created the following C++ program, access1.cpp (I could not resist to use templates here to reduce the code size):

#include <iostream>
#include <typeinfo>
#include <cstring>

using namespace std;

template <typename T>
void print_unaligned(char *ptr)
{
    T *val = reinterpret_cast<T *>(ptr);

    cout << "Type is \"" << typeid(T).name()
         << "\" with size " << sizeof(T) << endl;
    cout << val << " *val: " << *val << endl;
}

int main(void)
{
    char *mem = new char[128];

    memset(mem, 0, 128);

    print_unaligned<int>(mem);
    print_unaligned<int>(mem + 1);
    print_unaligned<long long>(mem);
    print_unaligned<long long>(mem + 1);
    print_unaligned<long double>(mem);
    print_unaligned<long double>(mem + 1);

    delete[] mem;
    return 0;
}

The program allocates memory using new char[], which as malloc() in C is guaranteed to allocate memory with the same alignment as the strictest fundamental type. After zeroing the memory, we access mem and mem + 1 by casting to different pointer types, knowing that the second address is odd, and therefore unaligned except for char * access.

I compiled the file with g++ on my laptop, ran it, and got

$ g++ access1.cpp -o access1
$ file access1
access1: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 2.6.32, BuildID[sha1]=09d0fb19340a10941eef4c3dc4d6eb29383e717d, not stripped
$ ./access1
Type is "i" with size 4
0x16c3c20 *val: 0
Type is "i" with size 4
0x16c3c21 *val: 0
Type is "x" with size 8
0x16c3c20 *val: 0
Type is "x" with size 8
0x16c3c21 *val: 0
Type is "e" with size 16
0x16c3c20 *val: 0
Type is "e" with size 16
0x16c3c21 *val: 0

No error for x86-64. This was expected as Intel architecture is known to support unaligned access by hardware, at a performance penalty (which is apparently quite small these days, see ⁴).

The second try was with an ARM CPU, compiling for arm-32:

$ g++ access1.cpp -o access1
$ file access1
access1: ELF 32-bit LSB executable, ARM, EABI5 version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.32, BuildID[sha1]=8c3c3e7d77fddd5f95d18dbffe37d67edc716a1c, not stripped
$ ./access1
Type is "i" with size 4
0x47b008 *val: 0
Type is "i" with size 4
0x47b009 *val: 0
Type is "x" with size 8
0x47b008 *val: 0
Type is "x" with size 8
Bus error (core dumped)

Now we get what we were searching for, a legitimate bus error, in this case when accessing a long long from an unaligned address. Commenting out the offending line and letting the program run further showed the error also when accessing a long double from mem + 1.

Fixing Unaligned Memory Accesses

After proving that this could be a real problem, at least for some architectures, I tried to find a solution that would let me do unaligned memory accesses in the most generic way. I could not find anything safe that was strictly following the C standard. However, all C/C++ compilers have ways to define packed structures, and that came to the rescue.

Packed structures are intended to minimize the padding that is introduced by alignment needed by the structure members. They are used when minimizing storage is a big concern. But what is interesting for us is that its members can be unaligned due to the packing, so dereferencing them must take that into account. Therefore, if we are accessing a type in a CPU that does not support unaligned access for that type the compiler must synthesize code that handles this transparently from the point of view of the C program.

To test that this worked as expected, I wrote access2.cpp, which uses GCC attribute __packed__ to define a packed structure:

#include <iostream>
#include <typeinfo>
#include <cstring>

using namespace std;

template <typename T>
struct __attribute__((__packed__)) struct_safe
{
    T val;
};

template <typename T>
void print_unaligned(char *ptr)
{
    struct_safe<T> *safe = reinterpret_cast<struct_safe<T> *>(ptr);

    cout << "Type is \"" << typeid(T).name()
         << "\" with size " << sizeof(T) << endl;
    cout << safe << " safe->val: " << safe->val << endl;
}

int main(void)
{
    char *mem = new char[128];

    memset(mem, 0, 128);

    print_unaligned<int>(mem);
    print_unaligned<int>(mem + 1);
    print_unaligned<long long>(mem);
    print_unaligned<long long>(mem + 1);
    print_unaligned<long double>(mem);
    print_unaligned<long double>(mem + 1);

    delete[] mem;
    return 0;
}

In this case, instead of directly casting to the type, I cast to a pointer to the packed struct and access the type through it.

Compiling and running for x86-64 got the expected result: no error, all worked as before. Then I compiled and ran it in an ARM device:

$ g++ access2.cpp -o access2
$ file access2
access2: ELF 32-bit LSB executable, ARM, EABI5 version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.32, BuildID[sha1]=9a1ee8c2fcd97393a4b53fe12563676d9f2327a3, not stripped
$ ./access2
Type is "i" with size 4
0x391008 safe->val: 0
Type is "i" with size 4
0x391009 safe->val: 0
Type is "x" with size 8
0x391008 safe->val: 0
Type is "x" with size 8
0x391009 safe->val: 0
Type is "e" with size 8
0x391008 safe->val: 0
Type is "e" with size 8
0x391009 safe->val: 0

No bus errors anymore! It worked as expected. To gain some understanding of what was happening behind the curtains, I disassembled the generated ARM binaries. For both access1 and access2, the same instruction was being used when I was getting a value after casting to int: LDR, which unsurprisingly loads a 32-bit word into a register. But for the long long, I found that access1 was using LDRD, which loads double words (8 bytes) from memory, while access2 was using two LDR instructions instead.

This all made a lot of sense, as ARM states that LDR supports access to unaligned data, while LDRD does not ⁵. Indeed the later is faster, but has this restriction. It was also good to check that there was no penalty for using the packed structure for integers: GCC does a good job to discriminate when the CPU really needs to handle differently unaligned accesses.

GCC cast-align Warning

GCC has a warning that can help to identify points in the code when we might be accessing unaligned data, which is activated with -Wcast-align. It is not part of the warnings that are activated by options -Wall or -Wextra, so we will have to add it explicitly if we want it. The warning is only triggered when compiling for architectures that do not support unaligned access for all types, so you will not see it if compiling only for x86.

When triggered, you will see something like

file.c:28:23: warning: cast increases required alignment of target type [-Wcast-align]
   int *my_int_ptr = (int *) &buf[i];
                     ^

Conclusion

The moral of this post is that you need to be very careful when casting pointers to a type different to the original one ⁶. When you need to do that, think about alignment issues first, and also think on your target architectures. There are programs that we want to run on more than one CPU type and too many times we only test in our reference.

Unfortunately the C standard does not give us a standard way of doing efficient access to unaligned data, but most if not all compilers seem to provide ways to do this. If we are using GCC, __attribute__((__packed__)) can help us when we might be doing unaligned accesses. The ARM compiler has a __packed attribute for pointers ⁷, and I am sure other compilers provide similar machinery. I also recommend to activate -Wcast-align if using GCC, which makes easier to spot alignment issues.

Finally, a word of caution: in most cases you should not do this type of casts. Some times you can define structures and read directly data onto them, some times you can use unions. Bear in mind always the strict pointer aliasing rules, which can hit back. To summarize, think twice before using the sort of trick showed in the post, and use them only when really needed.

Debugging Crashes in Proprietary Binaries – Case Study for an ARM Library

So there I was. I did have to use a proprietary library, for which I had no sources and no real hope of support from the creators. I built my program against it, I ran it, and I got a segmentation fault. An exception that seemed to happen inside that insidious library, which was of course stripped of all debugging information. I scratched my head, changed my code, checked traces, tried valgrind, strace, and other debugging tools, but found no obvious error. Finally, I assumed that I had to dig deeper and do some serious debugging of the library’s assembly code with gdb. The rest of the post is dedicated to the steps I followed to find out what was happening inside the wily proprietary library that we will call libProprietary. Prerequisites for this article are some knowledge of gdb and ARM architecture.

Some background on the task I was doing: I am a Canonical employee that works as developer for Ubuntu for Phones. In most, if not all, phones, the BSP code is not 100% open and we have to use proprietary libraries built for Android. Therefore, these libraries use bionic, Android’s libc implementation. As we want to call them inside binaries compiled with glibc, we resort to libhybris, an ingenious library that is able to load and call libraries compiled against bionic while the rest of the process uses glibc. This will turn out to be critical in this debugging. Note also that we are debugging ARM 32-bits binaries here.

The Debugging Session

To start, I made sure I had installed glibc and other libraries symbols and started to debug by using gdb in the usual way:

$ gdb myprogram
GNU gdb (Ubuntu 7.9-1ubuntu1) 7.9
...
Starting program: myprogram
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/arm-linux-gnueabihf/libthread_db.so.1".
[New Thread 0xf49de460 (LWP 7101)]
[New Thread 0xf31de460 (LWP 7104)]
[New Thread 0xf39de460 (LWP 7103)]
[New Thread 0xf41de460 (LWP 7102)]
[New Thread 0xf51de460 (LWP 7100)]

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0xf49de460 (LWP 7101)]
0x00000000 in ?? ()
(gdb) bt
#0  0x00000000 in ?? ()
#1  0xf520bd06 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) info proc mappings
process 7097
Mapped address spaces:

	Start Addr   End Addr       Size     Offset objfile
	   0x10000    0x17000     0x7000        0x0 /usr/bin/myprogram
	...
	0xf41e0000 0xf49df000   0x7ff000        0x0 [stack:7101]
	...
	0xf51f6000 0xf5221000    0x2b000        0x0 /android/system/lib/libProprietary.so
	0xf5221000 0xf5222000     0x1000        0x0 
	0xf5222000 0xf5224000     0x2000    0x2b000 /android/system/lib/libProprietary.so
	0xf5224000 0xf5225000     0x1000    0x2d000 /android/system/lib/libProprietary.so
	...
(gdb)

We can see here that we get the promised crash. I execute a couple of gdb commands after that to see the backtrace and part of the process address space that will be of interest in the following discussion. The backtrace shows that a segment violation happened when the CPU tried to execute instructions in address zero, and we can see by checking the process mappings that the previous frame lives inside the text segment of libProprietary.so. There is no backtrace beyond that point, but that should come as no surprise as there is no DWARF information in libProprietary, and also noting that usage of frame pointer is optimized away quite commonly these days.

After this I tried to get a bit more information on the CPU state when the crash happened:

(gdb) info reg
r0             0x0	0
r1             0x0	0
r2             0x0	0
r3             0x9	9
r4             0x0	0
r5             0x0	0
r6             0x0	0
r7             0x0	0
r8             0x0	0
r9             0x0	0
r10            0x0	0
r11            0x0	0
r12            0xffffffff	4294967295
sp             0xf49dde70	0xf49dde70
lr             0xf520bd07	-182403833
pc             0x0	0x0
cpsr           0x60000010	1610612752
(gdb) disassemble 0xf520bd02,+10
Dump of assembler code from 0xf520bd02 to 0xf520bd0c:
   0xf520bd02:	b	0xf49c9cd6
   0xf520bd06:	movwpl	pc, #18628	; 0x48c4	<UNPREDICTABLE>
   0xf520bd0a:	andlt	r4, r11, r8, lsr #12
End of assembler dump.
(gdb)

Hmm, we are starting to see weird things here. First, in 0xf520bd02 (which probably has been executed little before the crash) we get an unconditional branch to some point in the thread stack (see mappings in previous figure). Second, the instruction in 0xf520bd06 (which should be executed after returning from the procedure that provokes the crash) would load into the pc (program counter) an address that is not mapped: we saw that the first mapped address is 0x10000 in the previous figure. The movw instruction has also a “pl” suffix that makes the instruction execute only when the operand is positive or zero… which is obviously unnecessary as 0x48c4 is encoded in the instruction.

I resorted to doing objdump -d libProprietary.so to disassemble the library and compare with gdb output. objdump shows, in that part of the file (subtracting the library load address gives us the offset inside the file: 0xf520bd02-0xf51f6000=0x15d02):

   15d02:	f7f3 eade 	blx	92c0 <__android_log_print@plt>;
   15d06:	f8c4 5304 	str.w	r5, [r4, #772]	; 0x304
   15d0a:	4628      	mov	r0, r5
   15d0c:	b00b      	add	sp, #44	; 0x2c
   15d0e:	e8bd 8ff0 	ldmia.w	sp!, {r4, r5, r6, r7, r8, r9, sl, fp, pc}

which is completely different from what gdb shows! What is happening here? Taking a look at addresses for both code chunks, we see that instructions are always 4 bytes in gdb output, while they are 2 or 4 in objdump‘s. Well, you have guessed, don’t you? We are seeing “normal” ARM instructions in gdb, while objdump is decoding THUMB-2 instructions. Certainly objdump seems to be right here as the output is more sensible: we have a call to an executable part of the process space in 0x15d02 (it is resolved to a known function, __android_log_print), and the following instructions seems like a normal function epilogue in ARM: a return value is stored in r0, the sp (stack pointer) is incremented (we are freeing space in the stack), and we restore registers.

If we get back to the register values, we see that cpsr (current program status register [1]) does not have the T bit set, so gdb thinks we are using ARM instructions. We can change this by doing

(gdb) set $cpsr=0x60000030
(gdb) disass 0xf520bd02,+15
Dump of assembler code from 0xf520bd02 to 0xf520bd11:
   0xf520bd02:	blx	0xf51ff2c0
   0xf520bd06:	str.w	r5, [r4, #772]	; 0x304
   0xf520bd0a:	mov	r0, r5
   0xf520bd0c:	add	sp, #44	; 0x2c
   0xf520bd0e:	ldmia.w	sp!, {r4, r5, r6, r7, r8, r9, r10, r11, pc}
End of assembler dump.

Ok, much better now [2]. The thumb bit in cpsr is determined by last bx/blx call: if the address is odd, the procedure to which we are calling contains THUMB instructions, otherwise they are ARM (a good reference for these instructions is [3]). In this case, after an exception the CPU moves to arm mode, and gdb is unable to know which is the right mode when disassembling. We can search for hints on which parts of the code are arm/thumb by looking at the values in registers used by bx/blx, or by looking at the lr (link register): we can see above that the value after the crash was 0xf520bd07, which is odd and indicates that 0xf520bd06 contains a thumb instruction. However, for some reason gdb is not able to take advantage of this information.

Of course this problem does not happen if we have debugging information: in that case we have special symbols that let gdb know if the section where the code is contains thumb instructions or not [4]. As those are not found, gdb uses the cpsr value. Here objdump seems to have better heuristics though.

After solving this issue with instruction decoding, I started to debug __android_log_print to check what was happening there, as it looked like the crash was happening in that call. I spent quite a lot of time there, but found nothing. All looked fine, and I started to despair. Until I inserted a breakpoint in address 0xf520bd06, right after the call to __android_log_print, run the program… and it stopped at that address, no crash happened. I started to execute the program instruction by instruction after that:

(gdb) b *0xf520bd06
(gdb) run
...
Breakpoint 1, 0xf520bd06 in ?? ()
(gdb) si
0xf520bd0a in ?? ()
(gdb) si
0xf520bd0c in ?? ()
(gdb) si
0xf520bd0e in ?? ()
Warning:
Cannot insert breakpoint 0.
Cannot access memory at address 0x0

Something was apparently wrong with instruction ldmia, which restores registers, including the pc, from the stack. I took a look at the stack in that moment (taking into account that ldmia had already modified the sp after restoring 9 registers == 36 bytes):

(gdb) x/16xw $sp-36
0xf49dde4c:	0x00000000	0x00000000	0x00000000	0x00000000
0xf49dde5c:	0x00000000	0x00000000	0x00000000	0x00000000
0xf49dde6c:	0x00000000	0x00000000	0x00000000	0x00000000
0xf49dde7c:	0x00000000	0x00000000	0x00000000	0x00000000

All zeros! At this point it is clear that this is the real point where the crash is happening, as we are loading 0 into the pc. This looked clearly like a stack corruption issue.

But, before moving forward, why are we getting a wrong backtrace from gdb? Well, gdb is seeing a corrupted stack, so it is not able to unwind it. It would not be able to unwind it even if having full debug information. The only hint it has is the lr. This register contains the return address after execution of a bl/blx instruction [3]. If the called procedure is non-leaf, it is saved in the prologue, and restored in the epilogue, because it gets overwritten when branching to other procedures. In this case, it is restored on the pc and sometimes it is also saved back in the lr, depending on whether we have arm-thumb interworking built in the procedure or not [5]. It is not overwritten if we have a leaf procedure (as there are no procedure calls inside these).

As gdb has no additional information, it uses the lr to build the backtrace, assuming we are in a leaf procedure. However this is not true and the backtrace turns out to be wrong. Nonetheless, this information was not completely useless: lr was pointing to the instruction right after the last bl/blx instruction that was executed, which was not that far away from the real point where the program was crashing. This happened because fortunately __android_log_print has interworking code and restores the lr, otherwise the value of lr could have been from a point much far away from the point where the real crash happens. Believe or not, but it could have been even worse!

Having now a clear idea of where and why the crash was happening, things accelerated. The procedure where the crash happened, as disassembled by objdump, was (I include here only the more relevant parts of the code)

00015b1c <ProprietaryProcedure@@Base>:
   15b1c:	e92d 4ff0 	stmdb	sp!, {r4, r5, r6, r7, r8, r9, sl, fp, lr}
   15b20:	b08b      	sub	sp, #44	; 0x2c
   15b22:	497c      	ldr	r1, [pc, #496]	; (15d14 <ProprietaryProcedure@@Base+0x1f8>)
   15b24:	2500      	movs	r5, #0
   15b26:	9500      	str	r5, [sp, #0]
   15b28:	4604      	mov	r4, r0
   15b2a:	4479      	add	r1, pc
   15b2c:	462b      	mov	r3, r5
   15b2e:	f8df 81e8 	ldr.w	r8, [pc, #488]	; 15d18 <ProprietaryProcedure@@Base+0x1fc>
   15b32:	462a      	mov	r2, r5
   15b34:	f8df 91e4 	ldr.w	r9, [pc, #484]	; 15d1c <ProprietaryProcedure@@Base+0x200>
   15b38:	ae06      	add	r6, sp, #24
   15b3a:	f8df a1e4 	ldr.w	sl, [pc, #484]	; 15d20 <ProprietaryProcedure@@Base+0x204>
   15b3e:	200f      	movs	r0, #15
   15b40:	f8df b1e0 	ldr.w	fp, [pc, #480]	; 15d24 <ProprietaryProcedure@@Base+0x208>
   15b44:	f7f3 ef76 	blx	9a34 <prctl@plt>
   15b48:	44f8      	add	r8, pc
   15b4a:	4629      	mov	r1, r5
   15b4c:	44f9      	add	r9, pc
   15b4e:	2210      	movs	r2, #16
   15b50:	44fa      	add	sl, pc
   15b52:	4630      	mov	r0, r6
   15b54:	44fb      	add	fp, pc
   15b56:	f7f3 ea40 	blx	8fd8 <memset@plt>
   15b5a:	a807      	add	r0, sp, #28
   15b5c:	f7f3 ef70 	blx	9a40 <sigemptyset@plt>
   15b60:	4b71      	ldr	r3, [pc, #452]	; (15d28 <ProprietaryProcedure@@Base+0x20c>)
   15b62:	462a      	mov	r2, r5
   15b64:	9508      	str	r5, [sp, #32]
   15b66:	4631      	mov	r1, r6
   15b68:	447b      	add	r3, pc
   15b6a:	681b      	ldr	r3, [r3, #0]
   15b6c:	200a      	movs	r0, #10
   15b6e:	9306      	str	r3, [sp, #24]
   15b70:	f7f3 ef6c 	blx	9a4c <sigaction@plt>
   ...
   15d02:	f7f3 eade 	blx	92c0 <__android_log_print@plt>
   15d06:	f8c4 5304 	str.w	r5, [r4, #772]	; 0x304
   15d0a:	4628      	mov	r0, r5
   15d0c:	b00b      	add	sp, #44	; 0x2c
   15d0e:	e8bd 8ff0 	ldmia.w	sp!, {r4, r5, r6, r7, r8, r9, sl, fp, pc}

The addresses where this code is loaded can be easily computed by adding 0xf51f6000 to the file offsets shown in the first column. We see that a few calls to different external functions [6] are performed by ProprietaryProcedure, which is itself an exported symbol.

I restarted the debug session, added a breakpoint at the start of ProprietaryProcedure, right after stmdb saves the state, and checked the stack values:

(gdb) b *0xf520bb20
Breakpoint 1 at 0xf520bb20
(gdb) cont
...
Breakpoint 1, 0xf520bb20 in ?? ()
(gdb) p $sp
$1 = (void *) 0xf49dde4c
(gdb) x/16xw $sp
0xf49dde4c:	0xf49de460	0x0007df00	0x00000000	0xf49dde70
0xf49dde5c:	0xf49de694	0x00000000	0xf77e9000	0x00000000
0xf49dde6c:	0xf75b4491	0x00000000	0xf49de460	0x00000000
0xf49dde7c:	0x00000000	0xfd5b4eba	0xfe9dd4a3	0xf49de460

We can see that the stack contains something, including a return address that looks valid (0xf75b4491). Note also that the procedure must never touch this part of the stack, as it belongs to the caller of ProprietaryProcedure.

Now it is a simply a matter of bisecting the code between the beginning and the end of ProprietaryProcedure to find out where we are clobbering the stack. I will save you of developing here this tedious process. Instead, I will just show, that, in the end, it turned out that the call to sigemptyset() is the culprit [7]:

(gdb) b *0xf520bb5c
Breakpoint 1 at 0xf520bb5c
(gdb) b *0xf520bb60
Breakpoint 2 at 0xf520bb60
(gdb) run
Breakpoint 1, 0xf520bb5c in ?? ()
(gdb) x/16xw 0xf49dde4c
0xf49dde4c:	0xf49de460	0x0007df00	0x00000000	0xf49dde70
0xf49dde5c:	0xf49de694	0x00000000	0xf77e9000	0x00000000
0xf49dde6c:	0xf75b4491	0x00000000	0xf49de460	0x00000000
0xf49dde7c:	0x00000000	0xfd5b4eba	0xfe9dd4a3	0xf49de460
(gdb) cont
Continuing.
Breakpoint 2, 0xf520bb60 in ?? ()
(gdb) x/16xw 0xf49dde4c
0xf49dde4c:	0x00000000	0x00000000	0x00000000	0x00000000
0xf49dde5c:	0x00000000	0x00000000	0x00000000	0x00000000
0xf49dde6c:	0x00000000	0x00000000	0x00000000	0x00000000
0xf49dde7c:	0x00000000	0x00000000	0x00000000	0x00000000

Note here that I am printing the part of the stack not reserved by the function (0xf49dde4c is the value of the sp before execution of the line at offset 0x15b20, see the code).

What is going wrong here? Now, remember that at the beginning of the article I mentioned that we were using libhybris. libProprietary assumes a bionic environment, and the libc functions it calls are from bionic’s libc. However, libhybris has hooks for some bionic functions: for them bionic is not called, instead the hook is invoked. libhybris does this to avoid conflicts between bionic and glibc: for instance having two allocators fighting for process address space is a recipe for disaster, so malloc() and related functions are hooked and the hooks call in the end the glibc implementation. Signals related functions were hooked too, including sigemptyset(), and in this case the hook simply called glibc implementation.

I looked at glibc and bionic implementations, in both cases sigemptyset() is a very simple utility function that clears with memset() a sigset_t variable. All pointed to different definitions of sigset_t depending on the library. Definition turned out to be a bit messy when looking at the code as it depended on build time definitions, so I resorted to gdb to print the type. For a executable compiled for glibc, I saw

(gdb) ptype sigset_t
type = struct {
    unsigned long __val[32];
}

and for one using bionic

(gdb) ptype sigset_t
type = unsigned long

This finally confirms where the bug is, and explains it: we are overwriting the stack because libProprietary reserves in the stack memory for bionic’s sigset_t, while we are using glibc’s sigemptyset(), which uses a different definition for it. As this definition is much bigger, the stack gets overwritten after the call to memset(). And we get the crash later when trying to restore registers when the function returns.

After knowing this, the solution was simple: I removed the libhybris hooks for signal functions, recompiled it, and… all worked just fine, no crashes anymore!

However, this is not the final solution: as signals are shared resources, it makes sense to hook them in libhybris. But, to do it properly, the hooks have to translate types between bionic in glibc, thing that we were not doing (we were simply calling glibc implementation). That, however, is “just work”.

Of course I wondered why the heck a library that is kind of generic needs to mess around with signals, but hey, that is not my fault ;-).

Conclusions

I can say I learned several things while debugging this:

Not having the sources is terrible for debugging (well, I already knew this). Unfortunately not open sourcing the code is still a standard practice in part of the industry.
The most interesting technical bit here is IMHO that we need to be very cautious with the backtrace that debuggers shows after a crash. If you start to see things that do not make sense it is possible that registers or stack have been messed up and the real crash happens elsewhere. Bear in mind that the very first thing to do when a program crashes is to make sure that we know the exact point where that happens.
We have to be careful in ARM when disassembling, because if there is no debug information we could be seeing the wrong instruction set. We can check evenness of addresses used by bx/blx and of the lr to make sure we are in the right mode.
Some times taking a look at assembly code can help us when debugging, even when we have the sources. Note that if I had had the C sources I would have seen the crash happening right when returning from a function, and it might not have been that immediate to find out that the stack was messed up. The assembly clearly pointed to an overwritten stack.
Finally, I personally learned some bits of ARM architecture that I did not know, which was great.

Well, this is it. I hope you enjoyed the (lengthy, I know) article. Thanks for your reading!

[1] http://www.heyrick.co.uk/armwiki/The_Status_register
[2] We can get the same result by executing in gdb set arm fallback-mode thumb, but changing the register seemed more pedagogical here.
[3] http://infocenter.arm.com/help/topic/com.arm.doc.dui0068b/DUI0068.pdf
[4] http://reverseengineering.stackexchange.com/questions/6080/how-to-detect-thumb-mode-in-arm-disassembly
[5] http://www.mcternan.me.uk/ArmStackUnwinding/
[6] In fact the calls are to the PLT section, which is inside the library. The PLT calls in turn, by using addresses in the GOT data section, either directly the function or the dynamic loader, as we are doing lazy loading. See https://www.technovelty.org/linux/plt-and-got-the-key-to-code-sharing-and-dynamic-libraries.html, for instance.
[7] I had to use two breakpoints between consecutive instructions because the “ni” gdb command was not working well here.