Thursday, May 23, 2024

Investigate Obscure Kernel Symbols

Introduction

In the world of Linux kernel development, one often encounters intriguing anomalies that spark curiosity and investigation. My journey into exploring such peculiarities began with a previous deep dive into duplicate symbols within the Linux kernel. This exploration revealed fascinating insights into how certain symbols names, appears multiple times having different addresses. It was fun to discover that among multiple different addresses having the same name, there were also actual duplicates of the same function (name and body), even thought, the majority of those symbols having the same name were actually different objects. Building on that foundation, my current investigation delves into another set of mysterious symbols, those that appear to be aliases for given addresses in the kernel (multiple names for the same address), but whose origins are not immediately obvious. Their presence had significant consequences in my new effort. I'm currently adding a new feature to ks-nav, a nifty tool that generates diagrams from the kernel binary image. The goal is to provide kernel analysts with valuable insights into the kernel code, because who doesn't love a good kernel investigation? The tool already produces call tree diagrams and visualize subsystem interactions triggered by specific functions. My latest endeavor? To add functionality that reveals how global variables are used and shared among functions. The topic of this blog post springs from analyzing the output of this tool. Here's an image produced by investigating the global symbols shared starting from the function hugetlb_vma_lock_alloc.

The Problem of Macro Expansion and Symbol Aliasing

Unlike the previous investigation where symbols were straightforward duplicates, the issue at hand now involves a more complex phenomenon stemming from macro expansion. The process of macro expansion in the kernel can result in multiple symbols being generated with the same name, even though, each of these are actually different variables in memory. You can have the same phenomenon originate by compiler multiple mangling of the code such as inlining, or macro expansion, but when it happens, to allow the compiler to manage these same name symbols as different, the compiler must transform these names to allow it to differentiate. In practical terms, this just means that the compiler appends numbers to the identifier name to produce a new unique identifier. A simple example can clarify this:
$ cat h.c
#include 

int pippo(int i){
        static int paperino;
        if (i>=0) paperino=i;
        return paperino;
}
int pluto(int i){
        static int paperino;
        if (i>=0) paperino=i;
        return paperino;
}

int main(){
        printf("paperino= %d\n", pippo(55) );
        printf("paperino= %d\n", pippo(-1) );
        printf("paperino= %d\n", pluto(99) );
        printf("paperino= %d\n", pluto(-1) );
}
$ gcc -g h.c -o h
$ ./h
paperino= 55
paperino= 55
paperino= 99
paperino= 99
$ nm -n h
                 w __cxa_finalize@@GLIBC_2.2.5
                 w __gmon_start__
                 w _ITM_deregisterTMCloneTable
                 w _ITM_registerTMCloneTable
                 U __libc_start_main@@GLIBC_2.2.5
                 U printf@@GLIBC_2.2.5
0000000000001000 t _init
0000000000001060 T _start
0000000000001090 t deregister_tm_clones
00000000000010c0 t register_tm_clones
0000000000001100 t __do_global_dtors_aux
0000000000001140 t frame_dummy
0000000000001149 T pippo
000000000000116b T pluto
000000000000118d T main
0000000000001210 T __libc_csu_init
0000000000001280 T __libc_csu_fini
0000000000001288 T _fini
0000000000002000 R _IO_stdin_used
0000000000002014 r __GNU_EH_FRAME_HDR
00000000000021ac r __FRAME_END__
0000000000003db8 d __frame_dummy_init_array_entry
0000000000003db8 d __init_array_start
0000000000003dc0 d __do_global_dtors_aux_fini_array_entry
0000000000003dc0 d __init_array_end
0000000000003dc8 d _DYNAMIC
0000000000003fb8 d _GLOBAL_OFFSET_TABLE_
0000000000004000 D __data_start
0000000000004000 W data_start
0000000000004008 D __dso_handle
0000000000004010 B __bss_start
0000000000004010 b completed.8061
0000000000004010 D _edata
0000000000004010 D __TMC_END__
0000000000004014 b paperino.2316
0000000000004018 b paperino.2320
0000000000004020 B _end
$

This example shows, how the conflict generated by having two global variables having the same name, paperino, forced the compiler to differentiate them by appending a number. It is lesser known, but static local variables defined in functions are actually global variables. In the function namespace they do not generate any conflict, but in the compiler unit namespace they do, and this is why the compiler mangles names like that in the binary.

Back to the problem identified by the ks-nav new feature, in the diagram, there are two global data symbols that are evidently mangled by the compiler: the __key.11 and the __already_done.1 Let's start focusing on the simpler, just to familiarize with the phenomenon: the __already_done family of symbols. The analysis evidenced it comes from pr_warn_once. This function uses a macro to ensure that the warning message is printed only once. This mechanism ensures that each warning instance is tracked separately using a dedicated variable. To illustrate how this works, let's track down how the pr_warn_once macro is expanded.

step 1

  #define pr_warn_once(fmt, ...)                                  \
        printk_once(KERN_WARNING pr_fmt(fmt), ##__VA_ARGS__)
  

step 2

  #define printk_once(fmt, ...)                                   \
        DO_ONCE_LITE(printk, fmt, ##__VA_ARGS__)
  

step 3

  #define DO_ONCE_LITE(func, ...)                                         \
        DO_ONCE_LITE_IF(true, func, ##__VA_ARGS__)
  

step 4

  #define DO_ONCE_LITE_IF(condition, func, ...)                           \
        ({                                                              \
                bool __ret_do_once = !!(condition);                     \
                                                                        \
                if (__ONCE_LITE_IF(__ret_do_once))                      \
                        func(__VA_ARGS__);                              \
                                                                        \
                unlikely(__ret_do_once);                                \
        })
  

step 5

  #define __ONCE_LITE_IF(condition)                                       \
        ({                                                              \
                static bool __section(".data.once") __already_done;     \
                bool __ret_cond = !!(condition);                        \
                bool __ret_once = false;                                \
                                                                        \
                if (unlikely(__ret_cond && !__already_done)) {          \
                        __already_done = true;                          \
                        __ret_once = true;                              \
                }                                                       \
                unlikely(__ret_once);                                   \
        })
  

The last expansion step finally provides evidences where the symbol __already_done.1 is coming from. It is easy to understand that if more than one pr_warn_once is present into the same compilation unit, the compiler ends up in having several __already_done instances actually referring different memory area, hence it is forced to change these names. This is how __already_done.[0-9]+ symbol family is generated.

But if the compiler is so careful with names and addresses, how the aliases I mentioned at the beginning are even possible?

The Curious Case of __key Symbols

The __key family of symbols presents a different kind of anomaly. These symbols are closely tied to the spin_lock_init function and exhibit unique behavior compared to the __already_done family. The crux of the issue lies in how the compiler handles structures with no members in C. In the context of the Linux kernel, when the lockdep feature is disabled (this what happen when it is enabled), the lock_class_key structure becomes an empty struct. This means that when the compiler allocates such a variable in the data or BSS sections, it effectively allocates a zero-sized object. As a result, the next object allocated immediately afterward, ends up sharing the same address as the zero-sized object. This is the cause of the presence of these alias like symbols. They are not meant to be alias, they just happen to be such.

The __key symbols thus become aliases, purely due to the lock_class_key zero-sized nature when lockdep is disabled. This behavior is both unintended and inconsistent, as enabling lockdep causes the __key symbols to have a non-zero size, thereby preventing them from aliasing with other symbols.

Here is an example of zero sized __key objects, compared with the same, when the lockdep is enabled:

as it appears when lockdep is disabled

$ cat System.map| grep  ffffffff83534360
ffffffff83534360 b __key.11
ffffffff83534360 b __key.12
ffffffff83534360 b static_call_initialized
$ readelf -Wa vmlinux |grep __key.1[12]
 11513: ffffffff83534360     0 OBJECT  LOCAL  DEFAULT   35 __key.12
 11514: ffffffff83534360     0 OBJECT  LOCAL  DEFAULT   35 __key.11
 19420: ffffffff83541710     0 OBJECT  LOCAL  DEFAULT   35 __key.12
 19421: ffffffff83541710     0 OBJECT  LOCAL  DEFAULT   35 __key.11
 45259: ffffffff835690b8     0 OBJECT  LOCAL  DEFAULT   35 __key.11
 47597: ffffffff83569b38     0 OBJECT  LOCAL  DEFAULT   35 __key.12
 47598: ffffffff83569b38     0 OBJECT  LOCAL  DEFAULT   35 __key.11
 51424: ffffffff8356dac0     0 OBJECT  LOCAL  DEFAULT   35 __key.12
  

readelf shows 0 sized objects, and kernel's system map shows the collision between symbols

as it appears when lockdep is enabled

$ readelf -Wa vmlinux |grep __key.1[12]
  6080: ffffffff837ae610    16 OBJECT  LOCAL  DEFAULT   35 __key.12
  6081: ffffffff837ae600    16 OBJECT  LOCAL  DEFAULT   35 __key.11
  8402: ffffffff842624d0    16 OBJECT  LOCAL  DEFAULT   35 __key.11
  8693: ffffffff842626b0    16 OBJECT  LOCAL  DEFAULT   35 __key.11
  8703: ffffffff842626c0    16 OBJECT  LOCAL  DEFAULT   35 __key.12
  8975: ffffffff84262790    16 OBJECT  LOCAL  DEFAULT   35 __key.12
  8976: ffffffff84262780    16 OBJECT  LOCAL  DEFAULT   35 __key.11
 10437: ffffffff84265030    16 OBJECT  LOCAL  DEFAULT   35 __key.11
 12666: ffffffff8426ba60    16 OBJECT  LOCAL  DEFAULT   35 __key.12
 12916: ffffffff8426bc20    16 OBJECT  LOCAL  DEFAULT   35 __key.12
 20464: ffffffff8427b900    16 OBJECT  LOCAL  DEFAULT   35 __key.11
 21593: ffffffff8427bb50    16 OBJECT  LOCAL  DEFAULT   35 __key.12
 21594: ffffffff8427bb40    16 OBJECT  LOCAL  DEFAULT   35 __key.11
 23931: ffffffff8427d240    16 OBJECT  LOCAL  DEFAULT   35 __key.12
 23933: ffffffff8427d230    16 OBJECT  LOCAL  DEFAULT   35 __key.11
 27527: ffffffff8428cf50    16 OBJECT  LOCAL  DEFAULT   35 __key.11
 27902: ffffffff8428d050    16 OBJECT  LOCAL  DEFAULT   35 __key.12
 27904: ffffffff8428d040    16 OBJECT  LOCAL  DEFAULT   35 __key.11
 28675: ffffffff8428e1b0    16 OBJECT  LOCAL  DEFAULT   35 __key.11
 32713: ffffffff842a0b10    16 OBJECT  LOCAL  DEFAULT   35 __key.12
 32714: ffffffff842a0b00    16 OBJECT  LOCAL  DEFAULT   35 __key.11
 33307: ffffffff842a2d10    16 OBJECT  LOCAL  DEFAULT   35 __key.11
 42165: ffffffff842adb60    16 OBJECT  LOCAL  DEFAULT   35 __key.12
 42167: ffffffff842adb50    16 OBJECT  LOCAL  DEFAULT   35 __key.11
 44247: ffffffff842ae950    16 OBJECT  LOCAL  DEFAULT   35 __key.11
 44865: ffffffff842aee00    16 OBJECT  LOCAL  DEFAULT   35 __key.12
 44887: ffffffff842aedf0    16 OBJECT  LOCAL  DEFAULT   35 __key.11
 45016: ffffffff842aeed0    16 OBJECT  LOCAL  DEFAULT   35 __key.12
 45017: ffffffff842aeec0    16 OBJECT  LOCAL  DEFAULT   35 __key.11
 48389: ffffffff842b0760    16 OBJECT  LOCAL  DEFAULT   35 __key.12
 48390: ffffffff842b0750    16 OBJECT  LOCAL  DEFAULT   35 __key.11
 49274: ffffffff842b1500    16 OBJECT  LOCAL  DEFAULT   35 __key.11
 51779: ffffffff842b2820    16 OBJECT  LOCAL  DEFAULT   35 __key.12
 51780: ffffffff842b2810    16 OBJECT  LOCAL  DEFAULT   35 __key.11
 52060: ffffffff842b2cb0    16 OBJECT  LOCAL  DEFAULT   35 __key.12
 52061: ffffffff842b2ca0    16 OBJECT  LOCAL  DEFAULT   35 __key.11
 55853: ffffffff842b95c0    16 OBJECT  LOCAL  DEFAULT   35 __key.12
 62007: ffffffff842cf910    16 OBJECT  LOCAL  DEFAULT   35 __key.12
 62009: ffffffff842cf900    16 OBJECT  LOCAL  DEFAULT   35 __key.11
 63425: ffffffff842d6580    16 OBJECT  LOCAL  DEFAULT   35 __key.12
 63426: ffffffff842d6570    16 OBJECT  LOCAL  DEFAULT   35 __key.11
 64498: ffffffff842d7230    16 OBJECT  LOCAL  DEFAULT   35 __key.12
 64499: ffffffff842d7220    16 OBJECT  LOCAL  DEFAULT   35 __key.11
 66813: ffffffff842d8710    16 OBJECT  LOCAL  DEFAULT   35 __key.12
 66814: ffffffff842d8700    16 OBJECT  LOCAL  DEFAULT   35 __key.11
 69350: ffffffff842d88c0    16 OBJECT  LOCAL  DEFAULT   35 __key.12
 69351: ffffffff842d88b0    16 OBJECT  LOCAL  DEFAULT   35 __key.11

$ cat System.map| grep  static_call_initialized
ffffffff8426ba80 b static_call_initialized
$ cat System.map| grep  ffffffff8426ba80
ffffffff8426ba80 b static_call_initialized
  

as a consequence of the fact that lockdep structures are no more zero sized, the address conflict disappeared

Conclusion

The phenomena described above highlight how these lesser-known mechanisms induced a bug in the current implementation of the new ks-nav feature. It turns out ks-nav now needs a mechanism to detect zero-sized objects and skip them from evaluation. There's still work to do, but at least now I know what to blame for the hiccup. Time to teach ks-nav a new trick!

Saturday, May 4, 2024

Navigating the Syzkaller Experience: A Bug Chasing Adventure

Eisenbug hunting

Assigned with the task of chasing down a bug, an eisen one, I found myself delving into the intricate world of Syzkaller, that has been used to report it. Prompted by a report from a quality engineer tester and armed with only a kernel splat and a tarblob containing the documentation generated by Syzkaller supposed to reproduce the bug, I embarked on a journey of discovery.

Syzkaller, for the uninitiated, is a powerful tool designed for system call fuzzing and possibly discover new bugs in the Linux kernel. It utilizes a domain-specific language to describe syscalls that would deserve a shout, but I'm not yet good enough to describe it in details.
Back to the original task assignment, unwrap into the provided tarblob revealed a sparse landscape, with only a file named corpus.db to outstanding from others. Unclear on its significance, I initially assumed it to be a list of syscalls triggering the bug, only to learn that it referred to a set of minimal syscalls set inputs maximizing code coverage. Syzkaller is driven by the code coverage to direct the fuzzing, and the file is the current state of the coverage it found.

Undeterred, I resolved to set up a Syzkaller instance to unravel its mysteries and anticipate the bug-hunting process. Building Syzkaller from source was the first step, a process that required a fair amount of time.

In order to have the system ready to start the test you need:

  • syzkaller binaries for the target architecture (host and test machine)
  • qemu for the test machine architecture
  • Kernel image prepared for the test machine architecture
  • userspace system image for the test machine architecture

syzkaller build, is quite stright forward task:

git clone https://github.com/google/syzkaller.git cd syzkaller make HOSTOS=linux HOSTARCH=amd64 TARGETOS=linux TARGETARCH=arm64 -j$(nproc)

But for the test, syzkaller ssh identity is also needed.

ssh-keygen -f ./id-rsa

and provide a configuration:

{ "name": "QEMU-aarch64", "target": "linux/arm64", "http": ":56700", "workdir": "/home/alessandro/go/src/syzkaller/2test/corpus", "kernel_obj": "/home/alessandro/src/linux-6.8.9", "syzkaller": "/home/alessandro/go/src/syzkaller/", "enable_syscalls" : ["seccomp", "geteuid", "getresuid", "getegid", "getgid", "getgroups", "getresgid"], "sshkey": "/home/alessandro/go/src/syzkaller/2test/id_rsa", "image": "/home/alessandro/src/buildroot-2024.02.1/output/images/rootfs.ext2", "procs": 8, "type": "qemu", "vm": { "count": 1, "qemu": "/usr/local/bin/qemu-system-aarch64", "kernel": "/home/alessandro/src/linux-6.8.9/arch/arm64/boot/Image", "cpu": 2, "mem": 2048 } }

With Syzkaller primed for action, I turned my attention to preparing a suitable testing environment. Opting for a qemu instance as the target and a kernel syscall as the quarry, I embarked on a test aimed at gaining insight into Syzkaller. In other words, I sought to observe Syzkaller's behavior when encountering a bug, without investing my entire life in the process of searching a new bug. I chose the seccomp syscall due to its relatively low frequency in common workloads.

Seccomp, a mechanism for filtering system calls, served as the perfect candidate for my bug-hunting expedition. Armed with kernel code modifications, I prepared the groundwork for testing.

To expedite the bug-finding process, I intentionally created one. The following patch generates a crash in the seccomp syscall for 16 PIDs every 256.

diff --git a/kernel/seccomp.c b/kernel/seccomp.c index aca7b437882e..a0da35780eb8 100644 --- a/kernel/seccomp.c +++ b/kernel/seccomp.c @@ -2071,6 +2071,7 @@ static long do_seccomp(unsigned int op, unsigned int flags, SYSCALL_DEFINE3(seccomp, unsigned int, op, unsigned int, flags, void __user *, uargs) { + if ((current->pid & 0xff)<0x10) BUG(); return do_seccomp(op, flags, uargs); }

Next came the task of creating the userspace, for which I turned to Buildroot a nice tool for generating custom Linux userspaces. Using it, I opted for generating cpio and ext2 images to complement the kernel image.

After generating the userspace with Buildroot, my next objective was to create a kernel image that included the cpio as the initramfs. Since the scenario didn't require an elaborate userspace, my strategy was to merge the userspace directly into the kernel image. However, it seemed I counted my chickens before they hatched, as the 'image' argument in the configuration is mandatory. This meant that embedding the initramfs into the kernel made no difference as I had hoped. Now, I'm considering proposing a patch for syzkaller to make the 'image' argument optional. For the record, if you're looking to embed the initramfs into the kernel, CONFIG_INITRAMFS_SOURCE is the kernel config you'll need. Testing the image, however, presented a new challenge: incorporating the id_rsa.pub key to facilitate Syzkaller's access to the Linux image.

In tackling this obstacle, I explored two approaches: creating a new package or employing a post-build hook. Opting for the latter, I utilized the BR2_ROOTFS_POST_BUILD_SCRIPT symbol to integrate the required SSH key into the root filesystem.

The successive test I made, presented a new hurdle: my setup made syzkaller panic. Debugging revealed that Syzkaller expected debugfs to be mounted at its customary location, if not there it simply crashes. I, then, updated the post-build script to ensure debugfs was properly mounted.

Note for syzkaller users who want to use buildroot to create the userspace: watch for debugfs to be properly mounted!

Now that I had a working system at last, I delved into experimenting with Syzkaller, an impressive piece of software. However, upon examining the results, it became evident that the "bug reproduction" feature fell short in reproducing the bug I had intentionally inserted into the system. It seems that Syzkaller only considers the bug's dependency on syscall inputs, neglecting the kernel's internal state. The rather straightforward bug I introduced, where the bug's behavior depends on the PID value, renders the Syzkaller bug reproduction feature ineffective.

This is what you got hitting on "reproducing" link

Syzkaller hit 'kernel BUG in sys_seccomp' bug. The bug is not reproducible.