Thursday, June 20, 2024

Unidentifid Kernel symbols: Syscall macro expansion

When navigating kernel symbols, it is not uncommon to encounter symbols that do not appear to be declared in the source code. This is often the case with symbols related to syscalls. We know that symbols are created during preprocessing (see my previous blog posts), but syscall declarations seem to be more complex. Let's look at an example:

#include "kernel.h" SYSCALL_DEFINE4(test, unsigned long, first, unsigned long, second, unsigned long, third, unsigned long, fourth) { printk("hello"); }

The nice function above, after being preprocessed, spawns a few other functions:

struct pt_regs; static inline int is_syscall_trace_event(struct trace_event_call *tp_event) { return 0; } asmlinkage long __arm64_sys_test(const struct pt_regs *regs); ALLOW_ERROR_INJECTION(__arm64_sys_test, ERRNO); static long __se_sys_test(__MAP(4,__SC_LONG,unsigned long, first, unsigned long, second, unsigned long, third, unsigned long, fourth)); static inline long __do_sys_test(__MAP(4,__SC_DECL,unsigned long, first, unsigned long, second, unsigned long, third, unsigned long, fourth)); asmlinkage long __arm64_sys_test(const struct pt_regs *regs) { return __se_sys_test(__MAP(4,__SC_ARGS ,,regs->regs[0],,regs->regs[1],,regs->regs[2] ,,regs->regs[3],,regs->regs[4],,regs->regs[5])); } static long __se_sys_test(__MAP(4,__SC_LONG,unsigned long, first, unsigned long, second, unsigned long, third, unsigned long, fourth)) { long ret = __do_sys_test(__MAP(4,__SC_CAST,unsigned long, first, unsigned long, second, unsigned long, third, unsigned long, fourth)); __MAP(4,__SC_TEST,unsigned long, first, unsigned long, second, unsigned long, third, unsigned long, fourth); __PROTECT(4, ret,__MAP(4,__SC_ARGS,unsigned long, first, unsigned long, second, unsigned long, third, unsigned long, fourth)); return ret; } static inline long __do_sys_test(__MAP(4,__SC_DECL,unsigned long, first, unsigned long, second, unsigned long, third, unsigned long, fourth)) { printk("hello"); }

This example is for the aarch64 architecture, but other architectures undergo the same processing. The main function called when a syscall is invoked is __arm64_sys_test, which in turn calls __se_sys_test, and then __do_sys_test. Please note that the user code is part of this latter function. As we know, compilers perform complex optimizations when building user code and do not always honor the inline specifier. This is why, when looking at symbols (for example, in kallsyms), you may or may not see do_sys_* functions. The rationale behind this is:

If in a kernel splat backtrace you happen to see:
__do_sys_set_mempolicy_home_node+0xdc/0x1e4 __arm64_sys_set_mempolicy_home_node+0x20/0x2c
and in another build of the same kernel, you only see:
__arm64_sys_set_mempolicy_home_node+0x1d0/0x360
The error might have actually occurred at the same line of the source code.

Thursday, May 23, 2024

Investigate Obscure Kernel Symbols

Introduction

In the world of Linux kernel development, one often encounters intriguing anomalies that spark curiosity and investigation. My journey into exploring such peculiarities began with a previous deep dive into duplicate symbols within the Linux kernel. This exploration revealed fascinating insights into how certain symbols names, appears multiple times having different addresses. It was fun to discover that among multiple different addresses having the same name, there were also actual duplicates of the same function (name and body), even thought, the majority of those symbols having the same name were actually different objects. Building on that foundation, my current investigation delves into another set of mysterious symbols, those that appear to be aliases for given addresses in the kernel (multiple names for the same address), but whose origins are not immediately obvious. Their presence had significant consequences in my new effort. I'm currently adding a new feature to ks-nav, a nifty tool that generates diagrams from the kernel binary image. The goal is to provide kernel analysts with valuable insights into the kernel code, because who doesn't love a good kernel investigation? The tool already produces call tree diagrams and visualize subsystem interactions triggered by specific functions. My latest endeavor? To add functionality that reveals how global variables are used and shared among functions. The topic of this blog post springs from analyzing the output of this tool. Here's an image produced by investigating the global symbols shared starting from the function hugetlb_vma_lock_alloc.

The Problem of Macro Expansion and Symbol Aliasing

Unlike the previous investigation where symbols were straightforward duplicates, the issue at hand now involves a more complex phenomenon stemming from macro expansion. The process of macro expansion in the kernel can result in multiple symbols being generated with the same name, even though, each of these are actually different variables in memory. You can have the same phenomenon originate by compiler multiple mangling of the code such as inlining, or macro expansion, but when it happens, to allow the compiler to manage these same name symbols as different, the compiler must transform these names to allow it to differentiate. In practical terms, this just means that the compiler appends numbers to the identifier name to produce a new unique identifier. A simple example can clarify this:
$ cat h.c
#include 

int pippo(int i){
        static int paperino;
        if (i>=0) paperino=i;
        return paperino;
}
int pluto(int i){
        static int paperino;
        if (i>=0) paperino=i;
        return paperino;
}

int main(){
        printf("paperino= %d\n", pippo(55) );
        printf("paperino= %d\n", pippo(-1) );
        printf("paperino= %d\n", pluto(99) );
        printf("paperino= %d\n", pluto(-1) );
}
$ gcc -g h.c -o h
$ ./h
paperino= 55
paperino= 55
paperino= 99
paperino= 99
$ nm -n h
                 w __cxa_finalize@@GLIBC_2.2.5
                 w __gmon_start__
                 w _ITM_deregisterTMCloneTable
                 w _ITM_registerTMCloneTable
                 U __libc_start_main@@GLIBC_2.2.5
                 U printf@@GLIBC_2.2.5
0000000000001000 t _init
0000000000001060 T _start
0000000000001090 t deregister_tm_clones
00000000000010c0 t register_tm_clones
0000000000001100 t __do_global_dtors_aux
0000000000001140 t frame_dummy
0000000000001149 T pippo
000000000000116b T pluto
000000000000118d T main
0000000000001210 T __libc_csu_init
0000000000001280 T __libc_csu_fini
0000000000001288 T _fini
0000000000002000 R _IO_stdin_used
0000000000002014 r __GNU_EH_FRAME_HDR
00000000000021ac r __FRAME_END__
0000000000003db8 d __frame_dummy_init_array_entry
0000000000003db8 d __init_array_start
0000000000003dc0 d __do_global_dtors_aux_fini_array_entry
0000000000003dc0 d __init_array_end
0000000000003dc8 d _DYNAMIC
0000000000003fb8 d _GLOBAL_OFFSET_TABLE_
0000000000004000 D __data_start
0000000000004000 W data_start
0000000000004008 D __dso_handle
0000000000004010 B __bss_start
0000000000004010 b completed.8061
0000000000004010 D _edata
0000000000004010 D __TMC_END__
0000000000004014 b paperino.2316
0000000000004018 b paperino.2320
0000000000004020 B _end
$

This example shows, how the conflict generated by having two global variables having the same name, paperino, forced the compiler to differentiate them by appending a number. It is lesser known, but static local variables defined in functions are actually global variables. In the function namespace they do not generate any conflict, but in the compiler unit namespace they do, and this is why the compiler mangles names like that in the binary.

Back to the problem identified by the ks-nav new feature, in the diagram, there are two global data symbols that are evidently mangled by the compiler: the __key.11 and the __already_done.1 Let's start focusing on the simpler, just to familiarize with the phenomenon: the __already_done family of symbols. The analysis evidenced it comes from pr_warn_once. This function uses a macro to ensure that the warning message is printed only once. This mechanism ensures that each warning instance is tracked separately using a dedicated variable. To illustrate how this works, let's track down how the pr_warn_once macro is expanded.

step 1

  #define pr_warn_once(fmt, ...)                                  \
        printk_once(KERN_WARNING pr_fmt(fmt), ##__VA_ARGS__)
  

step 2

  #define printk_once(fmt, ...)                                   \
        DO_ONCE_LITE(printk, fmt, ##__VA_ARGS__)
  

step 3

  #define DO_ONCE_LITE(func, ...)                                         \
        DO_ONCE_LITE_IF(true, func, ##__VA_ARGS__)
  

step 4

  #define DO_ONCE_LITE_IF(condition, func, ...)                           \
        ({                                                              \
                bool __ret_do_once = !!(condition);                     \
                                                                        \
                if (__ONCE_LITE_IF(__ret_do_once))                      \
                        func(__VA_ARGS__);                              \
                                                                        \
                unlikely(__ret_do_once);                                \
        })
  

step 5

  #define __ONCE_LITE_IF(condition)                                       \
        ({                                                              \
                static bool __section(".data.once") __already_done;     \
                bool __ret_cond = !!(condition);                        \
                bool __ret_once = false;                                \
                                                                        \
                if (unlikely(__ret_cond && !__already_done)) {          \
                        __already_done = true;                          \
                        __ret_once = true;                              \
                }                                                       \
                unlikely(__ret_once);                                   \
        })
  

The last expansion step finally provides evidences where the symbol __already_done.1 is coming from. It is easy to understand that if more than one pr_warn_once is present into the same compilation unit, the compiler ends up in having several __already_done instances actually referring different memory area, hence it is forced to change these names. This is how __already_done.[0-9]+ symbol family is generated.

But if the compiler is so careful with names and addresses, how the aliases I mentioned at the beginning are even possible?

The Curious Case of __key Symbols

The __key family of symbols presents a different kind of anomaly. These symbols are closely tied to the spin_lock_init function and exhibit unique behavior compared to the __already_done family. The crux of the issue lies in how the compiler handles structures with no members in C. In the context of the Linux kernel, when the lockdep feature is disabled (this what happen when it is enabled), the lock_class_key structure becomes an empty struct. This means that when the compiler allocates such a variable in the data or BSS sections, it effectively allocates a zero-sized object. As a result, the next object allocated immediately afterward, ends up sharing the same address as the zero-sized object. This is the cause of the presence of these alias like symbols. They are not meant to be alias, they just happen to be such.

The __key symbols thus become aliases, purely due to the lock_class_key zero-sized nature when lockdep is disabled. This behavior is both unintended and inconsistent, as enabling lockdep causes the __key symbols to have a non-zero size, thereby preventing them from aliasing with other symbols.

Here is an example of zero sized __key objects, compared with the same, when the lockdep is enabled:

as it appears when lockdep is disabled

$ cat System.map| grep  ffffffff83534360
ffffffff83534360 b __key.11
ffffffff83534360 b __key.12
ffffffff83534360 b static_call_initialized
$ readelf -Wa vmlinux |grep __key.1[12]
 11513: ffffffff83534360     0 OBJECT  LOCAL  DEFAULT   35 __key.12
 11514: ffffffff83534360     0 OBJECT  LOCAL  DEFAULT   35 __key.11
 19420: ffffffff83541710     0 OBJECT  LOCAL  DEFAULT   35 __key.12
 19421: ffffffff83541710     0 OBJECT  LOCAL  DEFAULT   35 __key.11
 45259: ffffffff835690b8     0 OBJECT  LOCAL  DEFAULT   35 __key.11
 47597: ffffffff83569b38     0 OBJECT  LOCAL  DEFAULT   35 __key.12
 47598: ffffffff83569b38     0 OBJECT  LOCAL  DEFAULT   35 __key.11
 51424: ffffffff8356dac0     0 OBJECT  LOCAL  DEFAULT   35 __key.12
  

readelf shows 0 sized objects, and kernel's system map shows the collision between symbols

as it appears when lockdep is enabled

$ readelf -Wa vmlinux |grep __key.1[12]
  6080: ffffffff837ae610    16 OBJECT  LOCAL  DEFAULT   35 __key.12
  6081: ffffffff837ae600    16 OBJECT  LOCAL  DEFAULT   35 __key.11
  8402: ffffffff842624d0    16 OBJECT  LOCAL  DEFAULT   35 __key.11
  8693: ffffffff842626b0    16 OBJECT  LOCAL  DEFAULT   35 __key.11
  8703: ffffffff842626c0    16 OBJECT  LOCAL  DEFAULT   35 __key.12
  8975: ffffffff84262790    16 OBJECT  LOCAL  DEFAULT   35 __key.12
  8976: ffffffff84262780    16 OBJECT  LOCAL  DEFAULT   35 __key.11
 10437: ffffffff84265030    16 OBJECT  LOCAL  DEFAULT   35 __key.11
 12666: ffffffff8426ba60    16 OBJECT  LOCAL  DEFAULT   35 __key.12
 12916: ffffffff8426bc20    16 OBJECT  LOCAL  DEFAULT   35 __key.12
 20464: ffffffff8427b900    16 OBJECT  LOCAL  DEFAULT   35 __key.11
 21593: ffffffff8427bb50    16 OBJECT  LOCAL  DEFAULT   35 __key.12
 21594: ffffffff8427bb40    16 OBJECT  LOCAL  DEFAULT   35 __key.11
 23931: ffffffff8427d240    16 OBJECT  LOCAL  DEFAULT   35 __key.12
 23933: ffffffff8427d230    16 OBJECT  LOCAL  DEFAULT   35 __key.11
 27527: ffffffff8428cf50    16 OBJECT  LOCAL  DEFAULT   35 __key.11
 27902: ffffffff8428d050    16 OBJECT  LOCAL  DEFAULT   35 __key.12
 27904: ffffffff8428d040    16 OBJECT  LOCAL  DEFAULT   35 __key.11
 28675: ffffffff8428e1b0    16 OBJECT  LOCAL  DEFAULT   35 __key.11
 32713: ffffffff842a0b10    16 OBJECT  LOCAL  DEFAULT   35 __key.12
 32714: ffffffff842a0b00    16 OBJECT  LOCAL  DEFAULT   35 __key.11
 33307: ffffffff842a2d10    16 OBJECT  LOCAL  DEFAULT   35 __key.11
 42165: ffffffff842adb60    16 OBJECT  LOCAL  DEFAULT   35 __key.12
 42167: ffffffff842adb50    16 OBJECT  LOCAL  DEFAULT   35 __key.11
 44247: ffffffff842ae950    16 OBJECT  LOCAL  DEFAULT   35 __key.11
 44865: ffffffff842aee00    16 OBJECT  LOCAL  DEFAULT   35 __key.12
 44887: ffffffff842aedf0    16 OBJECT  LOCAL  DEFAULT   35 __key.11
 45016: ffffffff842aeed0    16 OBJECT  LOCAL  DEFAULT   35 __key.12
 45017: ffffffff842aeec0    16 OBJECT  LOCAL  DEFAULT   35 __key.11
 48389: ffffffff842b0760    16 OBJECT  LOCAL  DEFAULT   35 __key.12
 48390: ffffffff842b0750    16 OBJECT  LOCAL  DEFAULT   35 __key.11
 49274: ffffffff842b1500    16 OBJECT  LOCAL  DEFAULT   35 __key.11
 51779: ffffffff842b2820    16 OBJECT  LOCAL  DEFAULT   35 __key.12
 51780: ffffffff842b2810    16 OBJECT  LOCAL  DEFAULT   35 __key.11
 52060: ffffffff842b2cb0    16 OBJECT  LOCAL  DEFAULT   35 __key.12
 52061: ffffffff842b2ca0    16 OBJECT  LOCAL  DEFAULT   35 __key.11
 55853: ffffffff842b95c0    16 OBJECT  LOCAL  DEFAULT   35 __key.12
 62007: ffffffff842cf910    16 OBJECT  LOCAL  DEFAULT   35 __key.12
 62009: ffffffff842cf900    16 OBJECT  LOCAL  DEFAULT   35 __key.11
 63425: ffffffff842d6580    16 OBJECT  LOCAL  DEFAULT   35 __key.12
 63426: ffffffff842d6570    16 OBJECT  LOCAL  DEFAULT   35 __key.11
 64498: ffffffff842d7230    16 OBJECT  LOCAL  DEFAULT   35 __key.12
 64499: ffffffff842d7220    16 OBJECT  LOCAL  DEFAULT   35 __key.11
 66813: ffffffff842d8710    16 OBJECT  LOCAL  DEFAULT   35 __key.12
 66814: ffffffff842d8700    16 OBJECT  LOCAL  DEFAULT   35 __key.11
 69350: ffffffff842d88c0    16 OBJECT  LOCAL  DEFAULT   35 __key.12
 69351: ffffffff842d88b0    16 OBJECT  LOCAL  DEFAULT   35 __key.11

$ cat System.map| grep  static_call_initialized
ffffffff8426ba80 b static_call_initialized
$ cat System.map| grep  ffffffff8426ba80
ffffffff8426ba80 b static_call_initialized
  

as a consequence of the fact that lockdep structures are no more zero sized, the address conflict disappeared

Conclusion

The phenomena described above highlight how these lesser-known mechanisms induced a bug in the current implementation of the new ks-nav feature. It turns out ks-nav now needs a mechanism to detect zero-sized objects and skip them from evaluation. There's still work to do, but at least now I know what to blame for the hiccup. Time to teach ks-nav a new trick!

Saturday, May 4, 2024

Navigating the Syzkaller Experience: A Bug Chasing Adventure

Eisenbug hunting

Assigned with the task of chasing down a bug, an eisen one, I found myself delving into the intricate world of Syzkaller, that has been used to report it. Prompted by a report from a quality engineer tester and armed with only a kernel splat and a tarblob containing the documentation generated by Syzkaller supposed to reproduce the bug, I embarked on a journey of discovery.

Syzkaller, for the uninitiated, is a powerful tool designed for system call fuzzing and possibly discover new bugs in the Linux kernel. It utilizes a domain-specific language to describe syscalls that would deserve a shout, but I'm not yet good enough to describe it in details.
Back to the original task assignment, unwrap into the provided tarblob revealed a sparse landscape, with only a file named corpus.db to outstanding from others. Unclear on its significance, I initially assumed it to be a list of syscalls triggering the bug, only to learn that it referred to a set of minimal syscalls set inputs maximizing code coverage. Syzkaller is driven by the code coverage to direct the fuzzing, and the file is the current state of the coverage it found.

Undeterred, I resolved to set up a Syzkaller instance to unravel its mysteries and anticipate the bug-hunting process. Building Syzkaller from source was the first step, a process that required a fair amount of time.

In order to have the system ready to start the test you need:

  • syzkaller binaries for the target architecture (host and test machine)
  • qemu for the test machine architecture
  • Kernel image prepared for the test machine architecture
  • userspace system image for the test machine architecture

syzkaller build, is quite stright forward task:

git clone https://github.com/google/syzkaller.git cd syzkaller make HOSTOS=linux HOSTARCH=amd64 TARGETOS=linux TARGETARCH=arm64 -j$(nproc)

But for the test, syzkaller ssh identity is also needed.

ssh-keygen -f ./id-rsa

and provide a configuration:

{ "name": "QEMU-aarch64", "target": "linux/arm64", "http": ":56700", "workdir": "/home/alessandro/go/src/syzkaller/2test/corpus", "kernel_obj": "/home/alessandro/src/linux-6.8.9", "syzkaller": "/home/alessandro/go/src/syzkaller/", "enable_syscalls" : ["seccomp", "geteuid", "getresuid", "getegid", "getgid", "getgroups", "getresgid"], "sshkey": "/home/alessandro/go/src/syzkaller/2test/id_rsa", "image": "/home/alessandro/src/buildroot-2024.02.1/output/images/rootfs.ext2", "procs": 8, "type": "qemu", "vm": { "count": 1, "qemu": "/usr/local/bin/qemu-system-aarch64", "kernel": "/home/alessandro/src/linux-6.8.9/arch/arm64/boot/Image", "cpu": 2, "mem": 2048 } }

With Syzkaller primed for action, I turned my attention to preparing a suitable testing environment. Opting for a qemu instance as the target and a kernel syscall as the quarry, I embarked on a test aimed at gaining insight into Syzkaller. In other words, I sought to observe Syzkaller's behavior when encountering a bug, without investing my entire life in the process of searching a new bug. I chose the seccomp syscall due to its relatively low frequency in common workloads.

Seccomp, a mechanism for filtering system calls, served as the perfect candidate for my bug-hunting expedition. Armed with kernel code modifications, I prepared the groundwork for testing.

To expedite the bug-finding process, I intentionally created one. The following patch generates a crash in the seccomp syscall for 16 PIDs every 256.

diff --git a/kernel/seccomp.c b/kernel/seccomp.c index aca7b437882e..a0da35780eb8 100644 --- a/kernel/seccomp.c +++ b/kernel/seccomp.c @@ -2071,6 +2071,7 @@ static long do_seccomp(unsigned int op, unsigned int flags, SYSCALL_DEFINE3(seccomp, unsigned int, op, unsigned int, flags, void __user *, uargs) { + if ((current->pid & 0xff)<0x10) BUG(); return do_seccomp(op, flags, uargs); }

Next came the task of creating the userspace, for which I turned to Buildroot a nice tool for generating custom Linux userspaces. Using it, I opted for generating cpio and ext2 images to complement the kernel image.

After generating the userspace with Buildroot, my next objective was to create a kernel image that included the cpio as the initramfs. Since the scenario didn't require an elaborate userspace, my strategy was to merge the userspace directly into the kernel image. However, it seemed I counted my chickens before they hatched, as the 'image' argument in the configuration is mandatory. This meant that embedding the initramfs into the kernel made no difference as I had hoped. Now, I'm considering proposing a patch for syzkaller to make the 'image' argument optional. For the record, if you're looking to embed the initramfs into the kernel, CONFIG_INITRAMFS_SOURCE is the kernel config you'll need. Testing the image, however, presented a new challenge: incorporating the id_rsa.pub key to facilitate Syzkaller's access to the Linux image.

In tackling this obstacle, I explored two approaches: creating a new package or employing a post-build hook. Opting for the latter, I utilized the BR2_ROOTFS_POST_BUILD_SCRIPT symbol to integrate the required SSH key into the root filesystem.

The successive test I made, presented a new hurdle: my setup made syzkaller panic. Debugging revealed that Syzkaller expected debugfs to be mounted at its customary location, if not there it simply crashes. I, then, updated the post-build script to ensure debugfs was properly mounted.

Note for syzkaller users who want to use buildroot to create the userspace: watch for debugfs to be properly mounted!

Now that I had a working system at last, I delved into experimenting with Syzkaller, an impressive piece of software. However, upon examining the results, it became evident that the "bug reproduction" feature fell short in reproducing the bug I had intentionally inserted into the system. It seems that Syzkaller only considers the bug's dependency on syscall inputs, neglecting the kernel's internal state. The rather straightforward bug I introduced, where the bug's behavior depends on the PID value, renders the Syzkaller bug reproduction feature ineffective.

This is what you got hitting on "reproducing" link

Syzkaller hit 'kernel BUG in sys_seccomp' bug. The bug is not reproducible.

Sunday, March 3, 2024

How noexecstack became a Stack of Confusion

intro

Sometimes, what we perceive as a constant in our programming environments can undergo unexpected shifts, challenging our assumptions. In my journey to understand the mechanics behind stack overflow exploits, I encountered such a shift when grappling with the intricacies of the stack. Initially, as I delved into these techniques using machines devoid of MMUs, namely, plain m68k and x86 real mode, I paid little heed to memory flags. In those days, hackers could seamlessly inject binary payloads onto the stack, redirect program flow to the designated stack address housing their payloads, and execute their exploits with ease.

However, after setting aside these experiments for a time and revisiting them on early Linux machines, I encountered a surprising obstacle around 2005: the once-reliable technique suddenly ceased to function. Upon investigation, I came to the realization that assuming the executability of the stack was no longer tenable. Henceforth, I found myself grappling with the repercussions of this change, as the default behavior of compilers had shifted to render the stack non-executable. Or so I believed, until a recent inquiry from a client prompted me to revisit this assumption, revealing a truth starkly different from my prior expectations.

chapter 1 - What it seems like

So, what do we have here? Since 2005, something peculiar has emerged. When compiling a simple, trivial program using the C compiler, we observe the following:

$ echo -e "#include <stdio.h>\nint main(){printf(\"hello\\\n\");}"| gcc -x c -o hello - ;readelf -l hello Elf file type is EXEC (Executable file) Entry point 0x4004a0 There are 9 program headers, starting at offset 64 Program Headers: Type Offset VirtAddr PhysAddr FileSiz MemSiz Flags Align PHDR 0x0000000000000040 0x0000000000400040 0x0000000000400040 0x00000000000001f8 0x00000000000001f8 R 0x8 INTERP 0x0000000000000238 0x0000000000400238 0x0000000000400238 0x000000000000001c 0x000000000000001c R 0x1 [Requesting program interpreter: /lib64/ld-linux-x86-64.so.2] LOAD 0x0000000000000000 0x0000000000400000 0x0000000000400000 0x0000000000000768 0x0000000000000768 R E 0x200000 LOAD 0x0000000000000e00 0x0000000000600e00 0x0000000000600e00 0x0000000000000224 0x0000000000000228 RW 0x200000 DYNAMIC 0x0000000000000e10 0x0000000000600e10 0x0000000000600e10 0x00000000000001d0 0x00000000000001d0 RW 0x8 NOTE 0x0000000000000254 0x0000000000400254 0x0000000000400254 0x0000000000000044 0x0000000000000044 R 0x4 GNU_EH_FRAME 0x0000000000000640 0x0000000000400640 0x0000000000400640 0x000000000000003c 0x000000000000003c R 0x4 GNU_STACK 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000 RW 0x10 GNU_RELRO 0x0000000000000e00 0x0000000000600e00 0x0000000000600e00 0x0000000000000200 0x0000000000000200 R 0x1 Section to Segment mapping: Segment Sections... 00 01 .interp 02 .interp .note.ABI-tag .note.gnu.build-id .gnu.hash .dynsym .dynstr .gnu.version .gnu.version_r .rela.dyn .rela.plt .init .plt .text .fini .rodata .eh_frame_hdr .eh_frame 03 .init_array .fini_array .dynamic .got .got.plt .data .bss 04 .dynamic 05 .note.ABI-tag .note.gnu.build-id 06 .eh_frame_hdr 07 08 .init_array .fini_array .dynamic .got

Not much needs to be said; the stack lacks an executable flag: RW in the GNU_STACK section. Any attempt to execute code from this space inevitably results in a graceful crash, marked by the familiar segmentation fault (SIGSEGV).

Conversely, if our intention is to create an executable stack, we must explicitly instruct the compiler to do so.

$ echo -e "#include <stdio.h>\nint main(){printf(\"hello\\\n\");}"| gcc -x c -z execstack -o hello - ;readelf -l hello Elf file type is EXEC (Executable file) Entry point 0x4004a0 There are 9 program headers, starting at offset 64 Program Headers: Type Offset VirtAddr PhysAddr FileSiz MemSiz Flags Align PHDR 0x0000000000000040 0x0000000000400040 0x0000000000400040 0x00000000000001f8 0x00000000000001f8 R 0x8 INTERP 0x0000000000000238 0x0000000000400238 0x0000000000400238 0x000000000000001c 0x000000000000001c R 0x1 [Requesting program interpreter: /lib64/ld-linux-x86-64.so.2] LOAD 0x0000000000000000 0x0000000000400000 0x0000000000400000 0x0000000000000768 0x0000000000000768 R E 0x200000 LOAD 0x0000000000000e00 0x0000000000600e00 0x0000000000600e00 0x0000000000000224 0x0000000000000228 RW 0x200000 DYNAMIC 0x0000000000000e10 0x0000000000600e10 0x0000000000600e10 0x00000000000001d0 0x00000000000001d0 RW 0x8 NOTE 0x0000000000000254 0x0000000000400254 0x0000000000400254 0x0000000000000044 0x0000000000000044 R 0x4 GNU_EH_FRAME 0x0000000000000640 0x0000000000400640 0x0000000000400640 0x000000000000003c 0x000000000000003c R 0x4 GNU_STACK 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000 RWE 0x10 GNU_RELRO 0x0000000000000e00 0x0000000000600e00 0x0000000000600e00 0x0000000000000200 0x0000000000000200 R 0x1 Section to Segment mapping: Segment Sections... 00 01 .interp 02 .interp .note.ABI-tag .note.gnu.build-id .gnu.hash .dynsym .dynstr .gnu.version .gnu.version_r .rela.dyn .rela.plt .init .plt .text .fini .rodata .eh_frame_hdr .eh_frame 03 .init_array .fini_array .dynamic .got .got.plt .data .bss 04 .dynamic 05 .note.ABI-tag .note.gnu.build-id 06 .eh_frame_hdr 07 08 .init_array .fini_array .dynamic .got

Upon inspection, RWE in the GNU_STACK section, we confirm the presence of an executable stack.

In summary, the probability of encountering a new binary with an executable stack in contemporary settings is close to zero. Such instances may occur only if someone utilizes an outdated compiler or requires an executable stack for specific reasons. So, why would anyone desire an executable stack?

Perhaps solely to revisit the methods employed in old-fashioned stack overflow exploits!

Chapter 2 - Things are never a easy as they seems

Recently, a customer posed what initially appeared to be a trivial question: “What flag should I use to ensure that the stack remains non-executable?” I brushed it off as a simple matter, assuming that no action was needed since it was the default behavior.

However, the response from a knowledgeable individual surprised me: simply use -z nostackexec. This prompted me to question why such a flag even existed. After all, if the default behavior is to have a non-executable stack, what purpose does this flag serve?

Reflecting on past encounters with this flag, I had rationalized its existence by speculating, “Perhaps it’s necessary for exotic architectures where the default is to have an executable stack”.

However, I soon realized that the reality is far more complex than it initially seemed.

Let’s begin by clarifying: compilers do not manipulate stack flags; this task falls under the responsibility of the linker. The final executable is created by linking together all the object files generated by the compiler.

During the creation of ELF sections, the linker scans the input files for a specific section named .note.GNU-stack. This section conveys whether an executable stack is required or not.

According to the linker’s manual page, if an input file lacks a .note.GNU-stack section, then the default behavior is architecture-specific.

As I couldn’t find where this default behavior is specified, let’s conduct a couple of tests. You can find a collection of tests I’ve prepared in this repository.

Consider the gcc/asm_function executable file, which is a simple C executable that includes a basic function from an assembly file. Below is the relevant portion of the Makefile used to build it:

gcc/asm_function.o: src/asm_function.S gcc -g -c -o gcc/asm_function.o src/asm_function.S gcc/test_asm.o: src/test_asm.c gcc -g -c -o gcc/test_asm.o src/test_asm.c gcc/test_asm: gcc/test_asm.o gcc/asm_function.o gcc -g gcc/test_asm.o gcc/asm_function.o -o gcc/asm_function

Upon examining the generated object file, you’ll notice the absence of the .note.GNU-stack section. However, upon inspecting the resultant executable, you’ll observe that the stack is indeed marked as executable.

$ readelf -S gcc/asm_function.o There are 15 section headers, starting at offset 0x3b8: Section Headers: [Nr] Name Type Address Offset Size EntSize Flags Link Info Align [ 0] NULL 0000000000000000 00000000 0000000000000000 0000000000000000 0 0 0 [ 1] .text PROGBITS 0000000000000000 00000040 0000000000000006 0000000000000000 AX 0 0 1 [ 2] .data PROGBITS 0000000000000000 00000046 0000000000000000 0000000000000000 WA 0 0 1 [ 3] .bss NOBITS 0000000000000000 00000046 0000000000000000 0000000000000000 WA 0 0 1 [ 4] .debug_line PROGBITS 0000000000000000 00000046 0000000000000045 0000000000000000 0 0 1 [ 5] .rela.debug_line RELA 0000000000000000 00000248 0000000000000018 0000000000000018 I 12 4 8 [ 6] .debug_info PROGBITS 0000000000000000 0000008b 000000000000002e 0000000000000000 0 0 1 [ 7] .rela.debug_info RELA 0000000000000000 00000260 00000000000000a8 0000000000000018 I 12 6 8 [ 8] .debug_abbrev PROGBITS 0000000000000000 000000b9 0000000000000014 0000000000000000 0 0 1 [ 9] .debug_aranges PROGBITS 0000000000000000 000000d0 0000000000000030 0000000000000000 0 0 16 [10] .rela.debug_arang RELA 0000000000000000 00000308 0000000000000030 0000000000000018 I 12 9 8 [11] .debug_str PROGBITS 0000000000000000 00000100 0000000000000045 0000000000000001 MS 0 0 1 [12] .symtab SYMTAB 0000000000000000 00000148 00000000000000f0 0000000000000018 13 9 8 [13] .strtab STRTAB 0000000000000000 00000238 0000000000000009 0000000000000000 0 0 1 [14] .shstrtab STRTAB 0000000000000000 00000338 000000000000007b 0000000000000000 0 0 1 Key to Flags: W (write), A (alloc), X (execute), M (merge), S (strings), I (info), L (link order), O (extra OS processing required), G (group), T (TLS), C (compressed), x (unknown), o (OS specific), E (exclude), l (large), p (processor specific) $ readelf -l gcc/asm_function Elf file type is DYN (Shared object file) Entry point 0x1040 There are 11 program headers, starting at offset 64 Program Headers: Type Offset VirtAddr PhysAddr FileSiz MemSiz Flags Align PHDR 0x0000000000000040 0x0000000000000040 0x0000000000000040 0x0000000000000268 0x0000000000000268 R 0x8 INTERP 0x00000000000002a8 0x00000000000002a8 0x00000000000002a8 0x000000000000001c 0x000000000000001c R 0x1 [Requesting program interpreter: /lib64/ld-linux-x86-64.so.2] LOAD 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000530 0x0000000000000530 R 0x1000 LOAD 0x0000000000001000 0x0000000000001000 0x0000000000001000 0x00000000000001d5 0x00000000000001d5 R E 0x1000 LOAD 0x0000000000002000 0x0000000000002000 0x0000000000002000 0x0000000000000130 0x0000000000000130 R 0x1000 LOAD 0x0000000000002df0 0x0000000000003df0 0x0000000000003df0 0x0000000000000220 0x0000000000000228 RW 0x1000 DYNAMIC 0x0000000000002e00 0x0000000000003e00 0x0000000000003e00 0x00000000000001c0 0x00000000000001c0 RW 0x8 NOTE 0x00000000000002c4 0x00000000000002c4 0x00000000000002c4 0x0000000000000044 0x0000000000000044 R 0x4 GNU_EH_FRAME 0x0000000000002004 0x0000000000002004 0x0000000000002004 0x000000000000003c 0x000000000000003c R 0x4 GNU_STACK 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000 RWE 0x10 GNU_RELRO 0x0000000000002df0 0x0000000000003df0 0x0000000000003df0 0x0000000000000210 0x0000000000000210 R 0x1 Section to Segment mapping: Segment Sections... 00 01 .interp 02 .interp .note.gnu.build-id .note.ABI-tag .gnu.hash .dynsym .dynstr .gnu.version .gnu.version_r .rela.dyn 03 .init .plt .plt.got .text .fini 04 .rodata .eh_frame_hdr .eh_frame 05 .init_array .fini_array .dynamic .got .data .bss 06 .dynamic 07 .note.gnu.build-id .note.ABI-tag 08 .eh_frame_hdr 09 10 .init_array .fini_array .dynamic .got

This suggests that the default for x86_64 architecture is executable stack. Doing the same for aarch64, produces the followings:

$ readelf -S gcc/asm_function.aarch64.o There are 15 section headers, starting at offset 0x400: Section Headers: [Nr] Name Type Address Offset Size EntSize Flags Link Info Align [ 0] NULL 0000000000000000 00000000 0000000000000000 0000000000000000 0 0 0 [ 1] .text PROGBITS 0000000000000000 00000040 0000000000000010 0000000000000000 AX 0 0 8 [ 2] .data PROGBITS 0000000000000000 00000050 0000000000000000 0000000000000000 WA 0 0 1 [ 3] .bss NOBITS 0000000000000000 00000050 0000000000000000 0000000000000000 WA 0 0 1 [ 4] .debug_line PROGBITS 0000000000000000 00000050 000000000000004c 0000000000000000 0 0 1 [ 5] .rela.debug_line RELA 0000000000000000 00000290 0000000000000018 0000000000000018 I 12 4 8 [ 6] .debug_info PROGBITS 0000000000000000 0000009c 000000000000002e 0000000000000000 0 0 1 [ 7] .rela.debug_info RELA 0000000000000000 000002a8 00000000000000a8 0000000000000018 I 12 6 8 [ 8] .debug_abbrev PROGBITS 0000000000000000 000000ca 0000000000000014 0000000000000000 0 0 1 [ 9] .debug_aranges PROGBITS 0000000000000000 000000e0 0000000000000030 0000000000000000 0 0 16 [10] .rela.debug_arang RELA 0000000000000000 00000350 0000000000000030 0000000000000018 I 12 9 8 [11] .debug_str PROGBITS 0000000000000000 00000110 000000000000004d 0000000000000001 MS 0 0 1 [12] .symtab SYMTAB 0000000000000000 00000160 0000000000000120 0000000000000018 13 11 8 [13] .strtab STRTAB 0000000000000000 00000280 000000000000000f 0000000000000000 0 0 1 [14] .shstrtab STRTAB 0000000000000000 00000380 000000000000007b 0000000000000000 0 0 1 Key to Flags: W (write), A (alloc), X (execute), M (merge), S (strings), I (info), L (link order), O (extra OS processing required), G (group), T (TLS), C (compressed), x (unknown), o (OS specific), E (exclude), p (processor specific) $ readelf -l gcc/asm_function.aarch64 Elf file type is DYN (Shared object file) Entry point 0x610 There are 9 program headers, starting at offset 64 Program Headers: Type Offset VirtAddr PhysAddr FileSiz MemSiz Flags Align PHDR 0x0000000000000040 0x0000000000000040 0x0000000000000040 0x00000000000001f8 0x00000000000001f8 R 0x8 INTERP 0x0000000000000238 0x0000000000000238 0x0000000000000238 0x000000000000001b 0x000000000000001b R 0x1 [Requesting program interpreter: /lib/ld-linux-aarch64.so.1] LOAD 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x000000000000090c 0x000000000000090c R E 0x10000 LOAD 0x0000000000000d88 0x0000000000010d88 0x0000000000010d88 0x0000000000000288 0x0000000000000290 RW 0x10000 DYNAMIC 0x0000000000000d98 0x0000000000010d98 0x0000000000010d98 0x00000000000001f0 0x00000000000001f0 RW 0x8 NOTE 0x0000000000000254 0x0000000000000254 0x0000000000000254 0x0000000000000044 0x0000000000000044 R 0x4 GNU_EH_FRAME 0x00000000000007e0 0x00000000000007e0 0x00000000000007e0 0x0000000000000044 0x0000000000000044 R 0x4 GNU_STACK 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000 RW 0x10 GNU_RELRO 0x0000000000000d88 0x0000000000010d88 0x0000000000010d88 0x0000000000000278 0x0000000000000278 R 0x1 Section to Segment mapping: Segment Sections... 00 01 .interp 02 .interp .note.gnu.build-id .note.ABI-tag .gnu.hash .dynsym .dynstr .gnu.version .gnu.version_r .rela.dyn .rela.plt .init .plt .text .fini .rodata .eh_frame_hdr .eh_frame 03 .init_array .fini_array .dynamic .got .data .bss 04 .dynamic 05 .note.gnu.build-id .note.ABI-tag 06 .eh_frame_hdr 07 08 .init_array .fini_array .dynamic .got

Which is the opposite, non-executable stack, suggesting that the default for this architecture is to have the stack not executable.

chapter 3 - is it all?

Returning to the original topic, based on the previous chapter observations, it is possible to deduce that the two architectures, x86_64 and aarch64, have different defaults regarding executable stacks. But is this the extent of the matter?

Apparently not. There are instances where the compiler needs to generate code and execute it, often utilizing the stack for this purpose. In such cases, the resulting executable file will indeed have an executable stack.

There might be other cases out there, but after a thorough search, I couldn’t find anything except for the GCC GNU extension “nested functions.” It’s possible that not many people are aware of this feature - I certainly wasn’t until recently. However, it appears that nested functions can be implemented in C, but only when using GCC, clang does not support them.

Nested functions are functions defined within the body of another function. These inner functions have access to the variables and parameters of the enclosing function and can only be invoked within its scope. GCC allows them to exist, but for them to work, the stack needs to be executable, at least when they are called indirectly from another function.

Let’s consider an example:

int nested_carrier(int a, int b, int n) { int loc_var = n; int multiply2(int z) { return z + z + loc_var; } return sum_func(multiply2, a, b); }

In this function, multiply2 is passed to be executed by the external function sum_func. Now, let’s examine the assembly implementation of nested_carrier.

┌ 151: dbg.nested_carrier (int64_t arg1, int64_t arg2, int64_t arg3, int64_t arg_10h); │ ; arg int64_t arg1 @ rdi │ ; arg int64_t arg2 @ rsi │ ; arg int64_t arg3 @ rdx │ ; arg int64_t arg_10h @ rbp+0x10 │ ; var int z @ rbp-0x4 │ ; var int64_t canary @ rbp-0x8 │ ; var int64_t var_10h @ rbp-0x10 │ ; var int loc_var @ rbp-0x30 │ ; var int a @ rbp-0x34 │ ; var int b @ rbp-0x38 │ ; var int n @ rbp-0x3c │ 0x00001187 f30f1efa endbr64 ; nested_local.c:5 int nested_carrier (int a, int b, int n) { │ ; int nested_carrier(int a,int b,int n); │ 0x0000118b 55 push rbp │ 0x0000118c 4889e5 mov rbp, rsp │ 0x0000118f 4883ec40 sub rsp, 0x40 │ 0x00001193 897dcc mov dword [a], edi ; arg1 │ 0x00001196 8975c8 mov dword [b], esi ; arg2 │ 0x00001199 8955c4 mov dword [n], edx ; arg3 │ 0x0000119c 64488b0425.. mov rax, qword fs:[0x28] │ 0x000011a5 488945f8 mov qword [canary], rax ; Just bought my self a new canary │ 0x000011a9 31c0 xor eax, eax │ 0x000011ab 488d4510 lea rax, [arg_10h] │ 0x000011af 488945f0 mov qword [var_10h], rax │ 0x000011b3 488d45d0 lea rax, [loc_var] │ 0x000011b7 4883c004 add rax, 4 │ 0x000011bb 488d55d0 lea rdx, [loc_var] │ 0x000011bf c700f30f1efa mov dword [rax], 0xfa1e0ff3 ; Here it is writing the trampoline, note the endbr64 opcode │ 0x000011c5 66c7400449bb mov word [rax + 4], 0xbb49 ; it stores in the stack │ 0x000011cb 488d0d97ff.. lea rcx, [dbg.multiply2] ; as the multiply2 address │ 0x000011d2 48894806 mov qword [rax + 6], rcx │ 0x000011d6 66c7400e49ba mov word [rax + 0xe], 0xba49 ; another opcode │ 0x000011dc 48895010 mov qword [rax + 0x10], rdx ; this is the base address to locate parent local vars │ 0x000011e0 c7401849ff.. mov dword [rax + 0x18], 0x90e3ff49 ; more opcodes │ 0x000011e7 8b45c4 mov eax, dword [n] ; nested_local.c:6 int loc_var = n; │ 0x000011ea 8945d0 mov dword [loc_var], eax │ 0x000011ed 488d45d0 lea rax, [loc_var] ; nested_local.c:8 return sum_func (multiply2, a, b); │ 0x000011f1 4883c004 add rax, 4 │ 0x000011f5 4889c1 mov rcx, rax ; save trampoline address │ 0x000011f8 8b55c8 mov edx, dword [b] ; int64_t arg3 = b │ 0x000011fb 8b45cc mov eax, dword [a] │ 0x000011fe 89c6 mov esi, eax ; int64_t arg2 = a │ 0x00001200 4889cf mov rdi, rcx ; int64_t arg1 = trampoline address! │ 0x00001203 e847000000 call dbg.sum_func │ 0x00001208 488b75f8 mov rsi, qword [canary] ; Hey canary, are you there?! │ 0x0000120c 6448333425.. xor rsi, qword fs:[0x28] ; are still you!? │ ┌─< 0x00001215 7405 je 0x121c ; stack overflow check │ │ 0x00001217 e844feffff call sym.imp.__stack_chk_fail ; crash if canary is failing │ └─> 0x0000121c c9 leave └ 0x0000121d c3 ret

Examining this code, we notice some “alien code” added by our trusty compiler friend. Let’s set aside the stack check with canary for now; our current focus is on the trampoline it’s constructing to facilitate the external call. Within the function body, we can clearly see the trampoline being constructed, followed by the point at which the trampoline address is utilized for the external function call.

(gdb) x/10i $pc => 0x7fffffffdcc0: endbr64 0x7fffffffdcc4: movabs $0x555555555169,%r11 0x7fffffffdcce: movabs $0x7fffffffdcc0,%r10 0x7fffffffdcd8: rex.WB jmpq *%r11

Let’s delve into how the trampoline is constructed using our buddy GDB. We’ll break it down into four instructions:

  1. The endbr64 instruction was introduced as part of the Intel Control-flow Enforcement Technology (CET) extension. Don’t confuse it with Cache Allocation Technology (CAT), another CPU feature. Phew, the acronyms are piling up! Anyway, this instruction isn’t pertinent to our analysis; it’s included because the machine executing this code expects it to be present. The endbr64 instruction marks the end of a code sequence and helps prevent ROP gadgets from being chained together.
  2. movabs $0x555555555169,%r11: This instruction loads our target function address, multiply2, into register r11.
  3. movabs $0x7fffffffdcc0,%r10: Let’s recall the x86_64 ABI: Parameters to functions are passed in the registers rdi, rsi, rdx, rcx, r8, r9, and additional values are passed on the stack in reverse order. This instruction deviates from the conventional ABI, using a register r10, to pass the base address for the parent’s local variables.
  4. rex.WB jmpq *%r11: This is a straightforward indirect call that we know will lead us to address 0x555555555169, corresponding to the multiply2 function.

Now that we are aware of at least one other scenario where the compiler may necessitate an executable stack, let’s explore how this is reflected in the executable:

$ readelf -l gcc/nested_local Elf file type is DYN (Shared object file) Entry point 0x1080 There are 13 program headers, starting at offset 64 Program Headers: Type Offset VirtAddr PhysAddr FileSiz MemSiz Flags Align PHDR 0x0000000000000040 0x0000000000000040 0x0000000000000040 0x00000000000002d8 0x00000000000002d8 R 0x8 INTERP 0x0000000000000318 0x0000000000000318 0x0000000000000318 0x000000000000001c 0x000000000000001c R 0x1 [Requesting program interpreter: /lib64/ld-linux-x86-64.so.2] LOAD 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000658 0x0000000000000658 R 0x1000 LOAD 0x0000000000001000 0x0000000000001000 0x0000000000001000 0x0000000000000315 0x0000000000000315 R E 0x1000 LOAD 0x0000000000002000 0x0000000000002000 0x0000000000002000 0x00000000000001e0 0x00000000000001e0 R 0x1000 LOAD 0x0000000000002db0 0x0000000000003db0 0x0000000000003db0 0x0000000000000260 0x0000000000000268 RW 0x1000 DYNAMIC 0x0000000000002dc0 0x0000000000003dc0 0x0000000000003dc0 0x00000000000001f0 0x00000000000001f0 RW 0x8 NOTE 0x0000000000000338 0x0000000000000338 0x0000000000000338 0x0000000000000020 0x0000000000000020 R 0x8 NOTE 0x0000000000000358 0x0000000000000358 0x0000000000000358 0x0000000000000044 0x0000000000000044 R 0x4 GNU_PROPERTY 0x0000000000000338 0x0000000000000338 0x0000000000000338 0x0000000000000020 0x0000000000000020 R 0x8 GNU_EH_FRAME 0x000000000000201c 0x000000000000201c 0x000000000000201c 0x000000000000005c 0x000000000000005c R 0x4 GNU_STACK 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000 RWE 0x10 GNU_RELRO 0x0000000000002db0 0x0000000000003db0 0x0000000000003db0 0x0000000000000250 0x0000000000000250 R 0x1 Section to Segment mapping: Segment Sections... 00 01 .interp 02 .interp .note.gnu.property .note.gnu.build-id .note.ABI-tag .gnu.hash .dynsym .dynstr .gnu.version .gnu.version_r .rela.dyn .rela.plt 03 .init .plt .plt.got .plt.sec .text .fini 04 .rodata .eh_frame_hdr .eh_frame 05 .init_array .fini_array .dynamic .got .data .bss 06 .dynamic 07 .note.gnu.property 08 .note.gnu.build-id .note.ABI-tag 09 .note.gnu.property 10 .eh_frame_hdr 11 12 .init_array .fini_array .dynamic .got $ readelf -l gcc/nested_local.ne Elf file type is DYN (Shared object file) Entry point 0x1080 There are 13 program headers, starting at offset 64 Program Headers: Type Offset VirtAddr PhysAddr FileSiz MemSiz Flags Align PHDR 0x0000000000000040 0x0000000000000040 0x0000000000000040 0x00000000000002d8 0x00000000000002d8 R 0x8 INTERP 0x0000000000000318 0x0000000000000318 0x0000000000000318 0x000000000000001c 0x000000000000001c R 0x1 [Requesting program interpreter: /lib64/ld-linux-x86-64.so.2] LOAD 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000658 0x0000000000000658 R 0x1000 LOAD 0x0000000000001000 0x0000000000001000 0x0000000000001000 0x0000000000000315 0x0000000000000315 R E 0x1000 LOAD 0x0000000000002000 0x0000000000002000 0x0000000000002000 0x00000000000001e0 0x00000000000001e0 R 0x1000 LOAD 0x0000000000002db0 0x0000000000003db0 0x0000000000003db0 0x0000000000000260 0x0000000000000268 RW 0x1000 DYNAMIC 0x0000000000002dc0 0x0000000000003dc0 0x0000000000003dc0 0x00000000000001f0 0x00000000000001f0 RW 0x8 NOTE 0x0000000000000338 0x0000000000000338 0x0000000000000338 0x0000000000000020 0x0000000000000020 R 0x8 NOTE 0x0000000000000358 0x0000000000000358 0x0000000000000358 0x0000000000000044 0x0000000000000044 R 0x4 GNU_PROPERTY 0x0000000000000338 0x0000000000000338 0x0000000000000338 0x0000000000000020 0x0000000000000020 R 0x8 GNU_EH_FRAME 0x000000000000201c 0x000000000000201c 0x000000000000201c 0x000000000000005c 0x000000000000005c R 0x4 GNU_STACK 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000 RW 0x10 GNU_RELRO 0x0000000000002db0 0x0000000000003db0 0x0000000000003db0 0x0000000000000250 0x0000000000000250 R 0x1 Section to Segment mapping: Segment Sections... 00 01 .interp 02 .interp .note.gnu.property .note.gnu.build-id .note.ABI-tag .gnu.hash .dynsym .dynstr .gnu.version .gnu.version_r .rela.dyn .rela.plt 03 .init .plt .plt.got .plt.sec .text .fini 04 .rodata .eh_frame_hdr .eh_frame 05 .init_array .fini_array .dynamic .got .data .bss 06 .dynamic 07 .note.gnu.property 08 .note.gnu.build-id .note.ABI-tag 09 .note.gnu.property 10 .eh_frame_hdr 11 12 .init_array .fini_array .dynamic .got

In the following, you can observe the ELF program header table of two executable files, both generated from the same source file, src/nested_local.c, in the repository . They differ because in one instance, I added -z noexecstack to enforce a non-executable stack. This is what’s happen if they get executed:

$ ./gcc/nested_local; echo Fancy calculation (34) alessandro@r5:~/tmp/stack/nested_prt$ ./gcc/nested_local.ne; echo Segmentation fault (core dumped)

Since the trampoline is in the stack, when the second file is executed it crashes because it tries to execute code from the stack. Here’s the proof the crash is caused by it:

$ gdb ./gcc/nested_local.ne GNU gdb (Ubuntu 9.2-0ubuntu1~20.04.1) 9.2 Copyright (C) 2020 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-linux-gnu". Type "show configuration" for configuration details. For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>. Find the GDB manual and other documentation resources online at: <http://www.gnu.org/software/gdb/documentation/>. For help, type "help". Type "apropos word" to search for commands related to "word"... Reading symbols from ./gcc/nested_local.ne... (gdb) r Starting program: /home/alessandro/tmp/stack/nested_prt/gcc/nested_local.ne Program received signal SIGSEGV, Segmentation fault. 0x00007fffffffdc34 in ?? () (gdb) x/10i $pc => 0x7fffffffdc34: endbr64 0x7fffffffdc38: movabs $0x555555555169,%r11 0x7fffffffdc42: movabs $0x7fffffffdc30,%r10 0x7fffffffdc4c: rex.WB jmpq *%r11 0x7fffffffdc4f: nop 0x7fffffffdc50: jo 0x7fffffffdc2e 0x7fffffffdc52: (bad) 0x7fffffffdc53: (bad) 0x7fffffffdc54: (bad) 0x7fffffffdc55: jg 0x7fffffffdc57 (gdb)

Finally, let’s consider that to further complicate matters, GCC employs different conventions across architectures. Please do not expect this to be straightforward, as it certainly isn’t!

In x86_64, executable ELF files always contain an entry in the program header GNU_STACK, which reflects the actual permissions over the stack. When the linker combines objects to create the executable, it looks at .note.GNU-stack and its contents to set the stack accordingly. If .note.GNU-stack is missing, the stack defaults to executable.

Similarly, in aarch64, executable ELF files always include an entry in the program header GNU_STACK, with flags reflecting the stack’s permissions. The linker examines .note.GNU-stack during the executable creation process to determine the stack’s permissions. If .note.GNU-stack is absent, the stack defaults to non-executable.

On PPC64, executable ELF files only include an entry in the program header GNU_STACK if it needs to be executable; otherwise, it defaults to non-executable.

Conversely, in MIPS32, executable ELF files only have an entry in the program header GNU_STACK if it needs to be non-executable; otherwise, it defaults to executable.

As a final note for this extensive and perhaps tedious discussion on executable stacks, allow me to share what I discovered while verifying this information on a MIPS system.

Look at how some MIPS SoCs do not enforce the stack permissions

epilogue

If you want to make sure your stack is not executable, add -z noexecstack to your compiler's flags.

Friday, February 23, 2024

Linux Syscall Numbers: A Journey through Binary Analysis


Introduction:

Reverse engineering binaries is a crucial activity impacting both the realms of security and safety. When examining a program without the context syscall numbers provide, interpreting the code behavior can be daunting.

Consider this snippet of code devoid of syscall numbers:

0x0000000000000000: 48 C7 C0 01 00 00 00 mov rax, 1 0x0000000000000007: 48 C7 C7 01 00 00 00 mov rdi, 1 0x000000000000000e: 48 C7 C6 2e 00 00 00 mov rsi, 0x2e 0x0000000000000015: 48 C7 C2 0D 00 00 00 mov rdx, 0xd 0x000000000000001c: 0F 05 syscall 0x000000000000001e: 48 C7 C0 3C 00 00 00 mov rax, 0x3c 0x0000000000000025: 48 C7 C7 00 00 00 00 mov rdi, 0 0x000000000000002c: 0F 05 syscall 0x000000000000002e: 48 65 6C insb byte ptr [rdi], dx 0x0000000000000031: 6C insb byte ptr [rdi], dx 0x0000000000000032: 6F outsd dx, dword ptr [rsi] 0x0000000000000033: 20 57 6F and byte ptr [rdi + 0x6f], dl 0x0000000000000036: 72 6C jb 0xa4 0x0000000000000038: 64 21 00 and dword ptr fs:[rax], eax

Without knowledge of syscall numbers, understanding the program’s functionality, particularly its interactions with the operating system, becomes challenging. However, simply providing the context of syscall numbers can dramatically simplify comprehension. What if I tell you that the first syscall is write and the second is exit? Instantainly it became trivial to understand!

0x0000000000000000: 48 C7 C0 01 00 00 00 mov rax, 1 0x0000000000000007: 48 C7 C7 01 00 00 00 mov rdi, 1 0x000000000000000e: 48 C7 C6 2e 00 00 00 mov rsi, 0x2e 0x0000000000000015: 48 C7 C2 0D 00 00 00 mov rdx, 0xd 0x000000000000001c: 0F 05 syscall ; write(1, 0x2e, 13) 0x000000000000001e: 48 C7 C0 3C 00 00 00 mov rax, 0x3c 0x0000000000000025: 48 C7 C7 00 00 00 00 mov rdi, 0 0x000000000000002c: 0F 05 syscall ; exit(0) 0x000000000000002e: 48 65 6C insb byte ptr [rdi], dx ; H e l 0x0000000000000031: 6C insb byte ptr [rdi], dx ; l 0x0000000000000032: 6F outsd dx, dword ptr [rsi] ; o 0x0000000000000033: 20 57 6F and byte ptr [rdi + 0x6f], dl ; \b W o 0x0000000000000036: 72 6C jb 0xa4 ; r l 0x0000000000000038: 64 21 00 and dword ptr fs:[rax], eax ; d ! \0

It become easier to spot that bytes after the exit can’t be code, and that if write references them in a write to 1, stdout, they must be text!

In the snippets I just used as example spot the syscall number is trivial as mov rax, 1 for the first syscall, and mov rax, 0x3c for the second. So the example can just provide an hint of the problem. In a complex code the mov rax in question can be far away from the actual usage and in some convoluted cases, it can be moved from a register to another.

Automating the identification of syscall numbers not only aids reverse engineering for security but also can also enhance code safety. By automating this process, the amount of code that requires manual verification can be restricted. This reduction in manual effort streamlines the verification process, contributing to improved code safety and reliability.


Differences Between Architectures RISC vs CISC:

In the realm of computer architecture, the dynamic between CISC and RISC is akin to a timeless dance. CISC represents the old school, while RISC stands as the new kid on the block — although, truth be told, RISC has been strutting its stuff for at least 30 years now, which is quite a few lifetimes in architectural terms.

RISC architectures boast uniform instruction sizes, treating every instruction with equal importance. This uniformity simplifies the compiler’s task, ensuring impartial treatment of instructions. For instance, in aarch64, a 16-bit literal can effortlessly fit within a single instruction’s operating code, blending seamlessly with others within the 64k limit — a typical boundary for syscall numbers.

Conversely, CISC architectures march to a different beat. Loading an immediate value can entail performance and memory footprint costs. Even with minimal optimizations, varied instructions are often used to achieve the same outcome of loading a register with a specific value.

opcodesmnemonic
C8 1a 80 D2MOV X8, #0xd6
01 00 00 D4SVC 0
RISC: aarch64
opcodesmnemonic
93 08 e0 05li a7, 94
0f 05ecall
RISC: RiscV
opcodesmnemonic
48 c7 c0 0c 00 00 00mov rax,0x0C
73 00 00 00syscall
CISC: x86_64


opcodesmnemonic
A7 19 01 04lghi r1, 0x0104
0a 00svc 0
CISC: s390x

This architectural dichotomy sheds light on the nuances of identifying syscall numbers. RISC architectures, with their uniform instruction structure, facilitate easier identification, while CISC architectures, with their diverse and sometimes costly instructions, present a more formidable challenge.

Guess syscall numbers: How can it be done?

In the early days of computing, CPU designers anticipated the need for system call functionality directly embedded within instruction sets. For instance, the int instruction in 16-bit x86 architectures and traps in m68k systems exemplify this approach. However, as early as the 1980s, operating system developers recognized the limitations of such implementations, as the 256 functions allowed by these instructions proved insufficient. Legacy x86 systems, DOS for instance, differentiated between BIOS and operating system functions using the int number, with the OS typically consolidating its functions under a single int number, such as int 0x21. Today, the impracticality of embedding syscall numbers directly into instruction opcodes is evident, leading newer architectures like RISC-V and x86_64 to eschew this practice.

By hand: How do they do it?

Manual identification of syscall numbers is a cumbersome process, typically undertaken in two steps:
  • Locating a syscall instruction within the assembly code.
  • Tracing backward from the syscall instruction to identify registers, particularly the one containing the syscall number.
  • This meticulous process is essential for determining the specific syscall invoked by the program but highlights the need for automated tools and techniques in modern reverse engineering practices.

    Radare ESIL: How it does it?

    Radare 2 tackles the challenge of identifying syscall numbers by leveraging its ESIL framework. ESIL serves as a semantic representation of assembly instructions, enabling Radare 2 to evaluate individual instructions and emulate their behavior. Similar to LLVM-IR, ESIL provides a high-level abstraction of CPU opcode semantics, facilitating analysis and interpretation of assembly code.

    ESIL operates through a stack-based interpreter, akin to calculators, where values and operators are pushed and popped from a stack to perform operations. This post-fix notation allows for the expression of various CPU operations, including binary arithmetic, and memory operations. For example, Radare 2 can use ESIL to translate assembly instructions into understandable operations, shedding light on a program’s behavior, even for obscure architectures where traditional debugging tools may not be available. By harnessing ESIL’s capabilities, Radare 2 empowers analysts to efficiently identify syscall numbers and understand program execution flow.

    Ghidra: How it does it?

    Ghidra does not porovide an official way to solve the problem of guessing syscall numbers. Still, there’s a very well known third party script that uses Ghidra framework to provide the functionality. The Ghidra script employs symbolic propagation, a technique in reverse engineering that tracks symbolic values of program variables instead of executing the program. Specifically, the script analyzes x86 and x64 Linux binaries to resolve system calls. It identifies system call instructions, symbolically determines the values of syscall registers, and creates function definitions for each syscall. Symbolic propagation enables reasoning about register contents without actually executing the program.

    My Proposition: Static Analysis + Emulation.

    The initial concept is to emulate the code and examine the contents of registers when a syscall instruction is executed. However, a significant challenge arises: there's no assurance that the flow of code execution will necessarily reach the point of interest—the syscall.

    To address this challenge, I employ a technique I refer to as guided execution. Essentially, this method involves guiding the execution to ensure it passes over the specific instruction I need to examine.

    In the general case, this approach wouldn't provide guarantees about the accuracy of register contents. However, when it comes to syscalls, I am confident that the path I traverse to reach a syscall cannot determine the syscall number. This confidence stems from the fact that syscalls have diverse interfaces, implying that if multiple paths exist from the function start to a syscall, they will ultimately lead to the same syscall number. Subsequently, I outline the detailed procedure I employ to achieve this.

    The process commences with the user-provided function binary image, a representation of compiled machine code specific to a given function. This binary image undergoes disassembly using the libcapstone disassembler, converting the machine code into a list of assembly instructions.

    Following disassembly, an elaborate procedure ensues to generate the CFG for the function in question. The CFG serves as a graphical depiction of all feasible execution paths within the function. Each node within the graph corresponds to an instruction block, with edges denoting potential transitions between these blocks.

    An illustrative example of this intermediate step pertains to the glibc function __pthread_mutex_lock_full.

    Within this resulting CFG image, purple blocks denote instruction blocks housing syscalls, while cyan bubbles signify blocks containing a ret instruction.

    Subsequently, the task shifts to identifying a path from the function’s entry point to the instruction block housing the syscall.

    Once the path is discerned, a further operation ensues, scanning the code to locate any call instructions along the path. These call instructions are then substituted with nop instructions, ensuring guided execution remains focused solely on the path leading to the syscall instruction block and doesn’t veer off due to calls to other functions absent from the binary image.

    With call instructions replaced, guided execution is executed using libunicorn. During this process, each instruction block along the path is executed sequentially, with the context (CPU state) of each block utilized to initiate execution of the subsequent block.

    Upon reaching the syscall instruction, register values are inspected to ascertain the syscall number.


    Addressing concerns and outlining why it’s effective in this use case.

    The PoC

    To apply this concept in a practical context, I aimed to create a Proof of Concept (PoC), opting to develop it as a Radare2 plugin. Below, I present the implementation for both versions of Radare:

  • Radare2.
  • Rizin.
  • Patching the Code

    As I delve into the intricacies of Linux binaries, the necessity of patching becomes evident in my quest to uncover the elusive syscall numbers. Patching serves as a crucial tool to remove distractions and ensure a clear path to my destination.

    However, the task of patching is notably more complex in CISC architectures compared to RISC. In CISC, such as x86, branch instructions vary in length, adding layers of intricacy to the process. This complexity is compounded by the need to decipher the true meaning behind seemingly innocuous instructions like the nop instruction.

    In the realm of x86 architecture, the nop instruction is a facade for the xchg (e)ax, (e)ax instruction. This revelation sheds light on the true nature of nop, enabling me to craft multibyte instructions that effectively clear my path of obstructions.

    To achieve this, I employ the Gauss formula, a stroke of genius that allows me to create a function capable of generating multibyte nop instructions of any desired size. Armed with this solution, I navigate the intricate landscape of x86 architecture with newfound confidence, inching closer to my ultimate goal with each patch applied.

    const char multibyte_nop_x86[]= "\x90" "\x66\x90" "\x0f\x1f\x00" "\x0f\x1f\x40\x00" "\x66\x66\x66\x66\x90" "\x66\x0f\x1f\x44\x00\x00" "\x0f\x1f\x80\x00\x00\x00\x00" "\x0f\x1f\x84\x00\x00\x00\x00\x00" "\x66\x0f\x1f\x84\x00\x00\x00\x00\x00"; const char multibyte_ret_nop_x86[]= "\xc3" "\xc3\x90" "\xc3\x66\x90" "\xc3\x0f\x1f\x00" "\xc3\x0f\x1f\x40\x00" "\xc3\x66\x66\x66\x66\x90" "\xc3\x66\x0f\x1f\x44\x00\x00" "\xc3\x0f\x1f\x80\x00\x00\x00\x00" "\xc3\x0f\x1f\x84\x00\x00\x00\x00\x00"; #define MBNOP(n,t) (t==NOP_ONLY?multibyte_nop_x86 + (((n-1) * n)/2):multibyte_ret_nop_x86 + (((n-1) * n)/2))

    No Return Functions.

    The primary method employed by this plugin is emulation, where it does not have access to a context, meaning that function arguments are undetermined. Instead of emulating the entire function, it selectively executes the sequence of blocks of code leading from the function’s entry point to the point where a syscall is invoked, which is the guided execution I presented earlier.

    To achieve this, the plugin performs an analysis of the target function and creates a CFG. The CFG serves as a navigational tool to identify the path of blocks leading to the block that contains the syscall.

    In this context, “blocks,” or instruction blocks, are defined as sequences of instructions without any branching, typically terminating at a branch instruction. CALL instructions are typically treated as NOP instructions and are patched out of the binary.

    However, a unique challenge arises when dealing with the glibc library, specifically, when encountering the __libc_fatal function. Unlike other functions, this troublemaker does not conform to the assumed behavior of returning, rendering the standard treatment of call instructions inappropriate.

    I’ve decided a special treatment for this __libc_fatal function implementing ad hoc solution that transforms it into a RET instruction, allowing the plugin to function effectively in such cases.

    Jump Tables: Tackling complexities introduced by jump tables.

    The current plugin implementation effectively follows the CFG of a given function, providing accurate results in most scenarios. However, there’s an edge case that the current algorithm cannot handle correctly. This occurs when a switch case is implemented using a jump table. The current approach handles jmp instructions by considering no forward path, relying solely on the branch path. However, in cases involving jump tables, the branch path can be multiple and located far away from the code under examination.

    A temporary solution focusing specifically on the syscall number can be summarized as follows:

  • Collection of Syscall Numbers: During the examination of the function’s code, the plugin should collect information about syscall numbers and their corresponding positions within the code.
  • Handling Non-Reached CFG Blocks: If the CFG analysis encounters non-reached blocks containing a syscall, the plugin could execute them to determine if the syscall number can be deduced.
  • Epilogue:

    This endeavor stands as a PoC for a concept, while this blog post acts as a platform for others to voice their thoughts on this notion. I find it quite promising, although it requires refinement on multiple fronts. Summarizing key insights and takeaways. Encouragement for further exploration and research in the realm of binary analysis.

    Monday, January 29, 2024

    Navigating the Labyrinth: Battling Duplicate Symbols in the Kernel

    Introduction

    Hey folks! Ever wondered what it’s like to venture into the wild world of kernel development? Well, let me tell you, it’s not a walk in the park. In this post, I want to tell you about when I stumbled upon a head-scratcher – the conundrum of duplicate symbols in the Linux kernel.

    Why Duplicate Symbols Exist in the Kernel and How is it Possible?

    Kernel development is a complex and intricate process that involves the compilation and linking of numerous C source files to create the final executable. The kernel comprises multiple C files, each translated into an object file during compilation. The linking process that follows, is where these object files are brought together for form intermediate objects (or modules) and at last the kernel binary image. The linker relies on symbols, identifiers that represent variables or functions, to establish connections between different parts of the code. Importantly, these symbols the linker needs to use, must be unique within a linking session to ensure the process success.

    In any given object file, there are two types of symbols: exported and non-exported. Exported symbols, designated to be used externally, must be unique within the set of objects the needs to be linked together to ensure the linking process success. On the other hand, non-exported symbols, often associated with functions declared as static, are not subject to the same uniqueness constraints, and the object linked together may end up in containing non exported symbols names that are not unique within the set.

    As said, exported symbols must be unique to permit the linker’s job, but no strict rules are imposed on non-exported symbols. This characteristic is not something that must exist in general, but it is inevitably true in the vast codebase of the Linux kernel, where this occurrence is not uncommon.

    Adding complexity to this scenario is the use of header files. Some symbols, particularly small inline functions, find their definition in these header files. While this is a common practice, it introduces the possibility of duplicate symbols, and when this happens, it is not just the result of chance, it is something structural that produces duplicates scientifically, since header files can be included in different compilation units, therefore the same function may appear multiple times within the kernel executable.

    A notable contributor to the presence of duplicated symbols lies in the compatibility binary format ELF, where a hack in the compat_binfmt_elf.c file introduces duplicate symbols as a consequence of c file inclusion.

    In essence, the kernel’s symbols are not granted to be unique in any way, and in fact there are duplicated symbols in the kernel.

    ~ # cat /proc/kallsyms | grep " name_show" ffffcaa2bb4f01c8 t name_show ffffcaa2bb9c1a30 t name_show ffffcaa2bbac4754 t name_show ffffcaa2bbba4900 t name_show ffffcaa2bbec4038 t name_show ffffcaa2bbecc920 t name_show ffffcaa2bbed3840 t name_show ffffcaa2bbef7210 t name_show ffffcaa2bbf03328 t name_show ffffcaa2bbff6f3c t name_show ffffcaa2bbff8d78 t name_show ffffcaa2bbfff7a4 t name_show ffffcaa2bc001f60 t name_show ffffcaa2bc009890 t name_show ffffcaa2bc01212c t name_show ffffcaa2bc025e2c t name_show ffffcaa2a052102c t name_show [hello] ffffcaa2a051955c t name_show [rpmsg_char]

    Ok, there are duplicate symbols, but does this really matters?

    It is a fair question, and I can anticipate that you can live a long life full of joy without even knowing about this problem. You can be even a kernel developer at some level and ignore the thing. Some kernel maintenance people cannot ignore the issue, though, since it won’t spare them from the pain of facing it. Whoever has done kernel tracing at some level is not spared by stomping on it, and also the few people who did live patching had to deal with this at a harder level. To understand how kernel duplicated symbol names can impact anyone, we must start from the fundamentals of the tracing system, the kernel symbol table aka kallsyms. The kallsyms is a giant table where any kernel symbol is noted together with its address. This table allows the tracing subsystem to locate a function within the kernel address space with a simple lookup. All this works particularly well; from a name, you have an address, as long as you have unique symbols.

    Tracing a duplicate symbol, on the other hand, is a pain since you cannot know if the address kallsyms provides is the one you intended to watch. Saying that you can try again in the tracing usecase, things become more serious if you need to patch it.

    Consider the scenario of tracing a specific function for debugging or performance analysis. With duplicate symbols, the tracer picks the one of the matching symbols (typically the first), but nobody knows if it is tho one you intended, leading to inaccurate results and potentially misinforming the trace engineer. The challenge amplifies when it comes to live patching, a process that involves modifying the kernel code at runtime. In such a delicate operation, precision is paramount, and the presence of duplicate symbols introduces an element of uncertainty. Need to say that the livepatch subsystem has somehow solved the problem of the uncertainty by using kallsyms_on_each_match_symbol, but it is still a suboptimal solution.

    In essence, the existence of duplicate symbols adds an extra layer of complexity for those involved in kernel tracing and live patching.

    Does this Really Happen, or is it Just a Potential Problem?

    Well, the first time I stumbled upon this issue was when I began developing a tool to generate diagrams illustrating the dependencies between functions within the Linux kernel. Enter ks-nav, a tool designed to create graphs for any Linux kernel function, showcasing potential relations between various kernel subsystems. The software analyzes the kernel, exporting information into a database used to produce insightful diagrams. During this endeavor, I encountered the duplicate symbols issue for the first time. For my use case, I simply needed to address the duplicated names by renaming them adding leading sequence numbers, a seemingly manageable task at the time.

    However, my subsequent encounter with duplicate symbols proved to be more challenging, and I instinctively steered away from confronting it directly. I was assigned the task of investigating a bug in a system that experienced a complete freeze and became unusable when utilizing ftrace functions. Without delving into the intricate details of the bug, the crux of the matter was that a peripheral generated an excessive number of interrupts, overwhelming the system to the point where the ftrace overhead left no CPU time for other operations.

    Having identified the root cause, my intention was to create a demonstrator that would vividly illustrate the source of the problem. I aimed to hard-code a filter into the interrupt controller code to exclude the troublesome interrupt from being serviced. My strategy involved live-patching a function into the interrupt controller code, replacing it with the same function but with the filter hard-coded. It was a nice strategy, except that live patch does not support aarch64, and the interrupt controller symbol I wanted to replace, existed in thr two interrupt controller drivers built in the target kernel: the GIC and the GICv3.

    Here are the links to the relevant kernel source files:

    I managed to find a workaround for the issue, albeit in a somewhat hacky manner. However, this experience prompted me to recognize the need for a more generalized solution to address the challenges posed by duplicate symbols in the kernel.

    In Any Case, What Havoc Do These Duplicate Symbols Wreak?

    Let’s assume that the kernel itself doesn’t exhibit any substantial behavioral defects, and this issue predominantly affects the tracing system. Having a name that potentially does not precisely identify a function raises three significant issues with the current implementation. Let’s delve into the details:

    1. Uncertainty in Symbol Address: The function responsible for looking up a name in the Linux kernel is kallsyms_lookup_name. It returns the address of the first symbol it encounters with that name. This introduces a problem because, given a symbol name, you can’t be certain that the address you receive corresponds to the specific function you’re interested in.
    2. Identification Challenge: With multiple matching symbols, it’s not straightforward to determine which one is the intended symbol with a given name. The lack of a clear distinction complicates the selection process.
    3. Difficulty in Addressing Specific Symbols: Assuming you’re aware that the first matching symbol with a given name is not the one you seek, it becomes challenging to address any other symbol. The process of singling out the correct symbol becomes nontrivial.

    To address these challenges, the kernel provides kallsyms_lookup_names, which returns all symbols with the same name. While this can help refine the search, it doesn’t entirely resolve the issue. The live patch subsystem uses kallsyms_lookup_names in conjunction with kallsyms_on_each_match_symbol to apply a function on each entry and select the appropriate one. However, this approach has limitations, and its application is primarily confined to the live patch subsystem. If the function to apply as a filter were a BPF, it would be relatively more manageable.

    The underlying problem is that, even with these mechanisms, kernel developers can only select the desired symbol in a convoluted way. From a user-space perspective, a trace engineer is left with addresses that are not particularly user-friendly. Moreover, determining the nature of a symbol is solely reliant on the address context, requiring the trace engineer to make educated guesses based on the available information. This complexity adds an extra layer of difficulty for those working with kernel tracing, underscoring the impact of duplicate symbols on practical kernel development scenarios.

    So, is there any solution to fix this?

    Before delving into solutions, it’s essential to acknowledge additional requirements that any potential resolution should meet:

    1. Preserve the current state to not impact the ecosystem: Any solution addressing duplicated symbols in the kernel should strive to maintain the current state, ensuring minimal disruption to the ecosystem or any existing solutions implemented by trace software.

    Now, let’s explore the possible solutions. A straightforward approach involves renaming all duplicate symbols to make each symbol unique. However, this is no trivial task, given the significant changes required. Moreover, there’s currently no preventive mechanism in place to avoid the recurrence of this issue. Any attempt to pursue this path should include a strategy to prevent the kernel from encountering the same situation in the future. It’s worth noting that renaming several symbols simultaneously could have unforeseeable impacts on the ecosystem.

    So, what other options are available to address this issue? As of my investigation in July 2023, there were only two attempts to tackle the “Identification Challenge.”

    My proposal to address this problem involves adding aliases to duplicate symbols.

    Adding Aliases to Duplicate Symbols, can be really solve the problem?

    My approach to resolving this problem is to introduce aliases for duplicate symbols. I believe it can provide a valuable solution for the problem, and here’s how aliases can address the identified issues:

    1. Uncertainty in Symbol Address: Aliases need to be unique. Tracing a unique alias allows users to ensure that the provided address corresponds to the desired function.
    2. Identification Challenge: The alias embodies information useful for correctly understanding the symbol’s nature or source.
    3. Difficulty in Addressing Specific Symbols: The uniqueness of the alias allows the symbol to be selected simply by using the alias.
    4. Preserve the Current State to Not Impact the Ecosystem: By adding aliases without removing the previous symbols, any ecosystem software that dealt with the problem in its way is preserved. This approach allows for a gradual migration to using aliases.

    Indeed, special attention must be given to how aliases are generated. After consulting with the community, I opted to use a string as the alias, comprising the file path and the line number where the symbol is defined. This ensures the alias is not only unique but also carries crucial information for identifying the symbol.

    Is yours proposal something concrete, or is it just words in the air?

    This proposal is tangible and has undergone several iterations, currently standing at its 7th version as of January 2024. Detailed implementation discussions are more suited for the mailing list or the GitHub repository, where the development is actively taking place.

    In its current state, the patch can:

    • Automatically detect symbols requiring aliases during the kernel build.
    • Add aliases to the core kernel image.
    • Add aliases to modules.
    • Export a file with symbol name statistics.
    • Add aliases to out-of-tree kernel modules using symbol statistics.

    This is how the symbols should look with the patch applied:

    ~ # cat /proc/kallsyms | grep " name_show" ffffcaa2bb4f01c8 t name_show ffffcaa2bb4f01c8 t name_show@kernel_irq_irqdesc_c_264 ffffcaa2bb9c1a30 t name_show ffffcaa2bb9c1a30 t name_show@drivers_pnp_card_c_186 ffffcaa2bbac4754 t name_show ffffcaa2bbac4754 t name_show@drivers_regulator_core_c_678 ffffcaa2bbba4900 t name_show ffffcaa2bbba4900 t name_show@drivers_base_power_wakeup_stats_c_93 ffffcaa2bbec4038 t name_show ffffcaa2bbec4038 t name_show@drivers_rtc_sysfs_c_26 ffffcaa2bbecc920 t name_show ffffcaa2bbecc920 t name_show@drivers_i2c_i2c_core_base_c_660 ffffcaa2bbed3840 t name_show ffffcaa2bbed3840 t name_show@drivers_i2c_i2c_dev_c_100 ffffcaa2bbef7210 t name_show ffffcaa2bbef7210 t name_show@drivers_pps_sysfs_c_66 ffffcaa2bbf03328 t name_show ffffcaa2bbf03328 t name_show@drivers_hwmon_hwmon_c_72 ffffcaa2bbff6f3c t name_show ffffcaa2bbff6f3c t name_show@drivers_remoteproc_remoteproc_sysfs_c_215 ffffcaa2bbff8d78 t name_show ffffcaa2bbff8d78 t name_show@drivers_rpmsg_rpmsg_core_c_455 ffffcaa2bbfff7a4 t name_show ffffcaa2bbfff7a4 t name_show@drivers_devfreq_devfreq_c_1395 ffffcaa2bc001f60 t name_show ffffcaa2bc001f60 t name_show@drivers_extcon_extcon_c_389 ffffcaa2bc009890 t name_show ffffcaa2bc009890 t name_show@drivers_iio_industrialio_core_c_1396 ffffcaa2bc01212c t name_show ffffcaa2bc01212c t name_show@drivers_iio_industrialio_trigger_c_51 ffffcaa2bc025e2c t name_show ffffcaa2bc025e2c t name_show@drivers_fpga_fpga_mgr_c_618 ffffcaa2a052102c t name_show [hello] ffffcaa2a052102c t name_show@hello_hello_c_8 [hello] ffffcaa2a051955c t name_show [rpmsg_char] ffffcaa2a051955c t name_show@drivers_rpmsg_rpmsg_char_c_365 [rpmsg_char]

    However, refinement and alignment with community standards are still ongoing, requiring further iterations. The ultimate acceptance of this work into the kernel remains uncertain, but this article serves as a testament to the effort invested.