Wednesday, March 26, 2025

The Case of the Disappearing PID: A Debugging Mystery

Every developer, at some point, encounters a situation so baffling it makes them question their own sanity. This is the story of one such weirdness: a heavily multithreaded Golang application, a kernel module, and a PID that vanished into the abyss without a trace.

Spoiler alert: It wasn’t aliens.

The Setup: A Debugging Nightmare

The problem was simple: I was tracking the lifecycle of processes spawned by a third-party binary. To do this, I wrote a Linux Kernel Module that hooked into _do_fork() and do_exit(), logging every process birth and death. And yes, you read that right... _do_fork(). You know, that function that, for over 20 years, had ‘_do_fork’ as a name, even though ‘fork’ was actually just a special case of ‘clone’. Then, suddenly, in 5.10, someone in kernel land had a 'Wait a second!' moment and decided the name was too misleading. So, they renamed it to ‘kernel_clone()’, like, surprise! No more confusion, just 20 years of tradition down the drain. But hey, at least we now know what’s really going on... I think.

Back to the story, at first, everything seemed fine. Threads were born, threads died, logs were generated, and the universe remained in harmony. But then, something unholy happened: some PIDs vanished without ever triggering do_exit().

I know what you're thinking at... But NO, this was not a case of printk() lag, nor was it tracing inaccuracies. I double-checked using ftrace, netconsole, and even sacrificed a few coffee mugs at the pagan god of debugging... The logs were clear: the PID appeared, then POOF! Gone. No exit call, no final goodbye, no proper burial.

Step One: Denial (And the Stack Overflow Void)

Could a Linux process terminate without passing through do_exit()?

My first instinct was: Absolutely not.

If that were true, the very fabric of Linux process management would collapse. Chaos would reign. Cats and dogs would live together. And yet, my logs insisted otherwise.

So, like any good developer, I turned to Stack Overflow. Surely, someone must have encountered this before. I searched. No ready-made answer. Fine.

I did what any desperate soul would do: I asked the question myself.

Days passed. The responses trickled in, but nothing convinced me. The usual suspects, race conditions, tracing inaccuracies, were suggested, but I had already ruled them out. Stack Overflow had failed me.

I realized I wasn’t going to find the answer just by asking. I had to go hunting.

Step Two: Anger (aka Kernel Grep Hell)

I dug deep. Real deep. Into the Linux kernel source, into mailing lists from 2005, into the depths of Stack Overflow where unsolved mysteries go to die.

And then, I found it. The smoking gun.

Deep in fs/exec.c, hiding like a bug under the rug, was this delightful nugget (from the 4.19 kernel):

/* Become a process group leader with the old leader's pid. * The old leader becomes a thread of this thread group. * Note: The old leader also uses this pid until release_task * is called. Odd but simple and correct. */ tsk->pid = leader->pid;

I read it. I read it again. I re-read it while crying. And then it hit me.

Step Three: Bargaining (Can Two Processes Have the Same PID?)

If you had asked me before this, I’d have said no, absolutely not: two processes cannot share the same PID. That’s like realizing your passport was cloned, and now there's another ‘you’ vacationing in the Bahamas while you’re stuck debugging kernel code. That’s not how things work!

Except, sometimes, it is.

Here’s what happens (in 4.19):

  1. A multithreaded process decides it wants a fresh start and calls execve().
  2. The kernel, being the neat freak it is, has to clean up the old thread group.
  3. But, in doing so, it needs to shuffle some PIDs around.
  4. The newly exec’d thread gets the old leader’s PID, while the old leader, now a zombie, keeps using the same PID until it’s fully reaped.
  5. If you were monitoring the old leader, you’d see its PID go through do_exit() twice. First, when the actual old leader dies. Then, when its "impostor", the thread that inherited its PID, finally meets its own end. So, from an external observer’s perspective, it looks like one process vanished without a trace, while another somehow managed to die twice. Linux: where even PIDs get second lives.

Now, fast-forward to kernel 6.14, and the behavior has been slightly refined:

/* Become a process group leader with the old leader's pid. * The old leader becomes a thread of this thread group. */ exchange_tids(tsk, leader);

The mechanism has changed, but it still involves shuffling PIDs in a similar way. With exchange_tids(), the process restructuring appears to follow the same logic, likely leading to the same observable effect: one PID seeming to vanish without an obvious do_exit(), while another might appear to exit twice. However, a deeper investigation would be needed to confirm the exact behavior in modern kernels.

This, ladies and gentlemen, was my bug. My missing do_exit() wasn’t missing. It was just… misdirected.

Step Four: Acceptance (And Trolling Future Debuggers)

Armed with this knowledge, I could now definitively answer some existential Linux questions:

  1. Can a Linux process/thread terminate without passing through do_exit()?
    No. Every process must pass through do_exit(), even if it’s via a sneaky backdoor.
  2. Can two processes share the same PID?
    Normally, no. The rule of unique PIDs is sacred... or so we’d like to believe. But every now and then, the kernel bends the rules in the name of sneaky process management. And while modern kernels seem to have repented on this particular trick, well... Where there’s one skeleton in the closet, there’s bound to be more.
  3. Can a Linux process change its PID?
    Yes, in at least one rare case: when de_thread() decides to reassign it.

Final Thoughts (or, How to Break a Debugger’s Mind)

If you ever find yourself debugging a disappearing PID, remember:

  • The kernel is a twisted, brilliant piece of engineering.
  • Process lifecycle tracking is a house of mirrors.
  • Never trust a PID: it might not be who you think it is.
  • Stack Overflow won’t always save you. Sometimes, you have to dig into the source code yourself.
  • And, most importantly: always suspect execve().

In the end, Linux remains a beautifully chaotic system. But at least now, when PIDs disappear into the void, I know exactly which corner of the kernel is laughing at me.

Happy debugging!

Monday, March 10, 2025

Kernel Testing for Not-So-Common Architectures

When developing kernel patches, thorough testing is crucial to ensure stability and correctness. While testing is straightforward for common architectures like x86 or ARM, thanks to abundant tools, binary distributions, and community support, the landscape changes drastically when dealing with less common or emerging architectures.

The Challenge of Less Common Architectures

Emerging architectures, such as RISC-V, are gaining momentum but still face limitations in tooling and ecosystem maturity. Even more challenging are esoteric architectures like loongarch64, which may have minimal community support, scarce documentation, or lack readily available toolchains. Testing kernel patches for these platforms introduces unique hurdles:

  1. Toolchain Availability: Compilers and essential tools might be missing or outdated.

  2. Userspace Construction: Creating a minimal userspace to boot and test the kernel can be complex, especially if standard frameworks don’t support the target architecture.

The Role of Buildroot

In many scenarios, buildroot has been an invaluable resource. It simplifies the process of building both the toolchain and a minimal userspace. Its automated and modular approach makes setting up environments for a wide range of architectures relatively straightforward. However, buildroot has its limitations and doesn’t support every architecture recognized by the Linux kernel (apparently old architectures like parisc32 is still supported by the kernel).

Understanding Userspace Construction

Userspace setup is a critical part of kernel testing. Traditionally, userspace resides on block devices, which introduces a series of complications:

  • Block Device Requirements: The device itself must be available and correctly configured.
  • Kernel Driver Support: The kernel must include the necessary drivers for the block device. If these are modules and not built-in, early boot stages can fail.

An effective alternative is using initramfs. This is a root filesystem packaged in a cpio archive and loaded directly into memory at boot. It simplifies boot processes by eliminating dependencies on block devices.

Building an Initramfs

Building an initramfs introduces its own challenges. Tools like Dracut can automate this process and work well in native build environments. However, in cross-build scenarios, Dracut’s complexity increases. It may struggle with cross-compilation environments, environment configurations, and dependency resolution.

Alternatively, frameworks like Buildroot and Yocto offer comprehensive solutions to build both toolchains and userspaces, including initramfs. These tools can handle cross-compilation but have their drawbacks:

  • Performance: Both tools can be slow.
  • Architecture Support: Not all architectures supported by the Linux kernel are covered.

When Buildroot-like approach Falls Short

Encountering an unsupported architecture can be a major roadblock. Without Buildroot, developers need to find alternative strategies to build the necessary toolchain and create a functional userspace for kernel testing.

An Alternative Approach: Crosstool-NG and BusyBox

One effective solution is leveraging Crosstool-NG to build the cross-compilation toolchain and using BusyBox to create a minimal userspace. This approach offers flexibility and control, ensuring that even esoteric architectures can be targeted. Here’s a detailed overview of this method:

  1. Build the Toolchain with Crosstool-NG:

    • Build and Install Crosstool-NG
    • Initialize the wanted toolchain with ct-ng menuconfig.
    • Select the target architecture and customize the build parameters.
    • For esoteric architectures, enable the EXPERIMENTAL flag in the configuration menu. Some architectures are still considered experimental, and this flag is required to unlock their toolchain support.
    • Proceed with building the toolchain using ct-ng build.
    • Address any architecture-specific quirks or requirements during configuration and compilation.
  2. Create a Minimal Userspace with BusyBox:

    • Export the cross-compiler by setting the environment variable: export CROSS_COMPILE=<path-to-toolchain>/bin/<arch>-linux-.
    • Configure and build BusyBox for a static build to avoid library dependencies: make CONFIG_STATIC=y.
    • A static BusyBox build simplifies root filesystem creation, as it removes the need for organizing the /lib directory for shared libraries.
    • Design the init system using BusyBox’s init with a simple SystemV style inittab:
    • ::sysinit:/bin/mount -t proc proc /proc ::sysinit:/bin/mount -o remount,rw / ::respawn:/bin/sh
    • The rest of the filesystem can be minimal, with the /bin directory containing BusyBox and symlinks for the core tools.
    • Make sure to have a /dev directory populated with at least console and tty0 devices, otherwise you won't see any messages and possibly your init will crash
    • # mknod -m 622 console c 5 1 # mknod -m 622 tty0 c 4 0
  3. Sample implementation of this concept is here.
  1. Pack Userspace into an Initramfs:

    • Assemble the userspace into a cpio archive with: find . -print0 | cpio --null -o --format=newc > ../initramfs.cpio.
    • Ensure the kernel configuration is set to load the initramfs at boot.
  2. Build and Test the Kernel:

    • Compile the kernel using the cross-compiled toolchain:
    • make ARCH=<arch> CROSS_COMPILE=<path-to-toolchain>/bin/<arch>-linux-
    • Be aware that excessively long CROSS_COMPILE strings can cause issues, leading the build system to fall back to the native toolchain.
    • Use the kernel configuration symbol CONFIG_INITRAMFS_SOURCE to specify the initramfs for embedding directly into the kernel image, enabling quick validation with QEMU or similar tools.

This method demands more manual configuration than Buildroot but offers a path forward when conventional tools fall short.

Conclusion

Kernel development for less common architectures is a complex but rewarding challenge. When standard tools like Buildroot can’t cover the gap, combining Crosstool-NG and BusyBox provides a reliable and adaptable solution.

Saturday, March 1, 2025

The Tale of the Stubborn Cipher: A Debugging Saga

I’m a Red Hat guy since a while now, but lurking in my lab was a traitor: an old Ubuntu 20.04 machine still doing all my heavy lifting, mostly building kernels. Why? Because back in the day, before I saw the light of Red Hat, I was under the false impression that Fedora wasn’t great for cross-compilation. Turns out, I was dead wrong.

Fast-forward to today, and I need to work on LoongArch. Guess what? Ubuntu didn’t have the cross-compilation bits I needed… but Fedora did. Finally, I had a valid excuse to invest time, nuke that relic and bring it into the Fedora family.

But of course, nothing is ever that simple. See, I had an old encrypted disk on that setup, storing past projects, ancient treasures, and probably a few embarrassing bash scripts. No worries! I’ll just re-run the cryptsetup command on Fedora, and boom, I’m in…

Right?

Oh, how naive I was.

The Beginning: A Mysterious Failure

It all started with a simple goal: to mount an encrypted partition using AES-ESSIV:SHA256 on Fedora. The same setup had worked perfectly on an old setup for years, but now, Fedora refused to cooperate.

The process seemed straight-forward: cryptsetup creates the mapping, but...

$ sudo cryptsetup create --cipher aes-cbc-essiv --key-size 256 --hash sha256 pippo /dev/sdb1 Enter passphrase for /dev/sdb1: device-mapper: reload ioctl on pippo (253:2) failed: Invalid argument $

A classic error message, vague and infuriating. This called for serious debugging.

The First Hypothesis: A Kernel Mishap?

I’m a kernel developer, and you know what they say: when all you have is a hammer, everything looks like a kernel bug. So, naturally, my first suspicion landed straight on the Linux kernel. Because, let’s be honest, if something’s broken, it’s probably the kernel’s fault… right? So, ESSIV wasn’t appearing in /proc/crypto, the first suspicion was a kernel issue. Fedora is known for enforcing modern cryptographic policies, and legacy algorithms are often disabled by default.

To investigate, the essiv.ko module source was examined. It turns out that crypto_register_template(&essiv_tmpl); does not immediately appear in /proc/crypto. Instead, /proc/crypto reflects the state of crypto_alg_list, which only updates after the first use of an algorithm.

So while I was staring at /proc/crypto, expecting to see ESSIV magically appear, I was actually just looking at a list of algorithms that had already been used, not the registered templates. The kernel wasn’t necessarily broken: just playing hard to get.

I needed to be sure the upstream kernel code I was looking at was exactly the same running on my machine. Fedora typically does not modify the upstream code, but I needed a confirmation. Rather than hunting down Fedora’s kernel source repository, the decision was made to compare binary modules from Fedora and upstream Linux, but…

Ah, binary reproducibility… A dream everyone chases but few actually catch. The idea of building a kernel module and getting the exact same binary sounds simple, but in reality, it’s like trying to bake the same cake twice without measuring anything.

What can make binaries different? The obvious culprit is the code itself, but that’s just the start. Data embedded in the binary can also change things. Compiler versions and plugins play a role… If the same source code gets translated differently, you’ll end up with different binaries, no matter how pure your intentions. Then come the non-code factors. A kernel module is an ELF container, and ELF files carry metadata: timestamps, cryptographic signatures, and other bits that make your module unique (and sometimes annoying to compare). Even the flags that mark a module as Out-Of-Tree can introduce differences.

So, when doing a binary comparison, it’s not just a matter of checking if the bytes match... you have to strip out the noise and focus on the meaningful differences. Here’s what I did

$ objcopy -O binary --only-section=.text essiv.us.ko essiv.us.bin $ objcopy -O binary --only-section=.text essiv.fedora.ko essiv.fedora.bin $ cmp essiv.us.bin essiv.fedora.bin

And I was lucky, the result was bitwise identical. No funny business in the kernel. Time to look elsewhere.

The Second Hypothesis: A Cryptsetup Mismatch?

To check if Fedora’s cryptsetup was behaving differently, the same encryption command was run on both the old machine and Fedora:

sudo cryptsetup -v create pippo --cipher aes-cbc-essiv:sha256 --key-size 256 /dev/sdb1
  • On the old machine, this worked fine, and the partition mounted successfully.
  • On Fedora, it created the mapping but refused to mount.

The Real Culprit: The Command Line Argument Order

At this point, every possible difference between the Ubuntu and Fedora commands was scrutinized.

And then, the discovery:

cryptsetup create --cipher aes-cbc-essiv:sha256 --key-size 256 --hash sha256 pippo /dev/sdb1

vs.

cryptsetup create pippo --cipher aes-cbc-essiv:sha256 --key-size 256 --hash sha256 /dev/sdb1

The first one fails mysteriously (without any syntax error). The second one works not throws error.

$ sudo cryptsetup create --cipher aes-cbc-essiv --key-size 256 --hash sha256 pippo /dev/sdb1 Enter passphrase for /dev/sdb1: device-mapper: reload ioctl on pippo (253:2) failed: Invalid argument

Why? Because cryptsetup’s argument parser behaves differently depending on argument order. The correct order is:

cryptsetup create <name> <options> <device>

When the name (pippo) is placed before the options, everything just works. But if options come first, something breaks silently.

The Final Barrier: Key Derivation Algorithm Mismatch

With the argument order fixed, one final verification was done, the command now does not fail, but the filesystem still not mounts. Looking at visible crypto parameters, everything looked fine, but it was not.

$ sudo dmsetup table pippo

On Fedora, it returned:

0 3907026944 crypt aes-cbc-essiv:sha256 0000000000000000000000000000000000000000000000000000000000000000 0 8:17 0

On old machine, it returned:

0 3907026944 crypt aes-cbc-essiv:sha256 0000000000000000000000000000000000000000000000000000000000000000 0 8:97 0

Identical! This meant that the same crypto algorithm was being used, and I was providing the same passphrase. So, in theory, everything should have been correct.

And yet… mounting still failed.

The log only confirmed that the same encryption algorithm was in play; it didn’t prove that the same key was actually being used. Since the key is derived from the passphrase, hashing algorithm, and other parameters.

A final comparison of the cryptsetup debug logs revealed the culprit:

Even though both systems used the same hashing algorithm (aes-cbc-essiv:sha256), they used different passphrase-to-key derivation methods internally. Fedora’s version of cryptsetup was not deriving the same encryption key.

The Fix: Explicitly Specifying the Hash Algorithm (RIPEMD-160) and Mode

The final working command had to ensure that Fedora derived the key exactly like the old machine:

$ sudo cryptsetup create pippo --cipher aes-cbc-essiv:sha256 --key-size 256 --hash ripemd160 --type plain /dev/sda1

And finally:

$ sudo mount /dev/mapper/pippo /mnt/0 $ ls /mnt/0

Success! The partition mounted perfectly.

The Conclusion: Lessons Learned

  1. Look things carefully before blaming the kernel.
  2. Cryptographic defaults change across cryptsetup versions: be explicit!
  3. The order of command-line arguments in cryptsetup matters.
  4. Compare dmsetup table outputs is not just enough.
  5. Key derivation methods can differ, and it is not evident!

After all the deep dives into kernel modules, crypto policies, and hashing algorithms, the entire issue boiled down to two things:

  1. Wrong argument order in cryptsetup
  2. Key derivation differences between cryptosetup versions.

A truly fitting end to a classic Linux troubleshooting adventure.

Thursday, January 9, 2025

Security implications with printk

Introduction

Kernel debugging is inherently a complex task due to the intricate and low-level nature of kernel operations. Surprisingly, one of the most proficient and useful tools for tackling this challenge is the printk function. While it may seem like a simple utility for printing messages, printk is a cornerstone of kernel debugging, offering critical insights into kernel behavior. The printk function in the Linux kernel might appear trivial at first glance, simply serving to print messages for debugging and logging purposes. However, it is one of the most intricate and critical components of the kernel. Its complexity arises from the requirement to function reliably in all possible kernel contexts, including interrupt handlers, non-preemptive sections, and even in cases of kernel panics. This complexity has made printk a major obstacle to the integration of the preempt_rt (real-time preemption) patch into the mainline kernel, as achieving deterministic behavior and low-latency logging in real-time systems poses significant challenges. So, kernel debugging often involves analyzing log messages to diagnose issues or understand system behavior. Among the formats used for printing data in the kernel, %pK and %pS serve specific purposes when dealing with pointers. However, their combined usage in the same message can introduce unintended information leaks, potentially undermining Kernel Address Space Layout Randomization (KASLR) security measures. This blog post explores the problem of combining %pK and %pS in a single message. We’ll start with an introduction to the problem, delve into how these formats work, and discuss specific scenarios, such as those involving kmemleak and module loading, where these issues can arise.

Potential Information Leak from Combining %pK and %pS

The kernel uses %pK to mask sensitive pointer addresses in logs based on the privilege level of the user reading the logs. This is particularly critical for preserving KASLR offsets, which are integral to modern system security. On the other hand, %pS resolves pointers to symbols, printing the function name and offset, or falling back to the raw address if the symbol cannot be resolved. When %pK and %pS are used together, the masking provided by %pK can be voided if %pS prints the same address as a raw pointer. This creates a potential vector for leaking sensitive information, especially when kallsyms fails to resolve the symbol and %pS defaults to showing the raw address.


Kernel Print Formats

To better understand this issue, it’s essential to look at the various print formats available in the kernel. The Documentation/core-api/printk-formats.rst provides an in-depth guide to these formats.

Pointer Type Formats

The printk function offers a variety of powerful format specifiers for handling pointers, enabling developers to extract and display detailed information about kernel symbols, memory addresses, and resource ranges. Depending on the specifier, pointers passed to printk can be printed as raw addresses (%px), symbolic names with or without offsets (%pS, %ps), kernel or user memory strings (%pks, %pus), physical or DMA addresses (%pa[p], %pad), or even complex structures like resources (%pr) or ranges (%pra). Each of these formats is designed to provide flexibility and precision in debugging and introspection, often requiring integration with kernel features such as kallsyms or security mechanisms.

%pS: Symbolic Representation of Function Pointers

The %pS specifier is used to print the symbolic name of a function pointer, including the offsets. For example, it outputs function_name for a given pointer. This feature relies on kallsyms, a kernel mechanism for resolving symbols, which must be enabled at build time. If kallsyms is disabled, %pS falls back to printing the raw address, as symbolic resolution is unavailable. This makes %pS an invaluable tool for debugging, providing human-readable insights into function pointers, especially in backtraces or dynamic kernel environments.

%pK: Security-Conscious Printing of Kernel Pointers

The %pK specifier addresses the security implications of exposing kernel pointers. By default, it prints masked or hashed values (e.g., 00000000) unless the kptr_restrict sysctl parameter allows unrestricted access. This behavior is essential for protecting kernel memory layout information, particularly against exploits like kernel address space layout randomization (KASLR) bypasses. The interaction with the Linux Security Module (LSM) subsystem, such as SELinux, adds another dimension of control. When SELinux is active, additional access checks might apply, ensuring that %pK outputs are aligned with the system's security policy. For instance, even privileged users may encounter restricted pointer output if SELinux policies enforce strict controls.

Complexity Behind a Simple printk

While printk appears to be a simple logging tool, passing a pointer to it can invoke deeply integrated kernel features. Printing with %pS may involve symbol resolution and handling optional features like kallsyms, while %pK necessitates checks against security configurations and LSM policies. This intricate interplay between debugging utility and security subsystem demonstrates how printk transcends its apparent simplicity to become a critical component of kernel functionality and protection.

Real-World Scenarios: kmemleak and Module Loading

There are practical cases where the combined usage of %pK and %pS manifests. One such example is in kmemleak debugging messages. Kmemleak is a kernel memory leak detector that maintains a log of unreferenced memory allocations. A concrete example of this issue can be seen in kmemleak debugging messages when kptr_restrict is set to 1. In this configuration, %pK effectively masks the kernel addresses to prevent leaking sensitive information. However, if %pK and %pS are used together, the masking becomes ineffective. For instance:
unreferenced object 0xffff465a8eb90000 (size 2048): comm "insmod", pid 129, jiffies 4294953078 hex dump (first 32 bytes): 80 c0 5e 8e 5a 46 ff ff 01 00 00 00 62 00 3c 04 ..^.ZF......b.<. 00 00 00 00 00 00 00 00 1c 02 b9 8e 5a 46 ff ff ............ZF.. backtrace (crc 2f5e480d): [<0000000000000000>] kmemleak_alloc+0xb4/0xc4 [<0000000000000000>] __kmem_cache_alloc_node+0x23c/0x270 [<0000000000000000>] kmalloc_trace+0x3c/0x90 [<0000000000000000>] 0xffffac0d743b204c [<0000000000000000>] do_one_initcall+0x178/0xc90 [<0000000000000000>] do_init_module+0x1d8/0x63c [<0000000000000000>] load_module+0x10a0/0x1670 [<0000000000000000>] init_module_from_file+0xdc/0x130 [<0000000000000000>] idempotent_init_module+0x2d8/0x534 [<0000000000000000>] __arm64_sys_finit_module+0xb4/0x130 [<0000000000000000>] invoke_syscall.constprop.0+0xd8/0x1d4 [<0000000000000000>] do_el0_svc+0x158/0x1dc [<0000000000000000>] el0_svc+0x54/0x130 [<0000000000000000>] el0t_64_sync_handler+0x134/0x150 [<0000000000000000>] el0t_64_sync+0x17c/0x180
In this example, even not considering the first line, where the pointer is printed using %08lx, %pK masks the address on the lines in the backtrace, but %pS exposes it in the fourth line if the symbol cannot be resolved. The redundancy of %pK and %pS in the same line can undermine the intended security provided by %pK.

Is this case rare or what?

The line [<0000000000000000>] 0xffffac0d743b204c appears in the log when %pS is unable to resolve an address into a symbol, falling back to printing the raw address instead. This situation is not uncommon and in this case occurs because the address corresponds to a module's initialization function that allocated the memory, is marked with the __init attribute. Functions marked as __init are automatically discarded once their execution is complete, freeing up memory. As a result, kallsyms cannot resolve the symbol since it no longer exists in the kernel's symbol table, leading to the fallback output of the raw address.

Conclusions

Combining %pK and %pS in kernel messages might seem like a harmless redundancy at first glance. However, this practice can introduce vulnerabilities by inadvertently exposing sensitive kernel information. Understanding the nuances of kernel print formats and their appropriate usage is essential for developers to maintain both system security and effective debugging capabilities.

Tuesday, October 8, 2024

Symbol suffixes compilers use to confuse developers

Ever taken a peek at the symbol table of an object or executable? Or, if you're feeling particularly adventurous, have you snooped around the Linux kernel symbol table (kallsyms, for those on a first-name basis)? If so, you've probably been baffled by the bizarre names the compiler uses to mangle your precious symbols. Well, you’re not alone. I’ve spent some quality time collecting bits and pieces of information from various corners of the internet, carefully cutting and pasting (with great skill, I might add) to create this handy page where everything is nicely grouped. If you're anything like me, you’ll appreciate having all this info in one convenient place. Enjoy the madness!

Suffix Description Note
.constprop.<n> Constant propagation. Indicates that the function has been optimized using constant propagation, where constant values were propagated through the code. The is usually a version number and can increment if the function is optimized multiple times.
.isra.<n> Interprocedural Scalar Replacement of Aggregates (ISRA). This optimization involves breaking down aggregates (like structs or arrays) passed to functions into individual scalar values. It helps improve register usage and reduces memory accesses. The indicates the version or stage of the transformation.
.clone.<n> Function cloning. When the compiler creates a cloned version of a function to optimize it for specific use cases (e.g., with certain constant arguments, different calling conventions, or to assist in function inlining), it adds the .clone. suffix. This is useful for tailoring the function to certain conditions, such as a specific set of input parameters.
.part.<n> Function partitioning. The .part. suffix appears when the compiler splits a large function into smaller parts. This often happens due to the function being too complex for certain optimizations or for inlining purposes. The refers to the specific part number.
.cold Cold code. This suffix is added to functions or parts of functions that are considered "cold" by the compiler. Cold code refers to parts of the program that are rarely executed, such as error handling code or unlikely branches in conditionals. The compiler may optimize these functions for size rather than speed, or move them to separate sections of memory to improve cache performance for "hot" code (code that is frequently executed).
.hot Hot code. Similar to .cold, this suffix indicates "hot" code, which is frequently executed. The compiler might apply aggressive optimizations focused on improving execution speed, such as loop unrolling, function inlining, or improved branch prediction.
.likely/.unlikely Likely or unlikely branches. These suffixes indicate that the compiler has predicted whether a particular branch of code is likely or unlikely to be executed, usually based on profiling data or heuristics. The likely branch is optimized for speed, while the unlikely branch might be optimized for size or minimized in terms of performance impact.
.lto_priv.<hash> Link-Time Optimization (LTO) private function. This suffix appears during link-time optimization (LTO), where functions are optimized across translation units (source files). The .lto_priv. suffix is added to private (non-exported) functions that have been internalized and optimized during the LTO phase. The is typically a unique identifier.
.omp_fn.<n> OpenMP function. Functions generated as part of OpenMP parallelization are often given this suffix. It indicates that the function was created or modified to handle a parallel region of code as specified by OpenMP directives. The is the version or index of the OpenMP function.
.split.<n> Split functions. This suffix appears when a large function is split into smaller pieces, either for optimization reasons or due to certain compiler strategies (like function outlining). The indicates the part or section number of the split function.
.inline Inlined function. Functions marked with this suffix have been aggressively inlined by the compiler. Sometimes, a specialized inlined version of the function is created, while the original remains intact for non-inlined calls.
.to/.from Conversion functions. These suffixes are used when functions are involved in some sort of conversion process, such as casting or transforming data structures from one form to another. .to typically refers to converting to a certain form, and .from refers to converting from a form.
.gcda Profiling data generation (related to GCov). This suffix is associated with functions that produce profiling data used by GCov (GNU's code coverage tool). These functions track execution counts and other statistics to generate code coverage information.
.llvm.<hash> Local linkage promoted to External linkage With ThinLTO, you might see mangled names having this suffix. This happens because functions inlined across units need their local references made global, causing name changes.

Constant Propagation: Overview

Constant Propagation is an important optimization technique used by compilers to improve the performance of generated code. It involves analyzing the code to identify constant values that are known at compile-time and propagating these constants throughout the code. By substituting variables with their constant values, the compiler can simplify expressions and potentially remove unnecessary calculations, improving both the runtime performance and memory usage of the program.

How Constant Propagation Works:

  1. Identify Constants During the compilation process, the compiler looks for variables that are assigned constant values. For example:

    int x = 5; 
    int y = 2 \* x;
    Here, x is a constant because it is assigned a known value 5.
  2. Propagation: Once the compiler identifies a constant, it replaces the variable with its constant value wherever it appears in subsequent code. Continuing the above example:

    int y = 2 * 5;
  3. Simplification: After propagation, the compiler can further simplify expressions that involve constants:

    int y = 10;    
  4. Dead Code Elimination: Sometimes, constant propagation leads to opportunities for other optimizations, such as dead code elimination. For instance, after propagating constants, conditional branches that always evaluate to true or false can be simplified, allowing the compiler to remove unnecessary branches:

    if (5 < 10) { 
        // This block is always executed, so the condition can be removed. 
    }

Benefits of Constant Propagation:

  • Improved Performance: Constant propagation can eliminate runtime calculations, reducing the overall number of operations in the code. This leads to faster execution times.
  • Reduced Code Size: Simplifying expressions and removing redundant code can reduce the size of the compiled binary.
  • Better Memory Usage: By eliminating unnecessary variables and operations, constant propagation can reduce memory consumption.

Example of Constant Propagation:

Consider the following simple C program:

void example() {
    int a = 10;
    int b = a + 5;
    int c = b * 2;
    printf("%d\n", c);
}

Without constant propagation, this program might be compiled as-is, performing the operations b = a + 5 and c = b * 2 at runtime. However, with constant propagation, the compiler could optimize the program as follows:

void example() {
    int c = 30;
    printf("%d\n", c);
}

Constant Propagation vs. Constant Folding:

  • Constant Propagation focuses on replacing variables that hold constant values with those constants wherever possible in the code.
  • Constant Folding is another optimization technique that involves evaluating constant expressions at compile-time rather than runtime. For example:

    int x = 5 + 3;

    Here, constant folding would replace 5 + 3 with 8 at compile-time, eliminating the need for the addition operation at runtime.

Both techniques often work together, with constant propagation creating opportunities for constant folding and vice versa.

Types of Constant Propagation:

  1. Intra-Procedural Constant Propagation: This type of propagation occurs within a single function or block of code. The compiler tracks constants within the boundaries of the function or block, but it does not propagate values across different functions.
  2. Inter-Procedural Constant Propagation: This is a more advanced form of propagation where the compiler tracks and propagates constants across function boundaries. It requires more complex analysis but can result in better optimization, especially in large programs with function calls.

What .constprop.0 Means

  • Suffix Meaning: The .constprop.0 suffix is added by the compiler (usually by GCC or Clang) to signify that the function has been subjected to constant propagation optimization. The number (0 in .constprop.0) is just a version indicator and can be incremented if the function undergoes further stages of optimization.
  • Constant Propagation at the Function Level: When a compiler identifies that certain arguments to a function are constants, it can create a specialized version of the function where those constants are hardcoded. This allows the function to be optimized more aggressively. The suffix is attached to the optimized function to distinguish it from the original unoptimized version. For example, consider the following function:

    int add(int x, int y) {
        return x + y;
    } 

    If, during optimization, the compiler finds that this function is frequently called with constant values, say add(3, 5), it might create a specialized version of the function where the constants are propagated:

    int add.constprop.0() {
        return 8;  // 3 + 5 has been precomputed
    }

    In the compiled code, this new function might be named something like add.constprop.0 to reflect that it has been optimized based on constant propagation.

When Does Constant Propagation Trigger This?

The compiler performs constant propagation across function boundaries when it can determine that certain function parameters are constant in all or some of the calls to that function. This optimization is often triggered in conjunction with inlining, constant folding, and function specialization. Here’s how it works: 1. Function Inlining: When the compiler decides to inline a function (replace the call to the function with the actual function code), it can propagate constant arguments into the inlined function. This can lead to opportunities for further simplification. If the function isn’t fully inlined for all calls, the compiler might create a specialized version with constant propagation applied for those constant cases. 2. Function Specialization: If a function is called multiple times with certain arguments that are constant, the compiler might generate a specialized version of the function for those constant values. The .constprop.0 function is such a specialized version where constant propagation and potentially other optimizations (like dead code elimination or loop unrolling) have been applied. 3. Rewriting Calls: After creating the specialized version of the function, the compiler rewrites calls to the original function with constant arguments to point to the optimized .constprop.0 version. This way, the optimized version is used where possible, but the original version remains available for cases where the arguments aren’t constant.

Benefits of .constprop.0 Functions

The creation of these specialized functions with constant propagation offers several benefits: - Performance Gains: The compiler can optimize away redundant computations and simplify the function, leading to faster execution. For example, expressions that depend on constants might be precomputed, conditional branches might be eliminated, and loops might be unrolled. - Reduced Code Size: In some cases, specialized functions with constant propagation can actually reduce the code size, as the compiler might remove code paths that are no longer needed (for example, dead branches or unnecessary variable assignments). - Better Cache Usage: Specialized versions of functions can be smaller and more cache-friendly since they focus only on the specific case where certain inputs are constant.

Example in Practice

Consider this C code:

int compute(int a, int b) {
    return a * 2 + b;
}

int main() {
    return compute(4, 5);
}

Without optimization, the compute function would be called at runtime with the arguments 4 and 5. However, during constant propagation, the compiler detects that 4 and 5 are constants and creates a specialized version of compute:

int compute.constprop.0() {
    return 13;  // Precomputed: 4 * 2 + 5
}

The call in main() would be replaced by a direct call to compute.constprop.0(), and no runtime multiplication or addition would be required.

Why Does the Original Function Stay?

The original, non-specialized version of the function typically stays in the binary if there are calls to it with non-constant arguments or if it cannot be fully optimized in all cases. The .constprop.0 function is just an optimized variant for cases where constants are known, so the compiler keeps both versions to handle different calling scenarios.

Possible Reasons for .inline Suffix Existence:

  1. Partial Inlining:
    • What Happens: Sometimes, the compiler may choose to inline a function only in certain places (e.g., hot paths where performance is critical) while retaining the original non-inlined version for other calls. This can happen when the function is small enough to be inlined in performance-critical paths but also used in non-critical paths or in situations where inlining might increase code size too much.
    • Result: In this case, an inlined version may be created, but the original function with an .inline suffix might still be retained for non-inlined calls. This allows the compiler to balance performance and code size.
  2. Inlining Across Translation Units (LTO):
    • What Happens: During Link-Time Optimization (LTO), functions might be inlined across different translation units (source files). However, the function might still need to be retained in its original form for other purposes (such as if it’s part of a shared library or called from another compilation unit that was not optimized in the same way).
    • Result: A version of the function with the .inline suffix could be preserved as an internal symbol, ensuring that the compiler can still generate callable code if needed, while simultaneously allowing aggressive inlining across units.
  3. Multiple Optimization Levels:
    • What Happens: The compiler might generate different versions of a function to optimize for specific use cases. For instance, it could create an inlined version for certain contexts and a standalone version for others, especially if different parts of the code are compiled with different optimization flags or under different constraints (e.g., space vs. speed optimizations).
    • Result: The .inline suffix would then be used to distinguish the inlined version from the original, non-inlined function, even though the function is still present as a callable entity.
  4. Debugging and Profiling:
    • What Happens: Compilers sometimes retain inlined function symbols in the binary even though the code has been inlined, for the purpose of debugging and profiling. Tools like gdb or performance profilers may need to refer to the original function for accurate stack traces, debugging information, or code coverage data.
    • Result: The compiler might keep a symbol with the .inline suffix so that debugging information remains available, even if the function body no longer exists in its original form.
  5. Function Attributes:
    • What Happens: Certain function attributes or calling conventions may require that a function symbol still exists in the binary, even if the function has been inlined elsewhere. For instance, a function might be declared inline but also weak (meaning it can be overridden) or have other attributes that necessitate keeping a symbol for linking purposes.
    • Result: The compiler may generate both an inlined version and retain a separate version of the function marked with .inline, to fulfill these attributes or constraints.

### Scenario 1: The Function Is Declared inline

When a function is explicitly declared as inline in the source code: - Expectation: The programmer indicates that they would prefer the function to be inlined to avoid the overhead of a function call. This, however, is a hint, not a guarantee. The compiler can still choose not to inline the function, especially if inlining it would increase code size excessively or if the function is too complex. - Linkage and Visibility: Typically, inline functions are defined in headers or in multiple translation units because they should be available to multiple parts of the program. If you declare a function as inline, but it has external linkage, this means the function is visible across multiple translation units, and the linker might still need to ensure that only one definition is used. As a result, the function may still need a symbol in the binary. - Compilers can generate a symbol for such inline functions, especially if they are not inlined in all cases. The symbol might be suffixed with .inline if the compiler creates a specialized version after attempting partial inlining. - Why retain a symbol?: Even though the function is marked inline, the compiler might not inline it everywhere. It might still create a regular function for some call sites while inlining others. The symbol could remain to provide an externally accessible version in case the inlining isn’t performed universally. - In Public Libraries or Interfaces: Despite being marked inline, such functions might still need Interprocedural Scalar Replacement of Aggregates (ISRA) is a compiler optimization technique aimed at improving performance by breaking down large data structures (such as arrays, structs, or classes, collectively called aggregates) into their individual scalar components (like integers or floating-point values). This allows the compiler to perform more efficient optimizations on those individual parts rather than working with the entire structure as a whole. Interprocedural means that this optimization can take place across function boundaries, not just within a single function.

Let’s explore ISRA in detail:

Key Concepts in ISRA

  1. Aggregate Data Structures:
    • Aggregate types refer to complex data structures such as structs, arrays, or classes, which group together multiple individual variables into a single entity.
    • For example, in C, a struct might look like this:

      struct Point {
          int x;
          int y;
      };
    • The Point structure holds two integers, x and y, as part of one entity. Passing and manipulating this entire structure at once can be inefficient, especially when only some of its fields are used in a function.
  2. Scalar Replacement:
    • Scalar replacement is the process of breaking down an aggregate into its individual scalar components, such as integers, floats, or pointers. This allows the compiler to work with these smaller, more manageable parts instead of the entire structure.
    • For example, the compiler could split struct Point into two scalar variables, int x and int y, allowing it to perform optimizations on x and y independently.

How ISRA Works

In the context of interprocedural optimization, ISRA looks at the data being passed between functions (i.e., across function boundaries) and determines whether the entire aggregate needs to be passed, or if the individual fields of the aggregate can be passed as independent scalars. Consider this simple example:

struct Point {
    int x;
    int y;
};

int computeDistance(struct Point p) {
    return p.x * p.x + p.y * p.y;
}

Without ISRA, the computeDistance function would take a struct Point argument by value, which means that both x and y are passed as part of the struct Point object. This may involve unnecessary memory loads, stores, and passing the entire structure on the stack.

What Happens During ISRA

ISRA optimizes this process by performing the following steps: 1. Function Argument Decomposition: - Instead of passing the entire struct Point as a single argument to computeDistance, ISRA breaks it down into its components. This means that instead of passing the structure p, the compiler will generate a version of the function that takes two int arguments, x and y: int computeDistance(int x, int y) { return x * x + y * y; } 2. Across Function Boundaries: - The key part of ISRA is that it works interprocedurally—meaning it doesn’t just happen within one function but across function calls. If a function calls computeDistance, the compiler can transform both the calling function and computeDistance so that they pass and work on the individual scalar values (x and y), instead of the entire struct Point. For example: void process() { struct Point p = {3, 4}; int d = computeDistance(p); } ISRA would convert this into: void process() { int x = 3; int y = 4; int d = computeDistance(x, y); } 3. Improved Register Utilization: - By breaking down aggregates into their scalar components, the compiler can store and manipulate those values directly in CPU registers, which are much faster than accessing memory. In the example above, the x and y values can be kept in registers instead of being stored and loaded from memory, reducing the overhead of memory access. 4. Dead Code Elimination: - If only part of the structure is used, ISRA can also help eliminate unused fields. For instance, if a function only needs p.x but not p.y, the compiler can avoid passing or loading p.y entirely. This further reduces unnecessary computation and memory access.

Example of ISRA Optimization

Before ISRA:

struct Point {
    int x;
    int y;
};

int computeDistance(struct Point p) {
    return p.x * p.x + p.y * p.y;
}

void process() {
    struct Point p = {3, 4};
    int d = computeDistance(p);
}

After ISRA:

int computeDistance(int x, int y) {
    return x * x + y * y;
}

void process() {
    int x = 3;
    int y = 4;
    int d = computeDistance(x, y);
}

Benefits of ISRA

  1. Reduced Memory Traffic:
    • Since scalar values (like integers and floats) can often be passed in registers, ISRA reduces the need to read from or write to memory when working with aggregate data. This leads to faster execution because memory access is generally slower than register access.
  2. Smaller Code Size:
    • By eliminating the need to pass entire aggregates (especially if they are large), the generated code becomes smaller and more efficient, as the overhead of copying entire data structures is avoided.
  3. Better Cache Usage:
    • ISRA reduces memory accesses, which improves cache performance. By avoiding unnecessary loads and stores of the entire structure, it minimizes cache pollution, which can result in better overall performance.
  4. Improved Optimizations:
    • Once aggregates are replaced by scalars, the compiler can apply additional optimizations, such as constant propagation, dead code elimination, and register allocation, to individual fields, which can result in more efficient code.

Challenges and Limitations of ISRA

  • Large Structures: For very large structures, ISRA may not always be beneficial because breaking them down into many scalar values can lead to high register pressure. This is especially true on architectures with limited registers, where using too many registers for scalar values can degrade performance.
  • ABI Constraints: Some Application Binary Interfaces (ABI) dictate how functions should pass arguments (whether in registers or on the stack). ISRA optimizations must respect these rules, which may limit the extent to which aggregate structures can be scalarized.
  • Complex Structures: ISRA is easier to apply to simple aggregates (like structs with only a few fields), but it can be more complex or impractical for deeply nested or very large structures, especially if pointers are involved.

Monday, September 30, 2024

Injecting Code on the Fly: Overcoming Challenges to produce data self stuffed binary blobs.

Sometimes you run a long-lasting process on a remote machine… For example, to compress a large file… When suddenly you have an emergency: your wife is itching to shop. At that point, you typically have a couple of options:

  1. Stop the job and restart it at a more convenient time.
  2. Keep your computer connected and go out shopping.

Surely, if you had known beforehand, you could have started the job in a screen or tmux session, but usually things didn’t go that way, and now you have to decide what to do.

If you have enough time, you can use your trusty gdb to sort this problem out. You can attach to the program, close stderr and stdout, and then create new files to replace them. You can use sigaction to disable SIGHUP, but that isn’t something you can manage when shopping is calling… You simply don’t have the time.

To address this problem, I was trying to code a simple tool to automate the gdb process.

While I was able to produce a PoC using C and Assembly, trying to have the same using pure C, I ran into a challenge that I’d like to discuss briefly and gather suggestions on, if any.

The issue is that the tool needs to inject code into the running program to replace file descriptors and disable signals. However, the code that needs to be injected might require some data.

The obvious solution would be to write a small binary in assembly that can be executed in that context, but I wanted to write it in C. The problem is: can I embed data into a function in the .text section? The assembly equivalent would be something trivial like:

jmp code data: .byte [...] code: body of the function

Doing something similar in C that is both functional and portable is far from trivial. Here’s my current solution, which still has a few issues, and I’d like to collect suggestions:

void injected_function() { volatile int a = 0; if (a) { str: asm volatile ( ".byte 0x48, 0x65, 0x6c, 0x6c, 0x6f, 0x20, 0x66, 0x72, 0x6f, 0x6d, 0x20," "0x69, 0x6e, 0x6a, 0x65, 0x63, 0x74, 0x65, 0x64, 0x20, 0x66, 0x75, 0x6e," "0x63, 0x21, 0x0a, 0x00, 0x00" ); } str_end: write(1, (void *) &&str, (&&str_end - &&str)); }

This function is supposed to be injected into the address space of a running program and write: “Hello from injected func!” However, there are a few quirks: (maybe more, but I haven’t stomped into them yet)

  • In x86_64, when -fcf-protection=full is enabled, the asm volatile statements are considered valid jump targets, resulting in an endbr64 being inserted. Solutions include skipping 2 bytes to avoid printing the opcodes of the endbr64 or disabling CFI using -fcf-protection=none. I don’t like either solution, but I couldn’t find another workaround.
  • When compiling this for aarch64, if the message length is not a multiple of four, the resulting label of the code after the string becomes misaligned. My solution was to add a few \x00 bytes to ensure proper alignment, but I’m not satisfied with this approach.

I’m looking for a solution that is architecture-independent, but I haven’t been able to find one. Does anyone have any suggestions?

Monday, July 22, 2024

Using BTF to Build Out-of-Tree Kernel Modules with Private Struct Definitions

Introduction

OoT kernel modules often face challenges when they need access to private header struct definitions that are not available in public headers. Traditional methods to access these private headers can lead to complications and maintenance challenges. This blog post presents a PoC that demonstrates a method to write OoT kernel modules using BTF to leverage private header struct definitions. This approach aims to simplify the build process and improve maintainability.

What is BTF and Why is it in the Linux Kernel?

BPF is a technology used for network packet filtering, tracing, and monitoring within the Linux kernel. It allows users to run sandboxed programs in the kernel space, enabling powerful debugging and performance analysis capabilities. Producing BPF machine code is straightforward with a compiler that targets BPF, but writing a BPF program is more complicated due to the need to access kernel data during execution. For example, if you want to check if the IP of a given packet is your target, you need to access the structure representing the packet in your BPF program. You must navigate to the correct field by moving from the structure's address by a specific offset and interpret it correctly. This is where BTF comes into play. BTF, or BPF Type Format, is a slim and compact way to represent the structures used in the kernel, accounting for structure randomization. It provides rich type information for BPF programs, essential for accessing and manipulating kernel data structures accurately. BTF enhances the BPF ecosystem by enabling programs to understand and work with kernel data without needing explicit header files. To support BPF program development, an ecosystem has emerged, with libbpf being the key library that facilitates this. BPF programs need to be loaded into the kernel, and there is a BPF syscall for this operation. libbpf allows creating a loader program in native assembly that not only loads the program into the kernel but also links it (similar to the compiler's link process) to adapt it to the specific kernel, using BTF. Historically, Clang was the first C compiler to support the BPF target. GCC also supports the BPF target, but Clang remains the more commonly used compiler for this task. BTF focuses solely on describing data structures, which is why it is much more compact than other debugging formats like DWARF. The BTF section included in a production kernel is around 10-20MB, while DWARF info would be around 250-500MB.

Using BTF to Ease OoT Module Build and Maintenance

The PoC demonstrates how to build an OoT kernel module that requires private struct definitions by utilizing BTF. Here’s a step-by-step overview of the process:

  • Search for Structure to Define: Identify the private structures and unions needed for the OoT module from the Linux headers.
  • Collect All Structures and Unions: Gather all relevant structures and unions from the Linux headers.
  • Extract vmlinux from Bootable Image: Extract the vmlinux file from a bootable kernel image, which contains the BTF information.
  • Extract Structures from BTF: Use BTF to extract the required structures from the vmlinux file.
  • Filter BTF Extracted Structures: Filter out the structures that are already declared in public headers to avoid duplication.
  • Produce Header File: Generate a header file containing the necessary structures and unions.
  • Build Kernel Module: Use a customized Makefile and scripts to build the kernel module with the generated header file.

The PoC includes:

  • A customized Makefile that runs scripts to prepare the environment.
  • Module source code that marks structures with //BTF_INCLUDE to indicate they need to be imported.
  • Scripts to ensure consistency with existing structure declarations in public headers.
  • Scripts to handle dependencies and recursively extract related structures without redeclaring existing ones.

Conclusion

This PoC showcases a functional solution for using BTF to build OoT kernel modules with private struct definitions. It demonstrates how BTF can be used to retrieve structure information about non-public definitions. While this PoC is not intended to promote the use of non-public structures in OoT modules, it acknowledges that sometimes this is unavoidable. Using BTF for this purpose can significantly increase the maintainability of the OoT kernel module across different kernel versions.