alessandro.carminati: March 2025

Wednesday, March 26, 2025

The Case of the Disappearing PID: A Debugging Mystery

Every developer, at some point, encounters a situation so baffling it makes them question their own sanity. This is the story of one such weirdness: a heavily multithreaded Golang application, a kernel module, and a PID that vanished into the abyss without a trace.

Spoiler alert: It wasn’t aliens.

The Setup: A Debugging Nightmare

The problem was simple: I was tracking the lifecycle of processes spawned by a third-party binary. To do this, I wrote a Linux Kernel Module that hooked into _do_fork() and do_exit(), logging every process birth and death. And yes, you read that right... _do_fork(). You know, that function that, for over 20 years, had ‘_do_fork’ as a name, even though ‘fork’ was actually just a special case of ‘clone’. Then, suddenly, in 5.10, someone in kernel land had a 'Wait a second!' moment and decided the name was too misleading. So, they renamed it to ‘kernel_clone()’, like, surprise! No more confusion, just 20 years of tradition down the drain. But hey, at least we now know what’s really going on... I think.

Back to the story, at first, everything seemed fine. Threads were born, threads died, logs were generated, and the universe remained in harmony. But then, something unholy happened: some PIDs vanished without ever triggering do_exit().

I know what you're thinking at... But NO, this was not a case of printk() lag, nor was it tracing inaccuracies. I double-checked using ftrace, netconsole, and even sacrificed a few coffee mugs at the pagan god of debugging... The logs were clear: the PID appeared, then POOF! Gone. No exit call, no final goodbye, no proper burial.

Step One: Denial (And the Stack Overflow Void)

Could a Linux process terminate without passing through do_exit()?

My first instinct was: Absolutely not.

If that were true, the very fabric of Linux process management would collapse. Chaos would reign. Cats and dogs would live together. And yet, my logs insisted otherwise.

So, like any good developer, I turned to Stack Overflow. Surely, someone must have encountered this before. I searched. No ready-made answer. Fine.

I did what any desperate soul would do: I asked the question myself.

Days passed. The responses trickled in, but nothing convinced me. The usual suspects, race conditions, tracing inaccuracies, were suggested, but I had already ruled them out. Stack Overflow had failed me.

I realized I wasn’t going to find the answer just by asking. I had to go hunting.

Step Two: Anger (aka Kernel Grep Hell)

I dug deep. Real deep. Into the Linux kernel source, into mailing lists from 2005, into the depths of Stack Overflow where unsolved mysteries go to die.

And then, I found it. The smoking gun.

Deep in fs/exec.c, hiding like a bug under the rug, was this delightful nugget (from the 4.19 kernel):

/* Become a process group leader with the old leader's pid. * The old leader becomes a thread of this thread group. * Note: The old leader also uses this pid until release_task * is called. Odd but simple and correct. */ tsk->pid = leader->pid;

I read it. I read it again. I re-read it while crying. And then it hit me.

Step Three: Bargaining (Can Two Processes Have the Same PID?)

If you had asked me before this, I’d have said no, absolutely not: two processes cannot share the same PID. That’s like realizing your passport was cloned, and now there's another ‘you’ vacationing in the Bahamas while you’re stuck debugging kernel code. That’s not how things work!

Except, sometimes, it is.

Here’s what happens (in 4.19):

A multithreaded process decides it wants a fresh start and calls execve().
The kernel, being the neat freak it is, has to clean up the old thread group.
But, in doing so, it needs to shuffle some PIDs around.
The newly exec’d thread gets the old leader’s PID, while the old leader, now a zombie, keeps using the same PID until it’s fully reaped.
If you were monitoring the old leader, you’d see its PID go through do_exit() twice. First, when the actual old leader dies. Then, when its "impostor", the thread that inherited its PID, finally meets its own end. So, from an external observer’s perspective, it looks like one process vanished without a trace, while another somehow managed to die twice. Linux: where even PIDs get second lives.

Now, fast-forward to kernel 6.14, and the behavior has been slightly refined:

/* Become a process group leader with the old leader's pid. * The old leader becomes a thread of this thread group. */ exchange_tids(tsk, leader);

The mechanism has changed, but it still involves shuffling PIDs in a similar way. With exchange_tids(), the process restructuring appears to follow the same logic, likely leading to the same observable effect: one PID seeming to vanish without an obvious do_exit(), while another might appear to exit twice. However, a deeper investigation would be needed to confirm the exact behavior in modern kernels.

This, ladies and gentlemen, was my bug. My missing do_exit() wasn’t missing. It was just… misdirected.

Step Four: Acceptance (And Trolling Future Debuggers)

Armed with this knowledge, I could now definitively answer some existential Linux questions:

Can a Linux process/thread terminate without passing through do_exit()?
No. Every process must pass through do_exit(), even if it’s via a sneaky backdoor.
Can two processes share the same PID?
Normally, no. The rule of unique PIDs is sacred... or so we’d like to believe. But every now and then, the kernel bends the rules in the name of sneaky process management. And while modern kernels seem to have repented on this particular trick, well... Where there’s one skeleton in the closet, there’s bound to be more.
Can a Linux process change its PID?
Yes, in at least one rare case: when de_thread() decides to reassign it.

Final Thoughts (or, How to Break a Debugger’s Mind)

If you ever find yourself debugging a disappearing PID, remember:

The kernel is a twisted, brilliant piece of engineering.
Process lifecycle tracking is a house of mirrors.
Never trust a PID: it might not be who you think it is.
Stack Overflow won’t always save you. Sometimes, you have to dig into the source code yourself.
And, most importantly: always suspect execve().

In the end, Linux remains a beautifully chaotic system. But at least now, when PIDs disappear into the void, I know exactly which corner of the kernel is laughing at me.

Happy debugging!

Monday, March 10, 2025

Kernel Testing for Not-So-Common Architectures

When developing kernel patches, thorough testing is crucial to ensure stability and correctness. While testing is straightforward for common architectures like x86 or ARM, thanks to abundant tools, binary distributions, and community support, the landscape changes drastically when dealing with less common or emerging architectures.

The Challenge of Less Common Architectures

Emerging architectures, such as RISC-V, are gaining momentum but still face limitations in tooling and ecosystem maturity. Even more challenging are esoteric architectures like loongarch64, which may have minimal community support, scarce documentation, or lack readily available toolchains. Testing kernel patches for these platforms introduces unique hurdles:

Toolchain Availability: Compilers and essential tools might be missing or outdated.
Userspace Construction: Creating a minimal userspace to boot and test the kernel can be complex, especially if standard frameworks don’t support the target architecture.

The Role of Buildroot

In many scenarios, buildroot has been an invaluable resource. It simplifies the process of building both the toolchain and a minimal userspace. Its automated and modular approach makes setting up environments for a wide range of architectures relatively straightforward. However, buildroot has its limitations and doesn’t support every architecture recognized by the Linux kernel (apparently old architectures like parisc32 is still supported by the kernel).

Understanding Userspace Construction

Userspace setup is a critical part of kernel testing. Traditionally, userspace resides on block devices, which introduces a series of complications:

Block Device Requirements: The device itself must be available and correctly configured.
Kernel Driver Support: The kernel must include the necessary drivers for the block device. If these are modules and not built-in, early boot stages can fail.

An effective alternative is using initramfs. This is a root filesystem packaged in a cpio archive and loaded directly into memory at boot. It simplifies boot processes by eliminating dependencies on block devices.

Building an Initramfs

Building an initramfs introduces its own challenges. Tools like Dracut can automate this process and work well in native build environments. However, in cross-build scenarios, Dracut’s complexity increases. It may struggle with cross-compilation environments, environment configurations, and dependency resolution.

Alternatively, frameworks like Buildroot and Yocto offer comprehensive solutions to build both toolchains and userspaces, including initramfs. These tools can handle cross-compilation but have their drawbacks:

Performance: Both tools can be slow.
Architecture Support: Not all architectures supported by the Linux kernel are covered.

When Buildroot-like approach Falls Short

Encountering an unsupported architecture can be a major roadblock. Without Buildroot, developers need to find alternative strategies to build the necessary toolchain and create a functional userspace for kernel testing.

An Alternative Approach: Crosstool-NG and BusyBox

One effective solution is leveraging Crosstool-NG to build the cross-compilation toolchain and using BusyBox to create a minimal userspace. This approach offers flexibility and control, ensuring that even esoteric architectures can be targeted. Here’s a detailed overview of this method:

Build the Toolchain with Crosstool-NG:
- Build and Install Crosstool-NG
- Initialize the wanted toolchain with ct-ng menuconfig.
- Select the target architecture and customize the build parameters.
- For esoteric architectures, enable the EXPERIMENTAL flag in the configuration menu. Some architectures are still considered experimental, and this flag is required to unlock their toolchain support.
- Proceed with building the toolchain using ct-ng build.
- Address any architecture-specific quirks or requirements during configuration and compilation.
Create a Minimal Userspace with BusyBox:
- Export the cross-compiler by setting the environment variable: export CROSS_COMPILE=<path-to-toolchain>/bin/<arch>-linux-.
- Configure and build BusyBox for a static build to avoid library dependencies: make CONFIG_STATIC=y.
- A static BusyBox build simplifies root filesystem creation, as it removes the need for organizing the /lib directory for shared libraries.
- Design the init system using BusyBox’s init with a simple SystemV style inittab:
- The rest of the filesystem can be minimal, with the /bin directory containing BusyBox and symlinks for the core tools.
- Make sure to have a /dev directory populated with at least console and tty0 devices, otherwise you won't see any messages and possibly your init will crash
Sample implementation of this concept is here.

Pack Userspace into an Initramfs:
- Assemble the userspace into a cpio archive with: find . -print0 | cpio --null -o --format=newc > ../initramfs.cpio.
- Ensure the kernel configuration is set to load the initramfs at boot.
Build and Test the Kernel:
- Compile the kernel using the cross-compiled toolchain:
- Be aware that excessively long CROSS_COMPILE strings can cause issues, leading the build system to fall back to the native toolchain.
- Use the kernel configuration symbol CONFIG_INITRAMFS_SOURCE to specify the initramfs for embedding directly into the kernel image, enabling quick validation with QEMU or similar tools.

This method demands more manual configuration than Buildroot but offers a path forward when conventional tools fall short.

Conclusion

Kernel development for less common architectures is a complex but rewarding challenge. When standard tools like Buildroot can’t cover the gap, combining Crosstool-NG and BusyBox provides a reliable and adaptable solution.

Saturday, March 1, 2025

The Tale of the Stubborn Cipher: A Debugging Saga

I’m a Red Hat guy since a while now, but lurking in my lab was a traitor: an old Ubuntu 20.04 machine still doing all my heavy lifting, mostly building kernels. Why? Because back in the day, before I saw the light of Red Hat, I was under the false impression that Fedora wasn’t great for cross-compilation. Turns out, I was dead wrong.

Fast-forward to today, and I need to work on LoongArch. Guess what? Ubuntu didn’t have the cross-compilation bits I needed… but Fedora did. Finally, I had a valid excuse to invest time, nuke that relic and bring it into the Fedora family.

But of course, nothing is ever that simple. See, I had an old encrypted disk on that setup, storing past projects, ancient treasures, and probably a few embarrassing bash scripts. No worries! I’ll just re-run the cryptsetup command on Fedora, and boom, I’m in…

Right?

Oh, how naive I was.

The Beginning: A Mysterious Failure

It all started with a simple goal: to mount an encrypted partition using AES-ESSIV:SHA256 on Fedora. The same setup had worked perfectly on an old setup for years, but now, Fedora refused to cooperate.

The process seemed straight-forward: cryptsetup creates the mapping, but...

$ sudo cryptsetup create --cipher aes-cbc-essiv --key-size 256 --hash sha256 pippo /dev/sdb1 Enter passphrase for /dev/sdb1: device-mapper: reload ioctl on pippo (253:2) failed: Invalid argument $

A classic error message, vague and infuriating. This called for serious debugging.

The First Hypothesis: A Kernel Mishap?

I’m a kernel developer, and you know what they say: when all you have is a hammer, everything looks like a kernel bug. So, naturally, my first suspicion landed straight on the Linux kernel. Because, let’s be honest, if something’s broken, it’s probably the kernel’s fault… right? So, ESSIV wasn’t appearing in /proc/crypto, the first suspicion was a kernel issue. Fedora is known for enforcing modern cryptographic policies, and legacy algorithms are often disabled by default.

To investigate, the essiv.ko module source was examined. It turns out that crypto_register_template(&essiv_tmpl); does not immediately appear in /proc/crypto. Instead, /proc/crypto reflects the state of crypto_alg_list, which only updates after the first use of an algorithm.

So while I was staring at /proc/crypto, expecting to see ESSIV magically appear, I was actually just looking at a list of algorithms that had already been used, not the registered templates. The kernel wasn’t necessarily broken: just playing hard to get.

I needed to be sure the upstream kernel code I was looking at was exactly the same running on my machine. Fedora typically does not modify the upstream code, but I needed a confirmation. Rather than hunting down Fedora’s kernel source repository, the decision was made to compare binary modules from Fedora and upstream Linux, but…

Ah, binary reproducibility… A dream everyone chases but few actually catch. The idea of building a kernel module and getting the exact same binary sounds simple, but in reality, it’s like trying to bake the same cake twice without measuring anything.

What can make binaries different? The obvious culprit is the code itself, but that’s just the start. Data embedded in the binary can also change things. Compiler versions and plugins play a role… If the same source code gets translated differently, you’ll end up with different binaries, no matter how pure your intentions. Then come the non-code factors. A kernel module is an ELF container, and ELF files carry metadata: timestamps, cryptographic signatures, and other bits that make your module unique (and sometimes annoying to compare). Even the flags that mark a module as Out-Of-Tree can introduce differences.

So, when doing a binary comparison, it’s not just a matter of checking if the bytes match... you have to strip out the noise and focus on the meaningful differences. Here’s what I did

$ objcopy -O binary --only-section=.text essiv.us.ko essiv.us.bin $ objcopy -O binary --only-section=.text essiv.fedora.ko essiv.fedora.bin $ cmp essiv.us.bin essiv.fedora.bin

And I was lucky, the result was bitwise identical. No funny business in the kernel. Time to look elsewhere.

The Second Hypothesis: A Cryptsetup Mismatch?

To check if Fedora’s cryptsetup was behaving differently, the same encryption command was run on both the old machine and Fedora:

sudo cryptsetup -v create pippo --cipher aes-cbc-essiv:sha256 --key-size 256 /dev/sdb1

On the old machine, this worked fine, and the partition mounted successfully.
On Fedora, it created the mapping but refused to mount.

The Real Culprit: The Command Line Argument Order

At this point, every possible difference between the Ubuntu and Fedora commands was scrutinized.

And then, the discovery:

cryptsetup create --cipher aes-cbc-essiv:sha256 --key-size 256 --hash sha256 pippo /dev/sdb1

vs.

cryptsetup create pippo --cipher aes-cbc-essiv:sha256 --key-size 256 --hash sha256 /dev/sdb1

The first one fails mysteriously (without any syntax error). The second one works not throws error.

$ sudo cryptsetup create --cipher aes-cbc-essiv --key-size 256 --hash sha256 pippo /dev/sdb1 Enter passphrase for /dev/sdb1: device-mapper: reload ioctl on pippo (253:2) failed: Invalid argument

Why? Because cryptsetup’s argument parser behaves differently depending on argument order. The correct order is:

cryptsetup create <name> <options> <device>

When the name (pippo) is placed before the options, everything just works. But if options come first, something breaks silently.

The Final Barrier: Key Derivation Algorithm Mismatch

With the argument order fixed, one final verification was done, the command now does not fail, but the filesystem still not mounts. Looking at visible crypto parameters, everything looked fine, but it was not.

$ sudo dmsetup table pippo

On Fedora, it returned:

0 3907026944 crypt aes-cbc-essiv:sha256 0000000000000000000000000000000000000000000000000000000000000000 0 8:17 0

On old machine, it returned:

0 3907026944 crypt aes-cbc-essiv:sha256 0000000000000000000000000000000000000000000000000000000000000000 0 8:97 0

Identical! This meant that the same crypto algorithm was being used, and I was providing the same passphrase. So, in theory, everything should have been correct.

And yet… mounting still failed.

The log only confirmed that the same encryption algorithm was in play; it didn’t prove that the same key was actually being used. Since the key is derived from the passphrase, hashing algorithm, and other parameters.

A final comparison of the cryptsetup debug logs revealed the culprit:

Even though both systems used the same hashing algorithm (aes-cbc-essiv:sha256), they used different passphrase-to-key derivation methods internally. Fedora’s version of cryptsetup was not deriving the same encryption key.

The Fix: Explicitly Specifying the Hash Algorithm (`RIPEMD-160`) and Mode

The final working command had to ensure that Fedora derived the key exactly like the old machine:

$ sudo cryptsetup create pippo --cipher aes-cbc-essiv:sha256 --key-size 256 --hash ripemd160 --type plain /dev/sda1

And finally:

$ sudo mount /dev/mapper/pippo /mnt/0 $ ls /mnt/0

Success! The partition mounted perfectly.

The Conclusion: Lessons Learned

Look things carefully before blaming the kernel.
Cryptographic defaults change across cryptsetup versions: be explicit!
The order of command-line arguments in cryptsetup matters.
Compare dmsetup table outputs is not just enough.
Key derivation methods can differ, and it is not evident!

After all the deep dives into kernel modules, crypto policies, and hashing algorithms, the entire issue boiled down to two things:

Wrong argument order in cryptsetup
Key derivation differences between cryptosetup versions.

A truly fitting end to a classic Linux troubleshooting adventure.