Every developer, at some point, encounters a situation so baffling it makes them question their own sanity. This is the story of one such weirdness: a heavily multithreaded Golang application, a kernel module, and a PID that vanished into the abyss without a trace.
Spoiler alert: It wasn’t aliens.
The Setup: A Debugging Nightmare
The problem was simple: I was tracking the lifecycle of processes spawned by a third-party binary. To do this, I wrote a Linux Kernel Module that hooked into _do_fork()
and do_exit()
, logging every process birth and death. And yes, you read that right... _do_fork()
. You know, that function that, for over 20 years, had ‘_do_fork’ as a name, even though ‘fork’ was actually just a special case of ‘clone’. Then, suddenly, in 5.10, someone in kernel land had a 'Wait a second!' moment and decided the name was too misleading. So, they renamed it to ‘kernel_clone()’, like, surprise! No more confusion, just 20 years of tradition down the drain. But hey, at least we now know what’s really going on... I think.
Back to the story, at first, everything seemed fine. Threads were born, threads died, logs were generated, and the universe remained in harmony. But then, something unholy happened: some PIDs vanished without ever triggering do_exit()
.
I know what you're thinking at... But NO, this was not a case of printk() lag, nor was it tracing inaccuracies. I double-checked using ftrace, netconsole, and even sacrificed a few coffee mugs at the pagan god of debugging... The logs were clear: the PID appeared, then POOF! Gone. No exit call, no final goodbye, no proper burial.
Step One: Denial (And the Stack Overflow Void)
Could a Linux process terminate without passing through do_exit()
?
My first instinct was: Absolutely not.
If that were true, the very fabric of Linux process management would collapse. Chaos would reign. Cats and dogs would live together. And yet, my logs insisted otherwise.
So, like any good developer, I turned to Stack Overflow. Surely, someone must have encountered this before. I searched. No ready-made answer. Fine.
I did what any desperate soul would do: I asked the question myself.
Days passed. The responses trickled in, but nothing convinced me. The usual suspects, race conditions, tracing inaccuracies, were suggested, but I had already ruled them out. Stack Overflow had failed me.
I realized I wasn’t going to find the answer just by asking. I had to go hunting.
Step Two: Anger (aka Kernel Grep Hell)
I dug deep. Real deep. Into the Linux kernel source, into mailing lists from 2005, into the depths of Stack Overflow where unsolved mysteries go to die.
And then, I found it. The smoking gun.
Deep in fs/exec.c
, hiding like a bug under the rug, was this delightful nugget (from the 4.19 kernel):
I read it. I read it again. I re-read it while crying. And then it hit me.
Step Three: Bargaining (Can Two Processes Have the Same PID?)
If you had asked me before this, I’d have said no, absolutely not: two processes cannot share the same PID. That’s like realizing your passport was cloned, and now there's another ‘you’ vacationing in the Bahamas while you’re stuck debugging kernel code. That’s not how things work!
Except, sometimes, it is.
Here’s what happens (in 4.19):
- A multithreaded process decides it wants a fresh start and calls
execve()
. - The kernel, being the neat freak it is, has to clean up the old thread group.
- But, in doing so, it needs to shuffle some PIDs around.
- The newly exec’d thread gets the old leader’s PID, while the old leader, now a zombie, keeps using the same PID until it’s fully reaped.
- If you were monitoring the old leader, you’d see its PID go through
do_exit()
twice. First, when the actual old leader dies. Then, when its "impostor", the thread that inherited its PID, finally meets its own end. So, from an external observer’s perspective, it looks like one process vanished without a trace, while another somehow managed to die twice. Linux: where even PIDs get second lives.
Now, fast-forward to kernel 6.14, and the behavior has been slightly refined:
The mechanism has changed, but it still involves shuffling PIDs in a similar way. With exchange_tids()
, the process restructuring appears to follow the same logic, likely leading to the same observable effect: one PID seeming to vanish without an obvious do_exit()
, while another might appear to exit twice. However, a deeper investigation would be needed to confirm the exact behavior in modern kernels.
This, ladies and gentlemen, was my bug. My missing do_exit()
wasn’t missing. It was just… misdirected.
Step Four: Acceptance (And Trolling Future Debuggers)
Armed with this knowledge, I could now definitively answer some existential Linux questions:
- Can a Linux process/thread terminate without passing through
do_exit()
?
No. Every process must pass throughdo_exit()
, even if it’s via a sneaky backdoor. - Can two processes share the same PID?
Normally, no. The rule of unique PIDs is sacred... or so we’d like to believe. But every now and then, the kernel bends the rules in the name of sneaky process management. And while modern kernels seem to have repented on this particular trick, well... Where there’s one skeleton in the closet, there’s bound to be more. - Can a Linux process change its PID?
Yes, in at least one rare case: whende_thread()
decides to reassign it.
Final Thoughts (or, How to Break a Debugger’s Mind)
If you ever find yourself debugging a disappearing PID, remember:
- The kernel is a twisted, brilliant piece of engineering.
- Process lifecycle tracking is a house of mirrors.
- Never trust a PID: it might not be who you think it is.
- Stack Overflow won’t always save you. Sometimes, you have to dig into the source code yourself.
- And, most importantly: always suspect
execve()
.
In the end, Linux remains a beautifully chaotic system. But at least now, when PIDs disappear into the void, I know exactly which corner of the kernel is laughing at me.
Happy debugging!