Sunday, September 14, 2025

Schrödinger’s test: The /dev/mem case

Why I Went Down This Rabbit Hole

Back in 1993, when Linux 0.99.14 was released, /dev/mem made perfect sense. Computers were simpler, physical memory was measured in megabytes, and security basically boiled down to: “Don’t run untrusted programs.”

Fast-forward to today. We have gigabytes (or terabytes!) of RAM, multi-layered virtualization, and strict security requirements… And /dev/mem is still here, quietly sitting in the kernel, practically unchanged… A fossil from a different era. It’s incredibly powerful, terrifyingly dangerous, and absolutely fascinating.

My work on /dev/mem is part of a bigger effort by the ELISA Architecture working group, whose mission is to improve Linux kernel documentation and testing. This project is a small pilot in a broader campaign: build tests for old, fundamental pieces of the kernel that everyone depends on but few dare to touch.

In a previous blog post, “When kernel comments get weird”, I dug into the /dev/mem source code and traced its history, uncovering quirky comments and code paths that date back decades. That post was about exploration. This one is about action: turning that historical understanding into concrete tests to verify that /dev/mem behaves correctly… Without crashing the very systems those tests run on.

What /dev/mem Is and Why It Matters

/dev/mem is a character device that exposes physical memory directly to userspace. Open it like a file, and you can read or write raw physical addresses: no page tables, no virtual memory abstractions, just the real thing.

Why is this powerful? Because it lets you:

  • Peek at firmware data structures,
  • Poke device registers directly,
  • Explore memory layouts normally hidden from userspace.

It’s like being handed the keys to the kingdom… and also a grenade, with the pin halfway pulled.

A single careless write to /dev/mem can:

  • Crash the kernel,
  • Corrupt hardware state,
  • Or make your computer behave like a very expensive paperweight.

For me, that danger is exactly why this project matters. Testing /dev/mem itself is tricky: the tests must prove the driver works, without accidentally nuking the machine they run on.

STRICT_DEVMEM and Real-Mode Legacy

One of the first landmines you encounter with /dev/mem is the kernel configuration option STRICT_DEVMEM.

Think of it as a global policy switch:

  • If disabled, /dev/mem lets privileged userspace access almost any physical address: kernel RAM, device registers, firmware areas, you name it.
  • If enabled, the kernel filters which physical ranges are accessible through /dev/mem. Typically, it only permits access to low legacy regions, like the first megabyte of memory where real-mode BIOS and firmware tables traditionally live, while blocking everything else.

Why does this matter? Some very old software, like emulators for DOS or BIOS tools, still expects to peek and poke those legacy addresses as if running on bare metal. STRICT_DEVMEM exists so those programs can still work: but without giving them carte blanche access to all memory.

So when you’re testing /dev/mem, the presence (or absence) of STRICT_DEVMEM completely changes what your test can do. With it disabled, /dev/mem is a wild west. With it enabled, only a small, carefully whitelisted subset of memory is exposed.

A Quick Note on Architecture Differences

While /dev/mem always exposes what the kernel considers physical memory, the definition of physical itself can differ across architectures. For example, on x86, physical addresses are the real hardware addresses. On aarch64 with virtualization or secure firmware, EL1 may only see a subset of memory through a translated view, controlled by EL2 or EL3.

The main function that the STRICT_DEVMEM kernel configuration option provides in Linux is to filter and restrict access to physical memory addresses via /dev/mem. It controls which physical address ranges can be legitimately accessed from userspace by helping implement architecture-specific rules to prevent unsafe or insecure memory accesses.

32-Bit Systems and the Mystery of High Memory

On most systems, the kernel needs a direct way to access physical memory. To make that fast, it keeps a linear mapping: a simple, one-to-one correspondence between physical addresses and a range of kernel virtual addresses. If the kernel wants to read physical address 0x00100000, it just uses a fixed offset, like PAGE_OFFSET + 0x00100000. Easy and efficient.

But there’s a catch on 32-bit kernels: The kernel’s entire virtual address space is only 4 GB, and it has to share that with userspace. By convention, 3 GB is given to userspace, and 1 GB is reserved for the kernel, which includes its linear mapping.

Now here comes the tricky part: Physical RAM can easily exceed 1 GB. The kernel can’t linearly map all of it: there just isn’t enough virtual address space.

The extra memory beyond the first gigabyte is called highmem (short for high memory). Unlike the low 1 GB, which is always mapped, highmem pages are mapped temporarily, on demand, whenever the kernel needs them.

Why this matters for /dev/mem: /dev/mem depends on the permanent linear mapping to expose physical addresses. Highmem pages aren’t permanently mapped, so /dev/mem simply cannot see them. If you try to read those addresses, you’ll get zeros or an error, not because /dev/mem is broken, but because that part of memory is literally invisible to it.

For testing, this introduces extra complexity:

  • Some reads may succeed on lowmem addresses but fail on highmem.
  • Behavior on a 32-bit machine with highmem is fundamentally different from a 64-bit system, where all RAM is flat-mapped and visible.

Highmem is a deep topic that deserves its own article, but even this quick overview is enough to understand why it complicates /dev/mem testing.

How Reads and Writes Actually Happen

A common misconception is that a single userspace read() or write() call maps to one atomic access to the underlaying block device. In reality, the VFS layer and the device driver may split your request into multiple chunks, depending on alignment and boundaries, in this case.

Why does this happen?

  • Many devices can only handle fixed-size or aligned operations.
  • For physical memory, the natural unit is a page (commonly 4 KB).

When your request crosses a page boundary, the kernel internally slices it into:

  1. A first piece up to the page boundary,
  2. Several full pages,
  3. A trailing partial page.

For /dev/mem, this is a crucial detail: A single read or write might look seamless from userspace, but under the hood it’s actually several smaller operations, each with its own state. If the driver mishandles even one of them, you could see skipped bytes, duplicated data, or mysterious corruption.

Understanding this behavior is key to writing meaningful tests.

Safely Reading and Writing Physical Memory

At this point, we know what /dev/mem is and why it’s both powerful and terrifying. Now we’ll move to the practical side: how to interact with it safely, without accidentally corrupting your machine or testing in meaningless ways.

My very first test implementation kept things simple:

  • Only small reads or writes,
  • Always staying within a single physical page,
  • Never crossing dangerous boundaries.

Even with these restrictions, /dev/mem testing turned out to be more like diffusing a bomb than flipping a switch.

Why “success” doesn’t mean success (in this very specific case)

Normally, when you call a syscall like read() or write(), you can safely assume the kernel did exactly what you asked. If read() returns a positive number, you trust that the data in your buffer matches the file’s contents. That’s the contract between userspace and the kernel, and it works beautifully in everyday programming.

But here’s the catch: We’re not just using /dev/mem; we’re testing whether /dev/mem itself works correctly.

This changes everything.

If my test reads from /dev/mem and fills a buffer with data, I can’t assume that data is correct:

  • Maybe the driver returned garbage,
  • Maybe it skipped a region or duplicated bytes,
  • Maybe it silently failed in the middle but still updated the counters.

The same goes for writes: A return code of “success” doesn’t guarantee the write went where it was supposed to, only that the driver finished running without errors.

So in this very specific context, “success” doesn’t mean success. I need independent ways to verify the result, because the thing I’m testing is the thing that would normally be trusted.

Finding safe places to test: /proc/iomem

Before even thinking about reading or writing physical memory, I need to answer one critical question:

“Which parts of physical memory are safe to touch?”

If I just pick a random address and start writing, I could:

  • Overwrite the kernel’s own code,
  • Corrupt a driver’s I/O-mapped memory,
  • Trash ACPI tables that the system kernel depends on,
  • Or bring the whole machine down in spectacular fashion.

This is where /proc/iomem comes to the rescue. It’s a text file that maps out how the physical address space is currently being used. Each line describes a range of physical addresses and what they’re assigned to.

Here’s a small example:

00000000-00000fff : Reserved 00001000-0009ffff : System RAM 000a0000-000fffff : Reserved 000a0000-000dffff : PCI Bus 0000:00 000c0000-000ce5ff : Video ROM 000f0000-000fffff : System ROM 00100000-09c3efff : System RAM 09c3f000-09ffffff : Reserved 0a000000-0a1fffff : System RAM 0a200000-0a20efff : ACPI Non-volatile Storage 0a20f000-0affffff : System RAM 0b000000-0b01ffff : Reserved 0b020000-b696efff : System RAM b696f000-b696ffff : Reserved b6970000-b88acfff : System RAM b88ad000-b9ff0fff : Reserved b9fd0000-b9fd3fff : MSFT0101:00 b9fd0000-b9fd3fff : MSFT0101:00 b9fd4000-b9fd7fff : MSFT0101:00 b9fd4000-b9fd7fff : MSFT0101:00 b9ff1000-ba150fff : ACPI Tables ba151000-bbc0afff : ACPI Non-volatile Storage bbc0b000-bcbfefff : Reserved bcbff000-bdffffff : System RAM be000000-bfffffff : Reserved

By parsing /proc/iomem, my test program can:

  1. Identify which physical regions are safe to work with (like RAM already allocated to my process),
  2. Avoid regions that are reserved for hardware or critical firmware,
  3. Adapt dynamically to different machines and configurations.

This is especially important for multi-architecture support. While examples here often look like x86 (because /dev/mem has a long history there), the concept of mapping I/O regions isn’t x86-specific. On ARM, RISC-V, or others, you’ll see different labels… But the principle remains exactly the same.

In short: /proc/iomem is your treasure map, and the first rule of treasure hunting is “don’t blow up the ship while digging for gold.”

The Problem of Contiguous Physical Pages

Up to this point, my work focused on single-page operations. I wasn’t hand-picking physical addresses or trying to be clever about where memory came from. Instead, the process was simple and safe:

  1. Allocate a buffer in userspace, using mmap() so it’s page-aligned,
  2. Touch the page to make sure the kernel really backs it with physical memory,
  3. Walk /proc/self/pagemap to trace which physical pages back the virtual address in the buffer.

This gives me full visibility into how my userspace memory maps to physical memory. Since the buffer was created through normal allocation, it’s mine to play with, there’s no risk of trampling over the kernel or other userspace processes.

This worked beautifully for basic tests:

  • Pick a single page in the buffer,
  • Run a tiny read/write cycle through /dev/mem,
  • Verify the result,
  • Nothing explodes.

But then came the next challenge: What if a read or write crosses a physical page boundary?

Why boundaries matter

The Linux VFS layer doesn’t treat a read or write syscall as one giant, indivisible action. Instead, it splits large operations into chunks, moving through pages one at a time.

For example:

  • I request 10 KB from /dev/mem,
  • The first 4 KB comes from physical page A,
  • The next 4 KB comes from physical page B,
  • The last 2 KB comes from physical page C.

If the driver mishandles the transition between pages, I’d never notice unless my test forces it to cross that boundary. It’s like testing a car by only driving in a straight line: Everything looks fine… Until you try to turn the wheel.

To properly test /dev/mem, I need a buffer backed by at least two physically contiguous pages. That way, a single read or write naturally crosses from one physical page into the next… exactly the kind of situation where subtle bugs might hide.

And that’s when the real nightmare began.

Why this is so difficult

At first, this seemed easy. I thought:

“How hard can it be? Just allocate a buffer big enough, like 128 KB, and somewhere inside it, there must be two contiguous physical pages.”

Ah, the sweet summer child optimism. The harsh truth: modern kernels actively work against this happening by accident. It’s not because the kernel hates me personally (though it sure felt like it). It’s because of its duty to prevent memory fragmentation.

When you call brk() or mmap(), the kernel:

  1. Uses a buddy allocator to manage blocks of physical pages,
  2. Actively spreads allocations apart to keep them tidy,
  3. Reserves contiguous ranges for things like hugepages or DMA.

From the kernel’s point of view:

  • This keeps the system stable,
  • Prevents large allocations from failing later,
  • And generally makes life good for everyone.

From my point of view? It’s like trying to find two matching socks in a dryer while it is drying them.

Playing the allocation lottery

My first approach was simple: keep trying until luck strikes.

  1. Allocate a 128 KB buffer,
  2. Walk /proc/self/pagemap to see where all pages landed physically,
  3. If no two contiguous pages are found, free it and try again.

Statistically, this should work eventually. In reality? After thousands of iterations, I’d still end up empty-handed. It felt like buying lottery tickets and never even winning a free one.

The kernel’s buddy allocator is very good at avoiding fragmentation. Two consecutive physical pages are far rarer than you’d think, and that’s by design.

Trying to confuse the allocator

Naturally, my next thought was:

“If the allocator is too clever, let’s mess with it!”

So I wrote a perturbation routine:

  • Allocate a pile of small blocks,
  • Touch them so they’re actually backed by physical pages,
  • Free them in random order to create “holes.”

The hope was to trick the allocator into giving me contiguous pages next time. The result? It sometimes worked, but unpredictably. 4k attempts gave me >80% success. Not reliable enough for a test suite where failures must mean a broken driver, not a grumpy kernel allocator.

The options I didn’t want

There are sure-fire ways to get contiguous pages:

  • Writing a kernel module and calling alloc_pages().
  • Using hugepages.
  • Configuring CMA regions at boot.

But all of these require special setup or kernel cooperation. My goal was a pure userspace test, so they were off the table.

A new perspective: software MMU

Finally, I relaxed my original requirement. Instead of demanding two pages that are both physically and virtually contiguous, I only needed them to be physically contiguous somewhere in the buffer.

From there, I could build a tiny software MMU:

  • Find a contiguous physical pair using /proc/self/pagemap,
  • Expose them through a simple linear interface,
  • Run the test as if they were virtually contiguous.

This doesn’t eliminate the challenge, but it makes it practical. No kernel hacks, no special boot setup, just a bit of clever user-space logic.

From Theory to Test Code

All this theory eventually turned into a real test tool, because staring at /proc/self/pagemap is fun… but only for a while. The test lives here:

github.com/alessandrocarminati/devmem_test

It’s currently packaged as a Buildroot module, which makes it easy to run on different kernels and architectures without messing up your main system. The long-term goal is to integrate it into the kernel’s selftests framework, so these checks can run as part of the regular Linux testing pipeline. For now, it’s a standalone sandbox where you can:

  • Experiment with /dev/mem safely (on a test machine!),
  • Play with /proc/self/pagemap and see how virtual pages map to physical memory,
  • Try out the software MMU idea without needing kernel modifications.

And expect it still work in progress.

Thursday, September 11, 2025

When Kernel Comments Get Weird: The Tale of read_mem

When Kernel Comments Get Weird: The Tale of drivers/char/mem.c

As part of the Elisa community, we spend a good chunk of our time spelunking through the Linux kernel codebase. It’s like code archeology: you don’t always find treasure, but you do find lots of comments left behind by developers from the ’90s that make you go, “Wait… really?”

One of the ideas we’ve been chasing is to make kernel comments a bit smarter: not only human-readable, but also machine-readable. Imagine comments that could be turned into tests, so they’re always checked against reality. Less “code poetry from 1993”, more “living documentation”.

Speaking of code poetry, here’s one gem we stumbled across in mem.c:

/* The memory devices use the full 32/64 bits of the offset, * and so we cannot check against negative addresses: they are ok. * The return value is weird, though, in that case (0). */

This beauty has been hanging around since Linux 0.99.14… back when Bill Clinton was still president-elect, “Mosaic” was the hot new browser, and PDP-11 was still produced and sold.

Back then, it made sense, and reflected exactley what the code did.

Fast-forward thirty years, and the comment still kind of applies… but mostly in obscure corners of the architecture zoo. On the CPUs people actually use every day?

$ cat lseek.asm BITS 64 %define SYS_read 0 %define SYS_write 1 %define SYS_open 2 %define SYS_lseek 8 %define SYS_exit 60 ; flags %define O_RDONLY 0 %define SEEK_SET 0 section .data path: db "/dev/mem",0 section .bss align 8 buf: resq 1 section .text global _start _start: mov rax, SYS_open lea rdi, [rel path] xor esi, esi xor edx, edx syscall mov r12, rax ; save fd in r12 mov rax, SYS_lseek mov rdi, r12 mov rsi, 0x8000000000000001 xor edx, edx syscall mov [rel buf], rax mov rax, SYS_write mov edi, 1 lea rsi, [rel buf] mov edx, 8 syscall mov rax, SYS_exit xor edi, edi syscall $ nasm -f elf64 lseek.asm -o lseek.o $ ld lseek.o -o lseek $ sudo ./lseek| hexdump -C 00000000 01 00 00 00 00 00 00 80 |........| 00000008 $ # this is not what I expect, let's double check $ sudo gdb ./lseek GNU gdb (Fedora Linux) 16.3-1.fc42 Copyright (C) 2024 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu". Type "show configuration" for configuration details. For bug reporting instructions, please see: <https://www.gnu.org/software/gdb/bugs/>. Find the GDB manual and other documentation resources online at: <http://www.gnu.org/software/gdb/documentation/>. For help, type "help". Type "apropos word" to search for commands related to "word"... Reading symbols from ./lseek... (No debugging symbols found in ./lseek) (gdb) b _start Breakpoint 1 at 0x4000b0 (gdb) r Starting program: /tmp/lseek Breakpoint 1, 0x00000000004000b0 in _start () (gdb) x/30i $pc => 0x4000b0 <_start>: mov $0x2,%eax 0x4000b5 <_start+5>: lea 0xf44(%rip),%rdi # 0x401000 0x4000bc <_start+12>: xor %esi,%esi 0x4000be <_start+14>: xor %edx,%edx 0x4000c0 <_start+16>: syscall 0x4000c2 <_start+18>: mov %rax,%r12 0x4000c5 <_start+21>: mov $0x8,%eax 0x4000ca <_start+26>: mov %r12,%rdi 0x4000cd <_start+29>: movabs $0x8000000000000001,%rsi 0x4000d7 <_start+39>: xor %edx,%edx 0x4000d9 <_start+41>: syscall 0x4000db <_start+43>: mov %rax,0xf2e(%rip) # 0x401010 0x4000e2 <_start+50>: mov $0x1,%eax 0x4000e7 <_start+55>: mov $0x1,%edi 0x4000ec <_start+60>: lea 0xf1d(%rip),%rsi # 0x401010 0x4000f3 <_start+67>: mov $0x8,%edx 0x4000f8 <_start+72>: syscall 0x4000fa <_start+74>: mov $0x3c,%eax 0x4000ff <_start+79>: xor %edi,%edi 0x400101 <_start+81>: syscall 0x400103: add %al,(%rax) 0x400105: add %al,(%rax) 0x400107: add %al,(%rax) 0x400109: add %al,(%rax) 0x40010b: add %al,(%rax) 0x40010d: add %al,(%rax) 0x40010f: add %al,(%rax) 0x400111: add %al,(%rax) 0x400113: add %al,(%rax) 0x400115: add %al,(%rax) (gdb) b *0x4000c2 Breakpoint 2 at 0x4000c2 (gdb) b *0x4000db Breakpoint 3 at 0x4000db (gdb) c Continuing. Breakpoint 2, 0x00000000004000c2 in _start () (gdb) i r rax 0x3 3 rbx 0x0 0 rcx 0x4000c2 4194498 rdx 0x0 0 rsi 0x0 0 rdi 0x401000 4198400 rbp 0x0 0x0 rsp 0x7fffffffe3a0 0x7fffffffe3a0 r8 0x0 0 r9 0x0 0 r10 0x0 0 r11 0x246 582 r12 0x0 0 r13 0x0 0 r14 0x0 0 r15 0x0 0 rip 0x4000c2 0x4000c2 <_start+18> eflags 0x246 [ PF ZF IF ] cs 0x33 51 ss 0x2b 43 ds 0x0 0 es 0x0 0 fs 0x0 0 gs 0x0 0 fs_base 0x0 0 gs_base 0x0 0 (gdb) # fd is just fine rax=3 as expected. (gdb) c Continuing. Breakpoint 3, 0x00000000004000db in _start () (gdb) i r rax 0x8000000000000001 -9223372036854775807 rbx 0x0 0 rcx 0x4000db 4194523 rdx 0x0 0 rsi 0x8000000000000001 -9223372036854775807 rdi 0x3 3 rbp 0x0 0x0 rsp 0x7fffffffe3a0 0x7fffffffe3a0 r8 0x0 0 r9 0x0 0 r10 0x0 0 r11 0x246 582 r12 0x3 3 r13 0x0 0 r14 0x0 0 r15 0x0 0 rip 0x4000db 0x4000db <_start+43> eflags 0x246 [ PF ZF IF ] cs 0x33 51 ss 0x2b 43 ds 0x0 0 es 0x0 0 fs 0x0 0 gs 0x0 0 fs_base 0x0 0 gs_base 0x0 0 (gdb) # According to that comment, rax should have been 0, but it is not. (gdb) c Continuing. �[Inferior 1 (process 186746) exited normally] (gdb)

Not so much. Seeking at 0x8000000000000001… Returns 0x8000000000000001 not 0 as anticipated in the comment. We’re basically facing the kernel version of that “Under Construction” GIF on websites from the 90s, still there, but mostly just nostalgic decoration now.

The Mysterious Line in read_mem

Let’s zoom in on one particular bit of code in read_mem:

phys_addr_t p = *ppos; /* ... other code ... */ if (p != *ppos) return 0;

At first glance, this looks like a no-op; why would p be different from *ppos when you just copied it? It’s like testing if gravity still works by dropping your phone… spoiler: it does.

But as usual with kernel code, the weirdness has a reason.

The Problem: Truncation on 32-bit Systems

Here’s what’s going on:

  • *ppos is a loff_t, which is a 64-bit signed integer.
  • p is a phys_addr_t, which holds a physical address.

On a 64-bit system, both are 64 bits wide. Assignment is clean, the check always fails (and compilers just toss it out).

But on a 32-bit system, phys_addr_t is only 32 bits. Assign a big 64-bit offset to it, and boom, the top half vanishes. Truncated, like your favorite TV series canceled after season 1.

That if (p != *ppos) check is the safety net. It spots when truncation happens and bails out early, instead of letting some unlucky app read from la-la land.

Assembly Time: 64-bit vs. 32-bit

On 64-bit builds (say, AArch64), the compiler optimizes away the check.

┌ 736: sym.read_mem (int64_t arg2, int64_t arg3, int64_t arg4); │ `- args(x1, x2, x3) vars(13:sp[0x8..0x70]) │ 0x08000b10 1f2003d5 nop │ 0x08000b14 1f2003d5 nop │ 0x08000b18 3f2303d5 paciasp │ 0x08000b1c fd7bb9a9 stp x29, x30, [sp, -0x70]! │ 0x08000b20 fd030091 mov x29, sp │ 0x08000b24 f35301a9 stp x19, x20, [var_10h] │ 0x08000b28 f40301aa mov x20, x1 │ 0x08000b2c f55b02a9 stp x21, x22, [var_20h] │ 0x08000b30 f30302aa mov x19, x2 │ 0x08000b34 750040f9 ldr x21, [x3] │ 0x08000b38 e10302aa mov x1, x2 │ 0x08000b3c e33700f9 str x3, [var_68h] ; phys_addr_t p = *ppos; │ 0x08000b40 e00315aa mov x0, x21 │ 0x08000b44 00000094 bl valid_phys_addr_range │ ┌─< 0x08000b48 40150034 cbz w0, 0x8000df0 ;if (!valid_phys_addr_range(p, count)) │ │ 0x08000b4c 00000090 adrp x0, segment.ehdr │ │ 0x08000b50 020082d2 mov x2, 0x1000 │ │ 0x08000b54 000040f9 ldr x0, [x0] │ │ 0x08000b58 01988152 mov w1, 0xcc0 │ │ 0x08000b5c f76303a9 stp x23, x24, [var_30h] [...]

Nothing to see here, move along. But on 32-bit builds (like old-school i386), the check shows up loud and proud in the assembly.

┌ 392: sym.read_mem (int32_t arg_8h); │ `- args(sp[0x4..0x4]) vars(5:sp[0x14..0x24]) │ 0x080003e0 55 push ebp │ 0x080003e1 89e5 mov ebp, esp │ 0x080003e3 57 push edi │ 0x080003e4 56 push esi │ 0x080003e5 53 push ebx │ 0x080003e6 83ec14 sub esp, 0x14 │ 0x080003e9 8955f0 mov dword [var_10h], edx │ 0x080003ec 8b5d08 mov ebx, dword [arg_8h] │ 0x080003ef c745ec0000.. mov dword [var_14h], 0 │ 0x080003f6 8b4304 mov eax, dword [ebx + 4] │ 0x080003f9 8b33 mov esi, dword [ebx] ; phys_addr_t p = *ppos; │ 0x080003fb 85c0 test eax, eax │ ┌─< 0x080003fd 7411 je 0x8000410 ; if (!valid_phys_addr_range(p, count)) │ ┌┌──> 0x080003ff 8b45ec mov eax, dword [var_14h] │ ╎╎│ 0x08000402 83c414 add esp, 0x14 │ ╎╎│ 0x08000405 5b pop ebx │ ╎╎│ 0x08000406 5e pop esi │ ╎╎│ 0x08000407 5f pop edi │ ╎╎│ 0x08000408 5d pop ebp │ ╎╎│ 0x08000409 c3 ret [...]

The CPU literally does a compare-and-jump to enforce it. So yes, this is a real guard, not some leftover fluff.

Return Value Oddities

Now, here’s where things get even funnier. If the check fails in read_mem, the function returns 0. That’s “no bytes read”, which in file I/O land is totally fine.

But in the twin function write_mem, the same situation returns -EFAULT. That’s kernel-speak for “Nope, invalid address, stop poking me”.

So, reading from a bad address? You get a polite shrug. Writing to it? You get a slap on the wrist. Fair enough, writing garbage into memory is way more dangerous than failing to read it. Come on, probably here we need to fix things up… Do we really?

Why does read_mem return 0 instead of an error?

This behavior comes straight from Unix I/O tradition.

In user space, tools like dd expect a read() call to return 0 to mean “end of file”. They loop until that happens and then exit cleanly.

Returning an error code instead would break that pattern and confuse programs that treat /dev/mem like a regular file. In other words, read_mem is playing nice with existing utilities: 0 here doesn’t mean “nothing went wrong”, it means “nothing left to read.”

Wrapping It Up

This little dive shows how a single “weird” line of code carries decades of context, architecture quirks, type definitions, and evolving assumptions. It also shows why comments like the one from 0.99.14 are dangerous: they freeze a moment in time, but reality keeps moving.

Our mission in Elisa Architecture WG is to bring comments back to life: keep them up-to-date, tie them to tests, and make sure they still tell the truth. Because otherwise, thirty years later, we’re all squinting at a line saying “the return value is weird though” and wondering if the developer was talking about code… or just their day.

And now, a brief word from our sponsors (a.k.a. me in a different hat): When I’m not digging up ancient kernel comments with the Architecture WG, I’m also leading the Linux Features for Safety-Critical Systems (LFSCS) WG. We’re cooking up some pretty exciting stuff there too.

So if you enjoy the kind of archaeology/renovation work we’re doing there, come check out LFSCS as well: same Linux, different adventure.

Sunday, August 31, 2025

Confessions of a Nano User: Tabs, Spaces, and the Forbidden Love of OSC 52

“Hi, my name is Alessandro, and… I use nano.”

There. I said it. After years of quietly pressing Ctrl+X, answering “Yes” to save, and living in fear of the VIM and EMACS inquisitors, I’ve finally come out. While the big kids fight eternal holy wars over modal editing and Lisp extensibility, some of us took the small editor from GNU and just… got work done. Don’t judge.

But even within our tiny community of nano users, we are not free of pain. Our cross to bear is called… tabs.

The mystery: why my tabs turned into spaces

The story begins innocently: I opened a file in nano, full of perfectly fine tab characters. But then, the moment I dared to use my mouse to copy some text from the terminal window… BAM! My tabs were gone. Replaced by spaces.

It didn’t matter if I used KDE Konsole or GNOME Terminal, the effect was the same: mouse copy -> spaces. I was betrayed.

Meanwhile, if I ran cat file.txt and selected text with the mouse, the tabs survived. It was as if the gods of whitespace personally mocked me.

First hypothesis: nano must be guilty!

Naturally, my first instinct was to point fingers at nano itself. After all, nano has options like tabsize and tabstospaces. Maybe nano secretly converts tabs into spaces when rendering text? Maybe I’d been living a lie, editing “tabs” that were never really tabs?

I started investigating, even hex-dumping what nano sends to the terminal. Made a file containing only a tab and a blank. What I found was not 09 (tab) bytes at all, but ANSI escape sequences like:

09 20 # what you'd expect for <TAB><SPACE> vs 1b 5b 33 38 3b 35 3b 32 30 38 6d 20 20 20 20 20 20 20 20 # The <TAB> 1b 5b 33 39 6d 1b 28 42 1b 5b 6d 20 # The <SPACE>

That, dear reader, is ncurses at work.

The real culprit: ncurses, the decorator

Nano is innocent, it loves tabs just as much as I do. The real problem is ncurses, the library nano uses to paint text on the screen.

ncurses doesn’t just pass \t straight to the terminal. Instead, it calculates how many spaces are needed to reach the next tab stop and paints that many literal spaces, usually wrapped in shiny SGR sequences (color codes).

So when your terminal emulator builds its screen buffer, all it sees are decorated blanks. And when you drag your mouse to copy… guess what you get? Spaces. Always spaces.

Meanwhile, cat writes literal \t to the terminal, and some emulators (notably VTE-based ones like GNOME Terminal used to, though mileage varies) preserve that information for mouse copy. That’s why cat behaves “correctly” and nano doesn’t.

So yes: the real villain in this love story is not nano, but ncurses… The overzealous decorator.

Escape plan: bypass the screen, go straight to clipboard

If the terminal screen can’t be trusted, we need another path. Enter: OSC 52.

OSC 52 is an ANSI escape sequence that lets a program say:

“Hey terminal, please put this base64-encoded text directly into the system clipboard.”

Example:

printf '\033]52;c;%s\a' "$(printf 'Hello\tWorld' | base64 -w0)"

Paste somewhere else -> boom, you get Hello<TAB>World.

This bypasses ncurses, bypasses the screen, bypasses mouse selection entirely. The text, tabs and all, travels straight into your clipboard.

Limitations: it’s not all sunshine and rainbows

  • Terminal support: Only terminals that implement OSC 52 can do this. xterm, iTerm2, Alacritty, recent Konsole are good. VTE-based terminals (GNOME Terminal, Tilix, etc.)… nope, they deliberately don’t support OSC 52 (for “security reasons”).
  • Buffer size: Many implementations cap OSC 52 payloads at ~100 KB. Big selections won’t copy entirely.
  • Security paranoia: Some distros disable it, since malicious programs could silently overwrite your clipboard. (But honestly, what’s worse: malware, or spaces where you wanted tabs?)

My dream: nano with native OSC 52 support

Right now, the only workarounds are… well, kind of clumsy:

  • Write the buffer (or a marked region) out to a pipe using ^O | osc52-copy.
  • Or just step outside nano entirely and run cat file | osc52-copy.

But there’s no way in nano today to say “when I press ^K, also shove this into the clipboard”. nano simply doesn’t have a hook for that.

That’s why my dream is to add a proper set osc52 option. With it enabled, nano would take whatever you cut (or marked) and send it straight to the terminal clipboard using OSC 52. Ideally, it would be optional, nobody wants to suddenly discover nano has hijacked their clipboard without asking, or just play with the multitude of clipboard Linux users has: system, primary, application…

Epilogue

So here I stand, a proud but slightly broken nano user, with tabs that keep turning into spaces when I least expect it. I’ve learned the truth: it’s not nano’s fault, but ncurses. I’ve found salvation in OSC 52, though only if my terminal plays along.

And who knows, maybe one day there’ll be a tiny patch upstream, and nano will finally get to shout “COPY WITH TABS!” directly into our clipboards. Until then… I’ll try to refine my proposal to make this ocs52 goal near.

It still lacks the feature to make it optional, but for the time being it demonstrate at least a possible approach.

Stay tuned

diff --git a/src/cut.c b/src/cut.c index a2d4aecf..c9f12d86 100644 --- a/src/cut.c +++ b/src/cut.c @@ -24,6 +24,7 @@ #include <string.h> +#define MAX_OSC52_BUF 655536 /* Delete the character at the current position, and * add or update an undo item for the given action. */ void expunge(undo_type action) @@ -249,6 +250,54 @@ void chop_next_word(void) } #endif /* !NANO_TINY */ +void osc52(void) { + static const char b64[] = + "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/"; + + size_t cap = MAX_OSC52_BUF; + unsigned char *buf = malloc(cap); + if (!buf) return; + + linestruct *current = cutbuffer; + int pos=0; +// while (current->next != NULL) { + while (current != NULL) { + int l = strlen(current->data); + if (pos + l > MAX_OSC52_BUF) break; //osc52 has a recomanded length < 100k Feel appropriate to restrict to 64k + memcpy(buf + pos, current->data, l); + pos+=l; + if (current->next) *(buf + pos++)='\n'; + current = current->next; + } + + printf("\033]52;c;"); + + for (size_t i = 0; i < pos; i += 3) { + unsigned int octet_a = i < pos ? buf[i] : 0; + unsigned int octet_b = (i+1) < pos ? buf[i+1] : 0; + unsigned int octet_c = (i+2) < pos ? buf[i+2] : 0; + + unsigned int triple = (octet_a << 16) | (octet_b << 8) | octet_c; + + putchar(b64[(triple >> 18) & 0x3F]); + putchar(b64[(triple >> 12) & 0x3F]); + if ((i+1) < pos) + putchar(b64[(triple >> 6) & 0x3F]); + else + putchar('='); + if ((i+2) < pos) + putchar(b64[triple & 0x3F]); + else + putchar('='); + } + + putchar('\a'); + fflush(stdout); + + free(buf); + return; +} + /* Excise the text between the given two points and add it to the cutbuffer. */ void extract_segment(linestruct *top, size_t top_x, linestruct *bot, size_t bot_x) { @@ -365,6 +414,7 @@ void extract_segment(linestruct *top, size_t top_x, linestruct *bot, size_t bot_ /* If the text doesn't end with a newline, and it should, add one. */ if (!ISSET(NO_NEWLINES) && openfile->filebot->data[0] != '\0') new_magicline(); + osc52(); } /* Meld the buffer that starts at topline into the current file buffer

UPDATE1: On Sep 1, 2025, I sent a regular patch to the nano maintainers; let's see what happens.

UPDATE2: After I sent the patch, Benno (nano’s maintainer) gently reminded me that I hadn’t done my homework well. Turns out I wrongly accused poor ncurses of tab treachery, but the real culprit is none other than nano itself. Yes, my beloved editor is the true tab betrayer! (See src/winioc.c:1872). Still, love is blind: I don’t care, I’m sticking with nano. At the end of the day nano or ncurses, it doesn't really matters: if you try to select and take the text on nano, you won't have your tabs, so I still have reasons for submiting. And since my patch hasn’t been rejected yet, there's still chances the OCS52 can hit nano’s codebase.

Thursday, July 24, 2025

I Just Wanted the File List... Now I’m Neck-Deep in Makefiles

So You Want to Know All the Files Used in a Linux Kernel Build?

At some point, we all ask ourselves this question, it is kinda rite of passage:

Hey, I just want to list all the files that go into building this particular Linux kernel configuration. How hard can it be?”

Ha. Hahaha.

You sweet summer child.

(credits: Game of Thrones Stark’s Old Nan to Bran S01E03)

Let me take you through my journey of trying to do exactly that: not just for a single object, but for the entire kernel build, based on a specific configuration. Spoiler: it sounds simple, but the rabbit hole goes deep.

My Original Genius Plan (That Didn’t Survive Reality)

I thought: “It’s just a matter of combining a few logical steps.”

  1. Use .config to know which options are enabled.
  2. Parse the Makefiles (Kconfig are not needed, config is supposed to be static) to find out which .c files will actually be compiled.
  3. Recursively parse #include directives to find all needed headers.
  4. Done!

Except not.

That third point hides drangos! That’s where dreams go to die. Because parsing includes in the kernel is a nightmare wrapped in a macro inside a conditional.

Just look to this abomination:

#include TRACE_INCLUDE(TRACE_INCLUDE_FILE)

What is TRACE_INCLUDE_FILE?

It’s a macro.

Where is it defined?

Somewhere else... Not in this file... Possibly not even in this compilation unit... Possibly on the moon.

Static parsing? Good luck.

You Need a Preprocessor, Not Just a Text Scanner

If you really want to know what files a .c file pulls in, you need to preprocess it with the compiler. Like, for real. Not pretend. You need the full gcc -E or -MMD treatment. And actually, here’s the twist, the Linux kernel already does that! The kernel’s build system uses gcc -MMD by default, which generates a .d file alongside each .o or sometimes .s file. This .d file lists all the headers used for that .c file during preprocessing.

Awesome, right?

Except…

The Case of the Disappearing .d Files

I excitedly looked for these .d files, expecting a gold mine, and found… nothing. They’re gone. Vanished. Turns out, the kernel build system generates them, uses them, and then deletes them. Like a ninja with commitment issues.

Why?

Because they’re only meant to feed into a utility called…

Enter fixdep and the .cmd Files

Once upon a time, the kernel used .d files directly. But those were replaced by .cmd files, which are more versatile and kernel-specific. Here’s how it works now:

  1. The compiler creates a .d file (thanks to -MMD).
  2. The kernel build system runs fixdep on that .d file.
  3. fixdep reads the dependencies, filters out junk (like system headers), adds references to CONFIG_* flags, and writes everything into a .cmd file.
  4. Then it deletes the .d file like it never existed.

The resulting .cmd file (e.g., net/core/.sock.o.cmd) contains:

  • The full compile command - The source file path
  • A list of directly included headers
  • Wildcards for include/config/CONFIG_... dependencies

And this is what make (via Kbuild) uses to decide when to rebuild something.

So Are .cmd Better Than .d Files?

Sort of. .cmd files are more than .d files, but also less.

Yes:

  • They’re smarter.
  • They track CONFIG_ conditionals.
  • They integrate cleanly with the kernel’s Makefile logic.

But:

  • They’re not recursive.
  • They don’t include deeply nested headers.
  • They only list direct includes.

So what happens when you change a deeply nested header?

Nothing. Absolutely nothing.

Unless the header was directly listed in the .cmd file, your object won’t rebuild.

Ask Me How I Know

I can’t count how many times I’ve changed a header, rebuilt, and then… nothing changed. No errors. No recompiled object. Just my confused face staring into the void.

Eventually I learned: just delete the .o file manually and try again.

And now, finally, I understand why: the change wasn’t tracked because the .cmd file didn’t know about the header I changed. Because fixdep didn’t know. Because it’s not recursive. Because… reasons.

What Are Your Options Then?

If your goal is to get a full list of files used in a Linux kernel build, here are your tools:

Method Pros Cons
.d files Accurate, compiler-generated Deleted after use
.cmd files Tracked by Kbuild, includes CONFIG_ Incomplete, no nested headers
strace the build Very complete Noisy, includes false positives
Static parsing Fun in theory Hell in practice (TRACE_INCLUDE etc.)

So, Can I Use Dependency Files to List All Files Used in the Build?

Yes… But you have to be a bit clever about it.

You might think: “Hey, the kernel already uses -MMD, so .d files are created… I’ll just grab those!” Well… yes, they are created. But they’re also brutally deleted right after they’re used.

That rm command? It’s not a side effect or compiler flag, it’s explicitly baked into the kernel’s build logic. You correctly traced it to scripts/Kbuild.include, and specifically into, the logic used by the kernel’s makefiles. It’s part of this pattern:

cmd_and_fixdep = \ $(cmd); \ scripts/basic/fixdep $(depfile) $@ '$(make-cmd)' > $(dot-target).cmd;\ rm -f $(depfile)

So it’s not that you forgot to save the .d files, it’s that the kernel build system deliberately throws them away like yesterday’s logs. Why? Because it only needs them briefly, to feed into the fixdep tool, which extracts top-level config header dependencies (like include/config/FOO) and embeds them into the .cmd files.

What Can Be Actually Done

What can be done, rather than rerun the entire kernel build with strace, or try to reverse-engineer header includes statically (good luck parsing around TRACE_INCLUDE)...

“Wait… The .cmd files have the full compiler commands!”

Exactly.

Here’s what

  1. Search for all *.cmd files in the build output directory.
  2. Extract each compile command (it’s right there in the file).
  3. Manually rerun that compile, with -MMD still in place…
  4. but skip running fixdep and don’t delete the .d file.

This gives you full .d files, unmolested by cleanup. Now you have the real, compiler-generated dependency lists, with every nested header, accurate and complete, not the trimmed-down config-focused ones baked into .cmd.

Yes, it takes a bit of scripting and time. But it’s deterministic, reproducible, and lets you trace exactly what went into a particular kernel build, without hacking the build system or playing syscall whack-a-mole with strace.

And this is what I'm going to implement after an ufair fight with makfiles and friends!

Friday, June 27, 2025

Logging Shell Commands in BusyBox? Yes, You Can Now

In an earlier post, I showed how to log every shell command to a remote server using PROMPT_COMMAND in bash. It’s a neat trick, especially if you’re interested in session tracking, auditing, or just being a responsible sysadmin. But if you’re using BusyBox as your shell, and a lot of embedded systems and network devices do, you’ll quickly find out that this trick doesn’t work at all. So, what gives?

Bash Has PROMPT_COMMAND, BusyBox Doesn’t

If you’re used to bash, you know PROMPT_COMMAND is a handy little feature: before each command runs, the shell will execute whatever you put in that variable. Perfect spot for a logging hook.

But BusyBox’s shell applet, ash, doesn’t have PROMPT_COMMAND. Not even a secret version of it hiding somewhere. So if you try to replicate this kind of behavior in BusyBox, the shell just stares blankly at you and does… nothing.

This isn’t a bug or a missing package. BusyBox ash is designed to be small, fast, and simple; which means it skips the bells and whistles you’d find in a full shell like bash.

But Wait... Is This Really Useful?

Yes, actually. This isn’t just a case of “because I can.”

Think about it: a lot of network gear these days runs Linux under the hood: firewalls, routers, switches, access points, you name it. Many of them use BusyBox because it’s lightweight and gets the job done.

But in those environments, auditing is not optional. People managing these devices often come from networking backgrounds where things like TACACS+ command logging have been standard for decades. They expect that every command typed into a shell is logged somewhere, preferably off the device.

So this isn’t just a fun weekend hack. It’s a small but important feature that brings BusyBox a little closer to what these environments need.

A Peek Under the Hood: Why getenv() Doesn’t Work

Let’s say you want to control this feature using environment variables, like this:

export LOG_RHOST=10.0.0.1 export LOG_RPORT=5555 export SESSIONID_=abc123

Then, in your shell code, you try to do something like:

const char *host = getenv("LOG_RHOST");

And you get… NULL.

Why? Well, here’s the fun part: BusyBox ash keeps its own private stash of environment variables, separate from the system-wide environment you get with getenv(). It only syncs them up when launching child processes... not inside the shell itself. At least, this is what I read... Somewhere in the code or maybe in a commit message... But if I need to say, I didn't experience this myself.

So the variable is set, exported, and even shows up when you run set, but your getenv() call is talking to the wrong guy.

To fix this, you need to use BusyBox’s internal API:

const char *host = lookupvar("LOG_RHOST");

This asks the shell directly, and it gives you the right answer.

Lesson learned: if you’re inside BusyBox ash, use the tools it gives you, not the standard C library ones.

The Trick: Dependency Injection (Just Like the Cool Parts of BusyBox)

Here’s what I did:
Define a function pointer type in libbb:

typedef const char* (*injected_var_lookup_t)(const char *name);

Declare a static variable to hold the injected function:

static injected_var_lookup_t injected_lookup_var;

Add a public setter function to libbb:

void injected_set_var_lookup(injected_var_lookup_t func) { injected_lookup_var = func; }

In ash.c, during initialization, pass in the actual lookupvar() function: Now, anywhere in libbb, I can safely call:

const char *val = injected_lookup_var("LOG_RHOST");

And it’ll work because, at runtime, it’s calling the real lookupvar() function from ash. This pattern keeps the code clean, avoids nasty cross-module hacks, and is totally in line with how BusyBox handles similar internal wiring elsewhere. I thought I was a genius for inventing this, until I realized BusyBox already did the same thing in math.c. So… maybe I’m just a very good copy-paste engineer.

What This Patch Adds

So I wrote a patch for BusyBox that brings this remote command logging feature to life. Here’s what it does:

  • Watches each command entered in the shell.
  • If the logging environment variables are set (LOG_RHOST, LOG_RPORT, SESSIONID_), it sends the command over TCP to the remote server.
  • It includes the session ID in the log line, so you can trace which session ran what.

That’s it. Lightweight, optional, and it doesn’t get in the way if you’re not using it.

The Patch

Here’s the code that makes it happen: it applies nicely to busybox-1.37.0

From 5b663cd894f6418673686290248fe8776af2434d Mon Sep 17 00:00:00 2001 From: Alessandro Carminati <alessandro.carminati@gmail.com> Date: Fri, 27 Jun 2025 14:05:21 +0200 Subject: [PATCH] ash: add support for logging executed commands to a remote server Content-type: text/plain This commit adds functionality to the ash shell that sends each executed command to a remote logging server over TCP, enabling remote auditing and session tracking. The design is inspired by the tacacs2 approach used in network devices. This is particularly useful in embedded Linux environments replacing traditional routers, where audit trails are essential. Unlike bash, ash does not support PROMPT_COMMAND. This implementation fills that gap using internal hooks in the shell. The feature is controlled via three environment variables: - SESSIONID_ : unique identifier for the shell session - LOG_RHOST : remote log server hostname or IP address - LOG_RPORT : remote log server TCP port When these variables are set, each command entered is sent to the specified logging server, prepended with the session ID. This enhancement is lightweight and optional, and does not impact users who do not configure the environment variables Signed-off-by: Alessandro Carminati <acarmina@redhat.com> --- include/libbb.h | 7 +++ libbb/Config.src | 10 ++++ libbb/Kbuild.src | 1 + libbb/lineedit.c | 3 ++ libbb/loggers_utils.c | 114 ++++++++++++++++++++++++++++++++++++++++++ shell/ash.c | 3 ++ 6 files changed, 138 insertions(+) create mode 100644 libbb/loggers_utils.c diff --git a/include/libbb.h b/include/libbb.h index 01cdb1b..870b9f5 100644 --- a/include/libbb.h +++ b/include/libbb.h @@ -2003,6 +2003,9 @@ void free_line_input_t(line_input_t *n) FAST_FUNC; #else # define free_line_input_t(n) free(n) #endif +# if ENABLE_FEATURE_SEND_COMMAND_REMOTE +void loggers_utils_set_var_lookup(void *func); +# endif /* * maxsize must be >= 2. * Returns: @@ -2133,6 +2136,10 @@ enum { PSSCAN_RUIDGID = (1 << 21) * ENABLE_FEATURE_PS_ADDITIONAL_COLUMNS, PSSCAN_TASKS = (1 << 22) * ENABLE_FEATURE_SHOW_THREADS, }; +# if ENABLE_FEATURE_SEND_COMMAND_REMOTE +int rlog_this(const char *history_itm); +# endif + //procps_status_t* alloc_procps_scan(void) FAST_FUNC; void free_procps_scan(procps_status_t* sp) FAST_FUNC; procps_status_t* procps_scan(procps_status_t* sp, int flags) FAST_FUNC; diff --git a/libbb/Config.src b/libbb/Config.src index b980f19..a6f5882 100644 --- a/libbb/Config.src +++ b/libbb/Config.src @@ -202,6 +202,16 @@ config FEATURE_EDITING_SAVE_ON_EXIT help Save history on shell exit, not after every command. +config FEATURE_SEND_COMMAND_REMOTE + bool "Send last command to remote logger for audit" + default n + depends on FEATURE_EDITING_SAVEHISTORY + help + Send last command to remote logger for audit. + It is mandatory that LOG_RHOST and LOG_RPORT environment variables + are defined to specify the remote ip and port where send logs. + It alse needs the environment SESSIONID_ to be defined as sessionid. + config FEATURE_REVERSE_SEARCH bool "Reverse history search" default y diff --git a/libbb/Kbuild.src b/libbb/Kbuild.src index cb8d2c2..096a9f3 100644 --- a/libbb/Kbuild.src +++ b/libbb/Kbuild.src @@ -208,3 +208,4 @@ lib-$(CONFIG_FEATURE_CUT_REGEX) += xregcomp.o # Add the experimental logging functionality, only used by zcip lib-$(CONFIG_ZCIP) += logenv.o +lib-$(CONFIG_FEATURE_SEND_COMMAND_REMOTE) += loggers_utils.o diff --git a/libbb/lineedit.c b/libbb/lineedit.c index 543a3f1..8140f00 100644 --- a/libbb/lineedit.c +++ b/libbb/lineedit.c @@ -1685,6 +1685,9 @@ static void remember_in_history(char *str) /* i <= state->max_history-1 */ state->history[i++] = xstrdup(str); /* i <= state->max_history */ +# if ENABLE_FEATURE_SEND_COMMAND_REMOTE + rlog_this(state->history[i-1]); +# endif state->cur_history = i; state->cnt_history = i; # if ENABLE_FEATURE_EDITING_SAVEHISTORY && !ENABLE_FEATURE_EDITING_SAVE_ON_EXIT diff --git a/libbb/loggers_utils.c b/libbb/loggers_utils.c new file mode 100644 index 0000000..d1266e8 --- /dev/null +++ b/libbb/loggers_utils.c @@ -0,0 +1,114 @@ +/* + * This code allows remote logging of the commands. + * + * Copyright (c) 2025 Alessandro Carminati <acarmina@redhat.com> + * + * Licensed under GPLv2 or later, see file LICENSE in this source tree. + */ + +#include <stdio.h> +#include <stdlib.h> +#include <string.h> +#include <time.h> +#include <unistd.h> +#include <netinet/in.h> +#include <arpa/inet.h> +#include <netdb.h> +#include <sys/socket.h> + +#define SESSION_ID_ENV "SESSIONID_" +#define SESSION_LEN 9 +#define SESSION_RHOST "LOG_RHOST" +#define SESSION_RPORT "LOG_RPORT" + +typedef const char* (*loggers_utils_var_lookup_t)(const char *name); + +void get_timestamp(char *, size_t); +int send_log(const char *, const char *, const char *); +int rlog_this(const char *); +void loggers_utils_set_var_lookup(void *func); + +static loggers_utils_var_lookup_t loggers_utils_lookup_var; + +void loggers_utils_set_var_lookup(void *func) { + loggers_utils_lookup_var = (loggers_utils_var_lookup_t) func; +} + +void get_timestamp(char *buf, size_t len) { + time_t now = time(NULL); + struct tm *tm_info = localtime(&now); + strftime(buf, len, "%Y%m%d.%H%M%S", tm_info); +} + +int send_log(const char *line, const char *host, const char *port_str) { + int sockfd; + struct addrinfo hints, *res, *p; + + memset(&hints, 0, sizeof(hints)); + hints.ai_family = AF_UNSPEC; + hints.ai_socktype = SOCK_STREAM; + + if (getaddrinfo(host, port_str, &hints, &res) != 0) { + fprintf(stderr, "send_log: cant' resolve host in %s\n", + SESSION_RHOST); + return -1; + } + + for (p = res; p != NULL; p = p->ai_next) { + sockfd = socket(p->ai_family, p->ai_socktype, p->ai_protocol); + if (sockfd < 0) continue; + if (connect(sockfd, p->ai_addr, p->ai_addrlen) == 0) break; + close(sockfd); + } + + if (p == NULL) { + fprintf(stderr, "send_log: Unable to connect to %s:%s\n", + host, port_str); + freeaddrinfo(res); + return -1; + } + + ssize_t len = strlen(line); + if (send(sockfd, line, len, 0) != len) { + fprintf(stderr, "send_log: Unable to send data\n"); + close(sockfd); + freeaddrinfo(res); + return -1; + } + + close(sockfd); + freeaddrinfo(res); + return 0; +} + +int rlog_this(const char *history_itm) { + char timestamp[32], hostname[64]; + char *sess_id, *r_ip, *r_port; + char logline[1500]; + + if (!loggers_utils_lookup_var) return -1; + + sess_id = loggers_utils_lookup_var(SESSION_ID_ENV); + if (!sess_id || (strlen(sess_id) > 9)) return 1; + + r_ip = loggers_utils_lookup_var(SESSION_RHOST); + if (!r_ip) return -1; + + r_port = loggers_utils_lookup_var(SESSION_RPORT); + if (!r_port) return -1; + + if (!atoi(r_port)) return -1; + + get_timestamp(timestamp, sizeof(timestamp)); + gethostname(hostname, sizeof(hostname)); + + snprintf(logline, sizeof(logline), "%s - %s - %s > %s\n", + timestamp, sess_id, hostname, history_itm); + + if (send_log(logline, r_ip, r_port)!=0){ + fprintf(stderr, "rlog_this: can't send log to remote.\n"); + return -2; + }; + + return 0; +} diff --git a/shell/ash.c b/shell/ash.c index bbd7307..e021def 100644 --- a/shell/ash.c +++ b/shell/ash.c @@ -9780,6 +9780,9 @@ setinteractive(int on) did_banner = 1; } #endif +# if ENABLE_FEATURE_SEND_COMMAND_REMOTE + loggers_utils_set_var_lookup(&lookupvar); +# endif #if ENABLE_FEATURE_EDITING if (!line_input_state) { line_input_state = new_line_input_t(FOR_SHELL | WITH_PATH_LOOKUP); -- 2.25.1

BusyBox contribution

Once you’ve developed a new feature for BusyBox, contributing your patch is a straightforward process, but there are a few nuances compared to projects like the Linux kernel. To submit your code, you must first subscribe to the BusyBox mailing list, as only subscribers can post. The list is publicly archived, so you can review past discussions for context, but posting requires a subscription. Unlike the Linux kernel, BusyBox doesn’t maintain a comprehensive MAINTAINERS file; instead, Denys Vlasenko is the main project maintainer, and individual applet maintainers (if any) are usually listed in the relevant source files. When sending your patch, address it to the mailing list and, if your change affects a specific applet, CC the maintainer named in the file’s header.

For new features, it’s best practice to first float your idea on the mailing list for feedback before investing significant development effort, unless you really want the feature. This informal discussion helps gauge interest and avoid duplicating work. Once your patch is ready, ensure it’s well-tested, follows the project’s coding style, and focuses on a single logical change. If your submission doesn’t get a response, a polite follow-up or using the BusyBox Bug and Patch Tracking System is recommended.

Sunday, May 18, 2025

Inline Trouble: Convincing the Compiler to Leave My Functions Alone

If you’ve been following my eternal battle with compilers, especially in the Linux kernel, you already know my hobbies: debugging things that work, fighting compiler optimizations, and getting emotional about function addresses. This post is no different… Except this time, the compiler and I are arguing about inlining.

Now, inlining itself isn’t evil. It’s actually a performance gift. A function call costs a few CPU cycles, so compilers try to be helpful and say, “Hey, what if I just paste the code right here instead of calling it?” And for most of the world, that’s great. Less overhead, faster code, happy life.

But when you’re working on a giant codebase like the Linux kernel, and you want to know where exactly something happened at runtime, inlining quickly turns from optimization to obfuscation.


The Problem: Who Called Me?

Imagine you’re adding a feature to the kernel to suppress certain error messages… Selectively. Maybe a BUG() or WARN() is firing, but you don’t want to see the report if it comes from that function. You want something like:

dont_complain_if_warn_is_called_from("driver_init_fn");

Clean and elegant, right?

The kernel offers kallsyms, which can map an instruction pointer back to a function name. Perfect! So, all I need is to grab the address of the code where WARN() is called, pass it to kallsyms, and boom … I know the function name.

But here’s the catch: the original function where BUG() and WARN() were called might be inlined into another, and now kallsyms for that address return the parent’s name. The obvious solution would be to mark all the functions where BUG() and WARN() exist.

Now, BUG() and WARN() are macros. They can be used anywhere: from deep in scheduler code to a random USB driver. While you technically can go and modifying every function that ever calls them, and just to slap a __attribute__ ((noinline)) on top, this is anything but practical to maintain. In a constantly changing codebase like the Linux kernel, requiring everyone who adds a WARN() to ensure the function is not inlined is like asking C to be friendly with undefined behavior... you're just asking for trouble.


The Compiler is Too Smart

Compilers like GCC and Clang use heuristics to decide whether to inline a function.

Some of those include:

  • How big is the function?
  • Is it called only once?
  • Is it marked inline? (Spoiler: that’s just a suggestion.)

They’re allowed to ignore your hints. Even if you write inline, the compiler might say, “Nah, I see this function better as standalone”. In a ceiratain sense, this also means that there's nothing preventing it from inline function where no preference is expressed.

Which, in my case, breaks everything… Because now the call to WARN() has been absorbed into another function, and its address points to that function (former parent), not the one I was trying to filter.

Because of that, it is not resonable assume to mark every caller as noinline. A more clever way is needed.


My First Attempt: Macros and Label Addresses

The idea was simple: create a local label using &&label (a GNU extension that gives the address of a label), make it static so it sticks, and pass that pointer to a dummy function. Here’s what it looked like:

#define THISLABEL(i) THISLABEL1(i, __LINE__) #define THISLABEL1(i, l) THISLABEL2(i, l) #define THISLABEL2(i, l) THISLABEL_##i##_##l #define BLOCK_INLINE() \ do { \ static void *volatile p = &&THISLABEL(0);\ THISLABEL(0): \ use_pointer(p); \ } while (0) #define use_pointer(x) ((void)(x))

I thought: this is surely will keep the compiler to back off from inlining. And indeed, GCC seemed to take the hint.

But I noticed that just by placing the label after use_pointer(p), could change the results. In GCC, it still sort of worked (because GCC loves a good shrug, I guess), but Clang had different plans. Things broke subtly… and one function in my test bed of two functions, got inlined.

Placing the label before the pointer usage and wrapping everything in a macro with a static volatile pointer, as shown in the code snipped, I got consistent results across both compilers… But…

But even then, I felt uneasy. This solution felt fragile. I was relying on undefined quirks and hoping compilers didn’t change their mind in a slightly different scenario.


Sneaky Anti-Inlining Techniques

I did some digging (read: asking smart people), and found a list of things that scare GCC and Clang enough to not inline a function.

Here’s a list, probably not exhaustive:

  • Use alloca() (dynamic stack allocation)
  • Call setjmp() or similar
  • Use va_arg/va_end (variadic function tricks)
  • Take the address of a label (&&label) — a GNU extension
  • Use computed gotos (goto *ptr)
  • Declare a variable whose value the compiler can’t predict

Any of these can make the compiler decide, “This is too weird. I’m not touching it.”

Enter: The Function-Preserving Incantation

The same smart person suggested a compact GNU statement expression that uses computed gotos, labels, and inline assembly to form what I now think of as an anti-inlining trick:

#define BLOCK_INLINE() ({ __label__ lab; volatile int never_true = 0;lab:; static void *p = &&lab; __asm volatile ("" : "+m" (p)); if (never_true) goto *p; })

Let’s unpack this masterpiece:

  • __label__ lab; lab: declares a local label inside a block.
  • &&lab gets the address of that label, a GCC extension.
  • The address is stored in a static pointer p, marked volatile, so the compiler can’t just delete it.
  • __asm__ volatile ("" : "+m" (p)); is an empty inline assembly trick that tells the compiler, “This memory might be touched, don’t optimize too much.”
  • Finally, the conditional goto *p is wrapped in a fake if (never_true): enough to keep things technically live, without actually running anything.

This construct scared both GCC and Clang enough to not inline the function it appeared in, without needing any attributes, annotations, or source rewrites.

({ ... }) – GNU Statement Expression - Something I didn’t know about

  • GNU extension, not standard C/C++
  • Treats a block of code as an expression, and it returns a value
  • Only works with GCC and Clang (not MSVC… I think I can live with this)

Example:

int x = ({ int a = 3; int b = 4; a + b; }); // x = 7
  • The block runs, and the value of the last expression (a + b) is returned.
  • This allows you to write macros or inlined logic that act like expressions, not just code blocks.

And if you are wondering, as I did, if it is equivalent to do { ... } while(0), it is not. Here’s a quick comparison cheat sheet:

Feature ({ ... }) do { ... } while (0)
Standard C No (GNU extension) Yes
Returns a value Yes (last expression) No
Used in expressions Yes No (statement-only)
Scope control Local block Local block
Portability GCC/Clang only Universal
Common in macros Yes (GNU/Linux kernel) Yes (portable projects)

({ ... }) is neat... and something I wasn’t aware of before. The original suggestion includes it, but since I don’t need the return value functionality in my case, I think I’ll replace it with the more portable do { ... } while (0).

#define BLOCK_INLINE() do { \ __label__ lab; \ volatile int never_true = 0; \ lab:; \ static void *p = &&lab; \ __asm volatile ("" : "+m" (p)); \ if (never_true) goto *p; \ } while(0)

Conclusion: Please Don’t Try This at Home (Unless You Must)

Inlining is great… Until it is not…. If you simply need a function not to be inlined, the right thing to do is usually to use the tools your compiler provides: attributes. That’s clean, documented, and supported.

However, there are situations, like mine, where touching the function definition isn’t practical or even possible. In the Linux kernel, when WARN() or BUG() can appear virtually anywhere, and you need to reason about the call site without rewriting the world, you’re left with fewer options.

In those edge cases, using obscure tricks like computed gotos or label addresses might just be the only way to preserve the structure you need… Even if it means making the compiler a little nervous.

Just be aware: this kind of code is fragile, deeply non-portable, and should come with a warning label. It’s not a pattern… it’s a workaround.

Because let’s face it: compilers are smart, but weird code is forever.

Monday, April 28, 2025

When Wi-Fi Says "No" and Your Serial Says "Maybe"

Ever been stuck with an embedded board that looks like a spaceship but behaves like a potato?
No drivers, no network, no SSH, no hope. Just you and a lonely serial cable whispering bits of despair.
That’s the moment you wish you could throw a file at it... maybe a fresh build of busybox that finally has that utility you desperately need but isn’t available on the current system.

Enter: the send_console.
A gloriously simple hack to push a file over the same serial line you’re using to yell at your misbehaving device... because sometimes, network connectivity simply isn’t an option.

Let’s dive into what it is, why it matters, and why you’ll love it (or at least curse it slightly less than the alternative).


The Art of Smuggling Files Over Serial (Or: “How to Befriend Your Terminal”)

Let’s start with some very real pain points:

  • You’re enabling a brand-new platform. It just finished booting for the first time, half the drivers are still missing, and network connectivity is but a distant dream.
  • You’re working with a device controlled by a sidekick board: access through the back door, with device network not supposed to be accessed from outside.
  • You’re hacking an embedded device. You finally found the serial pins hidden on the board, connected to them, but you have no tool to upload a file on it.

In all these cases, you really need to get a file onto that machine to move forward, but there’s no standard way to do it.

In theory, serial lines were made for this.... In practice? Not anymore.

In the good old days, we had XMODEM and ZMODEM.
Cute protocols that handled file transfers over serial lines... Today?
You can’t bet they’re installed. Often, you’re lucky if you even have a working cat.

This brings us to the real star of the show: how terminals actually work.


Terminal Magic: Why You Can’t Just “Send the Bytes, Bro”

Imagine your terminal is a grumpy customs officer. You hand it a packet (a byte), but before it lets it through, it:

  • Checks for forbidden characters,
  • Buffers a bunch of data,
  • Sometimes “fixes” what it thinks you meant to send.

Terminals are not dumb pipes: they enforce line discipline, apply buffering, and often modify the data you send.

Enter stty, our magic wand.
It tells the terminal to behave properly: disable echo, set raw mode, turn off processing.

Why stty?

  • It’s incredibly portable.
  • It’s available almost everywhere: full Linux systems, BusyBox minimal environments, you name it. Hint: it is mandatory in the List of POSIX commands.

Another companion is terminal buffering:
Serial interfaces typically buffer input and output for efficiency, but this can ruin a clean file transfer if not managed.
Here, stdbuf comes to the rescue, forcing applications like cat to flush data properly as it’s processed.


send_console-ng: “Now With 50% More Reliability!”

Before send_console-ng, there was… well, just send_console.
A brave little utility with a lot of heart... and a lot of problems.

The first version tried to be smart:
It sent commands and checked their echoes to validate that everything arrived correctly.
Sounds reasonable, right?

Except…
In a real embedded system, kernel messages can appear at any moment.
Random printk noise: “driver XYZ initialized”, “thermal event detected”, “welcome to dmesg hell”, bursting onto your serial line like popcorn on a campfire.

This meant that even if your command was properly echoed, the echo might get interrupted mid-flight by a random kernel log.

In a heroic attempt to fight this chaos, I even crafted a function designed to validate echoes, despite random insertions:


func findFragmented(input, target string) [][2]int { inputRunes := []rune(input) targetRunes := []rune(target) if len(target) == 0 { return [][2]int{} } startPositions := []int{} for i, r := range inputRunes { if r == targetRunes[0] { startPositions = append(startPositions, i) } } newlineMap := map[int]struct{}{} for i, r := range inputRunes { if r == '\n' { newlineMap[i] = struct{}{} } } targetMap := make(map[rune][]int) for i, r := range inputRunes { targetMap[r] = append(targetMap[r], i) } var results [][2]int for _, start := range startPositions { if ok, path := matchFragmentedRecursive( inputRunes, targetRunes, targetMap, newlineMap, 1, start, []int{start}); ok { end := path[len(path)-1] + 1 results = append(results, [2]int{start, end}) } } return results } func matchFragmentedRecursive(inputRunes, targetRunes []rune, targetMap map[rune][]int, newlineMap map[int]struct{}, tIdx, currPos int, path []int, ) (bool, []int) { if tIdx == len(targetRunes) { return true, path } need := targetRunes[tIdx] validNext := map[int]struct{}{ currPos + 1: {}, } for nl := range newlineMap { if nl > currPos && nl+1 < len(inputRunes) { validNext[nl+1] = struct{}{} } } for _, pos := range targetMap[need] { if pos <= currPos { continue } if _, ok := validNext[pos]; !ok { continue } if ok, matched := matchFragmentedRecursive( inputRunes, targetRunes, targetMap, newlineMap, tIdx+1, pos, append(path, pos), ); ok { return true, matched } } return false, nil }

The function was smart. It could piece together broken echoes.
It battled bravely.
But in the end, I realized I was a modern Don Quixote, fighting not windmills, but terminal text wrapping feature, dmesg notifcation, ansi escape codes, and all their friends.

send_console-ng takes a much saner approach:
Instead of dancing with echoes, it takes control of the terminal settings on the remote side.
It flips the terminal into raw mode with stty, disables buffering shenanigans, and then simply blasts the file through cleanly.

On the host side, the command is beautifully simple:

send_console-ng -b 115200 -f busybox -d /dev/ttyUSB1
  • -b sets the baudrate,
  • -f specifies the file to send,
  • -d picks the serial device.

Behind the scenes, send_console-ng:

  • Compresses the file with gzip,
  • Encodes it safely with base64,
  • Manages terminal settings with stty,
  • Forces unbuffered transmission with stdbuf.
  • Handles the decoding on the remote side.

Disclaimer

send_console-ng is still immature. I'm not sure if it will ever mature, but it's certainly not at this point. For instance, when it reaches a terminal, it starts by testing the commands... However, it's currently incapable of determining if the terminal it's running on is the correct one: for instance, if it reaches a login prompt, or worse, a gdb terminal, it will most likely result in an error... The worst?!? I don't even want to think about it.

So, if I've written my regexes correctly, it might just throw an error, or maybe not...


Some Realistic Limits (And Why I’m Not Crying About It Yet)

Reality check!
This tool depends on the presence of:

  • stty
  • gzip
  • base64
  • cat
  • stdbuf

If any of these are missing on the host or target, the tool simply won’t work.

Can it be made work without them? Perhaps, here some ideas on how it could work in more hostile environments

For stty: if it’s missing, the missing functionality boils down to just two syscalls needed.
A small, architecture-dependent assembly utility could easily fill that gap.

Here's an example that could work on x86_64 machines:

.section .bss .lcomm termios_buf, 64 .section .text .global _start _start: mov $16, %rax # SYS_IOCTL mov $0, %rdi # stdin (fd=0) mov $0x5401, %rsi lea termios_buf(%rip), %rdx syscall lea termios_buf(%rip), %rbx add $12, %rbx mov (%rbx), %eax andl $~(0x0002 | 0x0008), %eax # clear ICANON and ECHO mov %eax, (%rbx) mov $16, %rax # SYS_IOCTL mov $0, %rdi mov $0x5402, %rsi # TCSETS lea termios_buf(%rip), %rdx syscall mov $60, %rax # SYS_exit xor %rdi, %rdi syscall

For base64:
If base64 is missing, an alternative is possible:
Send the file as a sequence of:

printf "\xXX\xYY\xZZ..."

However, this will significantly increase transfer time.
While base64 encoding uses about 12 bits per byte, raw hexadecimal uses 32 bits per byte... almost tripling the amount of data.
The upside?
It’s completely dependency-free.

For gzip:
Compression is a huge time saver but not strictly necessary.
Skipping gzip would work... At the cost of slower and bigger transfers.


Conclusion

Next time you find yourself staring at a device that’s technically alive but practically unreachable, remember:
You don’t need miracles. You just need a bit of clever system abuse and send_console-ng

Because in embedded work, persistence always beats perfection.