Sunday, September 14, 2025

Schrödinger’s test: The /dev/mem case

Why I Went Down This Rabbit Hole

Back in 1993, when Linux 0.99.14 was released, /dev/mem made perfect sense. Computers were simpler, physical memory was measured in megabytes, and security basically boiled down to: “Don’t run untrusted programs.”

Fast-forward to today. We have gigabytes (or terabytes!) of RAM, multi-layered virtualization, and strict security requirements… And /dev/mem is still here, quietly sitting in the kernel, practically unchanged… A fossil from a different era. It’s incredibly powerful, terrifyingly dangerous, and absolutely fascinating.

My work on /dev/mem is part of a bigger effort by the ELISA Architecture working group, whose mission is to improve Linux kernel documentation and testing. This project is a small pilot in a broader campaign: build tests for old, fundamental pieces of the kernel that everyone depends on but few dare to touch.

In a previous blog post, “When kernel comments get weird”, I dug into the /dev/mem source code and traced its history, uncovering quirky comments and code paths that date back decades. That post was about exploration. This one is about action: turning that historical understanding into concrete tests to verify that /dev/mem behaves correctly… Without crashing the very systems those tests run on.

What `/dev/mem` Is and Why It Matters

/dev/mem is a character device that exposes physical memory directly to userspace. Open it like a file, and you can read or write raw physical addresses: no page tables, no virtual memory abstractions, just the real thing.

Why is this powerful? Because it lets you:

Peek at firmware data structures,
Poke device registers directly,
Explore memory layouts normally hidden from userspace.

It’s like being handed the keys to the kingdom… and also a grenade, with the pin halfway pulled.

A single careless write to /dev/mem can:

Crash the kernel,
Corrupt hardware state,
Or make your computer behave like a very expensive paperweight.

For me, that danger is exactly why this project matters. Testing /dev/mem itself is tricky: the tests must prove the driver works, without accidentally nuking the machine they run on.

`STRICT_DEVMEM` and Real-Mode Legacy

One of the first landmines you encounter with /dev/mem is the kernel configuration option STRICT_DEVMEM.

Think of it as a global policy switch:

If disabled, /dev/mem lets privileged userspace access almost any physical address: kernel RAM, device registers, firmware areas, you name it.
If enabled, the kernel filters which physical ranges are accessible through /dev/mem. Typically, it only permits access to low legacy regions, like the first megabyte of memory where real-mode BIOS and firmware tables traditionally live, while blocking everything else.

Why does this matter? Some very old software, like emulators for DOS or BIOS tools, still expects to peek and poke those legacy addresses as if running on bare metal. STRICT_DEVMEM exists so those programs can still work: but without giving them carte blanche access to all memory.

So when you’re testing /dev/mem, the presence (or absence) of STRICT_DEVMEM completely changes what your test can do. With it disabled, /dev/mem is a wild west. With it enabled, only a small, carefully whitelisted subset of memory is exposed.

A Quick Note on Architecture Differences

While /dev/mem always exposes what the kernel considers physical memory, the definition of physical itself can differ across architectures. For example, on x86, physical addresses are the real hardware addresses. On aarch64 with virtualization or secure firmware, EL1 may only see a subset of memory through a translated view, controlled by EL2 or EL3.

The main function that the STRICT_DEVMEM kernel configuration option provides in Linux is to filter and restrict access to physical memory addresses via /dev/mem. It controls which physical address ranges can be legitimately accessed from userspace by helping implement architecture-specific rules to prevent unsafe or insecure memory accesses.

32-Bit Systems and the Mystery of High Memory

On most systems, the kernel needs a direct way to access physical memory. To make that fast, it keeps a linear mapping: a simple, one-to-one correspondence between physical addresses and a range of kernel virtual addresses. If the kernel wants to read physical address 0x00100000, it just uses a fixed offset, like PAGE_OFFSET + 0x00100000. Easy and efficient.

But there’s a catch on 32-bit kernels: The kernel’s entire virtual address space is only 4 GB, and it has to share that with userspace. By convention, 3 GB is given to userspace, and 1 GB is reserved for the kernel, which includes its linear mapping.

Now here comes the tricky part: Physical RAM can easily exceed 1 GB. The kernel can’t linearly map all of it: there just isn’t enough virtual address space.

The extra memory beyond the first gigabyte is called highmem (short for high memory). Unlike the low 1 GB, which is always mapped, highmem pages are mapped temporarily, on demand, whenever the kernel needs them.

Why this matters for /dev/mem: /dev/mem depends on the permanent linear mapping to expose physical addresses. Highmem pages aren’t permanently mapped, so /dev/mem simply cannot see them. If you try to read those addresses, you’ll get zeros or an error, not because /dev/mem is broken, but because that part of memory is literally invisible to it.

For testing, this introduces extra complexity:

Some reads may succeed on lowmem addresses but fail on highmem.
Behavior on a 32-bit machine with highmem is fundamentally different from a 64-bit system, where all RAM is flat-mapped and visible.

Highmem is a deep topic that deserves its own article, but even this quick overview is enough to understand why it complicates /dev/mem testing.

How Reads and Writes Actually Happen

A common misconception is that a single userspace read() or write() call maps to one atomic access to the underlaying block device. In reality, the VFS layer and the device driver may split your request into multiple chunks, depending on alignment and boundaries, in this case.

Why does this happen?

Many devices can only handle fixed-size or aligned operations.
For physical memory, the natural unit is a page (commonly 4 KB).

When your request crosses a page boundary, the kernel internally slices it into:

A first piece up to the page boundary,
Several full pages,
A trailing partial page.

For /dev/mem, this is a crucial detail: A single read or write might look seamless from userspace, but under the hood it’s actually several smaller operations, each with its own state. If the driver mishandles even one of them, you could see skipped bytes, duplicated data, or mysterious corruption.

Understanding this behavior is key to writing meaningful tests.

Safely Reading and Writing Physical Memory

At this point, we know what /dev/mem is and why it’s both powerful and terrifying. Now we’ll move to the practical side: how to interact with it safely, without accidentally corrupting your machine or testing in meaningless ways.

My very first test implementation kept things simple:

Only small reads or writes,
Always staying within a single physical page,
Never crossing dangerous boundaries.

Even with these restrictions, /dev/mem testing turned out to be more like diffusing a bomb than flipping a switch.

Why “success” doesn’t mean success (in this very specific case)

Normally, when you call a syscall like read() or write(), you can safely assume the kernel did exactly what you asked. If read() returns a positive number, you trust that the data in your buffer matches the file’s contents. That’s the contract between userspace and the kernel, and it works beautifully in everyday programming.

But here’s the catch: We’re not just using /dev/mem; we’re testing whether /dev/mem itself works correctly.

This changes everything.

If my test reads from /dev/mem and fills a buffer with data, I can’t assume that data is correct:

Maybe the driver returned garbage,
Maybe it skipped a region or duplicated bytes,
Maybe it silently failed in the middle but still updated the counters.

The same goes for writes: A return code of “success” doesn’t guarantee the write went where it was supposed to, only that the driver finished running without errors.

So in this very specific context, “success” doesn’t mean success. I need independent ways to verify the result, because the thing I’m testing is the thing that would normally be trusted.

Finding safe places to test: `/proc/iomem`

Before even thinking about reading or writing physical memory, I need to answer one critical question:

“Which parts of physical memory are safe to touch?”

If I just pick a random address and start writing, I could:

Overwrite the kernel’s own code,
Corrupt a driver’s I/O-mapped memory,
Trash ACPI tables that the system kernel depends on,
Or bring the whole machine down in spectacular fashion.

This is where /proc/iomem comes to the rescue. It’s a text file that maps out how the physical address space is currently being used. Each line describes a range of physical addresses and what they’re assigned to.

Here’s a small example:

00000000-00000fff : Reserved 00001000-0009ffff : System RAM 000a0000-000fffff : Reserved 000a0000-000dffff : PCI Bus 0000:00 000c0000-000ce5ff : Video ROM 000f0000-000fffff : System ROM 00100000-09c3efff : System RAM 09c3f000-09ffffff : Reserved 0a000000-0a1fffff : System RAM 0a200000-0a20efff : ACPI Non-volatile Storage 0a20f000-0affffff : System RAM 0b000000-0b01ffff : Reserved 0b020000-b696efff : System RAM b696f000-b696ffff : Reserved b6970000-b88acfff : System RAM b88ad000-b9ff0fff : Reserved b9fd0000-b9fd3fff : MSFT0101:00 b9fd0000-b9fd3fff : MSFT0101:00 b9fd4000-b9fd7fff : MSFT0101:00 b9fd4000-b9fd7fff : MSFT0101:00 b9ff1000-ba150fff : ACPI Tables ba151000-bbc0afff : ACPI Non-volatile Storage bbc0b000-bcbfefff : Reserved bcbff000-bdffffff : System RAM be000000-bfffffff : Reserved

By parsing /proc/iomem, my test program can:

Identify which physical regions are safe to work with (like RAM already allocated to my process),
Avoid regions that are reserved for hardware or critical firmware,
Adapt dynamically to different machines and configurations.

This is especially important for multi-architecture support. While examples here often look like x86 (because /dev/mem has a long history there), the concept of mapping I/O regions isn’t x86-specific. On ARM, RISC-V, or others, you’ll see different labels… But the principle remains exactly the same.

In short: /proc/iomem is your treasure map, and the first rule of treasure hunting is “don’t blow up the ship while digging for gold.”

The Problem of Contiguous Physical Pages

Up to this point, my work focused on single-page operations. I wasn’t hand-picking physical addresses or trying to be clever about where memory came from. Instead, the process was simple and safe:

Allocate a buffer in userspace, using mmap() so it’s page-aligned,
Touch the page to make sure the kernel really backs it with physical memory,
Walk /proc/self/pagemap to trace which physical pages back the virtual address in the buffer.

This gives me full visibility into how my userspace memory maps to physical memory. Since the buffer was created through normal allocation, it’s mine to play with, there’s no risk of trampling over the kernel or other userspace processes.

This worked beautifully for basic tests:

Pick a single page in the buffer,
Run a tiny read/write cycle through /dev/mem,
Verify the result,
Nothing explodes.

But then came the next challenge: What if a read or write crosses a physical page boundary?

Why boundaries matter

The Linux VFS layer doesn’t treat a read or write syscall as one giant, indivisible action. Instead, it splits large operations into chunks, moving through pages one at a time.

For example:

I request 10 KB from /dev/mem,
The first 4 KB comes from physical page A,
The next 4 KB comes from physical page B,
The last 2 KB comes from physical page C.

If the driver mishandles the transition between pages, I’d never notice unless my test forces it to cross that boundary. It’s like testing a car by only driving in a straight line: Everything looks fine… Until you try to turn the wheel.

To properly test /dev/mem, I need a buffer backed by at least two physically contiguous pages. That way, a single read or write naturally crosses from one physical page into the next… exactly the kind of situation where subtle bugs might hide.

And that’s when the real nightmare began.

Why this is so difficult

At first, this seemed easy. I thought:

“How hard can it be? Just allocate a buffer big enough, like 128 KB, and somewhere inside it, there must be two contiguous physical pages.”

Ah, the sweet summer child optimism. The harsh truth: modern kernels actively work against this happening by accident. It’s not because the kernel hates me personally (though it sure felt like it). It’s because of its duty to prevent memory fragmentation.

When you call brk() or mmap(), the kernel:

Uses a buddy allocator to manage blocks of physical pages,
Actively spreads allocations apart to keep them tidy,
Reserves contiguous ranges for things like hugepages or DMA.

From the kernel’s point of view:

This keeps the system stable,
Prevents large allocations from failing later,
And generally makes life good for everyone.

From my point of view? It’s like trying to find two matching socks in a dryer while it is drying them.

Playing the allocation lottery

My first approach was simple: keep trying until luck strikes.

Allocate a 128 KB buffer,
Walk /proc/self/pagemap to see where all pages landed physically,
If no two contiguous pages are found, free it and try again.

Statistically, this should work eventually. In reality? After thousands of iterations, I’d still end up empty-handed. It felt like buying lottery tickets and never even winning a free one.

The kernel’s buddy allocator is very good at avoiding fragmentation. Two consecutive physical pages are far rarer than you’d think, and that’s by design.

Trying to confuse the allocator

Naturally, my next thought was:

“If the allocator is too clever, let’s mess with it!”

So I wrote a perturbation routine:

Allocate a pile of small blocks,
Touch them so they’re actually backed by physical pages,
Free them in random order to create “holes.”

The hope was to trick the allocator into giving me contiguous pages next time. The result? It sometimes worked, but unpredictably. 4k attempts gave me >80% success. Not reliable enough for a test suite where failures must mean a broken driver, not a grumpy kernel allocator.

The options I didn’t want

There are sure-fire ways to get contiguous pages:

Writing a kernel module and calling alloc_pages().
Using hugepages.
Configuring CMA regions at boot.

But all of these require special setup or kernel cooperation. My goal was a pure userspace test, so they were off the table.

A new perspective: software MMU

Finally, I relaxed my original requirement. Instead of demanding two pages that are both physically and virtually contiguous, I only needed them to be physically contiguous somewhere in the buffer.

From there, I could build a tiny software MMU:

Find a contiguous physical pair using /proc/self/pagemap,
Expose them through a simple linear interface,
Run the test as if they were virtually contiguous.

This doesn’t eliminate the challenge, but it makes it practical. No kernel hacks, no special boot setup, just a bit of clever user-space logic.

From Theory to Test Code

All this theory eventually turned into a real test tool, because staring at /proc/self/pagemap is fun… but only for a while. The test lives here:

github.com/alessandrocarminati/devmem_test

It’s currently packaged as a Buildroot module, which makes it easy to run on different kernels and architectures without messing up your main system. The long-term goal is to integrate it into the kernel’s selftests framework, so these checks can run as part of the regular Linux testing pipeline. For now, it’s a standalone sandbox where you can:

Experiment with /dev/mem safely (on a test machine!),
Play with /proc/self/pagemap and see how virtual pages map to physical memory,
Try out the software MMU idea without needing kernel modifications.

And expect it still work in progress.

Thursday, September 11, 2025

When Kernel Comments Get Weird: The Tale of read_mem

When Kernel Comments Get Weird: The Tale of `drivers/char/mem.c`

As part of the Elisa community, we spend a good chunk of our time spelunking through the Linux kernel codebase. It’s like code archeology: you don’t always find treasure, but you do find lots of comments left behind by developers from the ’90s that make you go, “Wait… really?”

One of the ideas we’ve been chasing is to make kernel comments a bit smarter: not only human-readable, but also machine-readable. Imagine comments that could be turned into tests, so they’re always checked against reality. Less “code poetry from 1993”, more “living documentation”.

Speaking of code poetry, here’s one gem we stumbled across in mem.c:

/* The memory devices use the full 32/64 bits of the offset, * and so we cannot check against negative addresses: they are ok. * The return value is weird, though, in that case (0). */

This beauty has been hanging around since Linux 0.99.14… back when Bill Clinton was still president-elect, “Mosaic” was the hot new browser, and PDP-11 was still produced and sold.

Back then, it made sense, and reflected exactley what the code did.

Fast-forward thirty years, and the comment still kind of applies… but mostly in obscure corners of the architecture zoo. On the CPUs people actually use every day?

$ cat lseek.asm BITS 64 %define SYS_read 0 %define SYS_write 1 %define SYS_open 2 %define SYS_lseek 8 %define SYS_exit 60 ; flags %define O_RDONLY 0 %define SEEK_SET 0 section .data path: db "/dev/mem",0 section .bss align 8 buf: resq 1 section .text global _start _start: mov rax, SYS_open lea rdi, [rel path] xor esi, esi xor edx, edx syscall mov r12, rax ; save fd in r12 mov rax, SYS_lseek mov rdi, r12 mov rsi, 0x8000000000000001 xor edx, edx syscall mov [rel buf], rax mov rax, SYS_write mov edi, 1 lea rsi, [rel buf] mov edx, 8 syscall mov rax, SYS_exit xor edi, edi syscall $ nasm -f elf64 lseek.asm -o lseek.o $ ld lseek.o -o lseek $ sudo ./lseek| hexdump -C 00000000 01 00 00 00 00 00 00 80 |........| 00000008 $ # this is not what I expect, let's double check $ sudo gdb ./lseek GNU gdb (Fedora Linux) 16.3-1.fc42 Copyright (C) 2024 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu". Type "show configuration" for configuration details. For bug reporting instructions, please see: <https://www.gnu.org/software/gdb/bugs/>. Find the GDB manual and other documentation resources online at: <http://www.gnu.org/software/gdb/documentation/>. For help, type "help". Type "apropos word" to search for commands related to "word"... Reading symbols from ./lseek... (No debugging symbols found in ./lseek) (gdb) b _start Breakpoint 1 at 0x4000b0 (gdb) r Starting program: /tmp/lseek Breakpoint 1, 0x00000000004000b0 in _start () (gdb) x/30i $pc => 0x4000b0 <_start>: mov $0x2,%eax 0x4000b5 <_start+5>: lea 0xf44(%rip),%rdi # 0x401000 0x4000bc <_start+12>: xor %esi,%esi 0x4000be <_start+14>: xor %edx,%edx 0x4000c0 <_start+16>: syscall 0x4000c2 <_start+18>: mov %rax,%r12 0x4000c5 <_start+21>: mov $0x8,%eax 0x4000ca <_start+26>: mov %r12,%rdi 0x4000cd <_start+29>: movabs $0x8000000000000001,%rsi 0x4000d7 <_start+39>: xor %edx,%edx 0x4000d9 <_start+41>: syscall 0x4000db <_start+43>: mov %rax,0xf2e(%rip) # 0x401010 0x4000e2 <_start+50>: mov $0x1,%eax 0x4000e7 <_start+55>: mov $0x1,%edi 0x4000ec <_start+60>: lea 0xf1d(%rip),%rsi # 0x401010 0x4000f3 <_start+67>: mov $0x8,%edx 0x4000f8 <_start+72>: syscall 0x4000fa <_start+74>: mov $0x3c,%eax 0x4000ff <_start+79>: xor %edi,%edi 0x400101 <_start+81>: syscall 0x400103: add %al,(%rax) 0x400105: add %al,(%rax) 0x400107: add %al,(%rax) 0x400109: add %al,(%rax) 0x40010b: add %al,(%rax) 0x40010d: add %al,(%rax) 0x40010f: add %al,(%rax) 0x400111: add %al,(%rax) 0x400113: add %al,(%rax) 0x400115: add %al,(%rax) (gdb) b *0x4000c2 Breakpoint 2 at 0x4000c2 (gdb) b *0x4000db Breakpoint 3 at 0x4000db (gdb) c Continuing. Breakpoint 2, 0x00000000004000c2 in _start () (gdb) i r rax 0x3 3 rbx 0x0 0 rcx 0x4000c2 4194498 rdx 0x0 0 rsi 0x0 0 rdi 0x401000 4198400 rbp 0x0 0x0 rsp 0x7fffffffe3a0 0x7fffffffe3a0 r8 0x0 0 r9 0x0 0 r10 0x0 0 r11 0x246 582 r12 0x0 0 r13 0x0 0 r14 0x0 0 r15 0x0 0 rip 0x4000c2 0x4000c2 <_start+18> eflags 0x246 [ PF ZF IF ] cs 0x33 51 ss 0x2b 43 ds 0x0 0 es 0x0 0 fs 0x0 0 gs 0x0 0 fs_base 0x0 0 gs_base 0x0 0 (gdb) # fd is just fine rax=3 as expected. (gdb) c Continuing. Breakpoint 3, 0x00000000004000db in _start () (gdb) i r rax 0x8000000000000001 -9223372036854775807 rbx 0x0 0 rcx 0x4000db 4194523 rdx 0x0 0 rsi 0x8000000000000001 -9223372036854775807 rdi 0x3 3 rbp 0x0 0x0 rsp 0x7fffffffe3a0 0x7fffffffe3a0 r8 0x0 0 r9 0x0 0 r10 0x0 0 r11 0x246 582 r12 0x3 3 r13 0x0 0 r14 0x0 0 r15 0x0 0 rip 0x4000db 0x4000db <_start+43> eflags 0x246 [ PF ZF IF ] cs 0x33 51 ss 0x2b 43 ds 0x0 0 es 0x0 0 fs 0x0 0 gs 0x0 0 fs_base 0x0 0 gs_base 0x0 0 (gdb) # According to that comment, rax should have been 0, but it is not. (gdb) c Continuing. �[Inferior 1 (process 186746) exited normally] (gdb)

Not so much. Seeking at 0x8000000000000001… Returns 0x8000000000000001 not 0 as anticipated in the comment. We’re basically facing the kernel version of that “Under Construction” GIF on websites from the 90s, still there, but mostly just nostalgic decoration now.

The Mysterious Line in `read_mem`

Let’s zoom in on one particular bit of code in read_mem:

phys_addr_t p = *ppos; /* ... other code ... */ if (p != *ppos) return 0;

At first glance, this looks like a no-op; why would p be different from *ppos when you just copied it? It’s like testing if gravity still works by dropping your phone… spoiler: it does.

But as usual with kernel code, the weirdness has a reason.

The Problem: Truncation on 32-bit Systems

Here’s what’s going on:

*ppos is a loff_t, which is a 64-bit signed integer.
p is a phys_addr_t, which holds a physical address.

On a 64-bit system, both are 64 bits wide. Assignment is clean, the check always fails (and compilers just toss it out).

But on a 32-bit system, phys_addr_t is only 32 bits. Assign a big 64-bit offset to it, and boom, the top half vanishes. Truncated, like your favorite TV series canceled after season 1.

That if (p != *ppos) check is the safety net. It spots when truncation happens and bails out early, instead of letting some unlucky app read from la-la land.

Assembly Time: 64-bit vs. 32-bit

On 64-bit builds (say, AArch64), the compiler optimizes away the check.

┌ 736: sym.read_mem (int64_t arg2, int64_t arg3, int64_t arg4); │ `- args(x1, x2, x3) vars(13:sp[0x8..0x70]) │ 0x08000b10 1f2003d5 nop │ 0x08000b14 1f2003d5 nop │ 0x08000b18 3f2303d5 paciasp │ 0x08000b1c fd7bb9a9 stp x29, x30, [sp, -0x70]! │ 0x08000b20 fd030091 mov x29, sp │ 0x08000b24 f35301a9 stp x19, x20, [var_10h] │ 0x08000b28 f40301aa mov x20, x1 │ 0x08000b2c f55b02a9 stp x21, x22, [var_20h] │ 0x08000b30 f30302aa mov x19, x2 │ 0x08000b34 750040f9 ldr x21, [x3] │ 0x08000b38 e10302aa mov x1, x2 │ 0x08000b3c e33700f9 str x3, [var_68h] ; phys_addr_t p = *ppos; │ 0x08000b40 e00315aa mov x0, x21 │ 0x08000b44 00000094 bl valid_phys_addr_range │ ┌─< 0x08000b48 40150034 cbz w0, 0x8000df0 ;if (!valid_phys_addr_range(p, count)) │ │ 0x08000b4c 00000090 adrp x0, segment.ehdr │ │ 0x08000b50 020082d2 mov x2, 0x1000 │ │ 0x08000b54 000040f9 ldr x0, [x0] │ │ 0x08000b58 01988152 mov w1, 0xcc0 │ │ 0x08000b5c f76303a9 stp x23, x24, [var_30h] [...]

Nothing to see here, move along. But on 32-bit builds (like old-school i386), the check shows up loud and proud in the assembly.

┌ 392: sym.read_mem (int32_t arg_8h); │ `- args(sp[0x4..0x4]) vars(5:sp[0x14..0x24]) │ 0x080003e0 55 push ebp │ 0x080003e1 89e5 mov ebp, esp │ 0x080003e3 57 push edi │ 0x080003e4 56 push esi │ 0x080003e5 53 push ebx │ 0x080003e6 83ec14 sub esp, 0x14 │ 0x080003e9 8955f0 mov dword [var_10h], edx │ 0x080003ec 8b5d08 mov ebx, dword [arg_8h] │ 0x080003ef c745ec0000.. mov dword [var_14h], 0 │ 0x080003f6 8b4304 mov eax, dword [ebx + 4] │ 0x080003f9 8b33 mov esi, dword [ebx] ; phys_addr_t p = *ppos; │ 0x080003fb 85c0 test eax, eax │ ┌─< 0x080003fd 7411 je 0x8000410 ; if (!valid_phys_addr_range(p, count)) │ ┌┌──> 0x080003ff 8b45ec mov eax, dword [var_14h] │ ╎╎│ 0x08000402 83c414 add esp, 0x14 │ ╎╎│ 0x08000405 5b pop ebx │ ╎╎│ 0x08000406 5e pop esi │ ╎╎│ 0x08000407 5f pop edi │ ╎╎│ 0x08000408 5d pop ebp │ ╎╎│ 0x08000409 c3 ret [...]

The CPU literally does a compare-and-jump to enforce it. So yes, this is a real guard, not some leftover fluff.

Return Value Oddities

Now, here’s where things get even funnier. If the check fails in read_mem, the function returns 0. That’s “no bytes read”, which in file I/O land is totally fine.

But in the twin function write_mem, the same situation returns -EFAULT. That’s kernel-speak for “Nope, invalid address, stop poking me”.

So, reading from a bad address? You get a polite shrug. Writing to it? You get a slap on the wrist. Fair enough, writing garbage into memory is way more dangerous than failing to read it. Come on, probably here we need to fix things up… Do we really?

Why does `read_mem` return `0` instead of an error?

This behavior comes straight from Unix I/O tradition.

In user space, tools like dd expect a read() call to return 0 to mean “end of file”. They loop until that happens and then exit cleanly.

Returning an error code instead would break that pattern and confuse programs that treat /dev/mem like a regular file. In other words, read_mem is playing nice with existing utilities: 0 here doesn’t mean “nothing went wrong”, it means “nothing left to read.”

Wrapping It Up

This little dive shows how a single “weird” line of code carries decades of context, architecture quirks, type definitions, and evolving assumptions. It also shows why comments like the one from 0.99.14 are dangerous: they freeze a moment in time, but reality keeps moving.

Our mission in Elisa Architecture WG is to bring comments back to life: keep them up-to-date, tie them to tests, and make sure they still tell the truth. Because otherwise, thirty years later, we’re all squinting at a line saying “the return value is weird though” and wondering if the developer was talking about code… or just their day.

And now, a brief word from our sponsors (a.k.a. me in a different hat): When I’m not digging up ancient kernel comments with the Architecture WG, I’m also leading the Linux Features for Safety-Critical Systems (LFSCS) WG. We’re cooking up some pretty exciting stuff there too.

So if you enjoy the kind of archaeology/renovation work we’re doing there, come check out LFSCS as well: same Linux, different adventure.