As part of the Elisa community, we
spend a good chunk of our time spelunking through the Linux kernel
codebase. It’s like code archeology: you don’t always find treasure, but
you do find lots of comments left behind by developers from the
’90s that make you go, “Wait… really?”
One of the ideas we’ve been chasing is to make kernel comments a bit
smarter: not only human-readable, but also machine-readable. Imagine
comments that could be turned into tests, so they’re always checked
against reality. Less “code poetry from 1993”, more “living
documentation”.
Speaking of code poetry, here’s
one gem we stumbled across in mem.c
:
/* The memory devices use the full 32/64 bits of the offset,
* and so we cannot check against negative addresses: they are ok.
* The return value is weird, though, in that case (0).
*/
This
beauty has been hanging around since Linux 0.99.14…
back when Bill
Clinton was still president-elect, “Mosaic” was the hot
new browser, and PDP-11 was still
produced and sold.
Back then, it made sense, and reflected exactley what the code did.
Fast-forward thirty years, and the comment still kind of applies…
but mostly in obscure corners of the architecture zoo. On the CPUs
people actually use every day?
$ cat lseek.asm
BITS 64
%define SYS_read 0
%define SYS_write 1
%define SYS_open 2
%define SYS_lseek 8
%define SYS_exit 60
; flags
%define O_RDONLY 0
%define SEEK_SET 0
section .data
path: db "/dev/mem",0
section .bss
align 8
buf: resq 1
section .text
global _start
_start:
mov rax, SYS_open
lea rdi, [rel path]
xor esi, esi
xor edx, edx
syscall
mov r12, rax ; save fd in r12
mov rax, SYS_lseek
mov rdi, r12
mov rsi, 0x8000000000000001
xor edx, edx
syscall
mov [rel buf], rax
mov rax, SYS_write
mov edi, 1
lea rsi, [rel buf]
mov edx, 8
syscall
mov rax, SYS_exit
xor edi, edi
syscall
$ nasm -f elf64 lseek.asm -o lseek.o
$ ld lseek.o -o lseek
$ sudo ./lseek| hexdump -C
00000000 01 00 00 00 00 00 00 80 |........|
00000008
$ # this is not what I expect, let's double check
$ sudo gdb ./lseek
GNU gdb (Fedora Linux) 16.3-1.fc42
Copyright (C) 2024 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./lseek...
(No debugging symbols found in ./lseek)
(gdb) b _start
Breakpoint 1 at 0x4000b0
(gdb) r
Starting program: /tmp/lseek
Breakpoint 1, 0x00000000004000b0 in _start ()
(gdb) x/30i $pc
=> 0x4000b0 <_start>: mov $0x2,%eax
0x4000b5 <_start+5>: lea 0xf44(%rip),%rdi # 0x401000
0x4000bc <_start+12>: xor %esi,%esi
0x4000be <_start+14>: xor %edx,%edx
0x4000c0 <_start+16>: syscall
0x4000c2 <_start+18>: mov %rax,%r12
0x4000c5 <_start+21>: mov $0x8,%eax
0x4000ca <_start+26>: mov %r12,%rdi
0x4000cd <_start+29>: movabs $0x8000000000000001,%rsi
0x4000d7 <_start+39>: xor %edx,%edx
0x4000d9 <_start+41>: syscall
0x4000db <_start+43>: mov %rax,0xf2e(%rip) # 0x401010
0x4000e2 <_start+50>: mov $0x1,%eax
0x4000e7 <_start+55>: mov $0x1,%edi
0x4000ec <_start+60>: lea 0xf1d(%rip),%rsi # 0x401010
0x4000f3 <_start+67>: mov $0x8,%edx
0x4000f8 <_start+72>: syscall
0x4000fa <_start+74>: mov $0x3c,%eax
0x4000ff <_start+79>: xor %edi,%edi
0x400101 <_start+81>: syscall
0x400103: add %al,(%rax)
0x400105: add %al,(%rax)
0x400107: add %al,(%rax)
0x400109: add %al,(%rax)
0x40010b: add %al,(%rax)
0x40010d: add %al,(%rax)
0x40010f: add %al,(%rax)
0x400111: add %al,(%rax)
0x400113: add %al,(%rax)
0x400115: add %al,(%rax)
(gdb) b *0x4000c2
Breakpoint 2 at 0x4000c2
(gdb) b *0x4000db
Breakpoint 3 at 0x4000db
(gdb) c
Continuing.
Breakpoint 2, 0x00000000004000c2 in _start ()
(gdb) i r
rax 0x3 3
rbx 0x0 0
rcx 0x4000c2 4194498
rdx 0x0 0
rsi 0x0 0
rdi 0x401000 4198400
rbp 0x0 0x0
rsp 0x7fffffffe3a0 0x7fffffffe3a0
r8 0x0 0
r9 0x0 0
r10 0x0 0
r11 0x246 582
r12 0x0 0
r13 0x0 0
r14 0x0 0
r15 0x0 0
rip 0x4000c2 0x4000c2 <_start+18>
eflags 0x246 [ PF ZF IF ]
cs 0x33 51
ss 0x2b 43
ds 0x0 0
es 0x0 0
fs 0x0 0
gs 0x0 0
fs_base 0x0 0
gs_base 0x0 0
(gdb) # fd is just fine rax=3 as expected.
(gdb) c
Continuing.
Breakpoint 3, 0x00000000004000db in _start ()
(gdb) i r
rax 0x8000000000000001 -9223372036854775807
rbx 0x0 0
rcx 0x4000db 4194523
rdx 0x0 0
rsi 0x8000000000000001 -9223372036854775807
rdi 0x3 3
rbp 0x0 0x0
rsp 0x7fffffffe3a0 0x7fffffffe3a0
r8 0x0 0
r9 0x0 0
r10 0x0 0
r11 0x246 582
r12 0x3 3
r13 0x0 0
r14 0x0 0
r15 0x0 0
rip 0x4000db 0x4000db <_start+43>
eflags 0x246 [ PF ZF IF ]
cs 0x33 51
ss 0x2b 43
ds 0x0 0
es 0x0 0
fs 0x0 0
gs 0x0 0
fs_base 0x0 0
gs_base 0x0 0
(gdb) # According to that comment, rax should have been 0, but it is not.
(gdb) c
Continuing.
�[Inferior 1 (process 186746) exited normally]
(gdb)
Not so much. Seeking at 0x8000000000000001
… Returns
0x8000000000000001
not 0
as anticipated in the
comment. We’re basically facing the kernel version of that “Under
Construction” GIF on websites from the 90s, still there, but mostly just
nostalgic decoration now.
The Mysterious Line in
read_mem
Let’s zoom in on one particular bit of code in read_mem
:
phys_addr_t p = *ppos;
/* ... other code ... */
if (p != *ppos) return 0;
At first glance, this looks like a no-op; why would p
be
different from *ppos
when you just copied it? It’s like
testing if gravity still works by dropping your phone… spoiler:
it does.
But as usual with kernel code, the weirdness has a reason.
The Problem:
Truncation on 32-bit Systems
Here’s what’s going on:
*ppos
is a loff_t
, which is a 64-bit
signed integer.
p
is a phys_addr_t
, which holds a physical
address.
On a 64-bit system, both are 64 bits wide. Assignment is clean, the
check always fails (and compilers just toss it out).
But on a 32-bit system, phys_addr_t
is only 32 bits.
Assign a big 64-bit offset to it, and boom, the top
half vanishes. Truncated, like your favorite TV series canceled after
season 1.
That if (p != *ppos)
check is the safety net. It spots
when truncation happens and bails out early, instead of letting some
unlucky app read from la-la land.
Assembly Time: 64-bit
vs. 32-bit
On 64-bit builds (say, AArch64), the compiler optimizes away the
check.
┌ 736: sym.read_mem (int64_t arg2, int64_t arg3, int64_t arg4);
│ `- args(x1, x2, x3) vars(13:sp[0x8..0x70])
│ 0x08000b10 1f2003d5 nop
│ 0x08000b14 1f2003d5 nop
│ 0x08000b18 3f2303d5 paciasp
│ 0x08000b1c fd7bb9a9 stp x29, x30, [sp, -0x70]!
│ 0x08000b20 fd030091 mov x29, sp
│ 0x08000b24 f35301a9 stp x19, x20, [var_10h]
│ 0x08000b28 f40301aa mov x20, x1
│ 0x08000b2c f55b02a9 stp x21, x22, [var_20h]
│ 0x08000b30 f30302aa mov x19, x2
│ 0x08000b34 750040f9 ldr x21, [x3]
│ 0x08000b38 e10302aa mov x1, x2
│ 0x08000b3c e33700f9 str x3, [var_68h] ; phys_addr_t p = *ppos;
│ 0x08000b40 e00315aa mov x0, x21
│ 0x08000b44 00000094 bl valid_phys_addr_range
│ ┌─< 0x08000b48 40150034 cbz w0, 0x8000df0 ;if (!valid_phys_addr_range(p, count))
│ │ 0x08000b4c 00000090 adrp x0, segment.ehdr
│ │ 0x08000b50 020082d2 mov x2, 0x1000
│ │ 0x08000b54 000040f9 ldr x0, [x0]
│ │ 0x08000b58 01988152 mov w1, 0xcc0
│ │ 0x08000b5c f76303a9 stp x23, x24, [var_30h]
[...]
Nothing to see here, move along. But on 32-bit builds (like
old-school i386), the check shows up loud and proud in the assembly.
┌ 392: sym.read_mem (int32_t arg_8h);
│ `- args(sp[0x4..0x4]) vars(5:sp[0x14..0x24])
│ 0x080003e0 55 push ebp
│ 0x080003e1 89e5 mov ebp, esp
│ 0x080003e3 57 push edi
│ 0x080003e4 56 push esi
│ 0x080003e5 53 push ebx
│ 0x080003e6 83ec14 sub esp, 0x14
│ 0x080003e9 8955f0 mov dword [var_10h], edx
│ 0x080003ec 8b5d08 mov ebx, dword [arg_8h]
│ 0x080003ef c745ec0000.. mov dword [var_14h], 0
│ 0x080003f6 8b4304 mov eax, dword [ebx + 4]
│ 0x080003f9 8b33 mov esi, dword [ebx] ; phys_addr_t p = *ppos;
│ 0x080003fb 85c0 test eax, eax
│ ┌─< 0x080003fd 7411 je 0x8000410 ; if (!valid_phys_addr_range(p, count))
│ ┌┌──> 0x080003ff 8b45ec mov eax, dword [var_14h]
│ ╎╎│ 0x08000402 83c414 add esp, 0x14
│ ╎╎│ 0x08000405 5b pop ebx
│ ╎╎│ 0x08000406 5e pop esi
│ ╎╎│ 0x08000407 5f pop edi
│ ╎╎│ 0x08000408 5d pop ebp
│ ╎╎│ 0x08000409 c3 ret
[...]
The CPU literally does a compare-and-jump to enforce it. So yes, this
is a real guard, not some leftover fluff.
Return Value Oddities
Now, here’s where things get even funnier. If the check fails in
read_mem
, the function returns 0
. That’s “no
bytes read”, which in file I/O land is totally fine.
But in the twin function write_mem
, the same situation
returns -EFAULT
. That’s kernel-speak for “Nope, invalid
address, stop poking me”.
So, reading from a bad address? You get a polite shrug. Writing to
it? You get a slap on the wrist. Fair enough, writing garbage into
memory is way more dangerous than failing to read it. Come on, probably
here we need to fix things up… Do we really?
Why does
read_mem
return 0
instead of an error?
This behavior comes straight from Unix I/O tradition.
In user space, tools like dd
expect a
read()
call to return 0
to mean “end of file”.
They loop until that happens and then exit cleanly.
Returning an error code instead would break that pattern and confuse
programs that treat /dev/mem
like a regular file. In other
words, read_mem
is playing nice with existing utilities:
0
here doesn’t mean “nothing went wrong”, it means “nothing
left to read.”
Wrapping It Up
This little dive shows how a single “weird” line of code carries
decades of context, architecture quirks, type definitions, and evolving
assumptions. It also shows why comments like the one from 0.99.14 are
dangerous: they freeze a moment in time, but reality keeps moving.
Our mission in Elisa Architecture WG is to bring comments back to
life: keep them up-to-date, tie them to tests, and make sure they still
tell the truth. Because otherwise, thirty years later, we’re all
squinting at a line saying “the return value is weird though” and
wondering if the developer was talking about code… or just their
day.
And now, a brief word from our sponsors (a.k.a. me in a
different hat): When I’m not digging up ancient kernel comments with the
Architecture WG, I’m also leading the Linux Features for Safety-Critical
Systems (LFSCS) WG. We’re cooking up some pretty exciting stuff there
too.
So if you enjoy the kind of archaeology/renovation work we’re doing
there, come check out LFSCS as well: same
Linux, different adventure.
The same article is also published by the Elisa community blog at https://elisa.tech/ambassadors/2025/09/08/when-kernel-comments-get-weird-the-tale-of-drivers-char-mem-c/
ReplyDelete