Today, I'm going to figure out how execve() manages to keep track of open fds while replacing the executable image inside a task.

Something needs to keep track of which fds are open and what Mach ports they map to. Something needs to close CLOEXEC fds. And this all should happen without racing too much.

It might be simple or it might be complicated. Let's see

I'm looking at _hurd_exec_paths(), which exec's the given executable file in the given task

Yes, in a true capabilities manner it's possible to do an exec() in *any* task you have a port to, not just your own task! This is useful for spawning children (and this is how glibc implements posix_spawn) without actually copying your memory into the child task (fork). As a nice bonus, in this scheme it's the parent who gets the detailed error info if spawning goes wrong.


Whether the same task or not, the new program inherits the state from the program calling _hurd_exec_paths(), such as open file descriptors, cwd/root, umask, which signals are blocked. And in addition to that Unix state, it inherits Hurd-specific state such as which auth server to use.

All this info is packed into arrays of ints and ports, and sent to the exec server in exec_exec_paths() along with the file to execute and the port names to deallocate/destroy. This is how CLOEXEC happens.

Now we're in the exec server. The exec server, despite implementing a seemingly simple feature (that should actually be implementable in-task without a separate server), is in fact one of the most essential Hurd servers.

As I've previously mentioned it's one of the two servers (along with the root filesystem) that are started directly by the GNU Mach kernel on bootup — it has to, because without an exec server nothing else could be launched, and without the root filesystem, there would be nothing to exec.

The exec server will actually replace the given task with a fresh one if requested explicitly or if the EXEC_SECURE flag is set. This makes sure anyone who has a port to the old task cannot control the new program.

Additionally, EXEC_SECURE will cause the exec server to replace some of the provided ports (namely, ports to auth, proc, and root filesystem servers) with pristine versions.

How EXEC_SECURE gets set deserves its own digression:

In order to make setuid execs possible, _hurd_exec_paths() doesn't directly call into the exec server. Instead it asks the filesystem implementing the file-to-be-exec'ed to do that. The filesystem forwards the arguments to the exec server, but it can alter the provided auth and add EXEC_SECURE if it believes the executable is setuid.

After the exec server is done replacing task's virtual memory by loading the new executable image into it, it replaces the task's bootstrap port with a fresh port to itself.

If you need a refresher, the bootstrap port is one of the "special" ports that Mach stores for each task. It's generally used to provide the new task with some way to bootstrap other connections. On Darwin, it's used to connect tasks to launchd, aka the bootstrap server, which gives out ports to other servers.

On the Hurd, the bootstrap port, as seen inside main(), is used when starting translators (filesystems). ("Must be started as a translator" is the error message they typically print if they found out their bootstrap port is null.)

But it turns out each task *actually* starts up with the bootstrap port provided by the exec server. glibc initialization code calls exec_startup_get_info() on it, to which the exec server replies with all that data sent by whoever's started this exec in the first place.

This data includes the "real" bootstrap port — the one the task had before getting exec'ed, and the one main() expects to see — which glibc sets back as this task's bootstrap port.

This is also where glibc unpacks fds, essential server ports, and other info. So this is how fds and other state is preserved across exec — manually, by packing and all the relevant info, sending it to the filesystem, then to the exec server, then back to the task, then unpacking it back into place.

P.S. but what about the exec server itself? what bootstrap port does it get?

It turns out that it gets a port to the root filesystem, the other task started on bootup, as its bootstrap port. (The root filesystem gets the exec server port, which is normal.) So when the exec server itself starts up, it expects to have been just exec'ed by the root filesystem, and as any task it calls exec_startup_get_info() on that bootstrap port.

The root filesystem knows how to handle that by sending back a special flag (EXEC_STACK_ARGS) that tells the exec server to look for args on its stack, which is where the kernel loader places them, unlike the exec server, which sends them over in reply to exec_startup_get_info().


@bugaevc I feel like this is creeping into FUSE territory...

@seven I mean, this is literally what the Hurd is about, so...

@bugaevc I'm still lost I guess in what makes it "more flexible" tbh...

Maybe it's past me, my kernel dev days are half a decade ago at least, probably more like a decade (don't judge me age) so I'm unclear.

@bugaevc In fact I'm not clear the additional layer exactly, so is the advantage being a active translator? I'm not clear what the advantage is, perhaps some speed at expense of processor time I suppose?

:blobcatgooglyshrug:​ I think I'm too dumb to understand...

@seven to name a few:
* run ext2fs under gdb or rpctrace!
* lots of cool translators (FUSE is how filesystems are natively done in the Hurd)
* flexible (and "natural") subhurds
* generally, lots of things are (safely) doable without being root
* actual capability-based security underneath
* you can have multiple uids and gids at a time, and add new ones dynamically to a running process

@bugaevc The root problem seems exclusive to BSD, unless I am missing something, FUSE is userspace, beyond implementation, root is not needed (if implementation is done correctly) now some of the modules underneath might need root, is that what we are talking about?

The proc uid/gid statement confuses me, point me (please) at some errata to read on the subject cause I'm not clear...

I honestly dunno what subhurds are, I think I need to read more about this... I mean, I'm not entirely clueless here, but I'm still seeing BSD implementation problems, which FUSE wasn't made for, but you have sparked my curiosity so I shall invest some research hours to try and understand... ;)

@bugaevc I see, nahhh I'm good Microkernels are the devil... I mean, speaking as someone who's actually had to deal with minix... This concept was attempted long long ago, and rejected (rightfully so) our arch doesn't do it correctly... Maybe if RISC becomes fashionable again...

Sign in to participate in the conversation
Mastodon for Tech Folks

The social network of the future: No ads, no corporate surveillance, ethical design, and decentralization! Own your data with Mastodon!