Thoughts on Unix filesystems

A correspondent recently pointed me at a piece by David Wheeler about problems he sees with Unix pathnames and how he proposes to fix them.

I think he's wrong.

He's identified a constellation of problems—the impedance mismatch between how the filesystem does pathnames and how various pathname-using software does pathnames. But he proposes to fix them by complicating pathname software infrastructure: the kernel, the on-disk format, library routines. "Unix does not prevent you from doing stupid things because that would also prevent you from doing clever things." He should take this to heart.

Yes, the Unix filesystem interface has some botches. But complicating things is not the right answer. Unix variants have, repeatedly, tried complicating the system so as to present interfaces that humans find simple to use. These attempts have uniformly either fallen totally flat or have lost the clean orthogonal design that makes Unix Unix.

The right answer is not to complicate but to simplify.

If shell programs don't handle arbitrary filenames correctly because it's a programming language in which it's difficult to handle arbitrary octet strings, the right fix is not to eliminate one possible source of arbitrary and possibly malicious strings (out of many); it's either to stop using the language for tasks it's not appropriate for, or to make it suitable—to make the shell programming languages capable of working effectively with arbitrary octet strings.

If the prohibition of 0x00 and 0x2f octets in pathnames is a problem, the right fix is to provide filesystem interfaces that don't forbid them. This is not particularly difficult to do, certainly no more difficult than his proposed pervasive changes to practically everything that deals with pathnames.

Note I spoke of 0x00 and 0x2f octets, not characters. Unix filenames have never been character sequences, always octet sequences, though humans (especially humans working in ASCII or mostly-ASCII environments) tend to blur the distinction. (In passing, this is true of a lot of other Unixy things, too, and is the reason why, for example, ssh as standardized is actually unimplementable on many (most?) Unices; that it works as much as it does is a testament to how many sites stick to ASCII for the relevant things.) This whole "let's change the filesystem and/or its interfaces" idea is predicated on confusing filesystem names with character strings.

"It seems perfection is achieved not when there is nothing more to add, but when there is nothing more to remove."[1] Near as I can tell, Saint-Exupery was writing about aircraft design, but it's perhaps even more true of software. Don't accrete more special cases. Get rid of the ones already there.

Filesystem names as character strings (rather than octet strings) is a human-layer issue. Deal with it at the human-interface layer. Don't push human-layer special cases and other inconsistencies down into software-to-software interface layers such as syscall arguments or on-disk filesystem formats.

[1] My translation. "Il semble que la perfection soit atteinte non quand il n'y a plus rien à ajouter, mais quand il n'y a plus rien à retrancher."

Main