The curious case of shell commands, or how "this bug is required by POSIX" -- Volution Notes

Some context

(For those interested only in the glibc and POSIX issues, see at the end.)

I use Linux almost everywhere, from laptops and desktops to servers and routers, and over the course of many years I've written quite a few bash scripts that ease my interaction with all of these.

The usual "stack" that ties all of these together is composed of:

bash, obviously for scripting;
ssh, for remote execution;
i3, the window manager for desktop environments;
screen, for terminal multiplexing;
tmux, for terminal multiplexing;
dtach, for launching processes in background while still being able to access them if needed;
(among many other tools like coreutils and friends, grep, sed, sort, find, xargs, rsync, plus other smaller C, Go or Rust based tools; but all of these are not the subject of this article;)

What do all of the former have in common? They all provide some sort of functionality, but by themselves they don't actually do anything; they are meant to delegate the actual work to other tools.

So what is the problem you ask? Nothing really... Everything is fine... No planes are falling from the sky... Nobody is running around screaming while on fire... Everything is business as usual, unless you want to write some wrapper scripts that takes arbitrary user input and delegates that to one of these, and many other, broken tools...

And just for completeness I'll throw in some other tools that fit the bill:

make (and any other variant or derivate), for build systems;
ninja, my favorite build system, and lightweight DAG job scheduler and executer;

Root of all evil

As many of you have guessed, because we live in a UNIX world, many of these tools delegate the actual tool execution to the glibc system(3) function, which ends-up calling sh(1p) (most likely bash) with the -c argument. As opposed to doing the correct thing and just calling execve(2)...

BTW, this is not something Linux specific. Unfortunately it is a trait inherited from the UNIX ancestry by almost all operating systems, including all BSD variants plus OSX; and, for some tools, this behavior is even seen when running on Windows. But I digress...

Excerpts from Linux `system(3)` man page

Let's quote from the Linux system(3) man page:

The system(3) library function uses fork(2) to create a child process that executes the shell command specified in command using execl(3) as follows:
execl("/bin/sh", "sh", "-c", command, (char *) NULL);
[...]
Any user input that is employed as part of command should be carefully sanitized, to ensure that unexpected shell commands or command options are not executed. Such risks are especially grave when using system() from a privileged program.

So far the warning doesn't sound very comforting... But hey, we are all responsible developers...

Excerpts from POSIX `system(3p)` man page

Let's quote from the POSIX system(3p) man page:

Hmm... Strange... There is nothing to quote from this manual about warnings, issues or sanitization... (Except perhaps the warning for set-UID or set-GID processes against using this function, due to potential privilege escalation issues.)

Excerpts from POSIX `sh(1p)` man page

How about quoting from the POSIX sh(1p) man page regarding the -c flag:

-c -- Read commands from the command_string operand. Set the value of special parameter 0 (see Section 2.5.2, Special Parameters) from the value of the command_name operand and the positional parameters ($1, $2, and so on) in sequence from the remaining argument operands. No commands shall be read from the standard input.

OK... Still nothing about warnings, issues or sanitization...

Excerpts from `bash(1)` man page

Perhaps we'll have more luck quoting from the bash(1) man page:

-c -- If the -c option is present, then commands are read from the first non-option argument command_string. If there are arguments after the command_string, the first argument is assigned to $0 and any remaining arguments are assigned to the positional parameters. The assignment to $0 sets the name of the shell, which is used in warning and error messages.

Hmm... Perhaps sanitization isn't that a big of an issue?

Partial conclusion

For now let's remember that initial remark:

Any user input [...] part of command should be [...] sanitized [...].

Aside from that, perhaps this "shell command" business I'm crying about isn't such a big of an issue? Else the POSIX manuals for system(3p) (~2K words) and sh(1p) (~8K words) -- that I assume were written by legions of committee members -- would have certainly spared at least a few words about this issue... Right?

Unintended consequences

So why do we care about system(3) and sh -c?

Because as said in the beginning, many tools accept as arguments commands which in the end are passed to sh -c via the system(3) function.

Therefore when we write our scripts and tools we need to be aware of this situation and be prepared to escape and quote our commands and arguments accordingly, else we'll be subject to shell injections... (For more scarry stuff one can read about shellshock.)

A harmless example...

For example say we want to write a script that takes exactly two arguments, a command and a single argument (that we store in the _command and _argument variables):

if test "${#}" -ne 2 ; then
    printf -- '[ee] invalid arguments!\n' >&2
fi
_command="${1}"
_argument="${2}"
shift -- 2

And after some initial setup we want pass our command to a tool such as watch, hyperfine, ssh and many others, as in tool "command argument".

Thus one would quickly write the following:

tool "${_command} ${_argument}"

(Granted watch does have the -x flag that solves this issue, however it is not the default; moreover ssh, hyperfine and many other tools don't have this option.)

... with interesting failure modes

If one has written the delegation as above, then one would be in a bit of surprise, because trying to use a command or argument that:

contains a space -- would yield the wrong command (if the space is in the command) or multiple arguments (in any case);
contains a quote or backspace -- most likely would result in an invalid syntax due to mismatched quotes;
contains a special character like $, *, ?, etc. -- environment and glob expansion would take place, yielding wrong arguments;
contains a special character or token like ;, {, }, (, ), ||, &&, &, etc. -- would basically result in the equivalent of an SQL injection;
contains a backquote -- sub-command expansion would take place, yet another equivalent of an SQL injection;
basically anything that is not a number or letter will definitively break something...

Thus one asks himself, what is the right solution?

Proper solution

The proper solution would be dropping that broken tool immediately, securely erasing it from your hard-drive, then running and screaming that tool's name out-loud in shame... (Something akin to Game of Throne's walk of atonement...)

I'm not kidding... This kind of broken tools are the cause of many stupid bugs, ranging from the funny ups-rm-with-spaces (i.e. rm -Rf / some folder with spaces /some-file), to serious security issues like the formerly mentioned shellshock...

So, you say someone holds you at gun point, thus you must use that tool? Check if the broken tool doesn't have a flag that disables calling sh -c, and instead properly executes the given command and arguments directly via execve(2). (For example watch has the -x flag as mentioned.)

Alternatively, given that most likely the tool in question is an open-source project written by someone in his spare time, perhaps open a feature request describing the issue, and if possible contribute with a patch that solves it.

Still no luck? Make some popcorn and prepare for the latest block-buster "convoluted solutions for simple problems in UNIX town"...

Convoluted solutions

However if we are forced to use that broken tool, we'll need to apply the following countermeasures...

Yes, I've written "countermeasures", because now we have to think like an attacker and see how we can break our own script, then with each attempt we'll build layers upon layers of countermeasures. And if our eyes start hurting, it might be because of the onion we've created with all these convoluted layers...

Please note that none of the following are the correct solution. Instead they are just steps in building towards the correct solution.

Also, please note that most likely I've gotten half of it wrong!

Convoluted step 1 -- quoting the command and arguments

A first step is to use the bash specific @Q expansion modifier that properly quotes a string according to the sh rules:

broken-tool "${_command@Q} ${_argument@Q}"

In case we have multiple arguments, as in _arguments=( "${@}" ), then one could use the following:

broken-tool "${_command@Q} ${_arguments[*]@Q}"

In case we don't need the extra variables, we could simply just write:

broken-tool "${*@Q}"

Convoluted step 2 -- using `exec` before the command

What would happen if our command happens to be named just like a bash built-in, like for example kill or time which are not compatible with our arguments? Did you know that there are executables, in the default distribution, that are named if, echo, printf and even test or [? (Just try it out with type -P if.)

In order to make sure that we are actually executing the command as available from the $PATH environment variable, we should prefix it with an exec:

broken-tool "exec ${_command@Q} ${_arguments[*]@Q}"

Convoluted step 3 -- using `--` after `exec`

What would happen if our command happens to start with a hyphen, like for example -strange-command?

In order to make sure that the command isn't by mistake interpreted as a flag to exec itself, we should also add an -- between exec and the command:

broken-tool "exec -- ${_command@Q} ${_arguments[*]@Q}"

Please note that not all sh implementations actually support the exec -- variant.

Final convoluted solution

broken-tool "exec -- ${_command@Q} ${_arguments[*]@Q}"

In fact many tools accept themselves the -- argument to denote the end of all "options" and the beginning of "arguments", thus my favorite final variant would be:

broken-tool -- "exec -- ${_command@Q} ${_arguments[*]@Q}"

The road paved with good intentions

So now you ask yourself, which are the tools that should be treated with such great care?

Case study -- `watch(1)` man page

Let's look at the watch(1) manual:

at the beginning we encounter watch [options] command, thus it hints that it might be the case;
however later it gives examples such as watch -n 60 from and watch -d ls -l, which suggests it might not be the case?
our first hint is another example such as watch -d 'ls -l | fgrep joe', that not only suggests it would use system(3) to execute our tool, but in fact it encourages us to use full sized sh scripts;
finally the -x (or --exec) flag gives the final response:

Pass command to exec(2) instead of sh -c which reduces the need to use extra quoting to get the desired effect.

Case study -- `ssh(1)` man page

Let's look at the ssh(1) manual:

at the beginning we encounter ssh [...] destination [command], thus it hints that it might be the case;
however, strangely enough for such a complex tool, we have no examples regarding its invocation;
thus the manual is "silent" about any such detail...

However by trying various experiments, we guess it might not be the case?

ssh user@remote echo a b c;
ssh user@remote cat /etc/passwd;

Spoiler: ssh is one of the tools that not only should be handled with great care, they even lack a flag that would disable this dangerous behavior...

Case study -- `i3` and `i3-msg`

Let's look at the i3 and i3-msg manuals; from i3-msg we have:

i3-msg [-q] [-v] [-h] [-s socket] [-t type] [message]
command -- The payload of the message is a command for i3 (like the commands you can bind to keys in the configuration file) and will be executed directly after receiving it.

From i3 we have:

exec [--no-startup-id] <command> (as syntax)
exec --no-startup-id xdotool key --clearmodifiers ctrl+v (as example)
exec --no-startup-id import /tmp/latest-screenshot.png (as example)
finally in a later section the manual states:

exec [--no-startup-id] <command>
The exec command starts an application by passing the command you specify to a shell.

The common thread...

All of these tools behave in a misleading way:

they usually expect a single "command" that is in fact an sh script;
however they accept multiple "arguments" that are joined together by spaces to form the final command;
their manuals are quite expeditive when it comes to the "command" syntax and handling;

Some experiments...

Let's experiment for example with printf (not the bash builtin, but the executable, thus the full path used): /usr/bin/printf .%q. argument-1 argument-2 would print out its arguments, quoted according to sh syntax, and joined by dots, as in:

> /usr/bin/printf .%q. '1 2' '; a' 'b ;' 'c ; d'
.'1 2'..'; a'..'b ;'..'c ; d'.

> /usr/bin/printf .%q. "'1 2' '; a' 'b ;' 'c ; d'"
.''\''1 2'\'' '\''; a'\'' '\''b ;'\'' '\''c ; d'\'''.

> /usr/bin/printf .%q. '"1 2 ' ' 3 4"'
.'"1 2 '..' 3 4"'.

Now let's try that through ssh:

quoting the entire command, yields the correct outcome:

> ssh user@remote "/usr/bin/printf .%q. '1 2' '; a' 'b ;' 'c ; d'"
.'1 2'..'; a'..'b ;'..'c ; d'.

quoting only the last three arguments, yields the same outcome as above, although the arguments are different (see the previous printf experiment):

> ssh user@remote /usr/bin/printf .%q. "'1 2' '; a' 'b ;' 'c ; d'"
.'1 2'..'; a'..'b ;'..'c ; d'.

getting more creative with quotes, and now the arguments are merged into a single one:

> ssh user@remote /usr/bin/printf .%q. '"1 2 ' ' 3 4"'
.'1 2   3 4'.

forgetting the quotes, and now disaster strikes:

> ssh user@remote /usr/bin/printf .%q. '1 2' '; a' 'b ;' 'c ; d'
.1..2.
bash: a: command not found
bash: c: command not found
bash: d: command not found

basically ssh just takes your arguments, joins them with spaces (regardless of what they contain), and shove them into sh -c on the remote server...
thus the last example, the one that everybody would expect to be correct, and which can be found in many examples (sans the spaces), is actually the wrong one...

Finally let's try that through i3-msg:

> i3-msg -t command -- exec urxvt -hold -e /usr/bin/printf .%q. '1 2' '; a' 'b ;' 'c ; d'
ERROR: Your command: exec urxvt -hold -e /usr/bin/printf .%q. 1 2 ; a b ; c ; d
ERROR:                                                              ^^^^^^^^^^^

> i3-msg -t command -- exec "urxvt -hold -e /usr/bin/printf .%q. '1 2' '; a' 'b ;' 'c ; d'"
ERROR: Your command: exec urxvt -hold -e /usr/bin/printf .%q. '1 2' '; a' 'b ;' 'c ; d'
ERROR:                                                                 ^^^^^^^^^^^^^^^^

> i3-msg -t command -- "exec \"urxvt -hold -e /usr/bin/printf .%q. '1 2' '; a' 'b ;' 'c ; d'\""
# the expected outcome

> i3-msg -t command -- "exec \"exec -- urxvt -hold -e /usr/bin/printf .%q. '1 2' '; a' 'b ;' 'c ; d'\""
# the expected outcome

basically there are two "layers of interpreters";
first there is the i3 command parser, which fails in the first two examples;
then there is the sh -c command parser, which handles the last two examples;

Lessons learned

First of all we should stop using the any tool or library that uses system(3), sh -c, or equivalent.

If we can't replace those tools or libraries, then we should be careful and:

always properly quote and escape the executable and arguments;
always prefix our commands with exec --;
always try to provide a full or relative path to the executable (as resolved from $PATH environment variable);
always try to "fuzz" our scripts and tools to see if we've covered all the corner cases;
always ask ourselves if there aren't multiple layers of parsers (like in the case of i3, tmux, and other tools);

Second, we should create a "wall of shame" for those important tools that provide no alternative. Among these, at the top of the list in large-bold-red font I would definitively include OpenSSH, which given it's high profile, is perhaps responsible for countless hidden bugs that have their root cause in the way commands are handled.

In the end I acknowledge that that the "technical" fault is not with these tools but with their users. But in the end their developers should know better, and try to be as safe as possible...

Wall of shame

OpenSSH -- it takes any number of "command arguments", just joins them with spaces, and shoves the outcome in sh -c on the remote node;
Ruby's backtick feature -- provides easy access to system(3);
Ruby's system function -- provides easy access to system(3) when one provides a single string as "command"; (what if I have no arguments? how can I disable the "shell" behavior?)
Python's os.system and os.popen functions -- simple wrappers for system(3);
Node's child_process.exec function -- simple wrapper for system(3) which unfortunately has a misleading name tricking one into thinking it is a execve(2) replacement;
any other programming language module or function that exposes system(3) without safety measures;
to-be-continued;

Wall of fame

Go's os/exec module -- that clearly states at the beginning:

Unlike the "system" library call from C and other languages, the os/exec package intentionally does not invoke the system shell and does not expand any glob patterns or handle other expansions, pipelines, or redirections typically done by shells. The package behaves more like C's "exec" family of functions.

Rust's std::process module -- that like Go doesn't even provide the facility of calling system(3);
Python's subprocess module -- that provides safe defaults:

args is required for all calls and should be a string, or a sequence of program arguments. Providing a sequence of arguments is generally preferred, as it allows the module to take care of any required escaping and quoting of arguments (e.g. to permit spaces in file names). If passing a single string, either shell must be True (see below) or else the string must simply name the program to be executed without specifying any arguments.

to-be-continued;

The 1000 bonus points bug

Prologue for the 1000 bonus points bug

So I don't know how to say this... But I think I'm going mad... Although I'm a careful bash "programmer" and I've used bash -c 'some-script' countless times, I would have never guessed the following situation...

Let's look more closely once more at the Linux system(3) man page:

execl("/bin/sh", "sh", "-c", command, (char *) NULL);

Then let's look more closely once more at the bash(1) man page:

-c -- If the -c option is present, then commands are read from the first non-option argument command_string. [...]

Now let's suppose one has a tool named -x.

Why is it called -x? Because it is not a forbidden file name by the file-system, and for that matter by any existing standards. Moreover attackers are not "nice people" and if they can, most likely they will create such an executable...

Now let's suppose we want to call that tool from Python without any arguments:

import os
os.system("-x")

Also let's supposed we want to call that tool with two arguments:

import os
os.system("-x a b")

Surprise for the 1000 bonus points bug

Care to guess the outcome of those previous two snippets?

No, let's first strace that:

> strace -e execve -f -- python2 -c 'import os; os.system("-x")'
[...]
execve("/bin/sh", ["sh", "-c", "-x"], ...) = 0
[...]

> strace -e execve -f -- python2 -c 'import os; os.system("-x a b")'
[...]
execve("/bin/sh", ["sh", "-c", "-x a b"], ...) = 0
[...]

Now can you guess the outcome?

No, let's just try it out (I've put both the Python and the plain sh -c invocations):

> python2 -c 'import os; os.system("-x")'
> sh -c -x
sh: -c: option requires an argument

> python2 -c 'import os; os.system("-x a b")'
> sh -c '-x a b'
sh: - : invalid option
[...]

(!!! Lots of censored swear words... !!!)

Reasons for the 1000 bonus points bug

What (more censored swear words) happened?

Well -c uses as argument not the immediately following argument, but instead the first non-option argument. In this case there is no non-option argument because our -x or -x a b looks like an option argument...

What would be the proper solution?

Well the system(3) implementation should actually be:

execl("/bin/sh", "sh", "-c", "--", command, (char *) NULL);

Epilogue for the 1000 bonus points bug

What have I done about the problem?

As a good open-source citizen, I've opened the following bug reports (one lead to another):

at glibc, about the system(3) implementation -- bug report #27143;
at Linux man pages, about a warning to the system(3) manual -- bug report #211029;
at POSIX "Austin Group", about "fixing" the specification -- bug report #1440 -- because it seems that the system(3p) specification requires explicitly on how to call sh(1);

So in the end, let me quote one of the glibc developers (the emphasis is mine):

Unfortunately, this bug is required by POSIX, which requires passing the string as an argument to the -c option of the shell.

I'm tired... I'm going to sleep... I need a saner job...