Parellel processing in Bash

The command line has been the primary interface for Unixes for over 40 years now. It’s design is reminiscent of it’s terminal roots, and it has remained much the same in form and function for decades now.

The terminal, or shell (as we will call it), is batch oriented, i.e: it runs one command at time. But that’s just one part of it. The shell also supports parellel processing, i.e the ability to run multiple commands, asynchronously from a single script. Parellel processing is an old problem and is supported in various ways. In this article I will explore some solutions in BASH , the ubiquitous shell on Unixes, and talk about the pros and cons of each.

In the beginning

Bash has had support for job control for a few decades now. The concept is simple; any command appended with an ‘&’ is pushed to the ‘background’. You can see the list of background jobs in your current session using jobs -l

Technically speaking, a background job is one that is not connected to the controlling terminal. When a job is pushed into the background it disconnects from the controlling terminal and continues processing until it is complete (the STDOUT still points to the controlling terminal, so it can output anytime, pasting text even when you’re typing in the terminal!). Inside a bash script, when a job is pushed into the background the script continues running from the next line.

Background jobs can’t take input from the terminal, so if they need to, the kernel suspends them (Technically speaking, Linux will put any process to sleep that is waiting on a resource, until it is available). This limits their usability to commands that work to completion with their input params only. It turns out, this is not a bottleneck in practise. The majority of shell commands on a typical Unix work this way.

Here’s how to run a command in the background.

> ./my_command &
[1] 3487

This command will run and exit so long as no input is required. Any output will be dumped onto the current shell. The shell displays the “job number” in brackets ‘[]’ , and the relevant PID as well. A job number is how the shell keeps track of background jobs. It is not related to the process id.

If you need to run a set of commands in the background, you can background process an entire subshell. Just put your command in a parens, like so :

» ( sleep 5; echo "Im awake!"; ) &

After 5 seconds, you will see the following output in your shell.

» ( sleep 5; echo "Im awake"; ) &
[1] 7270
» Im awake
[1] + 7270 done ( sleep 5; echo "Im awake"; )

What happens if you try to read from STDIO, say :

( read name; echo "Hi $name"; ) &

You will see that the job is suspended.

» ( read name; echo "Hi $name"; ) &
[1] 7521
[1] + 7521 suspended (tty input) ( read name; echo "Hi $name"; )

To continue the command you need to bring it to the “foreground”, using the fg command followed by the job number. Here, i fg the command and provide input :

» fg %1
[1] + 7521 continued ( read name; echo "Hi $name"; )
Adil
Hi Adil

Wait for it ..

Jobs are great if you want to set-it-and-forget-it. But what it you want to start a number of background jobs, wait for them to finish, then continue processing. Introducing wait. Wait is a shell built-in that blocks when called, and continues only when jobs are finished. Without any arguments, it waits for all background processes to finish before continuing.

Let’s try this with an example. The following script starts out 3 background jobs, then prints “All done” only when all 3 are complete.

bash -c '
 ( sleep 5; echo "slept for 5 secs" ) &
 ( sleep 2; echo "slept for 2 secs" ) &
 ( sleep 3; echo "slept for 3 secs" ) &
wait
echo "All done"'

slept for 2 secs
slept for 3 secs
slept for 5 secs
All done

We can enable ‘--verbose‘ mode in bash to see exactly how the script behaves. From the manual, Verbose mode :

Print shell input lines as they’re read.

Cool! Note the script stops executing at wait, then picks up as soon as all the bg jobs are done.

» bash -c -v '
 ( sleep 5; echo "slept for 5 secs" ) &
 ( sleep 2; echo "slept for 2 secs" ) &
 ( sleep 3; echo "slept for 3 secs" ) &
 wait
 echo "All done"'
 
 ( sleep 5; echo "slept for 5 secs" ) &
 ( sleep 2; echo "slept for 2 secs" ) &
 ( sleep 3; echo "slept for 3 secs" ) &
 wait
 slept for 2 secs
 slept for 3 secs
 slept for 5 secs
 echo "All done"
 All done

What about errors?

Some of your jobs may fail. How do know which one? Unfortunately, wait without arguments does not let you know which jobs have failed. And, if you write your BASH scripts with set -e, failed jobs WILL NOT terminate the script. wait without arguments will never fail!

That’s when wait with arguments comes in. If you know the Job Id (not the PID!) calling wait JOB_ID will return the exit code of the command. You can leverage this property in your bash script.

» bash -c '
( sleep 2; exit 0 ) & 
( sleep 3; exit 1 ) &
wait %2
echo Exit status of subshell 2 = $?'

Exit status of subshell 2 = 1

A wait queue?

wait and bg jobs can be combined to make a parellel work queue. Let’s say you want to run a command several (hundred, thousand, million!) times, each time with a different parameter. But you don’t want to overload the system by adding them to the bg all at once. What can you do? Here’s a trick : queue them with wait.

The following script demonstrates this. It starts one million jobs in the background, but only 4 at a time. Everytime a batch finishes a new set of jobs are started. Even though there are a lot of jobs, your system will be nice and stable.

for i in {1..1000000}
do
 let r="$RANDOM % 10 + 1";
 sleep $r &

 let div="$i%4"; 
 if [[ $div -eq 0 ]]; then # This line ensures i'm only processing 4 jobs at a time.
     echo "Waiting .. "; 
     wait; 
     echo "Done"; 
 fi
done

The output will look something like this :

[2] 9495
[3] 9496
[4] 9497
[5] 9498
Waiting .. 
[3]    9496 done       sleep $r
[4]  - 9497 done       sleep $r
[5]  + 9498 done       sleep $r
[2]  - 9495 done       sleep $r
Done
[1] 9499
[2] 9500
[3] 9501
[4] 9502
Waiting .. 
[3]  - 9501 done       sleep $r
[4]  + 9502 done       sleep $r
[1]    9499 done       sleep $r
[2]  - 9500 done       sleep $r
Done
[1] 9503
[2] 9504
[3] 9505
[4] 9506
Waiting ..

Be sure to kill the script! Ctrl + C

What’s the problem?

The above looks like a simple, low frills, way to handle a lot of jobs on your multicore machine. And it’s right in bash! So what are the downsides? These :

  1. It’s not a “true” parellel work queue. wait pauses until ALL bg jobs are complete before starting the next round. This means even if one of the jobs hangs, or takes too long, your queue will be stalled.
  2. It doesn’t indicate if any of the scripts have errored out. This may be important to understand the rate of success. I’ll admit, this is a faux limit. It is certainly possible to code this in a few lines. Get the running Job ids using jobs -r, and pass them to wait. Capture wait’s exit code and increment your FAIL counter. It works, but is clunky.

There’s a better way.

xargs

Most Unixes come with the xargs utility. It’s a great utility if you want to run thousands of similar commands in parellel, so you can use all the cores of your server without breaking out of bash. And if you want to know if any of your commands failed, xargs will tell you.

Let’s rewrite the work queue above in xargs. I’m going to run 20 commands in a queue, 8 at a time. They’re all identical except that each command sleeps randomly between 1-5 seconds and errors out on exit. Here’s the script :

IDS=(1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20)
for ID in ${IDS[*]}; do echo $ID; done | xargs -I{} --max-procs 8 bash -c "
echo 'Processing ID' {} '..'; sleep $[RANDOM % 5]; echo 'Done' {}; exit 1;"
echo "Exit code for xargs = $?"

Let’s break the script down. The first two lines pipe IDs to xargs

IDS=(1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20)
for ID in ${IDS[*]}; do echo $ID; done | 

Then xargs takes this input and runs 8 of them at a time, with the --max-procs option (8 was randomly chosen, i have 8 cores on my machine). -I{} replaces ‘{}’ with one of the Ids passed via STDIN. Finally bash -c "..." is the actual script run once for every input sent to xargs. An input is simply a string followed by a new line.

First, let’s see the output :

Processing ID 2 ..
Processing ID 3 ..
Processing ID 1 ..
Processing ID 4 ..
Processing ID 7 ..
Processing ID 5 ..
Processing ID 8 ..
Processing ID 6 ..
Done 3
Done 2
Processing ID 9 ..
Processing ID 10 ..
Done 4
Done 7
Done 1
Processing ID 12 ..
Processing ID 11 ..
Processing ID 13 ..
Done 8
Done 5
Done 6
Processing ID 15 ..
Processing ID 14 ..
Processing ID 16 ..
Done 9
Done 10
Processing ID 17 ..
Processing ID 18 ..
Done 12
Done 13
Done 11
Processing ID 19 ..
Processing ID 20 ..
Done 15
Done 14
Done 16
Done 17
Done 18
Done 19
Done 20
Exit code for xargs = 123

Notice the start and ends of each batch are uneven. Each time one ID is over, xargs automatically picks the next ID while other scripts are running simultaneously. It doesn’t stall on slower processes. Also notice the exit code, 123. From the xargs man pages :

xargs exits with the following status:
0 if it succeeds
123 if any invocation of the command exited with status 1-125

Be sure to read the man pages for some nifty examples.

Conclusion

There you have it, 3 ways to do parellel processing in bash. While this article has been about Bash, background jobs and wait have been a part of most popular Unix shells, including tcsh and zsh. The xargs utility is also ubiquitous to all modern Unixes, including Linux and MacOSX. Try them out on your machine.

The next time you need to run a loop of commands, consider parellelizing them using the above techniques. They’re simple and robust, and will make much better use of your multi-core machines. If you want to go further, try GNU Parellel. From their site :

GNU parallel is a shell tool for executing jobs in parallel using one or more computers.

Parellel is designed to mimic xargs arguments, so it should feel just at home once you’ve gotten the hang of it. As a plus, parellel can split jobs across machines, so it’s definitely more useful in a clustered environment. One downside, it doesn’t come standard on Linux (even as of Ubuntu 14.04).

Leave your comments!

Advertisements

2 thoughts on “Parellel processing in Bash

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s