Jobs on Legion

Legion is a shared facility therefore work needs to be scheduled via a batch system.

Jobs are queued and prioritised based on requested resources.

On the login nodes resources are shared, so if someone runs something resource-intensive it slows them down for everyone else. The scheduler gives you exclusive access to resources you request and manages which jobs run when and where.

A Simple Serial Job

In this example we want to run the function calculate_pi

calculate_pi - Calculates π by numerically integrating a curve.

\[\int_{0}^{1}\frac{4}{1+x^2}\text{d}x=\pi\]

Image illustrating area under curve to be calculated

We will first make a copy of ‘calculate_pi’

cd ~/Scratch
cp -r /shared/ucl/apps/examples/calculate_pi_dir ./
cd calculate_pi_dir
make
./calculate_pi

For Legion to know what we want it to do we need to create a job script

Job scripts start with #!/bin/bash -l, the switch -l makes the shell created behave a login shell, giving you the same user environment as you would when you login. More importantly, if this is omitted the shell will not recognise modules.

#!/bin/bash -l
#$ -l h_rt=0:10:00
#$ -cwd
./calculate_pi

this lets the system know that we want a maximum runtime (h_rt) of 10 minutes, where we want the job to be run and the file to execute.

For job scripts the following defaults are applied, unless explicitly stated otherwise:

#$ -l h_rt=0:15:00
#$ -l memory=1M
#$ -l tmpfs=10G

The job queue & submitting jobs

In order for our job to be scheduled and ultimately run, we need to submit it to the job queue.

There are a number of queue commands to help us:

`qsub`	submit job
`qstat`	view queue status and job info
`qdel`	stop & delete a job
`qrsh`	start an interactive session
`qexplain`	will show the full error for the specified job if it is in in `Eqw` status
`joblist`	displays job lists for previous 24hrs
`nodesforjob`	information about nodes allocated for specified job

To submit our job we use the ‘qsub’ command:

$ qsub submit.sh
Your job 3521045 ("submit.sh") has been submitted

$ qsub -terse submit.sh
3521045

Special comments are options for qsub

Check man qsub for full lists

Every cluster is a little different

we can also view the queue and the status of our submitted job(s)

$ qstat

job-ID  prior   name       user         state submit/start at    
-----------------------------------------------------------------
3521045 0.00000 submit.sh  ccaaxxx      qw    01/14/2014 14:51:54

Job states:

Letter	Status
`q`	queued
`w`	waiting
`r`	running
`E`	error
`t`	transferring
`h`	held

more detail can be obtained by using the option -j

qstat -j 3521045

Errors

Most common problems:

Working directory does not exist
Working directory is not in ~/Scratch (and thus not writable)

Note that qstat -j cuts off the end of the error message - try e.g. qexplain 53893 to see full error message.

#### Removing Jobs

Once jobs have been submitted, we can remove them from the queue

$ qdel 3521045
ccaaxxx has deleted job 3521045

This command:

removes a queued job
stops and removes a running job

### Multi-threaded Jobs

Image of multi-threading

OpenMP:

Image of OpenMP threading behaviour

In our Scratch directory we are going to make a copy of the /shared/ucl/apps/examples/openmp_pi_dir directory. Then, build the program using make, and try running it

cd ~/Scratch
cp -r /shared/ucl/apps/examples/openmp_pi_dir ./
cd openmp_pi_dir
make
./openmp_pi

Requesting Threads

We need to let the system know how many cores and how many threads are required for our job:

#$ -pe smp 4

This tells scheduler to find you 4 cores and allocate them to your job

Set OMP_NUM_THREADS=4 to tell OpenMP you’re only using 4 instead of all

Our job script will now look like:

#!/bin/bash -l
#$ -l h_rt=0:10:00
#$ -pe smp 4
#$ -cwd
./openmp_pi

Multi-node Jobs

Need some method to communicate over the network
Most common is MPI

MPI

Diagram of message passing behaviour using MPI

Like before we will make a copy of the amended pi calculation program:

cd ~/Scratch
cp -r /shared/ucl/apps/examples/mpi_pi_dir ./
cd mpi_pi_dir
make
./mpi_pi
# This won't always work on clusters

We now need to request multiple nodes:

#$ -pe mpi 36

This makes space for multi-node job and creates variables and machines file.

Note that each requested core gets the amount of memory requested.

The job script will now be:

#!/bin/bash -l
#$ -l h_rt=0:10:00
#$ -pe mpi 4
#$ -cwd

gerun ./mpi_pi

Requesting an Array Job

#$ -t 3     <- (only runs one job)
#$ -t 1-3
#$ -t 1-7:2

This queues an array of jobs which only differ in how the $SGE_TASK_ID variable is set.

Once again our job script needs to be different:

#!/bin/bash -l
#$ -l h_rt=0:10:00
#$ -t 1-4
#$ -cwd

./calculate_pi ${SGE_TASK_ID}0

Job and File Performance

The Lustre parallel filesystem performs worst when creating and writing to lots of little files.

Arrays of jobs often create files like this.

To help performance, run this type of job using the local storage on the node, and copy the files over when the job is complete.

Local Storage: $TMPDIR

This requires the following amendments to the job script:

#!/bin/bash -l
#$ -l h_rt=0:10:00
#$ -t 1-40000
#$ -cwd

cd $TMPDIR
$HOME/my_programs/make_lots_of_files \
  --some-option=$SGE_TASK_ID

Then either:

cp * $SGE_WORK_DIR

cp -r $TMPDIR $SGE_WORK_DIR

Or, better for lots of files:

cd $SGE_O_WORK_DIR

tar -czf $JOB_ID.$SGE_TASK_ID.tar.gz $TMPDIR
zip -f   $JOB_ID.$SGE_TASK_ID.zip    $TMPDIR

Using Modules

We can specify the modules we want to use in our job scripts:

#!/bin/bash -l
#$ -l h_rt=0:10:00
#$ -cwd
module unload compilers mpi
module load r/recommended
R --no-save --slave <<EOF >r.output.$JOB_ID
runif(50,0,1)
EOF

(generates a bunch of random numbers)

Other Schedulers

Other systems (e.g. Archer) may use a slightly different scheduler system, so the scripts can be slightly different – consult the relevant documentation.

#$ -pe mpi 24
#PBS -l nodes=2:ppn=12

#$ -pe smp 12
#PBS -l nodes=1:ppn=12

#$ -l h_rt=1:00:00
#PBS -l walltime=1:00:00

#$ -l memory=4G
#PBS -l mem=4gb