Legion is a shared facility therefore work needs to be scheduled via a batch system.
Jobs are queued and prioritised based on requested resources.
On the login nodes resources are shared, so if someone runs something resource-intensive it slows them down for everyone else. The scheduler gives you exclusive access to resources you request and manages which jobs run when and where.
In this example we want to run the function calculate_pi
calculate_pi
- Calculates π by numerically integrating a curve.
\[\int_{0}^{1}\frac{4}{1+x^2}\text{d}x=\pi\]
We will first make a copy of ‘calculate_pi’
cd ~/Scratch
cp -r /shared/ucl/apps/examples/calculate_pi_dir ./
cd calculate_pi_dir
make
./calculate_pi
For Legion to know what we want it to do we need to create a job script
Job scripts start with #!/bin/bash -l
, the switch -l
makes the shell created behave a login shell, giving you the same user environment as you would when you login. More importantly, if this is omitted the shell will not recognise modules.
#!/bin/bash -l
#$ -l h_rt=0:10:00
#$ -cwd
./calculate_pi
this lets the system know that we want a maximum runtime (h_rt
) of 10 minutes, where we want the job to be run and the file to execute.
For job scripts the following defaults are applied, unless explicitly stated otherwise:
#$ -l h_rt=0:15:00
#$ -l memory=1M
#$ -l tmpfs=10G
In order for our job to be scheduled and ultimately run, we need to submit it to the job queue.
There are a number of queue commands to help us:
qsub |
submit job |
qstat |
view queue status and job info |
qdel |
stop & delete a job |
qrsh |
start an interactive session |
qexplain |
will show the full error for the specified job if it is in in Eqw status |
joblist |
displays job lists for previous 24hrs |
nodesforjob |
information about nodes allocated for specified job |
To submit our job we use the ‘qsub’ command:
$ qsub submit.sh
Your job 3521045 ("submit.sh") has been submitted
$ qsub -terse submit.sh
3521045
Special comments are options for qsub
Check man qsub
for full lists
Every cluster is a little different
we can also view the queue and the status of our submitted job(s)
$ qstat
job-ID prior name user state submit/start at
-----------------------------------------------------------------
3521045 0.00000 submit.sh ccaaxxx qw 01/14/2014 14:51:54
Job states:
Letter | Status |
---|---|
q |
queued |
w |
waiting |
r |
running |
E |
error |
t |
transferring |
h |
held |
more detail can be obtained by using the option -j
qstat -j 3521045
Most common problems:
~/Scratch
(and thus not writable)Note that qstat -j
cuts off the end of the error message - try e.g. qexplain 53893
to see full error message.
#### Removing Jobs
Once jobs have been submitted, we can remove them from the queue
$ qdel 3521045
ccaaxxx has deleted job 3521045
This command:
### Multi-threaded Jobs
OpenMP:
In our Scratch
directory we are going to make a copy of the /shared/ucl/apps/examples/openmp_pi_dir
directory.
Then, build the program using make
, and try running it
cd ~/Scratch
cp -r /shared/ucl/apps/examples/openmp_pi_dir ./
cd openmp_pi_dir
make
./openmp_pi
We need to let the system know how many cores and how many threads are required for our job:
#$ -pe smp 4
This tells scheduler to find you 4 cores and allocate them to your job
Set OMP_NUM_THREADS=4
to tell OpenMP you’re only using 4 instead of all
Our job script will now look like:
#!/bin/bash -l
#$ -l h_rt=0:10:00
#$ -pe smp 4
#$ -cwd
./openmp_pi
Like before we will make a copy of the amended pi calculation program:
cd ~/Scratch
cp -r /shared/ucl/apps/examples/mpi_pi_dir ./
cd mpi_pi_dir
make
./mpi_pi
# This won't always work on clusters
We now need to request multiple nodes:
#$ -pe mpi 36
This makes space for multi-node job and creates variables and machines
file.
Note that each requested core gets the amount of memory requested.
The job script will now be:
#!/bin/bash -l
#$ -l h_rt=0:10:00
#$ -pe mpi 4
#$ -cwd
gerun ./mpi_pi
#$ -t 3 <- (only runs one job)
#$ -t 1-3
#$ -t 1-7:2
This queues an array of jobs which only differ in how the $SGE_TASK_ID
variable is set.
Once again our job script needs to be different:
#!/bin/bash -l
#$ -l h_rt=0:10:00
#$ -t 1-4
#$ -cwd
./calculate_pi ${SGE_TASK_ID}0
The Lustre parallel filesystem performs worst when creating and writing to lots of little files.
Arrays of jobs often create files like this.
To help performance, run this type of job using the local storage on the node, and copy the files over when the job is complete.
Local Storage: $TMPDIR
This requires the following amendments to the job script:
#!/bin/bash -l
#$ -l h_rt=0:10:00
#$ -t 1-40000
#$ -cwd
cd $TMPDIR
$HOME/my_programs/make_lots_of_files \
--some-option=$SGE_TASK_ID
Then either:
cp * $SGE_WORK_DIR
or
cp -r $TMPDIR $SGE_WORK_DIR
Or, better for lots of files:
cd $SGE_O_WORK_DIR
tar -czf $JOB_ID.$SGE_TASK_ID.tar.gz $TMPDIR
zip -f $JOB_ID.$SGE_TASK_ID.zip $TMPDIR
We can specify the modules we want to use in our job scripts:
#!/bin/bash -l
#$ -l h_rt=0:10:00
#$ -cwd
module unload compilers mpi
module load r/recommended
R --no-save --slave <<EOF >r.output.$JOB_ID
runif(50,0,1)
EOF
(generates a bunch of random numbers)
Other systems (e.g. Archer) may use a slightly different scheduler system, so the scripts can be slightly different – consult the relevant documentation.
#$ -pe mpi 24
#PBS -l nodes=2:ppn=12
#$ -pe smp 12
#PBS -l nodes=1:ppn=12
#$ -l h_rt=1:00:00
#PBS -l walltime=1:00:00
#$ -l memory=4G
#PBS -l mem=4gb