Using resources effectively
OverviewTeaching: 15 min
Exercises: 10 minQuestions
How do we monitor our jobs?
How can I get my jobs scheduled more easily?Objectives
Understand how to look up job statistics and profile code.
Understand job size implications.
We now know virtually everything we need to know about getting stuff on a cluster. We can log on, submit different types of jobs, use pre-installed software, and install and use software of our own. What we need to do now is use the systems effectively.
Estimating required resources using the scheduler
Although we covered requesting resources from the scheduler earlier, how do we know how much and what type of resources we will need in the first place?
Answer: we don’t. Not until we’ve tried it ourselves at least once. We’ll need to benchmark our job and experiment with it before we know how much it needs in the way of resources.
The most effective way of figuring out how much resources a job needs is to submit a test job, and then ask the scheduler how many resources it used.
A good rule of thumb is to ask the scheduler for more time and memory than you expect your job to need. This ensures that minor fluctuations in run time or memory use will not result in your job being cancelled by the scheduler. Recommendations for how much extra to ask for vary but 10% is probably the minimum, with 20-30% being more typical. Keep in mind that if you ask for too much, your job may not run even though enough resources are available, because the scheduler will be waiting to match what you asked for.
Create a job that runs the following command in the same directory as the
[yourUsername@login12 ~]$ fastqc name_of_fastq_file
fastqccommand is provided by the
fastqcmodule. You’ll need to figure out a good amount of resources to allocate for this first “test run”. You might also want to have the scheduler email you to tell you when the job is done.
Hint: The job only needs 1 CPU and not too much memory or time. The trick is figuring out just how much you’ll need!
First, write the SGE script to run
fastqcon the file supplied at the command-line.
[yourUsername@login12 ~]$ cat fastqc-job.sh
#!/bin/bash -l #$ -l h_rt= 00:10:00 fastqc $1
Now, create and run a script to launch a job for each
[yourUsername@login12 ~]$ cat fastqc-launcher.sh
for f in *.fastq do qsub fastqc-job.sh $f done
[yourUsername@login12 ~]$ chmod +x fastqc-launcher.sh [yourUsername@login12 ~]$ ./fastqc-launcher.sh
Once the job completes (note that it takes much less time than expected), we can query the scheduler
to see how long our job took and what resources were used. We will use
get statistics about our job.
[yourUsername@login12 ~]$ jobhist
FSTIME | FETIME | HOSTNAME | OWNER | JOB NUMBER | TASK NUMBER | EXIT STATUS | JOB NAME ----------------------+---------------------+---------------+---------+------------+-------------+-------------+------------- 2020-07-02 15:37:56 | 2020-07-02 15:37:58 | node-f00a-001 | YourUser| 1965 | 0 | 0 | Serial_Job
This shows all the jobs we ran recently (note that there are multiple entries per job). To get info about a specific job, we change command slightly.
[yourUsername@login12 ~]$ jobhist -j 1965
It will show a lot of info, in fact, every single piece of info collected on your job by the
scheduler. It may be useful to redirect this information to
less to make it easier to view (use
the left and right arrow keys to scroll through fields).
[yourUsername@login12 ~]$ jobhist -j 1965 | less
Some interesting fields include the following:
- Hostname: Where did your job run?
- MaxRSS: What was the maximum amount of memory used?
- Elapsed: How long did the job take?
- State: What is the job currently doing/what happened to it?
- MaxDiskRead: Amount of data read from disk.
- MaxDiskWrite: Amount of data written to disk.
Measuring the statistics of currently running tasks
Connecting to Nodes
Typically, clusters allow users to connect directly to compute nodes from the head node. This is useful to check on a running job and see how it’s doing, but is not a recommended practice in general, because it bypasses the resource manager.
If you need to do this, check where a job is running with
qstat, then run
Give it a try!
[yourUsername@login12 ~]$ ssh node-d00a-001
We can also check on stuff running on the login node right now the same way (so it’s
not necessary to
ssh to a node for this example).
Monitor system processes with
The most reliable way to check current system stats is with
top. Some sample output might look
like the following (
Ctrl + c to exit):
[yourUsername@login12 ~]$ top
top - 16:28:49 up 47 days, 5:33, 96 users, load average: 53.87, 55.82, 50.47 Tasks: 1226 total, 31 running, 1181 sleeping, 10 stopped, 4 zombie %Cpu(s): 66.8 us, 33.2 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st KiB Mem : 19754995+total, 13150139+free, 21139988 used, 44908560 buff/cache KiB Swap: 21242220+total, 20060854+free, 11813660 used. 17565382+avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 145836 richard 20 0 5230196 3.8g 1204 R 2446 2.0 683:15.69 bowtie2-align-s 71877 agape 20 0 46372 4100 932 R 81.9 0.0 9:37.41 rsync 205211 logos 20 0 1072236 524576 6552 R 79.9 0.3 0:07.12 python 205224 peter 20 0 1067448 520076 6612 R 77.3 0.3 0:07.06 python 205212 paul 20 0 993228 445776 6556 R 55.3 0.2 0:06.04 python 74051 paul 20 0 48816 2708 496 S 35.6 0.0 8:42.12 rsync 58157 hezekia 20 0 129612 2848 1140 S 2.3 0.0 975:04.49 htop 124495 samuel 20 0 136188 3396 1152 S 2.3 0.0 1078:34 htop 91884 lydia 20 0 933260 241984 9040 S 1.7 0.1 4:32.68 ipython 2628 root 20 0 0 0 0 S 1.3 0.0 92:11.14 ptlrpcd_00_0
Overview of the most important fields:
PID: What is the numerical id of each process?
USER: Who started the process?
RES: What is the amount of memory currently being used by a process (in bytes)?
%CPU: How much of a CPU is each process using? Values higher than 100 percent indicate that a process is running in parallel.
%MEM: What percent of system memory is a process using?
TIME+: How much CPU time has a process used so far? Processes using 2 CPUs accumulate time at twice the normal rate.
COMMAND: What command was used to launch a process?
htop provides a curses-based overlay for
top, producing a better-organized and “prettier”
dashboard in your terminal. Unfortunately, it is not always available. If this is the case,
politely ask your system administrators to install it for you.
Check memory load with
Another useful tool is the
free -h command. This will show the currently used/free amount of
[yourUsername@login12 ~]$ free -h
total used free shared buff/cache available Mem: 188G 109G 54G 528K 24G 78G Swap: 202G 11G 191G
The key fields here are total, used, and available - which represent the amount of memory that the machine has in total, how much is currently being used, and how much is still available. When a computer runs out of memory it will attempt to use “swap” space on your hard drive instead. Swap space is very slow to access - a computer may appear to “freeze” if it runs out of memory and begins using swap. However, compute nodes on HPC systems usually have swap space disabled so when they run out of memory you usually get an “Out Of Memory (OOM)” error instead.
To show all processes from your current session, type
[yourUsername@login12 ~]$ ps
PID TTY TIME CMD 15113 pts/5 00:00:00 bash 15218 pts/5 00:00:00 ps
Note that this will only show processes from our current session. To show all processes you own
(regardless of whether they are part of your current session or not), you can use
[yourUsername@login12 ~]$ ps ux
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND auser 67780 0.0 0.0 149140 1724 pts/81 R+ 13:51 0:00 ps ux auser 73083 0.0 0.0 142392 2136 ? S 12:50 0:00 sshd: auser@pts/81 auser 73087 0.0 0.0 114636 3312 pts/81 Ss 12:50 0:00 -bash
This is useful for identifying which processes are doing what.
To kill all of a certain type of process, you can run
killall commandName. For example,
[yourUsername@login12 ~]$ killall rsession
would kill all
rsession processes created by RStudio. Note that you can only kill
your own processes.
You can also kill processes by their PIDs. For example, your
ssh connection to the server is
listed above with PID 73083. If you wish to close that connection forcibly, you could
Sometimes, killing a process does not work instantly. To kill the process in the most aggressive
manner possible, use the
-9 flag, i.e.,
kill -9 73083. It’s recommended to kill using without
-9 first: this sends the process a “terminate” signal (
SIGTERM), giving it the chance to clean
up child processes and exit cleanly. However, if a process just isn’t responding, use
terminate it instantly (
The smaller your job, the faster it will schedule.