Using SLURM
The HPC uses a package called SLURM to control the
primary work flow.
The SLURM (Simple Linux Utility for Resource Management workload manager is a free and open-source job scheduler for the Linux kernel. It is used by the HPC and many of the world's supercomputers (and clusters). It provides three key functions.
- First, it allocates exclusive and/or non-exclusive access to resources (computer nodes) to users for some duration of time so they can perform work.
- Second, it provides a framework for starting, executing, and monitoring work (typically a parallel job such as MPI) on a set of allocated nodes.
- Thirdly, it arbitrates contention for resources by managing a queue of pending jobs.
It will take your batch job submission and execute it across the computing nodes of the HPC. How it is processed will depend on a number of factors including the queue it is submitted to, jobs already submitted to the queue etc.
SLURM
The command for submitting jobs is 'sbatch'
This command can take many options to set a number of controls before it is submitted. Typing 'man sbatch' at the terminal will display these. These can be typed in at the command line (as below).
sbatch --ntasks 28 myjob.sh |
For ease and repetition it is much easier to build these into the batch script (e.g. mybatchjob.sh) using an editor such as vi,
emacs
etc.
Job Commands
Command
|
Description
|
---|
srun
|
Run a parallel job on cluster managed by SLURM. If necessary, srun will first create a resource allocation in which to run the parallel job
|
sbatch
|
Submits a batch script to SLURM. The batch script may be given to sbatch through a file name on the command line, or if no file name is specified, sbatch will read in a script from standard input.
|
squeue
|
Used to view job and job step information for jobs managed by SLURM.
|
scancel
|
Used to signal or cancel jobs, job arrays or job steps.
|
scontrol
|
Used to view or modify Slurm configuration including: job, job step, node, partition, reservation, and overall system configuration. Most of the commands can only be executed by user root.
|
salloc
|
Used to allocate a SLURM job allocation, which is a set of resources (nodes), possibly with some set of constraints (e.g. number of processors per node). When salloc successfully obtains the requested allocation, it then runs the command specified by the user. Finally, when the user specified command
is complete, salloc relinquishes the job allocation.
|
sacct
|
Accounting information for jobs invoked with SLURM are either logged in the job accounting log file or saved to the SLURM database
|
sinfo
|
Used to view partition and node information for a system running SLURM.
|
sattach
|
Attaches to a running SLURM job step. By attaching, it makes available the IO streams of all of the tasks of a running SLURM job step. It also suit‐ able for use with a parallel debugger like TotalView.
|
At the terminal you can also type 'man <command>', e.g. man sbatch.
Additional documentation of SLURM can be found at http://www.ceci-hpc.be/slurm_tutorial.html and https://slurm.schedmd.com/quickstart.html
squeue