Raven User Guide

System Overview

The final expansion stage of the RAVEN HPC system has been put into operation in June 2021 and comprises 1592 compute nodes with Intel Xeon IceLake-SP processors (Platinum 8360Y) with 72 cores and 256 GB RAM per node. A subset of 64 nodes is equipped with 512 GB RAM and 4 nodes with 2048 GB RAM. In addition, Raven provides 192 GPU-accelerated compute nodes, each with 4 Nvidia A100 GPUs (4 × 40 GB HBM2 memory per node and NVLink). The nodes are interconnected with a Mellanox HDR InfiniBand network (100 Gbit/s) using a pruned fat-tree topology with four non-blocking islands (720 CPU nodes with 256 GB RAM, 660 CPU nodes with 256 GB RAM, 192 GPU nodes plus 64 CPU nodes with 512 GB RAM and 4 CPU nodes with 2 TB RAM, 144 CPU nodes with 256 GB RAM). The GPU nodes are interconnected with at least 200 GBit/s.

In addition there are 2 login nodes and an I/O subsystem that serves 7 PB of disk storage with direct HSM access.

Summary: 1592 CPU compute nodes, 114624 CPU cores, 421 TB DDR RAM, 8.8 PFlop/s theoretical peak performance (FP64), plus 192 GPU-accelerated compute nodes 768 GPUs, 30 TB HBM2, 14.6 PFlop/s theoretical peak performance (FP64).

MPCDF Raven Deployment

Access

Login

For security reasons, direct login to the HPC system Raven is allowed only from within some MPG networks. Users from other locations have to login to one of our gateway systems first.

ssh <user>@gate.mpcdf.mpg.de

Use ssh to connect to Raven:

ssh <user>@raven.mpcdf.mpg.de

You will be directed to one of the Raven login nodes (raven01i, raven02i). You have to provide your (Kerberos) password and an OTP on the Raven login nodes. SSH keys are not allowed.

Secure copy (scp) can be used to transfer data to/from raven.mpcdf.mpg.de and raven-i.mpcdf.mpg.de.

Raven’s (all login/interactive nodes) ssh key fingerprints (SHA256) are:

MrZnFLM64Zz+rZrNRxXdoTfN8lgppZnFdWo2XpRsQts (RSA)
SRtUsiak+twYo1Ok9rd5AZATZT4Z5+9MqJHrxwss78g (ED25519)

Resource limits

The login nodes raven.mpcdf.mpg.de (raven[01-02]i) are intended for editing, compiling and submitting your parallel programs, only. Running parallel programs interactively on the login nodes is NOT allowed. CPU resources are restricted to an equivalent of two physical cpu cores, per user.

The login nodes raven-i.mpcdf.mpg.de (raven[03-04]i) are also primarily intended for editing, compiling and submitting your parallel programs, but CPU resources are only restricted to an equivalent of six cpu cores, per user.

Jobs have to be submitted to the Slurm batch system which reserves and allocates the resources (e.g. compute nodes) required for your job. Further information on the batch system is provided below.

Interactive (debug) runs

To test or debug your code you may run your code interactively by using the Slurm partition “interactive” (2 hours at most) with the command:

srun -n N_TASKS -p interactive --time=TIME_LESS_THAN_2HOURS --mem=MEMORY_LESS_THAN_32G ./EXECUTABLE

Users need to take care that the machine does not become overloaded. It is not allowed to use more than 8 cores in total and to request more than 32 GB of main memory. Neglecting these recommendations may cause a system crash or hangup!

To test and optimize your GPU codes one can use the “gpudev” partition by specifying

#SBATCH --partition=gpudev

in your submit script. Only one node with four a100 GPUs is available in “gpudev” partition. The maximum execution time is 15 minutes. Between one and four GPUs can be requested like for the usual GPU jobs.

Internet access

Connections to the Internet are only permitted from the login nodes in outgoing direction; Internet access from within batch jobs is not possible. To download source code or other data, command line tools such as wget, curl, rsync, scp, pip, git, or similar may be used interactively on the login nodes. In case the transfer is expected to take a long time it is useful to run it inside a screen or tmux session.

Hardware configuration

Compute nodes

CPU nodes:

  • 1592 compute nodes

  • Processor type: Intel Xeon IceLake Platinum 8360Y

  • Processor base frequency: 2.4 GHz

  • Cores per node: 72 (each with 2 hyperthreads, thus 144 logical CPUs per node)

  • Main memory (RAM) per node: 256 GB (1524 nodes), 512 GB (64 nodes), 2048 GB (4 nodes)

  • Theoretical peak performance per node (FP64, “double precision”): 2.4 GHz * 32 DP Flops/cycle * 72 = 5530 GFlop/s

  • 2 NUMA domains with 36 physical cores each

GPU nodes:

  • 192 GPU-accelerated nodes (each hosting 4 Nvidia A100 GPUs, interlinked with NVlink 3)

  • GPU type: Nvidia A100 NVlink 40 GB HBM2, CUDA compute capability 8.0 / Ampere

  • CPU host: Intel Xeon IceLake Platinum 8360Y with 72 CPU cores and 512 GB per node

Login and interactive nodes

  • 2 nodes for login and code compilation (Hostname raven.mpcdf.mpg.de)

  • 4 nodes for interactive program development and testing (Hostname raven-i.mpcdf.mpg.de)

  • Processor type: Intel Xeon IceLake Platinum 8360Y

  • Cores per node: 72 (144 logical CPUs)

  • Main memory (RAM) per node: login nodes: 512 GB, interactive nodes: 256 GB

Interconnect

  • Mellanox InfiniBand HDR network connecting all the nodes using a pruned fat-tree topology with four non-blocking islands

    • CPU nodes: 100 Gb/s (HDR100)

    • GPU nodes: 200 Gb/s (HDR200), 400 Gb/s (2x HDR200) for a subset of 32 nodes

I/O subsystem

  • 7 PB of online disk space

Additional hardware details

Additional details on the Raven hardware are given on a separate documentation page.

File systems

$HOME

Your home directory is in the GPFS file system /u (see below).

AFS

AFS is only available on the login nodes raven.mpcdf.mpg.de and on the interactive nodes raven-i.mpcdf.mpg.de in order to access software that is distributed by AFS. If you don’t get automatically an AFS token during login, you can get an AFS token with the command /usr/bin/klog.krb5. Note that there is no AFS on the compute nodes, so you have to avoid any dependencies on AFS in your job.

GPFS

There are two global, parallel file systems of type GPFS (/u and /ptmp), symmetrically accessible from all Raven cluster nodes, plus the migrating file system /r interfacing to the HPSS archive system.

File system /u

The file system /u (a symbolic link to /raven/u) is designed for permanent user data (source files, config files, etc.). The size of /u is 0.9 PB mirrored. Your home directory is in /u. The default disk quota in /u is 2.5 TB, the file quota is 2 mio files. You can check your disk quota in /u with the command:

/usr/lpp/mmfs/bin/mmlsquota raven_u

File system /ptmp

The file system /ptmp (a symbolic link to /raven/ptmp) is designed for batch job I/O. (12 PB, no system backups) Files in /ptmp that have not been accessed for more than 12 weeks will be removed automatically. The period of 12 weeks may be reduced if necessary (with prior notice).

As a current policy, no quotas are applied on /ptmp. This gives users the freedom to manage their data according to their actual needs without administrative overhead. This liberal policy presumes a fair usage of the common file space. So, please do a regular housekeeping of your data and archive/remove files that are not used actually.

Archiving data from the GPFS file systems to tape can be done using the migrating file system /r (see below).

File system /r

The /r file system (a symbolic link to /ghi/r) stages archive data. It is available only on the login nodes raven.mpcdf.mpg.de and on the interactive nodes raven-i.mpcdf.mpg.de.

Each user has a subdirectory /r/<initial>/<userid> to store data. For efficiency, files should be packed to tar files (with a size of about 1 GB to 1 TB) before archiving them in /r, i.e., please avoid archiving small files. When the file system /r gets filled above a certain value, files will be transferred from disk to tape, beginning with the largest files which have not been used for the longest time.

For documentation on how to use the MPCDF archive system, please see the backup and archive section.

/tmp

Please don’t use the file system /tmp or $TMPDIR for scratch data. Instead, use /ptmp which is accessible from all Raven cluster nodes. In cases where an application really depends on node-local storage, you can use the variables JOB_TMPDIR and JOB_SHMTMPDIR, which are set individually for each job.

Software

Access to software via environment modules

Environment modules are used at MPCDF to provide software packages and enable switching between different software versions.

Use the command

module avail

to list the available software packages on the HPC system. Note that you can search for a certain module by using the find-module tool (see below).

Use the command

module load package_name/version

to actually load a software package at a specific version.

Further information on the environment modules on Raven and their hierarchical organization is given below.

Information on the software packages provided by the MPCDF is available here.

Hierarchical module environment

To manage the plethora of software packages resulting from all the relevant combinations of compilers and MPI libraries, we organize the environment module system for accessing these packages in a natural hierarchical manner. Compilers (gcc, intel) are located on the uppermost level, depending libraries (e.g., MPI) on the second level, more depending libraries on a third level. This means that not all the modules are visible initially: Only after loading a compiler module, the modules depending on this will become available. And similarly, loading an MPI module in addition will make the modules depending on the MPI library available.

Starting with the HPC system Raven, no defaults are defined for the compiler and MPI modules, and no modules are loaded automatically at login. This forces users to specify explicit versions for those modules during compilation and in the batch scripts to ensure that the same MPI library is loaded. This also means that users can decide themselves when they use newer compiler and MPI versions for their code which avoids compatibility problems when changing defaults centrally.

For example, the FFTW library compiled with the Intel compiler and the Intel MPI library can be loaded as follows:

First, load the Intel compiler module using the command

module load intel/21.2.0

second, the Intel MPI module with

module load impi/2021.2

and, finally, the FFTW module fitting exactly to the compiler and MPI library via

module load fftw-mpi

You may check by using the command

module avail

that after the first and second steps the depending environment modules become visible, in the present example impi and fftw-mpi. Moreover, note that the environment modules can be loaded via a single ‘module load’ statement as long as the order given by the hierarchy is correct, e.g.,

module load intel/21.2.0 impi/2021.2 fftw-mpi

It is important to point out that a large fraction of the available software is not affected by the hierarchy, e.g., certain HPC applications, tools such as git or cmake, mathematical software (maple, matlab, mathematica), visualization software (visit, paraview, idl) are visible at the uppermost hierarchy. Note that a hierarchy exists for depending Python modules via the ‘anaconda’ module files on the top level.

Because of the hierarchy, some modules only appear after other modules (such as compiler and MPI) have been loaded. One can search all available combinations of a certain software (e.g. fftw-mpi) by using

find-module fftw-mpi

Further information on using environment modules is given here.

Slurm batch system

The batch system on the HPC cluster Raven is the open-source workload manager Slurm (Simple Linux Utility for Resource management). To run test or production jobs, submit a job script (see below) to Slurm, which will find and allocate the resources required for your job (e.g. the compute nodes to run your job on).

By default, the job run limit is set to 8 on Raven, the default job submit limit is 300. If your batch jobs can’t run independently from each other, please use job steps.

There are mainly two types of batch jobs:

  • Exclusive, where all resources on the nodes are allocated to the job

  • Shared, where several jobs share the resources of one node. In this case it is necessary that the number of CPUs and the amount of memory are specified for each job.

The Intel processors on Raven support hyperthreading which might increase the performance of your application by up to 20%. To use hyperthreading, you have to increase the product of the number of MPI tasks per node and the number of threads per MPI task from 72 to 144 in your job script. Please be aware that with 144 MPI tasks per node each process gets only half of the memory compared to the non-hyperthreading job by default. If you need more memory, you have to specify it in your job script (see the example batch scripts).

Overview of the available per-job resources on Raven:

    Job type          Max. CPUs            Number of GPUs   Max. Memory      Number     Max. Run
                      per Node               per node        per Node       of Nodes      Time
   =============================================================================================
    shared    cpu     36 / 72  in HT mode                     120 GB          < 1       24:00:00
    ............................................................................................
                      18 / 36  in HT mode        1            125 GB          < 1       24:00:00
    shared    gpu     36 / 72  in HT mode        2            250 GB          < 1       24:00:00
                      54 / 108 in HT mode        3            375 GB          < 1       24:00:00
   ---------------------------------------------------------------------------------------------
                                                              240 GB         1-360      24:00:00
    exclusive cpu     72 / 144 in HT mode                     500 GB         1-64       24:00:00
                                                             2048 GB         1-4        24:00:00
    ............................................................................................
    exclusive gpu     72 / 144 in HT mode        4            500 GB         1-80       24:00:00
    exclusive gpu bw  72 / 144 in HT mode        4            500 GB         1-16       24:00:00
   ---------------------------------------------------------------------------------------------

If an application needs more than 240 GB per node, the required amount of memory has to be specified in the Slurm submit script, e.g. with the following options:

#SBATCH --mem=500000               # to request up to 500 GB
   or
#SBATCH --mem=2048000              # to request up to 2 TB

A job submit filter will automatically choose the right partition and job parameters from the resource specification.

Interactive testing and debugging is possible on the nodes raven-i.mpcdf.mpg.de (raven[03-06]i) by using the command:

srun -n N_TASKS -p interactive ./EXECUTABLE

Interactive jobs are limited to 8 cores, 256000M memory and 2 hours runtime.

For detailed information about the Slurm batch system, please see Slurm Workload Manager.

The most important Slurm commands are

  • sbatch <job_script_name> Submit a job script for execution

  • squeue Check the status of your job(s)

  • scancel <job_id> Cancel a job

  • sinfo List the available batch queues (partitions).

Do not run Slurm client commands from loops in shell scripts or other programs. Ensure that programs limit calls to these commands to the minimum necessary for the information you are trying to gather.

Sample Batch job scripts can be found below.

Notes on job scripts:

  • The directive

    #SBATCH --nodes=<number of nodes>
    

    in your job script specifies the number of compute nodes that your program will use.

  • The directive

    #SBATCH  --ntasks-per-node=<number of cpus>
    

    specifies the number of MPI processes for the job. The parameter ntasks-per-node cannot be greater than 72 because one compute node on Raven has 72 cores with 2 threads each, thus 144 logical CPUs in hyperthreading mode.

  • The directive

    #SBATCH --cpus-per-task=<number of OMP threads per MPI task>
    

    specifies the number of threads per MPI process if you are using OpenMP.

  • The expression

    ntasks-per-node * cpus-per-task
    

    may not exceed 144.

  • The expression

    nodes * ntasks-per-node * cpus-per-task
    

    gives the total number of CPUs that your job will use.

  • To select either GPU nodes with standard (200 GBit/s) or with high-bandwidth (400 GBit/s) network interconnect, specify a job constraint as follows:

    #SBATCH --constraint="gpu"    # for gpu nodes with 200 GBit/s network connection
    # or
    #SBATCH --constraint="gpu-bw" # for gpu nodes with 400 GBit/s network connection
    
  • For multi-process GPU jobs, NVIDIA MPS can be launched using the command line flag --nvmps for sbatch. See the example scripts below for details.

  • Jobs that need less than a half compute node have to specify a reasonable memory limit so that they can share a node!

  • A job submit filter will automatically choose the right partition/queue from the resource specification.

  • Please note that setting the environment variable ‘SLURM_HINT’ in job scripts is not necessary and discouraged on Raven.

Slurm example batch scripts

MPI and MPI/OpenMP batch scripts

MPI batch job without hyperthreading

#!/bin/bash -l
# Standard output and error:
#SBATCH -o ./job.out.%j
#SBATCH -e ./job.err.%j
# Initial working directory:
#SBATCH -D ./
# Job Name:
#SBATCH -J test_slurm
#
# Number of nodes and MPI tasks per node:
#SBATCH --nodes=16
#SBATCH --ntasks-per-node=72
#
#SBATCH --mail-type=none
#SBATCH --mail-user=userid@example.mpg.de
#
# Wall clock limit (max. is 24 hours):
#SBATCH --time=12:00:00

# Load compiler and MPI modules (must be the same as used for compiling the code)
module purge
module load intel/21.2.0 impi/2021.2

# Run the program:
srun ./myprog > prog.out

Hybrid MPI/OpenMP batch job without hyperthreading

#!/bin/bash -l
# Standard output and error:
#SBATCH -o ./job_hybrid.out.%j
#SBATCH -e ./job_hybrid.err.%j
# Initial working directory:
#SBATCH -D ./
# Job Name:
#SBATCH -J test_slurm
#
# Number of nodes and MPI tasks per node:
#SBATCH --nodes=16
#SBATCH --ntasks-per-node=4
# for OpenMP:
#SBATCH --cpus-per-task=18
#
#SBATCH --mail-type=none
#SBATCH --mail-user=userid@example.mpg.de
#
# Wall clock limit (max. is 24 hours):
#SBATCH --time=12:00:00

# Load compiler and MPI modules (must be the same as used for compiling the code)
module purge
module load intel/21.2.0 impi/2021.2

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

# For pinning threads correctly:
export OMP_PLACES=cores

# Run the program:
srun ./myprog > prog.out

Hybrid MPI/OpenMP batch job in hyperthreading mode

#!/bin/bash -l
# Standard output and error:
#SBATCH -o ./job_hybrid.out.%j
#SBATCH -e ./job_hybrid.err.%j
# Initial working directory:
#SBATCH -D ./
# Job Name:
#SBATCH -J test_slurm
#
# Number of nodes and MPI tasks per node:
#SBATCH --nodes=32
#SBATCH --ntasks-per-node=4
# Enable Hyperthreading:
#SBATCH --ntasks-per-core=2
# for OpenMP:
#SBATCH --cpus-per-task=36
#
#SBATCH --mail-type=none
#SBATCH --mail-user=userid@example.mpg.de
#
# Wall clock Limit (max. is 24 hours):
#SBATCH --time=12:00:00

# Load compiler and MPI modules (must be the same as used for compiling the code)
module purge
module load intel/21.2.0 impi/2021.2

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

# For pinning threads correctly:
export OMP_PLACES=threads

# Run the program:
srun ./myprog > prog.out

Small MPI batch job on a shared node

#!/bin/bash -l
# Standard output and error:
#SBATCH -o ./job.out.%j
#SBATCH -e ./job.err.%j
# Initial working directory:
#SBATCH -D ./
# Job Name:
#SBATCH -J test_slurm
#
# Number of MPI Tasks, e.g. 8:
#SBATCH --ntasks=8
# Memory usage [MB] of the job is required, e.g. 3000 MB per task:
#SBATCH --mem=24000
#
#SBATCH --mail-type=none
#SBATCH --mail-user=userid@example.mpg.de
#
# Wall clock limit (max. is 24 hours):
#SBATCH --time=12:00:00

# Load compiler and MPI modules (must be the same as used for compiling the code)
module purge
module load intel/21.2.0 impi/2021.2

# Run the program:
srun ./myprog > prog.out

Batch jobs using GPUs

Note that computing time on GPU-accelerated nodes is accounted using a weighting factor of 4 relative to CPU-only jobs, corresponding to the additional computing power provided by the GPUs. Users are advised to check the performance reports of their jobs in order to monitor adequate utilization of the resources.

GPU job using 1, 2, or 4 GPUs on a single node

The following example job script launches a (potentially multithreaded) CUDA program to use one (or more) GPU(s) on a single node. In case more than one GPUs are requested the user code must be able to utilize these additional GPUs properly.

#!/bin/bash -l
# Standard output and error:
#SBATCH -o ./job.out.%j
#SBATCH -e ./job.err.%j
# Initial working directory:
#SBATCH -D ./
# Job name
#SBATCH -J test_gpu
#
#SBATCH --ntasks=1
#SBATCH --constraint="gpu"
#
# --- default case: use a single GPU on a shared node ---
#SBATCH --gres=gpu:a100:1
#SBATCH --cpus-per-task=18
#SBATCH --mem=125000
#
# --- uncomment to use 2 GPUs on a shared node ---
# #SBATCH --gres=gpu:a100:2
# #SBATCH --cpus-per-task=36
# #SBATCH --mem=250000
#
# --- uncomment to use 4 GPUs on a full node ---
# #SBATCH --gres=gpu:a100:4
# #SBATCH --cpus-per-task=72
# #SBATCH --mem=500000
#
#SBATCH --mail-type=none
#SBATCH --mail-user=userid@example.mpg.de
#SBATCH --time=12:00:00

module purge
module load intel/21.2.0 impi/2021.2 cuda/11.2

export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}

srun ./cuda_executable

Hybrid MPI/OpenMP job using one or more nodes with 4 GPUs each

The following example job script launches a hybrid MPI/OpenMP-CUDA-code on one (or more) nodes running one task per GPU. Note that the user code needs to attach its tasks to the different GPUs based on some code-internal logic. In case more than one MPI task is accessing a GPU it is necessary to enable NVIDIA MPS using the flag #SBATCH --nvmps as shown in the plain MPI-CUDA example below. The flag #SBATCH --constraint="gpu-bw" may be used to request nodes with high-bandwidth network interconnect.

#!/bin/bash -l
# Standard output and error:
#SBATCH -o ./job.out.%j
#SBATCH -e ./job.err.%j
# Initial working directory:
#SBATCH -D ./
# Job name
#SBATCH -J test_gpu
#
#SBATCH --nodes=1            # Request 1 or more full nodes
#SBATCH --constraint="gpu"   #   providing GPUs.
#SBATCH --gres=gpu:a100:4    # Request 4 GPUs per node.
#SBATCH --ntasks-per-node=4  # Run one task per GPU
#SBATCH --cpus-per-task=18   #   using 18 cores each.
#SBATCH --mail-type=none
#SBATCH --mail-user=userid@example.mpg.de
#SBATCH --time=12:00:00

module purge
module load intel/21.2.0 impi/2021.2 cuda/11.2

export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}

srun ./mpi_openmp_cuda_executable

Plain MPI job using GPUs

The following example job script launches an MPI-CUDA-code on one (or more) nodes with one MPI task per CPU core. Note that the user code needs to attach its tasks across the different GPUs based on some code-internal logic. Moreover, note that it is necessary to launch NVIDIA MPS via the flag #SBATCH --nvmps to enable the MPI tasks access the GPUs in an efficient manner concurrently. The flag #SBATCH --constraint="gpu-bw" may be used to request nodes with high-bandwidth network interconnect.

#!/bin/bash -l
# Standard output and error:
#SBATCH -o ./job.out.%j
#SBATCH -e ./job.err.%j
# Initial working directory:
#SBATCH -D ./
# Job name
#SBATCH -J test_slurm
#
#SBATCH --nodes=1             # Request 1 (or more) node(s)
#SBATCH --constraint="gpu"    #    providing GPUs.
#SBATCH --ntasks-per-node=72  # Launch 72 tasks per node
#SBATCH --gres=gpu:a100:4     # Request all 4 GPUs of each node
#SBATCH --nvmps               # Launch NVIDIA MPS to enable concurrent access to the GPUs from multiple processes efficiently
#
#SBATCH --mail-type=none
#SBATCH --mail-user=userid@example.mpg.de
#SBATCH --time=12:00:00

module purge
module load intel/21.2.0 impi/2021.2 cuda/11.2

srun ./mpi_cuda_executable

Batch jobs with dependencies

The following script generates a sequence of jobs, each job running the given job script. The start of each individual job depends on its dependency, where possible values for the --dependency flag are, e.g.

  • afterany:job_id This job starts after the previous job has terminated

  • afterok:job_id This job starts after previous job has successfully executed

#!/bin/bash
# Submit a sequence of batch jobs with dependencies
#
# Number of jobs to submit:
NR_OF_JOBS=6
# Batch job script:
JOB_SCRIPT=./my_batch_script
echo "Submitting job chain of ${NR_OF_JOBS} jobs for batch script ${JOB_SCRIPT}:"
JOBID=$(sbatch ${JOB_SCRIPT} 2>&1 | awk '{print $(NF)}')
echo "  " ${JOBID}
I=1
while [ ${I} -lt ${NR_OF_JOBS} ]; do
  JOBID=$(sbatch --dependency=afterany:${JOBID} ${JOB_SCRIPT} 2>&1 | awk '{print $(NF)}')
  echo "  " ${JOBID}
  let I=${I}+1
done

Batch job using a job array

#!/bin/bash -l
# specify the indexes (max. 30000) of the job array elements (max. 300 - the default job submit limit per user)
#SBATCH --array=1-20
# Standard output and error:
#SBATCH -o job_%A_%a.out        # Standard output, %A = job ID, %a = job array index
#SBATCH -e job_%A_%a.err        # Standard error, %A = job ID, %a = job array index
# Initial working directory:
#SBATCH -D ./
# Job Name:
#SBATCH -J test_array
#
# Number of nodes and MPI tasks per node:
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=72
#
#SBATCH --mail-type=none
#SBATCH --mail-user=userid@example.mpg.de
#
# Wall clock limit (max. is 24 hours):
#SBATCH --time=12:00:00

# Load compiler and MPI modules (must be the same as used for compiling the code)
module purge
module load intel/21.2.0 impi/2021.2

# Run the program:
#  the environment variable $SLURM_ARRAY_TASK_ID holds the index of the job array and
#  can be used to discriminate between individual elements of the job array

srun ./myprog > prog.out

Single-node example job scripts for sequential programs, plain-OpenMP cases, Python, Julia, Matlab

In the following, example job scripts are given for jobs that use at maximum one full node. Use cases are sequential programs, threaded programs using OpenMP or similar models, and programs written in languages such as Python, Julia, Matlab, etc.

The Python example programs referred to below are available for download.

Single-core job

#!/bin/bash -l
#
# Single-core example job script for MPCDF Raven.
# In addition to the Python example shown here, the script
# is valid for any single-threaded program, including
# sequential Matlab, Mathematica, Julia, and similar cases.
#
#SBATCH -J PYTHON_SEQ
#SBATCH -o ./out.%j
#SBATCH -e ./err.%j
#SBATCH -D ./
#SBATCH --ntasks=1         # launch job on a single core
#SBATCH --cpus-per-task=1  #   on a shared node
#SBATCH --mem=2000MB       # memory limit for the job
#SBATCH --time=0:10:00

module purge
module load gcc/10 impi/2021.2
module load anaconda/3/2021.05

# Set number of OMP threads to fit the number of available cpus, if applicable.
export OMP_NUM_THREADS=1

# Run single-core program
srun python3 ./python_sequential.py

Small job with multithreading, applicable to Python, Julia and Matlab, plain OpenMP, or any threaded application

#!/bin/bash -l
#
# Multithreading example job script for MPCDF Raven.
# In addition to the Python example shown here, the script
# is valid for any multi-threaded program, including
# Matlab, Mathematica, Julia, and similar cases.
#
#SBATCH -J PYTHON_MT
#SBATCH -o ./out.%j
#SBATCH -e ./err.%j
#SBATCH -D ./
#SBATCH --ntasks=1         # launch job on
#SBATCH --cpus-per-task=8  #   8 cores on a shared node
#SBATCH --mem=16000MB      # memory limit for the job
#SBATCH --time=0:10:00

module purge
module load gcc/10 impi/2021.2
module load anaconda/3/2021.05

# Set number of OMP threads to fit the number of available cpus, if applicable.
export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}

srun python3 ./python_multithreading.py

Python/NumPy multitheading, applicable to Julia and Matlab, plain-OpenMP, or any threaded application

#!/bin/bash -l
#
# Multithreading example job script for MPCDF Raven.
# In addition to the Python example shown here, the script
# is valid for any multi-threaded program, including
# parallel Matlab, Julia, and similar cases.
#
#SBATCH -o ./out.%j
#SBATCH -e ./err.%j
#SBATCH -D ./
#SBATCH -J PY_MULTITHREADING
#SBATCH --nodes=1             # request a full node
#SBATCH --ntasks-per-node=1   # only start 1 task via srun because Python multiprocessing starts more tasks internally
#SBATCH --cpus-per-task=72    # assign all the cores to that first task to make room for multithreading
#SBATCH --time=00:10:00

module purge
module load gcc/10 impi/2021.2
module load anaconda/3/2021.05

# set number of OMP threads *per process*
export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}

srun python3 ./python_multithreading.py

Python multiprocessing

#!/bin/bash -l
#
# Python multiprocessing example job script for MPCDF Raven.
#
#SBATCH -o ./out.%j
#SBATCH -e ./err.%j
#SBATCH -D ./
#SBATCH -J PYTHON_MP
#SBATCH --nodes=1             # request a full node
#SBATCH --ntasks-per-node=1   # only start 1 task via srun because Python multiprocessing starts more tasks internally
#SBATCH --cpus-per-task=72    # assign all the cores to that first task to make room for Python's multiprocessing tasks
#SBATCH --time=00:10:00

module purge
module load gcc/10 impi/2021.2
module load anaconda/3/2021.05

# Important:
# Set the number of OMP threads *per process* to avoid overloading of the node!
export OMP_NUM_THREADS=1

# Use the environment variable SLURM_CPUS_PER_TASK to have multiprocessing
# spawn exactly as many processes as you have CPUs available.
srun python3 ./python_multiprocessing.py $SLURM_CPUS_PER_TASK

Python mpi4py

#!/bin/bash -l
#
# Python MPI4PY example job script for MPCDF Raven.
# May use more than one node.
#
#SBATCH -o ./out.%j
#SBATCH -e ./err.%j
#SBATCH -D ./
#SBATCH -J MPI4PY
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=72
#SBATCH --time=00:10:00

module purge
module load gcc/10 impi/2021.2
module load anaconda/3/2021.05
module load mpi4py/3.0.3

# Important:
# Set the number of OMP threads *per process* to avoid overloading of the node!
export OMP_NUM_THREADS=1

srun python3 ./python_mpi4py.py