Biochemistry HPCL8

Name of the cluster:: HPCL8
Institution:: Max Planck Institute of Biochemistry

Login nodes:

hpcl8001.bc.rzg.mpg.de
hpcl8002.bc.rzg.mpg.de
hpcl8003.bc.rzg.mpg.de
hpcl8004.bc.rzg.mpg.de
hpcl9301.bc.rzg.mpg.de

hpcl8061.bc.rzg.mpg.de
hpcl8062.bc.rzg.mpg.de
hpcl8063.bc.rzg.mpg.de

hpcl9001.bc.rzg.mpg.de
hpcl9002.bc.rzg.mpg.de

Login nodes hpcl[8061-8063] are available for selected users, only (Dept. Conti)

Login nodes hpcl[9001-9002] are available for selected users, only (Dept. Briggs)

Hardware Configuration:

7 login nodes hpcl[8001-8004] & hpcl[8061-8063]: 2 x Intel(R) Xeon(R) Silver 4116 CPU @ 2.10GHz

24 cores per node

hyper-threading disabled - 1 threads per core

377 GB RAM;

2 x RTX 5000 GPUs

node interconnect: based on 25 Gb/s ethernet
56 execution nodes hpcl[8005-8060] for parallel CPU/GPU computing: total amount of 1344 CPU cores

2 x Intel(R) Xeon(R) Gold 6138 CPU @ 2.00GHz

24 cores per node

hyper-threading disabled - 1 threads per core

377 GB RAM

2 x RTX 5000 GPUs

node interconnect: based on 25 Gb/s ethernet
2 login nodes hpcl[9001-9002]: 2 x Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz

72 cores per node

hyper-threading disabled - 1 threads per core

1 TB RAM

4 x NVIDIA A40 GPUs

node interconnect: based on 50 Gb/s ethernet
9 execution nodes hpcl[9003-9011] for parallel CPU/GPU computing: total amount of 648 CPU cores

2 x Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz

72 cores per node

hyper-threading disabled - 1 threads per core

1 TB RAM

4 x NVIDIA A40 GPUs

node interconnect: based on 50 Gb/s ethernet
4 execution nodes hpcl[9101-9104] for parallel CPU/GPU computing: total amount of 304 CPU cores

Intel(R) Xeon(R) Platinum 8368 CPU @ 2.40GHz

76 cores per node

hyper-threading enabled - 2 threads per core

1 TB RAM

4 x NVIDIA H100 GPUs

node interconnect: based on 50 Gb/s ethernet
3 execution nodes hpcl[9201-9103] for parallel CPU: total amount of 192 CPU cores

AMD EPYC 9374F 32-Core CPU @ 3.80GHz

64 cores per node

hyper-threading enabled - 2 threads per core

512 GB RAM

node interconnect: based on 50 Gb/s ethernet
1 login nodes hpcl9301: 2 x AMD EPYC 9534 64-Core CPU @ 3.7GHz

128 cores per node

hyper-threading enabled - 2 threads per core

755 GB RAM

4 x NVIDIA L40s GPUs

node interconnect: based on 50 Gb/s ethernet
19 execution nodes hpcl[9302-9320] for parallel CPU/GPU computing: total amount of 2432 CPU cores

2 x AMD EPYC 9534 64-Core CPU @ 3.7GHz

128 cores per node

hyper-threading enabled - 2 threads per core

755 GB RAM

4 x NVIDIA L40s GPUs

node interconnect: based on 50 Gb/s ethernet

Compilers and Libraries:

The “module” subsystem is implemented on HPCL8. Please use ‘module available’ to see all available modules.

Intel compilers (-> ‘module load intel’): icc, icpc, ifort
GNU compilers (-> ‘module load gcc’): gcc, g++, gfortran
Intel MKL (-> ‘module load mkl’): $MKL_HOME defined; libraries found in $MKL_HOME/lib/intel64
Intel MPI (-> ‘module load impi’): mpicc, mpigcc, mpiicc, mpiifort, mpiexec, …´
OpenMPI (-> ‘module load openmpi’): mpicc, mpicxx, mpif77, mpif90, mpirun, mpiexec
Python (-> ‘module load anaconda’): python

Batch system based on Slurm:

The batch system on HPCL8 is the Slurm Workload Manager. A brief introduction into the basic commands (srun, sbatch, squeue, scancel, …) can be found on the Raven home page. For more detailed information, see the Slurm handbook. See also the sample batch scripts which must be modified for HPCL8 cluster (partition must be changed).

Current Slurm configuration on HPCL8:

default run time: 24 hours
current max. run time (wallclock): 21 days
four partitions: p.hpcl8 (default), p.hpcl9 (Dept. Briggs only) p.hpcl91 (b_borgwardt group only), p.hpcl92 (b_mann & g_rz groups only) & p.hpcl93
nodes in p.hpcl8 & p.hpcl9 partitions are exclusively allocated to users. Multiple jobs may be run for the same user only
nodes in p.hpcl91, p.hpcl92 & p.hpcl93 can be shared by jobs
default memory size per job on node: 380000 MB (p.hpcl8 partition), 1000000 MB (p.hpcl9 partition), 40000 MB (p.hpcl91 partition), 32000 MB (p.hpcl92 partition) & 38000 MB (p.hpcl93 partition)
max submitted jobs per user: 2000
max running jobs per user at one time: 200

Useful tips:

By default run time limit used for jobs that don’t specify a value is 24 hours. Use --time option for sbatch/srun to set a limit on the total run time of the job allocation but not longer than 504 hours

Memory is consumable resource. To run several jobs on one node use --mem=<size[units]> or --mem-per-cpu=<size[units]> options for sbatch/srun, where size should be less than default per node (380000 MB)

The OpenMP codes require a variable OMP_NUM_THREADS to be set. This can be obtained from the Slurm environment variable $SLURM_CPUS_PER_TASK which is set when --cpus-per-task is specified in a sbatch script (an example is on help information page)

To use GPUs add in your slurm scripts --gres option and choose how many GPUs to allocate: #SBATCH --gres=gpu:1 or #SBATCH --gres=gpu:2

Valid gres options are: gpu[[:type]:count]
where
type is a type of gpu (rtx5000, a40, h100 or l40s)
count is a number of resources (1 or 2 in p.hpcl8 partition and 1 - 4 in p.hpcl9, p.hpcl91  & p.hpcl93 paritions)

GPU cards are in default compute mode.

GPU cards on hpcl[9101-9103] nodes are MIG-ed.

Support:

For support please create a trouble ticket at the MPCDF helpdesk.