Biochemistry HPCL8


Name of the cluster:

HPCL8

Institution:

Max Planck Institute of Biochemistry

Login nodes:

  • hpcl8001.bc.rzg.mpg.de

  • hpcl8002.bc.rzg.mpg.de

  • hpcl8003.bc.rzg.mpg.de

  • hpcl8004.bc.rzg.mpg.de

  • hpcl9301.bc.rzg.mpg.de

  • hpcl8061.bc.rzg.mpg.de

  • hpcl8062.bc.rzg.mpg.de

  • hpcl8063.bc.rzg.mpg.de

  • hpcl9001.bc.rzg.mpg.de

  • hpcl9002.bc.rzg.mpg.de

Login nodes hpcl[8061-8063] are available for selected users, only (Dept. Conti)
Login nodes hpcl[9001-9002] are available for selected users, only (Dept. Briggs)

Hardware Configuration:

7 login nodes hpcl[8001-8004] & hpcl[8061-8063]
2 x Intel(R) Xeon(R) Silver 4116 CPU @ 2.10GHz
24 cores per node
hyper-threading disabled - 1 threads per core
377 GB RAM;
2 x RTX 5000 GPUs
node interconnect: based on 25 Gb/s ethernet
56 execution nodes hpcl[8005-8060] for parallel CPU/GPU computing
total amount of 1344 CPU cores
2 x Intel(R) Xeon(R) Gold 6138 CPU @ 2.00GHz
24 cores per node
hyper-threading disabled - 1 threads per core
377 GB RAM
2 x RTX 5000 GPUs
node interconnect: based on 25 Gb/s ethernet
2 login nodes hpcl[9001-9002]
2 x Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz
72 cores per node
hyper-threading disabled - 1 threads per core
1 TB RAM
4 x NVIDIA A40 GPUs
node interconnect: based on 50 Gb/s ethernet
9 execution nodes hpcl[9003-9011] for parallel CPU/GPU computing
total amount of 648 CPU cores
2 x Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz
72 cores per node
hyper-threading disabled - 1 threads per core
1 TB RAM
4 x NVIDIA A40 GPUs
node interconnect: based on 50 Gb/s ethernet
4 execution nodes hpcl[9101-9104] for parallel CPU/GPU computing
total amount of 304 CPU cores
Intel(R) Xeon(R) Platinum 8368 CPU @ 2.40GHz
76 cores per node
hyper-threading enabled - 2 threads per core
1 TB RAM
4 x NVIDIA H100 GPUs
node interconnect: based on 50 Gb/s ethernet
3 execution nodes hpcl[9201-9103] for parallel CPU
total amount of 192 CPU cores
AMD EPYC 9374F 32-Core CPU @ 3.80GHz
64 cores per node
hyper-threading enabled - 2 threads per core
512 GB RAM
node interconnect: based on 50 Gb/s ethernet
1 login nodes hpcl9301
2 x AMD EPYC 9534 64-Core CPU @ 3.7GHz
128 cores per node
hyper-threading enabled - 2 threads per core
755 GB RAM
4 x NVIDIA L40s GPUs
node interconnect: based on 50 Gb/s ethernet
19 execution nodes hpcl[9302-9320] for parallel CPU/GPU computing
total amount of 2432 CPU cores
2 x AMD EPYC 9534 64-Core CPU @ 3.7GHz
128 cores per node
hyper-threading enabled - 2 threads per core
755 GB RAM
4 x NVIDIA L40s GPUs
node interconnect: based on 50 Gb/s ethernet

Compilers and Libraries:

The “module” subsystem is implemented on HPCL8. Please use ‘module available’ to see all available modules.

  • Intel compilers (-> ‘module load intel’): icc, icpc, ifort

  • GNU compilers (-> ‘module load gcc’): gcc, g++, gfortran

  • Intel MKL (-> ‘module load mkl’): $MKL_HOME defined; libraries found in $MKL_HOME/lib/intel64

  • Intel MPI (-> ‘module load impi’): mpicc, mpigcc, mpiicc, mpiifort, mpiexec, …´

  • OpenMPI (-> ‘module load openmpi’): mpicc, mpicxx, mpif77, mpif90, mpirun, mpiexec

  • Python (-> ‘module load anaconda’): python

Batch system based on Slurm:

The batch system on HPCL8 is the Slurm Workload Manager. A brief introduction into the basic commands (srun, sbatch, squeue, scancel, …) can be found on the Raven home page. For more detailed information, see the Slurm handbook. See also the sample batch scripts which must be modified for HPCL8 cluster (partition must be changed).

Current Slurm configuration on HPCL8:

  • default run time: 24 hours

  • current max. run time (wallclock): 21 days

  • four partitions: p.hpcl8 (default), p.hpcl9 (Dept. Briggs only) p.hpcl91 (b_borgwardt group only), p.hpcl92 (b_mann & g_rz groups only) & p.hpcl93

  • nodes in p.hpcl8 & p.hpcl9 partitions are exclusively allocated to users. Multiple jobs may be run for the same user only

  • nodes in p.hpcl91, p.hpcl92 & p.hpcl93 can be shared by jobs

  • default memory size per job on node: 380000 MB (p.hpcl8 partition), 1000000 MB (p.hpcl9 partition), 40000 MB (p.hpcl91 partition), 32000 MB (p.hpcl92 partition) & 38000 MB (p.hpcl93 partition)

  • max submitted jobs per user: 2000

  • max running jobs per user at one time: 200

Useful tips:

By default run time limit used for jobs that don’t specify a value is 24 hours. Use --time option for sbatch/srun to set a limit on the total run time of the job allocation but not longer than 504 hours

Memory is consumable resource. To run several jobs on one node use --mem=<size[units]> or --mem-per-cpu=<size[units]> options for sbatch/srun, where size should be less than default per node (380000 MB)

The OpenMP codes require a variable OMP_NUM_THREADS to be set. This can be obtained from the Slurm environment variable $SLURM_CPUS_PER_TASK which is set when --cpus-per-task is specified in a sbatch script (an example is on help information page)

To use GPUs add in your slurm scripts --gres option and choose how many GPUs to allocate: #SBATCH --gres=gpu:1 or #SBATCH --gres=gpu:2

Valid gres options are: gpu[[:type]:count]
where
type is a type of gpu (rtx5000, a40, h100 or l40s)
count is a number of resources (1 or 2 in p.hpcl8 partition and 1 - 4 in p.hpcl9, p.hpcl91 & p.hpcl93 paritions)
GPU cards are in default compute mode.
GPU cards on hpcl[9101-9103] nodes are MIG-ed.

Support:

For support please create a trouble ticket at the MPCDF helpdesk.