Radioastronomy

Name of the linux cluster:: HERCULES
Institution:: Max Planck Institute for Radio Astronomy

Hardware Configuration:

2 login nodes hercules[11-12]: 2 x Intel(R) Xeon(R) Silver 4214R CPU @ 2.40GHz

24 cores per node

hyper-threading disabled - 1 threads per core

188 GB RAM;
32 execution nodes hc[201-232] for parallel computing: total amount of 1536 CPU cores

2 x Intel(R) Xeon(R) Platinum 8268 CPU @ 2.90GHz

48 cores per node

hyper-threading disabled - 1 threads per core

377 GB RAM
54 execution nodes hcg[001-054] for parallel GPU computing: total amount of 2592 CPU cores

2 x Intel(R) Xeon(R) Platinum 8268 CPU @ 2.90GHz

48 cores per node

hyper-threading disabled - 1 threads per core

377 GB RAM

3 x Quadro RTX 6000 GPUs per node
node interconnect: based on 25 Gb/s ethernet

Filesystems:

/u: shared home filesystem; quoted to 1TB of data and 600K files; quota can be checked with ‘/usr/lpp/mmfs/bin/mmlsquota’
/mkfs: dedicated project area for selected users - NO BACKUPS!
/hercules: dedicated project area - NO BACKUPS!
/scratch: dedicated scratch area for all users - NO BACKUPS!
/mandap: incoming data from Bonn (only available on login nodes) - NO BACKUPS!

Software Configuration:

The “module” subsystem is implemented on HERCULES cluster. Please use ‘module available’ to see all available modules.

Intel compilers (-> ‘module load intel/19.1.3’): icc, icpc, ifort
GNU compilers (-> ‘module load gcc’): gcc, g++, gfortran
Intel MKL (‘module load mkl’): $MKL_HOME defined; libraries found in $MKL_HOME/lib/intel64
Intel MPI 2019.9 (‘module load impi/2019.9’): mpicc, mpigcc, mpiicc, mpiifort, mpiexec, …´

Similar to the HPC systems, this module tree is hierarchical.

To find a module and information about the available versions or what dependencies need to be loaded first one can use the ‘find-module’ command.

Batch system based on Slurm:

sbatch, srun, squeue, sinfo, scancel, scontrol, s*
five partitions:
- short.q (default), long.q, gpu.q for serial jobs on shared nodes
- parallel.q for multi-nodes prallel hybrind MPI/OpenMP jobs, nodes are allocated exclusively
- gpu42cores.q for serial cpu jobs only with 42 cores and 50% of RAM per node
- gpu6cores.q for serial gpu jobs only with 6 cores and 50% of RAM per node
- interactive.q to debug serial/parallel jobs. Currently disabled.
sample batch scripts can be found on Cobra home page (must be modified for HERCULES)

Slurm partition	short.q (default)	long.q	gpu.q	parallel.q	gpu42cores.q	gpu6cores.q	interactive.q
number of nodes	86	32	54	32	54	54	4
hostnames	hc[201-232] hcg[001-054]	hc[201-232]	hcg[001-054]	hc[201-232]	hcg[001-054]	hcg[001-054]	hc,hcg
default run time limit	4 hours	24 hours	24 hours	48 hours	24 hours	24 hours	2 hours
maximum run time limit	4 hours	240 hours	240 hours	240 hours	240 hours	240 hours	12 hours
default memory per node	8000 MB	8000 MB	120000 MB	370000 MB	4000 MB	60000 MB	8000 MB
maximum memory per node	370000 MB	370000 MB	370000 MB	370000 MB	185000 MB	185000 MB	370000 MB
maximum nodes per job	1	1	1	32	1	1	2
maximum cpus per node	48	48	48	48	42	6	48
execute more than 1 job at a time on each node	Yes	Yes	Yes, max. 3 jobs	No	Yes	Yes, max. 3 jobs	Yes
gpus per node	--	--	3	--	--	3	0(hc),3(hcg)

Useful tips:

By default run time limit used for jobs that don’t specify a value and partition is 4 hours. Use --time option for sbatch/srun to set a limit on the total run time of the job allocation but not longer than 10 days on long.q, gpu.q and parallel.q partitions

Default memory per job in serial partitions is 8000M. To grant the job access to all of the memory on each node use –mem=0 option for sbatch/srun

The OpenMP codes require a variable OMP_NUM_THREADS to be set. This can be obtained from the Slurm environment variable $SLURM_CPUS_PER_TASK which is set when --cpus-per-task is specified in a sbatch script (an example is on help information page). Exporting of OMP_PLACES=cores also can be useful.

To run GPU codes add options -p gpu.q and --gres=gpu:N,
where N is number of GPUs (min is 1, max is 3), into your batch scripts:
#SBATCH -p gpu.q
#SBATCH --gres=gpu:1

Support:

For support please create a trouble ticket at the MPCDF helpdesk