Astrophysics FREYA

Name of the cluster:: FREYA
Institution:: Max Planck Institute for Astrophysics

Login nodes:

freya01.bc.rzg.mpg.de

freya02.bc.rzg.mpg.de

freya03.bc.rzg.mpg.de

freya04.bc.rzg.mpg.de

Hardware Configuration:

login nodes freya[01-04] : 2 x Intel(R) Xeon(R) Gold 6138 CPU @ 2.00GHz; 40 cores per node; 384 GB RAM

100 execution nodes freya[073-104,109-176] for parallel computing with a total amount of 6880 CPU cores; 2 x Intel(R) Xeon(R) Gold 6138 CPU @ 2.00GHz; 192 GB RAM
4 execution nodes freya[104-108] for parallel computing with a total amount of 160 CPU cores; 2 x Intel(R) Xeon(R) Gold 6138 CPU @ 2.00GHz; 384 GB RAM
8 execution nodes freyag[01-08] for parallel GPU computing with a total amount of 320 CPU cores; 2 x Intel(R) Xeon(R) Gold 6138 CPU @ 2.00GHz; 384 GB RAM; 2 x Nvidia Tesla P100-PCIE-16GB GPUs per node
4 execution nodes freyag[09-12] for parallel GPU computing with a total amount of 160 CPU cores; 2 x Intel(R) Xeon(R) Gold 6138 CPU @ 2.00GHz; 384 GB RAM; 2 x Nvidia Tesla V100-PCIE-32GB GPUs per node
11 execution nodes freyag[201-211] for parallel GPU computing with a total amount of 480 CPU cores; 2 x Intel(R) Xeon(R) Platinum 8268 CPU @ 2.90GHz; 384 GB RAM; 4 x Nvidia Tesla A100-PCIE-40GB GPUs per node

node interconnect is based on Intel Omni-Path Fabric (Speed: 100Gb/s)

Filesystems:

/u: shared home filesystem; GPFS-based; user quotas (currently 900 GB, 1M files) enforced; quota can be checked with ‘/usr/lpp/mmfs/bin/mmlsquota’.

/freya/ptmp: shared scratch filesystem (1.7 PB); GPFS-based; no quotas enforced. NO BACKUPS!

/virgotng: shared scratch filesystem (8.0 PB); GPFS-based; no quotas enforced. NO BACKUPS

Compilers and Libraries:

The “module” subsystem is implemented on FREYA. Please use ‘module available’ to see all available modules.

Intel compilers (-> ‘module load intel’): icc, icpc, ifort
GNU compilers (-> ‘module load gcc’): gcc, g++, gfortran
Intel MKL (‘module load mkl’): $MKL_HOME defined; libraries found in $MKL_HOME/lib/intel64
Intel MPI 2017.4 (‘module load impi’): mpicc, mpigcc, mpiicc, mpiifort, mpiexec, …´
GPGPU computing (‘module load cuda’): nvcc, …

Similar to the HPC systems, this module tree is hierarchical.

To find a module and information about the available versions or what dependencies need to be loaded first one can use the ‘find-module’ command.

Batch system based on Slurm:

sbatch, srun, squeue, sinfo, scancel, scontrol, s*
current max. turnaround time (wallclock): 24 hours
max. nodes limit per user: 92
four partitions: p.24h (default), p.test, p.gpu & p.gpu.ampere
p.test partition: has 4 nodes with 2 Nvidia Pascal gpus and 30 min run time
sample batch scripts can be found on Cobra home page (must be modified for FREYA)

Useful tips:

Nodes in p.test partition are in shared mode, default memory per job set to 9500 MB. To allocate necessary amount of memory use --mem parameter.

Nvidia Pascal and Volta GPUs are available in p.gpu partition. To use them add in your slurm scripts #SBATCH -p p.gpu and choose how many GPUs to have #SBATCH --gres=gpu:1 or #SBATCH --gres=gpu:2 To use Volta or Pascal GPUs add type of GPUs into the --gres parameter: --gres=gpu:p100:1 or --gres=gpu:v100:2

Nodes in p.gpu partition are in exclusive mode i.e. jobs allocate entire nodes.

Nvidia Ampere GPUs are available in p.gpu.ampere partition. Type of gpu must be explicitly set, i.e. --gres=gpu:a100:X, where X is between 1 and 4

Nodes in p.gpu.ampere partition are in shared mode i.e. jobs allocate only requested resources. Default memory per job is 95000 MB. Use --mem parameter to set necessary amount of RAM for jobs.

GPU cards are in default compute mode.

To run code on nodes with different memory capacity (mem192G; mem384G) use --constraint option in a sbatch script: #SBATCH --constraint=mem192G or #SBATCH --constraint=mem384G

To check node features, general resources and scheduling weight of nodes use sinfo -O nodelist,features:25,gres,weight

Support:

For support please create a trouble ticket at the MPCDF helpdesk