Astrophysics FREYA
- Name of the cluster:
FREYA
- Institution:
Max Planck Institute for Astrophysics
Login nodes:
|
|
|
|
Hardware Configuration:
login nodes freya[01-04] : 2 x Intel(R) Xeon(R) Gold 6138 CPU @ 2.00GHz; 40 cores per node; 384 GB RAM
100 execution nodes freya[073-104,109-176] for parallel computing with a total amount of 6880 CPU cores; 2 x Intel(R) Xeon(R) Gold 6138 CPU @ 2.00GHz; 192 GB RAM
4 execution nodes freya[104-108] for parallel computing with a total amount of 160 CPU cores; 2 x Intel(R) Xeon(R) Gold 6138 CPU @ 2.00GHz; 384 GB RAM
8 execution nodes freyag[01-08] for parallel GPU computing with a total amount of 320 CPU cores; 2 x Intel(R) Xeon(R) Gold 6138 CPU @ 2.00GHz; 384 GB RAM; 2 x Nvidia Tesla P100-PCIE-16GB GPUs per node
4 execution nodes freyag[09-12] for parallel GPU computing with a total amount of 160 CPU cores; 2 x Intel(R) Xeon(R) Gold 6138 CPU @ 2.00GHz; 384 GB RAM; 2 x Nvidia Tesla V100-PCIE-32GB GPUs per node
11 execution nodes freyag[201-211] for parallel GPU computing with a total amount of 480 CPU cores; 2 x Intel(R) Xeon(R) Platinum 8268 CPU @ 2.90GHz; 384 GB RAM; 4 x Nvidia Tesla A100-PCIE-40GB GPUs per node
node interconnect is based on Intel Omni-Path Fabric (Speed: 100Gb/s)
Filesystems:
- /u
shared home filesystem; GPFS-based; user quotas (currently 900 GB, 1M files) enforced; quota can be checked with ‘/usr/lpp/mmfs/bin/mmlsquota’.
- /freya/ptmp
shared scratch filesystem (1.7 PB); GPFS-based; no quotas enforced. NO BACKUPS!
- /virgotng
shared scratch filesystem (8.0 PB); GPFS-based; no quotas enforced. NO BACKUPS
Compilers and Libraries:
The “module” subsystem is implemented on FREYA. Please use ‘module available’ to see all available modules.
Intel compilers (-> ‘module load intel’): icc, icpc, ifort
GNU compilers (-> ‘module load gcc’): gcc, g++, gfortran
Intel MKL (‘module load mkl’): $MKL_HOME defined; libraries found in $MKL_HOME/lib/intel64
Intel MPI 2017.4 (‘module load impi’): mpicc, mpigcc, mpiicc, mpiifort, mpiexec, …´
GPGPU computing (‘module load cuda’): nvcc, …
Batch system based on Slurm:
sbatch, srun, squeue, sinfo, scancel, scontrol, s*
current max. turnaround time (wallclock): 24 hours
max. nodes limit per user: 92
four partitions: p.24h (default), p.test, p.gpu & p.gpu.ampere
p.test partition: has 4 nodes with 2 Nvidia Pascal gpus and 30 min run time
sample batch scripts can be found on Cobra home page (must be modified for FREYA)
Useful tips:
Nodes in p.test partition are in shared mode, default memory per job set to 9500 MB. To allocate necessary amount of memory use --mem parameter.
Nvidia Pascal and Volta GPUs are available in p.gpu partition. To use them add in your slurm scripts #SBATCH -p p.gpu and choose how many GPUs to have #SBATCH --gres=gpu:1 or #SBATCH --gres=gpu:2 To use Volta or Pascal GPUs add type of GPUs into the --gres parameter: --gres=gpu:p100:1 or --gres=gpu:v100:2
Nodes in p.gpu partition are in exclusive mode i.e. jobs allocate entire nodes.
Nvidia Ampere GPUs are available in p.gpu.ampere partition. Type of gpu must be explicitly set, i.e. --gres=gpu:a100:X, where X is between 1 and 4
Nodes in p.gpu.ampere partition are in shared mode i.e. jobs allocate only requested resources. Default memory per job is 95000 MB. Use --mem parameter to set necessary amount of RAM for jobs.
GPU cards are in default compute mode.
To run code on nodes with different memory capacity (mem192G; mem384G) use --constraint option in a sbatch script: #SBATCH --constraint=mem192G or #SBATCH --constraint=mem384G
To check node features, general resources and scheduling weight of nodes use sinfo -O nodelist,features:25,gres,weight
Support:
For support please create a trouble ticket at the MPCDF helpdesk