Psychiatry PIROL

Name of the cluster:: PIROL
Institution:: Max Planck Institute of Psychiatry

Hardware-Configuration:

Login node pirol01:

CPU Model: Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz
1 socket
18 cores per socket
no hyper-threading (1 thread per core)
120 GB RAM

6 cpu execution nodes pirolc[001-006] :

CPU Model: Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz
1 socket
12 cores per socket
no hyper-threading (1 thread per core)
80 GB RAM

6 gpu execution nodes pirolg[001-006] :

CPU Model: Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz
1 socket
10 cores per socket
400 GB RAM
1 x Nvidia A40

node interconnect is based on 10 Gb/s ethernet

Filesystems:

/u: shared home filesystem with user home directory in /u/<username>; user quotas (currently 200 GB, 250k files) enforced. User quotas can be checked using the quota command (e.g. quota --show-mntpoint --hide-device -f /pirol/u).

/nexus/posix0/MPI-psych: shared scratch filesystem with user directory in /nexus/posix0/MPI-psych/<username>

Compilers and Libraries:

Hierarchical environment modules are used at MPCDF to provide software packages and enable switching between different software versions. There are no modules preloaded on PIROL. User have to specify the needed modules with explicit versions at login and during the startup of a batch job. Not all software modules are displayed immediately by the module avail command, for some user first needs to load a compiler and/or MPI module. You can search the full hierarchy of the installed software modules with the find-module command.

Batch system based on Slurm:

a brief introduction into the basic commands (srun, sbatch, squeue, scancel, sinfo, s*…) can be found on the Raven home page or on the Slurm handbook

Current Slurm configuration on PSYCL:

two partitions: c.pirol (default), g.pirol (for gpu jobs and high memory cpu jobs)
current max. run time (wallclock): (11 days, default runtime is 24 hours)
default memory per node for jobs: c.pirol (10000 MB), g.pirol (39600 MB)
c.pirol, g.pirol: resources on the nodes may be shared between jobs
g.pirol partition: to access GPU resources --gres parameter must be explicitly set for jobs
sample batch scripts can be found on Raven home page (must be modified for PIROL)

Useful tips:

By default run time limit used for jobs that don’t specify a value is 24 hours. Use --time option for sbatch/srun to set a limit on the total run time of the job allocation but not longer than 11 days

Default memory per node is 10G & 38G. To grant the job access to all of the memory on each node use –mem=0 option for sbatch/srun

The OpenMP codes require a variable OMP_NUM_THREADS to be set. This can be obtained from the Slurm environment variable $SLURM_CPUS_PER_TASK which is set when --cpus-per-task is specified in a sbatch script

To run code with different memory limits than the defaults, choose appropriate partition and set the required memory(c.pirol with max 72000M per node and g.pirol with max 396000M per node) by using --partition option in a sbatch script: #SBATCH --partition=c.pirol or #SBATCH --partition=g.pirol

To use GPUs add in your slurm scripts --gres option and choose
how many GPUs and/or which model of them to have: #SBATCH --gres=gpu:a40:1
Valid gres options are: gpu[[:type]:count]
where
type is a type of gpu (a40)
count is a number of resources (=1)

GPU cards are in default compute mode.

Support:

For support please create a trouble ticket at the MPCDF helpdesk