Biophysics PHYS


Name of the cluster:

PHYS

Institution:

Max Planck Institute of Biophysics

Login nodes:

  • phys11.bc.mpcdf.mpg.de

  • phys12.bc.mpcdf.mpg.de

Hardware Configuration:

2 login node phys[11-12]
2 x Intel(R) Xeon(R) Platinum 8280 CPU @ 2.70GHz
56 cores per node
hyper-threading enabled - 2 threads per core
186 GB RAM
3 x Quadro RTX 6000 GPUs per node
238 execution nodes physg[201-438]
2 x Intel(R) Xeon(R) Platinum 8280 CPU @ 2.70GHz
56 cores per node
hyper-threading enabled - 2 threads per core
186 GB RAM
3 x Quadro RTX 6000 GPUs per node
node interconnect

based on Mellanox Technologies InfiniBand fabric (Speed: 100Gb)

Filesystems:

GPFS-based with total size of 820 TB and independent inode space for the following filesets:

/u

shared home filesystem; GPFS-based; user quotas (3 TB data, 500k files/) enforced; quota can be checked with ‘/usr/lpp/mmfs/bin/mmlsquota’.

/phys/ptmp

shared filesystem for temporary files; GPFS-based; no quotas enforced. NO BACKUPS!

/phys/scratch

shared scratch filesystem; GPFS-based; no quotas enforced. NO BACKUPS!

Compilers and Libraries:

The “module” subsystem is implemented on PHYS. Please use ‘module available’ to see all available modules.

  • Intel compilers (-> ‘module load intel/19.1.3’): icc, icpc, ifort

  • GNU compilers (-> ‘module load gcc/10’): gcc, g++, gfortran

  • Intel MKL (‘module load mkl’): $MKL_HOME defined; libraries found in $MKL_HOME/lib/intel64

  • Intel MPI 2019.9 (‘module load impi/2019.9’): mpicc, mpigcc, mpiicc, mpiifort, mpiexec, …´

  • CUDA: module load cuda

  • Python (-> ‘module load anaconda’): python

To find a module and information about the available versions or what dependencies need to be loaded first one can use the ‘find-module’ command.

Batch system based on Slurm

The batch system on PHYS is the Slurm Workload Manager. A brief introduction into the basic commands (srun, sbatch, squeue, scancel, …) can be found on the Cobra home page. For more detailed information, see the Slurm handbook. See also the sample batch scripts which must be modified for PHYS cluster.

Current Slurm configuration on PHYS:

  • default turnaround time: 2 hours

  • current max. turnaround time (wallclock): 24 hours

  • p.phys partition include all batch nodes in exclusive usage and is default

  • s.phys partition can be used for serial jobs and can be shared

  • l.phys shared partition for long running (up to 5 days) serial jobs (<=5 cores per job; <=224 cores in total)

Useful tips:

By default run time limit used for jobs that don’t specify a value is 2 hours. Use --time option for sbatch/srun to set a limit on the total run time of the job allocation but not longer than 24 hours

Default memory per node in the shared partition is 62000 MB. To grant the job access to all of the memory on each node use --mem=0 option for sbatch/srun

The OpenMP codes require a variable OMP_NUM_THREADS to be set. This can be obtained from the Slurm environment variable $SLURM_CPUS_PER_TASK which is set when --cpus-per-task is specified in a sbatch script (an example is on help information page)

To use GPUs add in your slurm scripts --gres option and choose how many GPUs and/or which model of them to have: #SBATCH --gres=gpu:rtx6000:X, where X is a number of resources (1, 2 or 3)

GPU cards are in default compute mode.

Support:

For support please create a trouble ticket at the MPCDF helpdesk