Biophysics LEO


Name of the cluster:

LEO

Institution:

Max Planck Institute of Biophysics

Login nodes:

  • leo01.bc.mpcdf.mpg.de

  • leo02.bc.mpcdf.mpg.de

Hardware Configuration:

2 login node leo[01-02]
2 x AMD EPYC 9454 48-Core Processor @ 2.75 GHz
96 cores per node
hyper-threading enabled - 2 threads per core
755 GB RAM
11 execution nodes leo[001-001]
2 x AMD EPYC 9454 48-Core Processor @ 2.75 GHz
96 cores per node
hyper-threading enabled - 2 threads per core
755 GB RAM
8 execution nodes leog[101-108]
2 x AMD EPYC 9554 64-Core Processor @ 3.10 GHz
128 cores per node
hyper-threading enabled - 2 threads per core
755 GB RAM
8 x NVIDIA L40S GPUs per node
3 execution nodes leog[201-203]
2 x AMD EPYC 9554 64-Core Processor @ 3.10 GHz
64 cores per node
hyper-threading disabled - 1 threads per core
1.5 TB RAM
4 x NVIDIA H100 GPUs per node
node interconnect

based on Mellanox Technologies InfiniBand fabric (Speed: 200Gb)

Filesystems:

GPFS-based with total size of 7.5 PB and independent inode space for the following filesets:

/u

shared home filesystem; GPFS-based; user quotas (250 GB data, 512k files/) enforced; quota can be checked with ‘/usr/lpp/mmfs/bin/mmlsquota’. NO BACKUPS YET

/leo/work

shared scratch filesystem; GPFS-based; no quotas enforced. NO BACKUPS !

/leo/data

shared scratch filesystem; GPFS-based; no quotas enforced. NO BACKUPS !

/cryo/*
only on leo[01-02]
read-only

Compilers and Libraries:

Hierarchical environment modules are used at MPCDF to provide software packages and enable switching between different software versions. User have to specify the needed modules with explicit versions at login and during the startup of a batch job. Not all software modules are displayed immediately by the module avail command, for some user first needs to load a compiler and/or MPI module. You can search the full hierarchy of the installed software modules with the find-module command.

Batch system based on Slurm

The batch system on LEO is the Slurm Workload Manager. A brief introduction into the basic commands (srun, sbatch, squeue, scancel, …) can be found on the Raven home page. For more detailed information, see the Slurm handbook. See also the sample batch scripts which must be modified for LEO cluster.

Current Slurm configuration on LEO:

  • default turnaround time: 2 hours

  • current max. turnaround time (wallclock): 24 hours

  • p.leo partition include all batch nodes in exclusive usage and is default

  • s.leo partition can be used for serial jobs and can be shared

Useful tips:

By default run time limit used for jobs that don’t specify a value is 2 hours. Use --time option for sbatch/srun to set a limit on the total run time of the job allocation but not longer than 24 hours

Default memory per node in the shared partition is 47000 MB. To grant the job access to all of the memory on each node use --mem=0 option for sbatch/srun

The OpenMP codes require a variable OMP_NUM_THREADS to be set. This can be obtained from the Slurm environment variable $SLURM_CPUS_PER_TASK which is set when --cpus-per-task is specified in a sbatch script (an example is on help information page)

To use GPUs add in your slurm scripts --gres option and choose how many GPUs and/or which model of them to have: #SBATCH --gres=gpu:type:X, where type is either l40s or h100 and X is a number of resources (1-8 for l40s and 1-4 for h100)

GPU cards are in default compute mode.

Support:

For support please create a trouble ticket at the MPCDF helpdesk