Science of Light


Name of the cluster:

ZEROPOINT

Institution:

Max Planck Institute for the Science of Light

Login nodes:

  • zp11.bc.rzg.mpg.de

  • zp12.bc.rzg.mpg.de

  • zp13.bc.rzg.mpg.de

  • zp14.bc.rzg.mpg.de

Hardware-Configuration and Slurm partitions:

(phase 1):

partition

HighMem

HighFreq

DGX

# nodes

4

8

1

hostnames

zp[01-04]

zp[001-008]

zpx

Slurm partition

highmem

highfreq

dgx

CPU

model

Xeon Gold 6130

Xeon Gold 6144

Xeon E5-2698 v4

architecture

x86_64

x86_64

x86_64

producer

Intel

Intel

Intel

microarchitecture

Skylake-SP

Skylake-SP

Broadwell-EP

CPUs per node

2

2

2

cores per CPU

16

8

20

threads per core

1

1

1

clock rate (base/boost), GHz

2.1 / 3.7

3.5 / 4.2

2.2 / 3.6

cache size (L3)

22 MB

24.75 MB

50 MB

SIMD instruction set

AVX-512

AVX-512

AVX-2

RAM size

zp[01-03]

zp04

1 TiB

960 GiB

96 GiB

500 GiB

GPU

model

Tesla V100

producer

Nvida

architecture

Volta

GPUs per node

8

Node interconnect

1 Gb/s ethernet

(phase 2):

partition

Standard (default)

GPU

# nodes

68

32

hostnames

zp[101-168]

zpg[001-032]

Slurm partition

standard

gpu

CPU

model

Xeon Gold 6130

Xeon Gold 6130

architecture

x86_64

x86_64

producer

Intel

Intel

microarchitecture

Skylake-SP

Skylake-SP

CPUs per node

2

2

cores per CPU

16

16

threads per core

1

1

clock rate (base/boost), GHz

2.1 / 3.7

2.1 / 3.7

cache size (L3)

22 MB

22 MB

SIMD instruction set

AVX-512

AVX-512

RAM size

187 GiB

187 GiB

GPU

model

Quadro RTX 6000

producer

Nvidia

architecture

Turing

GPUs per node

2

Node interconnect

1 Gb/s ethernet

Filesystems:

GPFS-based with total size of 27 TB:

/u

shared home filesystem with user home directory in /u/<username>; GPFS-based; user quotas (currently 600 GB, 1M files) enforced; quota can be checked with ‘/usr/lpp/mmfs/bin/mmlsquota’.

/ptmp
shared scratch filesystem with user directory in /ptmp/<username>;
GPFS-based; no quotas enforced.
NO BACKUPS!

Compilers and Libraries:

The hierarchical “module” subsystem is implemented on ZEROPOINT. Please use ‘module available’ to see all available modules.

Batch system based on Slurm:

The batch system on ZEROPOINT is the Slurm Workload Manager.

Current Slurm configuration on ZeroPoint:

  • default turnaround time: 3 days

  • current max. turnaround time (wallclock): 7 days

  • default partition: standard

Useful tips:

To run GPU codes use gpu partition add option --gres=gpu:N, where N is number of GPUs (max is 2) into your batch scripts: #SBATCH -p gpu --gres=gpu:1

To run GPU codes on zpx add option --gres=gpu:N, where N is number of GPUs (max is 8): srun -p dgx --gres=gpu:1 --pty bash -l

How to use parallel COMSOL runs on cluster please look at sample batch scripts

To use large number of threads (subkernels) in Mathematica and circumvent timeout issue with loading the kernels locally from the software server one can load sequentially the subkernels before the parallel region. For instance, to use 32 subkernels, one would then have the following commands in a mathematica script file:

LaunchKernels[10];
LaunchKernels[10];
LaunchKernels[10];
LaunchKernels[2];

“Parallel region”

Here 10 subkernels are launched at a time, but this value needs to be adapted depending on the networks performance and time out value.

Support:

For support please create a trouble ticket at the MPCDF helpdesk.