Science of Light

Name of the cluster:: ZEROPOINT
Institution:: Max Planck Institute for the Science of Light

Login nodes:

zp11.bc.rzg.mpg.de

zp12.bc.rzg.mpg.de

zp13.bc.rzg.mpg.de

zp14.bc.rzg.mpg.de

Hardware-Configuration and Slurm partitions:

(phase 1):

partition		HighMem		HighFreq	DGX
# nodes		4		8	1
hostnames		zp[01-04]		zp[001-008]	zpx
Slurm partition		highmem		highfreq	dgx

CPU	model	Xeon Gold 6130		Xeon Gold 6144	Xeon E5-2698 v4
	architecture	x86_64		x86_64	x86_64
	producer	Intel		Intel	Intel
	microarchitecture	Skylake-SP		Skylake-SP	Broadwell-EP
	CPUs per node	2		2	2
	cores per CPU	16		8	20
	threads per core	1		1	1
	clock rate (base/boost), GHz	2.1 / 3.7		3.5 / 4.2	2.2 / 3.6
	cache size (L3)	22 MB		24.75 MB	50 MB
	SIMD instruction set	AVX-512		AVX-512	AVX-2
	RAM size	zp[01-03] zp04	1 TiB 960 GiB	96 GiB	500 GiB

GPU	model	–		–	Tesla V100
	producer	–		–	Nvida
	architecture	–		–	Volta
	GPUs per node	–		–	8

Node interconnect		1 Gb/s ethernet

(phase 2):

partition		Standard (default)	GPU
# nodes		68	32
hostnames		zp[101-168]	zpg[001-032]
Slurm partition		standard	gpu

CPU	model	Xeon Gold 6130	Xeon Gold 6130
	architecture	x86_64	x86_64
	producer	Intel	Intel
	microarchitecture	Skylake-SP	Skylake-SP
	CPUs per node	2	2
	cores per CPU	16	16
	threads per core	1	1
	clock rate (base/boost), GHz	2.1 / 3.7	2.1 / 3.7
	cache size (L3)	22 MB	22 MB
	SIMD instruction set	AVX-512	AVX-512
	RAM size	187 GiB	187 GiB

GPU	model	–	Quadro RTX 6000
	producer	–	Nvidia
	architecture	–	Turing
	GPUs per node	–	2

Node interconnect		1 Gb/s ethernet

Filesystems:

GPFS-based with total size of 27 TB:

/u: shared home filesystem with user home directory in /u/<username>; GPFS-based; user quotas (currently 600 GB, 1M files) enforced; quota can be checked with ‘/usr/lpp/mmfs/bin/mmlsquota’.
/ptmp: shared scratch filesystem with user directory in /ptmp/<username>;

GPFS-based; no quotas enforced.

NO BACKUPS!

Compilers and Libraries:

The hierarchical “module” subsystem is implemented on ZEROPOINT. Please use ‘module available’ to see all available modules.

Batch system based on Slurm:

The batch system on ZEROPOINT is the Slurm Workload Manager.

Current Slurm configuration on ZeroPoint:

default turnaround time: 3 days
current max. turnaround time (wallclock): 7 days
default partition: standard

Useful tips:

To run GPU codes use gpu partition add option --gres=gpu:N, where N is number of GPUs (max is 2) into your batch scripts: #SBATCH -p gpu --gres=gpu:1

To run GPU codes on zpx add option --gres=gpu:N, where N is number of GPUs (max is 8): srun -p dgx --gres=gpu:1 --pty bash -l

How to use parallel COMSOL runs on cluster please look at sample batch scripts

To use large number of threads (subkernels) in Mathematica and
circumvent timeout issue with loading the kernels locally from the
software server one can load sequentially the subkernels before the
parallel region. For instance, to use 32 subkernels, one would then have
the following commands in a mathematica script file:

LaunchKernels[10];
LaunchKernels[10];
LaunchKernels[10];
LaunchKernels[2];

“Parallel region”

Here 10 subkernels are launched at a time, but this value needs to be adapted depending on the networks performance and time out value.

Support:

For support please create a trouble ticket at the MPCDF helpdesk.