MPSD / PKS


Name of the cluster:

ADA

Institution:

Max Planck Institute for the Structure and Dynamics of Matter

Max Planck Institute for the Physics of Complex Systems

Login nodes:

  • ada01.bc.rzg.mpg.de

  • ada02.bc.rzg.mpg.de

Their SHA256 ssh host key fingerprint is:

Roiw24V2Yhw5a9MwghRWJYTyq9bPs2jYNKqWPJiaxuE (ED25519)

Hardware Configuration:

  • ADA is built on top of Intel Xeon Platinum 8360Y CPUs (36 cores at 2.40GHz), each node is equipped with two 8360Y CPUs

  • As the HPC cluster RAVEN, ADA is operated with Hyper-Threading enabled

  • login nodes ada[01-02] (500 GB RAM each)

  • 72 execution nodes adag[001-072] (1 TB RAM each and 4 Nvidia A100-80GB GPUs each)

  • 2 execution nodes ada[001-002] (2 TB RAM each)

  • node interconnect is based on Mellanox/Nvidia Infiniband HDR-100 technology (Speed: 100 Gb/s)

Filesystems:

/u
  • shared home filesystem

  • user quotas (1 TB of data; 400k files/directories) enforced

  • quota can be checked with ‘/usr/lpp/mmfs/bin/mmlsquota’.

/ada/ptmp {mpsd|pks}
  • shared scratch filesystem (3.5 PB)

  • NO BACKUPS!

Compilers and Libraries:

Hierarchical environment modules are used at MPCDF to provide software packages and enable switching between different software versions. There are no modules preloaded on ADA. User have to specify the needed modules with explicit versions at login and during the startup of a batch job. Not all software modules are displayed immediately by the module avail command, for some user first needs to load a compiler and/or MPI module. You can search the full hierarchy of the installed software modules with the find-module command.

Batch system based on Slurm:

  • a brief introduction into the basic commands (srun, sbatch, squeue, scancel, sinfo, s*…) can be found on the Raven home page or on the Slurm handbook

  • two partitions: p.ada (default), p.large

  • current max. run time (wallclock): p.ada (1 days), p.large (1 days)

  • maximum memory per node for jobs: p.ada (1024000 MB), p.large (2048000 MB)

  • p.ada partition: nodes are exclusively allocated to users

  • p.large partition: resources on the nodes may be shared between jobs

  • p.ada partition: to access GPU resources --gres parameter must be explicitly set for jobs

Sample batch scripts

You can find a set of sample batch scripts on the Raven home page that must be modified for Ada. Below you find a few examples that are adapted to Ada already so you can copy and paste them.

Hybrid MPI/OpenMP job using one or more nodes with 4 GPUs each with CUDA-aware MPI

The following example job script launches a hybrid MPI/OpenMP-CUDA-code on one (or more) nodes running one task per GPU, using CUDA-aware MPI. You should use the same modules also for compiling your code. Note that the user code needs to attach its tasks to the different GPUs based on some code-internal logic.

Important

For MPSD users: if you want to run octopus, please use this batch script!

#!/bin/bash -l
# Standard output and error:
#SBATCH -o ./job.out.%j
#SBATCH -e ./job.err.%j
# Initial working directory:
#SBATCH -D ./
# Job name
#SBATCH -J test_gpu
#
#SBATCH --nodes=1            # Request 1 or more full nodes
#SBATCH --partition=p.ada    # in the GPU partition
#SBATCH --gres=gpu:a100:4    # Request 4 GPUs per node.
#SBATCH --ntasks-per-node=4  # Run one task per GPU
#SBATCH --cpus-per-task=18   #   using 18 cores each.
#SBATCH --mail-type=none
#SBATCH --mail-user=userid@example.mpg.de
#SBATCH --time=24:00:00

module purge
module load gcc/11 cuda/11.4 openmpi_gpu/4

export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}

srun ./mpi_openmp_cuda_executable

Hybrid MPI/OpenMP job using one or more nodes with 4 GPUs each

The following example job script launches a hybrid MPI/OpenMP-CUDA-code on one (or more) nodes running one task per GPU. You should load the same modules in the slurm script as you did for compiling your code. Note that the user code needs to attach its tasks to the different GPUs based on some code-internal logic.

#!/bin/bash -l
# Standard output and error:
#SBATCH -o ./job.out.%j
#SBATCH -e ./job.err.%j
# Initial working directory:
#SBATCH -D ./
# Job name
#SBATCH -J test_gpu
#
#SBATCH --nodes=1            # Request 1 or more full nodes
#SBATCH --partition=p.ada    # in the GPU partition
#SBATCH --gres=gpu:a100:4    # Request 4 GPUs per node.
#SBATCH --ntasks-per-node=4  # Run one task per GPU
#SBATCH --cpus-per-task=18   #   using 18 cores each.
#SBATCH --mail-type=none
#SBATCH --mail-user=userid@example.mpg.de
#SBATCH --time=24:00:00

module purge
module load intel/21.5.0 impi/2021.5 cuda/11.4

export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}

srun ./mpi_openmp_cuda_executable

MPI job using one of the large-memory nodes

The following example job script launches a MPI code on one large-memory node that has 2 TB of memory.

#!/bin/bash -l
# Standard output and error:
#SBATCH -o ./job.out.%j
#SBATCH -e ./job.err.%j
# Initial working directory:
#SBATCH -D ./
# Job name
#SBATCH -J test_gpu
#
#SBATCH --nodes=1            # Request 1 or more full nodes
#SBATCH --partition=p.large  # in the large-mem partition
#SBATCH --ntasks-per-node=72 # Run 72 tasks
#SBATCH --mail-type=none
#SBATCH --mail-user=userid@example.mpg.de
#SBATCH --time=24:00:00

module purge
module load intel/21.5.0 impi/2021.5

srun ./mpi_executable

Useful tips

Use --time option for sbatch/srun to set a limit on the total run time of the job allocation.

The OpenMP codes require a variable OMP_NUM_THREADS to be set. This can be obtained from the Slurm environment variable $SLURM_CPUS_PER_TASK which is set when --cpus-per-task is specified in a sbatch script.

Nvidia Ampere GPUs are available in p.ada partition. Type of gpu must be explicitly set, i.e. --gres=gpu:a100:X, where X is between 1 and 4

GPU cards are in default compute mode.

Nodes in p.large is in shared mode i.e. jobs allocate only requested resources. By default jobs allocate all memory on nodes. This means that to share a node between several jobs --mem parameter is required for jobs.

For debugging, you can use the Quality-of-Service feature: by adding –qos=debug, your job will get a higher priority to start as soon as possible. The time limit is 15 minutes for such jobs.

Profiling on GPUs

For profiling codes on GPUs, we provide packages for Nsight systems (nsight_systems) and Nsight compute (nsight_compute) from Nvidia.

Nsight systems is great to profile the overall behavior of the code and will give you a timeline which can be used to see which parts are already executed on the GPU and where there are still gaps or also where in the code data is transfered between CPU and GPU or also directly between GPUs. To use it, you can run in your batch script:

module load nsight_systems

nsys profile -t cuda,nvtx,mpi srun my_binary

This will create a profile that you can open in nsys-ui. Be aware that this only works for single-node runs!

Nsight compute is a great tool to analyze kernel behavior to optimize kernels. To profile a certain kernel, you would run:

module load nsight_compute

nv-nsight-cu-cli --kernel-id ::kernel_name:2 -o output ./my_binary

This will profile the second invocation of the kernel named kernel_name and will give you a profile with a name starting with output. You can open that profile with nv-nsight-cu. For this, it is enought to run the code on one GPU.

Tips for MPSD users

The Octopus code is provided via the module system on the Ada cluster. You can load the most recent modules without needing to load a compiler or MPI module. We offer two kinds of octopus installations on Ada:

  • octopus: the default module, supports all available libraries on our systems, except for pfft

  • octopus-gpu: supports running on GPUs

You can see the available versions using module avail. The versioning scheme includes the minor version number (e.g. octopus/12.0). We also offer the version octopus/main which provides a build of the current main branch that is updated twice per month.

To run on GPUs, the following code block is recommended:

module purge
module load octopus-gpu/main

export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}

srun octopus

This will let you use the main version of octopus compiled with support for CUDA-aware MPI which is especially important for domain-parallel runs.

You can also compile Octopus yourself using this build script.

You can also find more information on Octopus at the Octopus web page.

Support:

For support please create a trouble ticket at the MPCDF helpdesk