Dais User Guide


Note

The filesystem details, incl. the quota values, are not settled, yet, and thus subject to change.

Name of the cluster:

  • DAIS

Institution:

  • Selected MPG Departments

How to get Access permissions

Access can only be granted to members of institutes who procurred the system.

If you do not already have an account at MPCDF fill out the registration form. If you do already have an account at MPCDF but you cannot access to DAIS, please request access via our ticket system.

Note that access to DAIS can only be granted after successfully passing the export control which is done by the respective export controll officier of your institute. The export control officier will be automatically notified as soon as you request an account for DAIS.

Access

Login

For security reasons, direct login to the HPC system DAIS is allowed only from within some MPG networks. Users from other locations have to login to one of our gateway systems first.

Login nodes

  • dais11.mpcdf.mpg.de, dais12.mpcdf.mpg.de

Dais’ ssh key fingerprints (SHA256) are:

ijGSRMd1K3bq14gUaKnI0rODsx5hgCVvtAzQoHC/sy0 (RSA)
Ke44kG2tm/IRqYFg9iUGapSFCLQKIiSUERez5eSsT9Y (ED25519)

Hardware Configuration

2 login nodes dais[11-12]:

  • 2 x INTEL(R) XEON(R) PLATINUM 8568Y+ 48-Core Processor @ 2.3 GHz

  • 96 cores per node

  • hyper-threading enabled - 2 threads per core

  • 500 GB RAM

17 execution nodes daisg[101-117]:

  • 2 x INTEL(R) XEON(R) PLATINUM 8568Y+ 48-Core Processor @ 2.3 GHz

  • 96 cores per node

  • hyper-threading enabled - 2 threads per core

  • 2.0 TB RAM

  • 8 x NVIDIA H200 GPUs per node

Node interconnect:

  • based on Mellanox Technologies InfiniBand fabric (Speed: 8*200Gb per GPU Node)

Filesystems

Filesystem /u

  • shared home filesystem

  • quoted to 500k of files and 1TB of data

Filesystem /dais/fs/scratch

  • shared scratch filesystem, 200TB

  • quoted to 8M of files and 8TB of data

  • NO BACKUPS

Filesystem /nexus/posix0

Additional storage space can be rented

Compilers and Libraries

Hierarchical environment modules are used at MPCDF to provide software packages and enable switching between different software versions. User have to specify the needed modules with explicit versions at login and during the startup of a batch job. Not all software modules are displayed immediately by the module avail command, for some user first needs to load a compiler and/or MPI module. You can search the full hierarchy of the installed software modules with the find-module command.

Batch system based on Slurm

The batch system on DAIS is the Slurm Workload Manager. A brief introduction into the basic commands (srun, sbatch, squeue, scancel, …) can be found on the Raven home page. For more detailed information, see the Slurm handbook. See also the sample batch scripts which must be modified for DAIS cluster.

Current Slurm configuration on DAIS

  • default turnaround time: 2 hours

  • current max. turnaround time (wallclock): 24 hours

  • gpu partition: exclusive usage of compute nodes (with 8 GPUs each); default

  • gpu1 partition: shared usage of compute nodes; for jobs with less than 4 GPUs

Useful tips

By default run time limit used for jobs that don’t specify a value is 2 hours. Use --time option for sbatch/srun to set a limit on the total run time of the job allocation but not longer than 24 hours

Default memory per node in the shared partition is 250000 MB, maximum per allocated node per job is 2000000 MB. To grant the job access to all of the memory on each node use --mem=0 option for sbatch/srun

The OpenMP codes require a variable OMP_NUM_THREADS to be set. This can be obtained from the Slurm environment variable $SLURM_CPUS_PER_TASK which is set when --cpus-per-task is specified in a sbatch script (an example is on help information page)

To use GPUs add in your slurm scripts --gres option and choose how many GPUs and/or which model of them to have: #SBATCH --gres=gpu:h200:X, where X is a number of resources from 1 up to 8.

GPU cards are in default compute mode.

Slurm example batch scripts

GPU job using 1, 2, or 4 GPUs on a shared node

This example launches a Python program one (or more) GPU(s) on a single shared node, designed for a distributed PyTorch scenario. For other frameworks and containerized setups, refer to the ai_containers repository.

#!/bin/bash -l
#
# Initial working directory:
#SBATCH -D ./
#
# Standard output and error:
#SBATCH -o ./job.out.%j
#SBATCH -e ./job.err.%j
#
# Job name
#SBATCH -J test_gpu
#
#SBATCH --time=0-00:10:00 # wall-clock time limit D-HH:MM:SS (here: 10 minutes)
#
#SBATCH --nodes=1  # request a full node
#SBATCH --partition="gpu1" # request a shared node
#
#SBATCH --cpus-per-task=12 # request 1/8 of available CPUs per task; this remains unchanged for Pytorch example.
#SBATCH --threads-per-core=1
#
# --- default case: use a single GPU on a shared node ---
#SBATCH --gres=gpu:h200:1 # use 1 GPU on a shared node
#SBATCH --ntasks-per-node=1 # request 1 task on that node
#SBATCH --mem=250000 # grant the job access to 1/8 of the memory on the node
#
# --- uncomment to use 2 GPUs on a shared node ---
# #SBATCH --gres=gpu:h200:2 # use 2 GPU on a shared node
# #SBATCH --ntasks-per-node=2 # request 2 tasks on that node
# #SBATCH --mem=500000 # grant the job access to 2/8 of the memory on the node
#
# --- uncomment to use 4 GPUs on a shared node ---
# #SBATCH --gres=gpu:h200:4 # use 4 GPU on a shared node
# #SBATCH --ntasks-per-node=4 # request 4 tasks on that node
# #SBATCH --mem=1000000 # grant the job access to 4/8 of the memory on the node
#
#SBATCH --mail-type=none
#SBATCH --mail-user=userid@example.mpg.de

###### Environment ######
module purge
module load apptainer/1.4.1
CONTAINER="YOUR_PYTORCH_CONTAINER"

###### PyTorch distributed variables ######
export MASTER_ADDR=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)
export APPTAINERENV_MASTER_PORT=$(expr 10000 + $(echo -n $SLURM_JOBID | tail -c 4))
export WORLD_SIZE=$(($SLURM_NNODES * $SLURM_NTASKS_PER_NODE))

###### Run the program:
srun apptainer exec --nv $CONTAINER \
  bash -c "RANK=\${SLURM_PROCID} python3 ./your_python_executable"

GPU job using 8 GPUs on multiple nodes

This example launches a Python program across all 8 GPUs on two (or more) nodes, designed for a distributed PyTorch scenario. For other frameworks and containerized setups, refer to the ai_containers repository.

#!/bin/bash -l
#
# Initial working directory:
#SBATCH -D ./
#
# Standard output and error:
#SBATCH -o ./job.out.%j
#SBATCH -e ./job.err.%j
#
# Job name
#SBATCH -J test_gpu
#
#SBATCH --time=0-00:10:00 # wall-clock time limit D-HH:MM:SS (here: 10 minutes)
#
# Number of nodes, GPUs and MPI tasks per node:
#SBATCH --nodes=2  # request 2 or more full nodes
#SBATCH --gres=gpu:h200:8 # use 8 GPUs on each full node
#SBATCH --ntasks-per-node=8 # request 8 tasks on each node, e.g. one task per requested GPU
#
#SBATCH --cpus-per-task=12 # request 1/8 of available CPUs per task 
#SBATCH --threads-per-core=1
#
#SBATCH --partition="gpu" # request an exclusive node
#
#SBATCH --mem=0 # grant the job access to all of the memory on each node 
#
#SBATCH --mail-type=none
#SBATCH --mail-user=userid@example.mpg.de

###### Environment ######
module purge
module load apptainer/1.4.1
CONTAINER="YOUR_PYTORCH_CONTAINER"

###### PyTorch distributed variables ######
export MASTER_ADDR=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)
export APPTAINERENV_MASTER_PORT=$(expr 10000 + $(echo -n $SLURM_JOBID | tail -c 4))
export WORLD_SIZE=$(($SLURM_NNODES * $SLURM_NTASKS_PER_NODE))

###### Run the program:
srun apptainer exec --nv $CONTAINER \
  bash -c "RANK=\${SLURM_PROCID} python3 ./your_python_executable"

Support

For support please create a trouble ticket at the MPCDF helpdesk