Dais User Guide
Note
The filesystem details, incl. the quota values, are not settled, yet, and thus subject to change.
Contents
Name of the cluster:
DAIS
Institution:
Selected MPG Departments
How to get Access permissions
Access can only be granted to members of institutes who procurred the system.
If you do not already have an account at MPCDF fill out the registration form. If you do already have an account at MPCDF but you cannot access to DAIS, please request access via our ticket system.
Note that access to DAIS can only be granted after successfully passing the export control which is done by the respective export control officer of your institute. The export control officer will be automatically notified as soon as you request an account for DAIS.
Access
Login
For security reasons, direct login to the HPC system DAIS is allowed only from within some MPG networks. Users from other locations have to login to one of our gateway systems first.
Login nodes
dais11.mpcdf.mpg.de, dais12.mpcdf.mpg.de
Dais’ ssh key fingerprints (SHA256) are:
ijGSRMd1K3bq14gUaKnI0rODsx5hgCVvtAzQoHC/sy0 (RSA)
Ke44kG2tm/IRqYFg9iUGapSFCLQKIiSUERez5eSsT9Y (ED25519)
Hardware Configuration
2 login nodes dais[11-12]:
2 x INTEL(R) XEON(R) PLATINUM 8568Y+ 48-Core Processor @ 2.3 GHz
96 cores per node
hyper-threading enabled - 2 threads per core
500 GB RAM
17 execution nodes daisg[101-117]:
2 x INTEL(R) XEON(R) PLATINUM 8568Y+ 48-Core Processor @ 2.3 GHz
96 cores per node
hyper-threading enabled - 2 threads per core
2.0 TB RAM
8 x NVIDIA H200 GPUs (with 141GB HBM each) per node
10 execution nodes daisg[201-210]:
2 x INTEL(R) XEON(R) PLATINUM 8568Y+ 48-Core Processor @ 2.3 GHz
96 cores per node
hyper-threading enabled - 2 threads per core
2.0 TB RAM
8 x NVIDIA B200 GPUs (with 180GB HBM each) per node
2 execution nodes daisg[301-302]:
2 x AMD EPYC 9555 64-Core Processor @ 3.2 GHz
128 cores per node
hyper-threading enabled - 2 threads per core
1.5 TB RAM
4 x NVIDIA RTX PRO 6000 GPUs (with 96GB HBM each) per node
Node interconnect:
based on Mellanox Technologies InfiniBand fabric (Speed: 8*200Gb per GPU Node)
Filesystems
Filesystem /u
shared home filesystem
quoted to 500k of files and 1TB of data
Filesystem /dais/fs/scratch
shared scratch filesystem, 200TB
quoted to 8M of files and 8TB of data
NO BACKUPS
Filesytem /viper/ptmp2
The file system /viper/ptmp2 is designed for batch job I/O (12 PB, no system backups). Files in /viper/ptmp2 that have not been accessed for more than 12 weeks will be removed automatically. The period of 12 weeks may be reduced if necessary (with prior notice). As a current policy, no quotas are applied on /viper/ptmp2. This gives users the freedom to manage their data according to their actual needs without administrative overhead. This liberal policy presumes a fair usage of the common file space. So, please do a regular housekeeping of your data and archive/remove files that are not used actually.
Filesystem /nexus/posix0
Additional storage space can be rented
see also Nexus Posix
Compilers and Libraries
Hierarchical environment modules are
used at MPCDF to provide software packages and enable switching between different software versions. Users have to specify the needed modules
with explicit versions at login and during the startup of a batch job. Not all software modules are displayed
immediately by the module avail command, for some user first needs to load a compiler and/or MPI module.
You can search the full hierarchy of the installed software modules with the find-module command.
Batch system based on Slurm
The batch system on DAIS is the Slurm Workload Manager. A brief introduction into the basic commands (srun, sbatch, squeue, scancel, …) can be found on the Raven home page. For more detailed information, see the Slurm handbook. For example batch scripts see below.
Current Slurm configuration on DAIS
default turnaround time: 2 hours
current max. turnaround time (wallclock): 24 hours
gpu partition: exclusive usage of compute nodes (with 8 GPUs each); default
gpu1 partition: shared usage of compute nodes; for jobs with less than 4 GPUs
Useful tips
By default run time limit used for jobs that don’t specify a value is 2
hours. Use --time option for sbatch/srun to set a limit on the
total run time of the job allocation, but not longer than 24 hours.
Default memory per node in the shared partition is 250000 MB, maximum per allocated node per job is 2000000 MB.
To grant the job access to all of the memory on each node use --mem=0
option for sbatch/srun.
On nodes with RTX PRO 6000 gpus default memory per job is 375000 MB and nodes cannot be used exclusively.
The OpenMP codes require a variable OMP_NUM_THREADS to be set.
This can be obtained from the Slurm environment variable
$SLURM_CPUS_PER_TASK which is set when --cpus-per-task is
specified in a sbatch script (an example is on help information page)
To use GPUs add in your slurm scripts --gres option and choose how
many GPUs and/or which model of them to have: #SBATCH --gres=gpu:GPUTYPE:X,
where X is a number of resources from 1 up to 8, and GPUTYPE is one of h200, b200, rtx_pro_6000. If no gpu type is specified, the job will be scheduled to arbitrary nodes.
Only one rtx_pro_6000 gpu is allowed per job.
GPU cards are in default compute mode.
Slurm example batch scripts
The following sections present some general examples of submission scripts for the DAIS system. For more examples with specific frameworks and containerized setups, or comparisons with other HPCS systems, refer to the ai_containers repository.
Multi-node job
The script below runs a Python program on 16 GPUs spread over two nodes, illustrating a distributed PyTorch workflow (additional examples are available in the ai_containers repository).
#!/bin/bash -l
#
# Initial working directory:
#SBATCH -D ./
#
# Standard output and error:
#SBATCH -o ./job.out.%j
#SBATCH -e ./job.err.%j
#
# Job name
#SBATCH -J test_gpu
#
# Time limit
#SBATCH --time=0-00:10:00 # wall-clock D-HH:MM:SS (here: 10 minutes)
#
# Number of nodes, GPUs and MPI tasks per node:
#SBATCH --nodes=2 # request 2 or more full nodes
#SBATCH --partition="gpu" # request an exclusive node
#
#SBATCH --gres=gpu:h200:8 # use 8 H200 on each node.
# #SBATCH --gres=gpu:b200:8 # use 8 B200 on each node.
#SBATCH --ntasks-per-node=8 # request 8 tasks on each node (1 per gpu).
#SBATCH --cpus-per-task=12 # request 1/8 of available CPUs per task
#SBATCH --mem=0 # grant the job access to all of the memory on each node
#
#SBATCH --mail-type=none
#SBATCH --mail-user=userid@example.mpg.de
###### Environment ######
module purge
module load apptainer/1.4.3
CONTAINER="YOUR_PYTORCH_CONTAINER"
###### PyTorch distributed variables ######
export MASTER_ADDR=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)
export APPTAINERENV_MASTER_PORT=$(expr 10000 + $(echo -n $SLURM_JOBID | tail -c 4))
export WORLD_SIZE=$(($SLURM_NNODES * $SLURM_NTASKS_PER_NODE))
###### Run the program:
srun apptainer exec --nv $CONTAINER \
bash -c "RANK=\${SLURM_PROCID} python3 ./your_python_executable"
Support
For support please create a trouble ticket at the MPCDF helpdesk