Dais User Guide
Important
DAIS is operated in user risk mode and will be redeployed from scratch in the 2nd half of September.
This includes the redeploying of the filesystems.
-> All data stored on DAIS will be LOST then.
Contents
Name of the cluster:
DAIS
Institution:
Selected MPG Departments
Login nodes
dais11.mpcdf.mpg.de, dais12.mpcdf.mpg.de
Hardware Configuration
2 login nodes dais[11-12]:
2 x INTEL(R) XEON(R) PLATINUM 8568Y+ 48-Core Processor @ 2.3 GHz
96 cores per node
hyper-threading enabled - 2 threads per core
500 GB RAM
17 execution nodes daisg[101-117]:
2 x INTEL(R) XEON(R) PLATINUM 8568Y+ 48-Core Processor @ 2.3 GHz
96 cores per node
hyper-threading enabled - 2 threads per core
2.0 TB RAM
8 x NVIDIA H200 GPUs per node
Node interconnect:
based on Mellanox Technologies InfiniBand fabric (Speed: 8*200Gb per GPU Node)
Filesystems
GPFS-based with total size of 1.0 PB and independent inode space for the following filesets:
File system /u
shared home filesystem
NO BACKUPS
WILL BE DELETED and REDEPLOYED in the 2nd half of September!
Compilers and Libraries
Hierarchical environment modules are
used at MPCDF to provide software packages and enable switching between different software versions. User have to specify the needed modules
with explicit versions at login and during the startup of a batch job. Not all software modules are displayed
immediately by the module avail
command, for some user first needs to load a compiler and/or MPI module.
You can search the full hierarchy of the installed software modules with the find-module
command.
Batch system based on Slurm
The batch system on DAIS is the Slurm Workload Manager. A brief introduction into the basic commands (srun, sbatch, squeue, scancel, …) can be found on the Raven home page. For more detailed information, see the Slurm handbook. See also the sample batch scripts which must be modified for DAIS cluster.
Current Slurm configuration on DAIS
default turnaround time: 2 hours
current max. turnaround time (wallclock): 48 hours
gpu partition: exclusive usage of compute nodes (with 8 GPUs each); default
gpu1 partition: shared usage of compute nodes; for jobs with less than 4 GPUs
Useful tips
By default run time limit used for jobs that don’t specify a value is 2
hours. Use --time
option for sbatch/srun to set a limit on the
total run time of the job allocation but not longer than 48 hours
Default memory per node in the shared partition is 250000 MB, maximum per allocated node per job is 2000000 MB.
To grant the job access to all of the memory on each node use --mem=0
option for sbatch/srun
The OpenMP codes require a variable OMP_NUM_THREADS to be set.
This can be obtained from the Slurm environment variable
$SLURM_CPUS_PER_TASK which is set when --cpus-per-task
is
specified in a sbatch script (an example is on help information page)
To use GPUs add in your slurm scripts --gres
option and choose how
many GPUs and/or which model of them to have: #SBATCH --gres=gpu:h200:X
,
where X
is a number of resources from 1 up to 8.
GPU cards are in default compute mode.
Slurm example batch scripts
GPU job using 8 GPUs on multiple nodes
This example launches a Python program across all 8 GPUs on two (or more) nodes, designed for a distributed PyTorch scenario. For other frameworks and containerized setups, refer to the ai_containers repository.
#!/bin/bash -l
#
# Initial working directory:
#SBATCH -D ./
#
# Standard output and error:
#SBATCH -o ./job.out.%j
#SBATCH -e ./job.err.%j
#
# Job name
#SBATCH -J test_gpu
#
#SBATCH --time=0-00:10:00 # wall-clock time limit D-HH:MM:SS (here: 10 minutes)
#
# Number of nodes, GPUs and MPI tasks per node:
#SBATCH --nodes=2 # request 2 or more full nodes
#SBATCH --gres=gpu:h200:8 # use 8 GPUs on each full node
#SBATCH --ntasks-per-node=8 # request 8 tasks on each node, e.g. one task per requested GPU
#
#SBATCH --cpus-per-task=12 # request 1/8 of available CPUs per task
#SBATCH --threads-per-core=1
#
#SBATCH --partition="gpu" # request an exclusive node
#
#SBATCH --mem=0 # grant the job access to all of the memory on each node
#
#SBATCH --mail-type=none
#SBATCH --mail-user=userid@example.mpg.de
###### Environment ######
module purge
module load apptainer/1.4.1
CONTAINER="YOUR_PYTORCH_CONTAINER"
###### PyTorch distributed variables ######
export MASTER_ADDR=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)
export APPTAINERENV_MASTER_PORT=$(expr 10000 + $(echo -n $SLURM_JOBID | tail -c 4))
export WORLD_SIZE=$(($SLURM_NNODES * $SLURM_NTASKS_PER_NODE))
###### Run the program:
srun apptainer exec --nv $CONTAINER \
bash -c "RANK=\${SLURM_PROCID} python3 ./your_python_executable"
Support
For support please create a trouble ticket at the MPCDF helpdesk