HPC Systems and Services
Raven
What are the recommended compiler optimization flags for Raven?
The HPC system Raven comprises Intel Xeon IceLake-SP processors (Platinum 8360Y). The recommended compiler optimization flags for these CPUs are:
# Intel compilers (icc, icpc, ifort, mpiicc, mpiicpc, mpiifort)
icc -O3 -xICELAKE-SERVER -qopt-zmm-usage=high ...
# GNU compilers (gcc, g++, gfortran, mpigcc, mpigxx, mpigfortran)
gcc -O3 -march=icelake-server ...
# NVIDIA HPC SDK compilers (nvc, nvc++, nvfortran)
nvc -O3 -tp=skylake ...
Moreover, Raven contains NVIDIA A100 GPUs for which the recommended compiler optimization flags are given below.
# NVIDIA CUDA (nvcc)
nvcc -O3 -arch sm_80 ...
# NVIDIA HPC SDK compilers (nvc, nvc++, nvfortran) for OpenACC
nvc -O3 -tp=skylake -acc=gpu -gpu=cc80 ...
Please consult the compilers’ --help
output, the man pages, and the
official compiler documentation for more details and options.
Slurm batch system
How do I submit jobs to the Slurm batch system?
First, a Slurm job script (e.g. a text file named script.sh
) needs to be
written that essentially specifies the resources the job requires and the
commands the job shall execute (similar to a shell script).
Example scripts are provided with the documentation of each HPC system.
Second, use the command sbatch script.sh
to actually submit the script as a
job to the batch system. After submission a job id is reported by sbatch
that will serve as a unique handle for the job on the system.
Can I submit jobs longer than 24 hours? Why are job run times limited to 24 hours?
Batch jobs on the HPC systems and clusters are in general limited to run times of 24 hours, with very few exceptions on dedicated clusters in agreement with their owners. In case your runs need more time your code must support checkpoint-restart functionality, i.e. it must be able to write its current state to disk (create a checkpoint) and resume from that checkpoint at the beginning of a subsequent run. The 24h limit is necessary to enable reasonable job turnaround times to the users and to maximize the utilization of the system.
How do I launch an MPI code?
In your Slurm job script, start your program with the command
srun application
which will launch the application using the resources specified in the SLURM script.
What is the correct order of executable commands and ‘#SBATCH’ directives in a Slurm job script?
A Slurm job script must stick to the scheme as outlined in the following:
#!/bin/bash -l
# --- SBATCH directives come first ---
#SBATCH ...
#SBATCH ...
# --- executable commands follow after the SBATCH directives ---
module load ...
srun my_executable ...
All ‘#SBATCH’ directives must be put into the script before the first executable command is called. In particular, any ‘#SBATCH’ directive after the first non-comment non-whitespace line is ignored.
Can I work interactively for debugging and development work?
Production runs and large tests must be run via batch scripts. From a login node, small test runs can be started interactively, e.g. as follows:
srun --time=00:05:00 --mem=1G --ntasks=2 ./mpi_application
How can I query the estimated start time of my job?
The command
squeue --start -j <jobid>
will print the current estimated start time for a job.
How can I get information about my job at runtime?
The command
scontrol show jobid -dd <jobid>
will print detailed information on a job while it is running.
How can I get information about my job after it has finished?
After a job has finished, the sacct
command can be used to obtain information, e.g.
sacct -j <jobid>
sacct -u $USER --format=JobID,JobName,MaxRSS,Elapsed
What happens when a running job is hit by a hardware failure?
Rarely but inavoidably it may happen that hardware failures occur. In such
situations, an affected run is interrupted by SLURM, and an error message
similar to ‘srun: error: Node failure on coXXXX’ is written to ‘stderr’.
On all HPC systems and most dedicated clusters, the job will be put back into
the queue and relaunched on a different set of nodes, exluding the failed
one(s). If this default behaviour is undesired or unsupported by the
application, please submit with the --no-requeue
flag for sbatch
.
How do I do pinning correctly?
On MPCDF systems, pinning is done using SLURM. To ensure it works correctly, please use the example job scripts we provide for pure MPI jobs, hybrid MPI+OpenMP jobs and for jobs involving hyperthreading and start your code with srun.
Parallel File Systems (GPFS)
Which file systems are available and how should I use them?
Each HPC system at the MPCDF typically operates two GPFS high performance file systems.
The first file system contains your homedirectory at /u/$USER
and is intended
to store files of moderate number and size such as program source codes and
builds, locally installed software, or results like processed data, plots, or
videos.
Do not configure your simulation codes to perform heavy IO in your homedirectory.
The second file system provides temporary disk space for large IO operations, where
you have a directory at /ptmp/$USER
. It is optimized for large streaming IO, e.g.
for reading and writing of simulation snapshots or restart files.
Simulation codes must use the ptmp
file system to perform heavy IO.
Please note that a cleanup policy is active on the ptmp
file systems
of the HPC systems that purges unused files after 12 weeks.
That time may be reduced, if necessary, on prior announcement.
More details on the file systems (concerning backups, quotas, and automated cleanup) are given on the documentation pages of the individual systems. Please keep in mind that parallel file systems are resources shared across the whole system, and wrong utilization may affect other users and jobs on the full system.
I do not get the IO performance I would expect. Am I doing something wrong?
The parallel filesystems deliver the best performance with sequential access patterns on large files. Avoid random access patterns if possible, and avoid the creation and use of large numbers of small files, in particular large numbers of files in a single directory (>1000). Being a shared resource, the performance of the filesystem may vary and depends on the load caused by all users on the system.
How can I grant other users access to my files? How do I use ACLs?
The GPFS filesystem offers the possibility of sharing files and directories
with colleagues, even if they belong to different groups.
Use the commands setfacl
and getfacl
to manipulate access control lists (ACLs).
Granting read access
These two commands make a whole subtree readable for the user “john”:
setfacl -R -m user:john:rx /u/my/own/directory
setfacl -m user:john:rx /u/my/own /u/my
The first command applies recursively (-R) to every file and directory below /u/my/own/directory. The second command augments the access rights for the upper level directories, which by default are only accessible to the owner. The “x” (execute) bit is necessary in order to traverse directories, for plain files the “r” bit is sufficient. You can also define access rights for individual files. When you list such files and directories with ls -l, they will have a “+” sign appended to their mode bits.
You can also grant the access to a whole group of users. For example, similar to the example above if you want to share your files with the MPCDF support team, you can share your folder with the “rzg” group as:
setfacl -R -m group:rzg:rx /u/my/own/directory
setfacl -m group:rzg:rx /u/my/own /u/my
Revoking access
The action can be reversed with the command
setfacl -R -x user:john /u/my/own/directory
You can verify this with, for example,
getfacl -R /u/my/own/directory
Cleaning up
To purge all ACL entries (and return to the standard UNIX mode bits), use the command
setfacl -b /u/my/own/directory
How can I transfer files to and from the HPC systems?
For various use cases the following options are available on the login nodes:
MPCDF DataShare, using the
ds
command line client (module load datashare
,ds put
,ds get
) for small to medium sized files (e.g. source tarballs, plots)scp
andrsync
for transfers of files and directory trees;rsync
may resume a transfer after it got interruptedcurl
andwget
to download files from the wwwbbcp
for parallel transfers of large amounts of dataGlobus online for transfers of large amounts of data
Performance Monitoring
How can I find out about the performance of my runs?
Each job on the HPC systems and on selected clusters is monitored by the HPC performance monitoring system. Users can access the data as downloadable PDF reports at https://hpc-reports.mpcdf.mpg.de/. Such reports often give hints at performance issues that must then be closer investigated by the users using profilers or similar tools.
How can I stop the background performance monitoring?
The performance monitoring system uses perf
to query performance counters
on the CPUs which might interfere with profiling tools such as VTUNE, likwid,
and others. To suspend the instances of hpcmd that monitor each compute node
during the runtime of a batch job, we provide the wrapper hpcmd_suspend
.
Simply put it in between srun
and the executable you want to run as
follows:
srun hpcmd_suspend ./my_executable
After the batch job has ended, hpcmd will re-enable itself automatically. Please do not suspend hpcmd unless you intend to perform your own measurements.
GPU Computing
How can I launch the NVIDIA Multi-Process Service (MPS)?
NVIDIA MPS enables multiple concurrent processes (e.g. MPI tasks) to use the same GPU in an efficient way. Please find more technical information in the NVIDIA MPS documentation.
MPS can be beneficial especially for codes that use more MPI ranks per node than GPUs available on that node, e.g., because some parts are executed on the CPU and thus the full CPU node is needed. In this case, MPS can improve the utilization of the GPUs by executing kernels at the same time that were launched from different processes.
MPS must be launched once per node before the MPI tasks are launched.
To this end, the command line flag ‘–nvmps’ is supported for sbatch
on Raven.
How can I profile my GPU code?
You can use the Nsight tools to profile your GPU code on NVIDIA GPUs.
With Nsight systems you can generate timelines of the execution of your program to analyze the general workflow. You can get details about data transfers (e.g., speeds, direction) and kernel launches that can help you find bottlenecks and give hints for optimization steps.
For example, an application can be profiled by using:
module load cuda/XX.Y nsight_systems/ZZZZ
nsys profile -t cuda,cudnn srun ./application
which generates a file named report1.nsys-rep
that can be opened in the GUI. For more options to profile,
have a look into the documentation of Nsight Systems.
To start the GUI, connect to the MPCDF system forwarding X11 displays and run
module load nsight_systems/ZZZZ
nsys-ui
With Nsight compute, you can analyze specific kernels in great detail. It provides different performance metrics and includes, e.g., a roofline analysis, a memory workload analysis, and heat maps correlated with the source code. If you run your code with Nsight Systems first, you can select kernels to analyze and get the command line for running Nsight Compute.
Are there dedicated GPU resources available for interactive GPU development?
A GPU development partition ‘gpudev’ with a short turnaround time is available on a subset of the V100 nodes of Cobra to allow users test and optimize their GPU codes in an easy and time-saving manner. To access the resource, specify
#SBATCH --partition=gpudev
#SBATCH --gres=gpu:v100:2
in your Slurm batch script. Note that the time limit is 15 minutes. Interactive use is possible e.g. as follows:
srun --time=00:05:00 --nodes=1 --tasks-per-node=1 --cpus-per-task=40 --partition=gpudev --gres=gpu:v100:2 ./my_gpu_application
Containers
Which container solutions are supported? Can I run Docker containers?
Singularity and Charliecloud containers are supported on some HPC systems. For security reasons Docker is not directly supported, however, Docker containers can typically be converted and run with Singularity or Charliecloud. Please see the container documentation for more details.
Remote Visualization
How can I run visualization tools or any software that uses OpenGL on MPCDF systems?
Visualization tools such as VisIt, ParaView, and similar use OpenGL to
perform hardware-accelerated rendering. Access to GPUs and OpenGL is provided
via the VNC-based remote visualization service. To run OpenGL-based software,
launch a remote visualization session, open a terminal, and prefix your
command with vglrun
, for example using VisIt:
module load visit
vglrun visit
How can I get access to MPCDF remote visualization services?
Browser-based access to the remote visualization services are provided via the web server https://rvs.mpcdf.mpg.de/. No special client software such as VNC viewers is necessary.