HPC Systems and Services

Raven

Slurm batch system

How do I submit jobs to the Slurm batch system?

First, a Slurm job script (e.g. a text file named script.sh) needs to be written that essentially specifies the resources the job requires and the commands the job shall execute (similar to a shell script). Example scripts are provided with the documentation of each HPC system. Second, use the command sbatch script.sh to actually submit the script as a job to the batch system. After submission a job id is reported by sbatch that will serve as a unique handle for the job on the system.

Can I submit jobs longer than 24 hours? Why are job run times limited to 24 hours?

Batch jobs on the HPC systems and clusters are in general limited to run times of 24 hours, with very few exceptions on dedicated clusters in agreement with their owners. In case your runs need more time your code must support checkpoint-restart functionality, i.e. it must be able to write its current state to disk (create a checkpoint) and resume from that checkpoint at the beginning of a subsequent run. The 24h limit is necessary to enable reasonable job turnaround times to the users and to maximize the utilization of the system.

How do I launch an MPI code?

In your Slurm job script, start your program with the command

srun application

which will launch the application using the resources specified in the SLURM script.

What is the correct order of executable commands and ‘#SBATCH’ directives in a Slurm job script?

A Slurm job script must stick to the scheme as outlined in the following:

#!/bin/bash -l

# --- SBATCH directives come first ---
#SBATCH ...
#SBATCH ...

# --- executable commands follow after the SBATCH directives ---
module load ...
srun my_executable ...

All ‘#SBATCH’ directives must be put into the script before the first executable command is called. In particular, any ‘#SBATCH’ directive after the first non-comment non-whitespace line is ignored.

Can I work interactively for debugging and development work?

Production runs and large tests must be run via batch scripts. From a login node, small test runs can be started interactively, e.g. as follows:

srun --time=00:05:00 --mem=1G --ntasks=2 ./mpi_application

How can I query the estimated start time of my job?

The command

squeue --start -j <jobid>

will print the current estimated start time for a job.

How can I get information about my job at runtime?

The command

scontrol show jobid -dd <jobid>

will print detailed information on a job while it is running.

How can I get information about my job after it has finished?

After a job has finished, the sacct command can be used to obtain information, e.g.

sacct -j <jobid>
sacct -u $USER --format=JobID,JobName,MaxRSS,Elapsed

What happens when a running job is hit by a hardware failure?

Rarely but inavoidably it may happen that hardware failures occur. In such situations, an affected run is interrupted by SLURM, and an error message similar to ‘srun: error: Node failure on coXXXX’ is written to ‘stderr’. On all HPC systems and most dedicated clusters, the job will be put back into the queue and relaunched on a different set of nodes, exluding the failed one(s). If this default behaviour is undesired or unsupported by the application, please submit with the --no-requeue flag for sbatch.

How do I do pinning correctly?

On MPCDF systems, pinning is done using SLURM. To ensure it works correctly, please use the example job scripts we provide for pure MPI jobs, hybrid MPI+OpenMP jobs and for jobs involving hyperthreading and start your code with srun.

Parallel File Systems (GPFS)

Which file systems are available and how should I use them?

Each HPC system at the MPCDF typically operates two GPFS high performance file systems.

The first file system contains your homedirectory at /u/$USER and is intended to store files of moderate number and size such as program source codes and builds, locally installed software, or results like processed data, plots, or videos. Do not configure your simulation codes to perform heavy IO in your homedirectory.

The second file system provides temporary disk space for large IO operations, where you have a directory at /ptmp/$USER. It is optimized for large streaming IO, e.g. for reading and writing of simulation snapshots or restart files. Simulation codes must use the ptmp file system to perform heavy IO. Please note that a cleanup policy is active on the ptmp file systems of the HPC systems that purges unused files after 12 weeks. That time may be reduced, if necessary, on prior announcement.

More details on the file systems (concerning backups, quotas, and automated cleanup) are given on the documentation pages of the individual systems. Please keep in mind that parallel file systems are resources shared across the whole system, and wrong utilization may affect other users and jobs on the full system.

I do not get the IO performance I would expect. Am I doing something wrong?

The parallel filesystems deliver the best performance with sequential access patterns on large files. Avoid random access patterns if possible, and avoid the creation and use of large numbers of small files, in particular large numbers of files in a single directory (>1000). Being a shared resource, the performance of the filesystem may vary and depends on the load caused by all users on the system.

How can I grant other users access to my files? How do I use ACLs?

The GPFS filesystem offers the possibility of sharing files and directories with colleagues, even if they belong to different groups. Use the commands setfacl and getfacl to manipulate access control lists (ACLs).

Granting read access

These two commands make a whole subtree readable for the user “john”:

setfacl -R -m user:john:rx /u/my/own/directory
setfacl -m user:john:rx /u/my/own /u/my

The first command applies recursively (-R) to every file and directory below /u/my/own/directory. The second command augments the access rights for the upper level directories, which by default are only accessible to the owner. The “x” (execute) bit is necessary in order to traverse directories, for plain files the “r” bit is sufficient. You can also define access rights for individual files. When you list such files and directories with ls -l, they will have a “+” sign appended to their mode bits.

You can also grant the access to a whole group of users. For example, similar to the example above if you want to share your files with the MPCDF support team, you can share your folder with the “rzg” group as:

setfacl -R -m group:rzg:rx /u/my/own/directory
setfacl -m group:rzg:rx /u/my/own /u/my

Revoking access

The action can be reversed with the command

setfacl -R -x user:john /u/my/own/directory

You can verify this with, for example,

getfacl -R /u/my/own/directory

Cleaning up

To purge all ACL entries (and return to the standard UNIX mode bits), use the command

setfacl -b /u/my/own/directory

How can I transfer files to and from the HPC systems?

For various use cases the following options are available on the login nodes:

  • MPCDF DataShare, using the ds command line client (module load datashare, ds put, ds get) for small to medium sized files (e.g. source tarballs, plots)

  • scp and rsync for transfers of files and directory trees; rsync may resume a transfer after it got interrupted

  • curl and wget to download files from the www

  • bbcp for parallel transfers of large amounts of data

  • Globus online for transfers of large amounts of data

Performance Monitoring

How can I find out about the performance of my runs?

Each job on the HPC systems and on selected clusters is monitored by the HPC performance monitoring system. Users can access the data as downloadable PDF reports at https://hpc-reports.mpcdf.mpg.de/. Such reports often give hints at performance issues that must then be closer investigated by the users using profilers or similar tools.

How can I stop the background performance monitoring?

The performance monitoring system uses perf to query performance counters on the CPUs which might interfere with profiling tools such as VTUNE, likwid, and others. To suspend the instances of hpcmd that monitor each compute node during the runtime of a batch job, we provide the wrapper hpcmd_suspend. Simply put it in between srun and the executable you want to run as follows:

srun hpcmd_suspend ./my_executable

After the batch job has ended, hpcmd will re-enable itself automatically. Please do not suspend hpcmd unless you intend to perform your own measurements.

GPU Computing

How can I launch the NVIDIA Multi-Process Service (MPS)?

NVIDIA MPS enables multiple concurrent processes (e.g. MPI tasks) to use the same GPU in an efficient way. Please find more technical information in the NVIDIA MPS documentation.

MPS can be beneficial especially for codes that use more MPI ranks per node than GPUs available on that node, e.g., because some parts are executed on the CPU and thus the full CPU node is needed. In this case, MPS can improve the utilization of the GPUs by executing kernels at the same time that were launched from different processes.

MPS must be launched once per node before the MPI tasks are launched. To this end, the command line flag ‘–nvmps’ is supported for sbatch on Raven.

How can I profile my GPU code?

You can use the Nsight tools to profile your GPU code on NVIDIA GPUs.

With Nsight systems you can generate timelines of the execution of your program to analyze the general workflow. You can get details about data transfers (e.g., speeds, direction) and kernel launches that can help you find bottlenecks and give hints for optimization steps.

For example, an application can be profiled by using:

module load cuda/XX.Y nsight_systems/ZZZZ
nsys profile -t cuda,cudnn srun ./application

which generates a file named report1.nsys-rep that can be opened in the GUI. For more options to profile, have a look into the documentation of Nsight Systems.

To start the GUI, connect to the MPCDF system forwarding X11 displays and run

module load nsight_systems/ZZZZ
nsys-ui

With Nsight compute, you can analyze specific kernels in great detail. It provides different performance metrics and includes, e.g., a roofline analysis, a memory workload analysis, and heat maps correlated with the source code. If you run your code with Nsight Systems first, you can select kernels to analyze and get the command line for running Nsight Compute.

Are there dedicated GPU resources available for interactive GPU development?

A GPU development partition ‘gpudev’ with a short turnaround time is available on a subset of the V100 nodes of Cobra to allow users test and optimize their GPU codes in an easy and time-saving manner. To access the resource, specify

#SBATCH --partition=gpudev
#SBATCH --gres=gpu:v100:2

in your Slurm batch script. Note that the time limit is 15 minutes. Interactive use is possible e.g. as follows:

srun --time=00:05:00 --nodes=1 --tasks-per-node=1 --cpus-per-task=40 --partition=gpudev --gres=gpu:v100:2 ./my_gpu_application

Containers

Which container solutions are supported? Can I run Docker containers?

Singularity and Charliecloud containers are supported on some HPC systems. For security reasons Docker is not directly supported, however, Docker containers can typically be converted and run with Singularity or Charliecloud. Please see the container documentation for more details.

Remote Visualization

How can I run visualization tools or any software that uses OpenGL on MPCDF systems?

Visualization tools such as VisIt, ParaView, and similar use OpenGL to perform hardware-accelerated rendering. Access to GPUs and OpenGL is provided via the VNC-based remote visualization service. To run OpenGL-based software, launch a remote visualization session, open a terminal, and prefix your command with vglrun, for example using VisIt:

module load visit
vglrun visit

How can I get access to MPCDF remote visualization services?

Browser-based access to the remote visualization services are provided via the web server https://rvs.mpcdf.mpg.de/. No special client software such as VNC viewers is necessary.