Viper-GPU User Guide

Important

Viper-GPU is currently operated in user-risk mode which means that jobs may be canceled at any time if urgent maintenance work has to be done. Please check the message of the day on the login node for latest announcements, e.g. on software configuration changes.

Contents

System overview 

The HPC system Viper, deployed in the course of 2024/2025, consists of two parts, Viper-GPU and Viper-CPU which are operated as two logically separate clusters.

Viper-GPU is operational since February 2025 and in its final configuration provides 300 GPU compute nodes, with 2 AMD Instinct MI300A APUs and 128 GB of high-bandwidth memory (HBM3) per APU. The nodes are interconnected with a NVIDIA/Mellanox NDR InfiniBand network using a fat-tree topology (NDR, 400 Gb/s).

There is 1 login node and an I/O subsystem that serves approx. 12 PB of disk-based storage with direct HSM access, plus approx. 500 TB of NVMe-based storage.

MPCDF Viper-GPU Deployment

Access 

Login 

For security reasons, direct login to the HPC system Viper-GPU is allowed only from within some MPG networks. Users from other locations have to login to one of our gateway systems first.

ssh <user>@gate.mpcdf.mpg.de

Use ssh to connect to Viper-GPU:

ssh <user>@viper12i.mpcdf.mpg.de

To login you have to provide your (Kerberos) password and an OTP on the Viper-GPU login node. SSH keys are not allowed.

Secure copy (scp) can be used to transfer data to or from viper12i.mpcdf.mpg.de.

Viper-GPU’s (all login/interactive nodes) SSH key fingerprints (SHA256) are:

SHA256:0DhkRt6Qom1GA0SmvEjnlKWpIg1+kMPDjOUSqJ8ceyQ (ED25519)
SHA256:pM1fS+4YXGuIaLk0bLSt1sOlS1TpLrZYmFMRY9UjBKo (RSA)

Resource limits 

The login node viper12i.mpcdf.mpg.de is intended for editing, compiling and submitting your parallel programs, only. Running parallel programs interactively on the login node is NOT allowed. CPU resources are restricted to an equivalent of six physical cpu cores, per user. A hard memory limit is also enforced, which is 50% of the available memory.

Jobs have to be submitted to the Slurm batch system which reserves and allocates the resources (e.g. compute nodes) required for your job. Further information on the batch system is provided below.

Interactive (debug) runs 

To test and optimize GPU codes one can use the “apudev” partition by specifying

#SBATCH -p apudev

in the submit script. One node with two MI300A APUs is available in the “apudev” partition. The maximum wall clock time is 15 minutes. One or two APUs can be requested.

Internet access 

Connections to the Internet are only permitted from the login nodes in outgoing direction; Internet access from within batch jobs is not possible.

Data transfers 

To download source code or other data, command line tools such as wget, curl, rsync, scp, pip, git, or similar may be used interactively on the login nodes.

In case the transfer is expected to take a long time it is recommended to run it inside a screen or tmux session.

Hardware configuration 

Compute nodes 

GPU nodes

228 GPU nodes, 456 APUs
Processor type: AMD Instinct MI300A APU
Main memory (HBM3) per APU: 128 GB
24 CPU cores per APU
228 GPU compute units per APU
Theoretical peak performance per APU (FP64, “double precision”): 61 TFlop/s
Theoretical memory bandwidth per APU: 5.3 TB/s

Interconnect 

Viper-GPU uses a Mellanox InfiniBand NDR network with a non-blocking fat-tree topology with a per-node bandwidth of 400 Gb/s (NDR).

I/O subsystem 

Approx. 12 PB of online disk space are available.

Additional hardware details 

Additional details on the hardware are given on a separate documentation page.

File systems 

Important: The HPC systems Viper-CPU and Viper-GPU have separate file systems, i.e., each of both HPC systems has its own local file systems at /u and /ptmp. To access data from Viper-CPU from Viper-GPU and vice versa, the data has to be copied.

$HOME 

Your home directory is located in the GPFS file system /u (see below).

GPFS 

There are two global, parallel file systems of type GPFS (/u and /ptmp), symmetrically accessible from all Viper-GPU cluster nodes, plus the migrating file system /r interfacing to the HPSS archive system.

File system `/u`

The file system /u (a symbolic link to /viper/u2) is designed for permanent user data (source files, config files, etc.). The size of /u is 1.2 PB. Note that no system backups are performed. Your home directory is in /u. The default disk quota in /u is 1.0 TB, the file quota is 256K files. You can check your disk quota in /u with the command:

/usr/lpp/mmfs/bin/mmlsquota viper_u2

File system `/ptmp`

The file system /ptmp (a symbolic link to /viper/ptmp2) is designed for batch job I/O (12 PB, no system backups). Files in /ptmp that have not been accessed for more than 12 weeks will be removed automatically. The period of 12 weeks may be reduced if necessary (with prior notice).

As a current policy, no quotas are applied on /ptmp. This gives users the freedom to manage their data according to their actual needs without administrative overhead. This liberal policy presumes a fair usage of the common file space. So, please do a regular housekeeping of your data and archive/remove files that are not used actually.

Archiving data from the GPFS file systems to tape can be done using the migrating file system /r (see below).

File system `/r`

The /r file system (a symbolic link to /ghi/r) stages archive data. It is available only on the login node viper12i.mpcdf.mpg.de.

Each user has a subdirectory /r/<initial>/<userid> to store data. For efficiency, files should be packed to tar files (with a size of about 1 GB to 1 TB) before archiving them in /r, i.e., please avoid archiving small files. When the file system /r gets filled above a certain value, files will be transferred from disk to tape, beginning with the largest files which have not been used for the longest time.

For documentation on how to use the MPCDF archive system, please see the backup and archive section.

/tmp and node-local storage

Please don’t use the file system /tmp or $TMPDIR for scratch data. Instead, use /ptmp which is accessible from all Viper-GPU cluster nodes.

In cases where an application really depends on node-local storage, please use the directories from the environment variables JOB_TMPDIR and JOB_SHMTMPDIR, which are set individually for each Slurm job and cleaned afterwards.

Software 

Access to software via environment modules 

Environment modules are used at MPCDF to provide software packages and enable easy switching between different software versions.

Use the command

module avail

to list the available software packages on the HPC system. Note that you can search for a certain module by using the find-module tool (see below).

Use the command

module load package_name/version

to actually load a software package at a specific version.

Further information on the environment modules on Viper-GPU and their hierarchical organization is given below.

Information on the software packages provided by the MPCDF is available here.

Recommended compiler and MPI software stack on Viper-GPU 

As explained below, no defaults are defined for the compiler and MPI modules, and no modules are loaded automatically at login.

We currently recommend to use the following versions on Viper-GPU:

module load gcc/14 rocm/6.3 openmpi/5.0

Specific optimizing compiler flags for the AMD EPYC CPU are given further below.

If you want to use GPU-aware MPI, we recommend:

module load gcc/14 rocm/6.3 openmpi_gpu/5.0

Hierarchical module environment 

To manage the plethora of software packages resulting from all the relevant combinations of compilers and MPI libraries, we organize the environment module system for accessing these packages in a natural hierarchical manner. Compilers (gcc, intel) are located on the uppermost level, depending libraries (e.g., MPI) on the second level, more depending libraries on a third level. This means that not all the modules are visible initially: Only after loading a compiler module, the modules depending on this will become available. And similarly, loading an MPI module in addition will make the modules depending on the MPI library available.

No defaults are defined for the compiler and MPI modules, and no modules are loaded automatically at login. This forces users to specify explicit versions for those modules during compilation and in the batch scripts to ensure that the same MPI library is loaded. This also means that users can decide themselves when they use newer compiler and MPI versions for their code which avoids compatibility problems when changing defaults centrally.

For example, the FFTW library compiled with the GCC compiler and the OpenMPI library can be loaded as follows:

First, load the compiler module using the command

module load gcc/14

second, the MPI module with

module load openmpi/5.0

and, finally, the FFTW module fitting exactly to the compiler and MPI library via

module load fftw-mpi

You may check by using the command

module avail

that after the first and second steps the depending environment modules become visible, in the present example openmpi and fftw-mpi. Moreover, note that the environment modules can be loaded via a single ‘module load’ statement as long as the order given by the hierarchy is correct, e.g.,

module load gcc/14 openmpi/5.0 fftw-mpi

It is important to point out that a large fraction of the available software is not affected by the hierarchy, e.g., certain HPC applications, tools such as git or cmake, mathematical software (maple, matlab, mathematica), visualization software (visit, paraview, idl) are visible at the uppermost hierarchy. Note that a hierarchy exists for depending Python modules via the ‘anaconda’ module files on the top level.

Because of the hierarchy, some modules only appear after other modules (such as compiler and MPI) have been loaded. One can search all available combinations of a certain software (e.g. fftw-mpi) by using

find-module fftw-mpi

Further information on using environment modules is given here.

Slurm batch system 

The batch system used on the HPC cluster Viper-GPU is the open-source workload manager Slurm (Simple Linux Utility for Resource management). To run test or production jobs, submit a job script (see below) to Slurm, which will allocate the resources required for your job (e.g. the compute nodes to run your job on).

By default, the job run limit is set to 8 on Viper-GPU, the default job submit limit is 300. If your batch jobs can’t run independently from each other, please use job steps.

There are mainly two types of batch jobs:

Exclusive, where all resources on the nodes are allocated to the job
Shared, where two jobs may share the resources of one node. In this case it is necessary that the number of CPUs and the amount of memory are specified for each job.

The cpu cores on Viper-GPU support simultaneous multithreading (SMT) which potentially increases the performance of an application by up to 20%. To use SMT, you have to increase the product of the number of MPI tasks per node and the number of threads per MPI task from 48 to 96 in your job script. Please be aware that when doubling the number of MPI tasks per node each task only gets half of the memory compared to the non-SMT job.

Overview of the available per-job resources on Viper-GPU:

    Job type          Max. CPUs            Number of GPUs   Max. Memory      Number     Max. Run
                      per Node               per node      per Node [MB]     of Nodes      Time
   =============================================================================================
    shared    apu      24 / 48  in HT mode        1           110000          < 1       24:00:00
   ---------------------------------------------------------------------------------------------
    exclusive apu      48 / 96  in HT mode        2           220000        1-128       24:00:00
   ---------------------------------------------------------------------------------------------

A job submit filter will automatically choose the right partition and job parameters from the resource specification.

For detailed information about the Slurm batch system, please see Slurm Workload Manager.

The most important Slurm commands are

sbatch <job_script_name> Submit a job script for execution
squeue Check the status of your job(s)
scancel <job_id> Cancel a job
sinfo List the available batch queues (partitions).

Do not run Slurm client commands from loops in shell scripts or other programs. Ensure that programs limit calls to these commands to the minimum necessary for the information you are trying to gather.

Sample Batch job scripts can be found below.

Notes on job scripts:

The directive
```
#SBATCH --nodes=<number of nodes>
```
in your job script specifies the number of compute nodes that your program will use.
The directive
```
#SBATCH  --ntasks-per-node=<number of cpus>
```
specifies the number of MPI processes for the job. The parameter ntasks-per-node cannot be greater than 48 because one apu compute node on Viper-GPU has 48 physical cores (with 2 threads each and thus 96 logical CPUs in SMT mode).
The directive
```
#SBATCH --cpus-per-task=<number of OMP threads per MPI task>
```
specifies the number of threads per MPI process if you are using OpenMP.
The expression
```
ntasks-per-node * cpus-per-task
```
may not exceed 96.
The expression
```
nodes * ntasks-per-node * cpus-per-task
```
gives the total number of CPUs that your job will use.
To select GPU nodes specify a job constraint as follows:
```
#SBATCH --constraint="apu"
```
Jobs that need less than a half compute node have to specify a reasonable memory limit so that they can share a node!
A job submit filter will automatically choose the right partition/queue from the resource specification.

Slurm example batch scripts 

Batch jobs using APUs 

Note that computing time on APU-accelerated nodes is accounted using a weighting factor of 2 relative to CPU-only jobs, corresponding to the additional computing power provided by the GPUs. Users are advised to check the performance reports of their jobs in order to monitor adequate utilization of the resources.

GPU job using 1 or 2 APUs on a single node

The following example job script launches a (potentially multithreaded) program to use one (or two) APU(s) on a single node. In case more than one GPUs are requested the user code must be able to utilize these additional APUs properly.

#!/bin/bash -l
# Standard output and error:
#SBATCH -o ./job.out.%j
#SBATCH -e ./job.err.%j
# Initial working directory:
#SBATCH -D ./
# Job name
#SBATCH -J test_apu
#
#SBATCH --ntasks=1
#SBATCH --constraint="apu"
#
# --- default case: use a single APU on a shared node ---
#SBATCH --gres=gpu:1
#SBATCH --cpus-per-task=24
#SBATCH --mem=110000
#
# --- uncomment to use 2 APUs on a full node ---
# #SBATCH --gres=gpu:2
# #SBATCH --cpus-per-task=48
# #SBATCH --mem=220000
#
#SBATCH --mail-type=none
#SBATCH --mail-user=userid@example.mpg.de
#SBATCH --time=12:00:00

module purge
module load gcc/14 openmpi/5.0 rocm/6.3

export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}

srun ./apu_executable

Hybrid MPI/OpenMP job using one or more nodes with 2 APUs each

The following example job script launches a hybrid MPI/OpenMP-code on one (or more) nodes running one task per APU. Note that the user code needs to attach its tasks to the different APUs based on some code-internal logic.

#!/bin/bash -l
# Standard output and error:
#SBATCH -o ./job.out.%j
#SBATCH -e ./job.err.%j
# Initial working directory:
#SBATCH -D ./
# Job name
#SBATCH -J test_gpu
#
#SBATCH --nodes=1            # Request 1 or more full nodes
#SBATCH --constraint="apu"   #   providing APUs.
#SBATCH --gres=gpu:2         # Request 2 APUs per node.
#SBATCH --ntasks-per-node=2  # Run one task per APU
#SBATCH --cpus-per-task=24   #   using 24 cores each.
#SBATCH --mail-type=none
#SBATCH --mail-user=userid@example.mpg.de
#SBATCH --time=12:00:00

module purge
module load gcc/14 openmpi/5.0 rocm/6.3

export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}

srun ./mpi_openmp_apu_executable

Batch jobs with dependencies 

The following script generates a sequence of jobs, each job running the given job script. The start of each individual job depends on its dependency, where possible values for the --dependency flag are, e.g.

afterany:job_id This job starts after the previous job has terminated
afterok:job_id This job starts after previous job has successfully executed

#!/bin/bash
# Submit a sequence of batch jobs with dependencies
#
# Number of jobs to submit:
NR_OF_JOBS=6
# Batch job script:
JOB_SCRIPT=./my_batch_script
echo "Submitting job chain of ${NR_OF_JOBS} jobs for batch script ${JOB_SCRIPT}:"
JOBID=$(sbatch ${JOB_SCRIPT} 2>&1 | awk '{print $(NF)}')
echo "  " ${JOBID}
I=1
while [ ${I} -lt ${NR_OF_JOBS} ]; do
  JOBID=$(sbatch --dependency=afterany:${JOBID} ${JOB_SCRIPT} 2>&1 | awk '{print $(NF)}')
  echo "  " ${JOBID}
  let I=${I}+1
done

Batch job using a job array 

#!/bin/bash -l
# specify the indexes (max. 30000) of the job array elements (max. 300 - the default job submit limit per user)
#SBATCH --array=1-20
# Standard output and error:
#SBATCH -o job_%A_%a.out        # Standard output, %A = job ID, %a = job array index
#SBATCH -e job_%A_%a.err        # Standard error, %A = job ID, %a = job array index
# Initial working directory:
#SBATCH -D ./
# Job Name:
#SBATCH -J test_array
#
# Number of nodes and MPI tasks per node:
#SBATCH --nodes=1
#SBATCH --constraint="apu"   #   providing APUs.
#SBATCH --gres=gpu:2    # Request 2 APUs per node.
#SBATCH --ntasks-per-node=48
#
#SBATCH --mail-type=none
#SBATCH --mail-user=userid@example.mpg.de
#
# Wall clock limit (max. is 24 hours):
#SBATCH --time=12:00:00

# Load compiler and MPI modules (must be the same as used for compiling the code)
module purge
module load gcc/14 openmpi/5.0 rocm/6.3

# Run the program:
#  the environment variable $SLURM_ARRAY_TASK_ID holds the index of the job array and
#  can be used to discriminate between individual elements of the job array

srun ./myprog > prog.out

Migration guide for users coming from Intel- and NVIDIA-based HPC systems 

Comprehensive general information on tools, techniques, and best practices necessary to target the AMD APUs is available on the slidedecks from past training events:

Below, crucial information to get started specifically on the Viper-GPU system at the MPCDF is given in condensed form.

Application performance 

On a single AMD MI300A APU, sufficiently optimized GPU application codes should expect a performance improvement by at least a factor of two compared to a single Nvidia A100 GPU. When comparing per-node performances, this corresponds to an application performance per Viper-GPU node (with 2 MI300A APUs) being at least on par with (or better than) a Raven-GPU node (with 4 Nvidia A100 GPUs).

Users who need help with improving performance or with porting their codes to Viper-GPU are encouraged to contact the MPCDF helpdesk for support.

Placement of GPU-accelerated MPI tasks 

A Viper-GPU node has two distinct APU sockets with 24 CPU cores and one GPU, each. To avoid overhead due to inter-socket communication, MPI tasks (or processes) that use GPU acceleration must be placed physically (topologically) close to the actual GPU resource.

For example, a job that runs two GPU-enabled MPI tasks per node should be launched with the following Slurm parameters:

#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=1
#SBATCH --gres=gpu:2

This setup will place task 0 on core 0 of socket 0 (close to GPU 0), and task 1 on core 0 (24 in absolute numbering) of socket 1 (close to GPU 1).

Similarly, for a hybrid OpenMP/MPI application that uses GPUs and multiple threads, the Slurm parameters

#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=24
#SBATCH --gres=gpu:2

can be used to place task 0 on the 24 cores of socket 0 (close to GPU 0), and task 1 on the 24 cores of socket 1 (close to GPU 1).

Please be aware that each MPI process has access to all GPUs locally available to the Slurm job, i.e. the application needs to handle the setting of the correct GPU affinity by itself.

Moreover, please be aware that oversubscribing GPU resources (i.e. assigning a single GPU to multiple MPI tasks simultaneously) can significantly degrade performance on Viper-GPU. In general, one or at most two MPI tasks per GPU are recommended. This contrasts with NVIDIA GPUs (such as the A100 GPUs on Raven), where NVIDIA MPS is available to efficiently multiplex multiple tasks on a single GPU. It is recommended to perform performance measurements to identify the best performing configuration.

Software 

In the following, information relevant to target the accelerators of Viper-GPU is provided. Moreover, the hints to compile and optimize the code parts that run on the CPU cores given for Viper-CPU apply.

Compilers

Depending on the usage model of the GPUs/APUs, the compilation is done on different ways.

HIP/ROCm

Using this model allows you to choose from several host compilers: Gcc/gfortran, clang/flang, AMD or Intel compilers are possible. For your HIP (C++ Heterogeneous-Compute Interface for Portability) code, you have to use the hipcc compiler from the rocm module. We recommend to load the host compiler module first and then the rocm module. The architecture flags are set automatically for the MI300A on Viper-GPU (see below for more details and options). A typical compilation line reads

module load gcc/14 rocm/6.3
hipcc -x hip --offload-arch=gfx942 -c -o my_gpu_code.o my_gpu_code.cxx

If your HIP code also contains MPI calls, you have to add the include path of the MPI library:

module load gcc/14 rocm/6.3 openmpi/5.0
hipcc -x hip --offload-arch=gfx942 -c -o my_gpu_code.o -I${OPENMPI_HOME}/include my_gpu_code.cxx

You can build your CPU code with any of the aforementioned compilers, and finally link all object files with the hipcc command, where you then have to add the paths to the libraries needed (e.g. MPI, Fortran).

OpenMP TARGET directives

If your code contains OpenMP TARGET directives to target the GPU, you have to use the amdclang, amdclang++ or amdflang compiler from the amd-llvm module. It can be combined with hipcc from the rocm module, in this case you should load the rocm module first. The amd-llvm module always contains the latest versions of the compilers, which are currently under heavy development (especially the amdflang compiler). The rocm module contains the compilers bundled together with ROCm, which is updated less often. By first loading the rocm module and then the amd-llvm module, you can use the latest available LLVM compilers and ROCm version.

A typical compilation line for an OpenMP TARGET code reads:

module load amd-llvm/5.3
amdclang++ -fopenmp --offload-arch=gfx942 -c my_target_code.cxx -o my_target_code.o
amdflang -fopenmp --offload-arch=gfx942 -c my_target_fortran_code.F90 -o my_target_fortran_code.o

For linking, you have to add the libomptarget library:

module load amd-llvm/5.3
amdclang++ -fopenmp --offload-arch=gfx942 -o executable my_target_code.o -L${AMDLLVM_HOME}/lib/llvm/lib -lomptarget

Note that the amd-llvm module contains upstream LLVM with AMD GPU-specific additions (HIP compiler, OpenMP and debugging improvements etc.). This is not the same compiler as AMD Optimizing C/C++ and Fortran Compilers (AOCC), which adds AMD EPYC CPU-specific optimizations.

No defaults for the `--offload-arch` flag for MI300A

Please be aware that no --offload-arch defaults are set for the HIP and AMD LLVM compilers. Previously set defaults had to be abandoned entirely due to several side effects. Please ensure that you specify at least --offload-arch=gfx942 to generate proper code for the MI300A. Note that unified shared memory requires in addition setting the environment variable HSA_XNACK=1 at runtime.

Math libraries

ROCm libraries

Similarly to CUDA, there are vendor-optimized ROCm (Radeon Open Compute Platform) implementations of commonly used math operations provided. After loading the rocm module, the headers are available in ${ROCM_HOME}/include and the libraries can be used from ${ROCM_HOME}/lib. Be aware that the directory structure of ROCm has changed in the releases starting with major version 6. They are now organized in subdirectories rocblas, rocfft, rocrand, … The same holds for HIP, e.g., in your code you have to use the headers from e.g. #include <hipblas/hipblas.h> or #include <rocfft/rocfft.h>. Most of the HIP libraries are drop-in replacements of the respective CUDA libraries:

HIP	CUDA	ROCm
hipblas	cublas	rocblas
hipfft	cufft	rocfft
hipsparse	cusparse	rocsparse
hiprand	curand	rocrand
hipsolver	cusolver	rocsolver

Third-party libraries

ginkgo (sparse linear solvers)
magma (dense linear algebra: BLAS, LAPACK)
heFFTe (distributed FFTs)

Performance profilers

GPU kernel performance

AMD’s ROCm Compute Profiler rocprof-compute, previously called Omniperf, is part of the rocm module starting from the ROCm 6.3.0 release
ROCProfiler (command rocprof, is part of the ROCm installation in the rocm module)

Overall application performance

AMD’s ROCm Systems Profiler rocprof-sys-*, previously called Omnitrace, is part of the rocm module starting from the ROCm 6.3.0 release
Linaro MAP (load linaro_ddt module)