Viper User Guide
Important
Viper is currently operated in user-risk mode which means that jobs may be canceled at any time if urgent maintenance work has to be done.
System overview
The HPC system Viper is operational since June 2024 and comprises 768 compute nodes with AMD EPYC Genoa 9554 CPUs with 128 cores and at least 512 GB RAM per node. A subset of 609 nodes is equipped with 512 GB RAM (16 memory channels), 90 nodes with 768 GB RAM (24 memory channels), 66 nodes with 1024 GB RAM (16 memory channels), and 3 nodes with 2304 GB RAM (24 memory channels). In addition, Viper will provide 228 GPU compute nodes, each with 2 AMD Instinct MI300A APUs and 256 GB of high-bandwidth memory (HBM3). The nodes are interconnected with a NVIDIA/Mellanox NDR InfiniBand network using a fat-tree topology with two non-blocking islands, one for the CPU nodes (NDR200, 200 Gb/s), and one for the GPU nodes (NDR, 400 Gb/s).
Moreover, there are 6 login nodes, and I/O subsystems that serve approx. 20 PB of disk-based storage with direct HSM access, plus approx. 500 TB of NVMe-based storage.
Summary: 768 CPU compute nodes, 98304 CPU cores, 432 TB RAM (DDR5), 4.9 PFlop/s theoretical peak performance (FP64), 228 GPU nodes comprising 456 APUs (to be deployed in the course of 2024).
Access
Login
For security reasons, direct login to the HPC system Viper is allowed only from within some MPG networks. Users from other locations have to login to one of our gateway systems first.
ssh <user>@gate.mpcdf.mpg.de
Use ssh
to connect to Viper:
ssh <user>@viper.mpcdf.mpg.de
You will be directed to one of the Viper login nodes (viper01i
,
viper02i
). You have to provide your (Kerberos) password and an OTP on the
Viper login nodes. SSH keys are not allowed.
Secure copy (scp) can be used to transfer data to or from viper.mpcdf.mpg.de
.
Viper’s (all login/interactive nodes) SSH key fingerprints (SHA256) are:
SHA256:0DhkRt6Qom1GA0SmvEjnlKWpIg1+kMPDjOUSqJ8ceyQ (ED25519)
SHA256:pM1fS+4YXGuIaLk0bLSt1sOlS1TpLrZYmFMRY9UjBKo (RSA)
Resource limits
The login nodes viper.mpcdf.mpg.de
(viper[01-02]i) are intended for editing,
compiling and submitting your parallel programs, only. Running parallel
programs interactively on the login nodes is NOT allowed. CPU resources are restricted
to an equivalent of two physical cpu cores, per user.
The login nodes viper-i.mpcdf.mpg.de
(viper[03-04]i) are also primarily intended for editing,
compiling and submitting your parallel programs, but CPU resources are only restricted to an equivalent
of six cpu cores, per user.
Jobs have to be submitted to the Slurm batch system which reserves and allocates the resources (e.g. compute nodes) required for your job. Further information on the batch system is provided below.
Interactive (debug) runs
To test or debug your code you may run your code interactively by using the Slurm partition “interactive” (2 hours at most) with the command:
srun -n N_TASKS -p interactive --time=TIME_LESS_THAN_2HOURS --mem=MEMORY_LESS_THAN_32000M ./EXECUTABLE
It is not allowed to use more than 8 cores in total and to request more than 32 GB of main memory.
Internet access
Connections to the Internet are only permitted from the login nodes in outgoing direction; Internet access from within batch jobs is not possible.
Data transfers
To download source code or other data, command line tools such as wget
,
curl
, rsync
, scp
, pip
, git
, or similar may be used interactively on
the login nodes. In case the transfer is expected to take a long time it is
recommended to run it as a Slurm job in the Slurm partition “datatransfer”.
Datatransfer jobs can occupy up to 8 cores and can take up to 6 hours.
Hardware configuration
Compute nodes
CPU nodes
768 compute nodes, 1536 CPUs
Processor type: AMD EPYC Genoa 9554
Processor base frequency: 3.1 GHz
Cores per node: 128 (each with 2 “hyper-threads” (SMT), i.e. 256 logical CPUs per node)
Main memory (DDR5 RAM) per node: 512 GB (609 nodes), 768 GB (90 nodes), 1024 GB (66 nodes), 2304 GB (3 nodes)
Theoretical peak performance per node (FP64, “double precision”): 3.1 GHz * 16 FP64 Flops/cycle * 128 = 6350 GFlop/s
Theoretical memory bandwidth per node: 920 GB/s (768 GB nodes, 2304 GB nodes), 610 GB/s (512 GB nodes, 1024 GB nodes)
2x8 NUMA domains per node each with 8 physical cores
GPU nodes (preview, to be deployed in the course of 2024)
228 GPU nodes, 456 APUs
Processor type: AMD Instinct MI300A APU
Main memory (HBM3) per APU: 128 GB
Theoretical peak performance per APU (FP64, “double precision”): 61 TFlop/s
Theoretical memory bandwidth per APU: 5.3 TB/s
Login and interactive nodes
2 nodes for login and code compilation (DNS name
viper.mpcdf.mpg.de
)2 nodes for interactive program development and testing (DNS name
viper-i.mpcdf.mpg.de
)Processor type: AMD EPYC Genoa 9554
Cores per node: 128 physical CPUs (256 logical CPUs)
Main memory (RAM) per node: login nodes: 512 GB, interactive nodes: 512 GB
Interconnect
Viper uses a Mellanox InfiniBand NDR network with a non-blocking fat-tree topology for each of the CPU and the GPU partitions:
CPU nodes: 200 Gb/s (NDR200)
GPU nodes: 400 Gb/s (NDR)
I/O subsystem
Approx. 20 PB of online disk space are available.
Additional hardware details
Additional details on the Viper hardware are given on a separate documentation page.
File systems
GPFS
There are two global, parallel file systems of type
GPFS
(/u
and /ptmp
), symmetrically accessible from all Viper cluster nodes,
plus the migrating file system /r
interfacing to the HPSS archive system.
File system /u
The file system /u
(a symbolic link to /viper/u
) is designed for
permanent user data (source files, config files, etc.). The size of /u
is
1.2 PB. Note that no system backups are performed. Your home
directory is in /u
. The default disk quota in /u
is 1.0 TB, the file
quota is 256K files. You can check your disk quota in /u
with the command:
/usr/lpp/mmfs/bin/mmlsquota viper_u
File system /ptmp
The file system /ptmp
(a symbolic link to /viper/ptmp) is designed for
batch job I/O (12 PB, no system backups). Files in /ptmp
that
have not been accessed for more than 12 weeks will be removed automatically.
The period of 12 weeks may be reduced if necessary (with prior notice).
As a current policy, no quotas are applied on /ptmp
. This gives users
the freedom to manage their data according to their actual needs without
administrative overhead. This liberal policy presumes a fair usage of
the common file space. So, please do a regular housekeeping of your data
and archive/remove files that are not used actually.
Archiving data from the GPFS file systems to tape can be done using the
migrating file system /r
(see below).
File system /r
The /r
file system (a symbolic link to /ghi/r
) stages archive data. It
is available only on the login nodes viper.mpcdf.mpg.de
and on the
interactive nodes viper-i.mpcdf.mpg.de
.
Each user has a subdirectory /r/<initial>/<userid>
to store data. For
efficiency, files should be packed to tar files (with a size of about 1 GB to
1 TB) before archiving them in /r
, i.e., please avoid archiving small
files. When the file system /r
gets filled above a certain value, files
will be transferred from disk to tape, beginning with the largest files which
have not been used for the longest time.
For documentation on how to use the MPCDF archive system, please see the backup and archive section.
/tmp and node-local storage
Please don’t use the file system /tmp
or $TMPDIR
for scratch data.
Instead, use /ptmp
which is accessible from all Viper cluster nodes.
In cases where an application really depends on node-local storage, please use
the directories from the environment variables JOB_TMPDIR
and
JOB_SHMTMPDIR
, which are set individually for each Slurm job.
Software
Access to software via environment modules
Environment modules are used at MPCDF to provide software packages and enable easy switching between different software versions.
Use the command
module avail
to list the available software packages on the HPC system. Note that you
can search for a certain module by using the find-module
tool (see
below).
Use the command
module load package_name/version
to actually load a software package at a specific version.
Further information on the environment modules on Viper and their hierarchical organization is given below.
Information on the software packages provided by the MPCDF is available here.
Recommended compiler and MPI software stack on Viper
As explained below, no defaults are defined for the compiler and MPI modules, and no modules are loaded automatically at login.
We currently recommend to use the following versions on Viper:
module load intel/2024.0 impi/2021.11
or
module load gcc/13 impi/2021.11
Specific optimizing compiler flags for the AMD EPYC CPU are given further below.
Hierarchical module environment
To manage the plethora of software packages resulting from all the relevant combinations of compilers and MPI libraries, we organize the environment module system for accessing these packages in a natural hierarchical manner. Compilers (gcc, intel) are located on the uppermost level, depending libraries (e.g., MPI) on the second level, more depending libraries on a third level. This means that not all the modules are visible initially: Only after loading a compiler module, the modules depending on this will become available. And similarly, loading an MPI module in addition will make the modules depending on the MPI library available.
No defaults are defined for the compiler and MPI modules, and no modules are loaded automatically at login. This forces users to specify explicit versions for those modules during compilation and in the batch scripts to ensure that the same MPI library is loaded. This also means that users can decide themselves when they use newer compiler and MPI versions for their code which avoids compatibility problems when changing defaults centrally.
For example, the FFTW library compiled with the Intel compiler and the Intel MPI library can be loaded as follows:
First, load the Intel compiler module using the command
module load intel/2024.0
second, the Intel MPI module with
module load impi/2021.11
and, finally, the FFTW module fitting exactly to the compiler and MPI library via
module load fftw-mpi
You may check by using the command
module avail
that after the first and second steps the depending environment modules become visible, in the present example impi and fftw-mpi. Moreover, note that the environment modules can be loaded via a single ‘module load’ statement as long as the order given by the hierarchy is correct, e.g.,
module load intel/2024.0 impi/2021.11 fftw-mpi
It is important to point out that a large fraction of the available software is not affected by the hierarchy, e.g., certain HPC applications, tools such as git or cmake, mathematical software (maple, matlab, mathematica), visualization software (visit, paraview, idl) are visible at the uppermost hierarchy. Note that a hierarchy exists for depending Python modules via the ‘anaconda’ module files on the top level.
Because of the hierarchy, some modules only appear after other modules (such as compiler and MPI) have been loaded. One can search all available combinations of a certain software (e.g. fftw-mpi) by using
find-module fftw-mpi
Further information on using environment modules is given here.
Slurm batch system
The batch system used on the HPC cluster Viper is the open-source workload manager Slurm (Simple Linux Utility for Resource management). To run test or production jobs, submit a job script (see below) to Slurm, which will allocate the resources required for your job (e.g. the compute nodes to run your job on).
By default, the job run limit is set to 8 on Viper, the default job submit limit is 300. If your batch jobs can’t run independently from each other, please use job steps.
There are mainly two types of batch jobs:
Exclusive, where all resources on the nodes are allocated to the job
Shared, where several jobs share the resources of one node. In this case it is necessary that the number of CPUs and the amount of memory are specified for each job.
The AMD processors on Viper support simultaneous multithreading (SMT) which potentially increases the performance of an application by up to 20%. To use SMT, you have to increase the product of the number of MPI tasks per node and the number of threads per MPI task from 128 to 256 in your job script. Please be aware that when doubling the number of MPI tasks per node each task only gets half of the memory compared to the non-SMT job. If you need more memory, you have to specify this in your job script (see the example batch scripts).
Overview of the available per-job resources on Viper:
Job type Max. CPUs Number of GPUs Max. Memory Number Max. Run
per Node per node per Node [MB] of Nodes Time
=============================================================================================
shared cpu 64 / 128 in HT mode 250000 < 1 24:00:00
---------------------------------------------------------------------------------------------
480000 1-256 24:00:00
exclusive cpu 128 / 256 in HT mode 730000 1-64 24:00:00
980000 1-64 24:00:00
2250000 1-3 24:00:00
---------------------------------------------------------------------------------------------
If an application needs more than 480000 MB per node, the required amount of memory has to be specified in the Slurm submit script, e.g. with the following options:
#SBATCH --mem=730000 # to request up to 730000 MB
or
#SBATCH --mem=2250000 # to request up to 2250000 MB
A job submit filter will automatically choose the right partition and job parameters from the resource specification.
Interactive testing and debugging is possible by using the command:
srun -n N_TASKS -p interactive --time=TIME_LESS_THAN_2HOURS --mem=MEMORY_LESS_THAN_32000M ./EXECUTABLE
Interactive jobs are limited to 8 cores, 32000M memory and 2 hours runtime.
For detailed information about the Slurm batch system, please see Slurm Workload Manager.
The most important Slurm commands are
sbatch <job_script_name>
Submit a job script for executionsqueue
Check the status of your job(s)scancel <job_id>
Cancel a jobsinfo
List the available batch queues (partitions).
Do not run Slurm client commands from loops in shell scripts or other programs. Ensure that programs limit calls to these commands to the minimum necessary for the information you are trying to gather.
Sample Batch job scripts can be found below.
Notes on job scripts:
The directive
#SBATCH --nodes=<number of nodes>
in your job script specifies the number of compute nodes that your program will use.
The directive
#SBATCH --ntasks-per-node=<number of cpus>
specifies the number of MPI processes for the job. The parameter ntasks-per-node cannot be greater than 128 because one compute node on Viper has 128 physical cores (with 2 threads each and thus 256 logical CPUs in SMT mode).
The directive
#SBATCH --cpus-per-task=<number of OMP threads per MPI task>
specifies the number of threads per MPI process if you are using OpenMP.
The expression
ntasks-per-node * cpus-per-task
may not exceed 256.
The expression
nodes * ntasks-per-node * cpus-per-task
gives the total number of CPUs that your job will use.
Jobs that need less than a half compute node have to specify a reasonable memory limit so that they can share a node!
A job submit filter will automatically choose the right partition/queue from the resource specification.
Slurm example batch scripts
MPI and MPI/OpenMP batch scripts
MPI batch job
#!/bin/bash -l
# Standard output and error:
#SBATCH -o ./job.out.%j
#SBATCH -e ./job.err.%j
# Initial working directory:
#SBATCH -D ./
# Job Name:
#SBATCH -J test_slurm
#
# Number of nodes and MPI tasks per node:
#SBATCH --nodes=16
#SBATCH --ntasks-per-node=128
#
#SBATCH --mail-type=none
#SBATCH --mail-user=userid@example.mpg.de
#
# Wall clock limit (max. is 24 hours):
#SBATCH --time=12:00:00
# Load compiler and MPI modules (must be the same as used for compiling the code)
module purge
module load intel/2024.0 impi/2021.11
# Run the program:
srun ./myprog > prog.out
Hybrid MPI/OpenMP batch job
#!/bin/bash -l
# Standard output and error:
#SBATCH -o ./job_hybrid.out.%j
#SBATCH -e ./job_hybrid.err.%j
# Initial working directory:
#SBATCH -D ./
# Job Name:
#SBATCH -J test_slurm
#
# Number of nodes and MPI tasks per node:
#SBATCH --nodes=16
#SBATCH --ntasks-per-node=8
# for OpenMP:
#SBATCH --cpus-per-task=16
#
#SBATCH --mail-type=none
#SBATCH --mail-user=userid@example.mpg.de
#
# Wall clock limit (max. is 24 hours):
#SBATCH --time=12:00:00
# Load compiler and MPI modules (must be the same as used for compiling the code)
module purge
module load intel/2024.0 impi/2021.11
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
# For pinning threads correctly:
export OMP_PLACES=cores
# Run the program:
srun ./myprog > prog.out
Hybrid MPI/OpenMP batch job in simultaneous multithreading (SMT) mode
#!/bin/bash -l
# Standard output and error:
#SBATCH -o ./job_hybrid.out.%j
#SBATCH -e ./job_hybrid.err.%j
# Initial working directory:
#SBATCH -D ./
# Job Name:
#SBATCH -J test_slurm
#
# Number of nodes and MPI tasks per node:
#SBATCH --nodes=32
#SBATCH --ntasks-per-node=8
# Enable SMT:
#SBATCH --ntasks-per-core=2
# for OpenMP:
#SBATCH --cpus-per-task=32
#
#SBATCH --mail-type=none
#SBATCH --mail-user=userid@example.mpg.de
#
# Wall clock Limit (max. is 24 hours):
#SBATCH --time=12:00:00
# Load compiler and MPI modules (must be the same as used for compiling the code)
module purge
module load intel/2024.0 impi/2021.11
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
# For pinning threads correctly:
export OMP_PLACES=threads
# Run the program:
srun ./myprog > prog.out
Batch jobs with dependencies
The following script generates a sequence of jobs, each job running the given
job script. The start of each individual job depends on its dependency, where
possible values for the --dependency
flag are, e.g.
afterany:job_id
This job starts after the previous job has terminatedafterok:job_id
This job starts after previous job has successfully executed
#!/bin/bash
# Submit a sequence of batch jobs with dependencies
#
# Number of jobs to submit:
NR_OF_JOBS=6
# Batch job script:
JOB_SCRIPT=./my_batch_script
echo "Submitting job chain of ${NR_OF_JOBS} jobs for batch script ${JOB_SCRIPT}:"
JOBID=$(sbatch ${JOB_SCRIPT} 2>&1 | awk '{print $(NF)}')
echo " " ${JOBID}
I=1
while [ ${I} -lt ${NR_OF_JOBS} ]; do
JOBID=$(sbatch --dependency=afterany:${JOBID} ${JOB_SCRIPT} 2>&1 | awk '{print $(NF)}')
echo " " ${JOBID}
let I=${I}+1
done
Batch job using a job array
#!/bin/bash -l
# specify the indexes (max. 30000) of the job array elements (max. 300 - the default job submit limit per user)
#SBATCH --array=1-20
# Standard output and error:
#SBATCH -o job_%A_%a.out # Standard output, %A = job ID, %a = job array index
#SBATCH -e job_%A_%a.err # Standard error, %A = job ID, %a = job array index
# Initial working directory:
#SBATCH -D ./
# Job Name:
#SBATCH -J test_array
#
# Number of nodes and MPI tasks per node:
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=128
#
#SBATCH --mail-type=none
#SBATCH --mail-user=userid@example.mpg.de
#
# Wall clock limit (max. is 24 hours):
#SBATCH --time=12:00:00
# Load compiler and MPI modules (must be the same as used for compiling the code)
module purge
module load intel/2024.0 impi/2021.11
# Run the program:
# the environment variable $SLURM_ARRAY_TASK_ID holds the index of the job array and
# can be used to discriminate between individual elements of the job array
srun ./myprog > prog.out
Single-node example job scripts for sequential programs, plain-OpenMP cases, Python, Julia, Matlab
In the following, example job scripts are given for jobs that use at maximum one full node. Use cases are sequential programs, threaded programs using OpenMP or similar models, and programs written in languages such as Python, Julia, Matlab, etc.
The Python example programs referred to below are available for download.
Single-core job
#!/bin/bash -l
#
# Single-core example job script for MPCDF Viper.
# In addition to the Python example shown here, the script
# is valid for any single-threaded program, including
# sequential Matlab, Mathematica, Julia, and similar cases.
#
#SBATCH -J PYTHON_SEQ
#SBATCH -o ./out.%j
#SBATCH -e ./err.%j
#SBATCH -D ./
#SBATCH --ntasks=1 # launch job on a single core
#SBATCH --cpus-per-task=1 # on a shared node
#SBATCH --mem=2000MB # memory limit for the job
#SBATCH --time=0:10:00
module purge
module load gcc/13 impi/2021.11
module load anaconda/3/2023.03
# Set number of OMP threads to fit the number of available cpus, if applicable.
export OMP_NUM_THREADS=1
# Run single-core program
srun python3 ./python_sequential.py
Small job with multithreading, applicable to Python, Julia and Matlab, plain OpenMP, or any threaded application
#!/bin/bash -l
#
# Multithreading example job script for MPCDF Viper.
# In addition to the Python example shown here, the script
# is valid for any multi-threaded program, including
# Matlab, Mathematica, Julia, and similar cases.
#
#SBATCH -J PYTHON_MT
#SBATCH -o ./out.%j
#SBATCH -e ./err.%j
#SBATCH -D ./
#SBATCH --ntasks=1 # launch job on
#SBATCH --cpus-per-task=8 # 8 cores on a shared node
#SBATCH --mem=16000MB # memory limit for the job
#SBATCH --time=0:10:00
module purge
module load gcc/13 impi/2021.11
module load anaconda/3/2023.03
# Set number of OMP threads to fit the number of available cpus, if applicable.
export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}
srun python3 ./python_multithreading.py
Python/NumPy multithreading, applicable to Julia and Matlab, plain-OpenMP, or any threaded application
#!/bin/bash -l
#
# Multithreading example job script for MPCDF Viper.
# In addition to the Python example shown here, the script
# is valid for any multi-threaded program, including
# parallel Matlab, Julia, and similar cases.
#
#SBATCH -o ./out.%j
#SBATCH -e ./err.%j
#SBATCH -D ./
#SBATCH -J PY_MULTITHREADING
#SBATCH --nodes=1 # request a full node
#SBATCH --ntasks-per-node=1 # only start 1 task via srun because Python multiprocessing starts more tasks internally
#SBATCH --cpus-per-task=128 # assign all the cores to that first task to make room for multithreading
#SBATCH --time=00:10:00
module purge
module load gcc/13 impi/2021.11
module load anaconda/3/2023.03
# set number of OMP threads *per process*
export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}
srun python3 ./python_multithreading.py
Python multiprocessing
#!/bin/bash -l
#
# Python multiprocessing example job script for MPCDF Viper.
#
#SBATCH -o ./out.%j
#SBATCH -e ./err.%j
#SBATCH -D ./
#SBATCH -J PYTHON_MP
#SBATCH --nodes=1 # request a full node
#SBATCH --ntasks-per-node=1 # only start 1 task via srun because Python multiprocessing starts more tasks internally
#SBATCH --cpus-per-task=128 # assign all the cores to that first task to make room for Python's multiprocessing tasks
#SBATCH --time=00:10:00
module purge
module load gcc/13 impi/2021.11
module load anaconda/3/2023.03
# Important:
# Set the number of OMP threads *per process* to avoid overloading of the node!
export OMP_NUM_THREADS=1
# Use the environment variable SLURM_CPUS_PER_TASK to have multiprocessing
# spawn exactly as many processes as you have CPUs available.
srun python3 ./python_multiprocessing.py $SLURM_CPUS_PER_TASK
Python mpi4py
#!/bin/bash -l
#
# Python MPI4PY example job script for MPCDF Viper.
# May use more than one node.
#
#SBATCH -o ./out.%j
#SBATCH -e ./err.%j
#SBATCH -D ./
#SBATCH -J MPI4PY
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=128
#SBATCH --time=00:10:00
module purge
module load gcc/13 impi/2021.11
module load anaconda/3/2023.03
module load mpi4py/3.1.4
# Important:
# Set the number of OMP threads *per process* to avoid overloading of the node!
export OMP_NUM_THREADS=1
srun python3 ./python_mpi4py.py
Migration guide for users coming from Intel-based HPC systems
Application performance
On Viper, sufficiently optimized (SIMD, parallel scalability) application codes can expect a performance boost by at least a factor of two per node in direct comparison to a Raven node. Users who experience relatively bad performance or need help with porting are encouraged to contact the MPCDF helpdesk for support. Information on important hardware details and software recommendations, including optimizing compiler flags, are given below.
Hardware
Comparison to Raven, e.g covering per-core performance, NUMA structure and sub-structure, memory bandwidth, network bandwidth.
Software
Compilers
Both, Intel and GNU compilers for Fortran, C, and C++ are known to generate well-optimized code for the AMD EPYC (zen4) CPU and are therefore recommended for usage on Viper.
The AMD EPYC (zen4) CPU has AVX2 (256 bit) execution units but can decode AVX512 instructions. Depending on the application, using AVX512 may or may not produce faster code (by up to approx. 15%).
For Intel compilers (ifx
, icx
, icpx
) the relevant
microarchitecture-specific flags are -march=znver4
(targeting AMD EPYC (zen4)
CPUs including AVX512 vectorization, newly introduced with Intel oneAPI version
2024), and -march=skylake-avx512
(AVX512) or –march=core-avx2
(AVX2). The
latter two flags are also supported and recommended for the legacy Intel Fortran
compiler ifort
that is still part of Intel oneAPI 2024. Users are advised to
diagnose the degree of SIMD vectorization by employing the compiler option
-qopt-report=3
.
In general, please note that none of the Intel compiler flags starting with
-x
or -ax
can be used on AMD CPUs because these flags enable code
generation that exclusively targets Intel CPUs. In particular, using the
well-known option -xHost
to automatically select the target architecture of
the compilation host does not work for the AMD CPUs.
For GNU compilers (gfortran
, gcc
, g++
) the relevant
architecture-specific flags are -march=znver4
(AVX512) or -march=znver4 -mprefer-vector-width=256
(AVX2). Users are advised to diagnose the degree of
SIMD vectorization by using the compiler option -fopt-info
. Vectorization of
non-trivial SIMD loops may require the flags -O3 -ffast-math
in addition.
The AMD Optimizing C/C++ and Fortran Compilers (AOCC) are provided with limited support and for experimental purposes, to be used standalone or together with the AMD Optimizing CPU Libraries (AOCL). More details about the features, the limitations, and known issues can be found on AMD’s website for AOCC and AOCL.
Math libraries
The Intel Math Kernel Library (oneMKL) is the recommended mathematical library on Viper. It provides, among others, highly tuned implementations of linear algebra operations and Fast Fourier Transforms (FFTs), which are exposed via the standard BLAS, LAPACK, and FFTW interfaces for C/C++ and Fortran.
In addition, a broad selection of numerical libraries such as FFTW, PETSc, SLEPc is available.
The AMD Optimizing CPU Libraries (AOCL) are provided for GCC and AMD (AOCC) compilers, with limited support and for experimental purposes. Users may evaluate them as an alternative to MKL, and careful performance benchmarking is needed.
Performance tools
Intel VTUNE can be used on Viper to identify hotspots, call stacks, and measure multithreading characteristics of an application. However, certain hardware-specific analysis types are not available on AMD CPUs. In addition, installations of AMD uProf are available, covering similar use cases.
For low-level performance measurements, perf
and the likwid toolsuite
are provided.
To perform lightweight profiling of MPI applications, Intel APS (for Intel MPI only, comes bundled with VTUNE) or mpitrace can be used.
The MPCDF performance monitoring system is currently being prepared for Viper.