Compilers and languages

Contents

Intel Compilers
GNU Compiler Collection
GPU Programming
NAG Fortran compiler
Python

Intel Compilers 

Intel C/C++ Compiler for Linux

Usage

The name of the C compiler executable is icc, the name of the C++ compiler executable is icpc.

Compilation and linking of a C program (source file myprog.c) is done as follows: icc -o myprog myprog.c

To get an overview on the available command line options use the command icc --help.

More information is provided by the manual page man icc.

Extensive documentation and e.g. information on code optimization strategies is provided by the official Intel C/C++ Compiler Documentation. For documentation of compatibility, updates and changes of different versions see the Intel Composer release notes

To compile and link MPI codes, use the wrappers mpiicc and mpiicpc, respectively.

Compiling and linking against a more recent C++ standard library

The C++ standard library headers and shared objects installed in the default system folders are relatively dated. This can cause errors such as

<<error: namespace "std" has no member named …>>
GLIBC_2.33 not found at runtime

when compiling and linking.

The recommended procedure to avoid these errors is the following:

Clean the currently loaded environment modules with module purge.
Load the compiler module you want to use and all depending modules you need.
Export the environment variables CC and CXX with the compiler you want to use.
Lastly, load a recent gcc version, e.g. module load gcc/<version>. Do not load any other modules afterwards.
Set LDFLAGS to point to the recent GCC version of the standard library, e.g. export LDFLAGS="$LDFLAGS -L${GCC_HOME}/lib64 -Wl,-rpath,${GCC_HOME}/lib64".
Configure and build your application. Be aware that CMake checks LDFLAGS at the first invocation only. Make sure to create a new build with CMake, if applicable.

Intel Fortran Compiler for Linux

Usage

The name of the Intel Fortran Compiler executable is ifort.

Compilation and linking of a Fortran program (source file myprog.f90) is done as follows: ifort -o myprog myprog.f90

To get an overview on the available command line options use the command ifort --help.

More information is provided by the manual page man ifort.

Extensive documentation and e.g. information on code optimization strategies is provided by the official Intel Fortran Compiler Documentation. For documentation of compatibility, updates and changes of different versions see the Intel Composer release notes.

To compile and link MPI codes, use the wrapper mpiifort.

How to get access to the Intel Compilers

On the Raven and Viper supercomputers, user need to load and specify a version of the Intel compiler explicitly, similarly for Intel MPI. No default versions exist for the Intel compiler and MPI modules, and no default versions are loaded at login.

To get a list of all available Intel compilers, enter module avail intel.

To get access to a specific Intel compiler, load the module by module load intel/<version>.

Intel Compiler for Linux Optimization Flags

Compiler optimization flags

Compiler optimization flags have strong influence on the performance of the executable. Some important flags are given below. First, different optimization levels are available:

-O2: Standard optimization. Default.
-O3: Aggressive optimization. Use it with care and check the results against a less optimized binary.
-O0: Disables all optimization. Useful for fast compilation and to check if unexpected behavior results from a higher compiler optimization level.
-O1: Very conservative optimization.

In addition, vectorization is key to achieving good floating point performance on modern CPUs. For detailed information on how to specify the instruction set level during compilation please consult the Intel Compiler Documentation In particular, the switches -x, -ax, -m are relevant, for example:

-xCORE-AVX512 -qopt-zmm-usage=high: Enable AVX512 vectorization for Intel Skylake, CascadeLake, IceLake processors. These flags are recommended on Raven.
-xCORE-AVX2: Enable AVX2 vectorization for Intel Haswell and Broadwell CPUs.
-ipo: Enable interprocedural optimizations beyond individual source files.

The meta switch -fast is not supported on MPCDF systems because it forces the static linking of all libraries (i.e. it implies the switch -static) which is not possible with certain system libraries.

[*] To obtain information about the features of the Linux host CPU issue the command cat /proc/cpuinfo | grep flags | head -1. Instruction-set-related keywords are, among others, avx, avx2, avx512.

The list of all supported switches and extensive information is covered by the official Intel Compiler Documentation.

Floating point accuracy

Intel compilers tend to adopt increasingly more aggressive defaults for the optimization of floating-point semantics. The default is -fp-model fast=1. We recommend to double check the accuracy of simulation results by using more conservative settings (which might come at the expense of computational performance) like -fp-model precise (recommended) or even -fp-model strict.

See the compiler man pages for more details.

GNU Compiler Collection 

The GNU Compiler Collection provides – among others – front ends for C, C++, and Fortran.

A default version of GCC comes with the operating system. More recent versions suitable for HPC can be accessed via environment modules.

To compile and link MPI codes using the GNU compilers, use the commands mpigcc, mpig++, or mpigfortran in combination with Intel MPI.

Find the full documentation at https://gcc.gnu.org/.

GPU Programming 

The following packages are provided on the HPC clusters to enable users develop applications for NVIDIA GPUs.

NVIDIA CUDA Toolkit

The NVIDIA CUDA Toolkit provides a development environment for the programming of NVIDIA GPUs. It includes the CUDA C++ compiler (nvcc), optimized libraries, debuggers and profilers, among others. Issue module avail cuda to get an up-to-date list of the CUDA versions available on a system.

NVIDIA HPC SDK

The NVIDIA HPC SDK provides a C, C++, and Fortran compiler for the programming of NVIDIA GPUs and multi-core CPUs. It is the successor product of the PGI compiler suite. In addition, the NVIDIA HPC SDK comprises a copy of the CUDA toolkit and various libraries for numerical computation, deep learning and AI, and communication. Issue module avail nvhpcsdk to get an up-to-date list of the versions available on a system.

Kokkos C++ Performance Portability Library

The Kokkos performance portability framework enables the development of applications that achieve consistent good performance accross all relevant modern HPC platforms based on a single-source C++ implementation. It provides abstractions for parallel computation and data management, and supports several backends such as OpenMP and CUDA, among others.

Python

The high-level Python programming language can be extended with modules written in plain C++/CUDA to leverage GPU computing. In this case, the interfaces may be created with Cython or pybind11 in a comparably easy way. Alternatively, the PyCUDA module offers a straight forward way to embed CUDA code into Python modules. Codes that make heavy use of NumPy may compile such costly expressions to CPU or GPU machine code using the Numba package. Note that Numba is non-intrusive as it only uses decorators. It is part of the Anaconda Python distribution.

NAG Fortran compiler 

Usually we have the latest NAG Fortran compiler installed. To see all available NAG compilers on UNIX, enter module avail nagf95. To use a specific NAG compiler, load the module by module load nagf95/<$version>. The compiler command is nagfor.

To use the compiler on windows, follow the instructions given in

/afs/ipp-garching.mpg.de/common/soft/nag_f95/<$version>/windows/readme.txt

for versions equal or later rel5.3. For access a valid AFS-Token for the cell ipp-garching.mpg.de is necessary.

More information about the NAG Fortran Compiler can be found in the documentation of NAG at NAG Fortran Compiler Documentation.

Python 

At the MPCDF, Python including a plethora of scientific packages for numerical computing and data science (NumPy, SciPy, matplotlib, Cython, Numba, Pandas, etc.) used to be provided in a up-to-date fashion via the Anaconda Python Distribution.

Starting in 2024, a new Python basis is deployed, based on free software sources. Run the commands

module avail python-waterboa
module help python-waterboa

to get information on what’s available, where the versioning is similar to the one of Anaconda.

A list of the installed legacy Anaconda releases can be obtained via the following command:

module avail anaconda

Please note that new versions of Anaconda Python cannot be provided any more due to licensing restrictions.

Python for HPC

Being an interpreted and dynamically-typed language, plain Python is not a language suitable per-se to achieve high performance. Nevertheless, with the appropriate packages, tools, and techniques the Python programming language can be used to perform numerical computation in a very efficient manner, covering both aspects, the program’s efficiency and the programmer’s efficiency. The aim of this article is to provide some advice and orientation to the reader in order to use Python correctly on the HPC systems and to take first steps towards basic Python code optimization.

Performance

The key to achieve good performance with Python is to move notably expensive computation from the interpreted code layer down to a compiled layer which may consist of compiled libraries, code written and compiled by the user, or just-in-time compiled code. Below, three packages are discussed for these use cases.

NumPy

NumPy is the Python module that provides arrays of native datatypes (float32, float64, int64, etc.) and mathematical operations and functions on them. Typically, mathematical equations (in particular, vector and matrix arithmetic) can be written with NumPy expressions in a very readable and elegant way, which brings several advantages: NumPy expressions avoid explicit, slow loops in Python. In addition, NumPy uses compiled code and optimized mathematical libraries internally, e.g. Intel MKL on MPCDF systems, which enables vectorization and other optimizations. Parts of these libraries use thread-parallelization in a very efficient way by default, e.g. to perform matrix multiplications. In summary, NumPy provides the de-facto standard for numerical array-based computations and serves as the basis for a multitude of additional packages.

Cython

Cython is a Python language extension that makes it relatively easy to create compiled Python modules written in Cython, C or C++. It integrates well with NumPy arrays and can be used to implement time-critical parts of an algorithm. Moreover, Cython is very useful to create interfaces to C or C++ code, such as legacy libraries or native CUDA code. Technically, the Cython source code is translated by the Cython compiler to intermediate C code which is then compiled to machine code by a regular C compiler like GCC or ICC.

Numba

Numba is a just-in-time compiler based on the LLVM framework. It compiles Python functions at runtime for the datatypes these functions are being called with. Moreover, Numba implements a subset of NumPy’s functions, i.e. it is able to compile NumPy expressions. Functions are declared via a simple decorator-syntax to be suitable for jit-compilation, hence, Numba is only little intrusive on existing code bases.

Parallelization

While Python does implement threads as part of the standard library, these cannot be used to accelerate computation on more than one core in parallel due to cPython’s global interpreter lock. Nevertheless, Python is suitable for parallel computation. In the following, two important packages for intra-node and inter-node parallelism are addressed.

multiprocessing

The multiprocessing package is part of the Python standard library. It implements building blocks such as pools of workers and communication queues that can be used to parallelize data-parallel workloads. Technically, multiprocessing forks subprocesses from the main Python process that can run in parallel on multiple cores of a shared-memory machine. Note that some overhead is associated with the inter-process communication. It is, however, possible to access shared memory from several processes simultaneously. A typical use case would be large NumPy arrays.

mpi4py

Access to the Message Passing Interface (MPI) is available via the module mpi4py. It enables parallel computation on distributed-memory computers where the processes communicate via messages with each other. In particular, the mpi4py package supports the communication of NumPy arrays without additional overhead. On MPCDF systems, the environment module mpi4py provides an optimized build based on the default Intel MPI library.

IO

NumPy implements efficient binary IO for array data that is useful, e.g., for temporary files. A better choice with respect to portability and long-term compatibility are HDF5 files. HDF5 is accessible via the h5py Python package and offers an easy-to-use dictionary-style interface. For parallel codes, a special build of h5py with support for MPI-parallel IO is provided via the environment module h5py-mpi.

The Python software ecosystem

In addition to the packages discussed up to now, there is a plethora of solid and well-proven packages for scientific computation and data science available, covering, e.g., numerical libraries (SciPy), visualization (matplotlib, seaborn), data analysis (pandas), and machine learning (TensorFlow, pytorch), to name only a few.

Software installation

Often, users need to install special Python packages for their scientific domain. In most cases, the easiest and quickest way is to create an installation local to the user’s homedirectory. After loading the Anaconda environment module, the command pip install --user PACKAGE_NAME would download and install a package from the Python package index (PyPI), or similarly, the command python setup.py install --user would install a package from an unpacked source tarball. In both cases, the resulting installation is located below “~/.local” where Python will find it by default.

Summary

The software recommended in this article is available via the Anaconda Python Distribution (environment module “anaconda/3”) on MPCDF systems. Note that for some packages (mpi4py, h5py-mpi), the hierarchical environment modules matter, i.e., it is necessary to load a compiler (gcc, intel) and an MPI module (impi) in addition to Anaconda in order to get access to these depending environment modules.

The application group at the MPCDF has developed an in-depth course on “Python for HPC” which covers all the topics touched in this article in more detail on two days. It is taught one to two times per year and announced via the MPCDF web page.

Finally, it should be pointed out that Python 2 reaches its official end-of-life on January 1, 2020. Consequently, new Python modules and updates to existing ones will not take Python 2 compatibility into account in the future. Users still running legacy code are strongly encouraged to migrate to Python 3.

Compilers and languages

Intel Compilers

Intel C/C++ Compiler for Linux

Usage

Compiling and linking against a more recent C++ standard library

Intel Fortran Compiler for Linux

Usage

How to get access to the Intel Compilers

Intel Compiler for Linux Optimization Flags

Compiler optimization flags

Floating point accuracy

GNU Compiler Collection

GPU Programming

NVIDIA CUDA Toolkit

NVIDIA HPC SDK

Kokkos C++ Performance Portability Library

Python

NAG Fortran compiler

Python

Python for HPC

Performance

NumPy

Cython

Numba

Parallelization

multiprocessing

mpi4py

IO

The Python software ecosystem

Software installation

Summary

Intel Compilers 

GNU Compiler Collection 

GPU Programming 

NAG Fortran compiler 

Python 