Compilers and languages

Intel Compilers

Intel C/C++ Compiler for Linux

Usage

The name of the C compiler executable is icc, the name of the C++ compiler executable is icpc.

Compilation and linking of a C program (source file myprog.c) is done as follows: icc -o myprog myprog.c

To get an overview on the available command line options use the command icc --help.

More information is provided by the manual page man icc.

Extensive documentation and e.g. information on code optimization strategies is provided by the official Intel C/C++ Compiler Documentation. For documentation of compatibility, updates and changes of different versions see the Intel Composer release notes

To compile and link MPI codes, use the wrappers mpiicc and mpiicpc, respectively.

C++ header files and standard library

While the Intel C/C++ compiler does support modern C++ features (C++11, C++14, and newer), the header files and standard library (‘libstdc++’) that come with the operating system may not support such features yet. To access more recent header files and, similarly, a more recent standard library required for modern C++ codes, simply load a recent GCC environment module (‘module load gcc’) before loading the Intel compiler module.

Intel Fortran Compiler for Linux

Usage

The name of the Intel Fortran Compiler executable is ifort.

Compilation and linking of a Fortran program (source file myprog.f90) is done as follows: ifort -o myprog myprog.f90

To get an overview on the available command line options use the command ifort --help.

More information is provided by the manual page man ifort.

Extensive documentation and e.g. information on code optimization strategies is provided by the official Intel Fortran Compiler Documentation. For documentation of compatibility, updates and changes of different versions see the Intel Composer release notes.

To compile and link MPI codes, use the wrapper mpiifort.

How to get access to the Intel Compilers

On the Cobra and Raven supercomputers, user need to load and specify a version of the Intel compiler explicitly, similarly for Intel MPI. No default versions exist for the Intel compiler and MPI modules, and no default versions are loaded at login.

To get a list of all available Intel compilers, enter module avail intel.

To get access to a specific Intel compiler, load the module by module load intel/<version>.

Intel Compiler for Linux Optimization Flags

Compiler optimization flags

Compiler optimization flags have strong influence on the performance of the executable. Some important flags are given below. First, different optimization levels are available:

  • -O2: Standard optimization. Default.

  • -O3: Aggressive optimization. Use it with care and check the results against a less optimized binary.

  • -O0: Disables all optimization. Useful for fast compilation and to check if unexpected behavior results from a higher compiler optimization level.

  • -O1: Very conservative optimization.

In addition, vectorization is key to achieving good floating point performance on modern CPUs. For detailed information on how to specify the instruction set level during compilation please consult the Intel Compiler Documentation In particular, the switches -x, -ax, -m are relevant, for example:

  • -xCORE-AVX512 -qopt-zmm-usage=high: Enable AVX512 vectorization for Intel Skylake, CascadeLake, IceLake processors. These flags are recommended on cobra and raven.

  • -xCORE-AVX2: Enable AVX2 vectorization for Intel Haswell and Broadwell CPUs.

  • -ipo: Enable interprocedural optimizations beyond individual source files.

The meta switch -fast is not supported on MPCDF systems because it forces the static linking of all libraries (i.e. it implies the switch -static) which is not possible with certain system libraries.

[*] To obtain information about the features of the Linux host CPU issue the command cat /proc/cpuinfo | grep flags | head -1. Instruction-set-related keywords are, among others, avx, avx2, avx512.

The list of all supported switches and extensive information is covered by the official Intel Compiler Documentation.

Floating point accuracy

Intel compilers tend to adopt increasingly more aggressive defaults for the optimization of floating-point semantics. The default is -fp-model fast=1. We recommend to double check the accuracy of simulation results by using more conservative settings (which might come at the expense of computational performance) like -fp-model precise (recommended) or even -fp-model strict.

See the compiler man pages for more details.

GNU Compiler Collection

The GNU Compiler Collection provides – among others – front ends for C, C++, and Fortran.

A default version of GCC comes with the operating system. More recent versions suitable for HPC can be accessed via environment modules.

To compile and link MPI codes using the GNU compilers, use the commands mpigccmpig++, or mpigfortran in combination with Intel MPI.

Find the full documentation at https://gcc.gnu.org/.

GPU Programming

The following packages are provided on the HPC clusters to enable users develop applications for NVIDIA GPUs.

NVIDIA CUDA Toolkit

The NVIDIA CUDA Toolkit provides a development environment for the programming of NVIDIA GPUs. It includes the CUDA C++ compiler (nvcc), optimized libraries, debuggers and profilers, among others. Issue module avail cuda to get an up-to-date list of the CUDA versions available on a system.

NVIDIA HPC SDK

The NVIDIA HPC SDK provides a C, C++, and Fortran compiler for the programming of NVIDIA GPUs and multi-core CPUs. It is the successor product of the PGI compiler suite. In addition, the NVIDIA HPC SDK comprises a copy of the CUDA toolkit and various libraries for numerical computation, deep learning and AI, and communication. Issue module avail nvhpcsdk to get an up-to-date list of the versions available on a system.

Kokkos C++ Performance Portability Library

The Kokkos performance portability framework enables the development of applications that achieve consistent good performance accross all relevant modern HPC platforms based on a single-source C++ implementation. It provides abstractions for parallel computation and data management, and supports several backends such as OpenMP and CUDA, among others.

Python

The high-level Python programming language can be extended with modules written in plain C++/CUDA to leverage GPU computing. In this case, the interfaces may be created with Cython or pybind11 in a comparably easy way. Alternatively, the PyCUDA module offers a straight forward way to embed CUDA code into Python modules. Codes that make heavy use of NumPy may compile such costly expressions to CPU or GPU machine code using the Numba package. Note that Numba is non-intrusive as it only uses decorators. It is part of the Anaconda Python distribution.

NAG Fortran compiler

Usually we have the latest NAG Fortran compiler installed. To see all available NAG compilers on UNIX, enter module avail nagf95. To use a specific NAG compiler, load the module by module load nagf95/<$version>. The compiler command is nagfor.

To use the compiler on windows, follow the instructions given in

/afs/ipp-garching.mpg.de/common/soft/nag_f95/<$version>/windows/readme.txt

for versions equal or later rel5.3. For access a valid AFS-Token for the cell ipp-garching.mpg.de is necessary.

More information about the NAG Fortran Compiler can be found in the documentation of NAG at NAG Fortran Compiler Documentation.

Python

At the MPCDF, Python including a plethora of scientific packages for numerical computing and data science (NumPy, SciPy, matplotlib, Cython, Numba, Pandas, etc.) is provided via the Anaconda Python Distribution.

Obtain an up-to-date list of the installed Anaconda releases via the following command:

module avail anaconda

Python for HPC

Being an interpreted and dynamically-typed language, plain Python is not a language suitable per-se to achieve high performance. Nevertheless, with the appropriate packages, tools, and techniques the Python programming language can be used to perform numerical computation in a very efficient manner, covering both aspects, the program’s efficiency and the programmer’s efficiency. The aim of this article is to provide some advice and orientation to the reader in order to use Python correctly on the HPC systems and to take first steps towards basic Python code optimization.

Performance

The key to achieve good performance with Python is to move notably expensive computation from the interpreted code layer down to a compiled layer which may consist of compiled libraries, code written and compiled by the user, or just-in-time compiled code. Below, three packages are discussed for these use cases.

NumPy

NumPy is the Python module that provides arrays of native datatypes (float32, float64, int64, etc.) and mathematical operations and functions on them. Typically, mathematical equations (in particular, vector and matrix arithmetic) can be written with NumPy expressions in a very readable and elegant way, which brings several advantages: NumPy expressions avoid explicit, slow loops in Python. In addition, NumPy uses compiled code and optimized mathematical libraries internally, e.g. Intel MKL on MPCDF systems, which enables vectorization and other optimizations. Parts of these libraries use thread-parallelization in a very efficient way by default, e.g. to perform matrix multiplications. In summary, NumPy provides the de-facto standard for numerical array-based computations and serves as the basis for a multitude of additional packages.

Cython

Cython is a Python language extension that makes it relatively easy to create compiled Python modules written in Cython, C or C++. It integrates well with NumPy arrays and can be used to implement time-critical parts of an algorithm. Moreover, Cython is very useful to create interfaces to C or C++ code, such as legacy libraries or native CUDA code. Technically, the Cython source code is translated by the Cython compiler to intermediate C code which is then compiled to machine code by a regular C compiler like GCC or ICC.

Numba

Numba is a just-in-time compiler based on the LLVM framework. It compiles Python functions at runtime for the datatypes these functions are being called with. Moreover, Numba implements a subset of NumPy’s functions, i.e. it is able to compile NumPy expressions. Functions are declared via a simple decorator-syntax to be suitable for jit-compilation, hence, Numba is only little intrusive on existing code bases.

Parallelization

While Python does implement threads as part of the standard library, these cannot be used to accelerate computation on more than one core in parallel due to cPython’s global interpreter lock. Nevertheless, Python is suitable for parallel computation. In the following, two important packages for intra-node and inter-node parallelism are addressed.

multiprocessing

The multiprocessing package is part of the Python standard library. It implements building blocks such as pools of workers and communication queues that can be used to parallelize data-parallel workloads. Technically, multiprocessing forks subprocesses from the main Python process that can run in parallel on multiple cores of a shared-memory machine. Note that some overhead is associated with the inter-process communication. It is, however, possible to access shared memory from several processes simultaneously. A typical use case would be large NumPy arrays.

mpi4py

Access to the Message Passing Interface (MPI) is available via the module mpi4py. It enables parallel computation on distributed-memory computers where the processes communicate via messages with each other. In particular, the mpi4py package supports the communication of NumPy arrays without additional overhead. On MPCDF systems, the environment module mpi4py provides an optimized build based on the default Intel MPI library.

IO

NumPy implements efficient binary IO for array data that is useful, e.g., for temporary files. A better choice with respect to portability and long-term compatibility are HDF5 files. HDF5 is accessible via the h5py Python package and offers an easy-to-use dictionary-style interface. For parallel codes, a special build of h5py with support for MPI-parallel IO is provided via the environment module h5py-mpi.

The Python software ecosystem

In addition to the packages discussed up to now, there is a plethora of solid and well-proven packages for scientific computation and data science available, covering, e.g., numerical libraries (SciPy), visualization (matplotlib, seaborn), data analysis (pandas), and machine learning (TensorFlow, pytorch), to name only a few.

Software installation

Often, users need to install special Python packages for their scientific domain. In most cases, the easiest and quickest way is to create an installation local to the user’s homedirectory. After loading the Anaconda environment module, the command pip install --user PACKAGE_NAME would download and install a package from the Python package index (PyPI), or similarly, the command python setup.py install --user would install a package from an unpacked source tarball. In both cases, the resulting installation is located below “~/.local” where Python will find it by default.

Summary

The software recommended in this article is available via the Anaconda Python Distribution (environment module “anaconda/3”) on MPCDF systems. Note that for some packages (mpi4py, h5py-mpi), the hierarchical environment modules matter, i.e., it is necessary to load a compiler (gcc, intel) and an MPI module (impi) in addition to Anaconda in order to get access to these depending environment modules.

The application group at the MPCDF has developed an in-depth course on “Python for HPC” which covers all the topics touched in this article in more detail on two days. It is taught one to two times per year and announced via the MPCDF web page.

Finally, it should be pointed out that Python 2 reaches its official end-of-life on January 1, 2020. Consequently, new Python modules and updates to existing ones will not take Python 2 compatibility into account in the future. Users still running legacy code are strongly encouraged to migrate to Python 3.