No.206, April 2021

Contents

High-performance Computing
High-performance data analytics and AI software stack at MPCDF
Decommissioning of AFS
Relaunch of MPCDF website and new technical documentation platform
Events
- New online introductory course for new users of MPCDF
- Advanced HPC workshop: save the date

High-performance Computing 

HPC System Raven: deployment of the final system 

Deploying Raven

The Raven-interim HPC system is currently being dismantled to make way for the final system. The new machine will eventually comprise more than 1400 compute nodes with the brand new Intel Xeon IceLake-SP processor (72 cores per node arranged in 2 “packages”/NUMA domains, and 256 GB RAM per node). A subset of 64 nodes is equipped with 512 GB RAM and 4 nodes with 2 TB RAM. In addition, Raven will provide 192 GPU-accelerated compute nodes, each with 4 Nvidia A100 GPUs (4 x 40 GB HBM2 memory per node and Nvlink 3) connected to the IceLake host CPUs with PCIe Gen 4. All nodes are interconnected with a Mellanox HDR InfiniBand network (100 Gbit/s) using a pruned fat-tree topology with three non-blocking islands (2 CPU-only islands, 1 GPU island). The GPU nodes are interconnected with at least 200 Gbit/s. The first half of the final system will become operational at the beginning of May, the second half by July, 2021. The new IceLake CPU on Raven (Intel Xeon Platinum 8360Y) is based on the familiar SkyLake core architecture and the corresponding software stack, which MPCDF users are already familiar with on the interim system as well as on Cobra and several clusters. Notably, the IceLake platform provides an increased memory bandwidth (8 memory channels per package, compared to 6 channels on Skylake and its sibling CascadeLake). First benchmarks have shown a STREAM Triad performance of about 320 GB/s for a new Raven node (compared to ca. 190 GB/s measured on Cobra). The IceLake CPU is based on 10 nm technology and shows a significant energy efficiency increase over the interim CPU. More information about Raven can be found on the MPCDF webpage and in the technical documentation. Details about the migration schedule and necessary user actions will be announced to all HPC users of MPCDF facilities in due time. Basically, users of the interim system will only be required to recompile and relink their codes and to adapt their Slurm submission scripts to match the new node-level resources. As the new machine will use the same file systems (/raven/u, /raven/ptmp, /raven/r) as the interim system, migration of user data is not needed.

Hermann Lederer, Markus Rampp

Charliecloud and Singularity containers supported on Cobra and Raven

The Singularity and Charliecloud container engines have been recently deployed on the HPC clusters of the MPCDF, offering additional opportunities to run scientific applications at our computing centre. Through containers, users have full control of the operating system and on the software stack included in their environment, so that applications, libraries and other dependencies can be packaged and transferred together. Containers also supply an operating system virtualization to run software. This level of isolation, provided via cgroups and namespaces of the Linux kernel, offers a logical mechanism to abstract applications from the environment in which they run, promoting software portability between different hosts.

Introducing just a small overhead with respect to bare metal runs, applications in containers have increased reproducibility, running identically regardless of where they are deployed. This makes the use of containers particularly compelling when porting software with complex dependencies or executing applications that require system libraries different from the ones available on the host system (or even a completely different operating system). Containers also provide an easy way to access and run pre-packaged applications that are available online, usually in the form of Docker containers that can be easily converted into a Singularity or Charliecloud container image.

Additional information on the use of Singularity and Charliecloud at MPCDF can be found at the technical documentation page of the MPCDF and in Bits&Bytes No. 205.

Michele Compostella

Control and verification of the CPU affinity of processes and threads 

Introduction

The correct mapping of processes and threads to processors is of paramount importance to get the best possible performance from the HPC hardware. That mapping is often called pinning and handled via CPU affinities. Likewise, wrong pinning is very often the cause for inferior performance, especially on systems one uses for the first time. In the worst cases of incorrect pinning, some processors would stay idle whereas other processors would be overloaded with more tasks than they are actually able to run simultaneously.

This article gives some information on how to check and control the pinning on MPCDF systems. The pincheck library and tool developed at MPCDF is introduced, before the article concludes with some technical background for those readers who are interested.

Checking CPU affinities at runtime

In practice it is unfortunately cumbersome to learn about the actual pinning of a job, as different batch systems, MPI libraries, and OpenMP runtimes offer different ways to turn on verbose output. For example, setting the environment variable SLURM_CPU_BIND=verbose will instruct Slurm’s srun to print the process pinning it performs. Similarly, setting I_MPI_DEBUG=4 will enable verbose output from the Intel MPI library that includes some pinning information. Third, for example, the variable KMP_AFFINITY=verbose,compact will enable pinning output for OpenMP codes compiled with the Intel compilers, but please be aware that verbose cannot be specified alone but always needs a type specifier (here compact), otherwise no thread pinning would be applied.

As each of these outputs depends on individual software they each need to be read and interpreted differently. To reduce that complexity, MPCDF has developed a simple library and tool that yields the pinning information of codes at runtime in a unified and human-readable fashion.

The `pincheck` library and tool

To give developers and HPC users the possibility to easily check the CPU affinities of the processes and threads of their actual HPC jobs at runtime, MPCDF has developed a lightweight C++ library and tool named pincheck. It collects and returns the processor affinities from all MPI ranks in MPI_COMM_WORLD and from all the related OpenMP threads. The affinities are obtained in a portable way via system calls from the kernel, and no dependency on specific compilers or runtimes exists. pincheck is publicly available under a permissive MIT license.

For C++ codes, there is a header file (‘pincheck.hpp’) available that can be easily included and used from existing codes. In this case, the C++ header already includes the implementation, and no linking to a library is necessary. For C/C++ and Fortran codes, we will provide a library in combination with a C header file and a Fortran module with the next release in the near future. Alternatively, pincheck can be compiled and used as a stand-alone program to check the CPU affinities one gets based on certain batch scripts, environment variables, MPI and OpenMP runtimes, etc.

Detailed information on how to use pincheck from an existing code, and on how to compile and run it as a stand-alone program is available in the git repository.

Processor and thread affinities on Slurm-based systems at MPCDF

On the HPC systems and clusters at MPCDF, processes are typically started via the srun launcher of Slurm. Based on the resources requested for a batch job, Slurm takes care of the CPU affinities of processes (which are typically the MPI tasks) by applying useful defaults (i.e., the block distribution method).

For example, for a pure MPI job without threading, srun will pin the tasks to individual cores such that consecutive tasks share a socket. For hybrid (MPI/OpenMP) jobs that use one MPI task per socket and (per task) a number of threads equal to the number of cores per socket, srun will pin each MPI task to an individual socket. The threads spawned by these processes inherit the affinity mask, and the user has the option to further restrict the pinning of these individual threads. For OpenMP codes, this can be done by setting the environment variable OMP_PLACES, for example to the string cores which will pin each OpenMP thread to an individual core. Other threading models (e.g. pthreads) typically offer certain functions to achieve similar functionality.

The MPCDF documentation provides example submit scripts that already include proper settings for the pinning of MPI processes and OpenMP threads, see for example the section on Slurm scripts for the Raven system.

Technical background

The compute nodes of today’s HPC systems typically contain two or more multi-core chips (sockets) where each chip consists of multiple individual processors (cores) – a design that implies a complex memory hierarchy: each core has its private caches (typically L1 and L2), but shares a last-level cache (typically L3) with a set of other cores that are linked via a fast on-die bus. That bus links to a memory controller to which DIMM modules are connected. Each socket contains one or more such sets of cores that are called NUMA (non-uniform memory access) domains for the following reason: a core may logically access any memory attached to the compute node, however, at different bandwidths and latencies depending on which NUMA domain a particular part of the memory is physically attached to. Different NUMA domains are connected via bus systems that are slower than the intra-domain buses.

On a NUMA system it is therefore desirable that a core accesses physical memory local to its NUMA domain. Memory allocation and use is managed by the Linux operating system in chunks (pages) that are typically 4 kB in size. A first-touch policy applies, and, if possible, memory pages are placed closest to the core on which they were first used. HPC developers must therefore write their threaded programs in a NUMA-aware fashion to optimize for caches and minimize inter-domain memory accesses, and make sure that a process or thread stays within the domain or on the particular core. For codes that implement non-ideal memory access patterns (e.g., thread 0 touches all memory first, and then other threads access that memory across NUMA domains), the automatic NUMA balancing of the Linux OS may improve the performance during the runtime.

By default, the scheduler of the Linux operating system may move processes and threads (“tasks”) between the available processors. In case such moves occur within a NUMA domain, a task may suffer a temporary performance penalty when it is moved to a core which initially does not have relevant data cached in L1 or L2. In case a task is moved from one NUMA domain to another, there is in addition a more severe performance penalty caused by non-local memory accesses, i.e., when the moved task accesses memory pages physically located on another NUMA domain from where these pages had been touched first by the same task.

In most HPC scenarios it is advantageous to restrict that moving activity in order to improve the overall temporal and spatial locality of the caches and memory accesses. To enable programmers and users to control the placement of tasks relative to NUMA domains and cores, the operating system supports setting so-called affinity masks which are taken into account by the scheduler. Using such masks, tasks can be “pinned” to sets of cores (e.g. NUMA domains) or even to individual cores or hardware threads, such that they stay there and are not moved. On a low level these masks are actually bit masks, but fortunately users can mostly work on a higher level by using e.g. the variable OMP_PLACES=cores to instruct the OpenMP runtime to pin individual threads to individual cores.

References

Klaus Reuter

High-performance data analytics and AI software stack at MPCDF 

In the last years we have been observing an ever-growing number of researchers who want to use institute clusters and the HPC systems at MPCDF for data analytics and especially for machine-learning and deep-learning projects. This wish stems from the fact that the extremely powerful resources of HPC servers, especially if equipped with high-end GPU devices, can substantially boost the performance of data-analytics and AI workloads. Furthermore, the possibility to use multi-node setups to parallelize the workflows can reduce the time to solution by orders of magnitude. However, for users it is a non-trivial task to obtain a software stack that really does exploit the hardware features of HPC systems (SIMD vectorisation, Tensor cores of the GPUs, high-bandwidth fabrics, to mention a few) and does run with a reasonable fraction of the theoretical performance of the systems. Especially the builds of frameworks which the users can obtain via the usual distribution ways of the ML/DL community, such as Python-based installation methods like “pip”, usually do not run efficiently on HPC hardware.

In order to address the needs of its users for such workflows, MPCDF provides an HPC-optimized software stack for data-analytics and AI applications. Among other things, this software stack comprises

basic ML and AI libraries such as Nvidia’s and Intel’s DNN implementations
- Nvidia NCCL and cuDNN
- Intel MKL-DNN
- opencv
popular frameworks
- Tensorflow
- Pytorch
- Mxnet
- scikit-learn
parallelization frameworks
- Horovod (for Tensorflow, Pytorch and MxNet)
- Apache Spark
tools for image and NLP processing

See the MPCDF documentation for a detailed list and for some examples of how to use the software together with Slurm.

Whenever possible, a CPU and a GPU variant of the software is provided, which gives the user the freedom of choice and allows a seamless migration between different nodes and even clusters. As usual, the software is provided on MPCDF systems via the module environment. Please note that MPCDF uses a hierachical software stack (see Bits & Bytes No. 198) and not all software is always visible with the “module avail” command. We recommend to use the “find-modules” command, which will help users to find whether a software is available and which modules have to be loaded before the respective module will be visible.

Example:

user@cobra01:~> find-module tensorflow/gpu
tensorflow/gpu/1.14.0   (after loading anaconda/3/2019.03)
tensorflow/gpu/2.1.0    (after loading anaconda/3/2019.03)
tensorflow/gpu/2.1.0    (after loading anaconda/3/2020.02)
tensorflow/gpu/2.2.0    (after loading anaconda/3/2019.03)
tensorflow/gpu/2.2.0    (after loading anaconda/3/2020.02)
tensorflow/gpu/2.3.0    (after loading anaconda/3/2020.02)

After the desired modules have been loaded, the software can be used in the usual way and for example can be used with Jupyter Notebooks.

Decommissioning of AFS 

After many years of acting as the central file system in MPCDF, the time has come to say goodbye to the Andrew File System (AFS). This does not mean that AFS will disappear immediately, but as a first step it is planned that home directories will no longer be set up in AFS and not all login nodes of the Linux clusters will provide access to AFS, as it is already the case on gatezero. The lack of support for Windows forces the use of alternatives. For most users the Sync&Share functionality provided by our datashare is a good solution. For experiment data and software distribution other ways are already established or still have to be determined. Thus, we kindly ask all our users to no longer consider AFS as the one and only filesystem for data exchange, but to implement alternatives and not to store new data in AFS home directories.

Andreas Schott

Relaunch of MPCDF website and new technical documentation platform 

In March 2021, MPCDF relaunched its main website, adopting the corporate design of the Max Planck Society. The technical documentation for users of MPCDF services, including a comprehensive and continuously extended FAQ, as well as the MPCDF computer bulletin Bits&Bytes has been refurbished and is now available at https://docs.mpcdf.mpg.de/

Markus Rampp on behalf of the MPCDF Webteam

Events 

New online introductory course for new users of MPCDF 

The MPCDF has started offering a new online introductory course targeting new users. The first issue was held on April 13th with over 100 registered users from more than 30 Max Planck Institutes. In the future, it will be repeated on a semi-annual schedule. The 2.5 hour online course is given by application experts of MPCDF and includes an interactive chat option and concluding Q&A sessions. It provides a basic introduction to the essential compute and data services available at MPCDF, and is intended specifically for lowering the bar for their first-time usage. This course is the basis for more advanced courses such as the annual “Advanced HPC workshop” organised by MPCDF (next issue: autumn 2021, see below). Major topics of the online introductory course include an overview and practical hints for connecting to the HPC compute and storage facilities and using them via the Slurm batch system. The course material can be found at the MPCDF webpage.

Advanced HPC workshop: save the date 

Our annual Advanced High-performance Computing Workshop is scheduled for November 22nd to 24th, 2021, with an additional day of hands-on sessions for accepted projects on the 25th. The main topics will be software engineering, debugging and profiling for CPU and GPU. The talks will be given by members of the application group and by experts from Intel and Nvidia. Further details and registration options will be announced in the next issue of Bits & Bytes.

Klaus Reuter, Sebastian Ohlmann, Tilman Dannert