Raven hardware details

This page summarizes additional details on the Raven hardware and presents performance measurements from microbenchmarks.

Node architecture

The Raven system comprises of compute nodes powered by dual Intel Xeon IceLake-SP processors (Platinum 8360Y) with 36 physical CPU cores per socket (i.e. 72 per node). These nodes feature 256 GB of RAM, with 64 nodes having 512 GB and 4 nodes featuring 2048 GB. Additionally, there are 192 GPU-accelerated compute nodes, each equipped with 4 Nvidia A100 40GB-SXM GPUs connected via NVLINK3 and connected to the host via PCIe. These GPU nodes also feature 512 GB RAM and use the same Intel Xeon IceLake-SP CPUs.

The CPU nodes are interconnected with a Mellanox HDR InfiniBand network at 100 Gbit/s, whereas the GPU nodes are linked at a rate of 200 Gbit/s.

The nominal bandwidths between the components on a Raven GPU node are approximately:

100 GB/s per direction between each of the four A100 GPUs using NVLINK3
67 GB/s per direction between the two CPU sockets using UPI
32 GB/s per direction between each A100 GPU and the host using PCIe4 x16
25 GB/s per direction via the InfiniBand network interface

The following schematic highlights the topology of a Raven GPU node (where the notation ‘2x X GB/s’ refers to full duplex, enabling a bandwidth of X GB/s in each direction simultaneously):

Raven GPU node schematic

A set of 32 Raven GPU nodes is equipped with a second InfiniBand interface that doubles the bandwidth into the network to 50 GB/s per direction. On Slurm, these nodes can be selected via the --constraint="gpu-bw" flag of sbatch. Due to a limited number of available PCIe lanes, the GPUs on these nodes are connected to the host at half the bandwidth compared to the regular Raven GPU nodes. The following schematic show the topology of a Raven GPU node from the ‘gpu-bw’ type:

Raven GPU-BW node schematic

Empirical Roofline Models

Below, empirical roofline plots are presented that are based on measurements made on a CPU node and on an individual GPU of Raven. Roofline models illustrate the limitations of a computational kernel due to either the memory bandwidth or the maximum floating point performance of the hardware, depending on the arithmetic intensity of that kernel.

CPU Empirical Roofline Plot

GPU Empirical Roofline Plot

Performance measurements using microbenchmarks

The likwid suite implements a set of microbenchmarks to measure, e.g., the flops and the memory bandwidth a system may achieve. The numbers below are based on actual measurements on a full node, i.e. utilizing all CPU cores and memory channels.

For the measurements on the GPU the BabelStream microbenchmarks were used.

CPU

Currently Loaded Modulefiles:

    1)  intel/21.6.0
    2)  likwid/5.2(default)

Flops

  instruction set   GFlops/s
  ----------------- ----------
  scalar            304.660
  SSE               640.023
  AVX               1357.607
  AVX-FMA           2716.209
  AVX512            2685.938
  AVX512-FMA        5370.171

Memory Bandwidth

  load

  instruction set   GBytes/s
  ----------------- ----------
  scalar            321.435
  SSE               336.826
  AVX               338.035


  copy

  instruction set   GBytes/s
  ----------------- ----------
  scalar            260.375
  SSE               290.084
  AVX               294.560


  stream

  instruction set   GBytes/s
  ----------------- ----------
  scalar            300.476
  SSE               303.429
  AVX               303.799


  triad

  instruction set   GBytes/s
  ----------------- ----------
  scalar            306.859
  SSE               307.839
  AVX               307.724

To complement the previously presented numbers, the following plot shows measurements of the memory bandwidth under variation of the number of threads. Each thread is bound (“pinned”) to an individual physical core.

Memory bandwidth vs. number of physical cores employed

The orange curve depicts measured results based on a scattered pinning, i.e., threads are pinned to physical cores on the two CPU sockets in a round-robin fashion, thereby making use of all available memory channels in a balanced way. As a rule of thumb and evident from the plot, at least half of the physical cores per socket are required to make efficient use of the available memory bandwidth.

In contrast, the blue curve shows results based on a compact pinning, i.e., threads are pinned to the first socket until it is fully occupied (36) before the second socket is populated with threads as well (72). That transition illustrates the memory bandwidth a single socket is able to deliver.

GPU

  BabelStream
  Version: 4.0
  Implementation: CUDA
  Running kernels 100 times
  Precision: double
  Array size: 268.4 MB (=0.3 GB)
  Total size: 805.3 MB (=0.8 GB)
  Using CUDA device NVIDIA A100-SXM4-40GB
  Driver: 11040
  Function    MBytes/sec  Min (sec)   Max         Average
  Copy        1403896.061 0.00038     0.00039     0.00038
  Mul         1360548.080 0.00039     0.00040     0.00040
  Add         1357798.755 0.00059     0.00060     0.00060
  Triad       1362270.774 0.00059     0.00069     0.00060
  Dot         1229347.744 0.00044     0.00045     0.00044