Using Flash-based I/O-Accelerators on Viper-GPU

Introduction

The Viper-GPU HPC system is equipped with flash accelerators, a solution that combines specialized hardware and software to leverage the high throughput and low latency of Non-Volatile Memory Express (NVMe) solid-state drives for batch jobs. These accelerators are designed to mitigate the I/O bottleneck of HPC and AI applications during intense read and write phases. The deployment on Viper-GPU is based on the Smart Bunch of Flash (SBF) product by Eviden.

On Viper-GPU, SBF provides two modes of operation that differ in the lifetime of the storage buffer:

An ephemeral buffer has a lifetime that ends as soon as the job that uses it ends. This is the default mode.
A persistent buffer survives the job that creates it. It can then be used by subsequent jobs until it is destroyed explicitly.

SBF cannot be used for jobs on shared nodes, jobs requesting SBF must allocate full node(s), and run on the gpu partition.

Getting started

To use SBF, a line starting with #BB_LUA has to be added to the top of the Slurm job script, e.g. as follows:

#!/bin/bash -l
#BB_LUA SBF <sbf options>
# (SBATCH lines below)
#SBATCH <slurm options>
# ...
# (executable section of job script below)

The #BB_LUA SBF line has to be placed directly after the first line (#!/bin/bash -l) of the script. The following mandatory parameters have to be specified:

Option	Mandatory	Description
`StorageSize=<size>`	Yes	The size of each storage buffer per compute node. If `-N <n_nodes>` nodes are requested the job will consume `<size> * <n_nodes>` disk space from the servers.
`Path=/var/local/bb`	Yes	The mountpoint where each individual storage buffer will be mounted on each compute node.

The job steps of the job script must be started with the srun command to get full access to the storage buffers. Otherwise, only the master compute node that runs the job script will be able to use the SBF feature.

Ephemeral buffer example

The following example job script requests 100 GB of temporary storage space, where the application myApp has access to the storage via the mount point /var/local/bb.

#!/bin/bash -l
#BB_LUA SBF StorageSize=100GB Path=/var/local/bb
#
#SBATCH -J BB
#SBATCH --gres=gpu:2
#SBATCH --nodes 1  # this is -N
#SBATCH --ntasks-per-node 1
#SBATCH --cpus-per-task 48
#SBATCH -t 0:01:00

srun myApp

A typical use case would be an application that requires fast scratch space to write and read temporary files to a local file system.

Persistent buffer example

In persistent mode, SBF knows the three stages “create”, “use” and “destroy”. Two additional parameters for #BB_LUA SBF are relevant, as illustrated below.

Stage 1: `create_persistent`

An initial job can create a persistent buffer using the following parameters in the job script:

#BB_LUA SBF create_persistent Name=MyPersistentSBF StorageSize=100GB Path=/var/local/bb

where Name must be unique and chosen by the user for the jobs that should use the buffer. After the job completes, the -N storage buffers of size <size> each stay allocated.

Stage 2: `use_persistent`

Subsequent jobs can use the persistent buffer as follows:

#BB_LUA SBF use_persistent Name=MyPersistentSBF

After the job has completed, the storage buffers are unmounted from the compute nodes, but the buffers still exist on the servers. Subsequent jobs can reuse these storage buffers. Be aware that the same persistent Name cannot be used by multiple jobs of the same user at the same time due to the private nature of the storage buffers. The number of compute nodes (-N option) of the use_persistent request must not exceed the number of compute nodes of the create_persistent request.

Stage 3: `destroy_persistent`

Once a persistent buffer becomes unnecessary, the user shall request its destruction (i.e. deallocation) explicitly using the following line in a job script:

#BB_LUA SBF destroy_persistent Name=MyPersistentSBF

The destroy_persistent job can only be executed if no other job is using the persistent buffer anymore. After the job completes, the storage buffers no longer exist and cannot be used by any other job anymore.

In case the user specifies the force=true parameter in addition after destroy_persistent, the job will stop any jobs that are using the persistent buffer and kill them immediately.

A typical use case would be an application that relies on the availability of a fast file system to read data from a database file, including random file access. The database could be copied to the persistent buffer during the initial create_persistent job, and read from during subsequent jobs.

Reporting Commands

The Slurm commands squeue and scontrol may display information on the buffer, in particular information on the different phases stage-in and stage-out.

Use squeue to display information about jobs located in the Slurm scheduling queue, including details such as BurstBufferStageIn, BurstBufferResources, etc.
Use scontrol show job <jobid> to view detailed information for a specific job, where the BurstBuffer and Reason fields provide SBF-related data.
The command scontrol show bbstat displays the buffer information. The State field in report shows the job’s progress regarding the SBF activity. The Staging-out state indicates that the resources are not yet released. Once the resources are released, the state becomes Staged-out. Only after the state becomes Staged-out are the buffer resources fully released and the job considered complete.

Restrictions

The -N <nodes> (or --nodes=<nodes>) parameter is mandatory in your Slurm sbatch command. SBF creates a separate storage buffer for each node before the job starts, so it requires the node count in advance.
The SBF storage buffers are not shared between nodes.
Using the SBF requires an exclusive use of the compute node by a job.
The minimum value of the StorageSize parameter is 16MiB.
Heterogeneous jobs are not supported by SBF.
A batch job must be submitted to destroy a persistent buffer.

Technical Details

SBF is a software-defined storage solution by Eviden that allows to create fast and temporary storage volumes. These volumes are attached individually to individual compute nodes using NVMeOF (Non-Volatile Memory express Over Fabrics) technology. SBF uses the XFS file system on the volumes.