Using Flash-based I/O-Accelerators on Viper-GPU
Introduction
The Viper-GPU HPC system is equipped with flash accelerators, a solution that combines specialized hardware and software to leverage the high throughput and low latency of Non-Volatile Memory Express (NVMe) solid-state drives for batch jobs. These accelerators are designed to mitigate the I/O bottleneck of HPC and AI applications during intense read and write phases. The deployment on Viper-GPU is based on the Smart Bunch of Flash (SBF) product by Eviden.
On Viper-GPU, SBF provides two modes of operation that differ in the lifetime of the storage buffer:
An ephemeral buffer has a lifetime that ends as soon as the job that uses it ends. This is the default mode.
A persistent buffer survives the job that creates it. It can then be used by subsequent jobs until it is destroyed explicitly.
SBF cannot be used for jobs on shared nodes, jobs requesting SBF must allocate full node(s), and run on the gpu partition.
Getting started
To use SBF, a line starting with #BB_LUA has to be added to the top of the Slurm job script, e.g. as follows:
#!/bin/bash -l
#BB_LUA SBF <sbf options>
# (SBATCH lines below)
#SBATCH <slurm options>
# ...
# (executable section of job script below)
The #BB_LUA SBF line has to be placed directly after the first line (#!/bin/bash -l) of the script.
The following mandatory parameters have to be specified:
| Option | Mandatory | Description |
|---|---|---|
StorageSize=<size> |
Yes | The size of each storage buffer per compute node. If -N <n_nodes> nodes are requested the job will consume <size> * <n_nodes> disk space from the servers. |
Path=/var/local/bb |
Yes | The mountpoint where each individual storage buffer will be mounted on each compute node. |
The job steps of the job script must be started with the srun command to
get full access to the storage buffers. Otherwise, only the master compute node
that runs the job script will be able to use the SBF feature.
Ephemeral buffer example
The following example job script requests 100 GB of temporary storage space, where
the application myApp has access to the storage via the mount point /var/local/bb.
#!/bin/bash -l
#BB_LUA SBF StorageSize=100GB Path=/var/local/bb
#
#SBATCH -J BB
#SBATCH --gres=gpu:2
#SBATCH --nodes 1 # this is -N
#SBATCH --ntasks-per-node 1
#SBATCH --cpus-per-task 48
#SBATCH -t 0:01:00
srun myApp
A typical use case would be an application that requires fast scratch space to write and read temporary files to a local file system.
Persistent buffer example
In persistent mode, SBF knows the three stages “create”, “use” and “destroy”. Two additional parameters for #BB_LUA SBF are relevant, as illustrated below.
Stage 1: create_persistent
An initial job can create a persistent buffer using the following parameters in the job script:
#BB_LUA SBF create_persistent Name=MyPersistentSBF StorageSize=100GB Path=/var/local/bb
where Name must be unique and chosen by the user for the jobs that should
use the buffer. After the job completes, the -N storage buffers of size <size> each stay allocated.
Stage 2: use_persistent
Subsequent jobs can use the persistent buffer as follows:
#BB_LUA SBF use_persistent Name=MyPersistentSBF
After the job has completed, the storage buffers are unmounted from the compute
nodes, but the buffers still exist on the servers. Subsequent jobs can reuse
these storage buffers. Be aware that the same persistent Name cannot be
used by multiple jobs of the same user at the same time due to the private
nature of the storage buffers. The number of compute nodes (-N option) of the
use_persistent request must not exceed the number of compute nodes of the
create_persistent request.
Stage 3: destroy_persistent
Once a persistent buffer becomes unnecessary, the user shall request its destruction (i.e. deallocation) explicitly using the following line in a job script:
#BB_LUA SBF destroy_persistent Name=MyPersistentSBF
The destroy_persistent job can only be executed if no other job is using the
persistent buffer anymore. After the job completes, the storage buffers no
longer exist and cannot be used by any other job anymore.
In case the user specifies the force=true parameter in addition after
destroy_persistent, the job will stop any jobs that are using the persistent
buffer and kill them immediately.
A typical use case would be an application that relies on the availability of a
fast file system to read data from a database file, including random file
access. The database could be copied to the persistent buffer during the
initial create_persistent job, and read from during subsequent jobs.
Reporting Commands
The Slurm commands squeue and scontrol may display information on the buffer, in particular information on the different phases stage-in and stage-out.
Use
squeueto display information about jobs located in the Slurm scheduling queue, including details such as BurstBufferStageIn, BurstBufferResources, etc.Use
scontrol show job <jobid>to view detailed information for a specific job, where the BurstBuffer and Reason fields provide SBF-related data.The command
scontrol show bbstatdisplays the buffer information. The State field in report shows the job’s progress regarding the SBF activity. The Staging-out state indicates that the resources are not yet released. Once the resources are released, the state becomes Staged-out. Only after the state becomes Staged-out are the buffer resources fully released and the job considered complete.
Restrictions
The
-N <nodes>(or--nodes=<nodes>) parameter is mandatory in your Slurmsbatchcommand. SBF creates a separate storage buffer for each node before the job starts, so it requires the node count in advance.The SBF storage buffers are not shared between nodes.
Using the SBF requires an exclusive use of the compute node by a job.
The minimum value of the
StorageSizeparameter is 16MiB.Heterogeneous jobs are not supported by SBF.
A batch job must be submitted to destroy a persistent buffer.
Technical Details
SBF is a software-defined storage solution by Eviden that allows to create fast and temporary storage volumes. These volumes are attached individually to individual compute nodes using NVMeOF (Non-Volatile Memory express Over Fabrics) technology. SBF uses the XFS file system on the volumes.