Data Analytics / Machine Learning
On this page you find a collection of information about software for data analytics and especially machine learning, which we support at the MPCDF.
Table of Contents
Libraries for Deep Learning
On the MPCDF HPC systems we provide the following libraries which can be used to improve the performance of deep learning applications:
CPU based
mkl-dnn
mlsl
daal
GPU based
cudnn
nccl
Frameworks for Machine Learning
We try to provide on our cluster the machine learning software for a large set of cuda, python (via anconda) and MPI versions. However, due to the complexity of some of the machine learning frameworks (e.g. Tensorflow and Horovod) it is not possible to provide all of these combinations. As a rule of thumb the software will always be available for the latest versions of anaconda, MPI, and cuda, and – if possible – will be provided for more combinations.
The following software frameworks are supported on MPCDF systems:
CPU based
scikit-learn
tensorflow
pytorch
mxnet
horovod
apache spark
keras
onnx
GPU based
tensorflow
pytorch
mxnet
horovod
keras
NVIDIA DALI
onnx
How to find the software modules
In case you know the name of the module you wish to load (or only part of it), but you are
not sure about the available versions or what dependencies need to be
loaded first, you can try to use the find-module
command.
As an example, let us suppose that you need to use scikit-learn 0.24.1
and you are looking for the respective module.
Then you would query find-module scikit
and get an output similar to the following:
scikit-learn/0.23.1 (after loading anaconda/3/2020.02)
scikit-learn/0.24.1 (after loading anaconda/3/2020.02)
scikit-learn/0.24.1 (after loading anaconda/3/2021.05)
scikit-learn/0.24.1 (after loading anaconda/3/2021.11)
scikit-optimize/0.8.1 (after loading anaconda/3/2020.02)
scikit-optimize/0.8.1 (after loading anaconda/3/2021.05)
scikit-optimize/0.8.1 (after loading anaconda/3/2021.11)
You can now select the module featuring the version you need for scikit-learn and load it after having loaded its prerequisite:
module purge
module load anaconda/3/2021.11
module load scikit-learn/0.24.1
Please note that the most recent installation of anaconda (as of February 2022) is provided by the module anaconda/3/2021.11
.
How to use scikit-learn on HPC systems
Scikit-learn is a Python-based application, which implies that due to our hierarchical module system, the Scikit-learn modules will only be available after the Python module has been loaded:
module purge
module load anaconda/3/2021.11
module load scikit-learn
Scikit-learn is well suited for prototyping and experimenting with different machine learning algorithms. It has, however, no built-in functionality for intra-node or inter-node parallelization! (For experienced users it is possible to implement such functionality with python subtasks, python multiprocessing or resorting to the mpi4py package). An extensive documentation of scikit-learn is available at https://scikit-learn.org,
How to use TensorFlow/Pytorch/Mxnet on the HPC systems
On our HPC systems we provide variants of TensorFlow / Pytorch / Mxnet for CPU or GPU usage. The GPU version is always preferable to the CPU version, since Tensorflow/Pytorch/Mxnet applications perform extremely well on GPUs. However, if the memory of the GPUs is not sufficient for the application, the CPU versions might be used which can utilize the full memory of a node. Cobra’s GPUs are equipped with 32 GB of memory, RAVEN’s GPUs with 40 GB. The main memory of the nodes is at least 96 GB.
Using TensorFlow/Pytorch/Mxnet with GPUs on MPCDF HPC system
Due to the hierarchical module system at MPCDF, the modules for tensorflow/gpu, pytorch/gpu, mxnet/gpu will only be available after the cuda and python modules have been loaded. Starting May, 15 2021, the modules providing the GPU versions for tensorflow, pytorch and mxnet have been renamed to take into account the specific CUDA version employed to build the respective backend.
On COBRA and RAVEN do:
module purge
module load cuda/11.4
module load anaconda/3/2021.11
#TF
module load tensorflow/gpu-cuda-11.4/2.7.0
#Pytorch
module load pytorch/gpu-cuda-11.4/1.9.0
#Mxnet
module load mxnet/gpu-cuda-11.4/1.9.0
#and if needed (only for TF)
module load keras
Please always check the exact name for the horovod module to be loaded and its prerequisites via the
find-module
command, as described here.
Please note that without any extra parallelization like Horovod (see below), TensorFlow/Pytroch/Mxnet can only be used on one node of the HPC systems, since there is no inter-node parallelization available.
An example submission script (e.g. on COBRA) looks as follows:
#!/bin/bash -l
# Standard output and error:
#SBATCH -o ./tf_gpu.out.%j
#SBATCH -e ./tf_gpu.err.%j
# Initial working directory:
#SBATCH -D ./
# Job Name:
#SBATCH -J TF_GPU
# Queue:
#SBATCH --partition=gpu # If using both GPUs of a node
# Node feature:
#SBATCH --constraint="gpu"
#SBATCH --gres=gpu:2 # If using both GPUs of a node
# Number of nodes and MPI tasks per node:
#SBATCH --nodes=1
#SBATCH --cpus-per-task=40
#SBATCH --mail-type=all
#SBATCH --mail-user=%u@mpcdf.mpg.de
#
# wall clock limit
#SBATCH --time=00:30:00
module purge
module load anaconda/3/2021.11
module load cuda/11.4
#TF
module load tensorflow/gpu-cuda-11.4/2.7.0
#Pytorch
module load pytorch/gpu-cuda-11.4/1.9.0
#Mxnet
module load mxnet/gpu-cuda-11.4/1.9.0
# and if needed (only for TF)
module load keras
srun python your_gpu_tensorflow/pytorch_application.py
echo "job finished"
Using TensorFlow/Pytorch/Mxnet with CPUs-only on MPCDF HPC system
Due to the hierarchical module system at MPCDF, the modules for tensorflow/cpu will only be available after the python module has been loaded.
On COBRA and RAVEN do:
module purge
module load anaconda/3/2021.11
#TF
module load tensorflow/cpu/2.7.0
#Pytorch
module load pytorch/cpu/1.9.0
#Mxnet
module load mxnet/cpu/1.9.0
#and if needed (pnly for TF)
module load keras
Please note that without any extra parallelization like Horovod (see below), TensorFlow/Pytorch can only be used on one single node of the HPC systems, since there is no inter-node parallelization available.
An example submission script (e.g on COBRA) looks as follows:
#!/bin/bash -l
# Standard output and error:
#SBATCH -o ./tf_cpu.out.%j
#SBATCH -e ./tf_cpu.err.%j
# Initial working directory:
#SBATCH -D ./
# Job Name:
#SBATCH -J TF_CPU
# Queue (Partition):
#SBATCH --partition=express
# Number of nodes:
#SBATCH --nodes=1
#SBATCH --cpus-per-task=40
#
#SBATCH --mail-type=all
#SBATCH --mail-user=%u@mpcdf.mpg.de
#
# Wall clock limit:
#SBATCH --time=00:30:00
module purge
module load anaconda/3/2021.11
#TF
module load tensorflow/cpu/2.7.0
#Pytorch
module load pytorch/cpu/1.9.0
#mxnet
module load mxnet/cpu/1.9.0
# and if needed (only for TF)
module load keras
export OMP_NUM_THREADS=40
# For pinning threads correctly:
export OMP_PLACES=threads
export SLURM_HINT=multithread
# Run the program:
srun python your_cpu_tensorflow/pytorch_application.py
echo "job finished"
How to use TensorFlow/Pytorch/Mxnet with Horovod in data parallelism on HPC systems
Horovod is an MPI based extension to frameworks such as TensorFlow, Pytorch and Mxnet, which allows to parallelize neural networks in the data parallelism approach.
The key idea is to replicate a neural network on each MPI task (can be bound to a GPU or a set of CPUs), run a training step simultaneously on all replicas using different mini-batches for each MPI-task, and then to aggregate the gradients to update the model parameters for all replicas.
Image from henning.kropponline.de
Since in the data parallelism approach the parallelization is done over the mini-batches of the training data set, this approach can only be used to speed-up the training of a neural network.
In order to use Horovod with (Keras)TensorFlow/Pytorch, one needs to modify (Keras)TensorFlow/Pytorch application code and adapt SLURM submission script to request the correct resources.
Example: Modification of the Keras/TensorFlow application code
Since Horovod uses an MPI based approach, the concept of the MPI parallelization has to be introduced in the application code. This can be achieved by adding a few lines of code to the communication part of the Keras/TensorFlow code of the application. The following example, taken from the Horovod webpage, illustrates this for a simple convolutional neural network written with the Keras framework.
#!/usr/bin/env python
#-*- coding: utf-8 -*-
from __future__ import print_function
import keras
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten
from keras.layers import Conv2D, MaxPooling2D
from keras import backend as K
import math
import tensorflow as tf
# Horovod:
import horovod.keras as hvd
# Horovod: initialize Horovod.
hvd.init()
# Horovod: pin GPU to be used to process local rank (one GPU per process)
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
config.gpu_options.visible_device_list = str(hvd.local_rank())
K.set_session(tf.Session(config=config))
batch_size = 128
num_classes = 10
# Horovod: adjust number of epochs based on number of GPUs.
epochs = int(math.ceil(12.0 / hvd.size()))
# Input image dimensions
img_rows, img_cols = 28, 28
# The data, shuffled and split between train and test sets
(x_train, y_train), (x_test, y_test) = mnist.load_data()
if K.image_data_format() == 'channels_first':
x_train = x_train.reshape(x_train.shape[0], 1, img_rows, img_cols)
x_test = x_test.reshape(x_test.shape[0], 1, img_rows, img_cols)
input_shape = (1, img_rows, img_cols)
else:
x_train = x_train.reshape(x_train.shape[0], img_rows, img_cols, 1)
x_test = x_test.reshape(x_test.shape[0], img_rows, img_cols, 1)
input_shape = (img_rows, img_cols, 1)
x_train = x_train.astype('float32')
x_test = x_test.astype('float32')
x_train /= 255
x_test /= 255
print('x_train shape:', x_train.shape)
print(x_train.shape[0], 'train samples')
print(x_test.shape[0], 'test samples')
# Convert class vectors to binary class matrices
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)
model = Sequential()
model.add(Conv2D(32, kernel_size=(3, 3),
activation='relu',
input_shape=input_shape))
model.add(Conv2D(64, (3, 3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(num_classes, activation='softmax'))
# Horovod: adjust learning rate based on number of GPUs.
opt = keras.optimizers.Adadelta(1.0 * hvd.size())
# Horovod: add Horovod Distributed Optimizer.
opt = hvd.DistributedOptimizer(opt)
model.compile(loss=keras.losses.categorical_crossentropy,
optimizer=opt,
metrics=['accuracy'])
callbacks = [
# Horovod: broadcast initial variable states from rank 0 to all other processes.
# This is necessary to ensure consistent initialization of all workers when
# training is started with random weights or restored from a checkpoint.
hvd.callbacks.BroadcastGlobalVariablesCallback(0),
]
# Horovod: save checkpoints only on worker 0 to prevent other workers from corrupting them.
if hvd.rank() == 0:
callbacks.append(keras.callbacks.ModelCheckpoint('./checkpoint-{epoch}.h5'))
model.fit(x_train, y_train,
batch_size=batch_size,
callbacks=callbacks,
epochs=epochs,
verbose=1,
validation_data=(x_test, y_test))
score = model.evaluate(x_test, y_test, verbose=0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])
On the Horovod webpage, more examples can be found.
Slurm submission scripts with GPUs for MPCDF HPC systems
Due to the hierarchical module structure at MPCDF, the horovod module will only be available after an MPI module has been loaded. Starting May, 15 2021, the modules providing the GPU versions for horovod have been renamed to take into account the specific CUDA version employed to build the horovod libraries and the ML backend.
After loading a tensorflow/pytorch module for GPU (see above), do e.g.:
module load gcc/11
module load impi/2021.03
#TF
module load horovod-tensorflow-2.6.0/gpu-cuda-11.4/0.22.0
#Pytorch
module load horovod-pytorch-1.9.0/gpu-cuda-11.4/0.22.0
#mxnet
module load horovod-mxnet-1.9.0/gpu-cuda-11.4/0.22.0
It is important to keep in mind that the parallelization is done over the training data set via MPI, and that each MPI task will run its own copy of the network, i.e., its own complete TensorFlow process. If GPUs are used, avoid having more than one TensorFlow process per GPU. The reason is that per default each TensorFlow process reserves all the memory of a GPU for itself, thus other TensorFlow processes on the same GPU typically fail then due to memory errors. Since in Horovod TensorFlow processes are always mapped to an MPI task, the number of MPI tasks per node should never exceed the number of GPUs per node. Furthermore, it must be guaranteed that each MPI process is bound to exactly one GPU. The following is an exemplary submission script for Cobra:
#!/bin/bash -l
# Standard output and error:
#SBATCH -o ./tf_hvd_gpu.out.%j
#SBATCH -e ./tf_hvd_gpu.err.%j
# Initial working directory:
#SBATCH -D ./
# Job Name:
#SBATCH -J TF_HVD_GPUs
# Queue:
#SBATCH --partition=gpu # If using both GPUs of a node
# Node feature:
#SBATCH --constraint="gpu"
#SBATCH --gres=gpu:2 # If using both GPUs of a node
# Number of nodes and MPI tasks per node:
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=2 # If using both GPUs of a node
#SBATCH --mail-type=all
#SBATCH --mail-user=%u@mpcdf.mpg.de
#
# wall clock limit:
#SBATCH --time=00:30:00
#
module purge
module load gcc/10
module load anaconda/3/2021.11
module load cuda/11.4
#TF
module load tensorflow/gpu-cuda-11.4/2.6.0
module load horovod-tensorflow-2.6.0/gpu-cuda-11.4/0.22.0
#Pytorch
module load pytorch/gpu-cuda-11.4/1.9.0
module load horovod-pytorch-1.9.0/gpu-cuda-11.4/0.22.0
#mxnet
module load mxnet/gpu-cuda-11.4/1.9.0
module load horovod-mxnet-1.9.0/gpu-cuda-11.4/0.22.0
# and if nedeed (only for TF)
module load keras
# Run the program:
srun python your_gpu_tensorflow/pytorch_with_horovod_application.py
echo "job finsihed"
TIP: Managing the GPU RAM
As aforementioned, by default TensorFlow automatically grabs all the RAM in all visible GPUs the first time you run a graph, so you will not be able to start a second TensorFlow process while the first one is still running. One solution it to run each process on different GPU cards. To do this, the simplest option is to set the CUDA_VISIBLE_DEVICES environment variable so that each process only sees the appropriate GPU cards. For example, after loading a CUDA module, setting in your SLURM job submission script:
export CUDA_VISIBLE_DEVICES=1
causes device 0 to be invisible, and only the device 1 to be visible.
Slurm submission scripts with CPUs-only for MPCDF HPC systems
Due to the hierarchical module structure at MPCDF, the horovod module will only be available after the corresponding tensorflow/cpu module and an MPI module have been loaded.
After loading a tensorflow module for CPUs-only (see above), do:
module load gcc/11
module load impi/2021.03
module load horovod-tensorflow-2.6.0/cpu/0.22.0
It is important to keep in mind that the parallelization is done over the training data set via MPI, and that each MPI task will run its own copy of the network, i.e., its own complete TensorFlow process. If CPUs-only are used, you can run your distributed tensorflow-based application using the following Slurm submission script (e.g. on COBRA):
#!/bin/bash -l
# Standard output and error:
#SBATCH -o ./tf_hvd_cpu.out.%j
#SBATCH -e ./tf_hvd_cpu.err.%j
# Initial working directory:
#SBATCH -D ./
# Job Name:
#SBATCH -J TF_HVD_CPU
# Queue (Partition):
#SBATCH --partition=express
# Number of nodes and MPI tasks per node:
#SBATCH --nodes=16
#SBATCH --ntasks-per-node=4
# Enable Hyperthreading:
#SBATCH --ntasks-per-core=2
# for OpenMP:
#SBATCH --cpus-per-task=20
#
#SBATCH --mail-type=all
#SBATCH --mail-user=%u@mpcdf.mpg.de
#
# Wall clock limit:
#SBATCH --time=00:30:00
module purge
module load gcc/11
module load impi/2021.03
module load anaconda/3/2021.11
module load tensorflow/cpu/2.6.0
#TF
module load horovod-tensorflow-2.6.0/cpu/0.22.0
#Pytorch
module load pytorch/cpu/1.9.0
module load horovod-pytorch-1.9.0/cpu/0.19.4
# and if nedeed
module load keras
export OMP_NUM_THREADS=20
# For pinning threads correctly:
export OMP_PLACES=threads
export SLURM_HINT=multithread
# Run the program:
srun python you_cpu_tensorflow_with_horovod_application.py
echo "job finished"
How to install additional Python packages
Since the Python landscape of machine learning packages is changing rapidly, it is likely that you will not find all your required packages in our module system. We recommend setting up a virtual environment after loading the anaconda module to install your own packages. We also recommend loading as many packages as possible from our module system, since those were specifically built for our HPC systems.
Setting up a venv
First, load your preferred Python interpreter via the anaconda module, for example:
module load anaconda/3/2021.11
Then load your required packages, if they are available on our module system.
You can use the find-module
command to check for available packages as described here.
See our dedicated section for more information about how the module system works.
Usually, you would load some basic machine learning packages, such as PyTorch with CUDA support, for example:
module load pytorch/gpu-cuda-11.6/2.0.0
Now you create your virtual environment via:
python -m venv --system-site-packages <path/to/my_venv>
This command will create a directory at the given path where your newly created virtual environment lives.
The --system-site-packages
flag gives the virtual environment access to the already loaded packages in the previous steps.
Activate your venv
To activate your newly created virtual environment, execute:
source <path/to/my_venv>/bin/activate
Install packages
Then you can simply install your required packages via pip
, for example:
pip install mlflow
The above command will install the MLflow package (and all its dependencies that are not already satisfied) in the virtual environment site directory located at path/to/my_venv/lib/python3.9/site-packages/
.
Use it with Slurm
Putting it all together, you can use your virtual environment in your Slurm scripts in the following way:
#!/bin/bash -l
#SBATCH ...
#SBATCH ...
module purge
module load anaconda/3/2021.11
module load pytorch/gpu-cuda-11.6/2.0.0
source <path/to/my_venv>/bin/activate
...
How to use Apache Spark on HPC systems
Everything you need to know to get up and running with Spark on the HPC systems of MPCDF.
Introduction to Spark
Apache Spark is an open-source cluster-computing framework that supports big data and machine learning applications.
To allow projects to explore the use of Spark, the MPCDF now provides the ability to run a Spark application on the HPC systems Draco, Cobra and Raven via the Slurm batch system. This page provides an introduction to running Spark applications on these systems.
But first here’s a bit of background information about how this works.
As a user you will create a standard Slurm batch script which will be submitted to Slurm and managed as any normal batch job. However, within the Slurm job a stand-alone Spark cluster is created with a master and several workers (which are distributed across the nodes reserved for the Slurm job).
A Spark module has been created which provides the Spark source and some helper scripts to create a spark cluster. This makes it easier for you as a user to create a Spark cluster and submit an application to it.
Running Spark on HPC systems DRACO, COBRA and RAVEN
The first step is to load the spark module and create a shared secret for your future cluster(s). This step only needs to be done once (on a login node) of the respective HPC system and will ensure that only you can run applications on your Spark cluster.
$ module load jdk
$ module load spark
$ spark-create-secure-setup
The spark-create-secure-setup script creates a spark config directory (~/.spark-config/) and generates the configuration and keyfiles needed to ensure your clusters and only accessible by you.
Creating the Slurm script
The Slurm script is used to
Request the resources required by the Spark Cluster/Application
Configure the Spark software and start the cluster
Submit the Spark application to the cluster
In some senses the resources are allocated twice. First the Slurm batch script is used to allocate the resources on the Slurm cluster (this is done by the directives at the start of the Slurm script). Secondly, the Spark application requests resources once the spark-submit command is called to submit a Spark job to the Spark Cluster (see later). It is important that there is no large mis-match in these resource allocations.
The following snippet shows resource allocation for a 3 Node batch job with 4 tasks per node and 5 cores per task. The memory and time are also allocated and we have defined a partition, although this is not needed?
#!/bin/bash
#SBATCH -N 3
#SBATCH -t 00:30:00
#SBATCH --mem 20000
#SBATCH --ntasks-per-node 4
#SBATCH --cpus-per-task 5
#SBATCH -p express
Each of the “tasks” will be used to run a Spark executor process (5 cores per Spark executor is a good rule of thumb number). To make best use of the cluster resources it is recommended that you use as many cores as possible on each node. However, some tuning may be needed to fit with memory requirements and the cores per task etc.
Software setup and Spark cluster Start.
The next part of the Slurm script sets up the spark software and starts the Spark Cluster.
module load jdk
module load anaconda
module load spark
spark-start
echo $MASTER
The spark-start helper script starts the Spark Master and starts a Spark Slave per Slurm task on the reserved nodes. Note that one node will run both Slaves and the Master and this may be something you need to consider when allocating resources to your executors (in the next step).
Running the Spark application
The last part of the script is where the actual Spark application is submitted to the Spark cluster.
spark-submit --total-executor-cores 60 --executor-memory 5G \
/u/jkennedy/Spark/example-wordcount.py \
file:///u/jkennedy/Spark/Data/big-text.txt
spark-submit is a command line tool to submit a application to a Spark cluster. In this example that application is a simple word-count written in python which read a file from GPFS. The example is an adaption of one of the standard Spark examples which are distributed with the Spark code itself (see: https://github.com/apache/spark/tree/master/examples/src/main/python).
When submitting a Spark application there are a few tuning parameters which should be considered. The number of executors and the executor memory. You also need to match this to the resource allocation for the Slurm job. The number of Spark executors will be defined by the Slurm –ntasks-per-node and the number of nodes (in this example Ntasks-per-node = 34 = 12). Spark, by default, allocates one core per executor. However, in our example, we have set 5 cores per task so the total number of executor-cores for the Spark application will be 12*5=60. It is important that the Spark application matches the Slurm resource reservation (or at least does not exceed it).
Similarly the executor-memory needs to match the Slurm reservation (and cannot exceed the memory available to the Spark cluster). In this example we allocate 5GB per executor giving a total of 60GB which is easily within the limits of the Slurm reservation (remember some resources will be required by the Spark Master).
Submitting the Spark application to Slurm
Now that we have the full Slurm script (and have considered the resource allocation) we can submit it to the Slurm batch system. This is achieved by the sbatch command.
$ sbatch spark-wordcount.cmd
And from this point on the Spark job will behave as any other Slurm batch job.
Tips and Known Issues
Spark Driver Memory Tuning: The Spark Driver memory is used to store accumulator variables as well as any outputs of the collect() operation. By default, the driver is allocated 1GB of memory. If the application demands, it sometimes makes sense to increase this available memory, particularly for long-running Spark jobs, which may accumulate data during their execution. The Driver memory can be set using the -—driver-memory Ng argument where N is the number of gigabytes of memory to allocate for use by the drive. Here is an example
$ spark-submit —-spark-master spark//${MASTER} —-num-executor 10 \
—-executor-memory 10g —-driver-memory 10g"
List of supported software
On the HPC systems of MPCDF we support different software for machine learning and data analytics.
Due to the hierarchical module system at MPCDF, some
modules will only be available if the appropriate requirements have been
loaded fist. For example, the module scikit-learn (which is a python
package) will only appear after the respective anaconda
module has been loaded (e.g. anaconda/3/2021.11). The requirements for each of the modules listed below
are reported in parenthesis after the module name. If you want to search for software, or it
is unclear to you which modules should be loaded before a certain
software package is available, please use the command find-module
as follows:
KEYWORD="my-keyword-for-this-module"
find-module $KEYWORD
Please note that
the official supported python environment has been anaconda/3/2021.11 since February 2022.
the CUDA versions for which accelerated GPU software is currently built are cuda/11.2, cuda/11.4 and cuda/11.6 (latest update July 2022). Software compiled with oldest CUDA versions might still be available on our clusters, but no more maintained and updated.
General software (in alphabetical order):
cudnn (cuda/11.6)
gpytorch - cpu (anaconda/3/2021.11)
gpytorch - gpu ((anaconda/3/2021.11), cuda/11.6)
horovod-cpu (anaconda/3/2021.11, gcc/11, impi/2021.03, [or, as an alternative, openmpi if available on the system])
horovod-gpu (anaconda/3/2021.11, cuda/11.4, gcc/11, impi/2021.03 ,[or, as an alternative, openmpi if available on the system])
hyperopt (anaconda/3/2021.11)
keras (anaconda/3/2021.11)
keras-applications (anaconda/3/2021.11)
keras-preprocessing (anaconda/3/2021.11)
keras-tuner (anaconda/3/2021.11)
mkl-dnn
mxnet-cpu (anaconda/3/2021.11)
mxnet-gpu (anaconda/3/2021.11)
nccl (cuda/11.6)
NVIDIA DALI (anaconda/3/2021.11, cuda/11.6)
opencv-cpu (anaconda/3/2021.11)
opencv-gpu (anaconda/3/2021.11, cuda/11.6)
onnx (anaconda/3/2021.11)
protobuf
pytorch-cpu (anaconda/3/2021.11)
pytorch-gpu (anaconda/3/2021.11, cuda/11.6)
pytorch distributed - cpu (anaconda/3/2021.11, gcc/11, impi/2021.5)
pytorch distributed - gpu (anaconda/3/2021.11, cuda/11.6, gcc/11, impi/2021.5)
pytorch lightning - cpu (anaconda/3/2021.11)
pytorch lightning - gpu (anaconda/3/2021.11, cuda/11.6, [gcc/11, impi/2021.5])
scikit-learn (anaconda/3/2020.11)
spark
tensorboard (anaconda/3/2021.11)
tensorflow-cpu (anaconda/3/2021.11)
tensorflow-gpu (anaconda/3/2021.11, cuda/11.4)
tensorflow estimator (anaconda/3/2021.11)
tensorflow probability (anaconda/3/2021.11)
tensorrt (cuda/11.4)
Software for natural language processing (alphabetical order):
gensim (anaconda/3/2021.11)
Container solutions for Data-Analytics
charliecloud
singularity