AI software

On this page you find a collection of information about software for data analytics and especially machine learning, which we support at the MPCDF.

Introduction

The current best practice for using AI and data analytics software on our systems is to utilize containers. Containers provide a consistent and reliable way to manage complex dependencies and ensure reproducibility across different environments. We recommend using Apptainer (formerly Singularity) for this purpose, as it is well-suited for our HPC infrastructure.

The software provided through environment modules is considered deprecated and should be used with caution. For more details on the recent changes in our Python infrastructure, please refer to the Bits & Bytes article.

Please note that users are expected to install any required software themselves. The following sections provide guidance and resources to help you get started with containers, installing Python packages locally, and best practices for setting up your environment.

Important

Always verify the performance of your software setup, as it can vary depending on the installation method. You can monitor performance using our monitoring system.

Containers

AI frameworks often come with complex dependencies that can vary across systems. To manage this effectively on MPCDF systems, containers are the recommended solution. Containers allow you to encapsulate your software environment into a single, portable image. This greatly simplifies reproducibility and collaboration.

Using containers on MPCDF Systems

On MPCDF systems, the recommended way to work with containers is through Apptainer (formerly Singularity). It integrates well with batch systems, supports GPU usage, and does not require root access, making it ideal for HPC environments.

Apptainer is available on our HPC systems via the module system. To see available versions, use the find-module command. For example:

$ find-module apptainer

apptainer/1.3.2
apptainer/1.3.6
apptainer/1.4.1

For more details on the general usage of Apptainer, please refer to:

Dedicated Apptainer examples for AI frameworks

To help you get started with containers, we provide a curated AI Containers Repository on GitLab, featuring examples tailored for common AI frameworks like PyTorch and TensorFlow:

The repository includes:

  • Python scripts for typical AI workflows (e.g., training)

  • Apptainer definition files

  • Slurm job submission scripts for running containers on HPC systems

  • Best practices and tips for using containers

Use an apptainer image as a Jupyter kernel in RVS

To add a Jupyter kernel in RVS running inside an apptainer image follow the instructions outlined in the AI Containers Repository.

Hardware compatibility

To ensure optimal performance, it’s crucial to match your containers with the appropriate hardware, especially when using GPUs.

  • NVIDIA GPUs (e.g., on the Raven system):

    • Use containers built with CUDA and install AI frameworks compiled with CUDA support.

    • Browse available images here: NVIDIA NGC Catalog

  • AMD GPUs (e.g. on the Viper system):

    • Use containers built with ROCm, and ensure your AI frameworks are installed with ROCm support.

    • Browse available images here: AMD ROCm Docker Hub

How to install Python packages locally

For rapid experimentation, or if you want to leverage software already available on the HPC systems via environment modules, we recommend setting up a virtual environment to install any additional packages you may need.

Setting up a venv

First, load the Python intepreter via the Water Boa Python module:

module load python-waterboa/2024.06 

Then load your required packages, if they are available on our module system. See our dedicated section for more information about how the module system works.

Now you create your virtual environment via:

python -m venv --system-site-packages <path/to/my_venv>

This command will create a directory at the given path where your software will be installed. The --system-site-packages flag gives the virtual environment access to the already loaded packages in the previous steps.

Activate your venv

To activate your newly created virtual environment, execute:

source <path/to/my_venv>/bin/activate

Install packages

Then you can simply install your required packages via pip. For example to install PyTorch:

pip install torch

Important

Take care of GPU support!

If you require a particular build for CUDA or ROCm, consult the documentation of the software you want to install. For example, to install pyTorch:

  • With NVIDIA GPUs support, install a wheel built with CUDA 12.6:

    pip install torch --index-url https://download.pytorch.org/whl/cu126
    
  • With AMD GPUs support, install a wheel built with ROCm 6.3:

    pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.3
    

Important

Conda environments have very limited support (more details in our documentation and this Bits and Bytes article).

Use a virtual environment as a Jupyter kernel in RVS

You can add a Jupyter kernel in RVS running in a virtual environment.

First install the ipykernel package inside the virtual environment:

pip install ipykernel

Then install the kernel locally:

python -m ipykernel install --user --name=my-env-name --display-name "Python (my-env-name)"

It will be automatically visible in kernels list of a Jupyter Lab session in RVS.