AI software

On this page you find a collection of information about software for data analytics and especially machine learning, which we support at the MPCDF.

Introduction

The current best practice for using AI and data analytics software on our systems is to utilize containers. Containers provide a consistent and reliable way to manage complex dependencies and ensure reproducibility across different environments. We recommend using Apptainer (formerly Singularity) for this purpose, as it is well-suited for our HPC infrastructure.

The software provided through environment modules is considered deprecated and should be used with caution. For more details on the recent changes in our Python infrastructure, please refer to the Bits & Bytes article.

Please note that users are expected to install any required software themselves. The following sections provide guidance and resources to help you get started with containers, installing Python packages locally, and best practices for setting up your environment.

Important

Always verify the performance of your software setup, as it can vary depending on the installation method. You can monitor performance using our monitoring system.

Containers

AI frameworks often come with complex dependencies that can vary across systems. To manage this effectively on MPCDF systems, containers are the recommended solution. Containers allow you to encapsulate your software environment into a single, portable image. This greatly simplifies reproducibility and collaboration.

Using containers on MPCDF Systems

On MPCDF systems, the recommended way to work with containers is through Apptainer (formerly Singularity). It integrates well with batch systems, supports GPU usage, and does not require root access, making it ideal for HPC environments.

Apptainer is available on our HPC systems via the module system. To see available versions, use the find-module command. For example:

$ find-module apptainer

apptainer/1.3.2
apptainer/1.3.6
apptainer/1.4.1

For more details on the general usage of Apptainer, please refer to:

Dedicated Apptainer examples for AI frameworks

To help you get started with containers, we provide a curated AI Containers Repository on GitLab, featuring examples tailored for common AI frameworks like PyTorch and TensorFlow:

The repository includes:

  • Python scripts for typical AI workflows (e.g., training)

  • Apptainer definition files

  • Slurm job submission scripts for running containers on HPC systems

  • Best practices and tips for using containers

Use an apptainer image as a Jupyter kernel in RVS

To add a Jupyter kernel in RVS running inside an apptainer image follow the instructions outlined in the AI Containers Repository.

Hardware compatibility

To ensure optimal performance, it’s crucial to match your containers with the appropriate hardware, especially when using GPUs.

  • NVIDIA GPUs (e.g., on the Raven system):

    • Use containers built with CUDA and install AI frameworks compiled with CUDA support.

    • Browse available images here: NVIDIA NGC Catalog

  • AMD GPUs (e.g. on the Viper system):

    • Use containers built with ROCm, and ensure your AI frameworks are installed with ROCm support.

    • Browse available images here: AMD ROCm Docker Hub

How to install Python packages locally

For rapid experimentation, or if you want to leverage software already available on the HPC systems via environment modules, we recommend setting up a virtual environment to install any additional packages you may need.

Setting up a venv

First, load the Python interpreter via the Water Boa Python module:

module load python-waterboa/2024.06 

Then load your required packages, if they are available on our module system. See our dedicated section for more information about how the module system works.

Now, create your virtual environment via:

python -m venv --system-site-packages <path/to/my_venv>

This command will create a directory at the given path where your software will be installed. The --system-site-packages flag gives the virtual environment access to the already loaded packages in the previous steps.

Activate your venv

To activate your newly created virtual environment, execute:

source <path/to/my_venv>/bin/activate

Install packages

Then you can simply install your required packages via pip. For example to install PyTorch:

pip install torch

Important

Take care of GPU support!

If you require a particular build for CUDA or ROCm, consult the documentation of the software you want to install. For example, to install pyTorch:

  • With NVIDIA GPUs support, install a wheel built with CUDA 12.6:

    pip install torch --index-url https://download.pytorch.org/whl/cu126
    
  • With AMD GPUs support, install a wheel built with ROCm 6.3:

    pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.3
    

Important

Conda environments have very limited support (more details in our documentation and this Bits and Bytes article).

Use a virtual environment as a Jupyter kernel in RVS

You can add a Jupyter kernel in RVS running in a virtual environment.

First install the ipykernel package inside the virtual environment:

pip install ipykernel

Then install the kernel locally:

python -m ipykernel install --user --name=my-env-name --display-name "Python (my-env-name)"

It will be automatically visible in the kernels list of a Jupyter Lab session in RVS.

LLM Inference Service

To support the interactive use of open-source language models, MPCDF provides an easy-to-use LLM Inference Service, available at https://llm.mpcdf.mpg.de.

There are over 100,000 open language models hosted on Hugging Face. They cover a wide range of parameter sizes, from tiny models with a few million parameters to really gigantic models with up to one trillion parameters. While smaller models can outperform larger ones on specific tasks, especially when fine‑tuned, the general capabilities of the biggest open models rival those of closed models such as GPT or Gemini.

However, running even the “smaller” models efficiently already requires notable compute resources, and the largest models require substantial AI hardware that is often out of reach for individual research groups. To enable researchers in running and evaluating open-source LLMs, MPCDF provides the necessary computational resources along with an easy-to-use inference service, available at https://llm.mpcdf.mpg.de

The LLM Inference Service is a flexible web application that allows you to create endpoints exposing a model via a REST API. For this, we rely on popular inference frameworks such as vLLM and Ollama.

Important

When a user creates an endpoint, the service submits the corresponding Slurm job under that user’s own account. This means that the job is queued, scheduled, and accounted for like any other regular Slurm job.

The service then exposes the model through a routed REST API endpoint, so it can be accessed conveniently from a local machine or existing tools.

Via an intuitive UI, users can request the desired hardware and configure the framework. We provide sensible default configurations for the frameworks to help you get started quickly. At the same time, you remain free to tune the framework parameters to your needs and to run any model and modality supported by the respective framework, including your own fine‑tuned models, provided they are hosted on the Hugging Face Hub. Examples for starting and configuring the inference service, as well as for calling the API, are available on the Recipes page in the LLM Inference UI.

Currently, two of the most powerful GPU systems at MPCDF, DAIS and Viper-GPU, are connected to the service.

Important

Everyone with an active MPCDF account should be able to log in with the normal Kerberos credentials and run the service on the Viper-GPU system. Users with access to DAIS can also run the service on that system.

The LLM Inference Service is targeted at researchers who wish to interactively evaluate specific open models or conduct interactive user studies. For non-interactive workloads, such as extensive benchmarks or offline evaluations, we recommend using Slurm batch jobs to ensure efficient resource utilization. Example scripts for submitting such jobs are available in our LLMs-meet-MPCDF GitLab repository. Additionally, for users interested in testing “standard” open models, the Chat AI service by GWDG is an excellent alternative. It offers a user-friendly chat interface as well as access to an inference API.

Cautions and best‑practice notes for AI workloads on HPC systems

AI frameworks (e.g. PyTorch, TensorFlow, JAX) often automatically create a large number of OpenMP threads internally. On the HPC systems this can exhaust the available CPU resources, increase contention, and in the worst case cause node crashes or job termination. The following recommendations help to keep the nodes stable while preserving most of the performance:

Set OMP_WAIT_POLICY=PASSIVE

This tells the OpenMP runtime to put idle threads into a low‑power wait state instead of busy‑waiting. The impact on performance is typically negligible, but it can dramatically reduce the likelihood of node failures, especially on the Viper GPU nodes where aggressive thread spawning is a common cause of instability.

Important

When running AI workloads on the Viper GPU system we strongly recommend setting OMP_WAIT_POLICY to PASSIVE.

Limit the number of OpenMP threads

OMP_NUM_THREADS=1 is a safe starting point for many AI applications and guarantees that only a single thread per process is active. For workloads that benefit from multi‑threaded CPU kernels (e.g. data loading, BLAS operations) you may increase this value, but always benchmark the impact on your specific model and dataset.

Important

Use our internal monitoring system (see the monitoring system page) to watch for an unusually high number of threads, CPU usage, or memory pressure. If you observe such symptoms, try setting OMP_NUM_THREADS to a low number.