AI software
On this page you find a collection of information about software for data analytics and especially machine learning, which we support at the MPCDF.
Table of Contents
Introduction
The current best practice for using AI and data analytics software on our systems is to utilize containers. Containers provide a consistent and reliable way to manage complex dependencies and ensure reproducibility across different environments. We recommend using Apptainer (formerly Singularity) for this purpose, as it is well-suited for our HPC infrastructure.
The software provided through environment modules is considered deprecated and should be used with caution. For more details on the recent changes in our Python infrastructure, please refer to the Bits & Bytes article.
Please note that users are expected to install any required software themselves. The following sections provide guidance and resources to help you get started with containers, installing Python packages locally, and best practices for setting up your environment.
Important
Always verify the performance of your software setup, as it can vary depending on the installation method. You can monitor performance using our monitoring system.
Containers
AI frameworks often come with complex dependencies that can vary across systems. To manage this effectively on MPCDF systems, containers are the recommended solution. Containers allow you to encapsulate your software environment into a single, portable image. This greatly simplifies reproducibility and collaboration.
Using containers on MPCDF Systems
On MPCDF systems, the recommended way to work with containers is through Apptainer (formerly Singularity). It integrates well with batch systems, supports GPU usage, and does not require root access, making it ideal for HPC environments.
Apptainer is available on our HPC systems via the module system. To see available versions, use the find-module command. For example:
$ find-module apptainer
apptainer/1.3.2
apptainer/1.3.6
apptainer/1.4.1
For more details on the general usage of Apptainer, please refer to:
Dedicated Apptainer examples for AI frameworks
To help you get started with containers, we provide a curated AI Containers Repository on GitLab, featuring examples tailored for common AI frameworks like PyTorch and TensorFlow:
The repository includes:
Python scripts for typical AI workflows (e.g., training)
Apptainer definition files
Slurm job submission scripts for running containers on HPC systems
Best practices and tips for using containers
Use an apptainer image as a Jupyter kernel in RVS
To add a Jupyter kernel in RVS running inside an apptainer image follow the instructions outlined in the AI Containers Repository.
Hardware compatibility
To ensure optimal performance, it’s crucial to match your containers with the appropriate hardware, especially when using GPUs.
NVIDIA GPUs (e.g., on the Raven system):
Use containers built with CUDA and install AI frameworks compiled with CUDA support.
Browse available images here: NVIDIA NGC Catalog
AMD GPUs (e.g. on the Viper system):
Use containers built with ROCm, and ensure your AI frameworks are installed with ROCm support.
Browse available images here: AMD ROCm Docker Hub
How to install Python packages locally
For rapid experimentation, or if you want to leverage software already available on the HPC systems via environment modules, we recommend setting up a virtual environment to install any additional packages you may need.
Setting up a venv
First, load the Python interpreter via the Water Boa Python module:
module load python-waterboa/2024.06
Then load your required packages, if they are available on our module system. See our dedicated section for more information about how the module system works.
Now, create your virtual environment via:
python -m venv --system-site-packages <path/to/my_venv>
This command will create a directory at the given path where your software will be installed.
The --system-site-packages flag gives the virtual environment access to the already loaded packages in the previous steps.
Activate your venv
To activate your newly created virtual environment, execute:
source <path/to/my_venv>/bin/activate
Install packages
Then you can simply install your required packages via pip.
For example to install PyTorch:
pip install torch
Important
Take care of GPU support!
If you require a particular build for CUDA or ROCm, consult the documentation of the software you want to install. For example, to install pyTorch:
With NVIDIA GPUs support, install a wheel built with CUDA 12.6:
pip install torch --index-url https://download.pytorch.org/whl/cu126With AMD GPUs support, install a wheel built with ROCm 6.3:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.3
Important
Conda environments have very limited support (more details in our documentation and this Bits and Bytes article).
Use a virtual environment as a Jupyter kernel in RVS
You can add a Jupyter kernel in RVS running in a virtual environment.
First install the ipykernel package inside the virtual environment:
pip install ipykernel
Then install the kernel locally:
python -m ipykernel install --user --name=my-env-name --display-name "Python (my-env-name)"
It will be automatically visible in the kernels list of a Jupyter Lab session in RVS.
LLM Inference Service
To support the interactive use of open-source language models, MPCDF provides an easy-to-use LLM Inference Service, available at https://llm.mpcdf.mpg.de.
There are over 100,000 open language models hosted on Hugging Face. They cover a wide range of parameter sizes, from tiny models with a few million parameters to really gigantic models with up to one trillion parameters. While smaller models can outperform larger ones on specific tasks, especially when fine‑tuned, the general capabilities of the biggest open models rival those of closed models such as GPT or Gemini.
However, running even the “smaller” models efficiently already requires notable compute resources, and the largest models require substantial AI hardware that is often out of reach for individual research groups. To enable researchers in running and evaluating open-source LLMs, MPCDF provides the necessary computational resources along with an easy-to-use inference service, available at https://llm.mpcdf.mpg.de
The LLM Inference Service is a flexible web application that allows you to create endpoints exposing a model via a REST API. For this, we rely on popular inference frameworks such as vLLM and Ollama.
Important
When a user creates an endpoint, the service submits the corresponding Slurm job under that user’s own account. This means that the job is queued, scheduled, and accounted for like any other regular Slurm job.
The service then exposes the model through a routed REST API endpoint, so it can be accessed conveniently from a local machine or existing tools.
Via an intuitive UI, users can request the desired hardware and configure the framework. We provide sensible default configurations for the frameworks to help you get started quickly. At the same time, you remain free to tune the framework parameters to your needs and to run any model and modality supported by the respective framework, including your own fine‑tuned models, provided they are hosted on the Hugging Face Hub. Examples for starting and configuring the inference service, as well as for calling the API, are available on the Recipes page in the LLM Inference UI.
Currently, two of the most powerful GPU systems at MPCDF, DAIS and Viper-GPU, are connected to the service.
Important
Everyone with an active MPCDF account should be able to log in with the normal Kerberos credentials and run the service on the Viper-GPU system. Users with access to DAIS can also run the service on that system.
The LLM Inference Service is targeted at researchers who wish to interactively evaluate specific open models or conduct interactive user studies. For non-interactive workloads, such as extensive benchmarks or offline evaluations, we recommend using Slurm batch jobs to ensure efficient resource utilization. Example scripts for submitting such jobs are available in our LLMs-meet-MPCDF GitLab repository. Additionally, for users interested in testing “standard” open models, the Chat AI service by GWDG is an excellent alternative. It offers a user-friendly chat interface as well as access to an inference API.
Cautions and best‑practice notes for AI workloads on HPC systems
AI frameworks (e.g. PyTorch, TensorFlow, JAX) often automatically create a large number of OpenMP threads internally. On the HPC systems this can exhaust the available CPU resources, increase contention, and in the worst case cause node crashes or job termination. The following recommendations help to keep the nodes stable while preserving most of the performance:
Set OMP_WAIT_POLICY=PASSIVE
This tells the OpenMP runtime to put idle threads into a low‑power wait state instead of busy‑waiting. The impact on performance is typically negligible, but it can dramatically reduce the likelihood of node failures, especially on the Viper GPU nodes where aggressive thread spawning is a common cause of instability.
Important
When running AI workloads on the Viper GPU system we strongly recommend setting OMP_WAIT_POLICY to PASSIVE.
Limit the number of OpenMP threads
OMP_NUM_THREADS=1 is a safe starting point for many AI applications and guarantees that only a single thread per process is active.
For workloads that benefit from multi‑threaded CPU kernels (e.g. data loading, BLAS operations) you may increase this value,
but always benchmark the impact on your specific model and dataset.
Important
Use our internal monitoring system (see the monitoring system page) to watch for an unusually high number of threads, CPU usage, or memory pressure. If you observe such symptoms, try setting OMP_NUM_THREADS to a low number.