Biological Intelligence


Name of the clusters:

CAJAL

Institution:

Max Planck Institute for Biological Intelligence

Login nodes:

cajalg001.wb.mpcdf.mpg.de

cajalg002.wb.mpcdf.mpg.de

SHA256 ssh host key fingerprints are:

  • cajalg001: 3kIzXDB7ZDpMcrmFcxtxJ9c6qlHfPy42we6YsvD8x8o (ED25519)

  • cajalg002: /Y9+CyjtSf2MIeLWDmfrmarSWipp/RxQx3ddmABHe90 (ED25519)

Hardware Configuration:

2 Login node cajalg[001-002]
total amount of 128 CPU cores
Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz
64 cores per node
hyper-threading disabled - 1 threads per core
1 TB RAM
2 x NVIDIA A40 GPUs per node
51 execution nodes cajalg[003-053] for parallel computing
total amount of 3264 CPU cores
Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz
64 cores per node
hyper-threading disabled - 1 threads per core
1 TB RAM
2 x NVIDIA A40 GPUs per node
4 execution nodes cajalg[201-204] for parallel computing
total amount of 256 CPU cores
Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz
64 cores per node
hyper-threading disabled - 1 threads per core
1 TB RAM
8 x NVIDIA A40 GPUs per node
Node interconnect
based on 25000Mb/s ethernet

Filesystems:

/u
  • shared GPFS-based home filesystem

  • user quotas (256GB of data; 500k files/directories) enforced

/cajal/scratch/users/$USERNAME
  • shared GPFS-based scratch filesystem

  • quoted to 100TB of data and 1M files/directories

  • NO BACKUPS!

/cajal/nvmescratch/users/$USERNAME
  • shared GPFS-based scratch filesystem

  • quoted to 5TB of data and 1M files/directories

  • NO BACKUPS!

/wholebrain
  • shared GPFS-Based filesystem

  • will be converted to read-only end of 2022

  • NO BACKUPS!

Quota can be checked with /usr/lpp/mmfs/bin/mmlsquota

Compilers and Libraries:

Hierarchical environment modules are used at MPCDF to provide software packages and enable switching between different software versions. There are no modules preloaded on CAJAL. User have to specify the needed modules with explicit versions at login and during the startup of a batch job. Not all software modules are displayed immediately by the module avail command, for some user first needs to load a compiler and/or MPI module. You can search the full hierarchy of the installed software modules with the find-module command.

Batch system based on Slurm:

The batch system on CAJAL is the Slurm Workload Manager. A brief introduction into the basic commands (srun, sbatch, squeue, scancel, sinfo, s*…) can be found on the Raven home page or on the Slurm handbook

Current Slurm configuration on CAJAL:

three partitions
p.cajal (for parallel/exclusive jobs), p.share (default, for serial jobs), p.large ( with 8 gpus per node)
run time (wallclock)
2 days (default), 7 days (max. run time)
maximum memory per node for jobs
1024000 MB
default memory per node for jobs
1024000 MB (p.cajal, p.large), 256000 MB (p.share)
p.cajal partition
nodes are exclusively allocated to users
p.share and p.large partitions
resources on the nodes may be shared between jobs
all partitions
to access GPU resources --gres parameter must be explicitly set for jobs

Useful tips:

By default run time limit used for jobs that don’t specify a value is 48 hours. Use --time option for sbatch/srun to set a limit on the total run time of the job allocation but not longer than 168 hours

In shared partition p.share the default memory per node is 256000MB. To specify  the real memory required per node use --mem option (see also --mem-per-cpu in case of multithreaded job).

The OpenMP codes require a variable OMP_NUM_THREADS to be set. This can be obtained from the Slurm environment variable $SLURM_CPUS_PER_TASK which is set when --cpus-per-task is specified in a sbatch script (an example can be found on help information page)

To use GPUs add in your slurm scripts --gres options and choose how many GPUs and/or which model of them to have:  #SBATCH --gres=gpu:a40:1 or #SBATCH --gres=gpu:a40:2
Valid gres options are: gpu[[:type]:count]
where
type is a type of gpu (a40)
count is a number of resources (1<=N<=2 in p.cajal partition and 1<=N<=8 in p.large partition)

GPU cards are in default compute mode.

Support:

For support please create a trouble ticket at the MPCDF helpdesk