Performance tools

Intel APS

The Intel Application Performance Snapshot (APS) can give an overview of possible performance issues. We recommend using this tool first before looking into further details. You can invoke the analysis in your SLURM job script in the following way:

module load vtune
srun hpcmd_suspend aps --stat-level=4 -r aps_result -- ./my_application

This command will write the performance data into the subdirectory aps_result. After the job has finished, the report can be generated with aps-report -a aps_result. You will find two HTML files in the current directory, which you can view in any browser.

Intel VTune

Intel VTune is a statistical profiling and performance analysis tool for Intel processors.

A command line interface (vtune) and a GUI (vtune-gui) are provided by the environment module module load vtune.

Below, we give an example of how you can record the profiling data for your application in your SLURM job script:

module load vtune
srun hpcmd_suspend vtune -collect hpc-performance -r vtune_hpc_performance -- ./my_application
srun hpcmd_suspend vtune -collect hotspots -trace-mpi -r vtune_hotspots -- ./my_application
srun hpcmd_suspend vtune -collect uarch-exploration -r vtune_uarch -- ./my_application

This will create three subdirectories vtune_*, which contain the corresponding profiling data. You can inspect the data with vtune-gui, in which you can open the respective directory of interest.

If you encounter an error message Failed to create data directory: Too many open files, then add a line ulimit -n 16384 after module load vtune.

For detailed documentation please have a look at the Intel VTune documentation and the Vtune cookbook.

Intel Advisor

The Intel Advisor is a threading design and prototyping tool for software developers. Since version 2016 comprehensive SIMD-vectorization analysis capabilities have been added.


  • Analyze, design, tune and check your threading and SIMD-vectorization design before implementation

  • Explore and test threading options without disrupting normal development

  • Predict thread errors & performance scaling on systems with more cores

It can be made available on Linux clusters and HPC systems by invoking module load advisor.

Intel Advisor is particularly useful for analyzing the roofline data for your application. This can be achieved by inserting the following lines into your SLURM job script:

module load advisor
srun hpcmd_suspend advisor --collect survey --project-dir advisor_roofline -- ./my_application
srun hpcmd_suspend advisor --collect tripcounts --project-dir advisor_roofline --flop --no-trip-counts -- ./my_application

For producing the roofline plot it’s recommended to use the whole node, e.g. by specifying --cpus-per-task=72 for serial jobs. The result can then be viewed either by invoking GUI with advisor-gui advisor_roofline/ or by producing a roofline plot in html format by advisor --report=roofline --report-output=advisor_roofline/roofline.html --project-dir=advisor_roofline.

For a more detailed overview and information please have a look at the Intel Advisor Documentation. and the Advisor cookbook.

Intel Trace Collector and Analyzer (ITAC)

The Intel® Trace Analyzer and Collector (ITAC) is a tool for understanding MPI behaviour of applications. In the execution phase it collects data and produces trace files that can subsequently be analyzed with the Intel® Trace Analyzer performance analysis tool. ITAC can be used to analyze and visualize MPI communication behaviour of a given program and possibly detect inefficiencies and load imbalances.

To collect data do the following:

  1. module load itac

  2. Compile your code with an Intel compiler specifying “-g -tcollect”, e.g. mpiifort -g -tcollect *.f90

  3. Set the environment variable VT_FLUSH_PREFIX to some directory with plenty of space (/ptmp/USERID)

  4. Run your program as usual

By this a trace file (suffix .stf) should have been written. It can be analyzed with the trace analyzer tool as follows:

  1. module load itac

  2. traceanalyzer <prog.exe>.stf

ITAC also provides a mechanism to check correctness of MPI applications. The runtime checker will detect data type mismatches and deadlocks. Adapt your job script as follows to evaluate your MPI application:

  1. module load itac

  2. srun --export=ALL,LD_PRELOAD=$ITAC_HOME/intel64/slib/ <prog.exe>

These recipes represent only the easiest and most obvious access. There are also more refined methods either by instrumenting the code or by specifying filters for selective analysis. Detailed documentation can be found in the Intel® ITAC Documentation.


The Likwid Performance Tools are a lightweight command line performance tool suite for node-level profiling.

Likwid is available as an environment module.


Scalable tool for performance analysis of MPI/OpenMP/hybrid programs

Scalasca (Scalable performance analysis of large**-sc**ale parallel applications) is an open-source project by the Jülich Supercomputing Centre (JSC) which focuses on analyzing OpenMP, MPI and hybrid OpenMP/MPI parallel applications. The Scalasca tool can be used to identify bottlenecks and load imbalances in application codes by providing a number of helpful features, among others: profiling and tracing of highly parallel programs; automated trace analysis that localizes and quantifies communication and synchronization inefficiencies. The tool is designed to scale up to tens of thousands of cores and even more.

Scalasca is able to  automatically instrument code on subroutine level, or the user can instrument the code to investigate special regions. The performance measurements are carried out at runtime. The results are studied after program termination with a user-friendly interactive graphical interface which shows the considered event or performance metric together with the respective source code section and with regard to fluctuations over the used partition of the system.

Scalasca can be used on all compute clusters by loading the module module load scalasca. Documentation on how to prepare the code and specify different modes of operation can be found

on the Scalasca web page

Simple Performance Library perflib

This page describes the usage of the performance library libooperf.a.


The perflib consists of an instrumentation library, which provides instrumented programs with a summary output containing performance information for each instrumented region in a program. If some regions are deep in the call tree in a loop, the library adds some overhead to the runtime. This overhead is not visible in the resulting times but in the runtime of the program.

This library supports parallel (MPI and mixed mode) applications, written in Fortran. It only accounts for the master thread in a multithreaded program.

There are the following libraries:

perf library for MPI programs.

Use perf instrumentation to call hpmtoolkit.

Provides dummies for the perf instrumentation. You can link against this library to avoid the perf overhead in production runs.

Basic Interface

The basic interface of the perflib consists of 4 fortran callable subroutines:

must be called after MPI_Init to do initialization

defines a starting point for performance measurement. Name is a character string, which identifies this point in the output of perfout.

defines an end point of performance measurement.

must be called before MPI_Finalize and prints the results to standard output. Name identifies a call of perfon, which is supposed to have 100% of runtime. The percentage of runtime of all other perfons is relative to this. You can call perfout from any number of MPI tasks. If your MPI tasks all do the same work, it is enough to call perfout from task 0 only. When you call perfout from several MPI tasks concurrently the output in stdout is mixed. You can serialize the calls to perfout as shown in the following example program.

Advanced Interface

In addition to the four basic subroutines, there are two more for context management.

starts a new context, in which all regions called from this context are separated from the calls to the same regions from other contexts. For example, if one calls some regions from within the initialization and from the timeloop and one wants to discriminate these, one can define a context for the init phase and one for the timeloop.

ends a previously started context and switches back to the parent context

There are also two more functions, which helps getting the results of the performance regions directly into the program.

perf_get(name, double *inctime, double *inc_MFlops)
gives back the inclusive time and inclusive MFlops from the performance region “name”. This is useful for loops, where one wants to get the results for different loop iterations.

resets the counters and the time of the performance region “name”. Only useful in conjunction with perf_get, otherwise some performance data is lost.

Example Program

Program tperf
  Implicit none
  use mpi

  Integer::     ii, ierr, pe, npes, mype
  Real(8)::     uu

  Integer::     omp_get_thread_num
  Integer, Parameter:: SZ=10000
  Real(8)::     a(SZ), b(SZ),pi(SZ)

  print *, 'Start of Program tperf'

  Call MPI_Init(ierr)               
  if (ierr /= 0) Stop "MPI_Init failed"
  Call MPI_Comm_size(MPI_COMM_WORLD, npes, ierr);
  Call MPI_Comm_rank(MPI_COMM_WORLD, mype, ierr);

!$omp parallel
  print "('This is MPI task',i5,' thread',i3)",&
        mype, omp_get_thread_num()
!$omp end parallel

  Call perfinit
  Call perfon ('tperf')
  Call perfon ('calc')
!$omp   parallel do
  Do ii=1, 100
     call calc (pi)
!$omp   end parallel do
  Call perfoff

  Call perfon ('random')
  Call random_number(a)
  Call random_number(b)
  Call perfoff

  Call perfon ('calc2')
  Do ii=1, SZ
     uu = calc2 (a, b)
  Call perfoff

  Call perfoff                  ! tperf
  do pe=0, npes-1
      Call MPI_Barrier(MPI_COMM_WORLD, ierr)
      If (mype == pe) Call perfout('tperf')

  Call MPI_Finalize(ierr)
  if (ierr /= 0) Stop "MPI_Finalize failed"

  print *, 'End of Program tperf'

 Subroutine calc (p_pi)
  Integer::     ii
  Real(8)::     p_pi(:)

  Do ii=1, size(p_pi)
    p_pi(ii) = sin(real(ii))*sqrt(real(ii))
 End Subroutine calc

 Real Function calc2 (a, b)
  Real(8)::     a(:), b(:), c

  c = Sum (a * b)
  calc2 = c
 End Function calc2
End Program tperf

Compiling and Linking

The ‘perf’’ library is available as a module:

module load perflib

To compile and link the program tperf.f use the following command line:

mpif90 -o tperf tperf.f -L$PERFLIB_HOME/lib -looperf -lpfm -lstdc++

Sometimes the C++ standard library (-lstdc++ in the linkline) is already set inside the compiler wrapper. When using OpenMP (-qsmp=omp) measurements are only made for thread 0. Be aware that performance of thread 0 might be different to other threads.


By default the output from this program looks like this:

 Start of Program tperf
 Start of Program tperf
This is MPI task    0 thread  0
This is MPI task    1 thread  0

                             Inclusive                    Exclusive
Subroutine  #calls  Time(s)      %     MFlops    Time(s)      %     MFlops
tperf          1      1.878   100.0   142.763      0.000     0.0     0.000
calc           1      0.100     5.3   681.948      0.100     5.3   681.948
random         1      0.000     0.0   444.131      0.000     0.0   444.131
calc2          1      1.778    94.7   112.473      1.778    94.7   112.473

Size of data segment used by the program:          89.12 MB

                             Inclusive                    Exclusive
Subroutine  #calls  Time(s)      %     MFlops    Time(s)      %     MFlops
tperf          1      1.877   100.0   142.878      0.000     0.0     0.000
calc           1      0.100     5.3   682.227      0.100     5.3   682.227
random         1      0.000     0.0   465.735      0.000     0.0   465.735
calc2          1      1.777    94.7   112.565      1.777    94.7   112.565

Size of data segment used by the program:          89.12 MB
 End of Program tperf
 End of Program tperf


  • Column 1 is the name given in perfon.

  • Column 2 is the number of calls of perfon with this name.

  • Columns 3-5 are inclusive and the following 3 columns are exclusive.

  • Inclusive values measure all code between a call to perfon and its corresponding call of perfoff.

  • Exclusive values exclude those parts of the code, which are measured separately with calls to perfon and perfoff.

  • The subroutine calc has no subcalls of perfon. That’s why inclusive and exclusive values are identical.

By setting the environment variable


the output is not written to stdout but each MPI ranks writes its results in a different XML file with a name perf.<PID>.xml. These xml files can then be further processed with own tools or with the python script You can get help with --help