Performance tools

Contents

Intel APS
Intel VTune
Intel Advisor
Intel Trace Collector and Analyzer (ITAC)
Likwid
Scalasca
Simple Performance Library perflib

Intel APS 

The Intel Application Performance Snapshot (APS) can give an overview of possible performance issues. We recommend using this tool first before looking into further details. You can invoke the analysis in your SLURM job script in the following way:

module load vtune
srun hpcmd_suspend aps --stat-level=4 -r aps_result -- ./my_application

This command will write the performance data into the subdirectory aps_result. After the job has finished, the report can be generated with aps-report -a aps_result. You will find two HTML files in the current directory, which you can view in any browser.

Intel VTune 

Intel VTune is a statistical profiling and performance analysis tool for Intel processors.

A command line interface (vtune) and a GUI (vtune-gui) are provided by the environment module module load vtune.

Below, we give an example of how you can record the profiling data for your application in your SLURM job script:

module load vtune
srun hpcmd_suspend vtune -collect hpc-performance -r vtune_hpc_performance -- ./my_application
srun hpcmd_suspend vtune -collect hotspots -trace-mpi -r vtune_hotspots -- ./my_application
srun hpcmd_suspend vtune -collect uarch-exploration -r vtune_uarch -- ./my_application

This will create three subdirectories vtune_*, which contain the corresponding profiling data. You can inspect the data with vtune-gui, in which you can open the respective directory of interest.

If you encounter an error message Failed to create data directory: Too many open files, then add a line ulimit -n 16384 after module load vtune.

For detailed documentation please have a look at the Intel VTune documentation and the Vtune cookbook.

Intel Advisor 

The Intel Advisor is a threading design and prototyping tool for software developers. Since version 2016 comprehensive SIMD-vectorization analysis capabilities have been added.

Features:

Analyze, design, tune and check your threading and SIMD-vectorization design before implementation
Explore and test threading options without disrupting normal development
Predict thread errors & performance scaling on systems with more cores

It can be made available on Linux clusters and HPC systems by invoking module load advisor.

Intel Advisor is particularly useful for analyzing the roofline data for your application. This can be achieved by inserting the following lines into your SLURM job script:

module load advisor
srun hpcmd_suspend advisor --collect survey --project-dir advisor_roofline -- ./my_application
srun hpcmd_suspend advisor --collect tripcounts --project-dir advisor_roofline --flop --no-trip-counts -- ./my_application

For producing the roofline plot it’s recommended to use the whole node, e.g. by specifying --cpus-per-task=72 for serial jobs. The result can then be viewed either by invoking GUI with advisor-gui advisor_roofline/ or by producing a roofline plot in html format by advisor --report=roofline --report-output=advisor_roofline/roofline.html --project-dir=advisor_roofline.

For a more detailed overview and information please have a look at the Intel Advisor Documentation. and the Advisor cookbook.

Intel Trace Collector and Analyzer (ITAC)

The Intel® Trace Analyzer and Collector (ITAC) is a tool for understanding MPI behaviour of applications. In the execution phase it collects data and produces trace files that can subsequently be analyzed with the Intel® Trace Analyzer performance analysis tool. ITAC can be used to analyze and visualize MPI communication behaviour of a given program and possibly detect inefficiencies and load imbalances.

To collect data do the following:

module load itac
Compile your code with an Intel compiler specifying “-g -tcollect”, e.g. mpiifort -g -tcollect *.f90
Set the environment variable VT_FLUSH_PREFIX to some directory with plenty of space (/ptmp/USERID)
Run your program as usual

By this a trace file (suffix .stf) should have been written. It can be analyzed with the trace analyzer tool as follows:

module load itac
traceanalyzer <prog.exe>.stf

ITAC also provides a mechanism to check correctness of MPI applications. The runtime checker will detect data type mismatches and deadlocks. Adapt your job script as follows to evaluate your MPI application:

module load itac
srun --export=ALL,LD_PRELOAD=$ITAC_HOME/intel64/slib/libVTmc.so <prog.exe>

These recipes represent only the easiest and most obvious access. There are also more refined methods either by instrumenting the code or by specifying filters for selective analysis. Detailed documentation can be found in the Intel® ITAC Documentation.

Likwid 

The Likwid Performance Tools are a lightweight command line performance tool suite for node-level profiling.

Likwid is available as an environment module.

Scalasca 

Scalable tool for performance analysis of MPI/OpenMP/hybrid programs

Scalasca (Scalable performance analysis of large**-sc**ale parallel applications) is an open-source project by the Jülich Supercomputing Centre (JSC) which focuses on analyzing OpenMP, MPI and hybrid OpenMP/MPI parallel applications. The Scalasca tool can be used to identify bottlenecks and load imbalances in application codes by providing a number of helpful features, among others: profiling and tracing of highly parallel programs; automated trace analysis that localizes and quantifies communication and synchronization inefficiencies. The tool is designed to scale up to tens of thousands of cores and even more.

Scalasca is able to automatically instrument code on subroutine level, or the user can instrument the code to investigate special regions. The performance measurements are carried out at runtime. The results are studied after program termination with a user-friendly interactive graphical interface which shows the considered event or performance metric together with the respective source code section and with regard to fluctuations over the used partition of the system.

Scalasca can be used on all compute clusters by loading the module module load scalasca. Documentation on how to prepare the code and specify different modes of operation can be found on the Scalasca web page www.scalasca.org/.

Simple Performance Library perflib 

This page describes the usage of the performance library libooperf.a.

Description

The perflib consists of an instrumentation library, which provides instrumented programs with a summary output containing performance information for each instrumented region in a program. If some regions are deep in the call tree in a loop, the library adds some overhead to the runtime. This overhead is not visible in the resulting times but in the runtime of the program.

This library supports parallel (MPI and mixed mode) applications, written in Fortran. It only accounts for the master thread in a multithreaded program.

There are the following libraries:

libooperf.a
perf library for MPI programs.

libperfhpm.a
Use perf instrumentation to call hpmtoolkit.

libperfdummy.a
Provides dummies for the perf instrumentation. You can link against this library to avoid the perf overhead in production runs.

Basic Interface

The basic interface of the perflib consists of 4 fortran callable subroutines:

perfinit
must be called after MPI_Init to do initialization

perfon(name)
defines a starting point for performance measurement. Name is a character string, which identifies this point in the output of perfout.

perfoff
defines an end point of performance measurement.

perfout(name)
must be called before MPI_Finalize and prints the results to standard output. Name identifies a call of perfon, which is supposed to have 100% of runtime. The percentage of runtime of all other perfons is relative to this. You can call perfout from any number of MPI tasks. If your MPI tasks all do the same work, it is enough to call perfout from task 0 only. When you call perfout from several MPI tasks concurrently the output in stdout is mixed. You can serialize the calls to perfout as shown in the following example program.

Advanced Interface

In addition to the four basic subroutines, there are two more for context management.

perf_context_start(name)
starts a new context, in which all regions called from this context are separated from the calls to the same regions from other contexts. For example, if one calls some regions from within the initialization and from the timeloop and one wants to discriminate these, one can define a context for the init phase and one for the timeloop.

perf_context_end
ends a previously started context and switches back to the parent context

There are also two more functions, which helps getting the results of the performance regions directly into the program.

perf_get(name, double *inctime, double *inc_MFlops)
gives back the inclusive time and inclusive MFlops from the performance region “name”. This is useful for loops, where one wants to get the results for different loop iterations.

perf_reset(name)
resets the counters and the time of the performance region “name”. Only useful in conjunction with perf_get, otherwise some performance data is lost.

Example Program

    Program tperf
      Implicit none
      use mpi

      Integer::     ii, ierr, pe, npes, mype
      Real(8)::     uu

      Integer::     omp_get_thread_num
      Integer, Parameter:: SZ=10000
      Real(8)::     a(SZ), b(SZ),pi(SZ)

      print *, 'Start of Program tperf'

      Call MPI_Init(ierr)               
      if (ierr /= 0) Stop "MPI_Init failed"
      Call MPI_Comm_size(MPI_COMM_WORLD, npes, ierr);
      Call MPI_Comm_rank(MPI_COMM_WORLD, mype, ierr);

    !$omp parallel
      print "('This is MPI task',i5,' thread',i3)",&
            mype, omp_get_thread_num()
    !$omp end parallel

      Call perfinit
      Call perfon ('tperf')
      Call perfon ('calc')
    !$omp   parallel do
      Do ii=1, 100
         call calc (pi)
      Enddo
    !$omp   end parallel do
      Call perfoff

      Call perfon ('random')
      Call random_number(a)
      Call random_number(b)
      Call perfoff

      Call perfon ('calc2')
      Do ii=1, SZ
         uu = calc2 (a, b)
      Enddo
      Call perfoff

      Call perfoff                  ! tperf
      do pe=0, npes-1
          Call MPI_Barrier(MPI_COMM_WORLD, ierr)
          If (mype == pe) Call perfout('tperf')
      Enddo

      Call MPI_Finalize(ierr)
      if (ierr /= 0) Stop "MPI_Finalize failed"

      print *, 'End of Program tperf'

    CONTAINS
     Subroutine calc (p_pi)
      Integer::     ii
      Real(8)::     p_pi(:)

      Do ii=1, size(p_pi)
        p_pi(ii) = sin(real(ii))*sqrt(real(ii))
      Enddo
     End Subroutine calc

     Real Function calc2 (a, b)
      Real(8)::     a(:), b(:), c

      c = Sum (a * b)
      calc2 = c
     End Function calc2
    End Program tperf

Compiling and Linking

The ‘perf’’ library is available as a module:

module load perflib

To compile and link the program tperf.f use the following command line:

mpif90 -o tperf tperf.f -L$PERFLIB_HOME/lib -looperf -lpfm -lstdc++

Sometimes the C++ standard library (-lstdc++ in the linkline) is already set inside the compiler wrapper. When using OpenMP (-qsmp=omp) measurements are only made for thread 0. Be aware that performance of thread 0 might be different to other threads.

Output

By default the output from this program looks like this:

 Start of Program tperf
 Start of Program tperf
This is MPI task    0 thread  0
This is MPI task    1 thread  0


                              Inclusive                    Exclusive
Subroutine  #calls  Time(s)      %     MFlops    Time(s)      %     MFlops
--------------------------------------------------------------------------
tperf          1      1.878   100.0   142.763      0.000     0.0     0.000
calc           1      0.100     5.3   681.948      0.100     5.3   681.948
random         1      0.000     0.0   444.131      0.000     0.0   444.131
calc2          1      1.778    94.7   112.473      1.778    94.7   112.473


Size of data segment used by the program:          89.12 MB


                              Inclusive                    Exclusive
Subroutine  #calls  Time(s)      %     MFlops    Time(s)      %     MFlops
--------------------------------------------------------------------------
tperf          1      1.877   100.0   142.878      0.000     0.0     0.000
calc           1      0.100     5.3   682.227      0.100     5.3   682.227
random         1      0.000     0.0   465.735      0.000     0.0   465.735
calc2          1      1.777    94.7   112.565      1.777    94.7   112.565


Size of data segment used by the program:          89.12 MB
 End of Program tperf
 End of Program tperf

Remarks:

Column 1 is the name given in perfon.
Column 2 is the number of calls of perfon with this name.
Columns 3-5 are inclusive and the following 3 columns are exclusive.
Inclusive values measure all code between a call to perfon and its corresponding call of perfoff.
Exclusive values exclude those parts of the code, which are measured separately with calls to perfon and perfoff.
The subroutine calc has no subcalls of perfon. That’s why inclusive and exclusive values are identical.

By setting the environment variable

export PERFLIB_OUTPUT_FORMAT=xml

the output is not written to stdout but each MPI ranks writes its results in a different XML file with a name perf.<PID>.xml. These xml files can then be further processed with own tools or with the python script comp_perf.py. You can get help with

comp_perf.py --help