Performance tools
Contents
Intel APS
The Intel Application Performance Snapshot (APS) can give an overview of possible performance issues. We recommend using this tool first before looking into further details. You can invoke the analysis in your SLURM job script in the following way:
module load vtune
srun hpcmd_suspend aps --stat-level=4 -r aps_result -- ./my_application
This command will write the performance data into the subdirectory aps_result
.
After the job has finished, the report can be generated with aps-report -a aps_result
.
You will find two HTML files in the current directory, which you can view in any browser.
Intel VTune
Intel VTune is a statistical profiling and performance analysis tool for Intel processors.
A command line interface (vtune
) and a GUI (vtune-gui
) are provided by the environment module
module load vtune
.
Below, we give an example of how you can record the profiling data for your application in your SLURM job script:
module load vtune
srun hpcmd_suspend vtune -collect hpc-performance -r vtune_hpc_performance -- ./my_application
srun hpcmd_suspend vtune -collect hotspots -trace-mpi -r vtune_hotspots -- ./my_application
srun hpcmd_suspend vtune -collect uarch-exploration -r vtune_uarch -- ./my_application
This will create three subdirectories vtune_*
, which contain the corresponding profiling data. You can inspect the data
with vtune-gui
, in which you can open the respective directory of interest.
If you encounter an error message Failed to create data directory: Too many open files
, then add a line ulimit -n 16384
after module load vtune
.
For detailed documentation please have a look at the Intel VTune documentation and the Vtune cookbook.
Intel Advisor
The Intel Advisor is a threading design and prototyping tool for software developers. Since version 2016 comprehensive SIMD-vectorization analysis capabilities have been added.
Features:
Analyze, design, tune and check your threading and SIMD-vectorization design before implementation
Explore and test threading options without disrupting normal development
Predict thread errors & performance scaling on systems with more cores
It can be made available on Linux clusters and HPC systems by invoking
module load advisor
.
Intel Advisor is particularly useful for analyzing the roofline data for your application. This can be achieved by inserting the following lines into your SLURM job script:
module load advisor
srun hpcmd_suspend advisor --collect survey --project-dir advisor_roofline -- ./my_application
srun hpcmd_suspend advisor --collect tripcounts --project-dir advisor_roofline --flop --no-trip-counts -- ./my_application
For producing the roofline plot it’s recommended to use the whole node, e.g. by specifying --cpus-per-task=72
for serial jobs. The result can then be viewed either by invoking GUI with advisor-gui advisor_roofline/
or by producing a roofline plot in html format by advisor --report=roofline --report-output=advisor_roofline/roofline.html --project-dir=advisor_roofline
.
For a more detailed overview and information please have a look at the Intel Advisor Documentation. and the Advisor cookbook.
Intel Trace Collector and Analyzer (ITAC)
The Intel® Trace Analyzer and Collector (ITAC) is a tool for understanding MPI behaviour of applications. In the execution phase it collects data and produces trace files that can subsequently be analyzed with the Intel® Trace Analyzer performance analysis tool. ITAC can be used to analyze and visualize MPI communication behaviour of a given program and possibly detect inefficiencies and load imbalances.
To collect data do the following:
module load itac
Compile your code with an Intel compiler specifying “-g -tcollect”, e.g.
mpiifort -g -tcollect *.f90
Set the environment variable
VT_FLUSH_PREFIX
to some directory with plenty of space (/ptmp/USERID
)Run your program as usual
By this a trace file (suffix .stf
) should have been written.
It can be analyzed with the trace analyzer tool as follows:
module load itac
traceanalyzer <prog.exe>.stf
ITAC also provides a mechanism to check correctness of MPI applications. The runtime checker will detect data type mismatches and deadlocks. Adapt your job script as follows to evaluate your MPI application:
module load itac
srun --export=ALL,LD_PRELOAD=$ITAC_HOME/intel64/slib/libVTmc.so <prog.exe>
These recipes represent only the easiest and most obvious access. There are also more refined methods either by instrumenting the code or by specifying filters for selective analysis. Detailed documentation can be found in the Intel® ITAC Documentation.
Likwid
The Likwid Performance Tools are a lightweight command line performance tool suite for node-level profiling.
Likwid is available as an environment module.
Scalasca
Scalable tool for performance analysis of MPI/OpenMP/hybrid programs
Scalasca (Scalable performance analysis of large**-sc**ale parallel applications) is an open-source project by the Jülich Supercomputing Centre (JSC) which focuses on analyzing OpenMP, MPI and hybrid OpenMP/MPI parallel applications. The Scalasca tool can be used to identify bottlenecks and load imbalances in application codes by providing a number of helpful features, among others: profiling and tracing of highly parallel programs; automated trace analysis that localizes and quantifies communication and synchronization inefficiencies. The tool is designed to scale up to tens of thousands of cores and even more.
Scalasca is able to automatically instrument code on subroutine level, or the user can instrument the code to investigate special regions. The performance measurements are carried out at runtime. The results are studied after program termination with a user-friendly interactive graphical interface which shows the considered event or performance metric together with the respective source code section and with regard to fluctuations over the used partition of the system.
Scalasca can be used on all compute clusters by loading the
module module load scalasca
.
Documentation on how to prepare the code and specify different modes of
operation can be found
on the Scalasca web page www.scalasca.org/.
Simple Performance Library perflib
This page describes the usage of the performance library libooperf.a.
Description
The perflib consists of an instrumentation library, which provides instrumented programs with a summary output containing performance information for each instrumented region in a program. If some regions are deep in the call tree in a loop, the library adds some overhead to the runtime. This overhead is not visible in the resulting times but in the runtime of the program.
This library supports parallel (MPI and mixed mode) applications, written in Fortran. It only accounts for the master thread in a multithreaded program.
There are the following libraries:
libooperf.a
perf library for MPI programs.libperfhpm.a
Use perf instrumentation to call hpmtoolkit.libperfdummy.a
Provides dummies for the perf instrumentation. You can link against this library to avoid the perf overhead in production runs.
Basic Interface
The basic interface of the perflib consists of 4 fortran callable subroutines:
perfinit
must be called after MPI_Init to do initializationperfon(name)
defines a starting point for performance measurement. Name is a character string, which identifies this point in the output of perfout.perfoff
defines an end point of performance measurement.perfout(name)
must be called before MPI_Finalize and prints the results to standard output. Name identifies a call of perfon, which is supposed to have 100% of runtime. The percentage of runtime of all other perfons is relative to this. You can call perfout from any number of MPI tasks. If your MPI tasks all do the same work, it is enough to call perfout from task 0 only. When you call perfout from several MPI tasks concurrently the output in stdout is mixed. You can serialize the calls to perfout as shown in the following example program.
Advanced Interface
In addition to the four basic subroutines, there are two more for context management.
perf_context_start(name)
starts a new context, in which all regions called from this context are separated from the calls to the same regions from other contexts. For example, if one calls some regions from within the initialization and from the timeloop and one wants to discriminate these, one can define a context for the init phase and one for the timeloop.perf_context_end
ends a previously started context and switches back to the parent context
There are also two more functions, which helps getting the results of the performance regions directly into the program.
perf_get(name, double *inctime, double *inc_MFlops)
gives back the inclusive time and inclusive MFlops from the performance region “name”. This is useful for loops, where one wants to get the results for different loop iterations.perf_reset(name)
resets the counters and the time of the performance region “name”. Only useful in conjunction with perf_get, otherwise some performance data is lost.
Example Program
Program tperf
Implicit none
use mpi
Integer:: ii, ierr, pe, npes, mype
Real(8):: uu
Integer:: omp_get_thread_num
Integer, Parameter:: SZ=10000
Real(8):: a(SZ), b(SZ),pi(SZ)
print *, 'Start of Program tperf'
Call MPI_Init(ierr)
if (ierr /= 0) Stop "MPI_Init failed"
Call MPI_Comm_size(MPI_COMM_WORLD, npes, ierr);
Call MPI_Comm_rank(MPI_COMM_WORLD, mype, ierr);
!$omp parallel
print "('This is MPI task',i5,' thread',i3)",&
mype, omp_get_thread_num()
!$omp end parallel
Call perfinit
Call perfon ('tperf')
Call perfon ('calc')
!$omp parallel do
Do ii=1, 100
call calc (pi)
Enddo
!$omp end parallel do
Call perfoff
Call perfon ('random')
Call random_number(a)
Call random_number(b)
Call perfoff
Call perfon ('calc2')
Do ii=1, SZ
uu = calc2 (a, b)
Enddo
Call perfoff
Call perfoff ! tperf
do pe=0, npes-1
Call MPI_Barrier(MPI_COMM_WORLD, ierr)
If (mype == pe) Call perfout('tperf')
Enddo
Call MPI_Finalize(ierr)
if (ierr /= 0) Stop "MPI_Finalize failed"
print *, 'End of Program tperf'
CONTAINS
Subroutine calc (p_pi)
Integer:: ii
Real(8):: p_pi(:)
Do ii=1, size(p_pi)
p_pi(ii) = sin(real(ii))*sqrt(real(ii))
Enddo
End Subroutine calc
Real Function calc2 (a, b)
Real(8):: a(:), b(:), c
c = Sum (a * b)
calc2 = c
End Function calc2
End Program tperf
Compiling and Linking
The ‘perf’’ library is available as a module:
module load perflib
To compile and link the program tperf.f use the following command line:
mpif90 -o tperf tperf.f -L$PERFLIB_HOME/lib -looperf -lpfm -lstdc++
Sometimes the C++ standard library (-lstdc++ in the linkline) is already set inside the compiler wrapper. When using OpenMP (-qsmp=omp) measurements are only made for thread 0. Be aware that performance of thread 0 might be different to other threads.
Output
By default the output from this program looks like this:
Start of Program tperf
Start of Program tperf
This is MPI task 0 thread 0
This is MPI task 1 thread 0
Inclusive Exclusive
Subroutine #calls Time(s) % MFlops Time(s) % MFlops
--------------------------------------------------------------------------
tperf 1 1.878 100.0 142.763 0.000 0.0 0.000
calc 1 0.100 5.3 681.948 0.100 5.3 681.948
random 1 0.000 0.0 444.131 0.000 0.0 444.131
calc2 1 1.778 94.7 112.473 1.778 94.7 112.473
Size of data segment used by the program: 89.12 MB
Inclusive Exclusive
Subroutine #calls Time(s) % MFlops Time(s) % MFlops
--------------------------------------------------------------------------
tperf 1 1.877 100.0 142.878 0.000 0.0 0.000
calc 1 0.100 5.3 682.227 0.100 5.3 682.227
random 1 0.000 0.0 465.735 0.000 0.0 465.735
calc2 1 1.777 94.7 112.565 1.777 94.7 112.565
Size of data segment used by the program: 89.12 MB
End of Program tperf
End of Program tperf
Remarks:
Column 1 is the name given in perfon.
Column 2 is the number of calls of perfon with this name.
Columns 3-5 are inclusive and the following 3 columns are exclusive.
Inclusive values measure all code between a call to perfon and its corresponding call of perfoff.
Exclusive values exclude those parts of the code, which are measured separately with calls to perfon and perfoff.
The subroutine calc has no subcalls of perfon. That’s why inclusive and exclusive values are identical.
By setting the environment variable
export PERFLIB_OUTPUT_FORMAT=xml
the output is not written to stdout but each MPI ranks writes its results in a different XML file with a name perf.<PID>.xml. These xml files can then be further processed with own tools or with the python script comp_perf.py. You can get help with
comp_perf.py --help