Performance tools
Contents
Intel APS
The Intel Application Performance Snapshot (APS) can give an overview of possible performance issues. We recommend using this tool first before looking into further details. You can invoke the analysis in your SLURM job script in the following way:
module load vtune
srun hpcmd_suspend aps --stat-level=4 -r aps_result -- ./my_application
This command will write the performance data into the subdirectory aps_result.
After the job has finished, the report can be generated with aps-report -a aps_result.
You will find two HTML files in the current directory, which you can view in any browser.
Intel VTune
Intel VTune is a statistical profiling and performance analysis tool for Intel processors.
A command line interface (vtune) and a GUI (vtune-gui) are provided by the environment module
module load vtune.
Below, we give an example of how you can record the profiling data for your application in your SLURM job script:
module load vtune
srun hpcmd_suspend vtune -collect hpc-performance -r vtune_hpc_performance -- ./my_application
srun hpcmd_suspend vtune -collect hotspots -trace-mpi -r vtune_hotspots -- ./my_application
srun hpcmd_suspend vtune -collect uarch-exploration -r vtune_uarch -- ./my_application
This will create three subdirectories vtune_*, which contain the corresponding profiling data. You can inspect the data
with vtune-gui, in which you can open the respective directory of interest.
If you encounter an error message Failed to create data directory: Too many open files, then add a line ulimit -n 16384 after module load vtune.
For detailed documentation please have a look at the Intel VTune documentation and the Vtune cookbook.
Intel Advisor
The Intel Advisor is a threading design and prototyping tool for software developers. Since version 2016 comprehensive SIMD-vectorization analysis capabilities have been added.
Features:
Analyze, design, tune and check your threading and SIMD-vectorization design before implementation
Explore and test threading options without disrupting normal development
Predict thread errors & performance scaling on systems with more cores
It can be made available on Linux clusters and HPC systems by invoking
module load advisor.
Intel Advisor is particularly useful for analyzing the roofline data for your application. This can be achieved by inserting the following lines into your SLURM job script:
module load advisor
srun hpcmd_suspend advisor --collect survey --project-dir advisor_roofline -- ./my_application
srun hpcmd_suspend advisor --collect tripcounts --project-dir advisor_roofline --flop --no-trip-counts -- ./my_application
For producing the roofline plot it’s recommended to use the whole node, e.g. by specifying --cpus-per-task=72 for serial jobs. The result can then be viewed either by invoking GUI with advisor-gui advisor_roofline/ or by producing a roofline plot in html format by advisor --report=roofline --report-output=advisor_roofline/roofline.html --project-dir=advisor_roofline.
For a more detailed overview and information please have a look at the Intel Advisor Documentation. and the Advisor cookbook.
Intel Trace Collector and Analyzer (ITAC)
The Intel® Trace Analyzer and Collector (ITAC) is a tool for understanding MPI behaviour of applications. In the execution phase it collects data and produces trace files that can subsequently be analyzed with the Intel® Trace Analyzer performance analysis tool. ITAC can be used to analyze and visualize MPI communication behaviour of a given program and possibly detect inefficiencies and load imbalances.
To collect data do the following:
module load itacCompile your code with an Intel compiler specifying “-g -tcollect”, e.g.
mpiifort -g -tcollect *.f90Set the environment variable
VT_FLUSH_PREFIXto some directory with plenty of space (/ptmp/USERID)Run your program as usual
By this a trace file (suffix .stf) should have been written.
It can be analyzed with the trace analyzer tool as follows:
module load itactraceanalyzer <prog.exe>.stf
ITAC also provides a mechanism to check correctness of MPI applications. The runtime checker will detect data type mismatches and deadlocks. Adapt your job script as follows to evaluate your MPI application:
module load itacsrun --export=ALL,LD_PRELOAD=$ITAC_HOME/intel64/slib/libVTmc.so <prog.exe>
These recipes represent only the easiest and most obvious access. There are also more refined methods either by instrumenting the code or by specifying filters for selective analysis. Detailed documentation can be found in the Intel® ITAC Documentation.
Likwid
The Likwid Performance Tools are a lightweight command line performance tool suite for node-level profiling.
Likwid is available as an environment module.
Scalasca
Scalable tool for performance analysis of MPI/OpenMP/hybrid programs
Scalasca (Scalable performance analysis of large**-sc**ale parallel applications) is an open-source project by the Jülich Supercomputing Centre (JSC) which focuses on analyzing OpenMP, MPI and hybrid OpenMP/MPI parallel applications. The Scalasca tool can be used to identify bottlenecks and load imbalances in application codes by providing a number of helpful features, among others: profiling and tracing of highly parallel programs; automated trace analysis that localizes and quantifies communication and synchronization inefficiencies. The tool is designed to scale up to tens of thousands of cores and even more.
Scalasca is able to automatically instrument code on subroutine level, or the user can instrument the code to investigate special regions. The performance measurements are carried out at runtime. The results are studied after program termination with a user-friendly interactive graphical interface which shows the considered event or performance metric together with the respective source code section and with regard to fluctuations over the used partition of the system.
Scalasca can be used on all compute clusters by loading the
module module load scalasca.
Documentation on how to prepare the code and specify different modes of
operation can be found
on the Scalasca web page www.scalasca.org/.
Lightweight MPI profiling with mpitrace
MPItrace is an open-source library that enables lightweight
profiling of MPI programs by gathering statistics about the times spent in various MPI calls together
with corresponding message sizes. The tool can be used on the HPC clusters by simply loading the
module module load mpitrace at runtime. Functionality is based on the LD_PRELOAD mechanism, no
instrumentation or recompilation of the program is required.
By default the tool produces three files, corresponding to the MPI ranks with the minimum, maximum, and
median of the total communication time. They are named according to the scheme
mpi_profile.<process id>.<MPI rank>. Various knobs, including the option to record all MPI ranks
or a user-defined subset thereof, can be set via environment variables, as documented in
file $MPITRACE_HOME/doc/env_variables.txt
An example output file (here, rank 0 was the one with minimum communication time) looks like this:
Data for MPI rank 0 of 16:
Times from MPI_Init() to MPI_Finalize().
-----------------------------------------------------------------------
MPI Routine #calls avg. bytes time(sec)
-----------------------------------------------------------------------
MPI_Comm_rank 1 0.0 0.000
MPI_Comm_size 2 0.0 0.000
MPI_Barrier 8 0.0 0.000
MPI_Reduce 94 1257.7 0.056
MPI_Allreduce 12 16.0 0.000
MPI_Gather 40 5.2 0.006
MPI_Gatherv 56 200.0 0.004
MPI_Alltoall 48 15000000.0 0.240
-----------------------------------------------------------------------
MPI task 0 of 16 had the minimum communication time.
total communication time = 0.307 seconds.
total elapsed time = 29.487 seconds.
user cpu time = 458.794 seconds.
system time = 9.018 seconds.
max resident set size = 20027.258 MiB.
-----------------------------------------------------------------
Message size distributions:
MPI_Reduce #calls avg. bytes time(sec)
26 8.0 0.054
26 16.0 0.000
28 200.0 0.001
14 8000.0 0.002
MPI_Allreduce #calls avg. bytes time(sec)
12 16.0 0.000
MPI_Gather #calls avg. bytes time(sec)
28 4.0 0.006
12 8.0 0.000
MPI_Gatherv #calls avg. bytes time(sec)
56 200.0 0.004
MPI_Alltoall #calls avg. bytes time(sec)
48 15000000.0 0.240
-----------------------------------------------------------------
Summary for all tasks:
Rank 12 reported the largest memory utilization : 20035.61 MiB
Rank 9 reported the largest elapsed time : 29.54 sec
minimum communication time = 0.307 sec for task 0
median communication time = 3.294 sec for task 7
maximum communication time = 4.040 sec for task 13
MPI timing summary for all ranks:
taskid host cpu comm(s) elapsed(s) user(s) system(s) size(MiB) switches
0 vipc2001 0 0.31 29.49 458.79 9.02 20027.26 754
1 vipc2001 16 1.94 29.49 457.57 11.31 20033.55 843
2 vipc2001 32 3.01 29.49 455.71 12.72 20031.63 705
3 vipc2001 48 3.47 29.49 455.22 13.45 20029.03 748
4 vipc2001 64 1.59 29.49 457.42 10.64 20033.55 711
5 vipc2001 80 2.47 29.49 455.80 11.85 20030.67 693
6 vipc2001 96 3.16 29.49 454.81 13.03 20026.62 703
7 vipc2001 112 3.29 29.49 454.62 13.12 20030.29 699
8 vipc2002 0 2.83 29.49 455.53 12.42 20027.21 830
9 vipc2002 16 3.35 29.54 455.60 13.08 20025.33 694
10 vipc2002 32 3.42 29.49 455.02 13.16 20030.43 675
11 vipc2002 48 3.43 29.49 454.95 13.31 20024.61 726
12 vipc2002 64 3.92 29.49 454.54 13.96 20035.61 687
13 vipc2002 80 4.04 29.49 455.13 13.95 20027.00 677
14 vipc2002 96 3.54 29.49 455.92 13.22 20026.89 683
15 vipc2002 112 3.59 29.49 455.05 13.50 20035.12 666
Simple Performance Library perflib
This page describes the usage of the performance library libooperf.a.
Description
The perflib consists of an instrumentation library, which provides instrumented programs with a summary output containing performance information for each instrumented region in a program. If some regions are deep in the call tree in a loop, the library adds some overhead to the runtime. This overhead is not visible in the resulting times but in the runtime of the program.
This library supports parallel (MPI and mixed mode) applications, written in Fortran. It only accounts for the master thread in a multithreaded program.
There are the following libraries:
libooperf.a
perf library for MPI programs.libperfhpm.a
Use perf instrumentation to call hpmtoolkit.libperfdummy.a
Provides dummies for the perf instrumentation. You can link against this library to avoid the perf overhead in production runs.
Basic Interface
The basic interface of the perflib consists of 4 fortran callable subroutines:
perfinit
must be called after MPI_Init to do initializationperfon(name)
defines a starting point for performance measurement. Name is a character string, which identifies this point in the output of perfout.perfoff
defines an end point of performance measurement.perfout(name)
must be called before MPI_Finalize and prints the results to standard output. Name identifies a call of perfon, which is supposed to have 100% of runtime. The percentage of runtime of all other perfons is relative to this. You can call perfout from any number of MPI tasks. If your MPI tasks all do the same work, it is enough to call perfout from task 0 only. When you call perfout from several MPI tasks concurrently the output in stdout is mixed. You can serialize the calls to perfout as shown in the following example program.
Advanced Interface
In addition to the four basic subroutines, there are two more for context management.
perf_context_start(name)
starts a new context, in which all regions called from this context are separated from the calls to the same regions from other contexts. For example, if one calls some regions from within the initialization and from the timeloop and one wants to discriminate these, one can define a context for the init phase and one for the timeloop.perf_context_end
ends a previously started context and switches back to the parent context
There are also two more functions, which helps getting the results of the performance regions directly into the program.
perf_get(name, double *inctime, double *inc_MFlops)
gives back the inclusive time and inclusive MFlops from the performance region “name”. This is useful for loops, where one wants to get the results for different loop iterations.perf_reset(name)
resets the counters and the time of the performance region “name”. Only useful in conjunction with perf_get, otherwise some performance data is lost.
Example Program
Program tperf
Implicit none
use mpi
Integer:: ii, ierr, pe, npes, mype
Real(8):: uu
Integer:: omp_get_thread_num
Integer, Parameter:: SZ=10000
Real(8):: a(SZ), b(SZ),pi(SZ)
print *, 'Start of Program tperf'
Call MPI_Init(ierr)
if (ierr /= 0) Stop "MPI_Init failed"
Call MPI_Comm_size(MPI_COMM_WORLD, npes, ierr);
Call MPI_Comm_rank(MPI_COMM_WORLD, mype, ierr);
!$omp parallel
print "('This is MPI task',i5,' thread',i3)",&
mype, omp_get_thread_num()
!$omp end parallel
Call perfinit
Call perfon ('tperf')
Call perfon ('calc')
!$omp parallel do
Do ii=1, 100
call calc (pi)
Enddo
!$omp end parallel do
Call perfoff
Call perfon ('random')
Call random_number(a)
Call random_number(b)
Call perfoff
Call perfon ('calc2')
Do ii=1, SZ
uu = calc2 (a, b)
Enddo
Call perfoff
Call perfoff ! tperf
do pe=0, npes-1
Call MPI_Barrier(MPI_COMM_WORLD, ierr)
If (mype == pe) Call perfout('tperf')
Enddo
Call MPI_Finalize(ierr)
if (ierr /= 0) Stop "MPI_Finalize failed"
print *, 'End of Program tperf'
CONTAINS
Subroutine calc (p_pi)
Integer:: ii
Real(8):: p_pi(:)
Do ii=1, size(p_pi)
p_pi(ii) = sin(real(ii))*sqrt(real(ii))
Enddo
End Subroutine calc
Real Function calc2 (a, b)
Real(8):: a(:), b(:), c
c = Sum (a * b)
calc2 = c
End Function calc2
End Program tperf
Compiling and Linking
The ‘perf’’ library is available as a module:
module load perflib
To compile and link the program tperf.f use the following command line:
mpif90 -o tperf tperf.f -L$PERFLIB_HOME/lib -looperf -lpfm -lstdc++
Sometimes the C++ standard library (-lstdc++ in the linkline) is already set inside the compiler wrapper. When using OpenMP (-qsmp=omp) measurements are only made for thread 0. Be aware that performance of thread 0 might be different to other threads.
Output
By default the output from this program looks like this:
Start of Program tperf
Start of Program tperf
This is MPI task 0 thread 0
This is MPI task 1 thread 0
Inclusive Exclusive
Subroutine #calls Time(s) % MFlops Time(s) % MFlops
--------------------------------------------------------------------------
tperf 1 1.878 100.0 142.763 0.000 0.0 0.000
calc 1 0.100 5.3 681.948 0.100 5.3 681.948
random 1 0.000 0.0 444.131 0.000 0.0 444.131
calc2 1 1.778 94.7 112.473 1.778 94.7 112.473
Size of data segment used by the program: 89.12 MB
Inclusive Exclusive
Subroutine #calls Time(s) % MFlops Time(s) % MFlops
--------------------------------------------------------------------------
tperf 1 1.877 100.0 142.878 0.000 0.0 0.000
calc 1 0.100 5.3 682.227 0.100 5.3 682.227
random 1 0.000 0.0 465.735 0.000 0.0 465.735
calc2 1 1.777 94.7 112.565 1.777 94.7 112.565
Size of data segment used by the program: 89.12 MB
End of Program tperf
End of Program tperf
Remarks:
Column 1 is the name given in perfon.
Column 2 is the number of calls of perfon with this name.
Columns 3-5 are inclusive and the following 3 columns are exclusive.
Inclusive values measure all code between a call to perfon and its corresponding call of perfoff.
Exclusive values exclude those parts of the code, which are measured separately with calls to perfon and perfoff.
The subroutine calc has no subcalls of perfon. That’s why inclusive and exclusive values are identical.
By setting the environment variable
export PERFLIB_OUTPUT_FORMAT=xml
the output is not written to stdout but each MPI rank writes its results in a different XML file with a name perf.<PID>.xml. These xml files can then be further processed with own tools or with the python script comp_perf.py. You can get help with
comp_perf.py --help