Studying program performance on the Itanium 2 with pfmon

Hoare's dictum : "Premature optimization is the root of all evil."
 
"Rule 2. Measure. Don't tune for speed until you've measured, and even
then don't unless one part of the code overwhelms the rest." -Rob Pike

The pfmon program uses the Itanium 2 processor performance counters
(the Performance Monitoring Unit or PMU) to collect information about
where a program spends its time.  The Itanium 2 processor can count up
to four different events at a time, where events can be things like
the number of cycles the program executes in or the number of L2 cache
misses the program causes. Some of the events are very fine grained
and we're mostly interested in their relation to other events, the
number of loads per cache miss for example. We call a performance
metric composed by several hardware event counts a derived event.  We
will use pfmon to collect the event counter data for our program and a
perl script called i2prof.pl to calculate some useful derived event
counts and rates from the collected data.

One thing to keep in mind when measuring derived events is that the
code you are measuring have to behave in the same way each time you
execute it. If you want to measure the number of instructions executed
per cycle i2prof.pl calculates that as IA64_INST_RETIRED_THIS /
CPU_CYCLES.  If IA64_INST_RETIRED_THIS and CPU_CYCLES are measured
separately and the program ran for 200 million cycles when
IA64_INST_RETIRED_THIS was measured and for 500 million cycles when
CPU_CYCLES was measured the resulting derived event number will be
very inaccurate.

Our first task is to find out how many useful instructions are
executed. The Itanium 2 can execute up to 6 instructions every
cycle. If the compiler can't schedule enough instructions to fill all
6 instruction slots it will fill the remaining slots with NOP
instructions that does nothing but are counted as an executed
instruction by the PMU. To get the number of useful instructions we
count the total number of executed instructions and subtract the
number of executed NOP instructions. If the number of useful
instructions executed per cycle is very low the performance will be
far from the theoretical peak.  i2prof.pl can tell us which events we
need to count to calculate the instructions/cycle rate :

 > i2prof.pl -l -q "Useful instructions/cycle"
 "Useful instructions/cycle"
  IA64_INST_RETIRED_THIS,NOPS_RETIRED,CPU_CYCLES,

We'll then collect this data using pfmon :

 > pfmon --outfile=counters.txt --events=IA64_INST_RETIRED_THIS,NOPS_RETIRED,CPU_CYCLES ./a.out > out.txt
 > cat counters.txt
                291988175474 IA64_INST_RETIRED_THIS
                 62991821350 NOPS_RETIRED
                114171236224 CPU_CYCLES

And finally calculate the number of useful instruction per cycle

 > i2prof.pl -q "Useful instructions/cycle" -d ./counters.txt 
 Useful instructions/cycle                     : 2.006

Alternatively, i2prof.pl can call pfmon for you, adding the
required counts to the file you specify :

 > i2prof.pl -q "Useful instructions/cycle" -d counters.txt -p "./a.out > out.txt"
 Useful instructions/cycle                     : 2.003

This may cause ./a.out to be executed several times if that is
necessary in order to get the requested counter data. As you can see
the result differs slightly from the number we got when we called
pfmon ourselves. This is caused by dynamic changes in the machine
state. For instance, the number of cycles the code needs to execute
may vary slightly due to different cache contents when the code is
started and if other programs are executing on the machine they may
also affect the measurements. 

For the counts to be accurate the code needs to execute for at least a
couple of seconds. Try executing the code several times while
measuring CPU_CYCLES if you want to get a feel for how accurate
the counts are :

 > pfmon --events=CPU_CYCLES ./a.out > out.txt
 > pfmon --events=CPU_CYCLES ./a.out >> out.txt
 > pfmon --events=CPU_CYCLES ./a.out >> out.txt
 > grep CPU_CYCLES out.txt
               114613138521 CPU_CYCLES
               114663020591 CPU_CYCLES
               114626893096 CPU_CYCLES

The difference here is less than 0.5% which is certainly good enough.

Looking at the number of useful instructions/cycle that we measured we
can see that this code executes far from the peak 6
instructions/cycle.  If we look at the ratio of NOP instructions to
the total...

 > i2prof.pl -q "NOPs/instruction" -d ./counters.txt 
 NOPs/instruction                              : 0.216

...we see that about 20% of all instructions are NOPs. This means that
we should get around 5 useful instructions executed per
cycle. Evidently the CPU is stalling for some other reason.

 > i2prof.pl -q "Total stalls" -d counters.txt -p "./a.out > out.txt"
 Total stalls                                  : 0.570

So the CPU is stalling 57% of the time. We'll try to find out why this
is happening.  i2prof.pl has a group of statistics that detail the
various causes for the CPU to stall :

 > i2prof.pl -L -t stall
 "D-cache stalls"
 "Branch mispredict stalls"
 "I-cache stalls"
 "FPU stalls"
 "RSE stalls"
 "Integer register dependency stalls"
 "Support register dependency stalls"

This allows us to see what the CPU is stalling on :

 > i2prof.pl -p -t stall -d counters.txt "./a.out > out.txt"
 Stall numbers are the ratio to the total number of stall cycles.
 D-cache stalls                                : 0.286
 Branch mispredict stalls                      : 0.000
 I-cache stalls                                : 0.000
 FPU stalls                                    : 0.708
 RSE stalls                                    : 0.000
 Integer register dependency stalls            : 0.000
 Support register dependency stalls            : 0.000
 Sum                                           : 0.995

The sum should be approximately 1 although, as we have seen, since the
event counts were gathered over several program invocations the total
may vary a little bit.

We can see that around 29% of the CPU stalls occur because of
conflicts in the data cache. Note that stalls caused by load latencies
are not included in the D-cache stalls, they appear in either the
Integer register dependency stalls or in the FPU stalls.  The
remaining ~71% of the stalls are caused by the floating point
unit. Branches are almost perfectly predicted here, indicating that
the control flow of the program is either simple or very consistent.

The FPU stalls are caused by register-register and register-load
dependencies. There is unfortunately no easy way of determining the
proportion of these causes. Code inspection tells us that the main
loop in this particular code is fairly simple with reads/writes to 12
locations/iteration and 3 constant values. The Itanium 2 processor has
126 available floating point registers and the FPU has a latency of 4
cycles. If the main loop is sufficiently unrolled then 12*4+3
registers should be sufficient to avoid register-register dependencies
so we assume that the majority of the FPU stalls are caused by load
latencies.

We can further classify the data cache memory stalls :

 > i2prof.pl -p -d counters.txt -t dstall "./a.out > out.txt"
 L2 capacity stalls                            : 0.222
 Integer load latency stalls                   : 0.000
 L2 recirculation stalls                       : 0.171
 Store related stalls                          : 0.000
 Virtual memory stalls                         : 0.026

These statistics are not independent and may overlap, this is why the
sum doesn't add up to 29%. In particular, capacity stalls and
recirculation stalls may overlap substantially.

Here the majority of the stalls are due to the memory requests
overwhelming the L2 cache (the capacity stalls). When the L2 has too
many simultaneous outstanding requests, or if requests are blocked
because of conflicts, it becomes blocking, stalling the main pipeline.
This is often caused by bank conflicts in the L2 cache, data items
adjacent to each other that are referenced in the same cycle may cause
this type of stall.

The virtual memory system cause almost no stall cycles and neither
does stores.

Since the memory systems seems to be the major bottleneck in this case
we'll take a look at how the cache hierarchy performs.

 > i2prof.pl -p -d counters.txt -t cache "./a.out > out.txt"
 L1I hit rate                                  : 1.000
 L1I prefetch hit rate                         : 0.950
 L2I hit rate                                  : 0.812
 L1D hit rate                                  : 0.903
 L2 hit rate                                   : 0.963
 L2D hit rate                                  : 0.963
 L2 data references                            : 1.000
 L3D hit rate                                  : 0.009

We'll concentrate on the data cache hit rates since we've already seen
that the I-caches isn't a performance problem for this code. 

The hit rates are pretty good for both L1 and L2 caches and close to
zero for the L3. Whether this is because the 256kB L2 cache is
sufficient to cover the amount of data locality present in this code,
because the capacity of the L3 cache is too small so that cache lines
brought in by L2 misses are evicted before they are referenced again,
or because the compiler prefetches data with non-temporal locality
hints that cause it to bypass the L3 is hard to tell without
inspecting the code. In this case it turns out that the problem size
is such that the outer loop covers approximately 1.8MB of data so the
1.5MB L3 cache in our system is too small.

Once a load here misses in the L2 cache it almost always goes to main
memory. Since the main memory latency is in the order of 150 cycles
(on an 900Mhz Itanium 2 with the HP zx1 chipset) it will be almost
impossible for the compiler to find enough independent other work to
cover it unless it can prefetch the data well in advance of when it is
needed. The code clearly stalls mainly because it has to go to main
memory frequently.

We can also take a look at the average main memory bandwidth consumed

 > i2prof.pl -p -d counters.txt -q "Main memory bandwidth used" "./a.out > out.txt"
 Main memory bandwidth used                    : 5.388

The code loads and stores on average 5.4 bytes/cycle from/to main
memory. On our 900Mhz processor this gives us
900000000*5.4/1048576=4634.9 MB/s or around 72% of the theoretical
peak of 6.4GB/s, which is pretty close to the maximum memory bandwidth
that one would expect to be able to get in practice.

Finally we'll look at that all-important metric, the MFLOPS rate.

 > i2prof.pl -p -d counters.txt -q "FLOP/cycle" "./a.out > out.txt"
 FLOP/cycle                                    : 1.086

For our 900Mhz processor this becomes 1.086*900 ~= 977 MFLOPS.


Another example

To show an unexpected problem we'll analyse the BT component of the
NAS Parallel Benchmarks (the NPB) 2.3. BT is a simulated fluid (CFD)
application that solves block tridiagonal systems using an implicit
method. It consists of about 5500 lines of Fortan.

We've collected all counter values in a batch run using

 > i2prof.pl -a -p -d bt.A.1.counters ./bt.A.1

We'll use "efficiency" group of statistics this time

 > i2prof.pl -t efficiency -d bt.A.1.counters
 Useful instructions/cycle                     : 0.942
 FLOP/cycle                                    : 0.385
 Instructions/cycle                            : 1.390
 Main memory bandwidth used                    : 1.584
 NOPs/instruction                              : 0.322
 Total stalls                                  : 0.700

So this code is stalling 70% of the time and delivers
about 346 MFLOPS.

Breaking down the stall reasons :

 > i2prof.pl -t stall -d bt.A.1.counters
 Stall numbers are the ratio to the total number of stall cycles.
 D-cache stalls                                : 0.164
 Branch mispredict stalls                      : 0.004
 I-cache stalls                                : 0.007
 FPU stalls                                    : 0.305
 RSE stalls                                    : 0.000
 Integer register dependency stalls            : 0.000
 Support register dependency stalls            : 0.522
 Sum                                           : 1.002

We see that the majority of the stalls are attributed to dependencies
on the support registers. i2prof.pl can tell us (briefly) what
that means :

 > i2prof.pl -c -q "Support register dependency stalls"
 Support register dependency stalls :
  Stalls caused by bank switches, cancelled loads and
  dependencies on application, control and predicate
  registers.

At this point we'll have to leave i2prof.pl to figure out more about
what is happening. Breaking down the support register stall reasons
further we found that the stalls were caused by dependencies on the
application or control registers (counted by the event
BE_EXE_BUBBLE_ARCR). This is indicative of a problem with the
compiler, there should not normally be a huge amount of these kinds of
stalls.  When we lowered the optimization level from O3 to O2 the bulk
of these stalls disappeared and performance increased slightly.

Programs

 pfmon can be found at http://www.hpl.hp.com/research/linux/perfmon/

 iprof2.pl can be fetched from http://www.pdc.kth.se/~pek/i2prof.pl

Contact

 Corrections, improvement suggestions and questions welcome.

 pek@pdc.kth.se

References 

[1] Jarp, Sverre. A Methodology for using the Itanium 2 Performance
Counters for Bottleneck Analysis. 
http://www.gelato.org/pdf/Performance_counters_final.pdf

[2] Intel Itanium 2 Processor Reference Manual For Software Development
and Optimization.
http://developer.intel.com/design/itanium2/manuals/251110.htm

----

i2prof.pl statistics descriptions

Stall statistics

 "D-cache stalls"

 D-cache stalls are stalls caused by data load latencies and overhead
 in the memory system (TLB misses).

 "Branch mispredict stalls"

 Branch mispredict stalls are caused by a failure of the branch unit
 to correctly predict branch targets. When a branch is mispredicted it
 will cause a bubble of 6 or more cycles in the main pipeline. Codes
 with simple control flows will typically have few branch mispredict
 stalls. Profile directed recompilation may help to reduce the branch
 mispredict ratio if it is high.

 "I-cache stalls"

 I-cache stalls are stalls caused by a failure of the instruction
 cache hierarchy to supply instruction in a timely manner. This may be
 linked to a failure to predict branch outcomes.

 "FPU stalls"

 The FPU stalls are caused by register-register or register-load
 dependencies in the floating point unit.

 "RSE stalls"

 The Register Stack Engine (RSE) is used to spill and fill registers
 when a program uses more registers than the CPU physically has.  It
 will store (spill) old register contents to memory and reload (fill)
 the registers when the old contents is needed again. Storing to a
 register when there are no free physical registers available causes a
 stall while a register is selected and spilled to memory while
 reading from a spilled register causes a stall while the old contents
 is read back from memory.

 "Integer register dependency stalls"

 These are stalls caused by register-register dependencies
 (scoreboarding).

Dstall statistics

 "L2 capacity stalls"

 The L2 cache is non-blocking up to 16 outstanding requests for unique
 cache lines. When the number of simultaneous outstanding requests
 would become greater than 16 then the L2 becomes blocking. The L2
 capacity stalls are stalls caused by this phenomenon, the program is
 loading new values faster than the L2 can get them from the rest of
 the memory system.

 "L2 recirculation stalls"

 The L2 recirculation stalls are L2 cache misses that reference a
 cache line that is already in the process of being loaded into the
 L2.  These requests cause stall cycles until the cache line is
 brought into the L2 and the request is then treated as a cache hit.

 "Store related stalls"

 Stalls caused by lack of store resources (full store buffer or
 conflict with a returning cache fill).

 "Virtual memory stalls"

 Virtual memory stalls are stalls caused by virtual memory page
 misses.  Either latency caused by the transfer of a page table entry
 from the L2 DTLB to the L1 DTLB on a L1 DTLB miss, or latency caused
 by the hardware page walker (HPW) having to fetch a page table entry
 from the page table (VHPT). A high proportion of virtual memory
 stalls indicate that the performance would benefit from a bigger
 memory page size.

$Header: /afs/pdc.kth.se/home/p/pek/public_html/ia64/ia64-profiling.txt,v 1.6 2003/11/21 14:04:36 pek Exp $