Studying program performance on the Itanium 2 with pfmon Hoare's dictum : "Premature optimization is the root of all evil." "Rule 2. Measure. Don't tune for speed until you've measured, and even then don't unless one part of the code overwhelms the rest." -Rob Pike The pfmon program uses the Itanium 2 processor performance counters (the Performance Monitoring Unit or PMU) to collect information about where a program spends its time. The Itanium 2 processor can count up to four different events at a time, where events can be things like the number of cycles the program executes in or the number of L2 cache misses the program causes. Some of the events are very fine grained and we're mostly interested in their relation to other events, the number of loads per cache miss for example. We call a performance metric composed by several hardware event counts a derived event. We will use pfmon to collect the event counter data for our program and a perl script called i2prof.pl to calculate some useful derived event counts and rates from the collected data. One thing to keep in mind when measuring derived events is that the code you are measuring have to behave in the same way each time you execute it. If you want to measure the number of instructions executed per cycle i2prof.pl calculates that as IA64_INST_RETIRED_THIS / CPU_CYCLES. If IA64_INST_RETIRED_THIS and CPU_CYCLES are measured separately and the program ran for 200 million cycles when IA64_INST_RETIRED_THIS was measured and for 500 million cycles when CPU_CYCLES was measured the resulting derived event number will be very inaccurate. Our first task is to find out how many useful instructions are executed. The Itanium 2 can execute up to 6 instructions every cycle. If the compiler can't schedule enough instructions to fill all 6 instruction slots it will fill the remaining slots with NOP instructions that does nothing but are counted as an executed instruction by the PMU. To get the number of useful instructions we count the total number of executed instructions and subtract the number of executed NOP instructions. If the number of useful instructions executed per cycle is very low the performance will be far from the theoretical peak. i2prof.pl can tell us which events we need to count to calculate the instructions/cycle rate : > i2prof.pl -l -q "Useful instructions/cycle" "Useful instructions/cycle" IA64_INST_RETIRED_THIS,NOPS_RETIRED,CPU_CYCLES, We'll then collect this data using pfmon : > pfmon --outfile=counters.txt --events=IA64_INST_RETIRED_THIS,NOPS_RETIRED,CPU_CYCLES ./a.out > out.txt > cat counters.txt 291988175474 IA64_INST_RETIRED_THIS 62991821350 NOPS_RETIRED 114171236224 CPU_CYCLES And finally calculate the number of useful instruction per cycle > i2prof.pl -q "Useful instructions/cycle" -d ./counters.txt Useful instructions/cycle : 2.006 Alternatively, i2prof.pl can call pfmon for you, adding the required counts to the file you specify : > i2prof.pl -q "Useful instructions/cycle" -d counters.txt -p "./a.out > out.txt" Useful instructions/cycle : 2.003 This may cause ./a.out to be executed several times if that is necessary in order to get the requested counter data. As you can see the result differs slightly from the number we got when we called pfmon ourselves. This is caused by dynamic changes in the machine state. For instance, the number of cycles the code needs to execute may vary slightly due to different cache contents when the code is started and if other programs are executing on the machine they may also affect the measurements. For the counts to be accurate the code needs to execute for at least a couple of seconds. Try executing the code several times while measuring CPU_CYCLES if you want to get a feel for how accurate the counts are : > pfmon --events=CPU_CYCLES ./a.out > out.txt > pfmon --events=CPU_CYCLES ./a.out >> out.txt > pfmon --events=CPU_CYCLES ./a.out >> out.txt > grep CPU_CYCLES out.txt 114613138521 CPU_CYCLES 114663020591 CPU_CYCLES 114626893096 CPU_CYCLES The difference here is less than 0.5% which is certainly good enough. Looking at the number of useful instructions/cycle that we measured we can see that this code executes far from the peak 6 instructions/cycle. If we look at the ratio of NOP instructions to the total... > i2prof.pl -q "NOPs/instruction" -d ./counters.txt NOPs/instruction : 0.216 ...we see that about 20% of all instructions are NOPs. This means that we should get around 5 useful instructions executed per cycle. Evidently the CPU is stalling for some other reason. > i2prof.pl -q "Total stalls" -d counters.txt -p "./a.out > out.txt" Total stalls : 0.570 So the CPU is stalling 57% of the time. We'll try to find out why this is happening. i2prof.pl has a group of statistics that detail the various causes for the CPU to stall : > i2prof.pl -L -t stall "D-cache stalls" "Branch mispredict stalls" "I-cache stalls" "FPU stalls" "RSE stalls" "Integer register dependency stalls" "Support register dependency stalls" This allows us to see what the CPU is stalling on : > i2prof.pl -p -t stall -d counters.txt "./a.out > out.txt" Stall numbers are the ratio to the total number of stall cycles. D-cache stalls : 0.286 Branch mispredict stalls : 0.000 I-cache stalls : 0.000 FPU stalls : 0.708 RSE stalls : 0.000 Integer register dependency stalls : 0.000 Support register dependency stalls : 0.000 Sum : 0.995 The sum should be approximately 1 although, as we have seen, since the event counts were gathered over several program invocations the total may vary a little bit. We can see that around 29% of the CPU stalls occur because of conflicts in the data cache. Note that stalls caused by load latencies are not included in the D-cache stalls, they appear in either the Integer register dependency stalls or in the FPU stalls. The remaining ~71% of the stalls are caused by the floating point unit. Branches are almost perfectly predicted here, indicating that the control flow of the program is either simple or very consistent. The FPU stalls are caused by register-register and register-load dependencies. There is unfortunately no easy way of determining the proportion of these causes. Code inspection tells us that the main loop in this particular code is fairly simple with reads/writes to 12 locations/iteration and 3 constant values. The Itanium 2 processor has 126 available floating point registers and the FPU has a latency of 4 cycles. If the main loop is sufficiently unrolled then 12*4+3 registers should be sufficient to avoid register-register dependencies so we assume that the majority of the FPU stalls are caused by load latencies. We can further classify the data cache memory stalls : > i2prof.pl -p -d counters.txt -t dstall "./a.out > out.txt" L2 capacity stalls : 0.222 Integer load latency stalls : 0.000 L2 recirculation stalls : 0.171 Store related stalls : 0.000 Virtual memory stalls : 0.026 These statistics are not independent and may overlap, this is why the sum doesn't add up to 29%. In particular, capacity stalls and recirculation stalls may overlap substantially. Here the majority of the stalls are due to the memory requests overwhelming the L2 cache (the capacity stalls). When the L2 has too many simultaneous outstanding requests, or if requests are blocked because of conflicts, it becomes blocking, stalling the main pipeline. This is often caused by bank conflicts in the L2 cache, data items adjacent to each other that are referenced in the same cycle may cause this type of stall. The virtual memory system cause almost no stall cycles and neither does stores. Since the memory systems seems to be the major bottleneck in this case we'll take a look at how the cache hierarchy performs. > i2prof.pl -p -d counters.txt -t cache "./a.out > out.txt" L1I hit rate : 1.000 L1I prefetch hit rate : 0.950 L2I hit rate : 0.812 L1D hit rate : 0.903 L2 hit rate : 0.963 L2D hit rate : 0.963 L2 data references : 1.000 L3D hit rate : 0.009 We'll concentrate on the data cache hit rates since we've already seen that the I-caches isn't a performance problem for this code. The hit rates are pretty good for both L1 and L2 caches and close to zero for the L3. Whether this is because the 256kB L2 cache is sufficient to cover the amount of data locality present in this code, because the capacity of the L3 cache is too small so that cache lines brought in by L2 misses are evicted before they are referenced again, or because the compiler prefetches data with non-temporal locality hints that cause it to bypass the L3 is hard to tell without inspecting the code. In this case it turns out that the problem size is such that the outer loop covers approximately 1.8MB of data so the 1.5MB L3 cache in our system is too small. Once a load here misses in the L2 cache it almost always goes to main memory. Since the main memory latency is in the order of 150 cycles (on an 900Mhz Itanium 2 with the HP zx1 chipset) it will be almost impossible for the compiler to find enough independent other work to cover it unless it can prefetch the data well in advance of when it is needed. The code clearly stalls mainly because it has to go to main memory frequently. We can also take a look at the average main memory bandwidth consumed > i2prof.pl -p -d counters.txt -q "Main memory bandwidth used" "./a.out > out.txt" Main memory bandwidth used : 5.388 The code loads and stores on average 5.4 bytes/cycle from/to main memory. On our 900Mhz processor this gives us 900000000*5.4/1048576=4634.9 MB/s or around 72% of the theoretical peak of 6.4GB/s, which is pretty close to the maximum memory bandwidth that one would expect to be able to get in practice. Finally we'll look at that all-important metric, the MFLOPS rate. > i2prof.pl -p -d counters.txt -q "FLOP/cycle" "./a.out > out.txt" FLOP/cycle : 1.086 For our 900Mhz processor this becomes 1.086*900 ~= 977 MFLOPS. Another example To show an unexpected problem we'll analyse the BT component of the NAS Parallel Benchmarks (the NPB) 2.3. BT is a simulated fluid (CFD) application that solves block tridiagonal systems using an implicit method. It consists of about 5500 lines of Fortan. We've collected all counter values in a batch run using > i2prof.pl -a -p -d bt.A.1.counters ./bt.A.1 We'll use "efficiency" group of statistics this time > i2prof.pl -t efficiency -d bt.A.1.counters Useful instructions/cycle : 0.942 FLOP/cycle : 0.385 Instructions/cycle : 1.390 Main memory bandwidth used : 1.584 NOPs/instruction : 0.322 Total stalls : 0.700 So this code is stalling 70% of the time and delivers about 346 MFLOPS. Breaking down the stall reasons : > i2prof.pl -t stall -d bt.A.1.counters Stall numbers are the ratio to the total number of stall cycles. D-cache stalls : 0.164 Branch mispredict stalls : 0.004 I-cache stalls : 0.007 FPU stalls : 0.305 RSE stalls : 0.000 Integer register dependency stalls : 0.000 Support register dependency stalls : 0.522 Sum : 1.002 We see that the majority of the stalls are attributed to dependencies on the support registers. i2prof.pl can tell us (briefly) what that means : > i2prof.pl -c -q "Support register dependency stalls" Support register dependency stalls : Stalls caused by bank switches, cancelled loads and dependencies on application, control and predicate registers. At this point we'll have to leave i2prof.pl to figure out more about what is happening. Breaking down the support register stall reasons further we found that the stalls were caused by dependencies on the application or control registers (counted by the event BE_EXE_BUBBLE_ARCR). This is indicative of a problem with the compiler, there should not normally be a huge amount of these kinds of stalls. When we lowered the optimization level from O3 to O2 the bulk of these stalls disappeared and performance increased slightly. Programs pfmon can be found at http://www.hpl.hp.com/research/linux/perfmon/ iprof2.pl can be fetched from http://www.pdc.kth.se/~pek/i2prof.pl Contact Corrections, improvement suggestions and questions welcome. pek@pdc.kth.se References [1] Jarp, Sverre. A Methodology for using the Itanium 2 Performance Counters for Bottleneck Analysis. http://www.gelato.org/pdf/Performance_counters_final.pdf [2] Intel Itanium 2 Processor Reference Manual For Software Development and Optimization. http://developer.intel.com/design/itanium2/manuals/251110.htm ---- i2prof.pl statistics descriptions Stall statistics "D-cache stalls" D-cache stalls are stalls caused by data load latencies and overhead in the memory system (TLB misses). "Branch mispredict stalls" Branch mispredict stalls are caused by a failure of the branch unit to correctly predict branch targets. When a branch is mispredicted it will cause a bubble of 6 or more cycles in the main pipeline. Codes with simple control flows will typically have few branch mispredict stalls. Profile directed recompilation may help to reduce the branch mispredict ratio if it is high. "I-cache stalls" I-cache stalls are stalls caused by a failure of the instruction cache hierarchy to supply instruction in a timely manner. This may be linked to a failure to predict branch outcomes. "FPU stalls" The FPU stalls are caused by register-register or register-load dependencies in the floating point unit. "RSE stalls" The Register Stack Engine (RSE) is used to spill and fill registers when a program uses more registers than the CPU physically has. It will store (spill) old register contents to memory and reload (fill) the registers when the old contents is needed again. Storing to a register when there are no free physical registers available causes a stall while a register is selected and spilled to memory while reading from a spilled register causes a stall while the old contents is read back from memory. "Integer register dependency stalls" These are stalls caused by register-register dependencies (scoreboarding). Dstall statistics "L2 capacity stalls" The L2 cache is non-blocking up to 16 outstanding requests for unique cache lines. When the number of simultaneous outstanding requests would become greater than 16 then the L2 becomes blocking. The L2 capacity stalls are stalls caused by this phenomenon, the program is loading new values faster than the L2 can get them from the rest of the memory system. "L2 recirculation stalls" The L2 recirculation stalls are L2 cache misses that reference a cache line that is already in the process of being loaded into the L2. These requests cause stall cycles until the cache line is brought into the L2 and the request is then treated as a cache hit. "Store related stalls" Stalls caused by lack of store resources (full store buffer or conflict with a returning cache fill). "Virtual memory stalls" Virtual memory stalls are stalls caused by virtual memory page misses. Either latency caused by the transfer of a page table entry from the L2 DTLB to the L1 DTLB on a L1 DTLB miss, or latency caused by the hardware page walker (HPW) having to fetch a page table entry from the page table (VHPT). A high proportion of virtual memory stalls indicate that the performance would benefit from a bigger memory page size. $Header: /afs/pdc.kth.se/home/p/pek/public_html/ia64/ia64-profiling.txt,v 1.6 2003/11/21 14:04:36 pek Exp $