This lab exercise is intended to augment the related talk, Single Processor Performance Considerations. This exercise is not a tutorial. You should already be familiar with the information in the Single Processor Performance Considerations module; and be comfortable compiling and running either C or FORTRAN on CTC's IBM SP.
This lab has two principal objectives:
The exercise consists of a number of steps, each working on the same basic program, but implementing different performance improvement techniques. If you have the time, you will get the most from this lab by doing all of the steps in order. However, if you are short on time, try doing the first three lab steps on your own, and then just copying, studying and compiling the provided solution files for the remaining steps.
These instructions are written for either Fortran or C programmers. Where there is a difference due to language, the difference will be clearly indicated by two separate sets of instructions. Otherwise, differences in the lab depend on the program files themselves, which follow these naming conventions:
| Lab Files | Solutions | |
| Fortran | something.f | something.soln.f |
|---|---|---|
| C | something.c | something.soln.c |
Create an empty subdirectory to work in so that the files from
this tutorial do not get mixed up with your other files.
Use the UNIX cp command to copy the tutorial files
to your subdirectory.
Type these commands:
mkdir labdir
cd labdir
cp /afs/pdc.kth.se/public/www/training/Tutor/Performance/SingleProcPerf/mm* .
cp command,
which refers to your current directory).
We need a "baseline" of performance for a simple program before we start tuning the code.
Use either the xlf or xlc compiler, with no options except the -o (should you want to pick a name for the executable).
Note the execution time for this naive matrix-matrix multiply. Some sample results are also output; these are provided for debugging purposes later. The code will take a minute or so to run, so you may want to go on to the next couple of steps while waiting for the results.
Here we see what different compiler options or the use of a preprocessor can do to help our simple program.
For both Fortran and C, start by simply using -O3 with the appropriate compile command. Check the sample output carefully: It may not be the same as your first run. If not, but you think the results are "close enough", then ignore the differences and go on. Otherwise, recompile using -O3 -qstrict. Run this new executable and check the output: it should be a better match to the original results.
You should see roughly a factor of 5 improvement from the plain-compiled code.
Now we begin experimenting with "hand tuning" the code. First, we tinker with the structure of the loops.
Create another copy of mma.f and switch the I and J loops. The simple solution is mmr.soln.f. Compile with -O3, run and time. Evidently it is important to minimize the memory stride in the C(I,J) matrix.
The object is to stride by 4 over middle loop, while creating 4 temporary variables to use as accumulators in the innermost loop. In addition, it will be advantageous to create a temporary variable to hold the matrix variable which needs to be used in each of the 4 assignments in the innermost loop. Section 2.7 in the related talk shows an example.
You'll need to nest three more loops inside the
triply-nested main loop, giving the outer loops a stride which
makes the inner loops fit entirely within cache. (See
Sections
3.2 and 3.3 in the related talk for discussion and an
example.)
Compile with -O3, run and note the time. (Be sure to check your
results and recompile with -qstrict if you need it.)
Did you get a
speedup? You may not have! With the -O3 option, you enable the
compiler to do code transformations of its own; by tinkering around
with the code, you may have interfered with what the compiler was
doing previously. So be aware that hand-tuning has its risks...
Your solution should look like mmb.soln.f
or mmb.soln.c in your working directory.
Now we apply array padding for some simple hand tuning. The matrix size of 512x512 was deliberately chosen to be a large power of 2. Recall that this reduces the effective size of the cache (see section 3.5 in the related talk).
To fix this problem, enlarge one of the dimensions of the arrays to A(ND,ND) where the parameter ND is not equal to a power of 2--say, N+1 instead of N. Do not change the contents of the matrices or any of the loops! The goal is merely to change the way the matrices of the original problem are stored in the main memory.
Apply this simple fix to each of the programs mma, mmb, and mmu. You will be amazed to see the results after you compile with -O3, run and time. The provided solutions are mmp.soln.f, mmbp.soln.f and mmup.soln.f for Fortran; and mmp.soln.c, mmbp.soln.c and mmup.soln.c for C.
For all you real hackers, here's a puzzle to solve that will test your understanding of the cache mapping problem. It is actually possible to achieve the performance improvement you've seen by padding the array in a single dimension. Which one?
You can try it and test, of course, but here's a hint: The dimension you choose depends on the language your program uses. Can you explain what happened? Click here to check your reasoning.
Now we try the all of the loop tricks together with array padding. You can do this by combining your mmbp and mmup codes.
The inner 3 loops of the blocked code will need to be reordered (for FORTRAN) and unrolled. The easy way to do this is to take the loops from mmup and use them as replacements for the inner loops in mmbp, after modifying them in a straightforward way.
Compile with -O3, run and time. For comparison, See how much of this improved performance is retained when ND=N instead of N+1 (i.e., no padding). The provided solutions mmbup.soln.f and mmbu.soln.f for Fortran; and mmbup.soln.c and mmbu.soln.c.
Of course, we've saved the best strategies for the last.
Humble pie #1: The easy way out. We're going to try the simple array-padded code with the preprocessor.
The last option produces the file Pmmp.f, which is the Fortran code after transformation by KAP.
The +Ksif produces the intermediate files Kmmp.c and Kmmp.cpp. Kmmp.cpp is the source code output by the cpp preprocessor (a necessary first step before kapc can use the code). Kmmp.c is the source code we're interested in: the result of the kapc preprocessor.
Run the code and compare the output with the running time for the mma code compiled with the KAP preprocessor and -O3 in an earlier exercise. Evidently, introducing the array padding has caused KAP to be less effective in producing its own optimizations! You'll find that the same is true for many of the other elementary hand-tuning tricks covered in this lab. This is because a good preprocessor can be exceedingly subtle with its code transformations. Examine the KAP-altered source (either Pmmp.f or Kmmp.c) to see what KAP did, noting in particular the more sophisticated loop unrolling.
In addition to preprocessors, there are postprocessors like IBM's -qhot for Fortran. A postprocessor performs transformations similar to KAP's, but at the back end of the compilation. The combination "-qhot -qarch=pwr2" can be particularly effective in obtaining good performance. Try it on any of the codes in this lab and see what happens. (Note that -O2 or -O3 must also be specified during compilation in order for -qhot to take effect.)
This is not to say that hand-tuning is an empty exercise. Indeed, in the past, KAP actually turned in better performance when the array padding was first done by hand. This kind of result is typical when optimizing a code: many things must be tried in order to find the one thing (or the combination) that is most effective. The message is to experiment with a number of things (including compiler flags), relying on timing tests to reveal what works best.
Humble pie #2: The easiest way out. Now replace the entire set of loops by a call to the appropriate ESSL subroutine.
Also note the use of double quotes (") in place of the single quotes (') that appear in the Fortran call. This is because ESSL treats scalar character values as strings. More information on ESSL subroutines, and calling them from C programs can be found using IBM's InfoExplorer: info -l essl.
Compile the altered program with either -O3 -lessl or -O3 -lesslp2. Run them, time them and groan...
Both solution versions are provided for you: mmep.soln.f or mmep.soln.c, which use padded arrays; and mme.soln.f or mme.soln.c, which don't.
Click here for typical timing results, for both C and FORTRAN. The morals of the story:
cd ../
rm -r labdir