[PDC - Center for Parallel Computers, KTH]

LAB: Optimizing on a Single Processor

Last modified: Thu Jul 8 16:31:09 MET DST 1999

Table of Contents


Prerequisites and Objectives

This lab exercise is intended to augment the related talk, Single Processor Performance Considerations. This exercise is not a tutorial. You should already be familiar with the information in the Single Processor Performance Considerations module; and be comfortable compiling and running either C or FORTRAN on CTC's IBM SP.

This lab has two principal objectives:

The exercise consists of a number of steps, each working on the same basic program, but implementing different performance improvement techniques. If you have the time, you will get the most from this lab by doing all of the steps in order. However, if you are short on time, try doing the first three lab steps on your own, and then just copying, studying and compiling the provided solution files for the remaining steps.

Top of Page  |  =>Objectives<= |  General Instructions  |  Basic Program  |  Opts & Preps  |  Loop Tricks  |  Array Padding  |  Everything  |  What Works  |  Conclusions ]  Cleanup ]

General Lab Instructions

Top of Page  |  Objectives  | =>General Instructions<= |  Basic Program  |  Opts & Preps  |  Loop Tricks  |  Array Padding  |  Everything  |  What Works  |  Conclusions ]  Cleanup ]

The Basic Program

We need a "baseline" of performance for a simple program before we start tuning the code.

  1. This section uses mma.f (Fortran) or mma.c (C version). Look over the code, so that you understand what it does.

  2. Compile the program without options.

    Use either the xlf or xlc compiler, with no options except the -o (should you want to pick a name for the executable).

  3. Run the code.

    Note the execution time for this naive matrix-matrix multiply. Some sample results are also output; these are provided for debugging purposes later. The code will take a minute or so to run, so you may want to go on to the next couple of steps while waiting for the results.

Top of Page  |  Objectives  |  General Instructions  | =>Basic Program<= |  Opts & Preps  |  Loop Tricks  |  Array Padding  |  Everything  |  What Works  |  Conclusions ]  Cleanup ]


Compiler Options and Preprocessors

Here we see what different compiler options or the use of a preprocessor can do to help our simple program.

  1. Compile using -O3 alone; run the code and note the time.

    For both Fortran and C, start by simply using -O3 with the appropriate compile command. Check the sample output carefully: It may not be the same as your first run. If not, but you think the results are "close enough", then ignore the differences and go on. Otherwise, recompile using -O3 -qstrict. Run this new executable and check the output: it should be a better match to the original results.

  2. Compile using the Kap preprocessor alone; run the code and note the time.

    Fortran:
    Compile using the -Pk option:  xlf -Pk mma.f

    C: *
    Use the kap driver to preprocess and compile:  kxlc mma.c

  3. Finally, try them both together.

    Fortran:
    Compile using both options:  xlf -Pk -O3 mma.f

    C: *
    Use the kap driver to preprocess and compile:  kxlc -O3 mma.c

You should see roughly a factor of 5 improvement from the plain-compiled code.

Top of Page  |  Objectives  |  General Instructions  |  Basic Program  | =>Opts & Preps<= |  Loop Tricks  |  Array Padding  |  Everything  |  What Works  |  Conclusions ]  Cleanup ]


Loop Tricks

Now we begin experimenting with "hand tuning" the code. First, we tinker with the structure of the loops.

  1. Loop reordering: FORTRAN programmers only need to complete this step.

    Create another copy of mma.f and switch the I and J loops. The simple solution is mmr.soln.f. Compile with -O3, run and time. Evidently it is important to minimize the memory stride in the C(I,J) matrix.

  2. Loop unrolling: Now unroll the middle loop to depth 4.

    The object is to stride by 4 over middle loop, while creating 4 temporary variables to use as accumulators in the innermost loop. In addition, it will be advantageous to create a temporary variable to hold the matrix variable which needs to be used in each of the 4 assignments in the innermost loop. Section 2.7 in the related talk shows an example.

  3. Blocking: Create a copy of mma.f or mma.c and implement blocking on it.

    You'll need to nest three more loops inside the triply-nested main loop, giving the outer loops a stride which makes the inner loops fit entirely within cache. (See Sections 3.2 and 3.3 in the related talk for discussion and an example.)

    Compile with -O3, run and note the time. (Be sure to check your results and recompile with -qstrict if you need it.) Did you get a speedup? You may not have! With the -O3 option, you enable the compiler to do code transformations of its own; by tinkering around with the code, you may have interfered with what the compiler was doing previously. So be aware that hand-tuning has its risks... Your solution should look like mmb.soln.f or mmb.soln.c in your working directory.

    Fortran:
    1. Use a copy of the mmr.f code.
    2. Create temporary variables to replace C(I,J) through C(I+3,J).
    3. Create a temporary variable to hold B(K,J).
    4. Compile with -O3, run and time.
    5. Your solution should look like mmu.soln.f.

C:
  1. Use a copy of the mma.c code.
  2. Create temporary variables to replace C[I][J] through C[I][J+3].
  3. Create a temporary variable to hold A[I][K].
  4. Compile with -O3, run and time.
  5. Your solution should look like mmu.soln.c.

Top of Page  |  Objectives  |  General Instructions  |  Basic Program  |  Opts & Preps  | =>Loop Tricks<= |  Array Padding  |  Everything  |  What Works  |  Conclusions ]  Cleanup ]

Array Padding

Now we apply array padding for some simple hand tuning. The matrix size of 512x512 was deliberately chosen to be a large power of 2. Recall that this reduces the effective size of the cache (see section 3.5 in the related talk).

To fix this problem, enlarge one of the dimensions of the arrays to A(ND,ND) where the parameter ND is not equal to a power of 2--say, N+1 instead of N. Do not change the contents of the matrices or any of the loops! The goal is merely to change the way the matrices of the original problem are stored in the main memory.

Apply this simple fix to each of the programs mma, mmb, and mmu. You will be amazed to see the results after you compile with -O3, run and time. The provided solutions are mmp.soln.f, mmbp.soln.f and mmup.soln.f for Fortran; and mmp.soln.c, mmbp.soln.c and mmup.soln.c for C.

Advanced Exercise

For all you real hackers, here's a puzzle to solve that will test your understanding of the cache mapping problem. It is actually possible to achieve the performance improvement you've seen by padding the array in a single dimension. Which one?

You can try it and test, of course, but here's a hint: The dimension you choose depends on the language your program uses. Can you explain what happened? Click here to check your reasoning.

Top of Page  |  Objectives  |  General Instructions  |  Basic Program  |  Opts & Preps  |  Loop Tricks  | =>Array Padding<= |  Everything  |  What Works  |  Conclusions ]  Cleanup ]

Everything Together

Now we try the all of the loop tricks together with array padding. You can do this by combining your mmbp and mmup codes.

The inner 3 loops of the blocked code will need to be reordered (for FORTRAN) and unrolled. The easy way to do this is to take the loops from mmup and use them as replacements for the inner loops in mmbp, after modifying them in a straightforward way.

Compile with -O3, run and time. For comparison, See how much of this improved performance is retained when ND=N instead of N+1 (i.e., no padding). The provided solutions mmbup.soln.f and mmbu.soln.f for Fortran; and mmbup.soln.c and mmbu.soln.c.

Top of Page  |  Objectives  |  General Instructions  |  Basic Program  |  Opts & Preps  |  Loop Tricks  |  Array Padding  | =>Everything<= |  What Works  |  Conclusions ]  Cleanup ]

What Really Works

Of course, we've saved the best strategies for the last.

Humble pie #1:  The easy way out. We're going to try the simple array-padded code with the preprocessor.

Fortran:
  1. Use your array padded version, or a copy of mmp.soln.f.
  2. Compile this using the KAP preprocessor:
    xlf -Pk -O3 -Wp,-fortran mmp.f

    The last option produces the file Pmmp.f, which is the Fortran code after transformation by KAP.

C: *
  1. Use your array padded version, or a copy of mmp.soln.c.
  2. Compile this using the KAP preprocessor:
    kxlc +Ksif -O3 mmp.c

    The +Ksif produces the intermediate files Kmmp.c and Kmmp.cpp. Kmmp.cpp is the source code output by the cpp preprocessor (a necessary first step before kapc can use the code). Kmmp.c is the source code we're interested in: the result of the kapc preprocessor.

Run the code and compare the output with the running time for the mma code compiled with the KAP preprocessor and -O3 in an earlier exercise. Evidently, introducing the array padding has caused KAP to be less effective in producing its own optimizations! You'll find that the same is true for many of the other elementary hand-tuning tricks covered in this lab. This is because a good preprocessor can be exceedingly subtle with its code transformations. Examine the KAP-altered source (either Pmmp.f or Kmmp.c) to see what KAP did, noting in particular the more sophisticated loop unrolling.

In addition to preprocessors, there are postprocessors like IBM's -qhot for Fortran. A postprocessor performs transformations similar to KAP's, but at the back end of the compilation. The combination "-qhot -qarch=pwr2" can be particularly effective in obtaining good performance. Try it on any of the codes in this lab and see what happens. (Note that -O2 or -O3 must also be specified during compilation in order for -qhot to take effect.)

This is not to say that hand-tuning is an empty exercise. Indeed, in the past, KAP actually turned in better performance when the array padding was first done by hand. This kind of result is typical when optimizing a code: many things must be tried in order to find the one thing (or the combination) that is most effective. The message is to experiment with a number of things (including compiler flags), relying on timing tests to reveal what works best.

Humble pie #2: The easiest way out. Now replace the entire set of loops by a call to the appropriate ESSL subroutine.

Fortran:
Replace the loops in either mma.f or mmp.soln.f with:
CALL DGEMUL(A, ND, 'N', B, ND, 'N', C, ND, N, N, N)

C:
  1. Using either mma.c or mmp.soln.c, add:
    #include <essl.h>
  2. Replace the loops with:
    dgemul(B, ND, "N", A, ND, "N", C, ND, N, N, N);
Note the arrays are sent in reverse order to the Fortran call. This is because ESSL follows the Fortran convention of storing arrays in column major order. This calling order is just making use of the matrix property (AB)' = B'A' = C' where ' denotes the matrix transpose operation.

Also note the use of double quotes (") in place of the single quotes (') that appear in the Fortran call. This is because ESSL treats scalar character values as strings. More information on ESSL subroutines, and calling them from C programs can be found using IBM's InfoExplorer: info -l essl.

Compile the altered program with either -O3 -lessl or -O3 -lesslp2. Run them, time them and groan...

Both solution versions are provided for you: mmep.soln.f or mmep.soln.c, which use padded arrays; and mme.soln.f or mme.soln.c, which don't.

Top of Page  |  Objectives  |  General Instructions  |  Basic Program  |  Opts & Preps  |  Loop Tricks  |  Array Padding  |  Everything  | =>What Works<= |  Conclusions ]  Cleanup ]

Summary & Conclusions

Click here for typical timing results, for both C and FORTRAN. The morals of the story:

Top of Page  |  Objectives  |  General Instructions  |  Basic Program  |  Opts & Preps  |  Loop Tricks  |  Array Padding  |  Everything  |  What Works | =>Conclusions<= ]  Cleanup ]

Cleanup

If you wish to delete the directory and files you created for this exercise:


[Copyright Statement] [Feedback]

Changed by:$Author: kinsella $,$Date: 1999/07/08 14:31:11 $