Programming exercises on OpenMP

PDC Summer School 2001
Nils Smeds

About this exercise

The aim of this exercise is to give an introduction to OpenMP programming. The test examples are all written in Fortran90, but the same concepts apply to C programs. Although of course, the OpenMP syntax is slightly different. Even if you are not a native Fortran programmer you should be able to understand the examples and be able to instrument them with OpenMP directives.

The exercise consists of five parts. The goal of the first part is to give you some familiarity to the OpenMP syntax by successively adding directives to a small test program. In the second part there is a small dummy program that you will parallelize in a guided manner. Finally, you will be given a complete program that solves the Poisson equation. Your task is then to parallelize this complete program using OpenMP. Should you still have time left, there are some extra exercises you may play with.

The hardware

In the exercises you will use the 8-way parallel shared memory Nighthawk nodes. There will be several nodes available for interactive use during the lab. Notice that since you are several users you will not be able to do accurate timings of your codes. You will, however, be able to get a feeling for parallelizing using OpenMP and see the effects of the directives.

There will be several Nighthawk nodes available to you during the lab session. You might want to try to find a node with fewer users to get more of the resources to your program. The unix command who displays which users are currently logged into the system. The command uptime gives a snapshot of the system load. The program monitor -top shows the system usage in more detail. The ``+'' and ``-'' keys control the interval of the display update in the monitor command.

You may also try some of the exercises on the SGI machine boye.pdc.kth.se if you wish. You may have to make minor changes to the codes to have them run on the SGI.

Location of the examples

A Makefile for the Nighthawk nodes is to be found together with the source codes. You should copy this directory to a work catalogue of your own choice. The commands below will copy the OPENMP exercises into a catalogue OPENMP in your working catalogue PDC-school.

mkdir OPENMP-LAB
cd OPENMP-LAB
cp -p /misc/pdc/sp2/HPCsummer2001/OPENMP/Makefile .
cp -p /misc/pdc/sp2/HPCsummer2001/OPENMP/*.* .
You should also open the URL below in your web browser.
http://www.pdc.kth.se/training/Tutor/SMP/OpenMP-LAB/

Among the files is one dummy routine that is used to prevent some compiler optimization at some instances. There is also an attempt to an OpenMP module that can be used for Fortran90 programmers to declare the interfaces to OpenMP functions. In the module is also one way to declare an OpenMP lock type. The module is used by most of these exercise programs.

Answers to the exercises

All exercises has suggestions to how they can be solved. Compare your solution to ours. What are the good things and bad things for the different solutions? If you were not able to solve the exercise, make sure you understand how our suggested solution work. The answers are available at the URL mentioned above.

If you'd like to try our suggested solution, it is probably most easy to use the ``Save As...'' function in your web browser. Save the frame with the solution in ``Text'' mode to make sure that only the program is saved and not any information about font types etc. The solutions have the relevant changes made to the original program marked in bold face.

The first set of exercises

You will start with the file Prog1.F and make changes to it in this exercise. There are four different sections in the code you will work on. Note that there is a capital ``F'' in the file name - it is needed to generate the different subexercises from a single file. Do your changes to your copy of the file without changing its name. You can then recompile it using the make utility after each change you make.

Prog1Ex1 -- Controlling a parallel section

Compile the exercise by executing the command

make Prog1Ex1

Use 4 threads in your work

setenv OMP_NUM_THREADS 4 (csh)
or
export OMP_NUM_THREADS=4 (bash)

Run the resulting program

./Prog1Ex1

Execute the program several times. Sometimes the output is incorrect. Why? Fix the parallelization directive so that the program always prints the expected output - although not necessarily in the same order each time.

Prog1Ex2 -- Controlling the number of threads

The default storage type in OpenMP is shared. In many instances it is very useful to add the clause default(none) to the directives. If you take this as a habit you will always be reminded of what variables are in use and not forget to decide if a variable is to be shared or private.

Compile the next example.

make Prog1Ex2
The resulting program is named Prog1Ex2. Your task here is to make sure that this second part always runs using as many threads as there are CPUs. Your changes to the program should be such that they only affect this part of the program and not the following sections for the next exercise.

Hint: Look at the OpenMP standard subroutines and functions. A copy of the standard document is available if you open the following URL in your web browser

file://afs/pdc.kth.se/public/www/training/Tutor/SMP/OpenMP-Fortran-spec.pdf

Prog1Ex3 -- Parallel sections and implied barriers

Compile and run the example Prog1Ex3. Run it several times. Some times the output line from the master comes out as the fourth line in the exercise and sometimes it comes out as the fifth. If this does not happen, increase the number of threads using OMP_NUM_THREADS.

Try to modify the program so that the master output always comes before the final thread output.

Prog1Ex4 -- Dynamic scheduling

In some environments you may find that it is more beneficial to allow the operating system to control the number of threads actually used in a parallel loop. You can allow for dynamic scheduling by calling the OpenMP routine omp_set_dynamic() with an argument evaluating to .true.. The number of scheduled threads will not exceed the value of omp_get_max_threads(). If this has not been set in your program the value of OMP_NUM_THREADS is used. And if this variable is not set the system supplies a value. This value is different on different systems. On the IBM the value supplied is equal to the number of available processors.

Dynamic scheduling could be of use for example on a shared memory machine with many CPUs and many concurrent users. Try the exercise with some different values for OMP_NUM_THREADS. Does the number of threads that run vary? (It didn't when I tried, so either the system doesn't do dynamic allocation of the number of threads - or the system was not loaded enough when I tried)

Examine the code and make sure you understand what it is that makes the scheduling dynamic, and how the code works. Also, note that the omp_set_dynamic() is not the same kind of control as the environment variable OMP_SCHEDULE='dynamic' (but it is connected to the environment variable OMP_DYNAMIC='TRUE')

The second set of exercises

We will in this section go through a guided parallelization of a simulated application. The application first creates three matrices according to a recursive secret formula. The program then multiplies the matrices with each other a few times. First compile a serial version of the program that you can use to check your results against.

make Prog2Serial
Run the serial program (./Prog2Serial) and save the results in a file so that you can compare your parallel program to this.

Parallel matrix generation -- Prog2Ex1

Look at the section in the code marked Exercise 1. Three matrices are generated. Try to think of a way to parallelize this. Implement this parallelization. Compile your program with the command
make Prog2Ex1
Does it produce the same check sums for the matrices and the same output for x? If not, try again. Hint: The parallelization I think of will not scale beyond three threads.

Parallelizing a do loop -- Prog2Ex2

Compile the second example make Prog2Ex2. This loop makes a matrix multiplication. Parallelize the outer-most loop. Make sure the output is correct when you are done. Run your program several times to try to verify it is correct. Compare your solution to ours (there are several correct ways to do this exercise).

Parallelizing a do loop -- Prog2Ex3

Continue to the section of the code marked Exercise 3. This time try to parallelize the second outer-most loop (the j-loop). Make sure the output is not changed from the output in the serial version.

If this had been a real application, which parallelization would you prefer? Prog2Ex2 or Prog2Ex3? Why?

Doing a matrix multiply -- Prog2Ex4

Compile the example program. Compare the compile and link commands for this program and the serial program. (If you want to recompile the serial program execute rm Prog2Serial ; make Prog2Serial. The libraries essl and esslsmp are the corresponding libraries to the sgimath-libraries used in the RISC processor tuning exercises.

Compare the execution time when you run the SMP program on a varying number of threads. I.e. vary the value of OMP_NUM_THREADS. Also compare what happens when you change the value of the run-time system variable AIXTHREAD_SCOPE. This variable can have two values ``P'' and ``S''. The default value ``P'' causes the threads to be scheduled within your process, while the value ``S'' causes the threads to be scheduled on a system wide basis. For scientific computations the latter is usually to prefer.

A complete program -- twod.f

Finally there is a complete program that solves Poisson's equation available for you to play with. This is the same program as you have earlier used in the MPI session on collective communication.

Compile a serial version by using the command

make twod-serial
The parallel version is built from the same file using the command
make twod
Parallelize the program using OpenMP directives. Compare your solution to the one we made.

Non-trivial parallelization

Reality is usually less attractive than lab exercises. Now that you have become familiar with the directives in OpenMP, you might feel up to a slightly more challenging task.

In the example program Prog4.f a cumulative sum is computed. Compile the serial version of the program using the command make Prog4Serial. Run the program using the command ./Prog4Serial.

The core in this program consists of the loop

do i=2,N 
  A(i,1)=A(i-1,1)+A(i,1)
end do
In this example N is a large number so parallelizing it should be beneficial. Try to make this loop parallel. Run it several times to make sure the result is the same each time you run and that it is the same as for the serial program. You can compile the OpenMP version of the same file Prog4.f using the command make Prog4OMP. The parallel program is run using the command ./Prog4OMP.

The suggested solution contains two slightly different ways to accomplish the task, but uses the same parallelization. Both have about the same performance. The first one is the most appealing since it is most straight forward. The second solution is not the preferred way of accomplishing the task in a real application, but is a good exercise in some of the less often used functions in OpenMP.

Don't parallelize everything

The final example illustrates that it is important that there is enough work being done in a loop if there will be any speed-up. In the example program Prog5.f a histogram is computed. The core in this program consists of the loop:
outer: do i=1,N
  do j=1,NBINS 
    if(A(i,1)<binval(j))then
      bins(j)=bins(j)+1
      cycle outer
    endif
  end do 
end do outer
The array bins(j) will in position j store the number of elements x in A that fullfil

\begin{displaymath}bins(j-1) \le x < bins(j)\end{displaymath}

Compile the serial version with the command make Prog5Serial. Try to parallelize it using OpenMP. Compare the result with the original program. Is the result the same? Is the program faster or slower?

Compile and run our suggested solution. How do they compare to your solution? How do they compare to the original serial code?

This exercise shows that it if the amount of work in a loop is small it is not very suitable to parallelization. However, had the situation been that the loop in the program not only computed the histogram, but also had to compute the values of A(j,1) from some complicated formula, then the loop might have been very well suited for parallelization.



Mike Hammill
2001-08-24