PDC Summer School 2001
Nils Smeds
The aim of this exercise is to give an introduction to OpenMP programming. The test examples are all written in Fortran90, but the same concepts apply to C programs. Although of course, the OpenMP syntax is slightly different. Even if you are not a native Fortran programmer you should be able to understand the examples and be able to instrument them with OpenMP directives.
The exercise consists of five parts. The goal of the first part is to give you some familiarity to the OpenMP syntax by successively adding directives to a small test program. In the second part there is a small dummy program that you will parallelize in a guided manner. Finally, you will be given a complete program that solves the Poisson equation. Your task is then to parallelize this complete program using OpenMP. Should you still have time left, there are some extra exercises you may play with.
There will be several Nighthawk nodes available to you during the lab session. You might want to try to find a node with fewer users to get more of the resources to your program. The unix command who displays which users are currently logged into the system. The command uptime gives a snapshot of the system load. The program monitor -top shows the system usage in more detail. The ``+'' and ``-'' keys control the interval of the display update in the monitor command.
You may also try some of the exercises on the SGI machine boye.pdc.kth.se if you wish. You may have to make minor changes to the codes to have them run on the SGI.
Among the files is one dummy routine that is used to prevent some compiler optimization at some instances. There is also an attempt to an OpenMP module that can be used for Fortran90 programmers to declare the interfaces to OpenMP functions. In the module is also one way to declare an OpenMP lock type. The module is used by most of these exercise programs.
If you'd like to try our suggested solution, it is probably most easy to use the ``Save As...'' function in your web browser. Save the frame with the solution in ``Text'' mode to make sure that only the program is saved and not any information about font types etc. The solutions have the relevant changes made to the original program marked in bold face.
You will start with the file Prog1.F and make changes to it in this exercise. There are four different sections in the code you will work on. Note that there is a capital ``F'' in the file name - it is needed to generate the different subexercises from a single file. Do your changes to your copy of the file without changing its name. You can then recompile it using the make utility after each change you make.
Compile the exercise by executing the command
Use 4 threads in your work
Run the resulting program
Execute the program several times. Sometimes the output is incorrect. Why? Fix the parallelization directive so that the program always prints the expected output - although not necessarily in the same order each time.
The default storage type in OpenMP is shared. In many instances it is very useful to add the clause default(none) to the directives. If you take this as a habit you will always be reminded of what variables are in use and not forget to decide if a variable is to be shared or private.
Compile the next example.
Hint: Look at the OpenMP standard subroutines and functions. A copy of the standard document is available if you open the following URL in your web browser
Compile and run the example Prog1Ex3. Run it several times. Some times the output line from the master comes out as the fourth line in the exercise and sometimes it comes out as the fifth. If this does not happen, increase the number of threads using OMP_NUM_THREADS.
Try to modify the program so that the master output always comes before the final thread output.
In some environments you may find that it is more beneficial to allow the operating system to control the number of threads actually used in a parallel loop. You can allow for dynamic scheduling by calling the OpenMP routine omp_set_dynamic() with an argument evaluating to .true.. The number of scheduled threads will not exceed the value of omp_get_max_threads(). If this has not been set in your program the value of OMP_NUM_THREADS is used. And if this variable is not set the system supplies a value. This value is different on different systems. On the IBM the value supplied is equal to the number of available processors.
Dynamic scheduling could be of use for example on a shared memory machine with many CPUs and many concurrent users. Try the exercise with some different values for OMP_NUM_THREADS. Does the number of threads that run vary? (It didn't when I tried, so either the system doesn't do dynamic allocation of the number of threads - or the system was not loaded enough when I tried)
Examine the code and make sure you understand what it is that makes the scheduling dynamic, and how the code works. Also, note that the omp_set_dynamic() is not the same kind of control as the environment variable OMP_SCHEDULE='dynamic' (but it is connected to the environment variable OMP_DYNAMIC='TRUE')
We will in this section go through a guided parallelization of a simulated application. The application first creates three matrices according to a recursive secret formula. The program then multiplies the matrices with each other a few times. First compile a serial version of the program that you can use to check your results against.
If this had been a real application, which parallelization would you prefer? Prog2Ex2 or Prog2Ex3? Why?
Compare the execution time when you run the SMP program on a varying number of threads. I.e. vary the value of OMP_NUM_THREADS. Also compare what happens when you change the value of the run-time system variable AIXTHREAD_SCOPE. This variable can have two values ``P'' and ``S''. The default value ``P'' causes the threads to be scheduled within your process, while the value ``S'' causes the threads to be scheduled on a system wide basis. For scientific computations the latter is usually to prefer.
Compile a serial version by using the command
In the example program Prog4.f a cumulative sum is computed. Compile the serial version of the program using the command make Prog4Serial. Run the program using the command ./Prog4Serial.
The core in this program consists of the loop
do i=2,N A(i,1)=A(i-1,1)+A(i,1) end do
The suggested solution contains two slightly different ways to accomplish the task, but uses the same parallelization. Both have about the same performance. The first one is the most appealing since it is most straight forward. The second solution is not the preferred way of accomplishing the task in a real application, but is a good exercise in some of the less often used functions in OpenMP.
outer: do i=1,N
do j=1,NBINS
if(A(i,1)<binval(j))then
bins(j)=bins(j)+1
cycle outer
endif
end do
end do outer
Compile and run our suggested solution. How do they compare to your solution? How do they compare to the original serial code?
This exercise shows that it if the amount of work in a loop is small it is not very suitable to parallelization. However, had the situation been that the loop in the program not only computed the histogram, but also had to compute the values of A(j,1) from some complicated formula, then the loop might have been very well suited for parallelization.