Advanced Programming Lab: Hybrid OpenMP/MPI Programming
Aim
In this lab exercise (from the Summer school 2010), you will parallelize some simple algorithms using shared memory programming (OpenMP) and distributed memory programming (MPI) simultaneously. While the first two examples provide a simple introduction, the third one is concerned with the hybridization of a code for solving Poisson's equation on a 2D domain using Jacobi iteration. Your task is to experimentally investigate the performance of the hybrid code and compare it to pure OpenMP and MPI codes.
If you are interested in a more advanced application, the NAS Parallel Benchmark (multi-zone) is provided in order to test a more advanced code.
You will use Ferlin in these experiment. Each node of Ferlin consists of two quad-core Intel Harpertown 2.66 GHz cpus thus providing 8 SMP cores.
Preparation
In this lab, you will need an MPI implementation which allows for hybrid computing. For the time being, you should use Ferlin and OpenMPI at least in version 1.3. Depending on which compiler (Intel or GNU) you intend to use, issue the following commands at the prompt:
- Intel
module add i-compilers/11.1 openmpi/1.4.1-intel- GCC
module add gcc openmpi/1.4.1-gcc
Note that gcc must be at least version 4.3. This can be tested
by the command gcc --version. A
C-program can be compiled using
- Intel
mpicc -O1 -openmp -o <executable> <source>- GCC
mpicc -O3 -fopenmp -o <executable> <source>
where <executable> is the name of your
executable file and <source> contains your
source code. The use of the fortran compiler works similarly by
replacing mpicc by mpif77 or
mpif90.
Note: You can download all of the program files for this lab either by using each of the individual links found in the lab, or by downloading the batch downloader script and executing it on your own machine. Directions on running the batch download script can be found in the script itself.
Using interactive nodes As usual, you can attach to interactive nodes using the command
spattach -i -p <nodes>
where <nodes> is the number of nodes you
intend to use. Note that it is not guaranteed that you obtain
really that number of different nodes! You can check
this by having a look at the machine file:
cat $SP_HOSTFILE
The number of different host names corresponds to the number of nodes allocated to you. When using OpenMP be sure not to allocate too many threads.
The number of threads allocated for each MPI process can be
provided by setting the environment variable
OMP_NUM_THREADS. Do not forget to specify it.
Otherwise, each process allocates the maximal number (8 on
Ferlin) of threads by default even if more than one MPI process
is running on one node!
You can run your program the usual way (here using
bash),
export OMP_NUM_THREADS=<threads>
mpirun -np <processes> -bynode -display-map -machinefile $SP_HOSTFILE \
-x OMP_NUM_THREADS <executable> <parameter>
Do not forget the switch -x OMP_NUM_THREADS.
Otherwise, it will not be exported to the executable. The
parameter -bynode ensures that the MPI processes are
allocated at different nodes if possible. The parameter
-display-map is not necessary but convenient. It
requires mpirun to print out the allocation of
processes and threads to nodes. Run mpirun -help for
an explanation of all switches. A warning: Please do not call
mpirun with its full path! It will fail.
Using batch nodes Batch jobs can be submitted to the queue using the command
esubmit -n <nodes> -t <time> -c <MyUserCAC> ./runjob
where the script file runjob contains the
following lines:
#!/bin/bash
export OMP_NUM_THREADS=6
module add i-compilers openmpi/1.3-intel
mpirun -np <processes> -bynode -display-map -machinefile $SP_HOSTFILE \
-x OMP_NUM_THREADS ./<executable> <parameters>
When creating this file do not forget to make it executable
(chmod +x runjob).
Problems
- Implement the "Hello World" program from the lecture and run it on interactive nodes. What are the results?
- In the course folder, you will find the program
trapez.f, or trapez.c, which implements the trapezoidal rule for
evaluating the integral
The number of nodes is excessively large in order to obtain a running time above 2 s (with -O1 optimization) for the serial version. Your task is to compare the performance as a function of the number of processes/threads. Run the program on one node, only. So you should have at most 8 threads.
- Use all eight cores for computation:
np OMP_NUM_THREADS 1 8 pure OpenMP 2 4 4 2 8 1 pure MPI - Use only 6 cores and leave the remaining two cores to
system tasks.
np OMP_NUM_THREADS 1 6 pure OpenMP 2 3 3 2 6 1 pure MPI
What are the execution times? Did you manage to speed up things?
- Use all eight cores for computation:
- A more complex example is the solution of the 2D Poisson's
equation. An MPI parallelization is provided in the course
folder (oned.c and oned.f,
respectively). Do the same experiments as before, but now with
a larger number of nodes, say 2 or 4. Introduce OpenMPI
comments and measure the hybrid performance. Remember to check
that the new version has the same output as the old one.
The program accepts two parameters
nxandny(the number of grid points in x- and y-directions, repectively). They must be given via the terminal (Fortran version) or as command line parameters (C version). The C version contains a usage description in the comments of the source files. - As an optional task, I provide the source files for the multi-zone version of the NAS Parallel Benchmark here. Comments on how to build an execute the benchmarks are included in the bundle. I recommend that you experiment with problem sizes W and B. What is the best distribution of MPI-processes/threads for these sizes?


