You are here: Home Education Tutorials MPI Lab: Hybrid OpenMP/MPI Programming

Advanced Programming Lab: Hybrid OpenMP/MPI Programming

In this lab exercise, you will parallelize some simple algorithms using shared memory programming (OpenMP) and distributed memory programming (MPI) simultaneously.

Aim

In this lab exercise (from the Summer school 2010), you will parallelize some simple algorithms using shared memory programming (OpenMP) and distributed memory programming (MPI) simultaneously. While the first two examples provide a simple introduction, the third one is concerned with the hybridization of a code for solving Poisson's equation on a 2D domain using Jacobi iteration. Your task is to experimentally investigate the performance of the hybrid code and compare it to pure OpenMP and MPI codes.

If you are interested in a more advanced application, the NAS Parallel Benchmark (multi-zone) is provided in order to test a more advanced code.

You will use Ferlin in these experiment. Each node of Ferlin consists of two quad-core Intel Harpertown 2.66 GHz cpus thus providing 8 SMP cores.

Preparation

In this lab, you will need an MPI implementation which allows for hybrid computing. For the time being, you should use Ferlin and OpenMPI at least in version 1.3. Depending on which compiler (Intel or GNU) you intend to use, issue the following commands at the prompt:

Intel
module add i-compilers/11.1 openmpi/1.4.1-intel
GCC
module add gcc openmpi/1.4.1-gcc

Note that gcc must be at least version 4.3. This can be tested by the command gcc --version. A C-program can be compiled using

Intel
mpicc -O1 -openmp -o <executable> <source>
GCC
mpicc -O3 -fopenmp -o <executable> <source>

where <executable> is the name of your executable file and <source> contains your source code. The use of the fortran compiler works similarly by replacing mpicc by mpif77 or mpif90.

Note: You can download all of the program files for this lab either by using each of the individual links found in the lab, or by downloading the batch downloader script and executing it on your own machine. Directions on running the batch download script can be found in the script itself.

Using interactive nodes As usual, you can attach to interactive nodes using the command

      spattach -i -p <nodes>

where <nodes> is the number of nodes you intend to use. Note that it is not guaranteed that you obtain really that number of different nodes! You can check this by having a look at the machine file:

      cat $SP_HOSTFILE

The number of different host names corresponds to the number of nodes allocated to you. When using OpenMP be sure not to allocate too many threads.

The number of threads allocated for each MPI process can be provided by setting the environment variable OMP_NUM_THREADS. Do not forget to specify it. Otherwise, each process allocates the maximal number (8 on Ferlin) of threads by default even if more than one MPI process is running on one node!

You can run your program the usual way (here using bash),

      
      export OMP_NUM_THREADS=<threads>
      
      mpirun -np <processes> -bynode -display-map -machinefile $SP_HOSTFILE \

             -x OMP_NUM_THREADS <executable> <parameter>

Do not forget the switch -x OMP_NUM_THREADS. Otherwise, it will not be exported to the executable. The parameter -bynode ensures that the MPI processes are allocated at different nodes if possible. The parameter -display-map is not necessary but convenient. It requires mpirun to print out the allocation of processes and threads to nodes. Run mpirun -help for an explanation of all switches. A warning: Please do not call mpirun with its full path! It will fail.

Using batch nodes Batch jobs can be submitted to the queue using the command

      
      esubmit -n <nodes> -t <time> -c <MyUserCAC> ./runjob

where the script file runjob contains the following lines:

      
      #!/bin/bash

      export OMP_NUM_THREADS=6 

      module add i-compilers openmpi/1.3-intel

      mpirun -np <processes> -bynode -display-map -machinefile $SP_HOSTFILE \

               -x OMP_NUM_THREADS ./<executable> <parameters>

When creating this file do not forget to make it executable (chmod +x runjob).

Problems

  1. Implement the "Hello World" program from the lecture and run it on interactive nodes. What are the results?
  2. In the course folder, you will find the program trapez.f, or trapez.c, which implements the trapezoidal rule for evaluating the integral integral from a to b of f(x)dx.

    The number of nodes is excessively large in order to obtain a running time above 2 s (with -O1 optimization) for the serial version. Your task is to compare the performance as a function of the number of processes/threads. Run the program on one node, only. So you should have at most 8 threads.

    • Use all eight cores for computation:
      np OMP_NUM_THREADS  
      1 8 pure OpenMP
      2 4  
      4 2  
      8 1 pure MPI
    • Use only 6 cores and leave the remaining two cores to system tasks.
      np OMP_NUM_THREADS  
      1 6 pure OpenMP
      2 3  
      3 2  
      6 1 pure MPI

    What are the execution times? Did you manage to speed up things?

  3. A more complex example is the solution of the 2D Poisson's equation. An MPI parallelization is provided in the course folder (oned.c and oned.f, respectively). Do the same experiments as before, but now with a larger number of nodes, say 2 or 4. Introduce OpenMPI comments and measure the hybrid performance. Remember to check that the new version has the same output as the old one.

    The program accepts two parameters nx and ny (the number of grid points in x- and y-directions, repectively). They must be given via the terminal (Fortran version) or as command line parameters (C version). The C version contains a usage description in the comments of the source files.

  4. As an optional task, I provide the source files for the multi-zone version of the NAS Parallel Benchmark here. Comments on how to build an execute the benchmarks are included in the bundle. I recommend that you experiment with problem sizes W and B. What is the best distribution of MPI-processes/threads for these sizes?
Filed under: ,