Compiling and Running Programs
When developing and running software on the Ekman cluster there are some details to keep in mind:
- Each node has two AMD Opteron 2374HE CPUs, hence you should consider adapting both compiler and compiler flags for that architecture.
- Multi-CPU AMD Opteron systems in general (and Ekman-nodes in particular) are ccNUMA architectures making matters of memory affinity important for optimal application performance. On Ekman this means that each CPU has 8 GB DDR2 RAM directly attached (for a total of 16GB RAM per node). While effects are mitigated by the cache hierarchy, directly attached memory is faster to access than memory attached to the other CPU.
- Each CPU has 4 cores for a total of 8 cores per node. Cores can be considered as independent compute units. However, it is important to realize that cores do share (and thus sometimes are competing for) finite resources, the most important being memory and interconnect bandwidth.
- Ekman is a distributed memory system, hence efficient parallellisation has to be done with message-passing. For that purpose the user environment on Ekman makes several different MPI-implementations available.
- The interconnect is a full bisection bandwidth (FBB) Infiniband fabric with a multiple root tree structure. All links are 4xDDR making the per-link bandwidth 2GB/s.
The consequences of these details will be covered further below.
Compiling on Ekman
Gaining Access to Compilers and Libraries
PDC uses a system called Environment Modules to give access to specific versions of installed tools and applications. Consequently a compiling session will start with commands that chooses and sets up the environment for the selected combination of compiler and support libraries. The following sample will load the mpi-module of the system. Loading this module will give you the preferred mpi and compiler version of the system you are on (i.e. ekman).
module add mpiA particular version is "preferred" due to several different criteria, for instance:
- Perceived stability.
- Level of optimization towards the system.
- Availability of wanted features.
The main benefit of using the general mpi-module is simplicity, the main drawback is that your application may not have the same evaluation for the criteria above. If you require a specific version, you can try:
module avail openmpi mvapich mvapich2 mpich mpich2
for a list of available MPI-implementations and
module avail i-compilers gcc pgi pathscale nagfor a list of available compilers.
Once the mpi-module is loaded you will have access to MPI-versions of compilers for C, C++, F77 and F90, not surprisingly called mpicc, mpic++, mpif77 and mpif90.
To verify that the module command worked as expected, you can try:
From which can be surmised that the currently preferred version is OpenMPI version 1.3rc1 using GCC (GNU Compiler Collection).
Compiling your program
When you have loaded the proper module or modules you can compile your program as follows:
mpicc -o sample sample.c
mpif90 -o sample sample.f90
The specific flags to use are often compiler dependent the following is a list of flags you may consider adding:
-mtune=barcelonathis will tune the application to the CPU-type in the Ekman-nodes.
-tp barcelonathis will tune the application to the CPU-type in the Ekman-nodes.
Running your program
The exact way of running your program is dependent on the MPI-version that is used, the sample below is valid for OpenMPI (more samples will be forthcoming). When you have loaded the proper module or modules you can run your program as follows:mpirun -machinefile hostfile -n 2 ./sample
This will start the program "sample" with two processes on the nodes specified in hostfile.
Processes per nodeThe optimal choice of processes per node is highly dependent on the application. It is strongly advised that you experiment and find the best setting for your particular application. As noted above each node in Ekman has 8 cores. It seldom makes sense to start more processes than cores on a single node, rather, on multicore systems it is not uncommon to see increased performance from starting less.
The specific flags to use for mpirun are MPI-implementation-dependent. The following is a list of flags you may consider adding:
-mca mpi_paffinity_alone 1this will bind the processes in the job to specific cores, possibly increasing cache performance. It will also bind memory to processes in NUMA-aware fashion, increasing likelihood of efficient use of the memory bandwidth. See OpenMPI FAQ for more information how to set affinity flags.
Running an MPI-job in the Queue
The easiest way to run an MPI-job in the queue-system is to submit a small script that loads the required modules and then runs mpirun. If your shell is csh or tcsh this script can be written as follows:
#!/bin/tcsh module add mpi cd ~/my-program-lives-here/ mpirun -machinefile $SP_HOSTLIST -n 2 ./sample
If it is bash it can be written as follows:
#!/bin/bash module add mpi cd
mpirun -machinefile $SP_HOSTLIST -n 2 ./sample
It is recommended to write the script using the same interpreter as your login-shell. You can then submit you job to the queue as follows:
module add easy esubmit -n 2 -t 60 -c mycac ./my-wrapper-script
for a request of 2 nodes for 60 minutes charging the time allocation (CAC) mycac. The queuing system will start my-wrapper-script on exactly one of the allocated nodes. This will in turn use mpirun to start the job on the rest of the allocated nodes.