Some notes on terminology

The term "node" is used in the Linux kernel NUMA sense of a CPU,
possibly with some local memory attached. It should not be confused
with nodes in a cluster, they are referred to as "cluster nodes". For
simplicity the presence of local memory on a node is assumed. "System"
refers to a NUMA-machine.

The MPICH devices use different terms to denote the data structure
used to communicate data between processes. That type of data
structure is uniformly referred to as a "packet" here. A "message" is
the data buffer to send, and possibly a data structure encapsulating
it.

The term "local memory" here does not indicate that the memory is
non-shared, merely that it lives on the same node as the process
referencing it. Non-shared memory is called "private memory" here.
"Remote memory" is memory in another node, shared or non-shared.
"Global" signifies data that is shared by all processes

Unless otherwise stated "page-aligned" means that the start address of
the memory area coincide with a page address AND that the size of the
area is a whole number of pages so that the first byte after the end
of the area lies in a new page.

Linux Kernel

The three main characteristics of the Linux 2.6 kernel that are of
interest to us are the following :

A. Processes have weak processor affinity. 

 The scheduler tries to keep processes running on the same node if
 possible, but will move processes between nodes if it considers it
 necessary.

B. A first touch page allocation policy. 

 Memory pages owned by a process are allocated on the node where they
 are first referenced (written to for anonymous pages, read from or
 written to for named pages), if possible. If the node on which the
 process is running has no free memory left the memory will be
 allocated from the topologically closest node that has enough free
 memory. The determination of which node is closest is made at boot
 time.

 There are three cases to consider.

 Anonymous virtual memory that has been inherited from the parent
 process is mapped to the same physical page frames used in the parent
 process and the page frames are marked as Copy-on-Write. When either
 the parent or the child process writes to a page it will break COW
 and a new page frame will be allocated for that process.  This means
 that if the parent process breaks COW for a page, both copies of that
 page will end up living on the node that the parent process runs on.

 Anonymous virtual memory that has been newly allocated from the
 kernel is initially mapped to the Copy-on-Write "zero"-page. A write
 to a virtual address then breaks COW and a physical page frame will
 be allocated for that address on the node on which the touching
 process runs.

 Named pages (pages that are backed by a file or a swap-device) are
 allocated on the node that the process that first tries to read or
 write the page runs on.

C. No page migration.

 Pages that have been touched will stay on the same node that they
 were allocated on until they are deallocated or paged out to
 secondary storage.

In essence, processes have soft affinity and memory have hard
affinity.


MPICH

The MPICH version studied here is 1.2.6.

We looked at the two MPICH devices most likely to be used on a
NUMA-system running Linux, the pure shared memory device ch_shmem and
the combined shared memory / network device ch_p4.

MPICH ch_shmem internals

 In the ch_shmem device all processes communicate through a shared
 memory structure called MPID_shmem (of type MPID_SHMEM_globmem,
 ch_shmem/shdef.h).  MPID_shmem contains an array of packet queues
 (MPID_shmem->incoming), one queue for each process, an array of
 stacks of available packets (MPID_shmem->avail), various locks and a
 pool of packets (MPID_shmem->pool). Packets have a "next" pointer
 that is used to link packets into linked lists on the queue and
 stack. To alleviate pressure on the MPID_shmem structure each process
 has a private structure called MPID_lshmem (of type
 MPID_SHMEM_lglobmem, ch_shmem/shdef.h) with copies of pointers
 to the elements of MPID_shmem.

 Initialization

  When the application starts a shared memory arena is set up for the
  internal memory allocation interface (ch_shmem/p2p.c). With new
  Linux kernels this is accomplished by creating a new anonymous,
  shared virtual memory region with mmap() (p2p_init(),
  ch_shmem/p2p.c).

  Memory for MPID_shmem is allocated from the arena and initialized
  (MPID_SHMEM_init(), ch_shmem/shmempriv.c). The packet pool is
  partitioned among the processes and each avail-stack is initialized
  by linking together the packets in the corresponding pool partition
  in a list with MPID_shmem->avail[n] pointing to the head of the list
  allocated for process n. After this all child processes are spawned
  using fork() (p2p_create_procs(), ch_shmem/p2pprocs.c).

 Sending

  Sending is accomplished by fetching a free packet from the process'
  avail-stack (blocking until one is available)
  (MPID_SHMEM_GetSendPkt(), ch_shmem/shmempriv.c), copying the data
  into the packet using memcpy() and then inserting the packet into
  the incoming-queue of the destination process
  (MPID_SHMEM_SendControl(), ch_shmem/shmempriv.c).

 Receiving

  When receiving a packet (one is available on the incoming-queue),
  the data is copied from the packet and the packet is removed from
  the incoming-queue (by MPID_SHMEM_FreeRecvPkt(),
  ch_shmem/shmempriv.c) and placed in a free packet cache list
  (FreePkts[], ch_shmem/shmempriv.c) that is private to the
  process. The free packet cache is flushed (by
  MPID_SHMEM_FlushPkts(), ch_shmem/shmempriv.c) after a certain number
  of packets has been put on it, the packets in the cache are then put
  on the top of the avail-stack of the process that originally sent
  the packet.

Problems with ch_shmem
 
 Even though ch_shmem keeps packets separate for each processor, the
 location of the packets in physical memory becomes a performance
 problem in Linux NUMA systems.  With the default memory affinity
 policy of Linux, the first process to write to a page "wins" it, the
 page gets allocated to the node on which the process ran. In the
 initialization of ch_shmem all packets in the pool are touched when
 their "next"-pointers are set to form the linked lists. This causes
 all the memory pages in the pool, and therefore all the packets, to
 live permanently on the node on which the setup code was executed.
 This means that all processes in the system must access memory on the
 startup node every time a message is sent or received. This can be a
 performance limiter both with respect to bandwidth and latency.

 Furthermore, with the default soft processor affinity processes
 sometimes move between nodes but the memory will not, compounding the
 problem and giving non-deterministic performance.
 
Suggested modifications to ch_shmem

 In order to get predictable performance and make sure that processes
 don't migrate away from their data, processes should be locked to
 nodes so that they display the same hard affinity as memory. In newer
 Linux kernels this can be accomplished with the sched_setaffinity()
 system call. It is important that the affinity is set as early as
 possible in the startup process before much private memory has been
 touched or allocated.  In order to make sure that the processes are
 evenly distributed among the available nodes the affinity cannot be
 set until the process has discovered its ID (MPID_myid). This
 determines the earliest point at which the affinity can be set. For
 the child processes this can then be done in p2p_create_procs()
 (ch_shmem/p2pprocs.c) right after MPID_myid is set from the sequence
 number MPID_shmem->globid. For the initial process affinity can be
 set to node 0 first thing in MPID_SH_InitMsgPass()
 (ch_shmem/shmeminit.c) before the shared memory arena is initialized.
 To force a rescheduling of the process a sched_yield() call should
 follow directly on the setting of the affinity. Otherwise the process
 could continue to execute on the current (but wrong) node until the
 current scheduler time-slice runs out.

 Any time a message is sent data is copied from a user buffer on the
 sender (A) to a packet in shared memory, and then from shared memory
 into a user buffer at the receiver (B). When the packet lives in
 local memory on the node where either A or B runs the latency will be
 minimal (1). Distributing the packets among the nodes also improves
 total bandwidth as more memory interfaces will be utilized for
 communication and communication will not stress any particular node
 unevenly (unless the communication pattern is irregular).

 Most of the code needed to get processes to use packets in local
 memory is already present in MPICH. Any given process will always use
 the same set of packets for sending and those sets are disjunct among
 processes. We just need to make sure that the shared memory area
 occupied by the packets used by a process gets allocated to physical
 memory on the same node that the process will run on. Since memory
 gets assigned to nodes at the granularity of pages we need to make
 sure that
  
  A) the packet pool is page aligned

  B) the pool partitions allocated to processes occupies a whole
  number of pages in memory

  C) the pages used in any pool partitions are first written to by
  the process that should own them, after that process has been
  started on the node where it is supposed to live

 Given A and B, C will then cause the kernel to place the 
 packet pages in physical memory on the correct node.

 To accomplish this MPID_shmem needs to be allocated with a memory
 allocator that returns page-aligned memory and padded so that the
 packet pool array is page-aligned. We need to make sure that the size
 of the pool is such that the size divided by the number of processors
 is a whole multiple of the page size. Finally, carving up the packet
 pool among the processes and initializing the avail-stacks must be
 distributed so that each child process initializes its own packets.

MPICH p4_shmem

 In the p4 device processes can communicate over a network between
 systems and through shared memory or sockets within a system. Shared
 memory communication is implemented as an array of packet queues
 (shmem_msg_queues[]), one for each process, in a shared data
 structure called p4_global (of type struct p4_global_data,
 ch_p4/p4/lib/p4_defs.h). A global queue of available unused packets
 is kept in p4_global->avail_quel. Lists are implemented by
 encapsulating messages (struct p4_msg, ch_p4/p4/lib/p4_defs.h) as
 pointers in a queue structure (the actual packet) (struct
 p4_queued_msg, ch_p4/p4/lib/p4_defs.h). p4_global also contains an
 array of available empty message buffers (p4_global->avail_buffs[]).

 Each process also have a private structure called p4_local (struct
 local_data, ch_p4/p4/lib/p4_defs.h) which contains a private message
 queue for the process (p4_local->queued_messages).
 
 Initialization

  When the application starts a shared memory arena is set up for the
  internal memory allocation interface (ch_p4/p4/lib/p4_MD.c) using
  the System V IPC interface (shmget/shmat). In Linux this is
  implemented in the same way as shared anonymous mmap-mappings, by
  mmap()-ing a file in a RAM-filesystem. Memory for p4_global is
  allocated from the arena and initialized (alloc_global(),
  ch_p4/p4/lib/p4_alloc.c). The message buffer array
  (p4_global->avail_buffs[]) is set up to cache allocated but unused
  message buffers with sizes less than a range of power-of-twos up to
  1MB (init_avail_buffs(), ch_p4/p4/lib/p4_alloc.c). Memory for
  p4_local is allocated using malloc() and is initialized
  (alloc_local_bm(), ch_p4/p4/lib/p4_alloc.c). After this the child
  processes are spawned using fork() for processes local to the system
  (create_bm_processes(), ch_p4/p4/lib/p4_bm.c).
  
  The child processes free p4_local, reallocate it with malloc() and
  reinitialize it (alloc_local_slave(), ch_p4/p4/lib/p4_alloc.c).
 
 Sending
 
  On sending via shared memory (send_message(),
  ch_p4/p4/lib/p4_tsr.c) a buffer for the message (struct p4_msg,
  ch_p4/p4/lib/p4_defs.h) is fetched from the p4_global->avail_buffs[]
  array, or allocated from the shared memory arena (alloc_p4_msg(),
  ch_p4/p4/lib/p4_alloc.c) if no existing buffer of sufficient size
  could be found.

  The data is copied into the message and a packet (struct
  p4_msg_queue) is fetched from the global queue of available packets
  (p4_global->avail_quel) if one exists, otherwise a new packet is
  allocated from the shared memory arena. The message is linked into
  the packet as a pointer and put at the end of the packet queue of
  the destination process (p4_global->shmem_msg_queues[n] for process
  number n).

 Receiving

  When receiving a packet (p4_recv(), ch_p4/p4/lib/p4_tsr.c), the
  private queue of received messages (p4_local->queued_messages) is
  first searched for a matching message (with the right sender and
  type). If none is found, a packet is removed from the head of the
  global packet queue for this process
  (p4_global->shmem_msg_queues[n]), if one is available. The message
  is retrieved from the packet and the packet is inserted at the head
  of the available packets queue (p4_global->avail_quel).

  If the message is not what was expected (the type of the message or
  the sender id is other than expected) the message is moved to the
  private message queue (p4_local->queued_messages). Otherwise the
  data is copied into a user buffer and the message is either put on
  the available message buffer queue array (p4_global->avail_buffs[])
  if it is of a kept size, or it is freed.

Problems with p4_shmem

  The situation with p4_shmem is more complicated than with ch_shmem.
  When a message is being sent it could end up using a packet cached
  in p4_global->avail_quel that corresponds to physical memory on any
  node in the system that has sent packets before. Or the process
  allocates a new packet from shared memory, but where in physical
  memory that packet ends up depends on the page-alignment of the area
  allocated. If the area is part of a shared page that has already
  been partially used for other packets it could well already have
  been allocated physical memory on a remote node. This makes the
  performance of any given message transaction hard or impossible to
  predict.

  The same problems exist for messages, the queue of free messages
  is shared among the processors.

  p4_shmem also has the same problems with soft processor affinity
  as ch_shmem.

Changes to p4_shmem

 p4 processes needs to be given hard processor affinity in the same
 way as with ch_shmem.

 The queue of available packets should be split up so that each
 process has its own queue. When a process receives a packet it and is
 done with it it then needs to put the packet into the avail_quel that
 belongs to the process that sent the packet.

 Both avail_quel as a whole and the per-process partitions of it
 needs to be page aligned.

 Packet allocation needs happen in either page-aligned chunks, or the
 allocation must happen from a page-aligned per-process shared memory
 arena. In either of these cases a new shared memory allocator must be
 implemented for the packet allocation. One that allocates
 page-aligned blocks of memory, or takes a process ID as an argument
 and keeps per-process arenas.
 
 If the memory allocated keeps per-process arenas the free-routine
 must keep track of where to return the memory being freed.

 The list of available message buffers (p4_global->avail_buffs) also
 needs to be split up into page-aligned per-processor partitions and
 the freeing of message buffers after use must be modified to return
 the buffers to the sending process' list.
 
 
1.

This argument applies to data movement through the processor,
this appears to be the way memcpy() does things on linux.

Lets denote the time taken to move a data object from memory into the
processor over the memory bus locally on a node T, and the time taken
to transfer the object over the node-interconnect from node A to node
B T_A,B. We assume that the local memory transfer time is equal on all
nodes and furthermore that writing the same object from the processor
to memory is the same as for reading it.

In the case where all packets live on node 0, the transfer time for a
message between nodes X and Y, when X and Y are not equal to zero is :
T+T_X,0 + T + T + T_0,Y + T = 4T+ T_X,0 + T_0,Y

If the packets live on node X the time becomes
T+T+T+T_X,Y+T = 4T + T_X,Y

Assuming that T_X,Y is always shorter or equal to (T_X,0 + T_0,Y),
which is a reasonable assumption since the path from X to Y could pass
node 0, the latency in the case where the packets live on one of the
involved nodes will be smaller or at worst equal to the case where all
packets live on the same node.