Some notes on terminology The term "node" is used in the Linux kernel NUMA sense of a CPU, possibly with some local memory attached. It should not be confused with nodes in a cluster, they are referred to as "cluster nodes". For simplicity the presence of local memory on a node is assumed. "System" refers to a NUMA-machine. The MPICH devices use different terms to denote the data structure used to communicate data between processes. That type of data structure is uniformly referred to as a "packet" here. A "message" is the data buffer to send, and possibly a data structure encapsulating it. The term "local memory" here does not indicate that the memory is non-shared, merely that it lives on the same node as the process referencing it. Non-shared memory is called "private memory" here. "Remote memory" is memory in another node, shared or non-shared. "Global" signifies data that is shared by all processes Unless otherwise stated "page-aligned" means that the start address of the memory area coincide with a page address AND that the size of the area is a whole number of pages so that the first byte after the end of the area lies in a new page. Linux Kernel The three main characteristics of the Linux 2.6 kernel that are of interest to us are the following : A. Processes have weak processor affinity. The scheduler tries to keep processes running on the same node if possible, but will move processes between nodes if it considers it necessary. B. A first touch page allocation policy. Memory pages owned by a process are allocated on the node where they are first referenced (written to for anonymous pages, read from or written to for named pages), if possible. If the node on which the process is running has no free memory left the memory will be allocated from the topologically closest node that has enough free memory. The determination of which node is closest is made at boot time. There are three cases to consider. Anonymous virtual memory that has been inherited from the parent process is mapped to the same physical page frames used in the parent process and the page frames are marked as Copy-on-Write. When either the parent or the child process writes to a page it will break COW and a new page frame will be allocated for that process. This means that if the parent process breaks COW for a page, both copies of that page will end up living on the node that the parent process runs on. Anonymous virtual memory that has been newly allocated from the kernel is initially mapped to the Copy-on-Write "zero"-page. A write to a virtual address then breaks COW and a physical page frame will be allocated for that address on the node on which the touching process runs. Named pages (pages that are backed by a file or a swap-device) are allocated on the node that the process that first tries to read or write the page runs on. C. No page migration. Pages that have been touched will stay on the same node that they were allocated on until they are deallocated or paged out to secondary storage. In essence, processes have soft affinity and memory have hard affinity. MPICH The MPICH version studied here is 1.2.6. We looked at the two MPICH devices most likely to be used on a NUMA-system running Linux, the pure shared memory device ch_shmem and the combined shared memory / network device ch_p4. MPICH ch_shmem internals In the ch_shmem device all processes communicate through a shared memory structure called MPID_shmem (of type MPID_SHMEM_globmem, ch_shmem/shdef.h). MPID_shmem contains an array of packet queues (MPID_shmem->incoming), one queue for each process, an array of stacks of available packets (MPID_shmem->avail), various locks and a pool of packets (MPID_shmem->pool). Packets have a "next" pointer that is used to link packets into linked lists on the queue and stack. To alleviate pressure on the MPID_shmem structure each process has a private structure called MPID_lshmem (of type MPID_SHMEM_lglobmem, ch_shmem/shdef.h) with copies of pointers to the elements of MPID_shmem. Initialization When the application starts a shared memory arena is set up for the internal memory allocation interface (ch_shmem/p2p.c). With new Linux kernels this is accomplished by creating a new anonymous, shared virtual memory region with mmap() (p2p_init(), ch_shmem/p2p.c). Memory for MPID_shmem is allocated from the arena and initialized (MPID_SHMEM_init(), ch_shmem/shmempriv.c). The packet pool is partitioned among the processes and each avail-stack is initialized by linking together the packets in the corresponding pool partition in a list with MPID_shmem->avail[n] pointing to the head of the list allocated for process n. After this all child processes are spawned using fork() (p2p_create_procs(), ch_shmem/p2pprocs.c). Sending Sending is accomplished by fetching a free packet from the process' avail-stack (blocking until one is available) (MPID_SHMEM_GetSendPkt(), ch_shmem/shmempriv.c), copying the data into the packet using memcpy() and then inserting the packet into the incoming-queue of the destination process (MPID_SHMEM_SendControl(), ch_shmem/shmempriv.c). Receiving When receiving a packet (one is available on the incoming-queue), the data is copied from the packet and the packet is removed from the incoming-queue (by MPID_SHMEM_FreeRecvPkt(), ch_shmem/shmempriv.c) and placed in a free packet cache list (FreePkts[], ch_shmem/shmempriv.c) that is private to the process. The free packet cache is flushed (by MPID_SHMEM_FlushPkts(), ch_shmem/shmempriv.c) after a certain number of packets has been put on it, the packets in the cache are then put on the top of the avail-stack of the process that originally sent the packet. Problems with ch_shmem Even though ch_shmem keeps packets separate for each processor, the location of the packets in physical memory becomes a performance problem in Linux NUMA systems. With the default memory affinity policy of Linux, the first process to write to a page "wins" it, the page gets allocated to the node on which the process ran. In the initialization of ch_shmem all packets in the pool are touched when their "next"-pointers are set to form the linked lists. This causes all the memory pages in the pool, and therefore all the packets, to live permanently on the node on which the setup code was executed. This means that all processes in the system must access memory on the startup node every time a message is sent or received. This can be a performance limiter both with respect to bandwidth and latency. Furthermore, with the default soft processor affinity processes sometimes move between nodes but the memory will not, compounding the problem and giving non-deterministic performance. Suggested modifications to ch_shmem In order to get predictable performance and make sure that processes don't migrate away from their data, processes should be locked to nodes so that they display the same hard affinity as memory. In newer Linux kernels this can be accomplished with the sched_setaffinity() system call. It is important that the affinity is set as early as possible in the startup process before much private memory has been touched or allocated. In order to make sure that the processes are evenly distributed among the available nodes the affinity cannot be set until the process has discovered its ID (MPID_myid). This determines the earliest point at which the affinity can be set. For the child processes this can then be done in p2p_create_procs() (ch_shmem/p2pprocs.c) right after MPID_myid is set from the sequence number MPID_shmem->globid. For the initial process affinity can be set to node 0 first thing in MPID_SH_InitMsgPass() (ch_shmem/shmeminit.c) before the shared memory arena is initialized. To force a rescheduling of the process a sched_yield() call should follow directly on the setting of the affinity. Otherwise the process could continue to execute on the current (but wrong) node until the current scheduler time-slice runs out. Any time a message is sent data is copied from a user buffer on the sender (A) to a packet in shared memory, and then from shared memory into a user buffer at the receiver (B). When the packet lives in local memory on the node where either A or B runs the latency will be minimal (1). Distributing the packets among the nodes also improves total bandwidth as more memory interfaces will be utilized for communication and communication will not stress any particular node unevenly (unless the communication pattern is irregular). Most of the code needed to get processes to use packets in local memory is already present in MPICH. Any given process will always use the same set of packets for sending and those sets are disjunct among processes. We just need to make sure that the shared memory area occupied by the packets used by a process gets allocated to physical memory on the same node that the process will run on. Since memory gets assigned to nodes at the granularity of pages we need to make sure that A) the packet pool is page aligned B) the pool partitions allocated to processes occupies a whole number of pages in memory C) the pages used in any pool partitions are first written to by the process that should own them, after that process has been started on the node where it is supposed to live Given A and B, C will then cause the kernel to place the packet pages in physical memory on the correct node. To accomplish this MPID_shmem needs to be allocated with a memory allocator that returns page-aligned memory and padded so that the packet pool array is page-aligned. We need to make sure that the size of the pool is such that the size divided by the number of processors is a whole multiple of the page size. Finally, carving up the packet pool among the processes and initializing the avail-stacks must be distributed so that each child process initializes its own packets. MPICH p4_shmem In the p4 device processes can communicate over a network between systems and through shared memory or sockets within a system. Shared memory communication is implemented as an array of packet queues (shmem_msg_queues[]), one for each process, in a shared data structure called p4_global (of type struct p4_global_data, ch_p4/p4/lib/p4_defs.h). A global queue of available unused packets is kept in p4_global->avail_quel. Lists are implemented by encapsulating messages (struct p4_msg, ch_p4/p4/lib/p4_defs.h) as pointers in a queue structure (the actual packet) (struct p4_queued_msg, ch_p4/p4/lib/p4_defs.h). p4_global also contains an array of available empty message buffers (p4_global->avail_buffs[]). Each process also have a private structure called p4_local (struct local_data, ch_p4/p4/lib/p4_defs.h) which contains a private message queue for the process (p4_local->queued_messages). Initialization When the application starts a shared memory arena is set up for the internal memory allocation interface (ch_p4/p4/lib/p4_MD.c) using the System V IPC interface (shmget/shmat). In Linux this is implemented in the same way as shared anonymous mmap-mappings, by mmap()-ing a file in a RAM-filesystem. Memory for p4_global is allocated from the arena and initialized (alloc_global(), ch_p4/p4/lib/p4_alloc.c). The message buffer array (p4_global->avail_buffs[]) is set up to cache allocated but unused message buffers with sizes less than a range of power-of-twos up to 1MB (init_avail_buffs(), ch_p4/p4/lib/p4_alloc.c). Memory for p4_local is allocated using malloc() and is initialized (alloc_local_bm(), ch_p4/p4/lib/p4_alloc.c). After this the child processes are spawned using fork() for processes local to the system (create_bm_processes(), ch_p4/p4/lib/p4_bm.c). The child processes free p4_local, reallocate it with malloc() and reinitialize it (alloc_local_slave(), ch_p4/p4/lib/p4_alloc.c). Sending On sending via shared memory (send_message(), ch_p4/p4/lib/p4_tsr.c) a buffer for the message (struct p4_msg, ch_p4/p4/lib/p4_defs.h) is fetched from the p4_global->avail_buffs[] array, or allocated from the shared memory arena (alloc_p4_msg(), ch_p4/p4/lib/p4_alloc.c) if no existing buffer of sufficient size could be found. The data is copied into the message and a packet (struct p4_msg_queue) is fetched from the global queue of available packets (p4_global->avail_quel) if one exists, otherwise a new packet is allocated from the shared memory arena. The message is linked into the packet as a pointer and put at the end of the packet queue of the destination process (p4_global->shmem_msg_queues[n] for process number n). Receiving When receiving a packet (p4_recv(), ch_p4/p4/lib/p4_tsr.c), the private queue of received messages (p4_local->queued_messages) is first searched for a matching message (with the right sender and type). If none is found, a packet is removed from the head of the global packet queue for this process (p4_global->shmem_msg_queues[n]), if one is available. The message is retrieved from the packet and the packet is inserted at the head of the available packets queue (p4_global->avail_quel). If the message is not what was expected (the type of the message or the sender id is other than expected) the message is moved to the private message queue (p4_local->queued_messages). Otherwise the data is copied into a user buffer and the message is either put on the available message buffer queue array (p4_global->avail_buffs[]) if it is of a kept size, or it is freed. Problems with p4_shmem The situation with p4_shmem is more complicated than with ch_shmem. When a message is being sent it could end up using a packet cached in p4_global->avail_quel that corresponds to physical memory on any node in the system that has sent packets before. Or the process allocates a new packet from shared memory, but where in physical memory that packet ends up depends on the page-alignment of the area allocated. If the area is part of a shared page that has already been partially used for other packets it could well already have been allocated physical memory on a remote node. This makes the performance of any given message transaction hard or impossible to predict. The same problems exist for messages, the queue of free messages is shared among the processors. p4_shmem also has the same problems with soft processor affinity as ch_shmem. Changes to p4_shmem p4 processes needs to be given hard processor affinity in the same way as with ch_shmem. The queue of available packets should be split up so that each process has its own queue. When a process receives a packet it and is done with it it then needs to put the packet into the avail_quel that belongs to the process that sent the packet. Both avail_quel as a whole and the per-process partitions of it needs to be page aligned. Packet allocation needs happen in either page-aligned chunks, or the allocation must happen from a page-aligned per-process shared memory arena. In either of these cases a new shared memory allocator must be implemented for the packet allocation. One that allocates page-aligned blocks of memory, or takes a process ID as an argument and keeps per-process arenas. If the memory allocated keeps per-process arenas the free-routine must keep track of where to return the memory being freed. The list of available message buffers (p4_global->avail_buffs) also needs to be split up into page-aligned per-processor partitions and the freeing of message buffers after use must be modified to return the buffers to the sending process' list. 1. This argument applies to data movement through the processor, this appears to be the way memcpy() does things on linux. Lets denote the time taken to move a data object from memory into the processor over the memory bus locally on a node T, and the time taken to transfer the object over the node-interconnect from node A to node B T_A,B. We assume that the local memory transfer time is equal on all nodes and furthermore that writing the same object from the processor to memory is the same as for reading it. In the case where all packets live on node 0, the transfer time for a message between nodes X and Y, when X and Y are not equal to zero is : T+T_X,0 + T + T + T_0,Y + T = 4T+ T_X,0 + T_0,Y If the packets live on node X the time becomes T+T+T+T_X,Y+T = 4T + T_X,Y Assuming that T_X,Y is always shorter or equal to (T_X,0 + T_0,Y), which is a reasonable assumption since the path from X to Y could pass node 0, the latency in the case where the packets live on one of the involved nodes will be smaller or at worst equal to the case where all packets live on the same node.