[PDC - Center for Parallel Computers, KTH]

Discussion:
Parallel Processing Concepts

1/99

This is the in-depth discussion layer of a two-part module. For an explanation of the layers and how to navigate within and between them, return to the top page of this module.


This primer introduces you to significant concepts concerning parallel processing and its efficient realization within a number of different types of hardware and software environments.

The material in this tutorial is arranged to first give you a basic understanding of the important concepts in the field of parallel programming, including its goals, various forms, and significant terminology. This will be followed by a more in-depth discussion of several key considerations involved in the decision to go parallel, including some sample codes demonstrating various alternatives. The conclusion recaps the significant points in the presentation, and gives you directions for obtaining more information. Following the conclusion is a link to the evaluation form for this presentation; please do not forget to fill this out and submit it when you've completed all of the material.

In order for this tutorial to be of benefit to you, you should have somewhat more than a novice-level understanding of computing in general, and proficiency in at least one of the following programming languages:

fortran 77
fortran 90
C


Table of Contents

  1. Overview and Goals of Parallel Processing
  2. Why Use Parallel Processing to Reach These Goals?
  3. Taxonomy of Architectures
  4. Terminology of Parallelism
  5. Models of Memory Access
  6. Converting From Serial to Parallel Execution
  7. Costs of Parallel Processing
  8. Parallel Programming Example
  9. Conclusions

Quiz | Evaluation

[Table of Contents] [Section 1] [Section 2] [Section 3] [Section 4] [Section 5] [Section 6] [Section 7] [Section 8] [Section 9] [Section 10] [Less Detail]


1. Overview and Goals of Parallel Processing

Overview of Parallel Processing

Parallel processing is the use of multiple processors to execute different parts of the same program simultaneously.

The main goal of parallel processing is to reduce wall-clock time. There are also other reasons, which will be discussed later on in this section.

Imagine yourself having to order a deck of playing cards: a typical solution would be to first order them by suit, and then by rank-order within each suit. If there were two of you doing this, you could split the deck between you and both could follow the above strategy, combining your partial solutions at the end; or one of you could sort by suit, and the other by order within suits, both of you working simultaneously. Both of these scenarios are examples of the application of parallel processing to a particular task, and the reason for doing so is very simple: reduce the amount of time before achieving a solution. If you've ever played card games that use multiple decks, you've almost certainly engaged in parallel processing by having multiple people help with the tasks of collecting, sorting and shuffling all of the cards.

You can use this analogy to see indications of both the power and the weakness of the parallel approach, by taking it gradually to its extreme: as you increase the number of helpers involved in a particular task, you'll generally experience a characteristic speedup curve demonstrating how up-to-a-certain-number of helpers is beneficial, but any over that simply get in each others' way and reduce the overall time to completion. Consider, for example, how little it would help to have 52 people crowding around a table, each responsible for putting one particular card into its proper place in the deck -- this is exactly what is meant by the adage Too many cooks spoil the broth.

If you'd like to do an exercise which will drive home some of the aspects of both the power and the limitations of parallel programming, click here.


1.1 Goals of Parallel Processing

  1. Reduce Wall Clock Time

    In the just-stated goal, you'll notice that it isn't simply time that's being reduced, but wall-clock time; other kinds of "time" could have been emphasized, for example CPU time, which is a count of the exact number of CPU cycles spent working on your job, but wall-clock is considered to be the most significant because it's what you-the-researcher want to spend as little of as possible waiting for the solution: your own time. What is considered to be "acceptable" differs with the situation, and involves characteristics of the user, the particular code being run, and the system it's being run on. But in all cases, it is safe to assume that the user is generally going to be more pleased the faster a solution is produced.

    It should be pointed out that reducing the wall-clock time to solution is only one of many possible goals that might be of interest; some others might be:

  2. Cheapest Possible Solution Strategy

    Running programs costs money; different ways of achieving the same solution could have significantly different costs. If you're in a fiscally tight situation, you may have no reasonable recourse to a parallel strategy if it costs more than your budget allows. At the same time, you may find that running your program in parallel across a large number of workstation-type computers could cost considerably less than submitting it to a large, mainframe-style mega-beast.

  3. Local versus Non-Local Resources

    "Locality," here, usually refers to either "geographical locality" or "political locality." The former is just another way of saying that you want all of your processes to be "close" to one another in terms of communications; the latter indicates that you only want to use resources that are administratively open to you. Both can have a bearing on "cost;" e.g., the more communications latency incurred by your application, the longer it will run, and the more you might be charged, while use of resources "owned" by other organizations may also carry charges.

  4. Memory Constraints

    As researchers become increasingly computationally sophisticated, the complexity of the problems they tackle increases proportionally (some would say superlinearly, that researchers are forever trying to bite off more than their computers can chew). One of the first resources to get exhausted is local memory -- especially for Grand Challenge level projects, the amount of memory available to a single system is rarely going to be sufficient to the computational and data-storage needs encountered during application runs.

    This situation is greatly alleviated by having access to the aggregate memory made available by distributed computing environments: working storage (main memory) requirements can be spread around the various processors engaged in the cooperative computation, and long-term storage (tape and disk) can be accommodated at different locations (indeed, data security can be enhanced by arranging for multiple copies to be maintained at distinct locations in the distributed environment).

You can probably think of others; basically, any limited resource can be considered as the object of optimization, if it is deemed to be the most important quantity to conserve. In most cases involving large-scale computation, however, user time, as measured by the clock on the wall, is considered to be the single most valuable resource to be conserved.

[Table of Contents] [Section 1] [Section 2] [Section 3] [Section 4] [Section 5] [Section 6] [Section 7] [Section 8] [Section 9] [Section 10] [Less Detail]


2. Why Use Parallel Processing to Reach These Goals?

Accepting reduction of wall-clock time as the fundamental goal of our activities, why necessarily focus on parallel programming as the means to this end? Aren't there other approaches that can also yield fast turnarounds? Yes, indeed there are, the most significant being the oldest: "beef up that old mainframe," make the standalone single-processor design larger (e.g., increase the amount of memory it can directly address) and more powerful (e.g., increase its basic wordlength and computational precision) and faster (e.g., use smaller-micron etching technology, packing more transistors into less space, and coupling everything with larger and faster communications pathways).

This approach, though, can only be pushed so far, and indications are getting stronger and stronger that fundamental limitations are going to put permanent roadblocks up before too much longer. Three of these are:

[Table of Contents] [Section 1] [Section 2] [Section 3] [Section 4] [Section 5] [Section 6] [Section 7] [Section 8] [Section 9] [Section 10] [Less Detail]


3. Taxonomy of Architectures

To set a foundation for our examination of parallel processing, we need to understand just what kinds of processing alternatives have already been identified, and where they fit into the "parallel picture", if you will. One of the longest-lived and still very reasonable classification schemes was proposed by Flynn, in 1966, and distinguishes computer architectures according to how they can be classified along two independent, binary-valued dimensions; independent simply asserts that neither of the two dimensions has any effect on the other, and binary-valued means that each dimension has only two possible states, as a coin has only two distinct flat sides. For computer architectures, Flynn proposed that the two dimensions be termed Instruction and Data, and that, for both of them, the two values they could take be Single or Multiple. The two dimensions could then be drawn like a matrix having two rows and two columns, and each of the four cells thus defined would characterize a unique type of computer architecture.

Single Instruction, Single Data (SISD)

This is the oldest style of computer architecture, and still one of the most important: all personal computers fit within this category, as did most computers ever designed and built until fairly recently. Single instruction refers to the fact that there is only one instruction stream being acted on by the CPU during any one clock tick; single data means, analogously, that one and only one data stream is being employed as input during any one clock tick. These factors lead to two very important characteristics of SISD style computers:


Multiple Instruction, Single Data (MISD)


Few actual examples of computers in this class exist; this category was included more for the sake of completeness than to identify a working group of actual computer systems. However, special-purpose machines are certainly conceivable that would fit into this niche: multiple frequency filters operating on a single signal stream, or multiple cryptography algorithms attempting to crack a single coded message. Both of these are examples of this type of processing where multiple, independent instruction streams are applied simultaneously to a single data stream.

A less technological, but perhaps more cosmopolitan example, was suggested by a participant in the Cornell Theory Center's Virtual Workshop:


Single Instruction, Multiple Data (SIMD)

A very important class of architectures in the history of computation, single-instruction/multiple-data machines are capable of applying the exact same instruction stream to multiple streams of data simultaneously. For certain classes of problems, e.g., those known as data-parallel problems, this type of architecture is perfectly suited to achieving very high processing rates, as the data can be split into many different independent pieces, and the multiple instruction units can all operate on them at the same time.


Multiple Instruction, Multiple Data (MIMD)

Many believe that the next major advances in computational capabilities will be enabled by this approach to parallelism which provides for multiple instruction streams simultaneously applied to multiple data streams. The most general of all of the major categories, a MIMD machine is capable of being programmed to operate as if it were in fact any of the four.


SIMD + MIMD: e.g., Connection Machine CM-5

A few systems are exploring hybrid designs, mixing SIMD and MIMD capabilities in order to take advantage of either when the algorithm benefits most from it. The CM-5 is an example of such a system:

[Table of Contents] [Section 1] [Section 2] [Section 3] [Section 4] [Section 5] [Section 6] [Section 7] [Section 8] [Section 9] [Section 10] [Less Detail]


4. Terminology of Parallelism

Parallel processing has its own lexicon of terms and phrases, emphasizing those concepts that are considered to be most important to its goals and the ways in which those goals may be achieved. The following are some of the more commonly encountered ones. They are listed in an order for you to learn them, assuming you do not know any, so you can start with the first and then build up to the rest.

Task

A logically discrete section of computational work.

This is a somewhat loose definition, but adequate for this introduction. For now, think of a task as computational work you can describe simply, such as, "calculate the mean and standard deviation of 100,000 numbers," or "calculate a Fast Fourier transform."


Parallel Tasks

Tasks whose computations are independent of each other, so that all such tasks can be performed simultaneously with correct results.


Serial Execution

Execution of a program sequentially, one statement at a time.


Parallelizable Problem

A problem that can be divided into parallel tasks. This may require changes in the code and/or the underlying algorithm.

Example of Parallelizable Problem:

Example of a Non-parallelizable Problem:

A non-parallelizable problem, such as the calculation of the Fibonacci sequence above, would entail dependent calculations rather than independent ones -- notice how calculation of the k + 2 value uses those of both k + 1 and k, hence those three terms cannot be calculated independently, nor, therefore, in parallel.


Types of Parallelism

There are two basic ways to partition computational work among parallel tasks:


Observed Speedup

Observed speedup of a code which has been parallelized =

         wall-clock time of serial execution 
       --------------------------------------- 
        wall-clock time of parallel execution
    

One of the most widely used indicators of parallelizability, the calculation of observed speedup is both intuitively satisfying, and potentially misleading; the former because a well-parallelized code can be shown to run in a fraction of the time that it takes the serial version, the latter because, in many respects, this is a comparison of apples and oranges: the codes are different, they perform different tasks, the algorithms may be entirely distinct. Still, there is no discounting the fact that a good job of parallelization will be evident in the amount of wallclock time it has saved the user; what is debatable is the converse: if it is not evident that a lot of time has been saved, is it because the problem itself is not parallelizable, or because the parallelization simply wasn't done well? This, by the way, is where parallel profiling tools, covered later in this presentation, can help tremendously.


Synchronization

The temporal coordination of parallel tasks. It involves waiting until two or more tasks reach a specified point (a sync point) before continuing any of the tasks.


Parallel Overhead

The amount of time required to coordinate parallel tasks, as opposed to doing useful work.

Parallelization doesn't come free, and one of the most insidious costs is the time and cycles put into making sure that all of those separate tasks are doing what they're supposed to be doing. Things that are simply taken for granted in serial execution, or that don't apply, take on special significance when there are many tasks instead of just one; the three most commonly encountered coordination tasks are:


Granularity

A measure of the ratio of the amount of computation done in a parallel task to the amount of communication.




Massively parallel system

A parallel system with many processors. "Many" is usually defined as 1000 or more processors.


Scalable parallel system

A parallel system to which the addition of more processors will yield a proportionate increase in parallel speedup. Whether or not this increase occurs typically depends on some combination of:

[Table of Contents] [Section 1] [Section 2] [Section 3] [Section 4] [Section 5] [Section 6] [Section 7] [Section 8] [Section 9] [Section 10] [Less Detail]


5. Models of Memory Access

Memory access refers to the way in which the working storage, be it "main-memory", "cache-memory", or whatever, is viewed by the programmer. Regardless of how the memory is actually implemented, e.g., if it's actually remotely located but is accessed as if it were local, the access method plays a very large role in determining the conceptualization of the relationship of the program to its data.

5.1 Shared Memory

Think of a single large blackboard, marked off so that all data elements have their own unique locations assigned, and all the members of a programming team are working together to test out a particular algorithmic design, all at the same time...this is an example of shared memory in action:


5.2 Distributed Memory


The other major distinctive model of memory access is termed distributed, for a very good reason:


5.3 Distributed Memory: Some Approaches

Distributed memory is, for all intents and purposes, virtually synonymous with message-passing, although the actual characteristics of the particular communication schemes used by different systems may hide that fact.

[Table of Contents] [Section 1] [Section 2] [Section 3] [Section 4] [Section 5] [Section 6] [Section 7] [Section 8] [Section 9] [Section 10] [Less Detail]


6. Converting From Serial to Parallel Execution

There are a number of bottlenecks typically encountered in the transition from serial processing to parallel processing. One of the most pernicious is that of mindset: people who have been in the computing business for a long time are understandably reluctant to have to learn a new way of designing their codes, and an efficient parallel algorithm often has little similarity to an efficient serial algorithm. The very first task in the conversion effort is to step way back from the existing serial application, and re-examine the intent it was written to serve: can this task be effectively and efficiently performed in parallel, and, if so, how best can that be accomplished?

Very often existing serial code has to be almost completely ignored, and the parallel version written virtually from scratch. This can be a major commitment of resources, and for some dusty-deck codes the projected return from such an investment is often considered to be insufficient to warrant the effort.

However, once the decision has been made to move from serial to parallel, the real nitty-gritty work of code conversion can very often be helped along by application of the growing number of automatic tools, well seasoned by the manual use of hard-learned rules of thumb.


6.1 Automatic vs. Manual Conversion

For years, ever since the first parallel system was constructed, the parallelization of existing codes has been largely the realm of manual conversion. The ultimate future goal of parallel support is to build tools capable of accepting the before-mentioned dusty-deck serial code, and returning a perfectly parallelized program suitable for execution on a particular state-of-the-art parallel system.


6.2 Converting Serial Code to Parallel: General Dependencies

In this section, we'll spend some time discussing a major factor relevant to successful code parallelization: dependency.

General Dependencies


6.3 DO-LOOP Dependencies

Since DO-loops are prime candidates for very effective parallelization, identification of dependence is very important, even more so the determination of whether or not the dependence found will render the loop incapable of the desired parallelization.

[Table of Contents] [Section 1] [Section 2] [Section 3] [Section 4] [Section 5] [Section 6] [Section 7] [Section 8] [Section 9] [Section 10] [Less Detail]


7. Costs of Parallel Processing

By this point, I hope you will have gotten the joint message that:

  1. Parallel processing can be extremely useful, but...
  2. ... TANSTAAFL (There Ain't No Such Thing As A Free Lunch)
I.e., you can get great rewards from parallelizing, but you'll likely sweat blood getting there; now, that's not always the case, but it's better that you assume it will be, and be pleasantly surprised when it goes quickly and smoothly, than expecting everything will go smoothly and ending up mired to your neck in problems.

Here are some of the more significant ways that you can expect to spend time and encounter problems:

[Table of Contents] [Section 1] [Section 2] [Section 3] [Section 4] [Section 5] [Section 6] [Section 7] [Section 8] [Section 9] [Section 10] [Less Detail]



9. Parallel Programming Example

This section provides access to a sample program that demonstrates parallel techniques. A simple program uses the Message Passing Interface (MPI) to send the message "Hello, world" from one task to several others. The same program runs on each node, determining whether it is a sender or receiver through a variable named "me."

The equivalent program in HPF runs on each node, in parallel sets up the message and determines its own identity, and then sends that value ("me(i)") to node 0 where it is printed along with the message.

[Table of Contents] [Section 1] [Section 2] [Section 3] [Section 4] [Section 5] [Section 6] [Section 7] [Section 8] [Section 9] [Section 10] [Less Detail]


10. Conclusions

Here are some of the most important things you should take with you from this presentation:

Parallel processing can significantly reduce wall-clock time.

The whole reason for getting involved in parallelism is to reduce the time you spend waiting for your results. The characteristics of algorithms virtually insure that there will be points in most applications where significant savings can be achieved by judicious use of parallelism.

Writing and Debugging Software is More Complicated

Parallelism involves at least an order-of-magnitude more complexity in your code -- this need not be evident in the code that you write, but will certainly be evident in the size of the executable that results. The important aspect, of course, is the conceptual complexity that has been added, which has immediate implications for its debugging and maintenance.

Parallelization is not yet fully automatic

You'll have to assume that whatever parallelization is needed, you'll have to provide. There are specific situations where you'll be able to get away with as little as adding a few parallel-directives as comments in your still-serial source, but this is not yet a commonly-encountered scenario.

Overhead of parallelism costs more CPU

Parallelism doesn't come for free; not only do you have to do more work, but so does the computing system, and parallelism itself involves additional effort in terms of process-control: starting, stopping, synchronizing, and killing. Besides this, parallelism requires that, in general, both code and data exist in multiple places, and getting them there involves additional time as well as the additional space needed to hold them.

Decide which architecture is most appropriate for a given application

The characteristics of your application should drive your decision as to how it should be parallelized; the form of the parallelization should then determine what kind of underlying system, both hardware and software, is best suited to running your parallelized application.

[Table of Contents] [Section 1] [Section 2] [Section 3] [Section 4] [Section 5] [Section 6] [Section 7] [Section 8] [Section 9] [Section 10] [Less Detail]


[Quiz] Take a multiple-choice quiz on this material, and submit it for grading.

[Evaluation] Please complete this short evaluation form. Thank you!


[Copyright Statement] [Feedback]