Description
A1 – Open Multiprocessing (OpenMP)
Contents
Q1. Hello World Program 1
Q2. Hello World Program – Version 2 1
Q3. DAXPY Loop 2
Q4. Matrix Multiply 2
Q5. Calculation of π 2
Q6. Calculation of π – Worksharing and Reduction 4
Q7. Calculation of π – Monte Carlo Simulation 4
Q8. Producer-Consumer Program 5
The specification of the OpenMP Application Program Interface (OpenMP API) provides a model for parallel programming that is portable across shared memory architectures from different vendors. Compilers from numerous vendors support the OpenMP API. The directives extend the C, C++ and Fortran base languages with single program multiple data (SPMD) constructs, tasking constructs, work-sharing constructs, and synchronization constructs, and they provide support for sharing and privatizing data.
Hello World Example
#include “omp.h” /* OpenMP compiler directives, APIs are declared here */ void main() {
/* Parallel region begins here */
#pragma omp parallel
{ int ID = omp_get_thread_num(); printf(‘‘ hello(%d) ’’, ID); printf(‘‘ world(%d) ’’, ID);
}
}
Compilation
$ gcc -o hello helloworld.c -fopenmp
$ ./hello
Programs to Implement
Q1. Hello World Program
Hello World Program
Create a default number of OpenMP threads. For each thread OpenMP can create, print a Hello World message.
Q2. Hello World Program – Version 2
Hello World Program – Version 2
Create a default number of OpenMP threads. For each thread OpenMP can create, pass the thread id to a function. The function prints out a Hello World message and the thread id. The prototype of the function could be void printHello(threadID).
Q3. DAXPY Loop
DAXPY Loop
D stands for Double precision, A is a scalar value, X and Y are one-dimensional vectors of size 216 each, P stands for Plus. The operation to be completed in one iteration is X[i] = a*X[i] + Y[i]. Your task is to compare the speedup (in execution time) gained by increasing the number of threads. Start from a 2 thread implementation. How many threads give the max speedup? What happens if no. of threads are increased beyond this point? Why?
Speedup = : Execution time of a n-threaded OpenMP program; T1: Execution time of the uniprocessor implementation.
Q4. Matrix Multiply
Matrix Multiply
Build a parallel implementation of multiplication of large matrices (Eg. size could be 1000×1000). Repeat the experiment from the previous question for this implementation. Think about how to partition the work amongst the threads – which elements of the product array will be calculated by each thread?
Q5. Calculation of π
Calculation of π
This is the first example where many threads will cooperate to calculate a computational value. The task of this program will be arrive at an approximate value of π. The value of π is given by the following integral:
(1)
This can be approximated as the sum of the areas of rectangles under the curve shown in Figure 1. In the figure, δx is the width of the rectangle with its height equal to F(xi) at the middle of interval i.
Figure 1: The sum of the areas of the rectangles is an approximation for the value of π.
The serial version of the code follows.
static long num_steps = 100000; double step; void main ()
{ int i; double x, pi, sum = 0.0; step = 1.0/(double)num_steps; for (i=0; i<num_steps; i++) { x = (i+0.5)*step; sum = sum + 4.0/(1.0+x*x);
}
pi = step * sum;
}
• int omp get num threads();: No. of threads in the team.
• int omp get thread num();: returns the ThreadID of the calling thread.
• double omp get wtime();: returns time in seconds since a fixed point in the past.
Points to consider:
• How many threads will your program spawn?
• Assuming each thread has calculated its partial sum (1 . ). The sum variable should not be modified by multiple threads simultaneously – sum should be modified inside the critical region. Consider the following example:
float Sum;
#pragma omp parallel
{ float partial_sum; int i, id, nthrds;
id = omp_get_thread_num(); nthrds = omp_get_num_threads();
for(i=id;i<niters;i+nthrds){
partial_sum=partial_sum_calculate(i);
/* Critical section – Only one thread modifies Sum at a time.*/
#pragma omp critical
Sum = consumePartialSum(partial_sum); }
}
For the Sum = consumePartialSum(partial sum); statement, each thread waits for its turn to execute instructions in the consumePartialSum function. The critical clause of OpenMP marks the critical region (implements Mutual Exclusion).
For the π calculation, the same effect can be obtained using the atomic clause. The atomic clause instructs the compiler that the modification of the memory location must complete atomically. A thread will complete the read from memory-modify in register-write back to memory location sequence of operations as if it was a single instruction. The thread will not be context switched out during this operation. Only one thread can executing an atomic operation. Example:
float Sum;
#pragma omp parallel
{ float partial_sum; int i, id, nthrds;
id = omp_get_thread_num(); nthrds = omp_get_num_threads();
for(i=id;i<niters;i+nthrds){
partial_sum=partial_sum_calculate(i);
/* Critical section – Only one thread modifies Sum at a time.*/
#pragma omp atomic
Sum = Sum + partial_sum;
}
}
Q6. Calculation of π – Worksharing and Reduction
Calculation of π – Worksharing and Reduction
Improve the previous implementation of the π calculation using worksharing and the reduction clause.
The OpenMP parallel construct creates a “Single Program Multiple Data” (SPMD) program. Each thread redundantly executes the same code. Worksharing splits up pathways through the code between threads.
The loop worksharing construct (for) splits up loop iterations among the threads in a team. The following code shows the usage of the worksharing for construct.
#pragma omp parallel
{
#pragma omp for for (i=0;i<N;i++){
work_for_thread(i);
}
}
Note that the worksharing for construct should be used after eliminating any dependences between loop iterations.
In the π calculation loop, the partial sums calculated by each thread (1 . ). The partial sums have to be accumulated into the sum variable. This step is called reduction. Use the OpenMP reduction clause to calculate the sum. (To think: given 16 numbers to add up with 16 processors in hand, what is the fastest way to sum up the numbers?)
double ave=0.0, A[MAX]; int i;
#pragma omp parallel for reduction (+:ave) for (i=0;i< MAX; i++) {
ave + = A[i];
} ave = ave/MAX;
• The worksharing for construct has been appended to the parallel clause. This is short hand (instead of using two compiler directives as in the previous example).
• The reduction clause is reduction (op : list). op indicates the operation to be used for reduction (+ in the case of the π program). list are the variables in reduction (sum in the π serial code).
Q7. Calculation of π – Monte Carlo Simulation
Calculation of π – Monte Carlo Simulation
The objective of this program will be to estimate the value of π using the probability of landing a dart inside a circular boundary on a dart board. The probability of landing a dart inside the circle is proportional to ratio of areas:
For this question:
• Serial version of the code: Compute π by randomly choosing points, count the fraction that falls in the circle. Compute pi. Write your own pseudo random number generator.
• Create a parallel version of this program. Use the private clause to specify the variables local to a thread.
Bonus:
• Make the random number generator threadsafe.
• Make your random number generator numerically correct (non-overlapping sequences of pseudo-random numbers).
Q8. Producer-Consumer Program
Producer-Consumer Program
The producer-consumer sequence involves a producer process filling in data into a memory location and a consumer process reading from the memory location and performing an action on it. An example producer-consumer operation is for the consumer to calculate the sum of an array after it has been populated by the producer. The example serial code is shown below. Parallelize the following Producer-Consumer program.
int main()
{ double *A, sum, runtime; int flag = 0;
A = (double *)malloc(N*sizeof(double)); runtime = omp_get_wtime(); fill_rand(N, A); // Producer: fill an array of data sum = Sum_array(N, A); // Consumer: sum the array runtime = omp_get_wtime() – runtime; printf(” In %lf seconds, The sum is %lf “,runtime,sum);
}
For the parallel implementation, take care of these:
• There are two threads – a producer and a consumer thread. Use the sections and section clauses to assign code blocks to each thread. Example:
#pragma omp parallel
{
#pragma omp sections // Assign code in each section to individual threads {
#pragma omp section
X_calculation(); // Thread 1
#pragma omp section y_calculation(); // Thread 2
#pragma omp section z_calculation(); // Thread 3 } // default barrier here.
} • How does the consumer know that the producer has completed populating the array? The consumer can be notified by setting a global flag. Hint: Use the flush clause. The flush clause forces an update to the memory location as soon as a thread has modified a variable. The producer writes a flag that can be flushed. The following code sequence flushes the updated value of the flag variable to memory.
flag = 1;
#pragma omp flush (flag)
• The consumer thread waits (in an loop) for the producer to complete populating the array – it waits for the flag to be updated.
On successful completion, the code implements pairwise synchronization between threads.
Epilogue
This document borrows heavily from the excellent A “Hands-on” Introduction to OpenMP tutorial by Tim Mattson and Larry Meadows[1] (Videos on Youtube are here Tim Mattson’s OpenMP Video Tutorial). For the OpenMP definitions of all the terms used in this document refer to the OpenMP 4.0 specification document[2]. The API examples document from the OpenMP website is here[3]. OpenMP website [4] has the OpenMP specification, resources and other useful links. Some excellent references to understand concepts with hands-on exercises are listed in the References section
([5][6])[7][8][9].
References
[1] T. Mattson and L. Meadows, “A “Hands-on” Introduction to OpenMP,” http://openmp.org/mp-documents/ omp-hands-on-SC08.pdf, OpenMP tutorial at SC08.
[2] “OpenMP Application Program Interface Specification,” http://www.openmp.org/mp-documents/OpenMP4.0.0.
pdf.
[3] “OpenMP API Examples,” http://openmp.org/mp-documents/OpenMP Examples 4.0.1.pdf.
[4] “OpenMP Website,” http://openmp.org/wp.
[5] B. Barney, “OpenMP,” https://computing.llnl.gov/tutorials/openMP/, Lawrence Livermore National Laboratory.
[6] Wikipedia, “OpenMP — Wikipedia, the free encyclopedia,” 2015. [Online]. Available: https://en.wikipedia.org/w/ index.php?title=OpenMP&oldid=685091436
[7] J. Landman, “OpenMP in 30 Minutes,” http://www.linux-mag.com/id/4609/.
[8] J. Yliluoma, “Guide into OpenMP: Easy multithreading programming for C++,” http://bisqwit.iki.fi/story/howto/ openmp/.
[9] M. Su¨ss and C. Leopold, “Common mistakes in OpenMP and how to avoid them: a collection of best practices,” in Proceedings of the 2005 and 2006 international conference on OpenMP shared memory parallel programming, ser. IWOMP’05/IWOMP’06. Berlin, Heidelberg: Springer-Verlag, 2008, pp. 312–323. [Online].
Available: http://portal.acm.org/citation.cfm?id=1892830.1892863
Reviews
There are no reviews yet.