29/10/2021

Outline

  • OpenMP
  • Parallel for-loops
  • Race condition

Recap of key concepts

  • Many cores connected to memory.
  • Single address space.

Recap of key concepts

  • Private memory only accesible by owning thread
  • No notion of messages between threads
  • All communication via shared variables

OpenMP (www.openmp.org)

OpenMP is a standard and a programming model for shared memory parallelism.

Compilers “know” how to compile a program that uses OpenMP.

Principal components are: runtime, library functions, environment variables and pragmas.

Pragmas and directives

OpenMP heavily uses compiler directives.

If you compile your code with

g++ -fopenmp prog.cxx -o prog

the compiler is looking for special comments in the source code, called pragmas

#pragma omp directive

directive ^^ is a placeholder for a lot of OpenMP directives.

Very often, directives can be chained

#pragma omp target teams distribute parallel for

A first parallel program

int main()
{
  #pragma omp parallel
  {
  }
  return 0;
}

The omp parallel directive creates a parallel region. This is the most import directive. There is no parallelism in OpenMP without a parallel region.

g++     -fopenmp prog.cxx -o prog     # GNU compiler
clang++ -fopenmp prog.cxx -o prog     # LLVM compiler
icpx    -fopenmp prog.cxx -o prog     # intel compiler
dpcpp   -fopenmp prog.cxx -o prof     # intel oneAPI compiler
nvc++ -mp=multicore prog.cxx -o prog  # nvidia compiler

Parallel region

Controlling the number of threads

  • Most commonly through environment variable
OMP_NUM_THREADS=4 ./myprogram

# This also works
export OMP_NUM_THREADS=4
./myprogram
  • Alternatively in directives
int main() {
  #pragma omp parallel num_threads(4)
  {
  
  }
  return 0;
}

Library functions

OpenMP provides a number of library functions.

Using them requires this at the top of your code:

#include "omp.h"
int omp_get_num_threads(); // Total number of threads
int omp_get_thread_num(); // This thread's id
double omp_get_wtime(); # Useful for measuring wall time (seconds)
#i.e. duration = t1 - t0
double t0 = omp_get_wtime();
double t1 = omp_get_wtime();
double duration = t1 - t0; // Unit is seconds

Shared and private data with OpenMP

Variables inside parallel region are either shared or private.

By default variables declared outside parallel region are shared.

Variables declared inside parallel region are private.

Can use data-sharing attributes in directives for more control. (shared, private)

int a=0;    // shared by default
int b=1     // shared by default and explicitly marked shared below
int c=2;    // made to be private in directive below
#pragma omp parallel shared(b) private(c)
{
  int d=3; // d is private
}

Best practice

The default of treating all variables as shared can be dangerous.

Best practice is to use

#pragma omp parallel default(none)

and explicitly declare what variables are shared and which ones are private:

int a=0;    
int b=1     
int c=2;    
#pragma omp parallel default(none) shared(a,b) private(c)
{
  int d=3; // d is private
}

What can be parallelised?

Most of the parallelism that is easy to exploit is in for-loops.

There is a cost to creating threads and orchestrating work.

Thread parallelism does not always pay off.

Sequential for loop

for (i=0;i<100;i++) {  do some work dependent on i ; }

In a sequential (non-parallel) program, the loop items are processed one after the other.

Thread-parallel for loop

In a multithreaded program, we can distribute the loop iterations to different threads.

But correct only if loop iterations are independent!

And possible only if total loop length known!

The programmer has some control over work distribution.

#pragma omp for

#pragma omp for is used for parallelising for-loops:

#pragma omp parallel
{
  #pragma omp for
  for (int i=0;i<1000;i++) { ... ; }
}

This is equivalent:

    #pragma omp parallel for
    for (int i=0;i<1000;i++) { ... ; }

When is a loop parallelisable? Test if reversing the sequential loop changes the result!

    for (int i=0;i<1000;i++) { ... ; }
    # vs    
    for (int i=999;i>=0;i--) { ... ; }

Example

This program computes \(C_i=2\cdot(A_i+B_i)\).

#include <cstdio>

int main()
{
    int A[4] = {1,2,3,4},  B[4] = {2,3,4,5},  C[4] = {0,0,0,0};
    int i=0, j=0;

    #pragma omp parallel    // default(none) private(?,?) shared(?,?,?)
    {
        #pragma omp for
        for (i=0;i<4;i++)     // i is private due to #pragma omp for
        {
            for (j=0;j<2;j++) // j is NOT private by default
            {
                C[i] += A[i] + B[i];
            }
        }
    }
    for (i=0;i<4;i++) printf ("2*(%d + %d) = %d\n", A[i], B[i], C[i]);
    return 0;
}

This program has an error. Using default (none) helps fixing errors.

Race condition

  • Initialise a=10, each thread: add 1 to a
  • Expect final \(a=10+N_{threads}\)
  • For \(N_{threads}>1\) there is no guarantee for correctness
#include <cstdio>

int main() {
    int a=10;
    #pragma omp parallel
    {
        a = a+1;
    }
    printf("Final result a=%d\n", a);
    return 0;
}

Compile with:

g++ -fopenmp  input.cxx -o program

Race condition

Initialise a=10 in shared memory

Race condition

Thread 1 loads 10 into CPU register

Race condition

Thread 1 adds 1 to 10 in CPU register

Race condition

Thread 2 loads 10 into CPU register

Race condition

Thread 2 adds 1 to 10 in CPU register

Race condition

Thread 1 stores 11 in shared memory a

Race condition

Thread 2 stores 11 in shared memory a

Code example

Task: compute \(\sum\limits_{n=1}^{100} n\)

#include <cstdio>

int main() {
    int sum=0;
    const int N=100;

    #pragma omp parallel for
    for (int n=1;n<N+1;n++)
    {
      sum +=n;
    }
    printf("Result is %d. It should be %d\n", sum, N*(N+1)/2);
    return 0;
}