1 Summing up a billion times

The task is to sum all numbers between 1 and \(10^9\) in parallel. We can cross-check our result using the relation \(\sum\limits_{i=1}^N = \frac{N\cdot(N+1)}{2}\)

1.1 Explicit computation

Under what circumstances is this program correct? — only works if \(N_\mathrm{threads}=4\)

The code is available as ex3/sumup_explicit.cxx from github.

#include <cstdio>
#include "omp.h"

unsigned long crosscheck(unsigned long N)
{
    return N*(N+1)/2;
}

int main()
{
    unsigned long N = 1000000000;
    unsigned long SUMS[4] = {0,0,0,0};

    #pragma omp parallel for default(none) shared(N,SUMS) num_threads(4)
    for (unsigned long i=1;i<N+1;i++)
    {
        SUMS[omp_get_thread_num()]+=i;
    }

    unsigned long result = SUMS[0] + SUMS[1] + SUMS[2] + SUMS[3];

    printf("Result:    %lu\n", result);
    printf("Should be: %lu\n", crosscheck(N));

    return 0;
}

1.2 Using private helper variables

This is a more powerful version of the previous program.

Can you make this code correct? — need to at atomic at the += and make N,result shared

The code is available as ex3/sumup_v2.cxx from github.

#include <cstdio>
#include "omp.h"

unsigned long crosscheck(unsigned long N)
{
    return N*(N+1)/2;
}

int main()
{
    unsigned long N = 1000000000;
    unsigned long result = 0;
    
    #pragma omp parallel default(none) shared(N,result)
    {
        unsigned long temp = 0;
  
        #pragma omp for 
        for (unsigned long i=1;i<N+1;i++)
        {
            temp += i;
        }

        #pragma omp atomic
        result += temp;
    }

    printf("Result:    %lu\n", result);
    printf("Should be: %lu\n", crosscheck(N));

    return 0;
}

1.3 Using reductions

The gotcha here is private(N) is bad as private variables are uninitialised! Of course, for addition, the correct operator is +. Also, you cannot have result appear in a data sharing and reduction clause at the same time.

The most compact form is achieved using a reduction clause.

This code is available as ex3/sumup_reduction.cxx from github.

#include <cstdio>
#include "omp.h"

unsigned long crosscheck(unsigned long N)
{
    return N*(N+1)/2;
}

int main()
{
    unsigned long N = 1000000000;
    unsigned long result = 0;
    #pragma omp parallel for default(none) shared(N) reduction(+:result)
    for (unsigned long i=1;i<N+1;i++)
    {
        result += i;
    }

    printf("Result:    %lu\n", result);
    printf("Should be: %lu\n", crosscheck(N));

    return 0;
}

2 Numerical integration

There are two suggested solutions commented out that allow to safely update the shared variable sum. Either using atomic or using a reduction clause. Both contain an error:

I introduce a nasty bug in the atomic update, opm instead of omp is a simple typo to make — the compiler typically does not complain about it. The other thing is to pick the correct + operator and to make sum,a,b shared.

Ideally, people do both and measure the run times and come to the conclusion that reduction is the way to go.

The task is to compute \(\pi\) using the rectangle method.

We exploit the fact that \(\int\limits_0^1 dx \frac{4}{1+x^2} = \pi\).

The following code is available as ex3/pi_simple.cxx from github.

#include <cstdio>
#include <omp.h>


int main() {

  // Numerical integration of f(x) = 4/(1+x^2) in [a,b] = [0,1]
  // We use a fixed stepsize ('delta x') here.

  const long num_steps = 500000000;
  const double a=0;
  const double b=1;
  const double stepsize = (b-a) / num_steps; // This is 'delta x'

  double sum = 0;   // This accumulates the f(x)*'delta x' values (i.e. areas)
   
  double t_0 = omp_get_wtime(); // time stamp

  #pragma omp parallel for default(none) reduction(+:sum) // Either reduction
  for (int i = 1; i <= num_steps; i++) 
  {
    double x = a + (i - 0.5) * stepsize;
//#pragma omp atomic                                      // Or atomic update
    sum += 4.0 / (1.0 + x * x) * stepsize;   // evaluate f(x)*'delta x' and add to accumulator
  }
  
  double t_tot = omp_get_wtime() - t_0; // duration in seconds

  printf("\n pi with %ld steps is %lf in %lf seconds\n ", num_steps, sum, t_tot);
}

The program is already parallel but it is not correct.

Make the program correct using appropriate directives and clauses.
Measure the execution time as a function of the number of threads (use e.g. OMP_NUM_THREADS=4).
Produce a graph similar to this:

Solution 3

Holger Schulz

16/11/2021

1 Summing up a billion times

1.1 Explicit computation

1.2 Using private helper variables

1.3 Using reductions

2 Numerical integration