2021 Code Performance Series - From analysis to insight

15 July 2021

Chebychev polynomials

Chebychev polynomials used to approximate expensive 1D function
\[f(x, \vec{c}) \approx -\frac{1}{2}c_0 + \sum_{i=0}^{N-1}c_iT_i(x)\quad x\in[-1,1]\]
Coefficients \(c_i\) are known
Chebychev polynomials \(T_i(x)\) recursively defined for a given \(x\):
\(f(x)\) in essence is a scalar product of two vectors, \(\vec{c}\cdot\vec{T}\)

Problem description

Gravitational waves with Peano4/ExaHyPE (http://www.peano-framework.org/)
Identified expensive function, chebev, that is bottleneck in initialisation
We evaluate \(\mathrm{chebev}(x, \vec{c})=f(x,\vec{c})\approx \vec{c}\cdot\vec{T}\) in a doubly nested for loop (n1=n2=n3=30):

    for (k = 0; k < n3; k++)
    for (j = 0; j < n2; j++)
    {
      for (i = 0; i < n1; i++) coeffs[i] = v[i + n1 * (j + n2 * k)];
      values2[j][k] = chebev (x, coeffs);
    }

Notably, \(x\) is constant! What changes per call is the coefficient vector \(\vec{c}\).
Coefficient storage \(v\) is double array of length 27000
Analyse nested loop with google-benchmark and MAQAO

chebev

double chebev (double c[], int m, double x)
    /* eq. 5.8.11 of C++ NR (2nd ed) */
{
  double djp2(0), djp1(0), dj(0);

  for(int j = m-1 ; j >= 1; j--)
  { 
    djp2 = djp1; 
    djp1 = dj;
    dj   = 2*x*djp1 - djp2 + c[j];
  }
  return x*dj - djp1 + 0.5*c[0];
}

Very efficient implementation if x is changing constantly
Lots of redundancy for us (n2*n3 = 900 times the same x!)

First look with MAQAO

Unaltered implementation of code clocks in at 57442 ns

g++  -lbenchmark -lpthread chebnest.cxx   -O2 -g

g++  -lbenchmark -lpthread chebnest.cxx  -funroll-loops  -march=native -mtune=native -O2 -g

yields 28615 ns, 2x speed-up for free.

First rewrite

no more redundancy -> 14785 ns but vectorisation poor

    recurrence(x,N,recvec);   # recvec is T_i(x) -- i=0...N-1
    for (int k = 0; k < N; k++)
    for (int j = 0; j < N; j++) values2[j][k] = -0.5* v[N * (j + N * k)];
    
    for (int k = 0; k < N; k++)
    for (int j = 0; j < N; j++)
    for (int i = 0; i < N; i++) values2[j][k] += recvec[i]*v[i + N * (j + N * k)];

Reordering loops

array access bad!

    for (int i = 0; i < N; i++)
    for (int j = 0; j < N; j++)
    for (int k = 0; k < N; k++) values2[j][k] += recvec[i]*v[i + N * (j + N * k)];

Second rewrite

array access and omp simd

    int idx(0);
    for (int i = 0; i < N; i++)
    for (int j = 0; j < N; j++)
    #pragma omp simd
    for (int k = 0; k < N; k++) values2[j][k] += recvec[i]*v[idx++];

Contiguous access of coefficient array - benchmark now at 5088 ns

Final look with MAQAO

Very well vectorised code with good array access
MAQAO guided us to an 11x speed-up of a bottleneck
Dramatically reduces init phase (3 hours to 15 minutes)
Allowed to find and fix several bugs in last 2 weeks