In gravitational wave simulations with Peano4/ExaHyPE (http://www.peano-framework.org/) we identified an expensive function that is called millions of times in the initialisation phase of the simulation.
I am using google benchmark to run the costly chebychev evaluation. Compiler is gcc10.2.1.
The starting point is the default implementation based on numerical recipes 2. N is 30 in the application.
for (int k = 0; k < N; k++)
{
for (int j = 0; j < N; j++)
{
for (int i = 0; i < N; i++) p[i] = v[i + N * (j + N * k)];
values2[j][k] = chebev (-1, 1, p, N, A);
}
}
inline double chebev (double a, double b, double c[], int m, double x)
/* eq. 5.8.11 of C++ NR (2nd ed) */
{
int j;
double djp2, djp1, dj;
double y;
y = 2*(x - 0.5*(b+a))/(b-a);
dj = djp1 = 0;
for(j = m-1 ; j >= 1; j--)
{
djp2 = djp1;
djp1 = dj;
dj = 2*y*djp1 - djp2 + c[j];
}
return y*dj - djp1 + 0.5*c[0];
}
With default compilation options g++ -lbenchmark -lpthread test_arrays.cxx -O2 -g
this code clocks in at 57442 ns.
This is what MAQAO thinks of our code:
Whoops! We are not using appropriate compiler flags!
If we tell the compiler to unroll loops, we find
g++ -lbenchmark -lpthread -funroll-loops test_arrays.cxx -O2 -g
a benchmark time of 44225 ns.
Adding mircoarchitecutre specific options g++ -lbenchmark -lpthread test_arrays.cxx -funroll-loops -march=native -mtune=native -O2 -g
we get 28615 ns.
This is in total a factor of 2 speedup pretty much for free.
Looking at the code, we see that a ton of computations are redundantly repeated as the evaluation point A does not change. So we can strip out the chebychev recurrence operation
void recurrence(double x, int m, double * recvec)
{
recvec[0] = 1;
recvec[1] = x;
for (int i=2;i<m;i++)
{
recvec[i] = 2*x*recvec[i-1] - recvec[i-2];
}
}
and rewrite the benchmark code
recurrence(0.3,N,recvec);
for (int k = 0; k < N; k++)
for (int j = 0; j < N; j++) values2[j][k] = -0.5* v[N * (j + N * k)];
for (int k = 0; k < N; k++)
for (int j = 0; j < N; j++)
for (int i = 0; i < N; i++) values2[j][k] += recvec[i]*v[i + N * (j + N * k)];
which clocks in at 14785 ns.
Out of ideas on what next to do we use MAQAO to find that the vectorisation is not quite there yet.
So maybe we should reorder the loops:
for (int i = 0; i < N; i++)
for (int j = 0; j < N; j++)
for (int k = 0; k < N; k++) values2[j][k] += recvec[i]*v[i + N * (j + N * k)];
This made the code slightly slower, 16650 ns.
MAQAO analysis reveals that the array access is now quite bad.
The next thing to try is a bit more evasive as we need to change the way the array is accessed. To be still correct, we also need to fill the coefficient array in a more suitable form.
This is the original filling code — NOTE that the values filled are only for correctness tests.
// Fill array as in SpecCoef
for (int i = 0; i < N; i++)
for (int j = 0; j < N; j++)
for (int k = 0; k < N; k++) v[i + N * (j + N * k)] = i + N * (j + N * k);
This fills the array contiguously
int myglob(0);
for (int i = 0; i < N; i++)
for (int j = 0; j < N; j++)
for (int k = 0; k < N; k++) v[myglob++] = i + N * (j + N * k);
This allows us to rewrite the code of interest in a way that loops are vectorisable and array access is contiguous:
myglob=0;
recurrence(0.3,N,recvec);
for (int k = 0; k < N; k++)
for (int j = 0; j < N; j++) values2[j][k] = -0.5* v[myglob++];
myglob=0;
for (int i = 0; i < N; i++)
for (int j = 0; j < N; j++)
for (int k = 0; k < N; k++) values2[j][k] += recvec[i]*v[myglob++];
Which results in a benchmark time of 11464 ns, so we are on track here.
What does MAQAO say?
This is not vectorising, let’s try with OpenMP SIMD
for (int i = 0; i < N; i++)
for (int j = 0; j < N; j++)
#pragma omp simd
for (int k = 0; k < N; k++) values2[j][k] += recvec[i]*v[myglob++];
and wouldn’t you know it, we find a benchmark time of 5738 ns.
We’re on a roll so let’s see if MAQAO can tell us more:
Still bad array access times, so this must be the 2d array that we are writing to. Let’s turn that one into a 1d array as well then:
myglob=0;
myglob2=0;
recurrence(0.3,N,recvec);
for (int k = 0; k < N; k++)
for (int j = 0; j < N; j++) values2[myglob2++] = -0.5* v[myglob++];
myglob=0;
for (int i = 0; i < N; i++)
{
myglob2=0;
#pragma omp simd collapse(2) safelen(4)
for (int j = 0; j < N; j++)
for (int k = 0; k < N; k++) values2[myglob2++] += recvec[i]*v[myglob++];
}
Which comes in at 5088 ns.
A final message from MAQAO:
Thanks to MAQAO we were able to realise at 11x speedup of a function that is called a lot in the initialisation phase of our simulation of gravitational waves. By using appropriate compiler flags we can gain a factor 2 speedup. By some mild rewriting of the compute kernels we gain an additional factor 5.5. The key is disentangling the recurrence from the polynomial evaluation and contiguous memory access which allow to vectorise the loops properly.