I analyse why ccz4 stalls when using more than one rank. The reason is stalling due to openmp in libtwopunctures. I identify a trivial solution to use many nodes that comes with some limitations. I sketch what needs to be done to overcome these limitations.
We use LibTwoPunctures in ccz4. Fundamentally, it is an interpolator. The interpolation is constructed once using gsl routines in what appears to be a minimisation procedure using a newton algorithm. We evaluate the interpolation only once at \(t=0\) (in adjustSolution
).
When operating in an MPI setup, we observe stalling/deadlocking of our program at the beginning of the newton step. This is likely due to the openmp pragmas in LibTwoPunctures. It should be noted that the code works just fine in the absence of MPI (or running with just one rank for that matter).
The way the program is set up, every rank computes the same interpolation problem. Arguably, this is ok as we do this only once per run. It will eat into our scalability as this is effectively a serial phase of the program.
We can identify several solutions.
That’s a dirty solution that does not solve any of the underlying problems of the code but allows us to run ccz4 on more than one rank. It is, however, dead simple. In the file libtwopunctures/Makefile
we do a single change, i.e. comment out the -fopenmp flag:
3 LINKGSL=-lgsl -lgslcblas -lm
4 DEBUG=-g
5 OPTIMIZE=-O3
6 library=libtwopunctures.a
7 # enable parallelization
8 #OMP=-fopenmp
This is somewhat more elegant. We instruct only one rank to perform the construction of the interpolation and simply broadcast the required data to all other ranks. This still requires a fix for the OpenMP bit in libtwopunctures
.
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
TP::TwoPunctures _tp;
int const ntotal = _tp.npoints_A * _tp.npoints_B * _tp.npoints_phi;
_tp.allocate_derivs (&_tp.u, ntotal);
_tp.allocate_derivs (&_tp.v, ntotal);
_tp.allocate_derivs (&_tp.cf_v, ntotal);
if (_tp.grid_setup_method == "Taylor expansion") _tp.gsm = TP::GSM_Taylor_expansion;
else if (_tp.grid_setup_method == "evaluation") _tp.gsm = TP::GSM_evaluation;
MPI_Barrier(MPI_COMM_WORLD);
if (rank == 0)
{
TP_bindding::prepare(&_tp);
}
The elegance is lost when looking at the data structures and realising that a ton of broadcast statements have to be issued. It is not just the derivs
that need to be copied but also the ill-labeled Parameters
. It turns out the so-called Parameters are changed during the initial setup program.
MPI_Bcast(_tp.u.d0, ntotal, MPI_DOUBLE, 0, MPI_COMM_WORLD);
MPI_Bcast(_tp.u.d1, ntotal, MPI_DOUBLE, 0, MPI_COMM_WORLD);
MPI_Bcast(_tp.u.d2, ntotal, MPI_DOUBLE, 0, MPI_COMM_WORLD);
MPI_Bcast(_tp.u.d3, ntotal, MPI_DOUBLE, 0, MPI_COMM_WORLD);
MPI_Bcast(_tp.u.d11, ntotal, MPI_DOUBLE, 0, MPI_COMM_WORLD);
MPI_Bcast(_tp.u.d12, ntotal, MPI_DOUBLE, 0, MPI_COMM_WORLD);
MPI_Bcast(_tp.u.d13, ntotal, MPI_DOUBLE, 0, MPI_COMM_WORLD);
MPI_Bcast(_tp.u.d22, ntotal, MPI_DOUBLE, 0, MPI_COMM_WORLD);
MPI_Bcast(_tp.u.d23, ntotal, MPI_DOUBLE, 0, MPI_COMM_WORLD);
MPI_Bcast(_tp.u.d33, ntotal, MPI_DOUBLE, 0, MPI_COMM_WORLD);
MPI_Bcast(_tp.v.d0, ntotal, MPI_DOUBLE, 0, MPI_COMM_WORLD);
MPI_Bcast(_tp.v.d1, ntotal, MPI_DOUBLE, 0, MPI_COMM_WORLD);
MPI_Bcast(_tp.v.d2, ntotal, MPI_DOUBLE, 0, MPI_COMM_WORLD);
MPI_Bcast(_tp.v.d3, ntotal, MPI_DOUBLE, 0, MPI_COMM_WORLD);
MPI_Bcast(_tp.v.d11, ntotal, MPI_DOUBLE, 0, MPI_COMM_WORLD);
MPI_Bcast(_tp.v.d12, ntotal, MPI_DOUBLE, 0, MPI_COMM_WORLD);
MPI_Bcast(_tp.v.d13, ntotal, MPI_DOUBLE, 0, MPI_COMM_WORLD);
MPI_Bcast(_tp.v.d22, ntotal, MPI_DOUBLE, 0, MPI_COMM_WORLD);
MPI_Bcast(_tp.v.d23, ntotal, MPI_DOUBLE, 0, MPI_COMM_WORLD);
MPI_Bcast(_tp.v.d33, ntotal, MPI_DOUBLE, 0, MPI_COMM_WORLD);
MPI_Bcast(_tp.cf_v.d0, ntotal, MPI_DOUBLE, 0, MPI_COMM_WORLD);
MPI_Bcast(_tp.cf_v.d1, ntotal, MPI_DOUBLE, 0, MPI_COMM_WORLD);
MPI_Bcast(_tp.cf_v.d2, ntotal, MPI_DOUBLE, 0, MPI_COMM_WORLD);
MPI_Bcast(_tp.cf_v.d3, ntotal, MPI_DOUBLE, 0, MPI_COMM_WORLD);
MPI_Bcast(_tp.cf_v.d11, ntotal, MPI_DOUBLE, 0, MPI_COMM_WORLD);
MPI_Bcast(_tp.cf_v.d12, ntotal, MPI_DOUBLE, 0, MPI_COMM_WORLD);
MPI_Bcast(_tp.cf_v.d13, ntotal, MPI_DOUBLE, 0, MPI_COMM_WORLD);
MPI_Bcast(_tp.cf_v.d22, ntotal, MPI_DOUBLE, 0, MPI_COMM_WORLD);
MPI_Bcast(_tp.cf_v.d23, ntotal, MPI_DOUBLE, 0, MPI_COMM_WORLD);
MPI_Bcast(_tp.cf_v.d33, ntotal, MPI_DOUBLE, 0, MPI_COMM_WORLD);
MPI_Bcast(&_tp.gsm, 1, MPI_INT, 0, MPI_COMM_WORLD);
MPI_Bcast(&_tp.mp, 1, MPI_DOUBLE, 0, MPI_COMM_WORLD);
MPI_Bcast(&_tp.mm, 1, MPI_DOUBLE, 0, MPI_COMM_WORLD);
MPI_Bcast(&_tp.par_b, 1, MPI_DOUBLE, 0, MPI_COMM_WORLD);
MPI_Bcast(&_tp.par_P_minus, 3, MPI_DOUBLE, 0, MPI_COMM_WORLD);
MPI_Bcast(&_tp.par_P_plus, 3, MPI_DOUBLE, 0, MPI_COMM_WORLD);
MPI_Bcast(&_tp.par_S_minus, 3, MPI_DOUBLE, 0, MPI_COMM_WORLD);
MPI_Bcast(&_tp.par_S_plus, 3, MPI_DOUBLE, 0, MPI_COMM_WORLD);
MPI_Bcast(&_tp.TP_epsilon, 1, MPI_DOUBLE, 0, MPI_COMM_WORLD);
MPI_Bcast(&_tp.n1, 1, MPI_INT, 0, MPI_COMM_WORLD);
MPI_Bcast(&_tp.n2, 1, MPI_INT, 0, MPI_COMM_WORLD);
MPI_Bcast(&_tp.n3, 1, MPI_INT, 0, MPI_COMM_WORLD);
MPI_Bcast(&_tp.nvar, 1, MPI_INT, 0, MPI_COMM_WORLD);
MPI_Bcast(&_tp.conformal_state, 1, MPI_INT, 0, MPI_COMM_WORLD);
MPI_Bcast(&_tp.antisymmetric_lapse, 1, MPI_INT, 0, MPI_COMM_WORLD);
MPI_Bcast(&_tp.averaged_lapse, 1, MPI_INT, 0, MPI_COMM_WORLD);
MPI_Bcast(&_tp.pmn_lapse, 1, MPI_INT, 0, MPI_COMM_WORLD);
MPI_Bcast(&_tp.brownsville_lapse, 1, MPI_INT, 0, MPI_COMM_WORLD);
_tp.runned=true;
The probably most flexible and my preferred solution would be to design code that allows to store the interpolation data and the “altered parameters” to disk. This has several advantages: * we circumvent the openmp problems in libtwopunctures * we no longer have a serial phase of several minutes to construct the interpolation * we can write a const correct
interpolation evaluator * we can do parameter studies easier (currently a bit of an ungodly mess)
In an MPI setting, rank 0
would read the interpolation data and broadcast to all other ranks.
The interpolation data requires 3 derivates, each containing 10 arrays on length ntotal
. The parameter data is negligible. In the current default settings, ntotal=14400
, meaning that we need to write/read \(3*10*14400=432000\) doubles, which is equivalent to something like 3.5MB of disk space.
This is a skeleton structure of an interpolator class with const correctness and clear separation of parameters and computations. We would still use libtwopunctures for the main work (chebev
). So not a total rewrite but a more appropriate structure.
class Interpolator()
{
public:
void eval(const double* const pos, double *Q);
// contructor
Interpolator(const & u, ...) : _u(u) ,...{}
static Interpolator readData(const char*fname);
private:
deriv const _u;
deriv const _v;
deriv const _cf_v;
Parameters const _params;
};
void eval(const double* const pos, double *Q)
{
// Call functions --- these are real library functions, they don't change our parameters
PunctTaylorExpandAtArbitPosition(...);
// Calls allocate,calculate,free_derivs as well as interpol
PunctIntPolAtArbitPositionFast(...);
// Calls PunctEvalAtArbitPositionFast which in turn calls a bunch of functions,
// there are chebychev polynomials and evaluators
BY_Aijofxyz(...); // This one is simple, does not call other functions
}
Regarding the data structure, this is an almost trivial. we can store the deriv data in a \(30\times ntotal\) array and the parameters however we want.