Executive summary

I analyse why ccz4 stalls when using more than one rank. The reason is stalling due to openmp in libtwopunctures. I identify a trivial solution to use many nodes that comes with some limitations. I sketch what needs to be done to overcome these limitations.

Intro

We use LibTwoPunctures in ccz4. Fundamentally, it is an interpolator. The interpolation is constructed once using gsl routines in what appears to be a minimisation procedure using a newton algorithm. We evaluate the interpolation only once at \(t=0\) (in adjustSolution).

Issues

Deadlocking

When operating in an MPI setup, we observe stalling/deadlocking of our program at the beginning of the newton step. This is likely due to the openmp pragmas in LibTwoPunctures. It should be noted that the code works just fine in the absence of MPI (or running with just one rank for that matter).

Duplication of compute

The way the program is set up, every rank computes the same interpolation problem. Arguably, this is ok as we do this only once per run. It will eat into our scalability as this is effectively a serial phase of the program.

Solutions

We can identify several solutions.

Switch off OpenMP in LibTwoPunctures

That’s a dirty solution that does not solve any of the underlying problems of the code but allows us to run ccz4 on more than one rank. It is, however, dead simple. In the file libtwopunctures/Makefile we do a single change, i.e. comment out the -fopenmp flag:

  3 LINKGSL=-lgsl -lgslcblas -lm
  4 DEBUG=-g
  5 OPTIMIZE=-O3
  6 library=libtwopunctures.a
  7 # enable parallelization
  8 #OMP=-fopenmp

Compute on just one rank and broadcast

This is somewhat more elegant. We instruct only one rank to perform the construction of the interpolation and simply broadcast the required data to all other ranks. This still requires a fix for the OpenMP bit in libtwopunctures.

    MPI_Init(&argc, &argv);
    MPI_Comm_size(MPI_COMM_WORLD, &size);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
   

    TP::TwoPunctures _tp;
    int const ntotal = _tp.npoints_A * _tp.npoints_B * _tp.npoints_phi;
    _tp.allocate_derivs (&_tp.u,    ntotal);
    _tp.allocate_derivs (&_tp.v,    ntotal);
    _tp.allocate_derivs (&_tp.cf_v, ntotal);
    if      (_tp.grid_setup_method == "Taylor expansion") _tp.gsm = TP::GSM_Taylor_expansion;
    else if (_tp.grid_setup_method == "evaluation")       _tp.gsm = TP::GSM_evaluation;

    MPI_Barrier(MPI_COMM_WORLD);

    if (rank == 0)
    {
      TP_bindding::prepare(&_tp);
    }

The elegance is lost when looking at the data structures and realising that a ton of broadcast statements have to be issued. It is not just the derivs that need to be copied but also the ill-labeled Parameters. It turns out the so-called Parameters are changed during the initial setup program.

    MPI_Bcast(_tp.u.d0,  ntotal, MPI_DOUBLE, 0, MPI_COMM_WORLD);
    MPI_Bcast(_tp.u.d1,  ntotal, MPI_DOUBLE, 0, MPI_COMM_WORLD);
    MPI_Bcast(_tp.u.d2,  ntotal, MPI_DOUBLE, 0, MPI_COMM_WORLD);
    MPI_Bcast(_tp.u.d3,  ntotal, MPI_DOUBLE, 0, MPI_COMM_WORLD);
    MPI_Bcast(_tp.u.d11, ntotal, MPI_DOUBLE, 0, MPI_COMM_WORLD);
    MPI_Bcast(_tp.u.d12, ntotal, MPI_DOUBLE, 0, MPI_COMM_WORLD);
    MPI_Bcast(_tp.u.d13, ntotal, MPI_DOUBLE, 0, MPI_COMM_WORLD);
    MPI_Bcast(_tp.u.d22, ntotal, MPI_DOUBLE, 0, MPI_COMM_WORLD);
    MPI_Bcast(_tp.u.d23, ntotal, MPI_DOUBLE, 0, MPI_COMM_WORLD);
    MPI_Bcast(_tp.u.d33, ntotal, MPI_DOUBLE, 0, MPI_COMM_WORLD);
    
    MPI_Bcast(_tp.v.d0,  ntotal, MPI_DOUBLE, 0, MPI_COMM_WORLD);
    MPI_Bcast(_tp.v.d1,  ntotal, MPI_DOUBLE, 0, MPI_COMM_WORLD);
    MPI_Bcast(_tp.v.d2,  ntotal, MPI_DOUBLE, 0, MPI_COMM_WORLD);
    MPI_Bcast(_tp.v.d3,  ntotal, MPI_DOUBLE, 0, MPI_COMM_WORLD);
    MPI_Bcast(_tp.v.d11, ntotal, MPI_DOUBLE, 0, MPI_COMM_WORLD);
    MPI_Bcast(_tp.v.d12, ntotal, MPI_DOUBLE, 0, MPI_COMM_WORLD);
    MPI_Bcast(_tp.v.d13, ntotal, MPI_DOUBLE, 0, MPI_COMM_WORLD);
    MPI_Bcast(_tp.v.d22, ntotal, MPI_DOUBLE, 0, MPI_COMM_WORLD);
    MPI_Bcast(_tp.v.d23, ntotal, MPI_DOUBLE, 0, MPI_COMM_WORLD);
    MPI_Bcast(_tp.v.d33, ntotal, MPI_DOUBLE, 0, MPI_COMM_WORLD);
    
    MPI_Bcast(_tp.cf_v.d0,  ntotal, MPI_DOUBLE, 0, MPI_COMM_WORLD);
    MPI_Bcast(_tp.cf_v.d1,  ntotal, MPI_DOUBLE, 0, MPI_COMM_WORLD);
    MPI_Bcast(_tp.cf_v.d2,  ntotal, MPI_DOUBLE, 0, MPI_COMM_WORLD);
    MPI_Bcast(_tp.cf_v.d3,  ntotal, MPI_DOUBLE, 0, MPI_COMM_WORLD);
    MPI_Bcast(_tp.cf_v.d11, ntotal, MPI_DOUBLE, 0, MPI_COMM_WORLD);
    MPI_Bcast(_tp.cf_v.d12, ntotal, MPI_DOUBLE, 0, MPI_COMM_WORLD);
    MPI_Bcast(_tp.cf_v.d13, ntotal, MPI_DOUBLE, 0, MPI_COMM_WORLD);
    MPI_Bcast(_tp.cf_v.d22, ntotal, MPI_DOUBLE, 0, MPI_COMM_WORLD);
    MPI_Bcast(_tp.cf_v.d23, ntotal, MPI_DOUBLE, 0, MPI_COMM_WORLD);
    MPI_Bcast(_tp.cf_v.d33, ntotal, MPI_DOUBLE, 0, MPI_COMM_WORLD);

    
    MPI_Bcast(&_tp.gsm, 1, MPI_INT, 0, MPI_COMM_WORLD);
    MPI_Bcast(&_tp.mp, 1, MPI_DOUBLE, 0, MPI_COMM_WORLD);
    MPI_Bcast(&_tp.mm, 1, MPI_DOUBLE, 0, MPI_COMM_WORLD);
    MPI_Bcast(&_tp.par_b, 1, MPI_DOUBLE, 0, MPI_COMM_WORLD);
    MPI_Bcast(&_tp.par_P_minus, 3, MPI_DOUBLE, 0, MPI_COMM_WORLD);
    MPI_Bcast(&_tp.par_P_plus, 3, MPI_DOUBLE, 0, MPI_COMM_WORLD);
    MPI_Bcast(&_tp.par_S_minus, 3, MPI_DOUBLE, 0, MPI_COMM_WORLD);
    MPI_Bcast(&_tp.par_S_plus, 3, MPI_DOUBLE, 0, MPI_COMM_WORLD);
    MPI_Bcast(&_tp.TP_epsilon, 1, MPI_DOUBLE, 0, MPI_COMM_WORLD);
    MPI_Bcast(&_tp.n1, 1, MPI_INT, 0, MPI_COMM_WORLD);
    MPI_Bcast(&_tp.n2, 1, MPI_INT, 0, MPI_COMM_WORLD);
    MPI_Bcast(&_tp.n3, 1, MPI_INT, 0, MPI_COMM_WORLD);
    MPI_Bcast(&_tp.nvar, 1, MPI_INT, 0, MPI_COMM_WORLD);
    MPI_Bcast(&_tp.conformal_state, 1, MPI_INT, 0, MPI_COMM_WORLD);
    MPI_Bcast(&_tp.antisymmetric_lapse, 1, MPI_INT, 0, MPI_COMM_WORLD);
    MPI_Bcast(&_tp.averaged_lapse, 1, MPI_INT, 0, MPI_COMM_WORLD);
    MPI_Bcast(&_tp.pmn_lapse, 1, MPI_INT, 0, MPI_COMM_WORLD);
    MPI_Bcast(&_tp.brownsville_lapse, 1, MPI_INT, 0, MPI_COMM_WORLD);

    _tp.runned=true;

External computation of interpolation

The probably most flexible and my preferred solution would be to design code that allows to store the interpolation data and the “altered parameters” to disk. This has several advantages: * we circumvent the openmp problems in libtwopunctures * we no longer have a serial phase of several minutes to construct the interpolation * we can write a const correct interpolation evaluator * we can do parameter studies easier (currently a bit of an ungodly mess)

In an MPI setting, rank 0 would read the interpolation data and broadcast to all other ranks.

The interpolation data requires 3 derivates, each containing 10 arrays on length ntotal. The parameter data is negligible. In the current default settings, ntotal=14400, meaning that we need to write/read \(3*10*14400=432000\) doubles, which is equivalent to something like 3.5MB of disk space.

This is a skeleton structure of an interpolator class with const correctness and clear separation of parameters and computations. We would still use libtwopunctures for the main work (chebev). So not a total rewrite but a more appropriate structure.

class Interpolator()
{
  public:
    void eval(const double* const pos, double *Q);
    // contructor
    Interpolator(const & u, ...) : _u(u) ,...{}
    static Interpolator readData(const char*fname);
    
  private:
    deriv const _u;
    deriv const _v;
    deriv const _cf_v;
    Parameters const _params;
};
  
  void eval(const double* const pos, double *Q)
  {
    // Call functions --- these are real library functions, they don't change our parameters
    PunctTaylorExpandAtArbitPosition(...);
    // Calls allocate,calculate,free_derivs as well as interpol
    
    PunctIntPolAtArbitPositionFast(...);
    // Calls PunctEvalAtArbitPositionFast which in turn calls a bunch of functions,
    // there are chebychev polynomials and evaluators
    
    BY_Aijofxyz(...); // This one is simple, does not call other functions
    
  }

Regarding the data structure, this is an almost trivial. we can store the deriv data in a \(30\times ntotal\) array and the parameters however we want.