Parallel Design Patterns in C++ for the Intel Xeon Architecture: Quantitative Finance Code Modernization

Introduction

Following the recent adoption of the ISO Technical Specification for Parallel STL Algorithms in C++17, this article describes some of the parallel design patterns now available to C++ programmers seeking to deploy code on many-core architectures such as the Intel Xeon Phi. Parallel design patterns and parallel STL provide support for deploying legacy serial codes on many-core architecutures, a topic of much interest for developers and maintainers of complex serial C++ code-stacks which make heavy use of STL in addition to vendor math and scientific computing libraries which are often written in C. Examples in derivative modeling and graph traversal for efficient information retrieval illustrate and demonstrate the performance gains from using design patterns in parallel STL on complex simulation and information retrieval tasks.

Background

Modern CPUs are multicore and many-core devices have advanced significantly beyond merely running serial code very efficiently. They provide myriad hardware features to improve performance while remaining transparent to the programmer. Over the last decade, the semiconductor industry has increased the number of cores in the microprocessor to achieve higher performance, a solution which has avoided excessive power dissipation. Additionally, cores now provide one or more vector processing units (VPUs) for high floating point performance. The latest version of the Intel(R) Xeon Phi, codenamed Knights Landing (KNL), provides 72 cores each of which provides two x86 based VPUs. Multiple cores can however only provide better performance when the application can take advantage of the parallel compute resources they provide. A number of mature software technologies have been developed by CPU hardware manufactures to facilitate this task, including packages such as OpenMP and Intel(R) Thread Building Blocks. But ultimately, there is no guarantee of scalable application performance on multicore architectures without recompilation or recoding of the code base for existing applications.

GPUs are many-core microprocessors that are fundamentally different from CPUs: rather than provide complex hardware and features to run through a smaller number of threads, or sequences of instructions, as fast as possible, they instead allocate the vast majority of the silicon area to relatively simple processing units to increase total instruction throughput with the assumption that there is going to be hundreds to thousands of threads that can be executed at the same time. As a result GPUs are particularly suited to computationally intensive applications with ‘parallelism’ - a high degree of independence between the individual computations.

Kernel programming

CUDA, NVIDIA’s parallel programming platform and model and OpenCL, an open framework for programming heterogeneous platforms consisting of CPUs and accelerator devices, have been successfully embraced by quantitative infrastructure groups and provide substantial speed-ups to computational methods that are able to exploit the many thousands of cores that are provided by the latest versions of GPUs. Examples of such methods with a high degree of parallelism include Monte-Carlo simulation and iterative linear system solvers for implicit finite difference methods.

Despite conveniently basing CUDA’s programming environment on C++ with minor keyword extensions, high performance implementations require detailed knowledge of the vector instruction limitations, memory hierarchy characteristics, and cross-thread synchronization capabilities. GPU programming calls for technical skills outside those held by a typical software architect or programmer working in the finance industry and significant experience is required to architect highly parallel implementations to avoid the many pitfalls in parallel programming. Owing to the choice to restrict the silicon area to relatively simple processing units, GPUs support a more restrictive set of parallel software architectures which we shall refer to as ‘parallel design patterns’.

CPU programming

Modern CPUs readily support a wide range of programming styles including SIMD, fork-join and message passing. The task of porting legacy serial code to many-core CPUs is made easier by a wide range of implementation support infrastructure such as OpenMP and TBB in addition to the capability of the compiler to auto-vectorize. Furthermore, code can be efficiently run in parallel without the need to offload sections to an accelerator device with a separate memory.

But the simplicity and convenience that the modern CPU and compiler brings does not compensate for poor parallel software architecture. Robison and Johnson (2010) show that any single programming style may not be sufficient for an application and experience often plays the lead role in deciding how to ‘compose’ the programming styles.

The code modernization conundrum

This simple example not only illustrated the limitations of kernel programming but also the wider range of parallel architecture choices available to the programmer. Simply approaching parallel design with a one size fits all mentality is not an effective approach for utilizing the Intel(R) Xeon Phi. Instead, judicious use of openmp or TBB can lead to much more effective porting of serial code, but may be difficult for a less experienced programmer to implement scalable code. There are typically a great number of design and implementation choices which can be overwhelming for a C++ programmer. Which parallel C++ programming constructs are needed to parallelize the code? Which implementation support infrastructure to use? How should the application be re-factored to exploit the memory bandwidth and/or CPU capacity of many-core architectures?

Overview

Recently parallel STL Algorithms have been added to the C++17 ISO standard. This article shows how design patterns arise naturally in parallel STL and how the programmer can take advantage of this library to efficiently port serial algorithms to the Intel Xeon and Xeon Phi architectures. A key aspect of our research is how an application programmer can take advantage of some the parallel STL constructs to compose C style legacy functions, which are often provided by scientific library vendors. In the next section, we describe the guiding principles of parallel software architecture and introduce terminology and definitions for describing parallel design patterns. We continue with a brief review of the main advantage for using STL from a design patterns perspective and go on to explain how parallel design patterns can be implemented with parallel STL. Various examples and benchmarking results, including detailed cache and CPU utilization results are presented for a prototype application. We also point to other parallel design patterns useful for information retrieval and provide an implementation of a pipe-and-filter graph traversal algorithm using parallel STL.

Parallel Software Architecture

While traditional software architecture defines the components that make up an application, the communication among components, and the fundamental computations that occur inside components, it does not prescribe how the software architecture is mapped onto the hardware of a parallel computer. With an increasing variety of commoditized parallel platforms available, application developers with little or no parallel computing expertise seek a simple, systematic and comprehensive approach for guiding software architecture design. In addition to achieving scalability, this approach must not compromise flexibility, portability and maintainability of the software.

OPL is a unified pattern language for designing and implementing parallel software and has formed the basis for pattern oriented software frameworks described in (Chong (2010), E. Gonina et al. (2014), E. I. Gonina (2013)). OPL distills the software architecture design process into simple tried and tested recipes, which if followed likely lead to efficient mapping of the application of the architecture. It achieves this by

Defining the scope of the software;
Providing common vocabulary for application developers and efficiency programmers;
Enabling modularity and comprehensive coverage of the application domain; and
Providing a language for describing the software architecture of applications.

OPL Keutzer and Mattson (2010) considers the structural patterns (also known as architectural styles) that define the overall organization of an application, the basic computational patterns for each stage of the problem, and well as the low level details of the parallel algorithm. The following figure provides a conceptual overview of OPL and highlights some of the structural, computational, parallel algorithm and implementation strategy patterns that are common in financial simulation.

Conceptual overview of Our Pattern Language (OPL). Figure courtesy of Kurt Keutzer, ASPIRE Lab, UC Berkeley.

Computational patterns: These describe the classes of the computations that the application is composed of are often the compute bottle-necks.
Structural patterns: These describe the overall organization of the application and how the computational patterns interact.
Parallel algorithm strategy patterns: These define high-level strategies to exploit concurrency in a computation for execution on a parallel computer. They address the different ways concurrency is naturally expressed within a problem providing well known techniques to exploit that concurrency.
Implementation strategy patterns: These are the structures that are realized in source code to support (a) how the program itself is organized and (b) common data structures specific to parallel programming.

The UC Berkeley Aspire Lab’s website additionally provides a useful resource for understanding and referencing the parallel design patterns in OPL (“Our Pattern Language,” n.d.).

C++ and Parallel Programming

The level of support for parallel programming in C++ has recently been standardized. When the ISO C++14 Standard was published, an ISO Technical Specification for Parallel STL Algorithms was issued. This Technical Specification was based on extensive input from the principal developers of several widely used parallel algorithm libraries, including NVidia’s Thrust, HPX, Intel’s Thread Building Blocks and CilkPlus. Additional input came from members of the OpenMP organization, Sandia National Lab, Lawrence Livermore National Lab, Argonne National Lab, CERN, as well as Microsoft, IBM, and RedHat.

The Parallel Algorithms Technical Specification was adopted into C++17 and is now published as the official ISO C++ Standard.

C++ Programming with the Standard Template Library (STL)

C++’s STL has had a long legacy of providing standardized implementations of many commonly used algorithmic patterns, and is therefore a natural starting point for the exploration of parallel design patterns. We review some of the key concepts in C++’s STL and then discuss the most recent parallel programming support and it’s use for implementing parallel design patterns.

C++’s STL has long shipped with a library of generic algorithms in the form of the the and headers. These algorithms compliment the Standard Template Library’s containers and iterator concepts. The algorithms are optimal implementations of many common algorithmic patterns that developers invariably end up reimplementing, and have surprising breadth, a small sampling and further documentation is provided in Appendix A.

The argument for standard algorithms

The algorithms of the Standard Library are well defined and cover a broad range of functionality. It is well known that these algorithms are generally superior to bespoke implementations of the same functionality and recall some of the main points here:

They are correctly implemented across all conformant vendor implementations
Provide the same or better performance as virtually all bespoke implementations For example copy will make use of SIMD compiler intrinsics to perform vector wide copies if the container contains suitable primitive types.
Define a common vocabulary for algorithms
They reduce the cognitive burden of understanding control flow e.g.
- for_each() is a much more concise statement of intent than the eqivalent bare for loop
```
    for (auto it = v.begin(); it != v.end(); ++it) {
        ...
    }
```
- find_first() is a much more concise statement than a bare for loop with early exit
```
    for (auto it = v.begin(); !pred(*it); ++it);
```
They improve code modularity
- predicates and functions focus only on the what, not the how

They form the basis for building additional generic algorithmic patterns, for instance, insertion sort can be defined as follows:

template<class Iterator,
         class Compare>
Iterator insertion_sort(Iterator begin, Iterator end, Compare comp) {
    for (auto it = begin; it != end; ++it)
        std::rotate(std::upper_bound(start, it, *it, comp), it, std::next(it));
    return it
}

Another example is the gather operation from the Scatter/Gather Pattern:

    template<class BidIt,
             class Predicate>
    std::pair<BidIt,BidIt> gather(BidIt begin, BidIt end, BidIt partition,
                                  Predicate pred) {
        return { std::stable_partition(begin, partition, std::not_fn(pred)),
                 std::stable_partition(partition, end, pred) };
    }

C++ Standard Parallel Algorithms

The parallel algorithms extend the existing C++ Standard Template Library’s <algorithm> and <numeric> algorithm definitions with versions that follow the general form of

    template<class ExecutionPolicy, class Iterator, ...>

    Iterator some_algorithm(ExecutionPolicy policy, Iterator begin, Iterator end, ...);

where ExecutionPolicy is one of the policies defined in <execution>

std::execution::sequenced_policy - A policy which dictates that the algorithm’s execution may not be parallelized. This policy is equivalent to the existing pre-C++17 standard algorithm execution behavior.
std::execution::parallel_policy - A policy which states that the algorithm’s execution may be parallelized. The exact mechanism is dependent on the choice made by the standard library vendor, and may include Thread Building Blocks (Intel’s current implementation, also supported by NVidia’s Thrust library), OpenMP (GCC libc++, NVidia’s Thrust Library) or custom thread pools (e.g. GCD on OSX, Windows Thread Pool). This policy is equivalent to

    template<class RanIt,
             class Function>
    RanIt for_each(std::execution::parallel_policy, RanIt begin, RanIt end, Function fun) {
        #pragma parallel omp for
        for (auto it = begin; it != end; ++it)
            fun(*it);
    }

std::execution::parallel_unsequenced_policy - A policy which states that the algorithm’s execution may be parallelized, possibly vectorized, and has no thread local dependencies, so it may be used in a parent work stealing thread pool. For example, the Intel parallel STL algorithms use a mix of Thread Building Blocks for coarse grained thread parallelism, and #pragma omp simd annotated loops for SIMD vectorization if decltype(*end) is a type for which the target architecture (e.g. AVX256/512) supports vectorization.

Execution Policy Selection

The STL standard algorithms which do not take an ExecutionPolicy argument, execute the algorithm sequentially within the current thread of execution.

The execution::sequential_policy is required to produce the same observable result as the non-policy version of the standard algorithm, so the following expressions are equivalent -

    double tfun(double a, double b) { .. }

    ... 

    using vector_t = std::vector<double>;

    double example1(vector_t const& values) {
        return std::transform_reduce(values.begin(), values.end(), 0, tfun);
    }

    double example2(vector_t const& values) {
        return std::transform_reduce(std::execution::seq, values.begin(), values.end(), 0, tfun);
    }

    int main(int, char* const[]) {
        vector_t values{ 1, 2, 3, ... };

        assert(example1(values) == example2(values));
    }

Further, the caller may safely assume that all element accesses and function invocations will occur on the calling thread, however individual element accesses and function invocations are indeterminately sequenced with respect to each other.

The execution::parallel_policy gives the algorithm the choice to determine how element access and function invocation occur, with the hint that this work should be done in parallel on multiple threads of execution if the underlying hardware and platform support this. Further, the algorithm may “steal” the calling thread to complete some portion of the work (parent stealing) in parallel with the work dispatched to other execution agents. Execution however is constrained such that once work is dispatched to a thread of execution, that work will not then migrate to another thread of execution during the algorithm’s invocation. Unlike the execution::sequential_policy, the execution::parallel_policy requires that all element access and function invocation be data-race free. In the presences of data races, the following two expressions are not guaranteed to be equivalent -

    double tfun(double a, double b) { .. }

    ... 

    using vector_t = std::vector<double>;

    double example1(vector_t const& values) {
        return std::transform_reduce(std::execution::seq, values.begin(), values.end(), 0, tfun);
    }

    double example2(vector_t const& values) {
        return std::transform_reduce(std::execution::par, values.begin(), values.end(), 0, tfun);
    }

    int main(int, char* const[]) {
        vector_t values{ 1, 2, 3, ... };

        // not guaranteed to pass in the presence of data races
        assert(example1(values) == example2(values));
    }

The debugging and testing process for parallel codes is notoriously difficult. The uniform design of the STL’s algorithm dispatch policies makes it possible to trivially switch between parallel and sequential versions of the same code for testing and verification purposes -

    struct option { ... };
    struct risk_params { ... };

    using portfolio_t = std::vector<option>;

    struct risk_measures {
        risk_measures() = default;
        risk_measures& operator+(risk_measures const& lhs);
        bool operator=(risk_measures const& lhs) const;
    };

    risk_measures compute_risk(risk_params const& params, option const& o);

    risk_measures compute_risk(risk_params const& params, portfolio_t const& p) {
    #ifdef NDEBUG
        using exec_policy = std::execution::sequential_policy{};
    #else
        using exec_policy = std::execution::parallel_policy{};
    #endif // NDEBUG
        return std::transform_reduce(p.begin(), p.end(), risk_measures{},
                                     [params](auto const& o) {
                                        compute_risk(params, o);
                                    });
    }

The execution::parallel_unsequenced_policy further relaxes constraints on element access and function invocation, allowing the underlying implementation to take potentially further optimizations if the combination of types, functions, and underlying platform support it. A common example would be to generate vector unrolled inner loop invocations, so that more aggregate work is done per thread of execution. Another example would be to allow units of work to migrate to different threads of execution during algorithm invocation within a work stealing scheduler. This policy should only be used when the caller can meet all of the data-race freedom guarantees of execution::parallel_policy and the further requirement that there is no use of thread local storage. A common example of code that is not suitable for use with this policy is any code which relies on the C standard errno global variable.

Design Patterns in Parallel STL

In OPL, the main considerations for parallel design architecture involve specifying the structural and computational patterns together with their composition. OPL is a meta-language and seeks more tangible expression through parallel programming infrastructure. In this section, we describe how some of the concepts in OPL can be implemented in parallel STL. Our examples, although considerably simplified to convey the many design and implementation constructs, are drawn from numerical methods used from financial derivative pricing.

Structural Patterns

Parallel STL provides support for implementation of structural patterns such as Map-Reduce through transform_reduce as shown here

std::transform_reduce(policy, begin, end,
                      mapfn, reducefn);

The reducefn argument to transform_reduce is defaulted to std::plus<decltype(*end)> this is equivalent to the following code using OpenMP -

    #pragma omp parallel for reduction(+ : res)
    for (auto it = begin; it != end; ++it)
        res += mapfn(*it);

For example, consider the computation of the root mean squared error over a number of financial instruments being used to calibrate a parametric model such as Heston’s stochastic volatility model:

virtual double value(const Array& params) const {
    model_->setParams(projection_.include(params));
    double err = 0.0;
    #pragma omp parallel for reduction(+ : err)
    for (Size i=0; i<instruments_.size(); i++)
        err += instruments_[i]->calibrationError()
            *std::sqrt(weights_[i]);
    return err;
}

In this example, it is assumed that calibrationError() contains a computational pattern which generates a significant amount of work - in this case it is a FFT algorithm. The root mean square can be parallelized with coarse grained shared-memory parallelism using the parallel STL:

#include <boost/iterator/zip_iterator.hpp> // simplifies the code
using namespace boost;
using namespace std;

virtual double value(const Array& params) const {
    model_->setParams(projection_.include(params));
    return transform_reduce(execution::par,
                        make_zip_iterator(make_tuple(begin(instruments_), begin(weights_))),
                        make_zip_iterator(make_tuple(end(instruments_), end(weights_))),
                        0.0,
                        [](auto const& iw) {
                            return get<0>(i)->calibrationError() * sqrt(get<1>(i));
                        });
}

While the coarse-grained parallelism allocates work across cores, an efficient implementation on the Intel Xeon Phi should also use SIMD vectorization. Let’s suppose that the `calibrationError() function calls a derivative model pricer which implements the spectral transform computational pattern. Specifically consider a Fourier-Cosine series transformation of the form:

double HestonCOS(...){
    ...
    double ret = 0.0;
    #pragma omp simd reduction(+:ret)
    for (int k=0; k < N; k++)
    {
        U = 2.0/(b-a)*(xi(k,a,b,0,b) - psi(k,a,b,0,b));
        std::complex<double> HCF = HestonCF(k*pi/(b-a),T,r,sigma,lmbda,meanV,v0,rho);
        ret += unit*(HCF*exp(j1*double(k)*pi*(x-a)/(b-a))).real()*U;
         unit = 1.0;
    }
    return K*exp
}

Using parallel STL, the same function would be rewritten as

double HestonCOS(...){
    ...
    unit = 1.0;
    return K*std::transform_reduce(execution::par_unseq,
                             0, N,
                             0.0,
                             [&](auto const k) {
                                auto U = 2.0/(b-a)*(xi(k,a,b,0,b) - psi(k,a,b,0,b));
                                auto HCF = HestonCF(k*pi/(b-a),T,r,sigma,lmbda,meanV,v0,rho);
                                return unit*(HCF*exp(j1*double(k)*pi*(x-a)/(b-a))).real()*U;
                             });
}

The above examples demonstrate the use of different execution policies to exploit the shared memory and vector SIMD units provided by the Intel Xeon Phi. The parallelization is nested and is a common approach to re-factoring serial code on the Intel Xeon Phi. The efficacy of the implementation depends on the amount of work generated, in this case by ensuring N is sufficiently large and using HestonCF- a vectorizable function that calls a number of transcendental functions and perform complex arithmetic operations and only returns a scalar value.

Computational Patterns in Finance

The previous example not only showed the map-reduce function but also gave an example using vectorization of a spectral transformation operation - a common computational pattern. Random number generation, and more generally Monte-Carlo simulation, and dense linear algebra operations are common examples of computational patterns that are deployed in computational finance. In the simplest case, Monte-Carlo simulations are embarrisingly parallel, however many more complex pricing and risk management problems are more involved, such as Bayesian inference with MCMC algorithms, or as shown here, the use of generalized least squares for risk neutral valuation when the derivative has an early exercise feature.

Assuming the computational pattern is implemented in an Intel MKL routine, we show how we use structural patterns to compose the computational patterns with parallel STL. Our focus here will be on demonstrating how parallel STL is used to compose Intel MKL routines that are commonly used in simulations. It should be evident that the implementation considerations are therefore multi-faceted: on the one hand, we seek an efficient implementation of pricing and simulation approaches, but we also seek to use legacy codes, which are not written to interface with C++ and STL. The suggested approach could be adopted as a part of a more general strategy for code modernization.

Parallel Random Number Generation with Intel’s MKL and Parallel STL

The following implementation uses Intel’s Math Kernel Libraries for pseudo-random number generation. These MKL routines are batch oriented and the exposed C and Fortran compatible interfaces are not well suited to the common STL container iteration pattern. These are adapted using the OPL Agent and Repository pattern to permit efficient piece-wise parallel generation of psuedo random numbers.

All streams of random numbers support the operations of the following type -

    struct stream_base {
        stream_base(stream_base const& other);
        stream_base& operator=(stream_base const& other);
        bool valid() const;
    };

The type wrapping the MKL psuedo random number generator(s) is defined as follows -

    struct stream : stream_base {
        enum class rng_kind : MKL_INT {
            sfmt19937 = VSL_BRNG_SFMT19937
        };

        stream(rng_kind brng, MKL_UINT seed)
            : stream_base{ make_stream(brng, seed) }
        { }
    };

The MKL defines the notion of a skip-ahead stream, this type provides a general wrapper for this concept, initialized with an underlying stream type to define the fundamental generator type and seed, as well as a skip size -

    struct skip_ahead_stream : stream_base {
        skip_ahead_stream(stream_base const& strm, long long skip);
    };

The partitioned_stream type implements a Repository of skip_ahead_stream instances, constructed over an underlying stream, with a computed skip increment based on the configured number of simultaneously accessible partitions of the repository.

    struct partitioned_stream : stream_base {
        partitioned_stream(stream_base strm, unsigned num_parts = ...);
        static constexpr short max_stream_partitions();
        static constexpr size_t partition();

        skip_ahead_stream get_next_stream();
    };

The generated_range type incrementally generates a range of psuedo random numbers using a given distribution generator and skip_ahead_stream.

    template<class Generator>
    struct generated_range {
        using value_type = typename Generator::value_type;

        generated_range(skip_ahead_stream strm, Generator gen, unsigned buf_max);
        const_iterator begin() const;
        const_iterator end() const;
    };

The Agent portion of the Agent and Repository Pattern is implemented using the C++ Named Constructor idiom, taking a partitioned_stream and distribution generator -

    template<class Generator>
    auto make_distribution(partitioned_stream& stream, Generator gen, unsigned buf_max = ...)
        -> decltype(generated_range<Generator>) { ... }

An example of a distrubtion generator is the guassian_generator for pseudo random numbers of type T

    template<typename T>
    struct gaussian_generator {
        using value_type = T;

        enum class method : int {
            boxmuller2 = VSL_RNG_METHOD_GAUSSIAN_BOXMULLER2
        };

        gaussian_generator(value_type a, value_type sigma,
                           method mthd = method::boxmuller2);

        template<class Stream, class ContiguousIterator>
        void operator()(Stream& strm, ContiguousIterator begin, ContiguousIterator end) const;
    };

MKL’s Linear Algebra Routines

Dense Linear algebra routines are commonly used in mathematical and data modeling for transformations and projections. Rather than applying the linear algebra routine to a single large problem, as is often the case in scientific computing benchmarks such as LINPACK, we consider the more common example, where the linear algebra routine is sufficiently large that it creates a bottle-neck in the serial implementation, but is assumed to be nested inside a for loop. Common examples in computational finance include solving multiple independent linear systems, such as implicit finite differencing over a basket of options. Another example, is the pricing of a portfolio of American options, each instrument being priced through independent batches of Least-Squares Monte-Carlo simulations.

Least Squares Monte Carlo and Singular Value Decomposition

Least squares Monte-Carlo requires the solution of a generalized least square problem. A numerically robust approach to the generalized least squares problem is SVD. SVD algorithms are well known to scale poorly on parallel architectures. To avoid poor parrallel performance, our parallel architecture exploits the parallelism exposed by the application, not within each SVD routine. Here we will concurrently apply the Intel MKL SVD implementation to small matrices and use the random generator to simulate the paths of the underlying. Note that typically, a time-stepper such as the Euler-Mayrama scheme would be needed, but for clarity of exposition has been ommitted here. Our challenge here is how to compose C style functions, implementing computational pattern, with the map-reduce structural pattern implemented in parallel STL.

The implementation uses Intel’s Math Kernel Libraries for singular value decomposition (SVD). The routines of the MKL are designed to be used from C and Fortran, so they expose a very classic C calling style.

The MKL SVD routines are adapted to support compile time type dispatch as follows -

    #include <mkl_lapacke.h>

    namespace lapacke {
        template<class T>
        int gesvd(int, char, char, int, int, T*, int, T*, T*, int, T*, int, T*);

        template<>
        int gesvd<float>(int matrix_layout, char jobu, char jobvt, int m, int n,
                         float *a, int lda, float *s, float *u, int ldu,
                         float *vt, int ldvt, float *superb) {
            return LAPACKE_sgesvd(matrix_layout, jobu, jobvt, m, n, a, lda, s, u, ldu,
                                    vt, ldvt, superb);
        }

        template<>
        int gesvd<double>(int matrix_layout, char jobu, char jobvt, int m, int n,
                          double *a, int lda, double *s, double *u, int ldu,
                          double *vt, int ldvt, double *superb) {
            return LAPACKE_dgesvd(matrix_layout, jobu, jobvt, m, n, a, lda, s, u, ldu,
                                    vt, ldvt, superb);
        }
    }

The svd_input type will provide the individual input parameters to the SVD functor. In the example provided this is simply the matrix dimensions and a vector of random numbers from a Gaussian distribution.

In a typical application, this type would be extended via derivation to carry additional input to the SVD functor.

    template<class T>
    struct svd_input {
        using vec_t = std::vector<T>;

        using generator_type = vsl::gaussian_generator<T>;
        struct matrix_dims { ...  };

        svd_input(vsl::partitioned_stream& strm, generator_type gen, matrix_dims dims);

        using result_type = std::pair<matrix_dims, vec_t>;
        result_type operator()(int) const;
            auto gr = vsl::make_distribution{ strm_, gen_ };

            vec_t a;
            a.reserve(dims_.size());
            std::copy_n(gr.begin(), dims_.size(),
                            std::back_inserter<vec_t>(a));

            return std::make_pair(dims_, a);
        }
    };

The return result of the svd functor is a class type which collects both the singular value decomposition result vector and a result code from the lapack SVD routine. We also define a merge operation for this type upon which the reduction phase of the Map Reduce pattern is built -

    namespace result {
        template<class T>
        struct res_t {
            res_t() = default;

            res_t(int ec, std::vector<T> v);

            void swap(res_t& other);
            size_t size() const;

            friend
            res_t merge(res_t lhs, res_t rhs);
        };
    }

The svd functor type performs singular value decomposition for a single input of type svd_input and defined as follows -

template<class T>
struct svd {
    using vec_t = std::vector<T>;
    using res_t = result::res_t<T>;

    using arg_t = typename svd_input<T>::result_type;
    res_t operator()(arg_t&& a) {
        auto [dims, aa] = a.first;

        vec_t s( dims.m_ );
        vec_t superb( dims.min() - 1 );
        vec_t u( dims.m_ * dims.m_ );
        vec_t vt( dims.n_ * dims.n_ );

        auto ec = lapacke::gesvd(LAPACK_COL_MAJOR, 'A', 'A', dims.m_, dims.n_,
                                    aa, dims.m_, s.data(), u.data(), dims.m_,
                                    vt.data(), dims.n_, superb.data());
        if (ec < 0) {
            static auto msg = boost::format("lapacke_gsvd: illegal parameter, ec= %1%");
            throw std::runtime_error(boost::str(msg % ec));
        }
        return { ec, std::move(s) };
    }
};

Parallel Random Number Generation + Singular Value Decomposition

Declaring a partitioned_stream for random number generation -

    vsl::partitioned_stream strm{
            vsl::stream{ vsl::stream::rng_kind::sfmt19937, 1 }
        };

Generating a vector of svd_input<double>

    constexpr auto M = 10000; // number of inputs to generate
    constexpr auto dt = sqrt(1.0/ M);

    // SVD matrix dimensions
    constexpr auto m = 400;
    constexpr auto n = 250;

    using input_t = svd_input<double>;
    using gen_t = vsl::gaussian_generator<double>;

    std::vector<input_t> inputs;
    inputs.reserve(M);

    std::generate_n(std::back_inserter(inputs), M,
        [&] { return input_t{ strm, gen_t{ 0, dt }, m, n }; });

A parallel Map Reduce pattern applying SVD over the input vector is given by -

    using dsvd = svd<double>;
    auto res = std::transform_reduce(std::execution::par, inputs.begin(), inputs.end(),
                                     dsvd::res_t{},     // initial value
                                     merge,             // reduction function
                                     dvsvd{}            // map function
                                    );

Application Example

Following OPL, we prioritized the consideration of the computational and structural patterns. Here we illustrate, with more detail, how a portfolio of American options can be efficiently priced on an Intel Xeon architecture using the main concepts above.

Given an american_option -

    struct american_option {
        american_option(double T, double S, double K, double r, double sigma, double q);
        ...
    };

a portfolio of such options -

    using portfolio_t = std::vector<american_option>;

a volatility curve -

    using volatility_curve { ... };

and a pricing model -

template<long M, long N>
struct lsm_functor {
    struct arg_t : svd_input {
        arg_t(svd_input a,
              volatility_curve const& v,
              american_option const& o);
    };

    struct result_greeks {
        ...
    };

    result_greeks operator()(arg_t const& arg) { ... }
};

transform a portfolio into inputs for pricing -

    portfolio_t portfolio;
    auto input = csv::range_from_csv<american_option>(std::cin));
    std::copy(input.begin(), input.end(), std::back_inserter(portfolio));


    using lsm = typename lsm_functor<10000, 2>;
    using arg_t = lsm::arg_t;
    using gen_t = vsl::gaussian_generator<double>;

    volatility_curve vol{ ... };

    std::vector<arg_t> inputs;
    std::transform(portfolio.begin(), portfolio.end(), std::back_inserter(inputs),
                   [&](auto const& o) {
                        auto const si = svd_input{ strm, gen_t{ 0, dt } };
                        return arg_t{ si, vol, o };
                   });

Define what it means to aggregate result_greeks -

    struct greeks_aggregator {
        using result_t = lsm::result_greeks; 
        result_t operator()(result_t const& lhs, result_t const& rhs);
    };

Perform risk aggregation via parallel Map Reduce -

    auto res = std::transform_reduce(std::execution::par, inputs.begin(), inputs.end(),
                                     lsm::result_greeks{},          // intial value
                                     greeks_aggregator{},           // reduction function
                                     lsm{}                          // map function, in our case a lsm pricing model
                                    );

The above example provides the template for an efficient implementation of portfolios of american options on the Intel Xeon Phi and can be readily adapted for any application where there is data parallelism introduced by, say, a portfolio and there are a number of legacy numerical routines required for numerical solution.

Experimental results

We evaluated the performance of the combined RNG+SVD code on the Intel(R) Xeon Phi(TM) CPU (7250) - the Knights Landing generation. This processor has 68 cores and is clocked at 1.40GHz. The experiment measures the wall-clock time of 272 parallel randomized SVDs of a small dense rectangular matrix of size 200x125. In each experiment, the thread count per core and the total number of cores is controlled to evaluate the scalability of the application. All times are provided in seconds.

Cores	1 thread/core	2 threads/core	3 threads/core	4 threads/core
1	41.02	28.86	25.53	22.82
2	25.92	20.39	20.24	19.84
4	13.93	10.67	10.02	10.21
8	7.90	5.99	6.05	3.88
16	4.86	5.03	1.42	1.33
32	4.90	1.21	0.69	0.45
64	1.03	0.36	0.44	0.61
68	0.70	0.32	0.46	0.54

Cache utilization

We apply VTune to analyze the cache utilization and performance characteristics in comparison with a baseline serial implementation.

Property	Baseline	Parallel transform_reduce
Overall CPU utilization	2.8%	98.89%
L2 Demand Misses	0.0%	9.8%
Prefetch	100.0% of L2 Input requests	90.2% of L2 input requests
DRAM bound performance	1.2%	95.3%
SIMD Instructions per cycle (% packed SIMD)	100.0%	97.6%

The above results demonstrate the efficient mapping of the computations to the Intel Xeon Phi.

Implementing the Pipe and Filter pattern

Another example of a parallel design pattern which is common in information retrieval from financial data is the the pipe and filter pattern. This pattern is typically implemented using a task scheduler and (concurrent) queues. For this example, however, we express this pattern as a directed acyclic graph (DAG) with the vertices representing filters and the edges representing pipes.

Vertices in our implementation are one of three types

sources - the point at which a value enters the graph of pipes and filters
filters - the filter functions to apply at each point in the graph traversal
sinks - the function(s) that ultimately collect the results of the graph traversal

Filters are represented as any function which conforms to the constructor of -

    template<class T>
    using filter_fn = std::function<T(T const&)>;

Sinks are represented as any function which conforms to the constructor of -

    template<class T>
    using sink_fn = std::function<void(T const&)>;

The vertex type is defined as -

    template<class T, class E>
    struct vertex {
        using this_type = vertex<T, E>;
        using value_type = T;
        using name_type = E;

        using optional_t = std::optional<value_type>;

        std::variant<identity_fn, filter_fn, sink_fn> filter_;
        using pipes_t = std::vector<name_type>;
        pipes_t pipes;

        vertex()
            : filter_{ identity_fn{ } }
        { }

        vertex(filter_fn f)
            : filter_{ std::move(f) }
        { }

        vertex(sink_fn f)
            : filter_{ std::move(f) }
        { } 

        // we propagate the results of a filter invocation to adjacent vertices
        // by means of this method
        void set_value(value_type const&);

        // we apply filter and sink functions to propagated values by invocation
        // of this method
        optional_t apply() const;

        // atomically set layer distance from source
        // return true if value was replaced
        bool set_distance(long);

        // reset vertex state
        void reset();
    };

The graph is defined as

    template<class Name, class T>
    graph {
        using name_type = Name;
        using value_type = T;

        bool add_source(name_type&& name);
        bool add_filter(name_type&& name, filter_fn<value_type> f);
        bool add_sink(name_type&& name, sink_fn<value_type> f);

        bool add_pipe(name_type const& from, name_type const& to);

        vertex_type& operator[](name_type const& name);
        vertex_type const& operator[](name_type const& name) const;
    };

Where Name is the type used to name each vertex in the graph (e.g. int, string, etc.) and T is the value type passed to and returned from filters and sinks.

Defining a graph of pipes and filters is done as follows

    graph<int, int> g;
    g.add_source(0);
    g.add_filter(1, [](int x) { return x * 2; });
    g.add_filter(2, [](int x) { return x * 3; });
    g.add_sink(3, [](int x) { std::cout << "x => " << x; });

    g.add_pipe(0, 1);
    g.add_pipe(0, 2);
    g.add_pipe(1, 3);
    g.add_pipe(2, 3);

To post a value to the graph and initiate evaluation of the filters and sinks, we define a function post. The sequential execution version of which could be defined as -

    template<class Graph>
    void post(Graph const& g, typename Graph::name_type const& v0, typename Graph::value_type val);

Sequential evaluation of the graph would proceed using a typical Breadth First Traversal algorithm (Moore (1959), Lee (1961)). However, we are interested in parallel evaluation of the graph. Our implementation is based on a work efficient parallel Breadth First Search algorithm (PBFS) designed by Leiferson and Schardl (2010). The post signature is extended to accept a C++ standard ExecutionPolicy -

    template<class ExecutionPolicy,
             class Graph>
    void post(ExecutionPolicy&& exec, Graph const& g, typename Graph::name_type const& v0, typename Graph::value_type val);

Our algorithm implementation utilizes the C++ standard transform_reduce and for_each algorithms forwarding the ExecutionPolicy from the invocation of post, but otherwise proceeds in accordance with the algorithm outlined in Leiferson and Schardl (2010). Our post function is implemented as follows -

    template<class ExecutionPolicy,
             class Graph>
    void post(ExecutionPolicy&& exec, Graph const& g, typename Graph::name_type const& v0, typename Graph::value_type val) {
        // reset the graph state
        std::for_each(std::forward<ExecutionPolicy>(exec), std::begin(g), std::end(g),
                      [](auto& v) { v.reset(); });

        auto d = 0;
        auto& v = g[v0];
        v.set_distance(d);
        v.set_value(val);

        detail::bag bag{};
        bag.insert(v);
        while (!bag.empty()) {
            auto out = detail::process_layer(std::forward<ExecutionPolicy>(exec), bag, d);
            std::swap(bag, out);
            ++d;
        }
    }

The post function iteratively evaluates each layer of the graph starting from the named source vertex v0, calling detail::process_layer until all vertices have been reached.

detail::process_layer is implemented as follows -

    template<class ExecutionPolicy,
             class Graph>
    bag process_layer(ExecutionPolicy exec, Graph const& g, bag& in, long d) {
        return std::transform_reduce(std::forward<ExecutionPolicy>(exec), 0, in.size(), bag{}
                                     bag::merge
                                     [&](auto k) {
                                            if (in[k] != nullptr) {
                                                auto out = process_pennant(std::forward<ExecutionPolicy>(exec), g, in[k], d);
                                                bag::merge(res, out);
                                            }
                                     });
}

This function makes use of transform_reduce to potentially parallelize the processing of each vertex in a layer at distance d from the source vertex. The results of processing individual grains of the layer are reduced by means of bag::merge. For the purpose of minimizing allocations, we currently derive vertices from the pennant type. Each pennant is processed in the transform phase of transform_reduce by invoking process_pennant.

process_pennant is implemented as follows -

    template<class ExecutionPolicy,
             class Graph>
    bag process_pennant(ExecutionPolicy&& exec, Graph const& g, pennant* in, long d, size_t grain_size = 128) {
        using vertex_type = typename Graph::vertex_type;
        bag res;
        in->visit([&](auto& u) {
                auto v = reinterpret_cast<vertex_type&>(u);
                auto val = v.apply();
                std::for_each(std::forward<ExecutionPolicy>(exec), std::begin(v.pipes), std::end(v.pipes),
                              [&](auto const& n) {
                                    auto& adjv = g[n];
                                    if (adjv.set_distance(d + 1)) {
                                        // only propagate values if previous invocation was not a sink
                                        if (val)
                                            adjv.set_value(val);
                                        res.insert(adjv);
                                    }
                                });
            });
        return res;
    }

See the Appendix for technical notes on this implementation.

Conclusion

Following the recent adoption of the ISO Technical Specification for Parallel STL Algorithms in C++17, this article describes some of the parallel design patterns now available to C++ programmers seeking to deploy code on many-core devices such as the Intel Xeon Phi. Parallel design patterns and parallel STL provide support for deploying legacy serial codes on many-core architecutures, a topic of much interest for developers and maintainers of complex serial C++ code-stacks which make heavy use of STL in addition to vendor math and scientific computing libraries which are often written in c. Examples in derivative modeling and graph traversal for information retrieval illustrate and demonstrate the performance gains from using design patterns in parallel STL on complex simulation and information retrieval tasks.

The examples illustrate how OPL can be used to guide the design of a parallel architecture - in particular identifying bottlnecks and sources of parallelism in the program is key. An important step is understanding the work generated by the application and deciding whether, once parallelized, there will be sufficient work to efficiently use each core. Once the key aspects of the parallel design pattern are defined, we then demonstrate how adopting standardized algorithms and helper functions in parallel stl can be an effective and reliable approach to porting compute bound applications. The article does not present an exhaustive exploration of parallel design patterns with parallel STL, but rather illustrates some of the main conceptual challenges and implementation approaches for some common applications.

Acknowledgments

The author acknowledges funding support from Intel(R) Corporation.

Appendix A

Common algorithmic patterns implemented in STL:

Container iteration
- for_each(begin, end, fun) visit each element in the range [begin, end) applying fun()
- find_first(begin, end, pred) and find_last(begin, end, pred) visit each element in the range [begin, end) applying pred() terminate and return iterator to element where pred() evaluates true, or end, if no such element is found
- all_of, any_of, none_of as for find_first/find_last but returns true if pred() holds for all, any, or none of the elements in a range
Container modification
- copy, copy_if copy (possibly selectively) elements from one range to another
- move, move_if move (possibly selectively) elements from one range to another
- generate create a range through successive applications of a generator function
- transform apply a function to each element of a range, and place the result in a new range
- rotate rotate elements around a point in the range [begin, end)
- shuffle and sample randomly shuffle or sample elements in a range
Numeric operations
- accumulate Apply an operation to each element in [begin, end), by default std::plus<T> in the range [begin, end) and return the result
- reduce Similar to accumulate but does relaxes accumulate’s ordering guarantees.

Full documentation can be found here.

Appendix B

Technical notes on the parallel implementation of the BFS algorithm

Note that our implementation diverges from PBFS in the process_pennant function. Specifically:

We do not treat the data race setting the distance of adjacent vertices as benign, as we want to preclude redundant execution of filter and sink functions. This is not expected to significantly impact performance as Leiferson and Schardl (2010) establishes that these races are infrequent in practice.
We do not currently split work and leverage fork-join parallelism for bags above a given grain_size as the C++ Standard currently does not define any mechanism for interacting with the underlying task scheduler for a parallel algorithm.

Issues and future direction for this approach to implementing Pipe and Filter

While graph traversal is thread safe, only a single post can be in progress at a time. This is because the mutable state of the traversal is partially stored in the vertices of the graph. Introducing the notion of Causality and associating this mutable state with the causality would make the graph reentrant, and would be required of any practical implementation.
There is work to standardize access to the underlying task executor for parallel algorithms, however, for implementations where there is a non-standard way to access the underlying task scheduler, we could provide platform configurable abstractions for the fork-join part of process_pennant.

References

Chong, J. 2010. “Pattern-Oriented Application Frameworks for Domain Experts to Effectively Utilize Highly Parallel Manycore Microprocessors.” PhD thesis, EECS Department, University of California, Berkeley.

Gonina, E. I. 2013. “A Framework for Productive, Efficient and Portable Parallel Computing.” PhD thesis, EECS Department, University of California, Berkeley.

Gonina, E., G. Friedland, E. Battenberg, M. Driscoll, P. Koanantakool, E. Georganas, and K. Keutzer. 2014. “PyCASP: Scalable Multimedia Content Analysis on Parallel Platforms Using Python.” TOMCCAP - ACM Transactions on Multimedia Computing, Communications and Applications 10 (2).

Keutzer, K., and T. G. Mattson. 2010. “A Design Pattern Language for Engineering (Parallel) Software.” Intel Technology 4.

Lee, C.Y. 1961. “An Algorithm for Path Connections and Its Applications.” http://ieeexplore.ieee.org/document/5219222/?reload=true&arnumber=5219222.

Leiferson, C.E., and T.B. Schardl. 2010. “A Work-Efficient Parallel Breadth-First Search Algorithm.” http://supertech.csail.mit.edu/papers/pbfs.pdf.

Moore, E.F. 1959. “The Shortest Path Through a Maze.” In Proceedings of the International Symposium on the Theory of Switching, 282–92. Harvard University Press.

“Our Pattern Language.” n.d. https://patterns.eecs.berkeley.edu.

Robison, A.D., and R.E. Johnson. 2010. “Three Layer Cake for Shared-Memory Programming.” In Proceedings of the 2010 Workshop on Parallel Programming Patterns, 5:1–5:8. ParaPLoP ’10. New York, NY, USA: ACM.