Introduction

Goal of this presentation

The goal of this presentation is to give the flavor of GPU programming, as opposed to both CPU (“normal”) programming and the use of GPU-accelerated libraries or tools (such as are common in the machine learning field). It will not be sufficient for you to go out and start programming. I hope it is sufficient to help you decide whether you are interested in further study of GPU programming to support your own work.

What is a GPU?

GPU stands for graphics processing unit. Usually when we say GPU we mean general purpose GPU – one intended for doing math, not for rendering graphics. But we rarely say GPGPU.

Almost all modern supercomputers get most of their processing power from GPUs. Perlmutter, for example, is (at the time of this writing) the 5th fastest supercomputer in the world. (See https://top500.org for the current list.) Perlmutter Phase 1 has 1536 compute nodes. These are the nodes on which you’d run jobs, as opposed to login nodes, which are used for things like compiling your code. Each (compute) node has 1 AMD EPYC 7763 (Milan) CPU, with 64 cores. The cores are largely independent, but they share resources like some memory caches. Each of these cores can run 2 threads simultaneously. These threads can do different work, but they share some hardware resources and so are not totally independent.

Each (compute) node also has 4 NVIDIA Ampere A100 GPUs. Each GPU has 108 streaming multiprocessors (SMs). The SMs are largely independent. Each SM has:

  • 32 64-bit floating point units
  • 64 32-bit floating point units
  • 64 32-bit integer units
  • 4 tensor cores

The multiple units are not independent. A team of threads running on any of these units must perform the same operation on different data. This is the single instruction, multiple threads (SIMT) execution model. Note that we’ll come back to this when we look at if statements in CUDA/C++.

Just concentrating on the 64-bit floating point speed, the CPU is capable of about 2.5 TFlops, and each GPU is just below 10 TFlops (39 TFlops for all 4 on a node). 94% of the maximum FLops are from the GPU, and 6% from the CPU. So to use a node well, you must be doing most calculations on the GPUs.

Different strategies for programming CPUs and GPUs

A modern CPU is organized to make single streams of instructions run quickly. There are multi-level memory caches to hide memory latency. To achieve very high performance code must be concerned with latency, but much code is not. Instruction execution is very complicated, including both speculative execution and out-of order execution. These both help support efficient execution of complex logic, for example by starting execution of the body of an if statement before knowing whether the condition of the if statement is true. Programming languages like Python expect this kind of efficiency.

A GPU is organized to run large numbers of threads each doing the same operation on different data. The code to be executed on a GPU must be parallel code – and parallel in the manner that GPUs support, which means data parallelism. This often means that we are doing array based programming. Constructions like if statements can be very damaging to efficiency. Languages and libraries made for programming GPUs are designed to deal with this. I’ll be showing examples of two: CUDA/C++ (an extension of the C++ language) and Kokkos (a C++ library).

Code written to execute on GPUs must be organized very differently than CPU code. First we’ll look at CUDA/C++, and then we’ll compare some of that to Kokkos.

CUDA/C++ programming basics

First we need to introduce some terminology: host and device. The host is the CPU and its memory, called host memory. The device is the GPU and its memory (device memory, of course). In order to be used in calculations data must be moved from host memory to device memory. And eventually, the results of calculations need to be copied back from the device memory to host memory.

The compiler that is used for CUDA/C++ is nvcc. It can be used to compile a program with no device code:

#include <cstdio>
void greeting() {
  std::printf("Hello world\n");
}

int main() {
  greeting();
}
$ nvcc -o helloworld helloworld.cu
$ ./helloworld
Hello world
$

Code that will be called from the host but run will run on the device is called a kernel. It is marked specially with __global__ to tell nvcc to generate the necessary code for that to work. A kernel gets called (or launched) from host code; <<< and >>> are used in the host code to tell nvcc how this is to be done. These are the first few differences you’ll note between C++ and CUDA/C++, which is a proprietary and special-purpose extension of C++.

#include <cstdio>
#include <cuda.h>

__global__
void greeting() {
    std::size_t threadID = blockIdx.x * blockDim.x + threadIdx.x;
    std::printf("Hello world from thread %u\n");
}

int main() {
    int constexpr nb  = 2;   // number of blocks
    int constexpr tpw = 3;   // threads per block
    greeting<<<nw,tpw>>>();  // launch the kernel
    cudaDeviceSynchronize(); // wait until all are done
}

This code also shows some additional extensions to C++ that are part of CUDA/C++: blockIdx and threadIdx are two of the special pre-defined variables in CUDA/C++. We’ll get to their importance soon.

$ nvcc -std=c++17 -o helloworld helloworld.cu
$ ./helloworld
Hello world from thread 0
Hello world from thread 1
Hello world from thread 2
Hello world from thread 3
Hello world from thread 4
Hello world from thread 5
# The order of the numbering is not guaranteed.

But this is a silly use of a GPU: this is not massive parallelism. Let’s look at some CUDA/C++ that does actual work.

What real CUDA/C++ code looks like

Example from a numerical integration library

My examples code from a multidimensional numerical integration library developed in collaboration with a team from Old Dominion University, in a project supported by a Fermilab LDRD grant. The code is available at https://github.com/marcpaterno/gpuintegration.

This library contains two integration algorithms as well as a handful of utilities intended to help users implement integrands.

A class for interpolation in 1 dimension

The simplest example is the class quad::Interp1D which is used for 1-dimensional interpolation. We’ll look at parts of it, to see how CUDA/C++ is related to C++.

CUDA/C++ can have classes, and inheritance. This looks like normal C++:

class Interp1D : public Managed {
  size_t _cols = 0;
  double* _xs = nullptr;
  double* _zs = nullptr;
  // more code elided...
};

This class will uses the arrays _xs and _zs to do calculations on the device (the GPU). But the object itself will exist in host memory. That is, the this pointer for the object will be a location in host memory not a location in device memory. Let’s see how this is done.

A constructor

Here’s a constructor template that provides data (in host memory) to initialize an Interp1D object:

template <size_t M>
quad::Interp1D::Interp1D(std::array<double, M> const& xs,
                         std::array<double, M> const& zs)
  : _cols(M)
{
  _initialize(xs.data(), zs.data());
}

The data in the arrays xs and zs have to be made available on the device. So far, this looks like normal C++. The member function _initialize is where the first bit of CUDA specialization arises.

CUDA unified memory

quad::Interp1D::_initialize(double const* x, double const* z)
{
  size_t const nbytes = sizeof(double) * _cols;
  cudaMallocManaged(&_xs, nbytes);
  memcpy(_xs, x, nbytes);
  cudaMallocManaged(&_zs, nbytes);
  memcpy(_zs, z, nbytes);
}

CUDA unified memory is a part of the CUDA runtime that automatically moves data from host to device, or from device to host, as needed. The runtime tries to choose when this is done to provide maximum efficiency. The _initialize function allocates the unified memory, and copies the data from host memory into the unified memory.

Because we have allocated memory using cudaMallocManaged, we have to make sure to release it using cudaFree:

inline quad::Interp1D::~Interp1D()
{
  cudaFree(_zs);
  cudaFree(_xs);
}

The details here are of course complex, but the general point is simple: our class has to handle the marshaling of data from the host to the device, in this case doing so by making use of CUDA unified memory.

Doing the interpolation

These data are used by the function call operator, which actually does the interpolation:

inline __device__ __host__ double
quad::Interp1D::operator()(double x) const
{
  auto [x0_index, x1_index] = _find_neighbor_indices(x);
  const double y0 = _zs[x0_index];
  const double y1 = _zs[x1_index];
  const double x0 = _xs[x0_index];
  const double x1 = _xs[x1_index];
  const double y = (y0 * (x1 - x) + y1 * (x - x0)) / (x1 - x0);
  return y;
}

This (member) function is decorated with the execution space specifiers __device__ and __host__. This makes the CUDA/C++ compiler generate both CPU and GPU code for this function. Otherwise, this member function looks like normal C++. If called from the CPU, this function will use the class member data (_xs and _zs) in host memory. If called from the GPU, it will use the member data in device memory. This means that this function is suitable for being called from many GPU threads at the same time.

Note this difference between a __global__ function (a kernel) and this sort of function. A __global__ function, when called from the host, executes the called function on the device.

A function like this operator() is in a way is the simplest kind of CUDA/C++ code to write: classes and functions that can be used from either the CPU or the GPU. But something (a kernel, as we saw before) must start up the GPU work. We’ll look at one of them next.

A simple kernel

A kernel is launched with several (perhaps many) teams of threads, called warps, scheduled to execute the code simultaneously. They must be doing the same operations but on different data. The CUDA compiler will map these warps to the blocks specified in the kernel launch; a single block might correspond to multiple warps. The CUDA runtime handles the scheduling, which might execute the warps in any order.

Here is an example from our numerical integration library. This kernel is used to filter sub-regions in the volume of integration to distinguish those regions that need more “polishing” from those that do not:

  template <typename T>
  __global__ void // kernels never have a return value
   Filter(T const* dRegionsError,
         int* unpolishedRegions,
         int const* activeRegions,
         size_t numRegions,
         T errThreshold)
  {
    size_t const tid = blockIdx.x * blockDim.x + threadIdx.x;
    if (tid < numRegions) {
      T const selfErr = dRegionsError[tid];
      int const factor = (selfErr > errThreshold);
      unpolishedRegions[tid] = factor * activeRegions[tid];
    }
  }

This is a very simple example, but it is illustrative. The important feature is that each thread (identified by tid, for thread ID) is working on a different array element. Threads in the same block will have consecutive thread IDs, and will access adjacent elements of the array. This is done to make the best use of memory caches on the GPU.

This is illustrated in the figure below.

Mapping of threads to array entries.

Also notice how we use the automatic conversion from bool to int to avoid an if statement. This kind of construction is common in CUDA/C++ code. We want to avoid conditional logic that could give different answers on different threads.

Note also that nothing inside the kernel determines the number of threads that will execute the kernel. That is determined at the call site.

The importance of memory access patterns

Different levels of GPU memory (for example, global, shared and cache) have very different access latencies, and of course a faster memory type is always a more limited resource than a slower memory type. It is thus very important to organize code to make good use of the fastest memory available.

An good example of such organization is that used for a parallel reduction algorithm. This algorithm is applying a binary operation to reduce an array of data to a single value (in this case, a sum). The figure below shows an access pattern designed to:

  1. use as many threads to do work as the data will support,
  2. to keep the intermediate data (e.g. the partials sums) as compactly in memory as possible, and
  3. to allow direct addressing to the necessary data, based only on knowing the number of threads in use.

A first step, not shown, would copy data (in chunks) from global memory to shared memory, allowing the thread team to work on the data as quickly as possible. Keeping the data compact is important to allow the best possible use of the most local cache memory.

Data access pattern for a parallel reduction algorithm. Source: NVIDIA

Calling the Filter kernel

This kernel is called from a host-only function HSClassify, which is too complex to show here. The call to Filter looks like:

Filter<<<numBlocks, numThreads>>>(dRegionsError,
                                  unpolishedRegions, // this is the output parameter
                                  activeRegions,
                                  numRegions,
                                  ErrThreshold);

While numThreads is a compile-time constant in this function, numBlocks is calculated on each call. While the <<<>>> notation that decorate the launching of a kernel look like the notation of C++ templates, it is not. The parameters provided do not need to be compile-time constants.

It is the launch of the kernel, rather than the implementation of the kernel, that determines the degree of parallelism in the execution.

What is Kokkos?

Kokkos (see https://kokkos.org/core-about/) is a programming model for parallel algorithms. It provides support for various target architectures, including CUDA. It thus provides the promise of allowing us to write algorithms once, and to have the support of NVIDIA, AMD and Intel GPUs (as well as multi-core CPUs, through OpenMP threading) for those algorithms. This will typically come at some cost to efficiency. In my experience, Kokkos can often achieve about 85% of the efficiency of well-tuned CUDA/C++ on NVIDIA hardware.

The portability of the algorithms can be very important for our community. We do not often limit ourselves to running on a single type of hardware, much less a single machine (even one such as Perlmutter). We also do not typically have the resources to implement our code in several different programming systems (such as CUDA/C++ for NVIDIA, HIP for AMD and oneAPI for Intel). If the performance of Kokkos code is sufficient to our need, the advantage of the portability is extraordinarily valuable.

The Kokkos equivalent of the CUDA/C++ kernel

In CUDA/C++, a parallel algorithm is written in two parts:

  1. The kernel, which is essentially the body of a loop, and
  2. The call site for the kernel, which is essentially the parallel version of a for loop, providing the loop over the implied thread indices.

Using Kokkos, we write a C++ function Filter which, when called, is the equivalent of launching a kernel. The Kokkos template library and the Kokkos C++ runtime work together to make this work.

Coding using Kokkos makes wide use of closures, in C++ usually called lambda expressions. In order to provide the closures with the correct decorations (e.g. __device__ if the CUDA compilation target is chosen, but not if the OpenMP target is chosen), Kokkos provides the preprocessor macro KOKKOS_LAMBDA. This creates a function object, with the specified arguments, which also has access to copies of any local variables it needs.

A call to the resulting implementation of Filter below is the equivalent of launching the CUDA/C++ kernel.

void
Filter(ViewVectorDouble dRegionsError,
      ViewVectorInt unpolishedRegions,
      ViewVectorInt activeRegions,
      size_t numRegions,
      double errThreshold)
{
  auto do_filtering = KOKKOS_LAMBDA(int64_t const index) {
    double const selfErr = dRegionsError(index);
    int const factor = (selfErr > errThreshold);
    unpolishedRegions(index) = factor * activeRegions(index);
  };
  Kokkos::parallel_for("Filter", numRegions, do_filtering);
}

Note that the several view objects are copied into the closure. For this reason, views are made cheap to copy (they are essentially a pointer to the data, and an integer size).

But what determines the number of blocks and the number of threads used?

Kokkos has a execution policies that determine how many “teams” (blocks, in CUDA) will be used, and how many threads are to be used for each “team”.

The parallel_for takes a policy argument, with a default that is range policy. In this case, because there is no organized synchronization needed, the default policy is sufficient. In more complicated cases, another policy may be better.

Conclusion

This presentation only provides a taste of what GPU programming using CUDA/C++ or Kokkos is like. In summary: programming for a GPU can provide great speed benefits but it does not come for free. Code that is to run on a GPU must be organized to take advantage of the massive parallelism available on the device.

There are many online sources of further information that can be found using your favorite search engine. Older tutorials tend to concentrate on C which lacks the CUDA/C++ features to automate some of the work. You’ll need to augment tutorials that concentrate on C with your own good C++ practices and taste. Older tutorials also do not cover more recent additions to the CUDA environment (such as unified memory).

One good resource is the CUDA C++ Programming Guide from NVIDIA. NVIDIA does a generally praiseworthy job of keeping it up-to-date. Unfortunately, this is not a tutorial.

The Kokkos team provide tutorials periodically at a variety of locations. The materials from a recent one is available. I think it relevant to note that the Kokkos project is planning on moving to use of C++17 “soon”, and so you should not expect to use old C++ compilers with Kokkos.