Introduction
Goal of this presentation
The goal of this presentation is to give the flavor of GPU programming, as opposed to both CPU (“normal”) programming and the use of GPU-accelerated libraries or tools (such as are common in the machine learning field). It will not be sufficient for you to go out and start programming. I hope it is sufficient to help you decide whether you are interested in further study of GPU programming to support your own work.
What is a GPU?
GPU stands for graphics processing unit. Usually when we say GPU we mean general purpose GPU – one intended for doing math, not for rendering graphics. But we rarely say GPGPU.
Almost all modern supercomputers get most of their processing power from GPUs. Perlmutter, for example, is (at the time of this writing) the 5th fastest supercomputer in the world. (See https://top500.org for the current list.) Perlmutter Phase 1 has 1536 compute nodes. These are the nodes on which you’d run jobs, as opposed to login nodes, which are used for things like compiling your code. Each (compute) node has 1 AMD EPYC 7763 (Milan) CPU, with 64 cores. The cores are largely independent, but they share resources like some memory caches. Each of these cores can run 2 threads simultaneously. These threads can do different work, but they share some hardware resources and so are not totally independent.
Each (compute) node also has 4 NVIDIA Ampere A100 GPUs. Each GPU has 108 streaming multiprocessors (SMs). The SMs are largely independent. Each SM has:
- 32 64-bit floating point units
- 64 32-bit floating point units
- 64 32-bit integer units
- 4 tensor cores
The multiple units are not independent. A team of threads
running on any of these units must perform the same operation
on different data. This is the single instruction, multiple
threads (SIMT) execution model. Note that we’ll come back to this
when we look at if statements in CUDA/C++.
Just concentrating on the 64-bit floating point speed, the CPU is capable of about 2.5 TFlops, and each GPU is just below 10 TFlops (39 TFlops for all 4 on a node). 94% of the maximum FLops are from the GPU, and 6% from the CPU. So to use a node well, you must be doing most calculations on the GPUs.
Different strategies for programming CPUs and GPUs
A modern CPU is organized to make single streams of instructions run
quickly. There are multi-level memory caches to hide memory latency. To
achieve very high performance code must be concerned with latency, but
much code is not. Instruction execution is very complicated,
including both speculative execution and out-of order
execution. These both help support efficient execution of complex
logic, for example by starting execution of the body of an
if statement before knowing whether the condition of the
if statement is true. Programming languages like Python
expect this kind of efficiency.
A GPU is organized to run large numbers of threads each doing the
same operation on different data. The code to be
executed on a GPU must be parallel code – and parallel in the
manner that GPUs support, which means data parallelism. This
often means that we are doing array based programming.
Constructions like if statements can be very damaging to
efficiency. Languages and libraries made for programming GPUs are
designed to deal with this. I’ll be showing examples of two: CUDA/C++
(an extension of the C++ language) and Kokkos (a C++ library).
Code written to execute on GPUs must be organized very differently than CPU code. First we’ll look at CUDA/C++, and then we’ll compare some of that to Kokkos.
CUDA/C++ programming basics
First we need to introduce some terminology: host and device. The host is the CPU and its memory, called host memory. The device is the GPU and its memory (device memory, of course). In order to be used in calculations data must be moved from host memory to device memory. And eventually, the results of calculations need to be copied back from the device memory to host memory.
The compiler that is used for CUDA/C++ is nvcc. It can
be used to compile a program with no device code:
#include <cstdio>
void greeting() {
std::printf("Hello world\n");
}
int main() {
greeting();
}$ nvcc -o helloworld helloworld.cu
$ ./helloworld
Hello world
$Code that will be called from the host but run will run on the device
is called a kernel. It is marked specially with
__global__ to tell nvcc to generate the
necessary code for that to work. A kernel gets called (or
launched) from host code; <<< and
>>> are used in the host code to tell
nvcc how this is to be done. These are the first few
differences you’ll note between C++ and CUDA/C++, which is a proprietary
and special-purpose extension of C++.
#include <cstdio>
#include <cuda.h>
__global__
void greeting() {
std::size_t threadID = blockIdx.x * blockDim.x + threadIdx.x;
std::printf("Hello world from thread %u\n");
}
int main() {
int constexpr nb = 2; // number of blocks
int constexpr tpw = 3; // threads per block
greeting<<<nw,tpw>>>(); // launch the kernel
cudaDeviceSynchronize(); // wait until all are done
}This code also shows some additional extensions to C++ that are part
of CUDA/C++: blockIdx and threadIdx are two of
the special pre-defined variables in CUDA/C++. We’ll get to their
importance soon.
$ nvcc -std=c++17 -o helloworld helloworld.cu
$ ./helloworld
Hello world from thread 0
Hello world from thread 1
Hello world from thread 2
Hello world from thread 3
Hello world from thread 4
Hello world from thread 5
# The order of the numbering is not guaranteed.But this is a silly use of a GPU: this is not massive parallelism. Let’s look at some CUDA/C++ that does actual work.
What real CUDA/C++ code looks like
Example from a numerical integration library
My examples code from a multidimensional numerical integration library developed in collaboration with a team from Old Dominion University, in a project supported by a Fermilab LDRD grant. The code is available at https://github.com/marcpaterno/gpuintegration.
This library contains two integration algorithms as well as a handful of utilities intended to help users implement integrands.
A class for interpolation in 1 dimension
The simplest example is the class quad::Interp1D which
is used for 1-dimensional interpolation. We’ll look at parts of it, to
see how CUDA/C++ is related to C++.
CUDA/C++ can have classes, and inheritance. This looks like normal C++:
class Interp1D : public Managed {
size_t _cols = 0;
double* _xs = nullptr;
double* _zs = nullptr;
// more code elided...
};This class will uses the arrays _xs and _zs
to do calculations on the device (the GPU). But the object
itself will exist in host memory. That is, the
this pointer for the object will be a location in host
memory not a location in device memory. Let’s see how this is done.
A constructor
Here’s a constructor template that provides data (in host memory) to
initialize an Interp1D object:
template <size_t M>
quad::Interp1D::Interp1D(std::array<double, M> const& xs,
std::array<double, M> const& zs)
: _cols(M)
{
_initialize(xs.data(), zs.data());
}The data in the arrays xs and zs have to be
made available on the device. So far, this looks like normal C++. The
member function _initialize is where the first bit of CUDA
specialization arises.
CUDA unified memory
quad::Interp1D::_initialize(double const* x, double const* z)
{
size_t const nbytes = sizeof(double) * _cols;
cudaMallocManaged(&_xs, nbytes);
memcpy(_xs, x, nbytes);
cudaMallocManaged(&_zs, nbytes);
memcpy(_zs, z, nbytes);
}CUDA unified memory is a part of the CUDA runtime that
automatically moves data from host to device, or from device to host, as
needed. The runtime tries to choose when this is done to provide maximum
efficiency. The _initialize function allocates the unified
memory, and copies the data from host memory into the unified
memory.
Because we have allocated memory using
cudaMallocManaged, we have to make sure to release it using
cudaFree:
inline quad::Interp1D::~Interp1D()
{
cudaFree(_zs);
cudaFree(_xs);
}The details here are of course complex, but the general point is simple: our class has to handle the marshaling of data from the host to the device, in this case doing so by making use of CUDA unified memory.
Doing the interpolation
These data are used by the function call operator, which actually does the interpolation:
inline __device__ __host__ double
quad::Interp1D::operator()(double x) const
{
auto [x0_index, x1_index] = _find_neighbor_indices(x);
const double y0 = _zs[x0_index];
const double y1 = _zs[x1_index];
const double x0 = _xs[x0_index];
const double x1 = _xs[x1_index];
const double y = (y0 * (x1 - x) + y1 * (x - x0)) / (x1 - x0);
return y;
}This (member) function is decorated with the execution space
specifiers __device__ and __host__. This
makes the CUDA/C++ compiler generate both CPU and GPU code for this
function. Otherwise, this member function looks like normal C++. If
called from the CPU, this function will use the class member data
(_xs and _zs) in host memory. If called from
the GPU, it will use the member data in device memory. This means that
this function is suitable for being called from many GPU threads at the
same time.
Note this difference between a __global__ function (a
kernel) and this sort of function. A __global__ function,
when called from the host, executes the called function on the
device.
A function like this operator() is in a way is the
simplest kind of CUDA/C++ code to write: classes and functions that can
be used from either the CPU or the GPU. But something (a
kernel, as we saw before) must start up the GPU work. We’ll
look at one of them next.
A simple kernel
A kernel is launched with several (perhaps many) teams of threads, called warps, scheduled to execute the code simultaneously. They must be doing the same operations but on different data. The CUDA compiler will map these warps to the blocks specified in the kernel launch; a single block might correspond to multiple warps. The CUDA runtime handles the scheduling, which might execute the warps in any order.
Here is an example from our numerical integration library. This kernel is used to filter sub-regions in the volume of integration to distinguish those regions that need more “polishing” from those that do not:
template <typename T>
__global__ void // kernels never have a return value
Filter(T const* dRegionsError,
int* unpolishedRegions,
int const* activeRegions,
size_t numRegions,
T errThreshold)
{
size_t const tid = blockIdx.x * blockDim.x + threadIdx.x;
if (tid < numRegions) {
T const selfErr = dRegionsError[tid];
int const factor = (selfErr > errThreshold);
unpolishedRegions[tid] = factor * activeRegions[tid];
}
}This is a very simple example, but it is illustrative. The important
feature is that each thread (identified by tid, for thread
ID) is working on a different array element. Threads in the same block
will have consecutive thread IDs, and will access adjacent elements of
the array. This is done to make the best use of memory caches on the
GPU.
This is illustrated in the figure below.
Mapping of threads to array entries.
Also notice how we use the automatic conversion from
bool to int to avoid an if
statement. This kind of construction is common in CUDA/C++ code. We want
to avoid conditional logic that could give different answers on
different threads.
Note also that nothing inside the kernel determines the number of threads that will execute the kernel. That is determined at the call site.
The importance of memory access patterns
Different levels of GPU memory (for example, global, shared and cache) have very different access latencies, and of course a faster memory type is always a more limited resource than a slower memory type. It is thus very important to organize code to make good use of the fastest memory available.
An good example of such organization is that used for a parallel reduction algorithm. This algorithm is applying a binary operation to reduce an array of data to a single value (in this case, a sum). The figure below shows an access pattern designed to:
- use as many threads to do work as the data will support,
- to keep the intermediate data (e.g. the partials sums) as compactly in memory as possible, and
- to allow direct addressing to the necessary data, based only on knowing the number of threads in use.
A first step, not shown, would copy data (in chunks) from global memory to shared memory, allowing the thread team to work on the data as quickly as possible. Keeping the data compact is important to allow the best possible use of the most local cache memory.
Data access pattern for a parallel reduction algorithm. Source: NVIDIA
Calling the Filter kernel
This kernel is called from a host-only function
HSClassify, which is too complex to show here. The call to
Filter looks like:
Filter<<<numBlocks, numThreads>>>(dRegionsError,
unpolishedRegions, // this is the output parameter
activeRegions,
numRegions,
ErrThreshold);While numThreads is a compile-time constant in this
function, numBlocks is calculated on each call. While the
<<< … >>> notation that
decorate the launching of a kernel look like the notation of
C++ templates, it is not. The parameters provided do not need to be
compile-time constants.
It is the launch of the kernel, rather than the implementation of the kernel, that determines the degree of parallelism in the execution.
What is Kokkos?
Kokkos (see https://kokkos.org/core-about/) is a programming model for parallel algorithms. It provides support for various target architectures, including CUDA. It thus provides the promise of allowing us to write algorithms once, and to have the support of NVIDIA, AMD and Intel GPUs (as well as multi-core CPUs, through OpenMP threading) for those algorithms. This will typically come at some cost to efficiency. In my experience, Kokkos can often achieve about 85% of the efficiency of well-tuned CUDA/C++ on NVIDIA hardware.
The portability of the algorithms can be very important for our community. We do not often limit ourselves to running on a single type of hardware, much less a single machine (even one such as Perlmutter). We also do not typically have the resources to implement our code in several different programming systems (such as CUDA/C++ for NVIDIA, HIP for AMD and oneAPI for Intel). If the performance of Kokkos code is sufficient to our need, the advantage of the portability is extraordinarily valuable.
The Kokkos equivalent of the CUDA/C++ kernel
In CUDA/C++, a parallel algorithm is written in two parts:
- The kernel, which is essentially the body of a loop, and
- The call site for the kernel, which is essentially the parallel
version of a
forloop, providing the loop over the implied thread indices.
Using Kokkos, we write a C++ function Filter which, when
called, is the equivalent of launching a kernel. The Kokkos template
library and the Kokkos C++ runtime work together to make this work.
Coding using Kokkos makes wide use of closures, in C++
usually called lambda expressions. In order to provide the
closures with the correct decorations (e.g. __device__ if
the CUDA compilation target is chosen, but not if the OpenMP target is
chosen), Kokkos provides the preprocessor macro
KOKKOS_LAMBDA. This creates a function object, with the
specified arguments, which also has access to copies of any local
variables it needs.
A call to the resulting implementation of Filter below
is the equivalent of launching the CUDA/C++ kernel.
void
Filter(ViewVectorDouble dRegionsError,
ViewVectorInt unpolishedRegions,
ViewVectorInt activeRegions,
size_t numRegions,
double errThreshold)
{
auto do_filtering = KOKKOS_LAMBDA(int64_t const index) {
double const selfErr = dRegionsError(index);
int const factor = (selfErr > errThreshold);
unpolishedRegions(index) = factor * activeRegions(index);
};
Kokkos::parallel_for("Filter", numRegions, do_filtering);
}Note that the several view objects are copied into the closure. For this reason, views are made cheap to copy (they are essentially a pointer to the data, and an integer size).
But what determines the number of blocks and the number of threads used?
Kokkos has a execution policies that determine how many “teams” (blocks, in CUDA) will be used, and how many threads are to be used for each “team”.
The parallel_for takes a policy argument, with a default
that is range policy. In this case, because there is no
organized synchronization needed, the default policy is sufficient. In
more complicated cases, another policy may be better.
Conclusion
This presentation only provides a taste of what GPU programming using CUDA/C++ or Kokkos is like. In summary: programming for a GPU can provide great speed benefits but it does not come for free. Code that is to run on a GPU must be organized to take advantage of the massive parallelism available on the device.
There are many online sources of further information that can be found using your favorite search engine. Older tutorials tend to concentrate on C which lacks the CUDA/C++ features to automate some of the work. You’ll need to augment tutorials that concentrate on C with your own good C++ practices and taste. Older tutorials also do not cover more recent additions to the CUDA environment (such as unified memory).
One good resource is the CUDA C++ Programming Guide from NVIDIA. NVIDIA does a generally praiseworthy job of keeping it up-to-date. Unfortunately, this is not a tutorial.
The Kokkos team provide tutorials periodically at a variety of locations. The materials from a recent one is available. I think it relevant to note that the Kokkos project is planning on moving to use of C++17 “soon”, and so you should not expect to use old C++ compilers with Kokkos.