NCC GPU cluster and llvm, sycl

Overview

This document describes how to access the NCC gpu cluster and how to compile LLVM to achieve OpenMP - and SYCL - offloading with the Nvidia (CUDA) devices present.

NCC cluster

Connection

To get access, send a mail to Rob Powell. The cluster is only accessible from within the university network:

ssh CISUERNAME@ncc1.clients.dur.ac.uk

When working from home, I tend to first ssh into hamilton and then into ncc.

Nodes

The login node allows to ssh into compute nodes:

cpu1,cpu2,cpu3,cpu4 16 Core Xeon, 64GB RAM
cpu5 6 Core Xeon, 64GB RAM
cpu6 56 Core Xeon, 512GB RAM

Clearly, cpu6 is the machine to compile on.

The gpu nodes are:

gpu1 2 TITAN X (Pascal), Capability 6.1
gpu2 4 TITAN Xp , Capability 6.1
gpu3 8 TITAN Xp , Capability 6.1
gpu4, gpu5, gpu6, 8 GeForce RTX 2080 Ti, Capability 7.5
gpu7, gpu8 4 TITAN RTX, Capability 7.5
gpu-toby0 3 RTX A6000, Capability 8.6 + 1 RTX 8000, Capability 7.5

The compute capabilities are important when offloading with openmp.

LLVM + sycl + openmp

Here I describe how to compile llvm to run sycl or openmp offload applications with support for CUDA devices.

Prerequisites

We will need GCC>=8.2, ninja, cmake, CUDA

First step: compile llvm + cuda/sycl

Instead of the nominal llvm repo, we need to pull from intel’s llvm repo like so:

git clone https://github.com/intel/llvm -b sycl --single-branch --depth 1

This branch comes with a helper script (llvm/buildbot/configure.py) that makes life quite a bit easier:

CUDA_LIB_PATH=/apps/cuda/cuda-11.4/lib64/stubs CC=gcc-10 CXX=g++-10 python llvm/buildbot/configure.py --cuda --cmake-opt="-DCUDA_TOOLKIT_ROOT_DIR=/apps/cuda/cuda-11.4"

Note that CC= and CXX= select specific GNU compilers available on the system. All the other bits are needed to correctly find all CUDA related bits. This script will run to completion on, e.g., cpu6. To compile, we can use:

CUDA_LIB_PATH=/apps/cuda/cuda-11.4/lib64/stubs CC=gcc-10 CXX=g++-10 python llvm/buildbot/compile.py

This will compile llvm/clang with sycl+cuda. The libraries and binaries are found in llvm/build/lib and llvm/build/bin respectively.

Note, a subset is also copied into llvm/build/install but that is not the right path to use when compiling openmp.

To make use of the new compiler, the following may be used:

export PATH=BASEDIR/llvm/build/bin:${PATH}
export LD_LIBRARY_PATH=BASEDIR/llvm/build/lib:${LD_LIBRARY_PATH}

where BASEDIR is the directory where the git clone command was issued from.

To check that things are ok, try e.g.

ssh gpu7
export PATH=BASEDIR/llvm/build/bin:${PATH}
export LD_LIBRARY_PATH=BASEDIR/llvm/build/lib:${LD_LIBRARY_PATH}
sycl-ls

The output should look similar to this:

[ext_oneapi_cuda:gpu:0] NVIDIA CUDA BACKEND, NVIDIA TITAN RTX 0.0 [CUDA 11.6]
[ext_oneapi_cuda:gpu:1] NVIDIA CUDA BACKEND, NVIDIA TITAN RTX 0.0 [CUDA 11.6]
[ext_oneapi_cuda:gpu:2] NVIDIA CUDA BACKEND, NVIDIA TITAN RTX 0.0 [CUDA 11.6]
[ext_oneapi_cuda:gpu:3] NVIDIA CUDA BACKEND, NVIDIA TITAN RTX 0.0 [CUDA 11.6]
[host:host:0] SYCL host platform, SYCL host device 1.2 [1.2]

Second step: compile openmp+cuda with the new clang

For this to work, it is essential to have llvm-lit to be present in PATH. The steps to build openmp with cuda offloading are:

ssh cpu6
export PATH=BASEDIR/llvm/build/bin:${PATH}
export LD_LIBRARY_PATH=BASEDIR/llvm/build/lib:${LD_LIBRARY_PATH}

cd llvm
mkdir buildopenmp
cd buildopenmp

CUDA_LIB_PATH=/apps/cuda/cuda-11.4/lib64/stubs cmake  -DLIBOMPTARGET_NVPTX_COMPUTE_CAPABILITIES="61;70;75;80;86" -DCUDA_TOOLKIT_ROOT_DIR=/apps/cuda/cuda-11.4  -DLIBOMPTARGET_ENABLE_DEBUG=YES -DCMAKE_C_COMPILER=BASEDIR/llvm/build/install/bin/clang -DCMAKE_CXX_COMPILER=BASEDIR/llvm/build/install/bin/clang++ -DCMAKE_INSTALL_PREFIX=$PWD/local  -GNinja ../openmp

ninja
ninja install

The crucial option is the list of compute capabilities.

Note: there used to be a cmake flag to set the default capability (-DCLANG_OPENMP_NVPTX_DEFAULT_ARCH=sm_61) but that is no longer available.

The default is sm_35 which will not work with any of the NCC devices. But that’s ok, explicitly specifying the desired cpability is not hard and probably less error prone.

You can check that all the specified architectures are supported by

ls libomptarget/ |grep nvptx
libomptarget-nvptx-sm_61.bc
libomptarget-nvptx-sm_70.bc
libomptarget-nvptx-sm_75.bc
libomptarget-nvptx-sm_80.bc
libomptarget-nvptx-sm_86.bc

To make use of openmp, add local/lib to the LD_LIBRARY_PATH:

export LD_LIBRARY_PATH=BASEDIR/llvm/buildopenmp/local/lib:${LD_LIBRARY_PATH}

clang++ -fopenmp -fopenmp-targets=nvptx64 -Xopenmp-target -march=sm_75
 --cuda-path=/apps/cuda/cuda-11.4
 --libomptarget-nvptx-bc-path=BASEDIR/llvm/buildopenmp/local/lib
 -IBASEDIR/llvm/buildopenmp/local/include
 -LBASEDIR/llvm/buildopenmp/local/lib -lomptarget 
 test_omp.cxx

Mixing openmp and sycl offloading

In short: only works with named kernels for some reason, see https://github.com/intel/llvm/issues/5804

Solution looks like this:

clang++ -std=c++17 test.cxx -fsycl -fsycl-targets=nvptx64-cuda -fopenmp -L/home2/gcgt96/sycl_workspace/buildopenmp/lib -lomp -lomptarget -fopenmp-targets=nvptx64-cuda -Xopenmp-target -march=sm_75 --libomptarget-nvptx-bc-path=/home2/gcgt96/sycl_workspace/buildopenmp/lib -I/home2/gcgt96/sycl_workspace/syclcudagcc/include/sycl -I/home2/gcgt96/sycl_workspace/syclcudagcc/include

Here is a test program

#include <iostream>

#include <CL/sycl.hpp>
using namespace sycl;
static queue Q(default_selector{});

int main()
{
    std::cout << "  Using SYCL device: " << Q.get_device().get_info<sycl::info::device::name>() << std::endl;
    int *data = malloc_shared<int>(3, Q); 
    data[0] = 1;
    data[1] = 2;
    data[2] = 0;

    Q.submit([&](handler &cgh)
    {   
#ifdef NONAME
       cgh.single_task(                    [=](){data[2] = data[0] + data[1];}  );
#else
       cgh.single_task<class simple_sum>(  [=](){data[2] = data[0] + data[1];}  );
#endif
    }).wait();

    std::cout << "Result: " << data[2] << "\n";
    return 0;
}