Overview

This document describes how to reproduce a problem related to multiple threads accessing the same queue.

Requirements

Build instructions

These steps lead to an executable peano4 located at syclkernelthreadingissue/examples/exahype2/euler.

Systems like devcloud, fedora34

git clone https://gitlab.lrz.de/hpcsoftware/Peano.git  --branch syclkernelthreadingissue --single-branch --depth 1

cd syclkernelthreadingissue

libtoolize && aclocal && autoconf && autoheader && cp src/config.h.in . && automake --add-missing

./configure CXX=dpcpp CXXFLAGS="-g3 -std=c++17" LDFLAGS="-g3 -ltbb" --enable-exahype --enable-loadbalancing --enable-particles --with-multithreading=sycl --enable-blockstructured --enable-gpu
make -j 20

cd examples/exahype2/euler
export PYTHONPATH=$PWD/../../../python:$PYTHONPATH
python3  example-scripts/finitevolumes.py -cs 0.1 -f -t global-fixed-enclave --gpu -j 20 -d3

Centos systems

On some CentOS systems we noticed that the cpp preprocessor is not found correctly by autotools. Additionally, the include and library paths for tbb need to be added. So please check CXXCPP, CXXFLAGS and LDFLAGS below.

 git clone https://gitlab.lrz.de/hpcsoftware/Peano.git  --branch syclkernelthreadingissue --single-branch --depth 1
 
 cd syclkernelthreadingissue
 
 libtoolize && aclocal && autoconf && autoheader && cp src/config.h.in . && automake --add-missing
 
 ./configure CXX=dpcpp CXXCPP=/opt/rh/gcc-toolset-10/root/usr/bin/cpp  CXXFLAGS="-g3 -std=c++17 -I/opt/intel/oneapi/tbb/2021.5.0/include" LDFLAGS="-g3 -L/opt/intel/oneapi/tbb/2021.5.0/lib/intel64/gcc4.8 -ltbb" --enable-exahype --enable-loadbalancing --enable-particles --with-multithreading=sycl --enable-blockstructured --enable-gpu
 make -j 20
 
cd examples/exahype2/euler
export PYTHONPATH=$PWD/../../../python:$PYTHONPATH
python3  example-scripts/finitevolumes.py -cs 0.1 -f -t global-fixed-enclave --gpu -j 20 -d3

Error description

Everything is fine when running with 1 thread. As soon as the number of threads is \(>1\) the program behaviour becomes nondeterministic. We invariably observe segmentation faults and deadlock/starvation effects.

Reproducing the working program

SYCL_DEVICE_FILTER=*:gpu ./peano4 --threading-model fuse-immediately-20 --threads 1

Reproducing the issue

SYCL_DEVICE_FILTER=*:gpu ./peano4 --threading-model fuse-immediately-20 --threads 2

Remarks

The relevant code is in the file syclkernelthreadingissue/examples/exahype2/euler/tasks/EulerOnGPUEnclaveTask.cpp

We observe the issue to disappear if we mutex lines 611-787, i.e. all the bits that access the queue.

The queue itself is of type

static sycl::queue thisQueue(default_selector{});

Further issue

Upon chosing cpu devices through the device filter, the program terminates unexpectedly.

SYCL_DEVICE_FILTER=*:cpu ./peano4 --threading-model fuse-immediately-20 --threads 1

terminate called after throwing an instance of 'cl::sycl::runtime_error'
  what():  No device of requested type available. Please check https://software.intel.com/content/www/us/en/develop/articles/intel-oneapi-dpcpp-system-requirements.html -1 (CL_DEVICE_NOT_FOUND)
Aborted