This document describes how to reproduce a problem related to multiple threads accessing the same queue.
pip install jinja2
should be sufficient at devcloudThese steps lead to an executable peano4 located at syclkernelthreadingissue/examples/exahype2/euler.
git clone https://gitlab.lrz.de/hpcsoftware/Peano.git --branch syclkernelthreadingissue --single-branch --depth 1
cd syclkernelthreadingissue
libtoolize && aclocal && autoconf && autoheader && cp src/config.h.in . && automake --add-missing
./configure CXX=dpcpp CXXFLAGS="-g3 -std=c++17" LDFLAGS="-g3 -ltbb" --enable-exahype --enable-loadbalancing --enable-particles --with-multithreading=sycl --enable-blockstructured --enable-gpu
make -j 20
cd examples/exahype2/euler
export PYTHONPATH=$PWD/../../../python:$PYTHONPATH
python3 example-scripts/finitevolumes.py -cs 0.1 -f -t global-fixed-enclave --gpu -j 20 -d3
On some CentOS systems we noticed that the cpp preprocessor is not found correctly by autotools. Additionally, the include and library paths for tbb need to be added. So please check CXXCPP, CXXFLAGS and LDFLAGS below.
git clone https://gitlab.lrz.de/hpcsoftware/Peano.git --branch syclkernelthreadingissue --single-branch --depth 1
cd syclkernelthreadingissue
libtoolize && aclocal && autoconf && autoheader && cp src/config.h.in . && automake --add-missing
./configure CXX=dpcpp CXXCPP=/opt/rh/gcc-toolset-10/root/usr/bin/cpp CXXFLAGS="-g3 -std=c++17 -I/opt/intel/oneapi/tbb/2021.5.0/include" LDFLAGS="-g3 -L/opt/intel/oneapi/tbb/2021.5.0/lib/intel64/gcc4.8 -ltbb" --enable-exahype --enable-loadbalancing --enable-particles --with-multithreading=sycl --enable-blockstructured --enable-gpu
make -j 20
cd examples/exahype2/euler
export PYTHONPATH=$PWD/../../../python:$PYTHONPATH
python3 example-scripts/finitevolumes.py -cs 0.1 -f -t global-fixed-enclave --gpu -j 20 -d3
Everything is fine when running with 1 thread. As soon as the number of threads is \(>1\) the program behaviour becomes nondeterministic. We invariably observe segmentation faults and deadlock/starvation effects.
SYCL_DEVICE_FILTER=*:gpu ./peano4 --threading-model fuse-immediately-20 --threads 1
SYCL_DEVICE_FILTER=*:gpu ./peano4 --threading-model fuse-immediately-20 --threads 2
The relevant code is in the file syclkernelthreadingissue/examples/exahype2/euler/tasks/EulerOnGPUEnclaveTask.cpp
We observe the issue to disappear if we mutex lines 611-787, i.e. all the bits that access the queue.
The queue itself is of type
static sycl::queue thisQueue(default_selector{});
Upon chosing cpu devices through the device filter, the program terminates unexpectedly.
SYCL_DEVICE_FILTER=*:cpu ./peano4 --threading-model fuse-immediately-20 --threads 1
terminate called after throwing an instance of 'cl::sycl::runtime_error'
what(): No device of requested type available. Please check https://software.intel.com/content/www/us/en/develop/articles/intel-oneapi-dpcpp-system-requirements.html -1 (CL_DEVICE_NOT_FOUND)
Aborted