CUDA-accelerated classification

The task

Increase classification speeds of a random forest with CUDA. Random forest classification works by allowing each classification tree in a random forest to classify the incoming datum, then allowing the trees to vote on the label.

CPU algoirthm

Serial random forest classification is very simple. Each classification tree, one after another, classifies the datum. As classifications are produced, votes are recorded. At the end, the classification with the most votes is the output classification.

GPU algorithm

The following algorithm has the caveat of waiting for the slowest tree, but is designed to avoid communication with the CPU. A trip to the CPU appears to take on the order of 10 to 100 microseconds and thus should be avoided. As can be seen from the algorithm below, the GPU's advantage will grow with the number of trees in the forest.

1: In parallel on CUDA cores in a single thread block, classify the datum with each classification tree. Each tree gets its own CUDA core.
2: Hold all threads on the block until every tree is done classifying.
3: Log-time sum the votes, utilizing the CUDA cores.
4: Communicate final votes with the CPU.

Hardware

The CPU is an Intel Core i7 3517U. The GPU is a GeForce GT 620M. The GPU is a built-in laptop component in an Asus Ultra-book, and is really meant to coopoerate with the CPU, not compete against it. The fact that speed-ups are realized here demonstrates the great potential of CUDA. Affordable GPUs can have 500 or 1000 CUDA cores, and slightly more expensive cards can have 25000 CUDA cores with additional hardware and software potential beyond what I am using.

The data

I have farmed some FOREX EUR-USD historical transactions and labelled them as good calls and puts. There are 560 thousand data in 120 dimensions. Dimensions describe differences in prices since recent transactions and the time since that transaction.

x = read.table("./../../upTo7Feb.dat", header = F)
y = read.table("./../../upTo7Feb.lab", header = F)
n = dim(y)[1]
I = subset(1:n, y[, 1] == 1)
J = subset(1:n, y[, 1] == 2)
plot(x[, 2], type = "l", main = "EUR-USD price char", xlab = "time step", ylab = "price")
points(I, x[I, 2], col = "red", pch = ".")
points(J, x[J, 2], col = "blue", pch = ".")

plot of chunk unnamed-chunk-1

The success of CUDA

trees = c(100, 300, 500, 750)
cpu = c(70, 160, 211, 264)
gpuKernel = c(93, 146, 149, 199)
gpuKernelAndComm = c(113, 209, 240, 291)
plot(trees, gpuKernelAndComm, type = "l", lwd = 2, xlab = "trees", ylab = "microseconds = 10^(-6)s", 
    main = "Classification times", ylim = c(70, 291))
lines(trees, cpu, lwd = 2, col = "blue")
lines(trees, gpuKernel, lwd = 2, col = "red")

plot of chunk unnamed-chunk-2

GPU Kernel and Communication (black) vs GPU Kernel (red) vs CPU (blue)

Analysis

The effectiveness of my small GPU has limitted it to competing with the CPU only in pipeline efficient cases, however I predict that a better GPU would be faster than the CPU in communication time as well as kernel time.