I managed to compute the feature correlation with reasonable computational effort. I found an R function on github that was build to calculate correlation matrices on data with a high number of features. Originally it was using the package ff to store very large matrices. I changed some minor things, added support for normal matrices (see the parameter ff in the function) and changed the parallelization framework to doSNOW for Windows. I also eliminated a pitfall: Especially for very sparse data it can happen, that a very small number of observations is used to calculate the correlation. Have a look at the following example. The data table TestData has only 2 complete observations, which results in \(r=1\) (or \(r=-1\)).
TestData <- data.table(F1 <- c(1, NA, NA, 2, NA),
F2 <- c(3, NA, 2, 4, 5))
cor(TestData, use = "pairwise.complete.obs")
## V1 V2
## V1 1 1
## V2 1 1
I added the parameter SampleSize. It acts as a threshold for the minimum number of complete pairwise observations. If a combination of features has less complete pairwise observation the value in the correlation matrix is set to NA. According to this paper I set the standard value to 250 samples. Another way to go could be a test for significance of course. Keep in mind what we are doing here: We are calculating the Pearson coefficient of complete pairwise observations of every feature combination regardless of complete observations in the data set.
system.time(NumDataCor <- BigCorPar(NumData,
nblocks = 44,
threads = 6,
use = "pairwise.complete.obs",
SampleSize = 250))
save(NumDataCor, file = "NumDataCor.Rdata")
By the way: Exactly this is the useful thing about kaggle competitions: Now I have this function in my toolbox! Feel free to add it to yours, you’ll find it at the end of this post. The calculation took a little more than 1 hour on my machine and is very memory friendly! Since we only have to do this once I will not optimize it further :-)
Now let’s have a look at the correlation matrix:
corrplot(NumDataCor, method = "color",
type = "full", na.label = "square",
na.label.col = "green", tl.pos = "n")
Sparsity strikes back again: There are a lot of features, which does not have 250 or more pairwise complete observations (the green parts). Ok, lets try to find out if there are some highly correlated variables. The next plot shows all correlarions \(\lvert r \rvert >0.8\) and all the rest is white.
#Filter values |r| < 0.8
NumDataCor <- abs(NumDataCor)
NumDataCor[NumDataCor < 0.8] <- NA
#Plot
corrplot(NumDataCor, method = "color",
type = "full", na.label = "square",
na.label.col = "white", tl.pos = "n")
So what does this picture show us? Most of the higher correlated features are near to the main diagonal of the correlation matrix. The features are belonging to certain production lines and stations which are in our case sorted, which means that they are close together on the production site. Ok, so far some basic stuff on the data to get a sense of the content.
Here is the correlation function, that will (probably) not kill your memory:
#Function for parallel chunk wise computing for data sets with lot's of features
#choose ff = TRUE for data that is too big to allocate as normal matrix
#choose nblocks so, that ncol(data) %% nblocks is 0
BigCorPar <- function(x, nblocks = 5, threads = 2, ff = FALSE, SampleSize = 250, use = "pairwise.complete.obs", ...){
require(doSNOW)
require(foreach)
require(ff)
require(psych)
require(Matrix)
# initialize parallel framework
Cluster <- makeCluster(threads, type="SOCK")
registerDoSNOW(Cluster)
NCOL <- ncol(x)
## test if ncol(x) %% nblocks gives remainder 0
if (NCOL %% nblocks != 0){stop("Choose different 'nblocks' so that ncol(x) %% nblocks = 0!")}
## preallocate square matrix of dimension ncol(x)
if (!ff) {corMAT <- matrix(nrow = NCOL, ncol = NCOL)}
if (ff) {ff(vmode = "single", dim = c(NCOL, NCOL))}
## split column numbers into 'nblocks' groups
SPLIT <- split(1:NCOL, rep(1:nblocks, each = NCOL/nblocks))
## create all unique combinations of blocks
COMBS <- expand.grid(1:length(SPLIT), 1:length(SPLIT))
COMBS <- t(apply(COMBS, 1, sort))
COMBS <- unique(COMBS)
## iterate through each block combination, calculate correlation matrix
## between blocks and store them in the preallocated matrix on both
## symmetric sides of the diagonal
results <- foreach(i = 1:nrow(COMBS)) %do% {
COMB <- COMBS[i, ]
G1 <- SPLIT[[COMB[1]]]
G2 <- SPLIT[[COMB[2]]]
COR <- cor(x[, G1, with = F], x[, G2, with = F], use = use, ...)
SampleTest <- count.pairwise(x[, G1, with = F], x[, G2, with = F])
COR[SampleTest < SampleSize] <- NA
corMAT[G1, G2] <- COR
corMAT[G2, G1] <- t(COR)
COR <- NULL
}
stopCluster(Cluster)
gc()
return(corMAT)
}