First steps (Bosch @kaggle)

Here are some first insights on the numerical data and some tips for processing the data in R. The data set is pretty challenging for in memory computing on smaller machines. Reading in the train_numeric.csv file as a data table occupies approx. 10 GB of memory on my machine. Reading it with fread() from the data.table package took about 2 minutes. Don’t try it with read.csv, that’s going to take hours :-) Okay, let’s do it:

setwd("D:/SCRATCH/96_Bosch/")
rm(list = ls())

#Load packages
require(data.table)
require(dplyr)
require(tidyr)
require(Matrix)
require(caret)
require(doSNOW)
require(foreach)
require(ff)
require(psych)

#Read in NumData
NumData <- fread("train_numeric.csv")

#Extract target and ID
Target <- NumData$Response
ID <- NumData$Id

#Delete target and ID from data table
NumData[ , c("Response", "Id") := NULL]

#Helper variables
NofObs <- nrow(NumData)
NofFeat <- length(NumData)

I also loaded some other packages I’m going to use, extracted the ID’s and the target, deleted them from the data table and created some helper variables. Notice that I loaded the foreach package and the parallel backend doSNOW for Windows to do some work in parallel (you can also use other parallel frameworks). In the next part I’m calculating the variance of the features and the ratio of NA values in all features and observations.

I tried some different methods, but it seems like parallel workers are good for operations on the features. It’s pretty fast and very memory friendly. For operations on each observation the best I tried so far is using apply. Pretty fast too, but memory intense. Almost exceeded my 32 GB RAM. Any idea for a fast AND memory friendly way?

After the calculation I’m converting the results to a data table and saving them to disk, so that I don’t have to do it again. All in all the calculations take a few minutes. By the way it’s definetly a good idea to use the garbage collection gc() from time to time when you work with the data to clean up your memory. Especially when an operation went wrong!

#Register parallel cluster
Cluster <- makeCluster(6, type = "SOCK")
registerDoSNOW(Cluster)

#Calculate variance of each feature (fast, memory friendly)
FeatVar <- foreach(n = 1:NofFeat, .combine = c) %do% 
           var(NumData[ , n, with = F],  na.rm = T)

#Calculate percentage of unique values and the ratio between the
#frequency of the first and the second most occuring value
NearZeroFeat <- foreach(n = 1:NofFeat, .combine = rbind) %do% 
                nearZeroVar(NumData[ , n, with = F], saveMetrics = T)

#Calculate NA-ratio of each feature and observation (fast, memory friendly)
FeatNARatio <- foreach(n = 1:NofFeat, .combine = c) %do% 
               sum(is.na(NumData[ , n, with = F])) / NofObs

#Calculate NA-ratio of each observation (fast, memory intense)
ObsNARatio <- apply(NumData, 1, 
                    function(x) {sum(is.na(x)) / NofFeat})

#Close cluster
stopCluster(Cluster)

#Convert to data table
FeatStats <- data.table(FeatNARatio = FeatNARatio,
                        FeatVar = FeatVar)

FeatStats <- cbind(FeatStats, NearZeroFeat)

ObsStats <- data.table(ObsNARatio = ObsNARatio)

#Save variables to disk
save(FeatStats, ObsStats, file = "stats.Rdata")

Okay, let’s have a look at our results. The first thing is to look for features with zero variance:

min(FeatStats$FeatVar)

## [1] 4.052727e-09

Okay so all features have at least some information. A common way of deciding, if a feature has almost no Information, is to have a look at the percentage of unique values and frequency of the first and the second most occurring value.

Percentage of unique values for each feature

Ratio of the frequency of the first and the second most occuring value

Now that’s pretty interesting: Look how small the percentage of unique values is! The maximum is at 0.08 and the majority is smaller than 0.02. The ratio of the frequencies of the first and second most occurring value are pretty low. The nearZeroVar function in the caret package proposes to delete variables with less than 10% percent of unique values and a frequency ratio greater than 19. We already calculated this threshold in the code above. Let’s check how many features would be deleted:

sum(FeatStats$nzv)

## [1] 121

According to that 121 features have nearly zero variance. However I’m not going to delete those. We have a binary classification problem with an extremely imbalanced data set and even a perfectly separating feature could probably yield those values. Finally let’s have a look at the ratio of NA values per feature:

NA ratio of all features

Well at least 81% of our matrix are NA’s… Our data is extremely sparse. I can’t help but I think the data is somehow assembled, so that a lot of actually not existing/ not possible values are set so NA. I think that explaining the sparsity could be a key. My next steps are to investigate that. As a preparation I’m going to create a dummy matrix where all NA’s are TRUEand all existing values are FALSE:

Sparsity <- is.na(NumData)

We should also have a look at the feature correlation. The data is extremely sparse and there are feature combinations that have no pairwise complete observations. Nevertheless we can check the correlation for all features with pairwise complete observations. So far I just tried the following code, which heavily exceeded my memory. I’m pretty sure a nested, parallel foreach loop is going to do the job well. I’ll do that in my next post.

#calculate correlations
NumDataCor <- cor(NumData, use = "pairwise.complete.obs")
#plot
corrplot(NumDataCor, method = "color")

So there is a lot of investigation to do on the data before building a model that can compete. Nevertheless here are some first ideas on building a model:

It’s an extremely imbalanced data set, so for evaluation we should use metrics like AUC
We should take over- and undersampling sampling techniques like SMOTE into account
The same applies to dimensionality reduction (sparsity is going to be a problem)
I think I will go for XgBoost. It’s probably gonna be a computational expensive task with a big data table and XgBoost can use sparse represented matrices. It’s also very robust, when it comes to correlated features. It’s an extremely good optimized algorithm and brute forced rectangular boundary approximation seems to work pretty good, XgBoost was used by the winning teams of numerous kaggle competitions…
Use sparse matrices!