1 Exploratory Data Analysis

The dataset is about Mercedes-Benz’s car testing process. It consists of an anonymized set of predictors, each representing a custom feature in a car and the outcome represents the time (in seconds) that the car took to pass the testing process. The variables with letters are categorical and those with 0/1 are binary values.

1.1 Load libraries and data files

library(tidyverse)
library(caret)
library(DT)
library(plotly) 
train <- read_csv('train.csv')
test  <- read_csv('test.csv')

1.2 Data overview

Let’s have a look at the train and test files.

dim(train)
## [1] 4209  378
datatable(head(train,20), style="bootstrap", class="table-condensed", options = list(dom = 'tp',scrollX = TRUE))
dim(test)
## [1] 4209  377
datatable(head(test,20), style="bootstrap", class="table-condensed", options = list(dom = 'tp',scrollX = TRUE))

1.3 Missing values in train and test

sum(is.na(train)) + sum(is.na(test))
## [1] 0

There are no missing values in dataset!

1.4 Near-zero and zero variance variables

nzv <- nearZeroVar(train, saveMetrics = T)
colnames(train[,nzv$nzv])
##   [1] "X4"   "X10"  "X11"  "X15"  "X16"  "X17"  "X18"  "X21"  "X23"  "X24" 
##  [11] "X26"  "X28"  "X29"  "X30"  "X32"  "X33"  "X34"  "X36"  "X38"  "X39" 
##  [21] "X40"  "X41"  "X42"  "X44"  "X47"  "X48"  "X52"  "X53"  "X54"  "X55" 
##  [31] "X56"  "X57"  "X59"  "X60"  "X61"  "X62"  "X63"  "X65"  "X66"  "X67" 
##  [41] "X69"  "X73"  "X74"  "X75"  "X76"  "X77"  "X78"  "X79"  "X82"  "X83" 
##  [51] "X86"  "X87"  "X88"  "X89"  "X90"  "X91"  "X92"  "X93"  "X94"  "X95" 
##  [61] "X97"  "X99"  "X102" "X104" "X105" "X106" "X107" "X108" "X109" "X110"
##  [71] "X111" "X112" "X113" "X117" "X120" "X122" "X123" "X124" "X125" "X126"
##  [81] "X128" "X130" "X131" "X134" "X135" "X136" "X138" "X140" "X141" "X143"
##  [91] "X145" "X146" "X147" "X148" "X152" "X153" "X159" "X160" "X162" "X165"
## [101] "X166" "X167" "X169" "X170" "X172" "X173" "X174" "X175" "X176" "X179"
## [111] "X183" "X184" "X185" "X190" "X192" "X195" "X196" "X197" "X198" "X199"
## [121] "X200" "X203" "X204" "X205" "X206" "X207" "X210" "X211" "X212" "X213"
## [131] "X214" "X216" "X217" "X221" "X222" "X226" "X227" "X228" "X229" "X230"
## [141] "X231" "X232" "X233" "X235" "X236" "X237" "X239" "X240" "X242" "X243"
## [151] "X245" "X248" "X249" "X252" "X253" "X254" "X255" "X257" "X258" "X259"
## [161] "X260" "X262" "X263" "X264" "X266" "X267" "X268" "X269" "X270" "X271"
## [171] "X272" "X274" "X276" "X277" "X278" "X279" "X280" "X281" "X282" "X284"
## [181] "X287" "X288" "X289" "X290" "X291" "X292" "X293" "X295" "X296" "X297"
## [191] "X298" "X299" "X301" "X302" "X305" "X306" "X307" "X308" "X309" "X310"
## [201] "X312" "X315" "X317" "X318" "X319" "X320" "X322" "X323" "X325" "X326"
## [211] "X328" "X330" "X332" "X333" "X335" "X338" "X339" "X340" "X341" "X342"
## [221] "X344" "X345" "X346" "X347" "X349" "X353" "X357" "X359" "X361" "X364"
## [231] "X365" "X366" "X369" "X370" "X371" "X372" "X373" "X378" "X379" "X380"
## [241] "X382" "X383" "X384" "X385"
colnames(train[,nzv$zeroVar])
##  [1] "X11"  "X93"  "X107" "X233" "X235" "X268" "X289" "X290" "X293" "X297"
## [11] "X330" "X347"

Out of 244 near-zero-variance variables there are 12 zero-variance variables in the train. In any case these 12 zero-variance variables must be removed from the train.

train <- train[,!(nzv$zeroVar)]

1.5 Summarize outcome (testing time) in train

summary(train$y)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   72.11   90.82   99.15  100.67  109.01  265.32
ggplot(train, aes(x=y, y=..density..)) +
 geom_histogram(colour="purple", size=.2, bins = length(train$y)/3) +
 geom_density() + scale_x_log10(breaks=c(seq(70,150,10),seq(170,270,20))) +
  labs(x = "Testing time(secs)", y = "Density", title = "Distribution of the outcome y in train")

We observe following things from the above plot.

  • There are 4 prominent peaks indicating that the distribution is multi-modal.
  • There is a single observation having testing time as 265 secs, which seems like an outlier.

1.6 Duplicate combinations in train

Let’s check if there are any duplicate combinations of rows in train, ofcourse the ID variable has to be removed.

nrow(unique(train[duplicated(train[,-1]) | duplicated(train[,-1], fromLast = T),-1]))
## [1] 1

So one of the rows is duplicated. Remove one copy of that row from train.

train <- train %>% filter(ID != train[duplicated(train[,-1]),]$ID)

Let’s check if there are any duplicated predictor combinations in train, for this we remove ID and y.

nrow(unique(train[duplicated(train[,-(1:2)]) | duplicated(train[,-(1:2)], fromLast = T),-(1:2)]))
## [1] 217

There are 217 distinct combinations of predictors that have duplicates. Well what does this suggest? It means cars of same type and configurations have different testing times. So there must be a missing predictor which could explain these different times, that is not provided in the data set.

1.7 Interaction between outcome and covariates

1.7.1 Boxplots of outcome vs categorical predictors (X0-X8)

The boxplots show the distribution of the outcome w.r.t. values of the categorical predictors X0-X8.

train %>%
  ggplot(aes(reorder(X0, y, FUN = median) , y)) + geom_boxplot(fill="lightblue") +
  geom_jitter(color="blue", width = 0.2, size = 0.4, alpha = 0.5) +
  labs(x = "X0", y = "Testing time(secs)")

train %>%
  ggplot(aes(reorder(X1, y, FUN = median) , y)) + geom_boxplot(fill="lightblue") + 
  geom_jitter(color="blue", width = 0.2, size = 0.4, alpha = 0.5) +
  labs(x = "X1", y = "Testing time(secs)") 

train %>%
  ggplot(aes(reorder(X2, y, FUN = median) , y)) + geom_boxplot(fill="lightblue") + 
  geom_jitter(color="blue", width = 0.2, size = 0.4, alpha = 0.5) +
  labs(x = "X2", y = "Testing time(secs)") 

train %>%
 ggplot(aes(reorder(X3, y, FUN = median) , y)) + geom_boxplot(fill="lightblue") + 
  geom_jitter(color="blue", width = 0.2, size = 0.4, alpha = 0.5) +
  labs(x = "X3", y = "Testing time(secs)") 

train %>%
  ggplot(aes(reorder(X4, y, FUN = median) , y)) + geom_boxplot(fill="lightblue") + 
  geom_jitter(color="blue", width = 0.2, size = 0.4, alpha = 0.5) +
  labs(x = "X4", y = "Testing time(secs)") 

train %>%
ggplot(aes(reorder(X5, y, FUN = median) , y)) + geom_boxplot(fill="lightblue") + 
  geom_jitter(color="blue", width = 0.2, size = 0.4, alpha = 0.5) +
  labs(x = "X5", y = "Testing time(secs)") 

train %>%
  ggplot(aes(reorder(X6, y, FUN = median) , y)) + geom_boxplot(fill="lightblue") + 
  geom_jitter(color="blue", width = 0.2, size = 0.4, alpha = 0.5) +
  labs(x = "X6", y = "Testing time(secs)") 

train %>%
ggplot(aes(reorder(X8, y, FUN = median) , y)) + geom_boxplot(fill="lightblue") + 
  geom_jitter(color="blue", width = 0.2, size = 0.4, alpha = 0.5) +
  labs(x = "X8", y = "Testing time(secs)") 

We observe the following from the above boxplots

  • X0 has the highest variabilty in its levels followed by X2, X1, X4, X3, X5, X6 and X8.
  • However, X4 is almost populated with level “d” as compared to it’s other levels.
  • The observation point where outcome y is around 265 secs is way above the others. Hence it is better to remove it than suffer from biased predictions.
train <- train %>% filter(y < 260)

1.7.2 Volcano plot of categorical predictors (X0-X8)

The categorical predictors are decomposed into binary dummy variables and to determine the relevance of each binary predictor Wilcoxon rank sum test is applied followed by Benjamini and Hochberg p-value correction. In this case null hypothesis is that the outcome distributions for predictor values 1 and 0 differ by a location shift of zero.

# One hot encoding predictors X0-X8
dX0_8 <- dummyVars("~ .", data = train %>% select(X0:X8) )
binX0_8 <- as.data.frame(predict(dX0_8, newdata = train %>% select(X0:X8)))

# Apply Wilcoxon rank sum test on one hot encoded X0-X8
tests_X0_8 <- apply(binX0_8, 2,
                    function(x, y){
                      wxn <- wilcox.test(y ~ x, conf.int = T)[c("statistic", "p.value", "estimate")]
                      unlist(wxn)},
                    y = train$y)

# The results are a matrix with predictors in columns. We reverse this
tests_X0_8 <- as.data.frame(t(tests_X0_8))
names(tests_X0_8) <- c("Statistic.W", "p.value", "Estimate")
tests_X0_8$p.adj <- p.adjust(tests_X0_8$p.value , method = "fdr")
tests_X0_8$feature = rownames(tests_X0_8)
tests_X0_8_plot <- tests_X0_8 %>% 
  mutate(Significance = ifelse(tests_X0_8$p.adj<0.01, "p.adj<0.01", "Not Significant")) %>%
  ggplot(aes(x = Estimate, y = -log10(p.adj), color = Significance)) +
  geom_point(aes(text = (paste("feature:", feature, "<br>p.adj:", p.adj))), alpha = 0.4, size = 3) +
  labs(x = "Difference of means", y = "-log10(p.adj)") 
ggplotly(tests_X0_8_plot, tooltip = c("text"))

The above plot show that levels of predictor X0 are deemed relatively important than others, hence confirm our findings from the box-plots.

1.7.3 Volcano plot of binary predictors (X10-X385)

The above procedure is repeated for the binary predictors after removing the zero-variance variables.

# Apply Wilcoxon rank sum test on X10 to X385
tests_X10_385 <- apply(train %>% select(X10:X385), 2,
                  function(x, y)
                    {
                    wxn <- wilcox.test(y ~ x, conf.int = T)[c("statistic", "p.value", "estimate")]
                    unlist(wxn)
                    },
               y = train$y)
tests_X10_385 <- as.data.frame(t(tests_X10_385))
names(tests_X10_385) <- c("Statistic.W", "p.value", "Estimate")
tests_X10_385$p.adj <- p.adjust(tests_X10_385$p.value , method = "fdr")
tests_X10_385$feature = rownames(tests_X10_385)
tests_X10_385_plot <- tests_X10_385 %>% 
  mutate(Significance = ifelse(tests_X10_385$p.adj<0.01, "p.adj<0.01", "Not Significant")) %>%
  ggplot(aes(x = Estimate, y = -log10(p.adj), color = Significance)) +
  geom_point(aes(text = (paste("feature:", feature, "<br>p.adj:", p.adj))), alpha = 0.4, size = 3) +
  labs(x = "Difference of means", y = "-log10(p.adj)") 
ggplotly(tests_X10_385_plot, tooltip = c("text"))

1.7.4 Scatterplot of y vs ID

ggplot(train, aes(x = ID, y = y)) +
  geom_point(color = "blue", alpha = 0.5, size = 0.7) +
  geom_smooth()

From the above plot it is clear that ID does show a trend with the outcome y. May be there is some potential data leak. Nevertheless, let’s include ID for model building.

1.8 Missing categorical predictor’s (X0-X8) levels in train and test

1.8.1 Present in train but absent in test

Levels of X0 not in test but in train

unique(train$X0)[!(unique(train$X0) %in% unique(test$X0))]
## [1] "q"  "aa" "ac" "ab"

Levels of X1 not in test but in train

unique(train$X1)[!(unique(train$X1) %in% unique(test$X1))]
## character(0)

Levels of X2 not in test but in train

unique(train$X2)[!(unique(train$X2) %in% unique(test$X2))]
## [1] "l"  "aa" "c"  "o"  "ar"

Levels of X3 not in test but in train

unique(train$X3)[!(unique(train$X3) %in% unique(test$X3))]
## character(0)

Levels of X4 not in test but in train

unique(train$X4)[!(unique(train$X4) %in% unique(test$X4))]
## character(0)

Levels of X5 not in test but in train

unique(train$X5)[!(unique(train$X5) %in% unique(test$X5))]
## [1] "u"

Levels of X6 not in test but in train

unique(train$X6)[!(unique(train$X6) %in% unique(test$X6))]
## character(0)

Levels of X8 not in test but in train

unique(train$X8)[!(unique(train$X8) %in% unique(test$X8))]
## character(0)

1.8.2 Present in test but absent in train

Levels of X0 not in train but in test:

unique(test$X0)[!(unique(test$X0) %in% unique(train$X0))]
## [1] "av" "ag" "an" "ae" "p"  "bb"

Levels of X1 not in train but in test:

unique(test$X1)[!(unique(test$X1) %in% unique(train$X1))]
## character(0)

Levels of X2 not in train but in test:

unique(test$X2)[!(unique(test$X2) %in% unique(train$X2))]
## [1] "aj" "ax" "ab" "w"  "ad" "u"

Levels of X3 not in train but in test:

unique(test$X3)[!(unique(test$X3) %in% unique(train$X3))]
## character(0)

Levels of X4 not in train but in test:

unique(test$X4)[!(unique(test$X4) %in% unique(train$X4))]
## character(0)

Levels of X5 not in train but in test:

unique(test$X5)[!(unique(test$X5) %in% unique(train$X5))]
## [1] "t" "b" "a" "z"

Levels of X6 not in train but in test:

unique(test$X6)[!(unique(test$X6) %in% unique(train$X6))]
## character(0)

Levels of X8 not in train but in test:

unique(test$X8)[!(unique(test$X8) %in% unique(train$X8))]
## character(0)

1.9 Finding clusters in y with the help of X0

From the boxplot of y vs X0 we can observe 4 prominent group of levels where in the median of y is nearly the same. The level “aa”, having only two observations, doesn’t fit in any of the 4 groups. But since it is not present in the test, we can safely assign it to group 4.

X0_levels <- unique(train$X0)

# Binning X0 levels into groups as per their medians
group_1 <- c("bc","az")
group_2 <- c("ac","am","l","b","aq","u","ad","e","al","s",
            "n","y","t","ai","k","f","z","o","ba","m","q")
group_3 <- c("d","ay","h","aj","v","ao","aw")
group_4 <- c("c","ax","x","j","w","i","ak","g","at","ab",
            "af","r","as","a","ap","au","aa")

X0_groups <- rep(NA, length(X0_levels))
names(X0_groups) <- X0_levels
X0_groups[group_1] <- 1
X0_groups[group_2] <- 2
X0_groups[group_3] <- 3
X0_groups[group_4] <- 4

# Define cluster as new variable indicating group of X0 levels
train$cluster <- as.factor(X0_groups[train$X0])

Let’s check if this new variable “cluster” separates the mixture of distribution in y.

ggplot(train, aes(x = y, fill = cluster)) +
  geom_histogram(binwidth = 1, alpha = 0.6) +
  facet_wrap(~as.factor(cluster), ncol = 2) +
  labs(x = "Testing time (secs)", y = "Count", title = "Distribution of the outcome y separated by cluster")

ggplot(train, aes(x = cluster, y = y))+
  geom_violin(fill="lightblue", color = "blue") +
  geom_boxplot(fill="lightgreen",color = "darkgreen", width=.2) +
labs(x = "Cluster number", y = "Testing time (secs)", title = "Violin plot of y separed by cluster")

From the above plots it clear that the variable “cluster” does its job very well. We also have to separate the obervations in the test in the above mentioned groups, but there are 6 levels of X0 present in the test but absent in the train. Hence these obervations cannot be assigned to their respective groups with only help of X0. Let’s have look these observations in the test.

(X0levMissing <- test %>% filter(X0 %in% (unique(test$X0)[!(unique(test$X0) %in% unique(train$X0))])) %>% select(ID:X8))

From the boxplots it is clear that X2 shows highest variation after X0. Hence, these variables can be assigned to their respective groups using X2. There are 3 unique levels in X2 for these observations.

unique(X0levMissing$X2)
## [1] "ac" "as" "av"

The value of the cluster variable is assigned to these 6 observations based on the group number having highest proportion in the train for these 3 levels of X2.

train %>% filter(X2 == "ac") %>% group_by(cluster) %>% summarise(count = n())

X2 level “ac” is assigned to group 1.

train %>% filter(X2 == "as") %>% group_by(cluster) %>% summarise(count = n())

X2 level “as” is assigned to group 4.

train %>% filter(X2 == "av") %>% group_by(cluster) %>% summarise(count = n())

X2 level “av” is assigned to group 2.

Adding cluster variable to test

test$cluster <- as.factor(X0_groups[test$X0])
for(id in X0levMissing$ID){
  if(test[test$ID == id,]$X2 == "ac") test[test$ID == id,]$cluster <- 1
  if(test[test$ID == id,]$X2 == "as") test[test$ID == id,]$cluster <- 4
  if(test[test$ID == id,]$X2 == "av") test[test$ID == id,]$cluster <- 2
}

2 Selection of Features for Model Building

The categorical variables (X0-X8) were converted by one-hot-encoding. These along with the binary variables (X10-X385) were tested for their significance as described in section 1.7 and only those features with p-values less than 0.01 are selected. Also the new variable cluster is converted by one-hot-encoding and added to the train and test features set.

# Selecting features passing the Wilcoxon rank sum test at signifiance level of 0.01
binX_train <- train %>% select(y, ID) %>% bind_cols(binX0_8 %>%
  select(tests_X0_8 %>% filter(p.adj<0.01) %>% .$feature)) %>%
   bind_cols(train %>% select(X10:X385) %>%
    select(tests_X10_385 %>% filter(p.adj<0.01) %>% .$feature))

# One hot encoding test 
dmy <- dummyVars(" ~ .", data = test)
binX_test <- data.frame(predict(dmy, test))

# Selecting columns common to binX_train
sigPrdrs <- (colnames(binX_test))[colnames(binX_test) %in% colnames(binX_train)]
binX_test <- binX_test %>% select(sigPrdrs)

# One hot encoding variable cluster and append to both train and test
dmy_cluster <- dummyVars("~ cluster", data = train)
dmyC_train <- data.frame(predict(dmy_cluster, train))
dmyC_test <- data.frame(predict(dmy_cluster, test))
binX_train <- binX_train %>% bind_cols(dmyC_train)
binX_test <- binX_test %>% bind_cols(dmyC_test)

3 Model Building

Extreme Gradient Boosting tree based method is used for building model. The competition evaluation metric was coefficient-of-determination R2. However,the model fitted using root-mean-square-error (RMSE) as performance metric is found to be more accurate than that by mean-absolute-error (MAE) and R2. For tuning the models we use repeated leave-group-out-crossvalidation (LGOCV) where we leave 10 percent as hold-out group and repeat this for 100 times.

3.1 Extreme Gradient Boosting Tree

set.seed(4567)
xgbt <- train(y ~ ., data = binX_train,
               method = "xgbTree",
               metric = "RMSE",
               tuneGrid = expand.grid(nrounds = seq(50, 1000, by = 50),
                            max_depth = 3,
                            eta = 0.01,
                            gamma = 0,
                            colsample_bytree = 0.6,
                            min_child_weight = 1,
                            subsample = 1),
               trControl = trainControl(method = "LGOCV",
                             number = 100, p = 0.9,
                             search = "grid")
               )
xgbt
## eXtreme Gradient Boosting 
## 
## 4207 samples
##  284 predictor
## 
## No pre-processing
## Resampling: Repeated Train/Test Splits Estimated (100 reps, 90%) 
## Summary of sample sizes: 3787, 3787, 3787, 3787, 3787, 3787, ... 
## Resampling results across tuning parameters:
## 
##   nrounds  RMSE       Rsquared   MAE      
##     50     61.408334  0.5931929  60.615849
##    100     37.704363  0.5935345  36.690524
##    150     23.679173  0.5961436  22.213498
##    200     15.654930  0.5995220  13.456302
##    250     11.357673  0.6017464   8.190205
##    300      9.288183  0.6027522   5.574775
##    350      8.399637  0.6033532   4.868769
##    400      8.048491  0.6035043   4.830779
##    450      7.916764  0.6033707   4.926723
##    500      7.868212  0.6032157   5.022770
##    550      7.849513  0.6031960   5.092193
##    600      7.842412  0.6031655   5.138588
##    650      7.840002  0.6031040   5.168426
##    700      7.839542  0.6030118   5.187133
##    750      7.840095  0.6028845   5.199259
##    800      7.841514  0.6026974   5.207224
##    850      7.843157  0.6025011   5.212743
##    900      7.844868  0.6023021   5.216606
##    950      7.846782  0.6020890   5.219720
##   1000      7.848932  0.6018539   5.222591
## 
## Tuning parameter 'max_depth' was held constant at a value of 3
##  0.6
## Tuning parameter 'min_child_weight' was held constant at a value of
##  1
## Tuning parameter 'subsample' was held constant at a value of 1
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nrounds = 700, max_depth = 3,
##  eta = 0.01, gamma = 0, colsample_bytree = 0.6, min_child_weight = 1
##  and subsample = 1.

Predicting the test outcome and writing submission file.

testResult <- predict(xgbt, binX_test)
sub_xgbt <- read_csv('sample_submission.csv') %>% mutate(y = testResult)
write_excel_csv(sub_xgbt,'sub_xgbt.csv')

The snapshot of scores obtained by the above xgboost model:

The snapshot of top ten places on the private leaderboard:

4 Conclusion

The Wilcoxon rank sum test played very important role in separating relevent features from the noisy ones, and also those that were present in train but absent in test. It is clear from the leaderboard scores that the given predictors were unable to explain enough of variation in the outcome. The proposed solution in this document shows an improvement over the \(2^{nd}\) place solution on the private leaderboard