Since the first automobile, the Benz Patent Motor Car in 1886, Mercedes-Benz has stood for important automotive innovations. These include, for example, the passenger safety cell with crumple zone, the airbag and intelligent assistance systems. Mercedes-Benz applies for nearly 2000 patents per year, making the brand the European leader among premium car makers. Daimler’s Mercedes-Benz cars are leaders in the premium car industry. With a huge selection of features and options, customers can choose the customized Mercedes-Benz of their dreams. .
To ensure the safety and reliability of each and every unique car configuration before they hit the road, Daimler’s engineers have developed a robust testing system. But, optimizing the speed of their testing system for so many possible feature combinations is complex and time-consuming without a powerful algorithmic approach. As one of the world’s biggest manufacturers of premium cars, safety and efficiency are paramount on Daimler’s production lines.
In this competition, Daimler is challenging Kagglers to tackle the curse of dimensionality and reduce the time that cars spend on the test bench. Competitors will work with a dataset representing different permutations of Mercedes-Benz car features to predict the time it takes to pass testing. Winning algorithms will contribute to speedier testing, resulting in lower carbon dioxide emissions without reducing Daimler’s standards.
We will create multiple models based on the different clusters in the dataset. There are 4 distinct distributions of the testing times (dependent variable), we will create a different model for each of these distributions.
# Load Data ---------------------------------------------------------------
path <- "./input"
train <- read.csv(file.path(path, "train.csv"))
test <- read.csv(file.path(path, "test.csv"))
There are 377 varaibles in this anonymised data. X0-X8 are factor variables. X10-X377 are binary variables with values either 0 or 1. All the variables represent car type or car features.
plot(density(train$y))
summary(train$y)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 72.11 90.82 99.15 100.70 109.00 265.30
As you notice, there are multiple peaks in the distribution of the dependent variable, thus, there could be multiple distributions present in the data. Let’s explore.
Let’s regress Y using one independent variable at a time to see top variables that explain variability of y.
rev(sort(sapply(train[3:377], function(q) summary(lm(train$y ~ q))$r.squared)))[1:10]
## X0 X314 X261 X127 X2 X263 X279
## 0.5748049 0.3672423 0.3466680 0.2607325 0.2258437 0.1441181 0.1441181
## X232 X29 X136
## 0.1441181 0.1441181 0.1355544
X0 explains variabilty the most. Let’s plot y against means of X0.
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.3.3
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
meanYbyX0 <- train %>%
group_by(X0) %>%
summarise(meanY= mean(y))
library(ggplot2)
ggplot(meanYbyX0, aes(reorder(X0,meanY), meanY)) + geom_point()
We can clearly see 4 clusters in the data.
Let’s create the clusters in the training and test sets.
combi <- rbind(train[,-2],test)
combi$cluster <- ifelse(combi$X0=="bc"| combi$X0=="az",1,
ifelse(combi$X0=="ac" | combi$X0=="l" | combi$X0=="am" |
combi$X0=="b" | combi$X0=="e" | combi$X0=="q" |
combi$X0=="al" | combi$X0=="n" | combi$X0=="am"
| combi$X0=="t" | combi$X0=="aq" | combi$X0=="s"
| combi$X0=="f" | combi$X0=="f" | combi$X0=="y" |
combi$X0=="ad" | combi$X0=="u" | combi$X0=="ba" |
combi$X0=="o" | combi$X0=="m" | combi$X0=="z" |
combi$X0=="ai" | combi$X0=="k",2,ifelse(combi$X0=="d"
| combi$X0=="ay" | combi$X0=="aw" | combi$X0=="aj"
| combi$X0=="aj" | combi$X0=="h" | combi$X0=="v" |
combi$X0=="ao", 3,ifelse(combi$X0=="i" |
combi$X0=="g" | combi$X0=="j" | combi$X0=="c" |
combi$X0=="ax" | combi$X0=="ab" | combi$X0=="ak" |
combi$X0=="x" | combi$X0=="w" | combi$X0=="af" |
combi$X0=="at" | combi$X0=="r" | combi$X0=="as" |
combi$X0=="a" | combi$X0=="ap"| combi$X0=="au", 4,
ifelse(combi$X0=="aa", -1, 5)))))
combi$cluster <- as.factor(combi$cluster)
table(combi$cluster)
##
## -1 1 2 3 4 5
## 2 348 4136 1310 2616 6
There are 8 rows that need special attention, there are 2 rows with “aa” as the X0 column value that are not present in test set, we can ignore these rows.(cluster=-1) For the other 6 rows present in test but not in train, we will see how to deal with these later.(cluster=5)
train <- cbind(y=train$y,combi[1:4209,])
train <- train[train$cluster != -1, ]
test <- combi[4210:8418,]
rm(combi)
ggplot(train, aes(y, fill = cluster)) + geom_histogram(alpha = 0.5, aes(y = ..count..), position = 'identity', bins=80)
PS: XGBoost code can be found in my other published documents.