Mercedes-Benz Greener Manufacturing

Can you cut the time a Mercedes-Benz spends on the test bench?

Since the first automobile, the Benz Patent Motor Car in 1886, Mercedes-Benz has stood for important automotive innovations. These include, for example, the passenger safety cell with crumple zone, the airbag and intelligent assistance systems. Mercedes-Benz applies for nearly 2000 patents per year, making the brand the European leader among premium car makers. Daimler’s Mercedes-Benz cars are leaders in the premium car industry. With a huge selection of features and options, customers can choose the customized Mercedes-Benz of their dreams. .

To ensure the safety and reliability of each and every unique car configuration before they hit the road, Daimler’s engineers have developed a robust testing system. But, optimizing the speed of their testing system for so many possible feature combinations is complex and time-consuming without a powerful algorithmic approach. As one of the world’s biggest manufacturers of premium cars, safety and efficiency are paramount on Daimler’s production lines.

In this competition, Daimler is challenging Kagglers to tackle the curse of dimensionality and reduce the time that cars spend on the test bench. Competitors will work with a dataset representing different permutations of Mercedes-Benz car features to predict the time it takes to pass testing. Winning algorithms will contribute to speedier testing, resulting in lower carbon dioxide emissions without reducing Daimler’s standards.

Solution Approach

We will create multiple models based on the different clusters in the dataset. There are 4 distinct distributions of the testing times (dependent variable), we will create a different model for each of these distributions.

Load the data

# Load Data ---------------------------------------------------------------
path <- "./input"

train <- read.csv(file.path(path, "train.csv"))
test <- read.csv(file.path(path, "test.csv"))

There are 377 varaibles in this anonymised data. X0-X8 are factor variables. X10-X377 are binary variables with values either 0 or 1. All the variables represent car type or car features.

Density and Clusters

plot(density(train$y))

summary(train$y)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   72.11   90.82   99.15  100.70  109.00  265.30

As you notice, there are multiple peaks in the distribution of the dependent variable, thus, there could be multiple distributions present in the data. Let’s explore.

Let’s regress Y using one independent variable at a time to see top variables that explain variability of y.

rev(sort(sapply(train[3:377], function(q) summary(lm(train$y ~ q))$r.squared)))[1:10]
##        X0      X314      X261      X127        X2      X263      X279 
## 0.5748049 0.3672423 0.3466680 0.2607325 0.2258437 0.1441181 0.1441181 
##      X232       X29      X136 
## 0.1441181 0.1441181 0.1355544

X0 explains variabilty the most. Let’s plot y against means of X0.

library(dplyr)
## Warning: package 'dplyr' was built under R version 3.3.3
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
meanYbyX0 <- train %>%
         group_by(X0) %>%
         summarise(meanY= mean(y))        

library(ggplot2)
ggplot(meanYbyX0, aes(reorder(X0,meanY), meanY)) + geom_point()

We can clearly see 4 clusters in the data.

  • The mean values of y below 80.
  • The mean values of y around 95.
  • The mean values of y around 105.
  • The mean values of y around 115.

Let’s create the clusters in the training and test sets.

combi <- rbind(train[,-2],test)

combi$cluster <- ifelse(combi$X0=="bc"| combi$X0=="az",1,
                        ifelse(combi$X0=="ac" | combi$X0=="l" | combi$X0=="am"  |
                               combi$X0=="b" | combi$X0=="e" | combi$X0=="q" |
                               combi$X0=="al" | combi$X0=="n"  | combi$X0=="am"
                                | combi$X0=="t"  | combi$X0=="aq" | combi$X0=="s" 
                                | combi$X0=="f"  | combi$X0=="f"  | combi$X0=="y"  |
                               combi$X0=="ad"  | combi$X0=="u"  | combi$X0=="ba"  | 
                               combi$X0=="o"  | combi$X0=="m"  | combi$X0=="z"  |
                               combi$X0=="ai" | combi$X0=="k",2,ifelse(combi$X0=="d"
                                | combi$X0=="ay" | combi$X0=="aw" | combi$X0=="aj"
                                | combi$X0=="aj" | combi$X0=="h" | combi$X0=="v" |
                                combi$X0=="ao", 3,ifelse(combi$X0=="i" |
                                combi$X0=="g" | combi$X0=="j" | combi$X0=="c" |
                                combi$X0=="ax" | combi$X0=="ab" | combi$X0=="ak" |
                                combi$X0=="x" | combi$X0=="w" | combi$X0=="af" |
                                combi$X0=="at" | combi$X0=="r" | combi$X0=="as" |
                                combi$X0=="a" | combi$X0=="ap"| combi$X0=="au", 4,
                                ifelse(combi$X0=="aa", -1, 5)))))

combi$cluster <- as.factor(combi$cluster)

table(combi$cluster)
## 
##   -1    1    2    3    4    5 
##    2  348 4136 1310 2616    6

There are 8 rows that need special attention, there are 2 rows with “aa” as the X0 column value that are not present in test set, we can ignore these rows.(cluster=-1) For the other 6 rows present in test but not in train, we will see how to deal with these later.(cluster=5)

train <- cbind(y=train$y,combi[1:4209,])
train <- train[train$cluster != -1, ]
test <- combi[4210:8418,]
rm(combi)

ggplot(train, aes(y, fill = cluster)) + geom_histogram(alpha = 0.5, aes(y = ..count..), position = 'identity', bins=80)

Modeling

  • XGBoost (objective:regression) was used for modelling each of the clusters individually.
  • The results were amalagated and the predictions were submitted.
  • The hyperparameters for xgboost changed for every cluster based on the cross validation done using the grid of hyperparameters.
  • Each of the cross validation for each cluster was repeated 10 times and average of predictions was taken to account for randomness.
  • The 6 rows with (cluster 5) were submitted as the average of the the dependent variable y.

PS: XGBoost code can be found in my other published documents.