We, the NYC DataScience Academy
students, formed a team and participated in the Kaggle AXA Telematics
competition. We called ourselves Vivi’s Angels.
The challenge was to identify unique driver signatures using de-identified driver data.
Dataset made available as part of the competition included latitude and longitude displacement per minute for each of the drivers.
We split the responsibilities across two major areas of the project:
Feature generation using available data &
Classification using Machine learning techniques based on the generated feature set.
I chose to be part of the machine learning team and was responsible for SVM based classification , while my teammates were focusing on tree based models (RandomForest, GBM) and clustering (kNN) methods.
This was a classic case of multi-class problem in SVM and the following factors were to be considered.
There were around 3,600 drivers ( each to be categorized as a separate class) and about 40 attributes.
Given the number of classes across a relatively few dimensions , I felt a linear classification hyperplane might not be a good option and opted for a radial basis kernal.
Given there were 3,600 distinct classes , each with 200 separate observations , it would be computationally expensive for a one-versus-one approach (OVO) , I chose to go with a one-versus-all (OVA) approach.
This was still a challenge of balancing as the true class had a prevalence rate of about 1:3600 in the test dataset.
At this point I was convinced that probably tree based models might work better for this situation but wanted to go through the phases as a learning experience in this unique situation and compare the performance with the other models that my fellow team members were working on.
In this blog, I am sharing the approach taken , process and results with a subset of the data , shrinking this to a 3 class classification problem.
Step 1 : Read input driver files (one per driver and create a consolidated input file.
# enable parallel processing as the process is computationally intensive.
library(doMC)
registerDoMC(cores = 4)
# read input driver attribute files.
setwd("/Users/sundar/data/input/NYCDSA/attr")
allfiles <- dir(pattern="*.RData")
data <- attributes.raw <- NULL
attributes.raw <- data.frame()
for(i in 1:length(allfiles)){
data <- NULL
load(allfiles[i])
if (length(attributes.raw) == 0 ){
attributes.raw <- data
}
else {
attributes.raw <- rbind(attributes.raw,data)
}
}
# convert "driver" field as the class factor.
attributes.raw$driver <- as.factor(attributes.raw$driver)
Step 2 : Check for near zero variance variables to drop from feature set.
# check for near zero variance variables.
nzv <- nearZeroVar(driver.attributes[,-26])
head(nzv)
## integer(0)
# No near zero variance attributes.
Step 3 : Check for and impute missing values.
#Find if there are NA Values in the input data and impute:
impute.Attribute <- function(impute.input){
nullcount <- sum(I(is.na(impute.input)))
if (nullcount > 0){
imputed <<- complete(mice(impute.input,m=5,,print=FALSE))
}
else {
imputed <<- impute.input
}
}
impute.Attribute(driver.attributes)
Step 4 : create testing and training data partitions.
# Function to partition test and training datasets
prep.for.model <- function(prep.input) {
inTest <- createDataPartition(y = prep.input$driver, p = 0.2, list = FALSE)
imputed$driver <<- as.factor(prep.input$driver)
testing <<- prep.input[inTest,]
training <<- prep.input[-inTest,]
}
set.seed(1)
prep.for.model(imputed)
Step 5 : Set up control File . 10-fold cross validation with 5 repeats was selected.
#setup control file
cvCtrl= trainControl(method="repeatedcv",repeats=5,
verbose=F,
classProbs=T)
Step 6 : Train the SVM model using the training partition and the training control file.
tuneLength
parameter (0.25,1,2,4,8 …. 126).preProc
parameter as SVM is sensitive to the units (distance).method
it would be difficult to find a linear hyper plane in a multi-class problem.Accuracy
was selected as the tuning metric
.set.seed(2)
svmTune <- train(driver ~., data=training,
method="svmRadial",
tuneLength=10,
preProc=c("center","scale"),
metric="Accuracy",
trControl=cvCtrl)
Step 7 : Explore the trained model.
svmTune # display the tuning details
## Support Vector Machines with Radial Basis Function Kernel
##
## 473 samples
## 25 predictor
## 3 classes: '1', '2', '3'
##
## Pre-processing: centered, scaled
## Resampling: Cross-Validated (10 fold, repeated 5 times)
##
## Summary of sample sizes: 425, 425, 426, 427, 426, 425, ...
##
## Resampling results across tuning parameters:
##
## C Accuracy Kappa Accuracy SD Kappa SD
## 0.25 0.6621712 0.4931950 0.06814183 0.10213127
## 0.50 0.6896041 0.5343549 0.06493835 0.09735857
## 1.00 0.6982338 0.5473803 0.06715926 0.10075352
## 2.00 0.7088683 0.5632353 0.06746317 0.10136260
## 4.00 0.7033159 0.5549402 0.06967262 0.10461227
## 8.00 0.6919677 0.5378928 0.06610852 0.09925308
## 16.00 0.6925844 0.5388562 0.06699142 0.10055270
## 32.00 0.6865845 0.5298612 0.06704464 0.10061134
## 64.00 0.6756987 0.5135083 0.06704830 0.10066610
## 128.00 0.6694291 0.5041290 0.06826780 0.10246086
##
## Tuning parameter 'sigma' was held constant at a value of 0.119121
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 0.119121 and C = 2.
Accuracy
Curveggplot(svmTune) + scale_x_log10() + theme_bw() # plot the ROC curve
svmTune$finalModel # plot the ROC curve
## Support Vector Machine object of class "ksvm"
##
## SV type: C-svc (classification)
## parameter : cost C = 2
##
## Gaussian Radial Basis kernel function.
## Hyperparameter : sigma = 0.119121010528511
##
## Number of Support Vectors : 356
##
## Objective Function Value : -197.6157 -295.2407 -155.9021
## Training error : 0.137421
## Probability model included.
Step 8 : Predict classes for the testing dataset and evaluate performance.
svmPred <- predict(svmTune, newdata=testing)
confusionMatrix(svmPred,testing$driver)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2 3
## 1 28 6 11
## 2 5 32 3
## 3 7 2 26
##
## Overall Statistics
##
## Accuracy : 0.7167
## 95% CI : (0.6272, 0.7951)
## No Information Rate : 0.3333
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.575
## Mcnemar's Test P-Value : 0.7579
##
## Statistics by Class:
##
## Class: 1 Class: 2 Class: 3
## Sensitivity 0.7000 0.8000 0.6500
## Specificity 0.7875 0.9000 0.8875
## Pos Pred Value 0.6222 0.8000 0.7429
## Neg Pred Value 0.8400 0.9000 0.8353
## Prevalence 0.3333 0.3333 0.3333
## Detection Rate 0.2333 0.2667 0.2167
## Detection Prevalence 0.3750 0.3333 0.2917
## Balanced Accuracy 0.7437 0.8500 0.7688
As observed in the confusion matrix summary above SVM classification prediciton was around 71% and kappa around 57% in this multi-class situation. Given the fact the original dataset had a few misclassified samples , this was a decent performance, but still was relatively poor compared to predictions achieved using boosted tree based models. The prediction generated using gbm
was selected as our final submission to the competition after comparing relative performance of all machine learning algorithms for this challenge.
To our delight , our submission was ranked among the top 10% (rank 134) of 1528 teams, in our first ever attempt at a kaggle competition. Coming out of this competition, we have learnt the machine learning skills required and relative strengths of different algorithms.
We are more than ready for the next Kaggle competition.