Introduction

This post is meant to showcase some of the work of the students of the first cohort of the NYC Data Science Academy Bootcamp. We competed in the AXA Driver Telematics Analysis Kaggle competition and this is a write-up done by a few of our team members.

The Telematics Competition Set-up:

The goal of the competition was to develop an algorithm to create driver specific signatures based upon nothing but X-Y coordinates in a number of csv’s. We were given 2736 directories, corresponding to 2736 different drivers. In each directory were 200 csv’s, one for each trip assigned to that driver. Each row of a csv corresponds to one second of a driver’s trip. To test our driver signature algorithm, the organizers substituted an unknown number of trips in each driver’s directory with false trips that were driven by another driver. We don’t know which trips are driven by the driver of interest, all we know is that the driver of interest has the majority of trips in that folder. Our goal is to submit a file identifying the correct probability of whether each trip belongs to the driver associated with their directory. This took the form of a csv with 200*2736 rows and the probability that that trip was driven by the driver of interest.

Project Management

One of our goals for this project was to get experience managing a data science project. We wanted to apply an agile project development model to our problem which would eventually iterate through the development cycle multiple times until the project deadline.

 

alt text Traditional Waterfall Development Method vs. Iterative Development method

 

This is in contrast to the waterfall development method which we believe would not have appropriately utilized our team size and talent, and would not have lent itself to the competitive framework we were contrained by. The waterfall method would also have limited us to our initial vision of our final model without giving us ample room to revise our requirements based upon interim results. Whereas, with the iterative model, we were able to create multiple working models and take advantage of many opportunities to test our models upon the public leaderboard (71 submissions to be exact). It allowed us to redefine basic assumptions late in the game which bumped our score tremendously in the final week of the competition.

Other Considerations

Git

We were instructed on the Git framework in our first week of the bootcamp. This allowed us to immediately implement a distributed version control system and allowed us to get used to working with our code as a team.

R

Alongside with Git, the first two weeks of the bootcamp was an intense introduction to R and we used R exclusively for the competition. It was the ideal choice for the data manipulation, visualization, and statistical learning methods we needed to implement.

Student Goals

We had to keep in mind that this was indeed a bootcamp student project. We were not a professional Kaggle team, and as such, we couldn’t put demands on team members like we would have in a purely professional context. Our main goal was to make sure that this was an educational experience for every member of the team. After that requirement was met, we had to take a laissez-faire approach when it came to dominating the available time of the team members. Being in the middle of an intensive bootcamp experience meant that free time was very hard to come by. Luckily, our team was blessed with some extrememly motivated and capable members. They went above and beyond the call of duty and won us all a Kaggle finish we are be proud of.

Inception

Our Kaggle team consisted of 14 bootcamp students. In order to make things manageable we decided to split the team into two sub-groups. This allowed us to make our weekly team meetings more manageable and allowed multiple people to step into similar vital roles within each group to make sure that everyone was involved near the center of the project. Eventually we merged the teams back together to maximize our performance by utilizing the best techniques and code from both groups.

In Practice

Our workflow revolved around weekly meetings at which members of each sub-group would first go over the research or code they had been working on for the previous week. The second portion of the meetings involved an open table style brainstorming session to decide on ways to improve our model. The meetings concluded with task assignments for the next week, which was done mainly on a volunteer basis in order to be sensitive to our schedules and needs.

Meetings in the first two weeks consisted of brainstorming wild ideas as well as creating a framework for the simplest model we could come up with. In order to reap the benefits of an iterative project development model, we needed to create a working algorithm as soon as possible. Within a few weeks we were making submissions using the same basic framework that we ended up employing in our final submission.

The benefit of this model was that we could send people down possible rabbit holes without it holding up the whole group. As long as we were trying new models and evolving our skill set, we were making progress. We could afford to chase down ideas like Fourier Transformation Matrices and Dynamic Time Warping with minimal risk to the overall success of the project. In the end, between the multiple trip matching algorithms and our work trying to integrate Support Vector Machines into an unsupervised problem, we were sure we made the right choice by using the agile development model.

Loading the Data

In order to keep everyone’s code as consistent as possible, we created a script for the team to use to read in the 547,000 csv’s as easily loadable binaries saved on our individual machines. It was based upon Lauri Koobas’ rebuild-data.R code in the Kaggle Forum. We figured that if we all started using the same foundation it would simplify the process of working with each others’ code later on down the line.

# Set WD to directory containing the 'drivers' folder.

require(data.table)

# Use fread from the data.table package to read in x and y coords
# Apply trip ID to new third column in data frame
fread.and.modify <- function(file.number, driver) {
  tmp <- fread(paste0("drivers/",driver,"/",file.number,".csv"), header=T, sep=",")
  tmp[, tripID:=file.number]
  return(tmp)
}
system.time({
  
# Pull down list of driver directories and create a home for binaries
driverlist <- list.files("./drivers/")
dir.create("./data/", showWarnings = TRUE, recursive = FALSE, mode = "0777")

# Loop through the driver list and use rbindlist to bind data from
# x, and y columns to the specific driver data frame
for (i in 1:length(driverlist)) {
  onedriver <- driverlist[i]
  drives <- rbindlist(lapply(1:200, fread.and.modify, onedriver))
  save(drives, file = paste('./data/DriverData',onedriver, sep=''))
}
})

With the files read in place, we could start visualizing and manipulating the data in a uniform way.

Trip Visualization

To get started visualizing the trips, we wrote a small script to read in all the trip data for a single driver in long format.

read_trips = function(x){
  setwd(x)
  dir_list = list.files(x)
  dir_list = dir_list[grep('[[:alpha:]]{0,3}.csv', dir_list)]
  num_files = length(dir_list)
  files = lapply(dir_list, read.csv)
  idx = unlist(lapply(files, nrow))
  trip = rep(1:num_files, idx)
  files = do.call(rbind, files)
  time = unlist(sapply(idx, function(x) seq(from=1, to=x, by=1)))
  files = cbind(files, trip, time)
  }

driverOne = read_trips("C:/Users/TimBo/Downloads/R docs and scripts/Collab/drivers/drivers/1")

Using ggplot2 we were able to visualize all 200 trips by Driver 1 and included a label corresponding to the trip number. Interestingly, trip numbers 48, 73, and 145 are identical trips that have been rotated. Additionally, trip 145 appears to have been reflected and is facing in the opposite the direction of trips 48 and 73. This suggested a strategy of trip matching to reduce dimensionality and also clustering trips into those taken by the driver of interest and those taken by the “false driver”.

library(ggplot2)
library(plyr)
library(dplyr)

driverOne$dist = sqrt(driverOne$x^2+driverOne$y^2)

ggplot(data = driverOne, aes(x=x,y=y, group=trip, color=factor(trip), label=trip))+geom_path()+theme_bw()+
      geom_text(data= driverOne %>% group_by(trip) %>% filter(dist == max(dist)), color='black', size=2.5, 
                position = position_jitter())+xlab('')+ylab('')+theme(legend.position='none')+ 
      ggtitle('Trips by Driver 1')

Trips by Driver 1

Driver 1 Trip Animation Example

We also wanted to monitor trip progress as a function of time so we created an animation to visualize driver progress as shown above. The following script loops over all trips and makes a plot at each time slice. The animation package binds the time slices together to create the final animation.

library(animation)
oopt <- ani.options(interval = 0.05)
trip_animation <- function() {
  lapply(1:max(driverOne$time), function(i) {
    print(ggplot(driverOne[driverOne$time <= i,], aes(x=x,y=y,group=trip))+
            ylim(range(driverOne$y))+xlim(range(driverOne$x))+
            theme(axis.text.y=element_blank(), axis.title.x=element_blank(), 
                axis.title.y=element_blank(), axis.text.x=element_blank(), 
                panel.border=element_blank(), panel.background=element_blank(), 
                axis.ticks = element_blank(), legend.position='none')+
            geom_path(aes(color=factor(trip))))
    animation::ani.pause()
  })
}

saveHTML(trip_animation(), autoplay = FALSE, loop = FALSE, verbose = TRUE, outdir = "images", 
         single.opts = "'controls': ['first', 'previous', 'play', 'next', 'last', 'loop', 'speed'], 'delayMin': 0")

After noticing the rotations and reflections of the trips in the data we wrote a script to rotate the most distant point from the origin in each trip to the x-axis using a rotation matrix. We also ensured the majority of the data was in the Quadrant 1 by reflecting each trip about the x-axis if necessary.

rot_all = function(df){
  n = matrix(ncol=2)
  for(i in 1:length(unique(df$trip))){
    dists = df[df$trip == i,c('x','y','dist')]
    mp = dists[max.col(t(dists$dist),'last'),1:2]
    rot.mat = matrix(c(mp$x,mp$y,-mp$y,mp$x),2,2)/sqrt(mp$x^2+mp$y^2) 
    rot.points = as.matrix(df[df$trip==i,1:2])%*%rot.mat
    if (sum(sign(rot.points[,2]))<0) rot.points[,2] = - rot.points[,2]
    n = rbind(n, rot.points)
  } 
  n = n[-which(is.na(n)),]
  n = as.data.frame(n)
  n = cbind(n, df$trip, df$time, df$dist)
  colnames(n) = c('x','y','trip','time', 'dist')
  return(n)
}

rot.DriverOne = rot_all(driverOne)

Revisualizing the rotated trip data confirmed our transformations were successful. Trips 48, 73, and 145 are now overlayed as desired as shown in and many more trip matches which may have previously gone unnoticed now align almost perfectly. However, as seen in “Selected Trips by Driver 1” some identical trips remain unmatched for reasons which are discussed below.

rot.DriverOne.sub = filter(rot.DriverOne, trip == 48 | trip == 73 | trip == 145 | trip == 25 | trip == 55 | trip == 183)

ggplot(rot.DriverOne.sub, aes(x=x,y=y, group=trip, color=factor(trip), label=trip))+geom_path()+theme_bw()+
  geom_text(data= rot.DriverOne.sub %>% group_by(trip) %>% filter(dist == max(dist)), color='black', size=2.5, 
            position = position_jitter())+xlab('')+ylab('')+theme(legend.position='none')+
  ggtitle('Selected Trips by Driver 1')

ggplot(rot.DriverOne, aes(x=x,y=y, group=trip, color=factor(trip), label=trip))+geom_path()+theme_bw()+
  geom_text(data= rot.DriverOne %>% group_by(trip) %>% filter(y == max(y) | y==min(y)), color='black', size=2.5, 
            position = position_jitter())+xlab('')+ylab('')+theme(legend.position='none')+
  ggtitle('Trips by Driver 1: Rotated and Reflected')
Selected Trips, Driver 1
Selected trips for Driver 1.
Rotated and Reflected Trips, Driver 1
The rotated and reflected trips.

After careful inspection of the data we discovered our rotation/reflection script can produce false negatives (unmatched identical trips) in 2 ways:

  1. It reflects only about the x-axis and a limited number of trips were seen to have been reflected about the y-axis and remain unmatched to identical trips.
  2. The Kaggle rules detailed a potential trimming of the trip data from the beginning or end of each trip. If one of two identical trips had a large negative y-component trimmed from the data a reflection about the x-axis might no longer occur leading to one reflected trip and one unreflected trip.

Both types of failure can be observed in the plot of “Selected Trips by Driver 1”. Trips 25, 55, and 183 are identical trips. Trip 55 failed to be reflected about the x-axis, and trip 183 is related to trip 55 by a reflection about the y-axis. However after reviewing the data we concluded that both of these types of false negatives were infrequent and could be neglected.

With data visualization complete, generating attributes and developing classification models could begin.

Our approach was to create a good number of variables even if we had to remove them afterwards, in the final model. Because we are always wary of losing data or having to go back and re-compute everything, we saved the intermediate steps data in different folders. This makes the computation time a bit longer, however, it avoids having to re-compute every step when we want to make one change in the most recent step (like changing the number of quantiles for example).

Creating Trip Variables

After loading the csv files and creating a unique data frame with all the trips for each driver, we calculated a number of variables (for each trip) after turning the trip so that each trip begins to go up and right (y and x positive).

library(dplyr)

The main loop of this script call the function expand_data.f (which takes other function as variable) for each drivers.

todo <- list.files('data')
registerDoParallel(cores=6)
foreach(i=1:length(todo)) %dopar% {
  expand_data.f(todo[i], turn_trip.f, calculate_trip_value.f)  
}

The expand_data.f function loads the driver data; splits the data so that each trip is an element of a list and applies the turn_trip.f function and the calculate_trip_value.f function.

expand_data.f <- function(x, turn.trip_f, calculate_trip_value.f){
  #this function load the data called the previous function and save the data 
  load(paste0('data/',x))
  tmp <- lapply(split(drives, drives$tripID), turn_trip.f)  
  drives <- lapply(tmp, calculate_trip_value.f)   
  save(drives, file = paste('data_transf/', x, sep = ''))
}

The turn_trip.f function turns the trip so that at the beginning it goes up and right (both y and x positive.)

turn_trip.f <- function(trip){
  # turn the trip by deciding it begin to go up and right (y>0 and x>0)
  if(trip$x[which.max(abs(trip$x) > 0.1)] < 0){trip$x = -trip$x}
  if(trip$y[which.max(abs(trip$y) > 0.1)] < 0){trip$y = -trip$y}
  return(trip)
}

The calculate_trip_value.f calculates the distance between two consecutive points, the velocity both in the y-axis and in the x-axis, the speed, the acceleration, the jerk, and the angle change between two points. It also adds a boolean flag indicating when there is a ‘jump’ in the data (i.e. when there are missing points, which creates ridiculous values for the variable).

Since we turned the trips so that they began all in the same direction, we then flagged the positive and negative turns with booleans. To account for both sharp turns and slow turns we created three columns for both the positive and negative turns to show flagged all turns having greater than 45-degrees total between two consecutive points, or one point and two points after or between one point and three points after.

calculate_trip_value.f <- function(trip){
  #this function calculate vel, speed accel, put a flag if data are bad and put a flag when car is moving
  trip <- trip %>% mutate(dist = sqrt((x-lag(x))^2 + (y - lag(y))^2), velx = x -lag(x), vely = y - lag(y), speed = sqrt(velx^2 + vely^2), accel = speed - lag(speed), jerk = accel - lag(accel), angle = atan2((y - lag(y)), (x - lag(x)))*180/pi)
  
  ##if missing points aka 'jump'
  trip$jump <- ifelse(abs(trip$accel) > 100, TRUE, FALSE)
  
  #boolean for when the car turn
  trip$turnpos1 <- ifelse((trip$angle - lag(trip$angle)) > 45, TRUE, FALSE)
  trip$turneg1 <- ifelse((trip$angle - lag(trip$angle)) < - 45, TRUE, FALSE)
  trip$turnpos2 <- ifelse((trip$angle - lag(trip$angle, 2)) > 45, TRUE, FALSE)
  trip$turneg2 <- ifelse((trip$angle - lag(trip$angle, 2)) < - 45, TRUE, FALSE)
  trip$turnpos3 <- ifelse((trip$angle - lag(trip$angle, 3)) > 45, TRUE, FALSE)
  trip$turneg3 <- ifelse((trip$angle - lag(trip$angle, 3)) < - 45, TRUE, FALSE)
  return(trip)
}

For example this is what the data looks like after adding these variable:

##      x    y tripID     dist velx vely    speed       accel      jerk
## 4 53.7 32.6      1 20.59733 17.6 10.7 20.59733  0.03304049  1.129083
## 5 70.1 42.8      1 19.31321 16.4 10.2 19.31321 -1.28412201 -1.317162
##     angle  jump turnpos1 turneg1 turnpos2 turneg2 turnpos3 turneg3
## 4 31.2977 FALSE    FALSE   FALSE    FALSE   FALSE       NA      NA
## 5 31.8796 FALSE    FALSE   FALSE    FALSE   FALSE    FALSE   FALSE

Summarizing the Trips

For each trip, we then summarized all the variables calculated in the previous step and created new values to describe each trip.

library(plyr)
library(dplyr)
library(reshape2)
library(tidyr)
library(doParallel)
dir.create("./data_summary/", showWarnings = TRUE, recursive = FALSE)

As previously, the main loop calls the main function find_feature.f, which loads the data, calls the function trip_feature. f for each trip and combines the results in a data frame for each driver.

todo <- list.files('data_transf')
registerDoParallel(cores=6)
foreach(i=1:length(dir())) %dopar% {
  find_feature.f(todo[i], trip_feature.f)  
}
find_feature.f <- function(x, trip_feature.f){
  load(paste0('data_transf/',x))
  res <- lapply(drives, trip_feature.f)
  drives <- do.call(rbind, res)
  save(drives, file = paste('data_summary/', x, sep = ''))
}

The trip_feature.f calculates the mean, standard deviation and quartile of the velocities, speed, acceleration, jerk and angle variables, and counts the number of turns in each direction. This function also calculates trip variables: the length of the trip, the total distance driven, the direct distance between first and last point, and the path efficiency (direct distance / total distance), the angle from first to last point, the angle change, the absolute total angle turned and the turn efficiency (angle change / total angle turned).

trip_feature.f <- function(trip){
  #this function summarise the variable value for each trip and create new variable for each trip
  len <- dim(trip)[1]
  dist <- sum(trip$dist, na.rm = TRUE)
  trip <- as.data.frame(trip)
  tmp <- trip %>% filter(!jump) %>% select(velx, vely, speed, accel, jerk, angle) %>% filter(!is.na(jerk))
  res <- t(sapply(tmp, function(x) each(mean, sd, quantile)(x)))
  res <- data.frame(param = rownames(res) , res)
  rownames(res) <- NULL
  meltres <-  melt(res, id = 'param')
  res <- data.frame(var = paste(meltres$param, meltres$variable, sep = '-'), value = meltres$value)
  res$var <- as.character(res$var)
  turn <- as.data.frame(colSums(trip %>% select(turnpos1:turneg3) %>% filter(!is.na(turneg3))))
  turn <- data.frame(var = rownames(turn), value = turn[,1])
  res <- rbind(res, turn)
  angletodest <- with(trip, atan2(y[len] - y[1], x[len]-x[1]))
  angle_change <- sum(tmp$angle)
  tot_angle <- sum(abs(tmp$angle))
  turn_efficiency <- angle_change/tot_angle
  direct_dist <- with(trip, sqrt((x[len] - x[1])^2 + (y[len] - y[1])^2))
  path_efficiency <- direct_dist/dist
  res <- rbind(res, c('len', len))
  res <- rbind(res, c('dist', dist))
  res <- rbind(res, c('angletodest', angletodest))
  res <- rbind(res, c('angle_change', angle_change))
  res <- rbind(res, c('turn_efficiency', turn_efficiency))
  res <- rbind(res, c('tot_angle', tot_angle))
  res <- rbind(res, c('direct_dist', direct_dist))
  res <- rbind(res, c('path_efficiency', path_efficiency))
  
  res <- data.frame(tripID = unique(trip$tripID), res)
  res <- spread(res, var, value)
  return(res)
}

At that point we have 56 56 variables to describe each trip.

##   tripID          accel-mean accel-quantile.0. accel-quantile.100.
## 1      1 -0.0239119644602508 -7.65506139861884    8.63982921403924
## 2      2 -0.0189847836360734 -2.99836284157543    2.57514947750462
##   accel-quantile.25. accel-quantile.50. accel-quantile.75.
## 1 -0.665030398446061                  0   0.49136969977116
## 2   -0.3454761060725                  0  0.338371707214766
##            accel-sd      angle_change        angle-mean angle-quantile.0.
## 1  1.04707189528923 -7130.09506595443 -8.29080821622608 -179.761268966901
## 2 0.727299259483334 -5826.69393710213 -10.4421038299321 -177.119622447152
##   angle-quantile.100. angle-quantile.25. angle-quantile.50.
## 1                 180  -110.564358555495                  0
## 2                 180  -13.0786718118663   2.35081989462507
##   angle-quantile.75.         angle-sd       angletodest      direct_dist
## 1   66.4552130434609 103.722626547054 -2.05412230833675 1152.29004161279
## 2   19.5366549381283 71.5475296044288 0.102449477245298 4959.40393595843
##               dist           jerk-mean  jerk-quantile.0.
## 1 14255.4357513228 0.00127446857163485 -9.71101104302095
## 2 7708.45549805557 0.00226972687180188 -3.83239034410703
##   jerk-quantile.100.  jerk-quantile.25.  jerk-quantile.50.
## 1   16.2948906126581 -0.866418248915247 0.0247681249661102
## 2   3.53784361265113 -0.346351980318013 0.0249957357232851
##   jerk-quantile.75.           jerk-sd len    path_efficiency
## 1 0.911045147892871  1.56303639433121 863 0.0808316253332251
## 2  0.37265250072084 0.825327694421798 561  0.643371935818975
##         speed-mean speed-quantile.0. speed-quantile.100.
## 1 16.5269896854414                 0    26.6582069914689
## 2 13.7737963532715                 0    33.3736422944813
##   speed-quantile.25. speed-quantile.50. speed-quantile.75.
## 1   12.8328488869921   19.4476219354035   21.9151762984753
## 2    5.3806344673959   12.3808318784358   24.4537615527427
##           speed-sd        tot_angle     turn_efficiency turneg1 turneg2
## 1 7.30317409633395 75045.8571467167 -0.0950098424755799      22      32
## 2 10.2274250224362 26163.5839884706  -0.222702437849102      13      27
##   turneg3 turnpos1 turnpos2 turnpos3          velx-mean velx-quantile.0.
## 1      30       20       30       31 -0.664651162790698            -26.5
## 2      34       14       16       25   8.80179211469534            -16.4
##   velx-quantile.100.  velx-quantile.25. velx-quantile.50.
## 1               25.5            -12.925                 0
## 2   33.1999999999998 -0.100000000000364               6.5
##   velx-quantile.75.          velx-sd         vely-mean  vely-quantile.0.
## 1             10.25 13.6357532416522 -1.21186046511628 -24.4000000000001
## 2  22.7999999999999  13.588651551997  0.89910394265233             -11.4
##   vely-quantile.100. vely-quantile.25.  vely-quantile.50.
## 1               20.5             -11.1                  0
## 2               14.1 -2.87499999999997 0.0999999999999659
##   vely-quantile.75.          vely-sd
## 1  7.45000000000005 11.7876782524607
## 2                 4 5.62017549768013

Model: lasso

To check if the variables we chose could be useful I decided to test a lasso regression because the number of variables was quite high. To do so, we first took five random drivers to be used as control data.

drivers = list.files("data_summary")
randomDrivers = sample(drivers, size = 5)
library(dplyr)
library(reshape2)
library(tidyr)
library(glmnet)
library(doParallel)

registerDoParallel(cores=3)
refData<-foreach(driver=randomDrivers, .combine=rbind) %dopar% {
  load(paste0("data_summary/",driver))
  spreadD <- drives %>% spread(var, value)
}
target <- 1
refData$target <-target  

Then we mixed the current driver data with the control data and tried to predict whether the trip of the current driver belonged to him or not. This model showed that the variables were indeed interesting to predict whether the trips were driven by the main driver, since a simple lasso model with them scored 0.761.

target <- 0
registerDoParallel(cores=6)
submission<-foreach(driver=drivers,.combine=rbind) %dopar% {
  
  ## need to grab the numeric driverID from the file name
  driverID <- gsub(driver,pattern="DriverData",replacement="")
  load(paste0("data_summary/",driver))
  print(driver)
  currentData <- drives
  currentData$target <- target
  
  train <- as.data.frame(rbind(refData,currentData))
  train <- apply(train,2, as.numeric)
  train[which(is.na(train))] = 0
  currentData <- apply(currentData ,2, as.numeric)
  x <- train[,2:93]
  y <- as.factor(train[,94])
  grid =10^ seq (10 , -2 , length =100)
  lasso.mod <- glmnet(x, y, family = 'binomial',alpha =1 , lambda = grid )
  set.seed (1)
  cv.out = cv.glmnet (train[,2:93], train[,94], alpha =1)
  bestlam = cv.out$lambda.min
  p = predict(lasso.mod , s = bestlam , newx = currentData[,2:93], type = 'response')
  p = ifelse(p > 0.5, 1, 0)
  labels <- sapply(1:200, function(x) paste0(driverID,'_', x))
  data.frame(labels, p)
}

colnames(submission) = c("driver_trip","prob")
write.csv(submission, "submission.csv", row.names=F, quote=F)

Feature Generation: Trip Matching

Early in the competition, we came across posts like this one that spoke of the success of trip matching in improving model results. Some domain experts even posted in the forums, saying that in the real world, GPS data was the best indicator of whether a driver was the owner of the car or not. The problem was that AXA rotated and flipped some trips so that they couldn’t be matched in this way! The first order of business was to re-align the trips, then compare them to one another to test for matches.

Once the team decided that trip matching should be incorporated into the features of our model, there were three challenges:

Coding for Trip Matching

This code was written in R to do two things:

  • Rotate/flip trips so that most of the trip was in the first quadrant.
  • Compare trips of roughly similar shape using euclidean distance.

With 200 trips to compare for each of 2700 drivers, parallelization was a must. library(doParallel) enabled the use of foreach loops that greatly sped up computation, even on only three cores. The flag, .combine=rbind takes the output from each loop and rbinds it to the dataframe in a do.call manner. Note that the trips for each driver have already been combined and saved as binary files before this point.

library(dplyr)
library(doParallel)
cl <- makeCluster(1)
registerDoParallel(cores=3)

dataDir <- "data/"
drivers <- list.files(dataDir)

similarTrips<-foreach(driver=drivers,.combine=rbind) %dopar%{
  # Each driver's 200 trips were loaded as a binary file before being compared with one another
  load(paste0(dataDir,driver))
  driverID<-gsub(driver,pattern="DriverData",replacement="")

Rotating/reflecting trips

As mentioned earlier, trips were rotated and reflected to aid in visualization and reduce dimensionality. Before the comparison for loops, each trip was rotated so that the last point is on the positive x-axis by changing the coordinate system. The trips were then reflected if more than half of the x or y coordinates were negative. This was useful later when trips were compared roughly before calculating their euclidean distances. When the euclidean distances were calculated, the rotated y values were used.

  test<-drives %>%
    group_by(tripID) %>%
    mutate(x = x, y = y,
           r=sqrt(x^2+y^2),
           alpha=atan2(last(y),last(x))-atan2(y,x),
           rot.x=r*cos(alpha),
           rot.y=r*sin(alpha),
           rows=n(),
           rot.x.flip=ifelse(sum(rot.x<0)>floor(rows/2), -rot.x, rot.x),
           rot.y.flip=ifelse(sum(rot.y<0)>floor(rows/2), -rot.y, rot.y)
    )%>%
    select(tripID,rot.x=rot.x.flip,rot.y=rot.y.flip)

To give a visual idea of this rotation, as in the visualization section above, we show the raw and rotated trips, respectively.

Raw Trips, Driver 1
The 200 raw trips for Driver 1, colored by trip ID.
Rotated Trips, Driver 1
The trips have now been rotated and flipped.

Even at this point, the naked eye can see some similarity among the rotated, spaghetti-esque mess. The next steps will automate the trip comparison to ultimately reveal similarities among trips.

Comparing individual trips

Only unique pairs of trips were considered, greatly reducing computational time.

  mini2<-NULL
  # begin a nested loop to check all UNIQUE combinations of trips
  for(i in 1:199){
    focus<-select(test[test$tripID==i,], foc.x=rot.x, foc.y=rot.y)
    mini1<-NULL
    for(k in (i+1):200) {
      compare<-select(test[test$tripID==k,], cmp.x=rot.x, cmp.y=rot.y)

Once two trips were selected for potential comparison, their “footprint” was compared. Trips that were more than 20% different in x- or y- range were not compared, since that large a difference would probably indicate a poor match. This selection by if statements reduced computational time by over 30% over checking every pair, generally resulting in approximately 1000 comparisons.

      if(!(diff(range(focus$foc.x))<0.8*diff(range(compare$cmp.x))) &
           !(diff(range(compare$cmp.x))<0.8*diff(range(focus$foc.x))) &
           !(diff(range(focus$foc.y))<0.8*diff(range(compare$cmp.y))) &
           !(diff(range(compare$cmp.y))<0.8*diff(range(focus$foc.y)))
      ){

In order to compare trips by their euclidean distance, their vectors must be the same length. The forum post that inspired this post imputed extra values by repeating values from the shorter of the two trips. Initially, that was my approach; however, it was faster (by about 10%) and more conservative (no imputed values) to truncate the longer of the two vectors.

        trimLength<- min(nrow(compare),nrow(focus))
        focusTrim<-focus[1:trimLength,]
        compareTrim<-compare[1:trimLength,]

Finally, the euclidean distance was calculated between the focusTrim and compareTrim y-values. I had to apply a normalization in order for the euclidean distance to be an acurate metric for similarity of trips of all sizes. Without such a normalization, trips with larger y- values would have greater euclidean distance, even if they were just as good a match. I chose to normalize by the mean of the largest y-values from each of the two trips.

        eqDistMat<-rbind(focusTrim$foc.y,compareTrim$cmp.y)
        mini1<-rbind(mini1,
                     data.frame(driver=driverID, tripA=i,tripB=k,
                   eucDist=unlist(as.numeric(dist(eqDistMat)))/mean(c(max(focusTrim$foc.y),max(compareTrim$cmp.y)))
                  ))
      } # end if loop
    } # end comparison loop
    mini2<-rbind(mini2,mini1)
  } # end focus loop
  mini2
}

Visualizing the result

Arranging the output (similarTrips) by eucDist, we get a “rank” of trip similarity, with the most similar trips towards the top. Keep in mind that eucDist doesn’t hold much value, but serves to indicate which trips are relatively similar.

> head(arrange(similarTrips,eucDist),n=10)
   driver tripA tripB   eucDist
1       1    76   182 0.6139725
2       1    46    86 0.7146952
3       1    33    68 0.8082965
4       1   139   160 0.8460641
5       1    60    68 0.9580528
6       1    31   160 1.0072537
7       1    17   170 1.0351923
8       1    68   154 1.0593948
9       1    33    91 1.0752706
10      1    43   129 1.1583642

After looking at the spaghetti plot (Fig. 1) for so long, plotting the unique trips from the tripA and tripB column produces a gratifyingly organized verification of some success in matching trips:

Matched Faceted Trips, Driver 1
Unique (rotated) trips in the top 10 pairs of matched trips for Driver 1, faceted by trip ID.


Our team now had a trip matching algorithm of our own (written in R instead of Python), and it was relatively fast at about 50 seconds per driver per cpu (about 38 cpu hours for the whole set).

Which eucDist should we choose?

This was not an easy question to answer. Choosing several drivers at random and plotting the above facets along with the ranked list of eucDist values. Our thinking was to be conservative: it was better to miss a true positive than to introduce false positives by choosing a threshold that was too high. Eventually, we chose a threshold such that approximately 30 pairs were nearly all true positives. Later we would see that it wasn’t about finding pure matches.

If all of this manual verification sounds like it was mindnumbing, don’t worry: the monotony of verifying a few of these top matches for several drivers was outweighed by the sheer joy of finding order amidst the chaos.

Trip matches as features

Now that we could get a rough idea of which trips were similar, we needed to somehow code these trips for our gradient boosting machines model. We kicked around a lot of ideas:

Maybe any trip that matches another trip is the primary owner.

This turned out to produce bad results. Intuitively, we should have realized that even non-owners of the vehicles might take the same routes from time to time.

Maybe trips that have a match should “round” our initial prediction to 0 or 1.

Again, this reduced our performance on the public leaderboard. Although it would have been so nice to apply such a manual post-processing boost to our predictions!

Maybe trips should receive a score based on their similarity at different eucDist thresholds.

This was what ultimately produced an improvement in our public leaderboard score. I chose 5 thresholds, ranging from very conservative to very lax and assigned scores to them such that trips that matched at conservative thresholds (and more lax ones as a result) received a higher score than those that only matched at the “liberal thresholds.”

Attempting Support Vector Machines

Our first thought was that this was a classic case of multi-class classification in SVM and the following factors were to be considered.

  1. There were around 3,600 drivers ( each to be categorized as a separate class) and about 40 attributes.
    Given the number of classes across a relatively few dimensions , we felt a linear classification hyperplane might not be a good option and opted for a radial basis kernal.

  2. Given there were 3,600 distinct classes , each with 200 separate observations , it would be computationally expensive for a one-versus-one approach (OVO) , we chose to go with a one-versus-all (OVA) approach.
    This was still a challenge of balancing as the true class had a prevalence rate of about 1:3600 in the test dataset.

Approach

Here is the approach we took in order to test this.


Step 1 : Read input driver files (one per driver and create a consolidated input file.


# enable parallel processing as the process is computationally intensive.

library(doMC)
registerDoMC(cores = 4)
# read input driver attribute files. 

allfiles <- dir(pattern="*.RData")

data <- attributes.raw <- NULL
attributes.raw <- data.frame()

for(i in 1:length(allfiles)){
  data <- NULL 
  load(allfiles[i])
  if (length(attributes.raw) == 0 ){
    attributes.raw <- data 
  }
  else {
    attributes.raw <- rbind(attributes.raw,data)
  }            
}

# convert "driver" field as the class factor.
attributes.raw$driver <- as.factor(attributes.raw$driver)


Step 2 : Check for near zero variance variables to drop from feature set.

# check for near zero variance variables.
nzv <- nearZeroVar(driver.attributes[,-26])
head(nzv)  
## integer(0)
# No near zero variance attributes.


Step 3 : Check for and impute missing values.

#Find if there are NA Values in the input data and impute: 

impute.Attribute <- function(impute.input){ 
nullcount  <-   sum(I(is.na(impute.input)))
 if (nullcount > 0){
     imputed <<- complete(mice(impute.input,m=5,,print=FALSE))
     }
     else {
    imputed <<- impute.input          
     }
 }

impute.Attribute(driver.attributes)


Step 4 : create testing and training data partitions.

# Function to partition test and training datasets

prep.for.model <- function(prep.input) { 
inTest <- createDataPartition(y = prep.input$driver, p = 0.2, list = FALSE)
imputed$driver <<- as.factor(prep.input$driver)
testing <<- prep.input[inTest,]
training <<- prep.input[-inTest,]
}

set.seed(1)
prep.for.model(imputed)


Step 5 : Set up control File . 10-fold cross validation with 5 repeats was selected.

#setup control file 

cvCtrl= trainControl(method="repeatedcv",repeats=5,
                     verbose=F,
                     classProbs=T)


Step 6 : Train the SVM model using the training partition and the training control file.

  • 10 values of cost parameters were selected for using tuneLength parameter (0.25,1,2,4,8 …. 126).
  • Training dataset was pre-processed using “center” and “scale” preProc parameter as SVM is sensitive to the units (distance).
  • svmRadial was chosen as the tuning method it would be difficult to find a linear hyper plane in a multi-class problem.
  • Accuracy was selected as the tuning metric.
set.seed(2)
svmTune <- train(driver ~., data=training,
                 method="svmRadial",
                 tuneLength=10,
                 preProc=c("center","scale"),
                 metric="Accuracy",
                 trControl=cvCtrl)


Step 7 : Explore the trained model.

  • Display Tuning Metrics
svmTune  # display the tuning details
## Support Vector Machines with Radial Basis Function Kernel 
## 
## 473 samples
##  25 predictor
##   3 classes: '1', '2', '3' 
## 
## Pre-processing: centered, scaled 
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## 
## Summary of sample sizes: 425, 425, 426, 427, 426, 425, ... 
## 
## Resampling results across tuning parameters:
## 
##   C       Accuracy   Kappa      Accuracy SD  Kappa SD  
##     0.25  0.6621712  0.4931950  0.06814183   0.10213127
##     0.50  0.6896041  0.5343549  0.06493835   0.09735857
##     1.00  0.6982338  0.5473803  0.06715926   0.10075352
##     2.00  0.7088683  0.5632353  0.06746317   0.10136260
##     4.00  0.7033159  0.5549402  0.06967262   0.10461227
##     8.00  0.6919677  0.5378928  0.06610852   0.09925308
##    16.00  0.6925844  0.5388562  0.06699142   0.10055270
##    32.00  0.6865845  0.5298612  0.06704464   0.10061134
##    64.00  0.6756987  0.5135083  0.06704830   0.10066610
##   128.00  0.6694291  0.5041290  0.06826780   0.10246086
## 
## Tuning parameter 'sigma' was held constant at a value of 0.119121
## Accuracy was used to select the optimal model using  the largest value.
## The final values used for the model were sigma = 0.119121 and C = 2.
  • Plot Accuracy Curve
ggplot(svmTune) + scale_x_log10() + theme_bw() # plot the ROC curve 

  • Explore the final model selected after tuning.
svmTune$finalModel # plot the ROC curve 
## Support Vector Machine object of class "ksvm" 
## 
## SV type: C-svc  (classification) 
##  parameter : cost C = 2 
## 
## Gaussian Radial Basis kernel function. 
##  Hyperparameter : sigma =  0.119121010528511 
## 
## Number of Support Vectors : 356 
## 
## Objective Function Value : -197.6157 -295.2407 -155.9021 
## Training error : 0.137421 
## Probability model included.


Step 8 : Predict classes for the testing dataset and evaluate performance.

svmPred <- predict(svmTune, newdata=testing)
confusionMatrix(svmPred,testing$driver)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  1  2  3
##          1 28  6 11
##          2  5 32  3
##          3  7  2 26
## 
## Overall Statistics
##                                           
##                Accuracy : 0.7167          
##                  95% CI : (0.6272, 0.7951)
##     No Information Rate : 0.3333          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.575           
##  Mcnemar's Test P-Value : 0.7579          
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3
## Sensitivity            0.7000   0.8000   0.6500
## Specificity            0.7875   0.9000   0.8875
## Pos Pred Value         0.6222   0.8000   0.7429
## Neg Pred Value         0.8400   0.9000   0.8353
## Prevalence             0.3333   0.3333   0.3333
## Detection Rate         0.2333   0.2667   0.2167
## Detection Prevalence   0.3750   0.3333   0.2917
## Balanced Accuracy      0.7437   0.8500   0.7688


SVM Conclusions

As observed in the confusion matrix summary above SVM classification prediciton was around 71% and kappa around 57% in this multi-class situation. Given the fact the original dataset had a few misclassified samples , this was a decent performance, but still was relatively poor compared to predictions achieved using boosted tree based models. In the next section, we detail how we approached this problem with tree methods instead

Machine Learning with Trees

Having tried SVM to a modest degree of success, we decided to focus on Gradient Boosting Machines (GBM)

Feature Map

From a general and top-down view, we explored the full potential feature list. A full feature map could guide us to explore efficiently.

Feature Map

Event-based Features

The first kind of feature set was event-driven features, which are generally independent of the underlying trips. For example, the average speed 3 second after a full stop typically relates to how aggressive a driver is. Similarly the deceleration rate before a full stop may also correspond to a similar personality of driving. Although, these features are still trip-dependent, let’s say a highway trip may not have enough full stops. They will result different clusters within a driver’s trips, and other drivers’ trips might still be outliers to those clusters. We need to keep in mind this phenomena.

First, we wrote a generic function to filter out the time steps that match an event:

#' Find event for a given path
#' Find the event by looking from index to index+s values
#' @param v vector to check
#' @param s length to take a look
#' @param fun function to check event
#' @param extend a float number to indicate how long to mark
#' from index to index + extend * s
find.event <- function(v, s, fun, extend=0, ...){
    n <- length(v)-s
    flag <- rep(FALSE,n)
    for (i in 1:n){
        if (fun(v[i:(i+s)], ...)) flag[i:floor(i+extend*s)] <-TRUE # event happened at i+s!
    }
    which(flag)
}

The above find.event() will search through the whole v array and test each segment v[i:(i+s)] with the given function fun to see whether it is an event.

We are mainly focusing on four types of events:

  1. car starts: identify when a car is fully stopped at the beginning and starts to move after a specific period of time.
#' check starting event 
#' the velocity (first item removed) and acceleration has same lenght here.
#' So the accelerration is one row after
#' @param x sliced from find.event
#' @return TRUE/FALSE
is.start <- function(x){
    x[1]==0 & x[length(x)]>0.1
}
  1. car stops: identify when a car is fully stopped at the end and was moving in a specific period of time.
#' check braking event 
#' the velocity (first item removed) and acceleration has same lenght here.
#' So the accelerration is at one row after
#' @param x sliced from find.event
#' @return TRUE/FALSE
is.brake <- function(x){
    x[length(x)]==0 & x[1]>0.1
}
  1. car moves in high-speed: identify the time period that the car is moving above a certain speed.
#' select the speed range above the speed limit
#' @param speed data
#' @param speedlimit in [m/s]
#' @return TRUE/FALSE
is.highspeed <- function(x,speedlimit=29){
    any(x>speedlimit)
}
  1. car truns: identify the time when a car is turning. To identify rotation, we need to calculate the acceleration perpendicular to the previous speed direction 1. Here we take the time steps that the rotational acceleartion is larger than \(1~m/s^2\).
#' check whether this is a turning
is.turn <- function(x,turninglimit=1){
    any(x>turninglimit)
}

Feature Extraction

Now we could feed in our speed and acceleration data deduced from the raw Kaggle data to get useful feature data. The setup in the previous section enables us to get a series of features from each type of event identifier. For instance, if we choose s=5 in find.event with is.start, we will get the time steps for the event, where the car is moving 5 seconds after a full stop.

Now for each trip, we would have a number of such events. Then we could calculate the mean, max, min, median, percentiles, and standard deviation. Then we would have dozens of features to use for machine learning.

ML Considerations

Basic Idea

The challenging part for this competition is the unsupervised nature. Typically people could use unsupervised machine learning methods to detect outliers. But in that sense, we lose some information hidden in the raw data provided by AXA because the 200 trips from one driver are mixed with randomly selected trips. So if we assume the 200 trips are all from the driver, we could train a classification model by using those 200 trips against another set of trips from other drivers. And then we predict the probability of those trips using the fitted model. This time, those low probability trips from these 200 trips might not be the trips made by the original driver.

Along with this idea, we started exploring a k-clustering approach, which is an intuitive method to explore the concentration of features in high dimensional space. Later we worked on a machine-learning algorithm using GBM. This is an efficient method to implement adaboost and gradient boosting for classification problems.

###############################################################################
# Main functions
###############################################################################
#' Find the target driver
#' @param driver driver id
#' @param res driver data feature, driver_trip column must be the first column
#' @param Ncnt count of paths from other drivers
#' @param prior prior
f_driver <- function(driver, res, Ncnt=1000, prior=NULL){
  require(gbm)
  # create a sample for target driver
  y <- as.integer(grepl(paste('^',driver,'_\\d+',sep=''),rec$driver_trip))
  if (is.null(prior)){
    idx <- c(which(y==1),sample(which(y==0),Ncnt))
  } else{
    idx <- c(which(y==1),sample(which(y==0),Ncnt,prob=prior[which(y==0)])) 
  }
  id <- rec$driver_trip[which(y==1)]
  y <- y[idx]
  x <- rec[idx,-1]
  # gbm fitting
  fit1 <- gbm.fit(x=x,y=y,
                  var.names=names(x),
                  n.trees=ntree_step,
                  interaction.depth=depth,
                  shrinkage=shrk,
                  distribution=family,
                  verbose=F)
  ntree_best <- gbm.perf(fit1,method='OOB',plot.it=F)
  
  if(ntree_best >= ntree_step){
    for(ntree in seq(ntree_step*2,ntree_max,ntree_step)){
      set.seed(114)
      fit1 <- gbm.more(fit1,ntree_step)
      ntree_best <- gbm.perf(fit1,method='OOB',plot.it=F)
      if(ntree_best < (ntree-10)) break
    }
  }
  score <- predict(fit1,as.data.frame(x[1:200,]),n.trees=ntree_best,type='response')
  data.frame(driver_trip=id,prob=score)
}

Improvement

By using the trip related statistics (quantiles of velocities, acceleration, etc.) calculated previously, we boosted our AUC score to 0.7 for the first time. Then adding the event-driven features, the score keeps raising to 0.8. And then we reached our first plateau.

To overcome the plateau, we adopted two approaches. One is by combining different results from gbm. To build a training set, we could choose an arbitrary number of trips from other drivers into the model fitting. And different numbers create different models. By combining more and more fitting models, we gradually improved our result. However the drawback on this is that it becomes harder to interpret the result.

The second approach is to purify the trips from other drivers. For instance, if the trip 1 from driver 2 is actually driven by driver 1, then using that trip to fit a model for driver 1 would be misleading. So when we select other trips, we could use our prior knowledge to choose only those trips are sure to be others. Using this approach, we finally push our AUC score into 0.9 territory.

Finally, we included the trip-matching features. This time, we improved our result further ahead, but slightly. In comparing to claims from other posts about the trip matching. We didn’t improve the result dramatically when our score was already above 0.9. On the other hand, it shows that our feature based approach has already caught the most relavent information from the Kaggle data.

Conclusion

One of the most helpful resources for our team was the Kaggle forum. The Kaggle community was a wealth of valuable information and techniques. Competitions can promote secrecy and cutthroat attitudes, but the attitude on the Kaggle forum was one of cooperation and assistance. Starting with our first post of all ones, the forum served as a source of inspiration that helped guide us towards our own unique final solution. We highly recommend that new competitors use the forum as a starting point or a resource when encountering dead ends.

We had a sense of where we ended up, but we wanted to see the changing landscape and our path to the final standings. This was easy enough to plot in R after downloading the AXA public leaderboard history csv, which Kaggle conveniently provides at the bottom of each competition’s public leaderboard page. This file contained all timestamped and identified updates made to the leaderboard since the beginning of the competition.

Animated Climb
The Kaggle Climb: Click to refresh the animated GIF.

If you click on the image above, you can observe the changing layout of scores and ranks throughout 90 days of competition and follow the crosshair of Vivi’s angels as we climb up the hill. This is where the iterative project development model really shines. It would have been impossible to chart our course at the beginning of the project without knowing in advance which techniques to abandon. In practice, that requires trial and error, and the agile project development framework makes trial and error less painful. Our time spent trying to incorporate K-means clustering and SVM into our model was not wasted. It just helped us to narrow our choices to things that worked well, like using GBM with a large feature set, or building trip visualization and trip matching tools to help choose trips for our training data. In the end, we poured hours of hard work into the project, but it served as a valuable learning experience for all of the members of the team.


  1. Speed is the first derivative of distance and the acceleration is the second derivative. Just be aware that the time steps (indices) are shiftted and we need to trim the data accordingly.