My role in the Kaggle Axa Challenge team was to create variable. My approach was to create a good number of variables even if I had to remove them afterwards, in the final model. Because I am always wary of losing data or having to go back and re-compute everything, I saved the intermediate steps data in different folders. This makes the computation time a bit longer, however, it avoids having to re-compute every step when I want to make one change in the most recent step (like changing the number of quantiles for example).

car draw

Creating Trip Variables

After loading the csv files and creating a unique data frame with all the trips for each driver with Julian’s code, I calculated a number of variables (for each trip) after turning the trip so that each trip begins to go up and right (y and x positive).

library(dplyr)

The main loop of this script call the function expand_data.f (which takes other function as variable) for each drivers.

todo <- list.files('data')
registerDoParallel(cores=6)
foreach(i=1:length(todo)) %dopar% {
  expand_data.f(todo[i], turn_trip.f, calculate_trip_value.f)  
}

The expand_data.f function loads the driver data; splits the data so that each trip is an element of a list and applies the turn_trip.f function and the calculate_trip_value.f function.

expand_data.f <- function(x, turn.trip_f, calculate_trip_value.f){
  #this function load the data called the previous function and save the data 
  load(paste0('data/',x))
  tmp <- lapply(split(drives, drives$tripID), turn_trip.f)  
  drives <- lapply(tmp, calculate_trip_value.f)   
  save(drives, file = paste('data_transf/', x, sep = ''))
}

The turn_trip.f function turns the trip so that at the beginning it goes up and right (both y and x positive.)

turn_trip.f <- function(trip){
  # turn the trip by deciding it begin to go up and right (y>0 and x>0)
  if(trip$x[which.max(abs(trip$x) > 0.1)] < 0){trip$x = -trip$x}
  if(trip$y[which.max(abs(trip$y) > 0.1)] < 0){trip$y = -trip$y}
  return(trip)
}

The calculate_trip_value.f calculates the distance between two consecutive points, the velocity both in the y-axis and in the x-axis, the speed, the acceleration, the jerk, and the angle change between two points. It also adds a boolean flag indicating when there is a ‘jump’ in the data (i.e. when there are missing points, which creates ridiculous values for the variable).

Since I turned the trips so that they began all in the same direction, I then flagged the positive and negative turns with booleans. To account for both sharp turns and slow turns I created three columns for both the positive and negative turns to show flagged all turns having greater than 45-degrees total between two consecutive points, or one point and two points after or between one point and three points after.

calculate_trip_value.f <- function(trip){
  #this function calculate vel, speed accel, put a flag if data are bad and put a flag when car is moving
  trip <- trip %>% mutate(dist = sqrt((x-lag(x))^2 + (y - lag(y))^2), velx = x -lag(x), vely = y - lag(y), speed = sqrt(velx^2 + vely^2), accel = speed - lag(speed), jerk = accel - lag(accel), angle = atan2((y - lag(y)), (x - lag(x)))*180/pi)
  
  ##if missing points aka 'jump'
  trip$jump <- ifelse(abs(trip$accel) > 100, TRUE, FALSE)
  
  #boolean for when the car turn
  trip$turnpos1 <- ifelse((trip$angle - lag(trip$angle)) > 45, TRUE, FALSE)
  trip$turneg1 <- ifelse((trip$angle - lag(trip$angle)) < - 45, TRUE, FALSE)
  trip$turnpos2 <- ifelse((trip$angle - lag(trip$angle, 2)) > 45, TRUE, FALSE)
  trip$turneg2 <- ifelse((trip$angle - lag(trip$angle, 2)) < - 45, TRUE, FALSE)
  trip$turnpos3 <- ifelse((trip$angle - lag(trip$angle, 3)) > 45, TRUE, FALSE)
  trip$turneg3 <- ifelse((trip$angle - lag(trip$angle, 3)) < - 45, TRUE, FALSE)
  return(trip)
}

For example this is what the data looks like after adding these variable:

##      x    y tripID     dist velx vely    speed       accel      jerk
## 4 53.7 32.6      1 20.59733 17.6 10.7 20.59733  0.03304049  1.129083
## 5 70.1 42.8      1 19.31321 16.4 10.2 19.31321 -1.28412201 -1.317162
##     angle  jump turnpos1 turneg1 turnpos2 turneg2 turnpos3 turneg3
## 4 31.2977 FALSE    FALSE   FALSE    FALSE   FALSE       NA      NA
## 5 31.8796 FALSE    FALSE   FALSE    FALSE   FALSE    FALSE   FALSE

car draw

Summarizing the Trips

For each trip, I then summarized all the variables calculated in the previous step and created new values to describe each trip.

library(plyr)
library(dplyr)
library(reshape2)
library(tidyr)
library(doParallel)
dir.create("./data_summary/", showWarnings = TRUE, recursive = FALSE)

As previously, the main loop calls the main function find_feature.f, which loads the data, calls the function trip_feature. f for each trip and combines the results in a data frame for each driver.

todo <- list.files('data_transf')
registerDoParallel(cores=6)
foreach(i=1:length(dir())) %dopar% {
  find_feature.f(todo[i], trip_feature.f)  
}
find_feature.f <- function(x, trip_feature.f){
  load(paste0('data_transf/',x))
  res <- lapply(drives, trip_feature.f)
  drives <- do.call(rbind, res)
  save(drives, file = paste('data_summary/', x, sep = ''))
}

The trip_feature.f calculates the mean, standard deviation and quartile of the velocities, speed, acceleration, jerk and angle variables, and counts the number of turns in each direction. This function also calculates trip variables: the length of the trip, the total distance driven, the direct distance between first and last point, and the path efficiency (direct distance / total distance), the angle from first to last point, the angle change, the absolute total angle turned and the turn efficiency (angle change / total angle turned).

trip_feature.f <- function(trip){
  #this function summarise the variable value for each trip and create new variable for each trip
  len <- dim(trip)[1]
  dist <- sum(trip$dist, na.rm = TRUE)
  trip <- as.data.frame(trip)
  tmp <- trip %>% filter(!jump) %>% select(velx, vely, speed, accel, jerk, angle) %>% filter(!is.na(jerk))
  res <- t(sapply(tmp, function(x) each(mean, sd, quantile)(x)))
  res <- data.frame(param = rownames(res) , res)
  rownames(res) <- NULL
  meltres <-  melt(res, id = 'param')
  res <- data.frame(var = paste(meltres$param, meltres$variable, sep = '-'), value = meltres$value)
  res$var <- as.character(res$var)
  turn <- as.data.frame(colSums(trip %>% select(turnpos1:turneg3) %>% filter(!is.na(turneg3))))
  turn <- data.frame(var = rownames(turn), value = turn[,1])
  res <- rbind(res, turn)
  angletodest <- with(trip, atan2(y[len] - y[1], x[len]-x[1]))
  angle_change <- sum(tmp$angle)
  tot_angle <- sum(abs(tmp$angle))
  turn_efficiency <- angle_change/tot_angle
  direct_dist <- with(trip, sqrt((x[len] - x[1])^2 + (y[len] - y[1])^2))
  path_efficiency <- direct_dist/dist
  res <- rbind(res, c('len', len))
  res <- rbind(res, c('dist', dist))
  res <- rbind(res, c('angletodest', angletodest))
  res <- rbind(res, c('angle_change', angle_change))
  res <- rbind(res, c('turn_efficiency', turn_efficiency))
  res <- rbind(res, c('tot_angle', tot_angle))
  res <- rbind(res, c('direct_dist', direct_dist))
  res <- rbind(res, c('path_efficiency', path_efficiency))
  
  res <- data.frame(tripID = unique(trip$tripID), res)
  res <- spread(res, var, value)
  return(res)
}

At that point we have 56 56 variables to describe each trip.

##   tripID          accel-mean accel-quantile.0. accel-quantile.100.
## 1      1 -0.0239119644602508 -7.65506139861884    8.63982921403924
## 2      2 -0.0189847836360734 -2.99836284157543    2.57514947750462
##   accel-quantile.25. accel-quantile.50. accel-quantile.75.
## 1 -0.665030398446061                  0   0.49136969977116
## 2   -0.3454761060725                  0  0.338371707214766
##            accel-sd      angle_change        angle-mean angle-quantile.0.
## 1  1.04707189528923 -7130.09506595443 -8.29080821622608 -179.761268966901
## 2 0.727299259483334 -5826.69393710213 -10.4421038299321 -177.119622447152
##   angle-quantile.100. angle-quantile.25. angle-quantile.50.
## 1                 180  -110.564358555495                  0
## 2                 180  -13.0786718118663   2.35081989462507
##   angle-quantile.75.         angle-sd       angletodest      direct_dist
## 1   66.4552130434609 103.722626547054 -2.05412230833675 1152.29004161279
## 2   19.5366549381283 71.5475296044288 0.102449477245298 4959.40393595843
##               dist           jerk-mean  jerk-quantile.0.
## 1 14255.4357513228 0.00127446857163485 -9.71101104302095
## 2 7708.45549805557 0.00226972687180188 -3.83239034410703
##   jerk-quantile.100.  jerk-quantile.25.  jerk-quantile.50.
## 1   16.2948906126581 -0.866418248915247 0.0247681249661102
## 2   3.53784361265113 -0.346351980318013 0.0249957357232851
##   jerk-quantile.75.           jerk-sd len    path_efficiency
## 1 0.911045147892871  1.56303639433121 863 0.0808316253332251
## 2  0.37265250072084 0.825327694421798 561  0.643371935818975
##         speed-mean speed-quantile.0. speed-quantile.100.
## 1 16.5269896854414                 0    26.6582069914689
## 2 13.7737963532715                 0    33.3736422944813
##   speed-quantile.25. speed-quantile.50. speed-quantile.75.
## 1   12.8328488869921   19.4476219354035   21.9151762984753
## 2    5.3806344673959   12.3808318784358   24.4537615527427
##           speed-sd        tot_angle     turn_efficiency turneg1 turneg2
## 1 7.30317409633395 75045.8571467167 -0.0950098424755799      22      32
## 2 10.2274250224362 26163.5839884706  -0.222702437849102      13      27
##   turneg3 turnpos1 turnpos2 turnpos3          velx-mean velx-quantile.0.
## 1      30       20       30       31 -0.664651162790698            -26.5
## 2      34       14       16       25   8.80179211469534            -16.4
##   velx-quantile.100.  velx-quantile.25. velx-quantile.50.
## 1               25.5            -12.925                 0
## 2   33.1999999999998 -0.100000000000364               6.5
##   velx-quantile.75.          velx-sd         vely-mean  vely-quantile.0.
## 1             10.25 13.6357532416522 -1.21186046511628 -24.4000000000001
## 2  22.7999999999999  13.588651551997  0.89910394265233             -11.4
##   vely-quantile.100. vely-quantile.25.  vely-quantile.50.
## 1               20.5             -11.1                  0
## 2               14.1 -2.87499999999997 0.0999999999999659
##   vely-quantile.75.          vely-sd
## 1  7.45000000000005 11.7876782524607
## 2                 4 5.62017549768013

car draw

Model: lasso

To check if the variables I chose could be useful I decided to test a lasso regression because the number of variables was quite high. To do so, I first took five random drivers to be used as control data.

drivers = list.files("data_summary")
randomDrivers = sample(drivers, size = 5)
library(dplyr)
library(reshape2)
library(tidyr)
library(glmnet)
library(doParallel)

registerDoParallel(cores=3)
refData<-foreach(driver=randomDrivers, .combine=rbind) %dopar% {
  load(paste0("data_summary/",driver))
  spreadD <- drives %>% spread(var, value)
}
target <- 1
refData$target <-target  

Then I mixed the current driver data with the control data and tried to predict whether the trip of the current driver belonged to him or not. This model showed that the variables were indeed interesting to predict whether the trips were driven by the main driver, since a simple lasso model with them scored 0.761.

target <- 0
registerDoParallel(cores=6)
submission<-foreach(driver=drivers,.combine=rbind) %dopar% {
  
  ## need to grab the numeric driverID from the file name
  driverID <- gsub(driver,pattern="DriverData",replacement="")
  load(paste0("data_summary/",driver))
  print(driver)
  currentData <- drives
  currentData$target <- target
  
  train <- as.data.frame(rbind(refData,currentData))
  train <- apply(train,2, as.numeric)
  train[which(is.na(train))] = 0
  currentData <- apply(currentData ,2, as.numeric)
  x <- train[,2:93]
  y <- as.factor(train[,94])
  grid =10^ seq (10 , -2 , length =100)
  lasso.mod <- glmnet(x, y, family = 'binomial',alpha =1 , lambda = grid )
  set.seed (1)
  cv.out = cv.glmnet (train[,2:93], train[,94], alpha =1)
  bestlam = cv.out$lambda.min
  p = predict(lasso.mod , s = bestlam , newx = currentData[,2:93], type = 'response')
  p = ifelse(p > 0.5, 1, 0)
  labels <- sapply(1:200, function(x) paste0(driverID,'_', x))
  data.frame(labels, p)
}

colnames(submission) = c("driver_trip","prob")
write.csv(submission, "submission.csv", row.names=F, quote=F)