My role in the Kaggle Axa Challenge team was to create variable. My approach was to create a good number of variables even if I had to remove them afterwards, in the final model. Because I am always wary of losing data or having to go back and re-compute everything, I saved the intermediate steps data in different folders. This makes the computation time a bit longer, however, it avoids having to re-compute every step when I want to make one change in the most recent step (like changing the number of quantiles for example).
After loading the csv files and creating a unique data frame with all the trips for each driver with Julian’s code, I calculated a number of variables (for each trip) after turning the trip so that each trip begins to go up and right (y and x positive).
library(dplyr)
The main loop of this script call the function expand_data.f (which takes other function as variable) for each drivers.
todo <- list.files('data')
registerDoParallel(cores=6)
foreach(i=1:length(todo)) %dopar% {
expand_data.f(todo[i], turn_trip.f, calculate_trip_value.f)
}
The expand_data.f function loads the driver data; splits the data so that each trip is an element of a list and applies the turn_trip.f function and the calculate_trip_value.f function.
expand_data.f <- function(x, turn.trip_f, calculate_trip_value.f){
#this function load the data called the previous function and save the data
load(paste0('data/',x))
tmp <- lapply(split(drives, drives$tripID), turn_trip.f)
drives <- lapply(tmp, calculate_trip_value.f)
save(drives, file = paste('data_transf/', x, sep = ''))
}
The turn_trip.f function turns the trip so that at the beginning it goes up and right (both y and x positive.)
turn_trip.f <- function(trip){
# turn the trip by deciding it begin to go up and right (y>0 and x>0)
if(trip$x[which.max(abs(trip$x) > 0.1)] < 0){trip$x = -trip$x}
if(trip$y[which.max(abs(trip$y) > 0.1)] < 0){trip$y = -trip$y}
return(trip)
}
The calculate_trip_value.f calculates the distance between two consecutive points, the velocity both in the y-axis and in the x-axis, the speed, the acceleration, the jerk, and the angle change between two points. It also adds a boolean flag indicating when there is a ‘jump’ in the data (i.e. when there are missing points, which creates ridiculous values for the variable).
Since I turned the trips so that they began all in the same direction, I then flagged the positive and negative turns with booleans. To account for both sharp turns and slow turns I created three columns for both the positive and negative turns to show flagged all turns having greater than 45-degrees total between two consecutive points, or one point and two points after or between one point and three points after.
calculate_trip_value.f <- function(trip){
#this function calculate vel, speed accel, put a flag if data are bad and put a flag when car is moving
trip <- trip %>% mutate(dist = sqrt((x-lag(x))^2 + (y - lag(y))^2), velx = x -lag(x), vely = y - lag(y), speed = sqrt(velx^2 + vely^2), accel = speed - lag(speed), jerk = accel - lag(accel), angle = atan2((y - lag(y)), (x - lag(x)))*180/pi)
##if missing points aka 'jump'
trip$jump <- ifelse(abs(trip$accel) > 100, TRUE, FALSE)
#boolean for when the car turn
trip$turnpos1 <- ifelse((trip$angle - lag(trip$angle)) > 45, TRUE, FALSE)
trip$turneg1 <- ifelse((trip$angle - lag(trip$angle)) < - 45, TRUE, FALSE)
trip$turnpos2 <- ifelse((trip$angle - lag(trip$angle, 2)) > 45, TRUE, FALSE)
trip$turneg2 <- ifelse((trip$angle - lag(trip$angle, 2)) < - 45, TRUE, FALSE)
trip$turnpos3 <- ifelse((trip$angle - lag(trip$angle, 3)) > 45, TRUE, FALSE)
trip$turneg3 <- ifelse((trip$angle - lag(trip$angle, 3)) < - 45, TRUE, FALSE)
return(trip)
}
For example this is what the data looks like after adding these variable:
## x y tripID dist velx vely speed accel jerk
## 4 53.7 32.6 1 20.59733 17.6 10.7 20.59733 0.03304049 1.129083
## 5 70.1 42.8 1 19.31321 16.4 10.2 19.31321 -1.28412201 -1.317162
## angle jump turnpos1 turneg1 turnpos2 turneg2 turnpos3 turneg3
## 4 31.2977 FALSE FALSE FALSE FALSE FALSE NA NA
## 5 31.8796 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
For each trip, I then summarized all the variables calculated in the previous step and created new values to describe each trip.
library(plyr)
library(dplyr)
library(reshape2)
library(tidyr)
library(doParallel)
dir.create("./data_summary/", showWarnings = TRUE, recursive = FALSE)
As previously, the main loop calls the main function find_feature.f, which loads the data, calls the function trip_feature. f for each trip and combines the results in a data frame for each driver.
todo <- list.files('data_transf')
registerDoParallel(cores=6)
foreach(i=1:length(dir())) %dopar% {
find_feature.f(todo[i], trip_feature.f)
}
find_feature.f <- function(x, trip_feature.f){
load(paste0('data_transf/',x))
res <- lapply(drives, trip_feature.f)
drives <- do.call(rbind, res)
save(drives, file = paste('data_summary/', x, sep = ''))
}
The trip_feature.f calculates the mean, standard deviation and quartile of the velocities, speed, acceleration, jerk and angle variables, and counts the number of turns in each direction. This function also calculates trip variables: the length of the trip, the total distance driven, the direct distance between first and last point, and the path efficiency (direct distance / total distance), the angle from first to last point, the angle change, the absolute total angle turned and the turn efficiency (angle change / total angle turned).
trip_feature.f <- function(trip){
#this function summarise the variable value for each trip and create new variable for each trip
len <- dim(trip)[1]
dist <- sum(trip$dist, na.rm = TRUE)
trip <- as.data.frame(trip)
tmp <- trip %>% filter(!jump) %>% select(velx, vely, speed, accel, jerk, angle) %>% filter(!is.na(jerk))
res <- t(sapply(tmp, function(x) each(mean, sd, quantile)(x)))
res <- data.frame(param = rownames(res) , res)
rownames(res) <- NULL
meltres <- melt(res, id = 'param')
res <- data.frame(var = paste(meltres$param, meltres$variable, sep = '-'), value = meltres$value)
res$var <- as.character(res$var)
turn <- as.data.frame(colSums(trip %>% select(turnpos1:turneg3) %>% filter(!is.na(turneg3))))
turn <- data.frame(var = rownames(turn), value = turn[,1])
res <- rbind(res, turn)
angletodest <- with(trip, atan2(y[len] - y[1], x[len]-x[1]))
angle_change <- sum(tmp$angle)
tot_angle <- sum(abs(tmp$angle))
turn_efficiency <- angle_change/tot_angle
direct_dist <- with(trip, sqrt((x[len] - x[1])^2 + (y[len] - y[1])^2))
path_efficiency <- direct_dist/dist
res <- rbind(res, c('len', len))
res <- rbind(res, c('dist', dist))
res <- rbind(res, c('angletodest', angletodest))
res <- rbind(res, c('angle_change', angle_change))
res <- rbind(res, c('turn_efficiency', turn_efficiency))
res <- rbind(res, c('tot_angle', tot_angle))
res <- rbind(res, c('direct_dist', direct_dist))
res <- rbind(res, c('path_efficiency', path_efficiency))
res <- data.frame(tripID = unique(trip$tripID), res)
res <- spread(res, var, value)
return(res)
}
At that point we have 56 56 variables to describe each trip.
## tripID accel-mean accel-quantile.0. accel-quantile.100.
## 1 1 -0.0239119644602508 -7.65506139861884 8.63982921403924
## 2 2 -0.0189847836360734 -2.99836284157543 2.57514947750462
## accel-quantile.25. accel-quantile.50. accel-quantile.75.
## 1 -0.665030398446061 0 0.49136969977116
## 2 -0.3454761060725 0 0.338371707214766
## accel-sd angle_change angle-mean angle-quantile.0.
## 1 1.04707189528923 -7130.09506595443 -8.29080821622608 -179.761268966901
## 2 0.727299259483334 -5826.69393710213 -10.4421038299321 -177.119622447152
## angle-quantile.100. angle-quantile.25. angle-quantile.50.
## 1 180 -110.564358555495 0
## 2 180 -13.0786718118663 2.35081989462507
## angle-quantile.75. angle-sd angletodest direct_dist
## 1 66.4552130434609 103.722626547054 -2.05412230833675 1152.29004161279
## 2 19.5366549381283 71.5475296044288 0.102449477245298 4959.40393595843
## dist jerk-mean jerk-quantile.0.
## 1 14255.4357513228 0.00127446857163485 -9.71101104302095
## 2 7708.45549805557 0.00226972687180188 -3.83239034410703
## jerk-quantile.100. jerk-quantile.25. jerk-quantile.50.
## 1 16.2948906126581 -0.866418248915247 0.0247681249661102
## 2 3.53784361265113 -0.346351980318013 0.0249957357232851
## jerk-quantile.75. jerk-sd len path_efficiency
## 1 0.911045147892871 1.56303639433121 863 0.0808316253332251
## 2 0.37265250072084 0.825327694421798 561 0.643371935818975
## speed-mean speed-quantile.0. speed-quantile.100.
## 1 16.5269896854414 0 26.6582069914689
## 2 13.7737963532715 0 33.3736422944813
## speed-quantile.25. speed-quantile.50. speed-quantile.75.
## 1 12.8328488869921 19.4476219354035 21.9151762984753
## 2 5.3806344673959 12.3808318784358 24.4537615527427
## speed-sd tot_angle turn_efficiency turneg1 turneg2
## 1 7.30317409633395 75045.8571467167 -0.0950098424755799 22 32
## 2 10.2274250224362 26163.5839884706 -0.222702437849102 13 27
## turneg3 turnpos1 turnpos2 turnpos3 velx-mean velx-quantile.0.
## 1 30 20 30 31 -0.664651162790698 -26.5
## 2 34 14 16 25 8.80179211469534 -16.4
## velx-quantile.100. velx-quantile.25. velx-quantile.50.
## 1 25.5 -12.925 0
## 2 33.1999999999998 -0.100000000000364 6.5
## velx-quantile.75. velx-sd vely-mean vely-quantile.0.
## 1 10.25 13.6357532416522 -1.21186046511628 -24.4000000000001
## 2 22.7999999999999 13.588651551997 0.89910394265233 -11.4
## vely-quantile.100. vely-quantile.25. vely-quantile.50.
## 1 20.5 -11.1 0
## 2 14.1 -2.87499999999997 0.0999999999999659
## vely-quantile.75. vely-sd
## 1 7.45000000000005 11.7876782524607
## 2 4 5.62017549768013
To check if the variables I chose could be useful I decided to test a lasso regression because the number of variables was quite high. To do so, I first took five random drivers to be used as control data.
drivers = list.files("data_summary")
randomDrivers = sample(drivers, size = 5)
library(dplyr)
library(reshape2)
library(tidyr)
library(glmnet)
library(doParallel)
registerDoParallel(cores=3)
refData<-foreach(driver=randomDrivers, .combine=rbind) %dopar% {
load(paste0("data_summary/",driver))
spreadD <- drives %>% spread(var, value)
}
target <- 1
refData$target <-target
Then I mixed the current driver data with the control data and tried to predict whether the trip of the current driver belonged to him or not. This model showed that the variables were indeed interesting to predict whether the trips were driven by the main driver, since a simple lasso model with them scored 0.761.
target <- 0
registerDoParallel(cores=6)
submission<-foreach(driver=drivers,.combine=rbind) %dopar% {
## need to grab the numeric driverID from the file name
driverID <- gsub(driver,pattern="DriverData",replacement="")
load(paste0("data_summary/",driver))
print(driver)
currentData <- drives
currentData$target <- target
train <- as.data.frame(rbind(refData,currentData))
train <- apply(train,2, as.numeric)
train[which(is.na(train))] = 0
currentData <- apply(currentData ,2, as.numeric)
x <- train[,2:93]
y <- as.factor(train[,94])
grid =10^ seq (10 , -2 , length =100)
lasso.mod <- glmnet(x, y, family = 'binomial',alpha =1 , lambda = grid )
set.seed (1)
cv.out = cv.glmnet (train[,2:93], train[,94], alpha =1)
bestlam = cv.out$lambda.min
p = predict(lasso.mod , s = bestlam , newx = currentData[,2:93], type = 'response')
p = ifelse(p > 0.5, 1, 0)
labels <- sapply(1:200, function(x) paste0(driverID,'_', x))
data.frame(labels, p)
}
colnames(submission) = c("driver_trip","prob")
write.csv(submission, "submission.csv", row.names=F, quote=F)