Introduction

This post is meant to showcase some of the work of the students of the first cohort of the NYC Data Science Academy Bootcamp. We competed in the AXA Driver Telematics Analysis Kaggle competition and this is a write-up done by a few of our team members.

The Telematics Competition Set-up:

The goal of the competition was to develop an algorithm to create driver specific signatures based upon nothing but X-Y coordinates in a number of csv’s. We were given 2736 directories, corresponding to 2736 different drivers. In each directory were 200 csv’s, one for each trip assigned to that driver. Each row of a csv corresponds to one second of a driver’s trip. To test our driver signature algorithm, the organizers substituted an unknown number of trips in each driver’s directory with false trips that were driven by another driver. We don’t know which trips are driven by the driver of interest, all we know is that the driver of interest has the majority of trips in that folder. Our goal is to submit a file identifying the correct probability of whether each trip belongs to the driver associated with their directory. This took the form of a csv with 200*2736 rows and the probability that that trip was driven by the driver of interest.

Project Management

One of our goals for this project was to get experience managing a data science project. We wanted to apply an agile project development model to our problem which would eventually iterate through the development cycle multiple times until the project deadline.

 

alt text Traditional Waterfall Development Method vs. Iterative Development method

 

This is in contrast to the waterfall development method which we believe would not have appropriately utilized our team size and talent, and would not have lent itself to the competitive framework we were contrained by. The waterfall method would also have limited us to our initial vision of our final model without giving us ample room to revise our requirements based upon interim results. Whereas, with the iterative model, we were able to create multiple working models and take advantage of many opportunities to test our models upon the public leaderboard (71 submissions to be exact). It allowed us to redefine basic assumptions late in the game which bumped our score tremendously in the final week of the competition.

Other Considerations

Git

We were instructed on the Git framework in our first week of the bootcamp. This allowed us to immediately implement a distributed version control system and allowed us to get used to working with our code as a team.

R

Alongside with Git, the first two weeks of the bootcamp was an intense introduction to R and we used R exclusively for the competition. It was the ideal choice for the data manipulation, visualization, and statistical learning methods we needed to implement.

Student Goals

We had to keep in mind that this was indeed a bootcamp student project. We were not a professional Kaggle team, and as such, we couldn’t put demands on team members like we would have in a purely professional context. Our main goal was to make sure that this was an educational experience for every member of the team. After that requirement was met, we had to take a laissez-faire approach when it came to dominating the available time of the team members. Being in the middle of an intensive bootcamp experience meant that free time was very hard to come by. Luckily, our team was blessed with some extrememly motivated and capable members. They went above and beyond the call of duty and won us all a Kaggle finish we are be proud of.

Inception

Our Kaggle team consisted of 14 bootcamp students. In order to make things manageable we decided to split the team into two sub-groups. This allowed us to make our weekly team meetings more manageable and allowed multiple people to step into similar vital roles within each group to make sure that everyone was involved near the center of the project. Eventually we merged the teams back together to maximize our performance by utilizing the best techniques and code from both groups.

In Practice

Our workflow revolved around weekly meetings at which members of each sub-group would first go over the research or code they had been working on for the previous week. The second portion of the meetings involved an open table style brainstorming session to decide on ways to improve our model. The meetings concluded with task assignments for the next week, which was done mainly on a volunteer basis in order to be sensitive to our schedules and needs.

Meetings in the first two weeks consisted of brainstorming wild ideas as well as creating a framework for the simplest model we could come up with. In order to reap the benefits of an iterative project development model, we needed to create a working algorithm as soon as possible. Within a few weeks we were making submissions using the same basic framework that we ended up employing in our final submission.

The benefit of this model was that we could send people down possible rabbit holes without it holding up the whole group. As long as we were trying new models and evolving our skill set, we were making progress. We could afford to chase down ideas like Fourier Transformation Matrices and Dynamic Time Warping with minimal risk to the overall success of the project. In the end, between the multiple trip matching algorithms and our work trying to integrate Support Vector Machines into an unsupervised problem, we were sure we made the right choice by using the agile development model.

Loading the Data

In order to keep everyone’s code as consistent as possible, we created a script for the team to use to read in the 547,000 csv’s as easily loadable binaries saved on our individual machines. It was based upon Lauri Koobas’ rebuild-data.R code in the Kaggle Forum. We figured that if we all started using the same foundation it would simplify the process of working with each others’ code later on down the line.

# Set WD to directory containing the 'drivers' folder.

require(data.table)

# Use fread from the data.table package to read in x and y coords
# Apply trip ID to new third column in data frame
fread.and.modify <- function(file.number, driver) {
  tmp <- fread(paste0("drivers/",driver,"/",file.number,".csv"), header=T, sep=",")
  tmp[, tripID:=file.number]
  return(tmp)
}
system.time({
  
# Pull down list of driver directories and create a home for binaries
driverlist <- list.files("./drivers/")
dir.create("./data/", showWarnings = TRUE, recursive = FALSE, mode = "0777")

# Loop through the driver list and use rbindlist to bind data from
# x, and y columns to the specific driver data frame
for (i in 1:length(driverlist)) {
  onedriver <- driverlist[i]
  drives <- rbindlist(lapply(1:200, fread.and.modify, onedriver))
  save(drives, file = paste('./data/DriverData',onedriver, sep=''))
}
})

With the files read in place, we could start visualizing and manipulating the data in a uniform way.

Trip Visualization

To get started visualizing the trips, we wrote a small script to read in all the trip data for a single driver in long format.

read_trips = function(x){
  setwd(x)
  dir_list = list.files(x)
  dir_list = dir_list[grep('[[:alpha:]]{0,3}.csv', dir_list)]
  num_files = length(dir_list)
  files = lapply(dir_list, read.csv)
  idx = unlist(lapply(files, nrow))
  trip = rep(1:num_files, idx)
  files = do.call(rbind, files)
  time = unlist(sapply(idx, function(x) seq(from=1, to=x, by=1)))
  files = cbind(files, trip, time)
  }

driverOne = read_trips("C:/Users/TimBo/Downloads/R docs and scripts/Collab/drivers/drivers/1")

Using ggplot2 we were able to visualize all 200 trips by Driver 1 and included a label corresponding to the trip number. Interestingly, trip numbers 48, 73, and 145 are identical trips that have been rotated. Additionally, trip 145 appears to have been reflected and is facing in the opposite the direction of trips 48 and 73. This suggested a strategy of trip matching to reduce dimensionality and also clustering trips into those taken by the driver of interest and those taken by the “false driver”.

library(ggplot2)
library(plyr)
library(dplyr)

driverOne$dist = sqrt(driverOne$x^2+driverOne$y^2)

ggplot(data = driverOne, aes(x=x,y=y, group=trip, color=factor(trip), label=trip))+geom_path()+theme_bw()+
      geom_text(data= driverOne %>% group_by(trip) %>% filter(dist == max(dist)), color='black', size=2.5, 
                position = position_jitter())+xlab('')+ylab('')+theme(legend.position='none')+ 
      ggtitle('Trips by Driver 1')

Trips by Driver 1

Driver 1 Trip Animation Example

We also wanted to monitor trip progress as a function of time so we created an animation to visualize driver progress as shown above. The following script loops over all trips and makes a plot at each time slice. The animation package binds the time slices together to create the final animation.

library(animation)
oopt <- ani.options(interval = 0.05)
trip_animation <- function() {
  lapply(1:max(driverOne$time), function(i) {
    print(ggplot(driverOne[driverOne$time <= i,], aes(x=x,y=y,group=trip))+
            ylim(range(driverOne$y))+xlim(range(driverOne$x))+
            theme(axis.text.y=element_blank(), axis.title.x=element_blank(), 
                axis.title.y=element_blank(), axis.text.x=element_blank(), 
                panel.border=element_blank(), panel.background=element_blank(), 
                axis.ticks = element_blank(), legend.position='none')+
            geom_path(aes(color=factor(trip))))
    animation::ani.pause()
  })
}

saveHTML(trip_animation(), autoplay = FALSE, loop = FALSE, verbose = TRUE, outdir = "images", 
         single.opts = "'controls': ['first', 'previous', 'play', 'next', 'last', 'loop', 'speed'], 'delayMin': 0")

After noticing the rotations and reflections of the trips in the data we wrote a script to rotate the most distant point from the origin in each trip to the x-axis using a rotation matrix. We also ensured the majority of the data was in the Quadrant 1 by reflecting each trip about the x-axis if necessary.

rot_all = function(df){
  n = matrix(ncol=2)
  for(i in 1:length(unique(df$trip))){
    dists = df[df$trip == i,c('x','y','dist')]
    mp = dists[max.col(t(dists$dist),'last'),1:2]
    rot.mat = matrix(c(mp$x,mp$y,-mp$y,mp$x),2,2)/sqrt(mp$x^2+mp$y^2) 
    rot.points = as.matrix(df[df$trip==i,1:2])%*%rot.mat
    if (sum(sign(rot.points[,2]))<0) rot.points[,2] = - rot.points[,2]
    n = rbind(n, rot.points)
  } 
  n = n[-which(is.na(n)),]
  n = as.data.frame(n)
  n = cbind(n, df$trip, df$time, df$dist)
  colnames(n) = c('x','y','trip','time', 'dist')
  return(n)
}

rot.DriverOne = rot_all(driverOne)

Revisualizing the rotated trip data confirmed our transformations were successful. Trips 48, 73, and 145 are now overlayed as desired as shown in and many more trip matches which may have previously gone unnoticed now align almost perfectly. However, as seen in “Selected Trips by Driver 1” some identical trips remain unmatched for reasons which are discussed below.

rot.DriverOne.sub = filter(rot.DriverOne, trip == 48 | trip == 73 | trip == 145 | trip == 25 | trip == 55 | trip == 183)

ggplot(rot.DriverOne.sub, aes(x=x,y=y, group=trip, color=factor(trip), label=trip))+geom_path()+theme_bw()+
  geom_text(data= rot.DriverOne.sub %>% group_by(trip) %>% filter(dist == max(dist)), color='black', size=2.5, 
            position = position_jitter())+xlab('')+ylab('')+theme(legend.position='none')+
  ggtitle('Selected Trips by Driver 1')

ggplot(rot.DriverOne, aes(x=x,y=y, group=trip, color=factor(trip), label=trip))+geom_path()+theme_bw()+
  geom_text(data= rot.DriverOne %>% group_by(trip) %>% filter(y == max(y) | y==min(y)), color='black', size=2.5, 
            position = position_jitter())+xlab('')+ylab('')+theme(legend.position='none')+
  ggtitle('Trips by Driver 1: Rotated and Reflected')