Introduction

Brief Description

 

Among our drivers, there are drivers who use Fake GPS application to mock their location. This FGPS usage is unfair for other GOJEK drivers who work honestly. Hence, we would like to apply a machine learning model to classify whether a trip is being done using fake GPS or not based on their PING behavior.

You are provided with an anonymized dataset of drivers PING signals containing numeric feature variables, the binary target column, and a string order_id column. The task is to predict the value of the label column in the test set.

Data fields

The service_type variable comes from two services provided by the GOJEK company :

Libraries

library(tidyverse)         # sata manipulation, plotting, ...
library(ggmap)             # map plot
library(geosphere)         # compute geographical distances
library(dataPreparation)   # data preparation for ML
library(h2o)               # ML frame

Import and prepare data

gps_train <- read_csv("train.csv")
gps_test <- read_csv("test.csv")
gps <- bind_rows(train = gps_train, test = gps_test, .id = "data")
glimpse(gps)
## Observations: 648,879
## Variables: 12
## $ data               <chr> "train", "train", "train", "train", "train", …
## $ order_id           <chr> "RB193", "RB193", "RB193", "RB193", "RB193", …
## $ service_type       <chr> "GO_RIDE", "GO_RIDE", "GO_RIDE", "GO_RIDE", "…
## $ driver_status      <chr> "UNAVAILABLE", "AVAILABLE", "AVAILABLE", "AVA…
## $ date               <date> 2018-02-05, 2018-02-05, 2018-02-05, 2018-02-…
## $ hour               <dbl> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, …
## $ seconds            <dbl> 1548890667, 1548890680, 1548890690, 154889070…
## $ latitude           <dbl> -6.922910, -6.923039, -6.923039, -6.923048, -…
## $ longitude          <dbl> 107.6313, 107.6312, 107.6312, 107.6312, 107.6…
## $ altitude_in_meters <dbl> NA, 712, 712, 713, 713, 713, 713, 713, 713, 7…
## $ accuracy_in_meters <dbl> 23.027, 9.577, 9.577, 8.139, 7.029, 7.029, 3.…
## $ label              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …

As per description, we can re-encode some variables as factors.

gps$order_id <- as.factor(gps$order_id)
gps$service_type <- as.factor(gps$service_type)
gps$driver_status <- as.factor(gps$driver_status)
gps$label <- as.factor(gps$label)

The order_id variable group several observations (“ping”) from a unique ride.

gps %>% 
  count(order_id, name = "nb_of_ping") %>% 
  head()
## # A tibble: 6 x 2
##   order_id nb_of_ping
##   <fct>         <int>
## 1 F0              222
## 2 F1              180
## 3 F10             337
## 4 F100            293
## 5 F1000           117
## 6 F1001           118

We actually have 3500 described rides in the train set and 500 in the test set.

gps %>% filter(data == "train") %>% select(order_id) %>% n_distinct()
## [1] 3500
gps %>% filter(data == "test") %>% select(order_id) %>% n_distinct() 
## [1] 500

The distribution of the target variable (in the train set) is :

gps %>% 
  filter(data == "train") %>% 
  group_by(order_id) %>% 
  slice(1) %>% 
  ungroup() %>% 
  with(table(label))
## label
##    0    1 
## 1039 2461

Note that we don’t know which label correspond to a fake GPS.
We could suspect that 0 is the fraud value, since fraud usually appears less frequently ; however, in our case, we would have one third of the data considered as fraud.

 

The data is equally distributed between the service_type variable :

gps %>% 
  group_by(order_id) %>% 
  slice(1) %>% 
  ungroup() %>% 
  with(table(service_type, data))
##             data
## service_type test train
##      GO_FOOD  250  1750
##      GO_RIDE  250  1750

Also note that order_id starting with an “R” correspond to a “GORIDE” service, those starting with a “F” are a “GOFOOD” service.

gps %>% 
  group_by(order_id) %>% 
  slice(1) %>% 
  ungroup() %>% 
  mutate(first_letter_order_id = str_sub(string = order_id, start = 1, end = 1)) %>% 
  with(table(first_letter_order_id, service_type))
##                      service_type
## first_letter_order_id GO_FOOD GO_RIDE
##                     F    2000       0
##                     R       0    2000

The data has been collected for 47 consecutive days, from 2018-02-05 to 2018-03-23 (both in train and test sets).

# unique date values in train set
gps %>% 
  filter(data == "train") %>% 
  distinct(date) %>% 
  pull()
##  [1] "2018-02-05" "2018-02-06" "2018-02-07" "2018-02-08" "2018-02-09"
##  [6] "2018-02-10" "2018-02-11" "2018-02-12" "2018-02-13" "2018-02-14"
## [11] "2018-02-15" "2018-02-16" "2018-02-17" "2018-02-18" "2018-02-19"
## [16] "2018-02-20" "2018-02-21" "2018-02-22" "2018-02-23" "2018-02-24"
## [21] "2018-02-25" "2018-02-26" "2018-02-27" "2018-02-28" "2018-03-01"
## [26] "2018-03-02" "2018-03-03" "2018-03-04" "2018-03-05" "2018-03-06"
## [31] "2018-03-07" "2018-03-08" "2018-03-09" "2018-03-10" "2018-03-11"
## [36] "2018-03-12" "2018-03-13" "2018-03-14" "2018-03-15" "2018-03-16"
## [41] "2018-03-17" "2018-03-18" "2018-03-19" "2018-03-20" "2018-03-21"
## [46] "2018-03-22" "2018-03-23"
# unique date values in test set
gps %>% 
  filter(data == "test") %>% 
  distinct(date) %>% 
  pull()
##  [1] "2018-02-05" "2018-02-06" "2018-02-07" "2018-02-08" "2018-02-09"
##  [6] "2018-02-10" "2018-02-11" "2018-02-12" "2018-02-13" "2018-02-14"
## [11] "2018-02-15" "2018-02-16" "2018-02-17" "2018-02-18" "2018-02-19"
## [16] "2018-02-20" "2018-02-21" "2018-02-22" "2018-02-23" "2018-02-24"
## [21] "2018-02-25" "2018-02-26" "2018-02-27" "2018-02-28" "2018-03-01"
## [26] "2018-03-02" "2018-03-03" "2018-03-04" "2018-03-05" "2018-03-06"
## [31] "2018-03-07" "2018-03-08" "2018-03-09" "2018-03-10" "2018-03-11"
## [36] "2018-03-12" "2018-03-13" "2018-03-14" "2018-03-15" "2018-03-16"
## [41] "2018-03-17" "2018-03-18" "2018-03-19" "2018-03-20" "2018-03-21"
## [46] "2018-03-22"

Finally, we can have a quik look at missing data.

sapply(gps, function(x) sum(is.na(x)))
##               data           order_id       service_type 
##                  0                  0                  0 
##      driver_status               date               hour 
##                  0                  0                  0 
##            seconds           latitude          longitude 
##                  0                  0                  0 
## altitude_in_meters accuracy_in_meters              label 
##             175606                  0              81334

The variables longitude, latitude and altitude_in_meters can be used to calculate distances between 2 consecutive “ping”. Since we are not dealing with long distances, altitude is not very important in this calculation.
So, as we have a lot of NA’s for this variable, we will delete the altitude_in_meters variable.

Quick vizualization

All vizualizatiions will be done using the train dataset.

gps_train <- gps %>% filter(data == "train")

The label variable is of interest. We can look at the distribution for each type of service.

gps_train %>% 
  group_by(order_id) %>% 
  slice(1) %>% 
  ggplot(aes(service_type)) +
    geom_bar(aes(fill = label), position = position_fill()) +
    labs(title = "Distribution of target for each type of service")

Proportionnaly, it seems that the 0 value appears more frequently for the “GORIDE” service.

We can also look at the distribution of the label variable according to the time spend on the service.

gps_train %>% 
  group_by(order_id, label, service_type) %>% 
  summarise(service_duration = max(seconds) - min(seconds)) %>% 
  ggplot(aes(x = service_duration)) +
    geom_density(aes(fill = service_type), alpha = 0.4) +
    facet_wrap(~ label)

Delivering food takes usually more time than moving someone. However, the distributions are very close for each label.

Finally, we can look at the time spent for each “ping” status for each type of label.

gps_train %>% 
  count(label, driver_status) %>% 
  group_by(label) %>% 
  mutate(prop = n / sum(n)) %>% 
  ggplot(aes(x = label, y = prop, fill = driver_status)) +
    geom_col() +
    geom_label(aes(label = scales::percent(prop)), position = position_stack(vjust = 0.5), show.legend = FALSE) +
    labs(title = 'Proportion of time spent on each "ping" status',
         subtitle = "'label' point of view")

For each label, the time spent on each “ping” status is very similar.

gps_train %>% 
  count(service_type, driver_status) %>% 
  group_by(service_type) %>% 
  mutate(prop = n / sum(n)) %>% 
  ggplot(aes(x = service_type, y = prop, fill = driver_status)) +
    geom_col() +
    geom_label(aes(label = scales::percent(prop)), position = position_stack(vjust = 0.5), show.legend = FALSE) +
    labs(title = 'Proportion of time spent on each "ping" status',
         subtitle = "'service' point of view")

The “GORIDE” service spends more time “available” than the “GOFOOD” service.

Other variables distribution.

gps_train %>% 
  count(order_id, label, hour) %>% 
  ggplot(aes(x = hour)) +
    geom_histogram(aes(fill = label), color = "black", bins = 24, boundary = 0, position = position_dodge()) +
    labs(title = "'hour' distribution")

Quick mapping

Each ride has GPS coordinates. So it is easy to view a ride on a map.

First, let’s draw an example of 0 and 1 labels for each service type.

# Sample 0/1 label for GORIDE/GOFFOD service
# and extract the corresponding 'order_id'
set.seed(314)
random_label <- gps_train %>% 
  group_by(label, service_type) %>% 
  sample_n(size = 1) %>% 
  pull(order_id) %>% 
  as.vector()

# Extract the corresponding rides
gps_example <- gps_train %>% 
  filter(order_id %in% random_label)

# Add a column to trace the path
gps_example <- gps_example %>% 
  group_by(order_id) %>% 
  mutate(path_flow = row_number()) %>% 
  ungroup()

We now have 4 different trips. We could use ggmap and facet_wrap to display all 4 trips, but this would not be readable, as the background map will be the same for all 4 maps.
Instead, use plot_grid from cowplot package.

Plot the first trip on a map, to get a glimpse of the future function

# Extract first trip
gps_example_trip1 <- gps_example %>% 
  filter(order_id == random_label[1])

# Create a plot title
plot_title <- paste0("Type of service: ", gps_example_trip1$service_type, ", label: ", gps_example_trip1$label)

# Extract a map from minimum and maximum latitude/longitude
map_gps_example_trip1 <- get_stamenmap(bbox = c(left = min(gps_example_trip1$longitude), right = max(gps_example_trip1$longitude),
                                                top = max(gps_example_trip1$latitude), bottom = min(gps_example_trip1$latitude)),
                                       zoom = 15, crop = FALSE)

# Map the ride
ggmap(map_gps_example_trip1,
      base_layer = ggplot(data = gps_example_trip1, aes(x = longitude, y = latitude))) +
  geom_point(aes(color = driver_status)) +
  geom_text(aes(label = path_flow), 
            size = 4, vjust = 0, hjust = -0.5, color = "black", check_overlap = TRUE) +
  geom_path(aes(color = driver_status, group = order_id)) +
  labs(title = plot_title)

Let’s create a function to avoid repeating all previous code.

plot_map_trip <- function(data, legend = TRUE) {
  # Create a plot title
  plot_title <- paste0("Type of service: ", data$service_type, ", label: ", data$label)
  
  # Extract a map from minimum and maximum latitude/longitude
  map_gps <- get_stamenmap(bbox = c(left = min(data$longitude), right = max(data$longitude),
                                    top = max(data$latitude), bottom = min(data$latitude)),
                           zoom = 15, crop = FALSE)
  
  # Map the ride
  ggmap(map_gps,
        base_layer = ggplot(data = data, aes(x = longitude, y = latitude))) +
    geom_point(aes(color = driver_status)) +
    geom_text(aes(label = path_flow), 
              size = 4, vjust = 0, hjust = -0.5, color = "black", check_overlap = TRUE) +
    geom_path(aes(color = driver_status, group = order_id))+
    labs(title = plot_title)
}
# Test
gps_example %>% 
  filter(order_id == random_label[1]) %>% 
  plot_map_trip()

Map the 4 random rides.

cowplot::plot_grid(plot_map_trip(gps_example %>% filter(order_id == random_label[1])) + theme(legend.position = "none"), 
                   plot_map_trip(gps_example %>% filter(order_id == random_label[3])), 
                   plot_map_trip(gps_example %>% filter(order_id == random_label[2])) + theme(legend.position = "none"), 
                   plot_map_trip(gps_example %>% filter(order_id == random_label[4])) + theme(legend.position = "none"),
                   align = 'v', axis = 'l', ncol = 1)

# clean variables
rm(gps_example, gps_example_trip1, map_gps_example_trip1, random_label, plot_title, plot_map_trip)

Feature engineering

From available variables, we can create new, more adequate features that might fit better on data modeling.

Available data and new features proposal

  • order_id : no new feature
  • service_type : binary variable is_goride
  • driver_status : no new variable, but numeric variables will be split according to the “ping” status
  • date : day of the week recoded into cyclical variable (year and month are probably not relevant because too similar between all described rides) ; is this a business day or a weekend day
  • hour : extract starting and ending hours of the ride, and recode into cyclical variable ; is this business hours or not
  • seconds : duration of the trip ; possibly split by “ping” status ; maximum time between 2 “ping”
  • latitude and longitude - distance between 2 consecutive “ping” => total distance of the ride, possibly split by “ping” status ; maximum distance between 2 “ping”
  • altitude_in_meters : delete variable
  • accuracy_in_meters : minimum, maximum, mean, median, split by “ping” status

Many of these new variables will be created for a “grouped ride”.
The goal is to obtain at the end of the process a unique line for each ride, with all these new variables.

Binary variables

gps_goride <- gps %>% 
  select(order_id, service_type) %>% 
  distinct() %>% 
  transmute(order_id, is_goride = as.numeric(service_type == "GO_RIDE")) %>% 
  # this variable should be a factor
  mutate(is_goride = as.factor(is_goride))

head(gps_goride)
## # A tibble: 6 x 2
##   order_id is_goride
##   <fct>    <fct>    
## 1 RB193    1        
## 2 F1452    0        
## 3 F943     0        
## 4 F1999    0        
## 5 F415     0        
## 6 RB865    1

Cyclical variables

The goal is to re-encode variables as “cyclical” (to keep the time variability).
Indeed, if we use 1-7 encoding for the days of the week, we’re telling the model that days 4 and 5 are very similar, while days 1 and 7 are very dissimilar. In fact, days 1 and 7 are just as similar as days 4 and 5.

Recoding into cyclical values will be done using cos/sin transformations (http://blog.davidkaleko.com/feature-engineering-cyclical-features.html).

Also note that some rides overlap 2 days.

two_days_ride <- gps %>% 
  distinct(order_id, date) %>% 
  group_by(order_id) %>% 
  filter(n() > 1)

head(two_days_ride)
## # A tibble: 6 x 2
## # Groups:   order_id [3]
##   order_id date      
##   <fct>    <date>    
## 1 RB1889   2018-02-07
## 2 RB1889   2018-02-08
## 3 RB288    2018-02-13
## 4 RB288    2018-02-14
## 5 RB564    2018-02-17
## 6 RB564    2018-02-18

In order to ease the reading of the final dataset, we will only keep the starting date of the ride, and add another column to indicate if we have 2 dates.

# recode days of the week

# extract the week day as a numeric variable
# The 'wday' component of a 'POSIXlt' object is the numeric weekday (0 ot 6, starting on Sunday).
gps_date <- gps %>% 
  distinct(order_id, date) %>% 
  mutate(numeric_day = as.POSIXlt(date)$wday,
         day_cos = cos(2 * pi * numeric_day / 7),
         day_sin = sin(2 * pi * numeric_day / 7),
         # add an indication of weekday/weekend
         is_weekend = as.numeric(numeric_day %in% c(0,6)))

head(gps_date)
## # A tibble: 6 x 6
##   order_id date       numeric_day day_cos day_sin is_weekend
##   <fct>    <date>           <int>   <dbl>   <dbl>      <dbl>
## 1 RB193    2018-02-05           1   0.623   0.782          0
## 2 F1452    2018-02-05           1   0.623   0.782          0
## 3 F943     2018-02-05           1   0.623   0.782          0
## 4 F1999    2018-02-05           1   0.623   0.782          0
## 5 F415     2018-02-05           1   0.623   0.782          0
## 6 RB865    2018-02-05           1   0.623   0.782          0
# Keep only the first day of the ride and
# add a column indicating a 2-day ride
gps_date <- gps_date %>% 
  group_by(order_id) %>% 
  filter(date == min(date)) %>% 
  ungroup() %>% 
  mutate(is_two_day_ride = as.numeric(order_id %in% two_days_ride$order_id),
         # change variables into factor
         is_weekend = as.factor(is_weekend),
         is_two_day_ride = as.factor(is_two_day_ride)) %>% 
  select(-date, -numeric_day)

Let’s check for the 2-day rides.

gps_date %>% 
  filter(is_two_day_ride == 1) %>% 
  head()
## # A tibble: 6 x 5
##   order_id day_cos day_sin is_weekend is_two_day_ride
##   <fct>      <dbl>   <dbl> <fct>      <fct>          
## 1 RB1889    -0.901   0.434 0          1              
## 2 RB288     -0.223   0.975 0          1              
## 3 RB564      0.623  -0.782 1          1              
## 4 RB1374     1       0     1          1              
## 5 RB806      0.623   0.782 0          1              
## 6 F1828      0.623   0.782 0          1
rm(two_days_ride)

We can now proceed for the hour variable.
In addition to transforming into cyclical variable, we can add an indication of business hours : let’s define these between 7AM and 8PM.

# Extract beginand end hour of the ride
gps_hour <- gps %>% 
  group_by(order_id) %>% 
  slice(1, n()) %>% 
  select(order_id, hour) %>% 
  mutate(text = c("hour_start", "hour_end")) %>% 
  pivot_wider(names_from = text, values_from = hour) %>% 
  # add business hours indication
  mutate(is_biz_hour = as.numeric(hour_start %in% 7:20)) %>% 
  ungroup()

head(gps_hour)
## # A tibble: 6 x 4
##   order_id hour_start hour_end is_biz_hour
##   <fct>         <dbl>    <dbl>       <dbl>
## 1 F0               10       11           1
## 2 F1               10       10           1
## 3 F10              15       16           1
## 4 F100             19       20           1
## 5 F1000             7        7           1
## 6 F1001            15       15           1
# Convert into cyclical
gps_hour <- gps_hour %>% 
  mutate(hour_start_cos = cos(2 * pi * hour_start / 24),
         hour_start_sin = sin(2 * pi * hour_start / 24),
         hour_end_cos = cos(2 * pi * hour_end / 24),
         hour_end_sin = sin(2 * pi * hour_end / 24),
         is_biz_hour = as.factor(is_biz_hour)) %>% 
  select(-hour_start, -hour_end)

head(gps_hour)
## # A tibble: 6 x 6
##   order_id is_biz_hour hour_start_cos hour_start_sin hour_end_cos
##   <fct>    <fct>                <dbl>          <dbl>        <dbl>
## 1 F0       1                   -0.866          0.500       -0.966
## 2 F1       1                   -0.866          0.500       -0.866
## 3 F10      1                   -0.707         -0.707       -0.5  
## 4 F100     1                    0.259         -0.966        0.5  
## 5 F1000    1                   -0.259          0.966       -0.259
## 6 F1001    1                   -0.707         -0.707       -0.707
## # … with 1 more variable: hour_end_sin <dbl>

Geographical and timing variables

First we will compute distances between 2 consecutive “ping”, using longitude and latitude.
We will use the “Haversine” distance which, according to the geosphere vignette, gives good results for small distances.

# small example
paste("Distance between the first 2 coordinates of the dataset :", 
      distHaversine(p1 = c(gps$longitude[1], gps$latitude[1]),
                    p2 = c(gps$longitude[2], gps$latitude[2])),
      "meters.")
## [1] "Distance between the first 2 coordinates of the dataset : 15.4468444664606 meters."

When applied to a dataframe, the process is similar.
Note that we have to add a “NA” value to have a correct number a new distances (with 20 point coordinates, only 19 distances are computed).

gps %>% 
  slice(1:20) %>% 
  select(longitude, latitude) %>%
  mutate(dist = c(NA, distHaversine(.))) %>% 
  head()
## # A tibble: 6 x 3
##   longitude latitude  dist
##       <dbl>    <dbl> <dbl>
## 1      108.    -6.92 NA   
## 2      108.    -6.92 15.4 
## 3      108.    -6.92  0   
## 4      108.    -6.92  2.38
## 5      108.    -6.92  9.30
## 6      108.    -6.92  0

We just have a small issue. As we are going to use this method for consecutive rides on our dataset, we thus will compute the distance between the last “ping” of a ride and the first “ping” of the next one.
We will trick this by computing each and every possible distance, and then add an “NA” value for the first coordinate of each ride.

gps_distance <- gps %>% 
  select(longitude, latitude) %>%
  # compute all possible distance
  mutate(dist = c(NA, distHaversine(.))) %>% 
  # add the order_id column
  mutate(order_id = gps$order_id)

head(gps_distance)
## # A tibble: 6 x 4
##   longitude latitude  dist order_id
##       <dbl>    <dbl> <dbl> <fct>   
## 1      108.    -6.92 NA    RB193   
## 2      108.    -6.92 15.4  RB193   
## 3      108.    -6.92  0    RB193   
## 4      108.    -6.92  2.38 RB193   
## 5      108.    -6.92  9.30 RB193   
## 6      108.    -6.92  0    RB193

Now we can group by order_id, and modify the first line of each ride accordingly.

gps_distance <- gps_distance %>% 
  group_by(order_id) %>% 
  mutate(dist = ifelse(row_number() == 1, NA, dist)) %>% 
  select(order_id, dist) %>% 
  ungroup()

head(gps_distance)
## # A tibble: 6 x 2
##   order_id  dist
##   <fct>    <dbl>
## 1 RB193    NA   
## 2 RB193    15.4 
## 3 RB193     0   
## 4 RB193     2.38
## 5 RB193     9.30
## 6 RB193     0

We can check the number of missing values for the dist column ; it should be the same as the total number of rides, ie 3500 + 500.

sum(is.na(gps_distance$dist))
## [1] 4000

Let’s do the same and calculate the time spent between each “ping”, using the seconds variable.
As previously, we will add a “NA” value for the first line of each ride.

gps_time <- gps %>% 
  mutate(time = c(NA, diff(seconds))) %>% 
  select(order_id, time) %>% 
  group_by(order_id) %>% 
  mutate(time = ifelse(row_number() == 1, NA, time)) %>% 
  ungroup()

head(gps_time)
## # A tibble: 6 x 2
##   order_id  time
##   <fct>    <dbl>
## 1 RB193       NA
## 2 RB193       13
## 3 RB193       10
## 4 RB193       10
## 5 RB193       10
## 6 RB193       10

We can check the number of missing values for the time column ; it should be the same as the total number of rides, ie 3500 + 500.

sum(is.na(gps_time$time))
## [1] 4000

Let’s add these new columns to the original dataframe.
Also, recall that we want to compute values “per status” ; so we will keep only our variables of interest : order_id and driver_status, with the new dist and time variables.

gps_dist_time <- gps %>% 
  mutate(dist = gps_distance$dist,
         time = gps_time$time) %>% 
  select(order_id, driver_status, dist, time)

head(gps_dist_time)
## # A tibble: 6 x 4
##   order_id driver_status  dist  time
##   <fct>    <fct>         <dbl> <dbl>
## 1 RB193    UNAVAILABLE   NA       NA
## 2 RB193    AVAILABLE     15.4     13
## 3 RB193    AVAILABLE      0       10
## 4 RB193    AVAILABLE      2.38    10
## 5 RB193    AVAILABLE      9.30    10
## 6 RB193    AVAILABLE      0       10

Let’s now compute the new features, grouped by ride and “ping” status.

gps_dist_time <- gps_dist_time %>% 
  group_by(order_id, driver_status) %>% 
  summarise(dist_total = sum(dist, na.rm = TRUE),
            dist_max = max(dist, na.rm = TRUE),
            time_total = sum(time, na.rm = TRUE),
            time_max = max(time, na.rm = TRUE)) %>% 
  ungroup()

head(gps_dist_time)
## # A tibble: 6 x 6
##   order_id driver_status dist_total dist_max time_total time_max
##   <fct>    <fct>              <dbl>    <dbl>      <dbl>    <dbl>
## 1 F0       AVAILABLE          1125.    1125.       1167       44
## 2 F0       OTW_DROPOFF         579.     100.        705       12
## 3 F0       OTW_PICKUP         1497.     715.        347       16
## 4 F0       UNAVAILABLE           0        0          13        8
## 5 F1       AVAILABLE           842.     842.        352       18
## 6 F1       OTW_DROPOFF        1412.     105.       1165       15

Some rides do not have any value for some status. This is problematic, because when we take the maximum value, the result is -Inf value.

# for example
gps_dist_time %>% 
  filter(order_id == "RB193")
## # A tibble: 4 x 6
##   order_id driver_status dist_total dist_max time_total time_max
##   <fct>    <fct>              <dbl>    <dbl>      <dbl>    <dbl>
## 1 RB193    AVAILABLE          1635.    112.         800       92
## 2 RB193    OTW_DROPOFF        1069.     59.6        377       11
## 3 RB193    OTW_PICKUP          157.     43.4        186       17
## 4 RB193    UNAVAILABLE           0    -Inf            0     -Inf

To solve this, we will change these -Inf values into NA values.

gps_dist_time <- gps_dist_time %>% 
  # replace -Inf values with NA
  mutate_if(.predicate = is.numeric, .funs = list(~ na_if(., -Inf)))

gps_dist_time %>% 
  filter(order_id == "RB193")
## # A tibble: 4 x 6
##   order_id driver_status dist_total dist_max time_total time_max
##   <fct>    <fct>              <dbl>    <dbl>      <dbl>    <dbl>
## 1 RB193    AVAILABLE          1635.    112.         800       92
## 2 RB193    OTW_DROPOFF        1069.     59.6        377       11
## 3 RB193    OTW_PICKUP          157.     43.4        186       17
## 4 RB193    UNAVAILABLE           0      NA            0       NA

Let’s now spread the columns, in order to have only one row per ride.

gps_dist_time <- gps_dist_time %>% 
  pivot_wider(names_from = driver_status, values_from = dist_total:time_max)

head(gps_dist_time)
## # A tibble: 6 x 17
##   order_id dist_total_AVAI… dist_total_OTW_… dist_total_OTW_…
##   <fct>               <dbl>            <dbl>            <dbl>
## 1 F0                  1125.            579.             1497.
## 2 F1                   842.           1412.              375.
## 3 F10                  802.           7809.             3393.
## 4 F100                1196.           1234.              179.
## 5 F1000               2692.             75.5            1040.
## 6 F1001               1391.           2436.                0 
## # … with 13 more variables: dist_total_UNAVAILABLE <dbl>,
## #   dist_max_AVAILABLE <dbl>, dist_max_OTW_DROPOFF <dbl>,
## #   dist_max_OTW_PICKUP <dbl>, dist_max_UNAVAILABLE <dbl>,
## #   time_total_AVAILABLE <dbl>, time_total_OTW_DROPOFF <dbl>,
## #   time_total_OTW_PICKUP <dbl>, time_total_UNAVAILABLE <dbl>,
## #   time_max_AVAILABLE <dbl>, time_max_OTW_DROPOFF <dbl>,
## #   time_max_OTW_PICKUP <dbl>, time_max_UNAVAILABLE <dbl>

Note that we have some missing values, since each status is not present in every ride.
So, for each variable, we will add a new factor column indicating if we have a missing value, and imputing the missing values with 0.

gps_dist_time <- gps_dist_time %>% 
  # this adds new columns indicating NA values (except for the column 'order_id')
  mutate_at(.vars = vars(-order_id), .funs = list(is_na = ~ as.numeric(is.na(.)))) %>% 
  # this will replace NA values with 0
  mutate_at(.vars = vars(-order_id, -ends_with("is_na")), .funs = replace_na, 0) %>% 
  # this will transform the 'is_na' columns into factor
  mutate_at(.vars = vars(ends_with("is_na")), .funs = as.factor)

head(gps_dist_time)
## # A tibble: 6 x 33
##   order_id dist_total_AVAI… dist_total_OTW_… dist_total_OTW_…
##   <fct>               <dbl>            <dbl>            <dbl>
## 1 F0                  1125.            579.             1497.
## 2 F1                   842.           1412.              375.
## 3 F10                  802.           7809.             3393.
## 4 F100                1196.           1234.              179.
## 5 F1000               2692.             75.5            1040.
## 6 F1001               1391.           2436.                0 
## # … with 29 more variables: dist_total_UNAVAILABLE <dbl>,
## #   dist_max_AVAILABLE <dbl>, dist_max_OTW_DROPOFF <dbl>,
## #   dist_max_OTW_PICKUP <dbl>, dist_max_UNAVAILABLE <dbl>,
## #   time_total_AVAILABLE <dbl>, time_total_OTW_DROPOFF <dbl>,
## #   time_total_OTW_PICKUP <dbl>, time_total_UNAVAILABLE <dbl>,
## #   time_max_AVAILABLE <dbl>, time_max_OTW_DROPOFF <dbl>,
## #   time_max_OTW_PICKUP <dbl>, time_max_UNAVAILABLE <dbl>,
## #   dist_total_AVAILABLE_is_na <fct>, dist_total_OTW_DROPOFF_is_na <fct>,
## #   dist_total_OTW_PICKUP_is_na <fct>, dist_total_UNAVAILABLE_is_na <fct>,
## #   dist_max_AVAILABLE_is_na <fct>, dist_max_OTW_DROPOFF_is_na <fct>,
## #   dist_max_OTW_PICKUP_is_na <fct>, dist_max_UNAVAILABLE_is_na <fct>,
## #   time_total_AVAILABLE_is_na <fct>, time_total_OTW_DROPOFF_is_na <fct>,
## #   time_total_OTW_PICKUP_is_na <fct>, time_total_UNAVAILABLE_is_na <fct>,
## #   time_max_AVAILABLE_is_na <fct>, time_max_OTW_DROPOFF_is_na <fct>,
## #   time_max_OTW_PICKUP_is_na <fct>, time_max_UNAVAILABLE_is_na <fct>
# clean environment
rm(gps_time, gps_distance)
gc(verbose = FALSE)
##            used  (Mb) gc trigger  (Mb) max used  (Mb)
## Ncells  2434634 130.1    4452424 237.8  4452424 237.8
## Vcells 19283693 147.2   40996540 312.8 40667016 310.3

Basic statistics variable

The last variable to treat is the accuracy_in_meters variable, from which we will extract basic statistics.

gps_stats <- gps %>% 
  select(order_id, driver_status, accuracy_in_meters) %>% 
  group_by(order_id, driver_status) %>% 
  summarise(accuracy_min = min(accuracy_in_meters),
            accuracy_max = max(accuracy_in_meters),
            accuracy_mean = mean(accuracy_in_meters),
            accuracy_median = median(accuracy_in_meters)) %>% 
  ungroup()

head(gps_stats)
## # A tibble: 6 x 6
##   order_id driver_status accuracy_min accuracy_max accuracy_mean
##   <fct>    <fct>                <dbl>        <dbl>         <dbl>
## 1 F0       AVAILABLE             0.1          0.1          0.1  
## 2 F0       OTW_DROPOFF           4.38         8.29         4.96 
## 3 F0       OTW_PICKUP            4.38         7.4          5.53 
## 4 F0       UNAVAILABLE           4.48         4.68         4.58 
## 5 F1       AVAILABLE             0.01         9            0.267
## 6 F1       OTW_DROPOFF           3.80        35.2          6.43 
## # … with 1 more variable: accuracy_median <dbl>

As previously, we will spred the columns to have only one row per ride, and replace NA values with 0 and add a column indicating if there was a missing value.

gps_stats <- gps_stats %>% 
  pivot_wider(names_from = driver_status, values_from = accuracy_min:accuracy_median) %>% 
  # this adds new columns indicating NA values (except for the grouped column 'order_id')
  mutate_at(.vars = vars(-order_id), .funs = list(is_na = ~ as.numeric(is.na(.)))) %>% 
  # this will replace NA values with 0
  mutate_at(.vars = vars(-order_id, -ends_with("is_na")), .funs = replace_na, 0) %>% 
  # this will transform the 'is_na' columns into factor
  mutate_at(.vars = vars(ends_with("is_na")), .funs = as.factor)

head(gps_stats)
## # A tibble: 6 x 33
##   order_id accuracy_min_AV… accuracy_min_OT… accuracy_min_OT…
##   <fct>               <dbl>            <dbl>            <dbl>
## 1 F0                   0.1              4.38             4.38
## 2 F1                   0.01             3.80             0.01
## 3 F10                  4.45             3                4.44
## 4 F100                 0.1              4.37             0.1 
## 5 F1000                3.9              3.9              3.9 
## 6 F1001                0.5              0.5              0.5 
## # … with 29 more variables: accuracy_min_UNAVAILABLE <dbl>,
## #   accuracy_max_AVAILABLE <dbl>, accuracy_max_OTW_DROPOFF <dbl>,
## #   accuracy_max_OTW_PICKUP <dbl>, accuracy_max_UNAVAILABLE <dbl>,
## #   accuracy_mean_AVAILABLE <dbl>, accuracy_mean_OTW_DROPOFF <dbl>,
## #   accuracy_mean_OTW_PICKUP <dbl>, accuracy_mean_UNAVAILABLE <dbl>,
## #   accuracy_median_AVAILABLE <dbl>, accuracy_median_OTW_DROPOFF <dbl>,
## #   accuracy_median_OTW_PICKUP <dbl>, accuracy_median_UNAVAILABLE <dbl>,
## #   accuracy_min_AVAILABLE_is_na <fct>,
## #   accuracy_min_OTW_DROPOFF_is_na <fct>,
## #   accuracy_min_OTW_PICKUP_is_na <fct>,
## #   accuracy_min_UNAVAILABLE_is_na <fct>,
## #   accuracy_max_AVAILABLE_is_na <fct>,
## #   accuracy_max_OTW_DROPOFF_is_na <fct>,
## #   accuracy_max_OTW_PICKUP_is_na <fct>,
## #   accuracy_max_UNAVAILABLE_is_na <fct>,
## #   accuracy_mean_AVAILABLE_is_na <fct>,
## #   accuracy_mean_OTW_DROPOFF_is_na <fct>,
## #   accuracy_mean_OTW_PICKUP_is_na <fct>,
## #   accuracy_mean_UNAVAILABLE_is_na <fct>,
## #   accuracy_median_AVAILABLE_is_na <fct>,
## #   accuracy_median_OTW_DROPOFF_is_na <fct>,
## #   accuracy_median_OTW_PICKUP_is_na <fct>,
## #   accuracy_median_UNAVAILABLE_is_na <fct>

We can now build our final dataframe for modeling.

gps_final <- gps %>% 
  select(data, order_id, label) %>% 
  distinct() %>% 
  left_join(gps_goride, by = "order_id") %>% 
  left_join(gps_date, by = "order_id") %>% 
  left_join(gps_hour, by = "order_id") %>% 
  left_join(gps_dist_time, by = "order_id") %>% 
  left_join(gps_stats, by = "order_id")
glimpse(gps_final)
## Observations: 4,000
## Variables: 77
## $ data                              <chr> "train", "train", "train", "tr…
## $ order_id                          <fct> RB193, F1452, F943, F1999, F41…
## $ label                             <fct> 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ is_goride                         <fct> 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, …
## $ day_cos                           <dbl> 0.6234898, 0.6234898, 0.623489…
## $ day_sin                           <dbl> 0.7818315, 0.7818315, 0.781831…
## $ is_weekend                        <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ is_two_day_ride                   <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ is_biz_hour                       <fct> 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, …
## $ hour_start_cos                    <dbl> 6.123234e-17, -8.660254e-01, -…
## $ hour_start_sin                    <dbl> 1.0000000, 0.5000000, -0.50000…
## $ hour_end_cos                      <dbl> 6.123234e-17, -9.659258e-01, -…
## $ hour_end_sin                      <dbl> 1.0000000, 0.2588190, -0.70710…
## $ dist_total_AVAILABLE              <dbl> 1635.290795, 1514.376522, 1337…
## $ dist_total_OTW_DROPOFF            <dbl> 1068.7541, 9054.5517, 1784.790…
## $ dist_total_OTW_PICKUP             <dbl> 157.075962, 1008.015814, 1449.…
## $ dist_total_UNAVAILABLE            <dbl> 0.000000, 0.000000, 0.000000, …
## $ dist_max_AVAILABLE                <dbl> 112.197553, 985.561324, 177.18…
## $ dist_max_OTW_DROPOFF              <dbl> 59.59587, 292.06993, 108.42692…
## $ dist_max_OTW_PICKUP               <dbl> 43.436086, 79.306181, 165.3672…
## $ dist_max_UNAVAILABLE              <dbl> 0.000000, 0.000000, 0.000000, …
## $ time_total_AVAILABLE              <dbl> 800, 1026, 1163, 658, 396, 871…
## $ time_total_OTW_DROPOFF            <dbl> 377, 1957, 909, 391, 685, 245,…
## $ time_total_OTW_PICKUP             <dbl> 186, 916, 657, 262, 399, 687, …
## $ time_total_UNAVAILABLE            <dbl> 0, 0, 26, 24, 14, 23, 10, 0, 6…
## $ time_max_AVAILABLE                <dbl> 92, 373, 38, 110, 21, 164, 523…
## $ time_max_OTW_DROPOFF              <dbl> 11, 28, 52, 15, 61, 15, 66, 0,…
## $ time_max_OTW_PICKUP               <dbl> 17, 16, 20, 12, 20, 11, 16, 16…
## $ time_max_UNAVAILABLE              <dbl> 0, 0, 18, 10, 14, 7, 10, 0, 6,…
## $ dist_total_AVAILABLE_is_na        <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ dist_total_OTW_DROPOFF_is_na      <fct> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, …
## $ dist_total_OTW_PICKUP_is_na       <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ dist_total_UNAVAILABLE_is_na      <fct> 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, …
## $ dist_max_AVAILABLE_is_na          <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ dist_max_OTW_DROPOFF_is_na        <fct> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, …
## $ dist_max_OTW_PICKUP_is_na         <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ dist_max_UNAVAILABLE_is_na        <fct> 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, …
## $ time_total_AVAILABLE_is_na        <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ time_total_OTW_DROPOFF_is_na      <fct> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, …
## $ time_total_OTW_PICKUP_is_na       <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ time_total_UNAVAILABLE_is_na      <fct> 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, …
## $ time_max_AVAILABLE_is_na          <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ time_max_OTW_DROPOFF_is_na        <fct> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, …
## $ time_max_OTW_PICKUP_is_na         <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ time_max_UNAVAILABLE_is_na        <fct> 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, …
## $ accuracy_min_AVAILABLE            <dbl> 3.000, 3.000, 0.010, 0.010, 0.…
## $ accuracy_min_OTW_DROPOFF          <dbl> 3.000, 3.000, 3.000, 3.000, 5.…
## $ accuracy_min_OTW_PICKUP           <dbl> 3.000, 3.000, 0.010, 0.010, 0.…
## $ accuracy_min_UNAVAILABLE          <dbl> 23.027, 0.000, 0.010, 0.010, 0…
## $ accuracy_max_AVAILABLE            <dbl> 9.577, 5.893, 9.648, 9.648, 16…
## $ accuracy_max_OTW_DROPOFF          <dbl> 9.000, 98.753, 59.751, 6.138, …
## $ accuracy_max_OTW_PICKUP           <dbl> 9.000, 22.045, 35.970, 17.487,…
## $ accuracy_max_UNAVAILABLE          <dbl> 23.027, 0.000, 0.010, 3.900, 0…
## $ accuracy_mean_AVAILABLE           <dbl> 4.1849577, 3.6291324, 3.585307…
## $ accuracy_mean_OTW_DROPOFF         <dbl> 4.080289, 7.556753, 10.555709,…
## $ accuracy_mean_OTW_PICKUP          <dbl> 5.981111, 7.760374, 4.506516, …
## $ accuracy_mean_UNAVAILABLE         <dbl> 23.027000, 0.000000, 0.010000,…
## $ accuracy_median_AVAILABLE         <dbl> 3.8420, 3.0000, 5.9340, 3.1470…
## $ accuracy_median_OTW_DROPOFF       <dbl> 3.5810, 3.9000, 6.3960, 3.2160…
## $ accuracy_median_OTW_PICKUP        <dbl> 6.2015, 3.4120, 3.2160, 0.0100…
## $ accuracy_median_UNAVAILABLE       <dbl> 23.0270, 0.0000, 0.0100, 3.000…
## $ accuracy_min_AVAILABLE_is_na      <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ accuracy_min_OTW_DROPOFF_is_na    <fct> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, …
## $ accuracy_min_OTW_PICKUP_is_na     <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ accuracy_min_UNAVAILABLE_is_na    <fct> 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, …
## $ accuracy_max_AVAILABLE_is_na      <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ accuracy_max_OTW_DROPOFF_is_na    <fct> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, …
## $ accuracy_max_OTW_PICKUP_is_na     <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ accuracy_max_UNAVAILABLE_is_na    <fct> 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, …
## $ accuracy_mean_AVAILABLE_is_na     <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ accuracy_mean_OTW_DROPOFF_is_na   <fct> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, …
## $ accuracy_mean_OTW_PICKUP_is_na    <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ accuracy_mean_UNAVAILABLE_is_na   <fct> 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, …
## $ accuracy_median_AVAILABLE_is_na   <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ accuracy_median_OTW_DROPOFF_is_na <fct> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, …
## $ accuracy_median_OTW_PICKUP_is_na  <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ accuracy_median_UNAVAILABLE_is_na <fct> 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, …

We transformed the original dataset of dimensions 648879 lines x 12 columns to this new dataset of dimensions 4000 lines x 77 columns

# clean environment
rm(gps_date, gps_dist_time, gps_goride, gps_hour, gps_stats)

To close this part, let’s save our gps_final dataset.

saveRDS(gps_final, file = "gps_final.rds")
rm(gps, gps_train, gps_test)

Modeling and Predicting target variable

Using machine learning algorithms will be done using the H2O frame.
For data preparation, we will use the dataPreparation package.

# if necessary load data
# gps_final <- readRDS("gps_final.rds")

Filter useless variables

The first thing to do, in order to make computation fast, would be to filter useless variables:

  • Constant variables
  • Variables that are in double (for example col1 == col2)
  • Variables that are exact bijections (for example col1 = A, B, B, A and col2 = 1, 2, 2, 1)

Let’s id them.

# Which columns are constant ?
constant_cols <- whichAreConstant(gps_final)
## [1] "whichAreConstant: it took me 0.03s to identify 0 constant column(s)"

No constant column here.

# Which columns are redundant ?
double_cols <- whichAreInDouble(gps_final)
## [1] "whichAreInDouble: time_total_AVAILABLE_is_na is exactly equal to dist_total_AVAILABLE_is_na. I put it in drop list."
## [1] "whichAreInDouble: accuracy_min_AVAILABLE_is_na is exactly equal to dist_total_AVAILABLE_is_na. I put it in drop list."
## [1] "whichAreInDouble: accuracy_max_AVAILABLE_is_na is exactly equal to dist_total_AVAILABLE_is_na. I put it in drop list."
## [1] "whichAreInDouble: accuracy_mean_AVAILABLE_is_na is exactly equal to dist_total_AVAILABLE_is_na. I put it in drop list."
## [1] "whichAreInDouble: accuracy_median_AVAILABLE_is_na is exactly equal to dist_total_AVAILABLE_is_na. I put it in drop list."
## [1] "whichAreInDouble: time_total_OTW_DROPOFF_is_na is exactly equal to dist_total_OTW_DROPOFF_is_na. I put it in drop list."
## [1] "whichAreInDouble: accuracy_min_OTW_DROPOFF_is_na is exactly equal to dist_total_OTW_DROPOFF_is_na. I put it in drop list."
## [1] "whichAreInDouble: accuracy_max_OTW_DROPOFF_is_na is exactly equal to dist_total_OTW_DROPOFF_is_na. I put it in drop list."
## [1] "whichAreInDouble: accuracy_mean_OTW_DROPOFF_is_na is exactly equal to dist_total_OTW_DROPOFF_is_na. I put it in drop list."
## [1] "whichAreInDouble: accuracy_median_OTW_DROPOFF_is_na is exactly equal to dist_total_OTW_DROPOFF_is_na. I put it in drop list."
## [1] "whichAreInDouble: time_total_OTW_PICKUP_is_na is exactly equal to dist_total_OTW_PICKUP_is_na. I put it in drop list."
## [1] "whichAreInDouble: accuracy_min_OTW_PICKUP_is_na is exactly equal to dist_total_OTW_PICKUP_is_na. I put it in drop list."
## [1] "whichAreInDouble: accuracy_max_OTW_PICKUP_is_na is exactly equal to dist_total_OTW_PICKUP_is_na. I put it in drop list."
## [1] "whichAreInDouble: accuracy_mean_OTW_PICKUP_is_na is exactly equal to dist_total_OTW_PICKUP_is_na. I put it in drop list."
## [1] "whichAreInDouble: accuracy_median_OTW_PICKUP_is_na is exactly equal to dist_total_OTW_PICKUP_is_na. I put it in drop list."
## [1] "whichAreInDouble: time_total_UNAVAILABLE_is_na is exactly equal to dist_total_UNAVAILABLE_is_na. I put it in drop list."
## [1] "whichAreInDouble: accuracy_min_UNAVAILABLE_is_na is exactly equal to dist_total_UNAVAILABLE_is_na. I put it in drop list."
## [1] "whichAreInDouble: accuracy_max_UNAVAILABLE_is_na is exactly equal to dist_total_UNAVAILABLE_is_na. I put it in drop list."
## [1] "whichAreInDouble: accuracy_mean_UNAVAILABLE_is_na is exactly equal to dist_total_UNAVAILABLE_is_na. I put it in drop list."
## [1] "whichAreInDouble: accuracy_median_UNAVAILABLE_is_na is exactly equal to dist_total_UNAVAILABLE_is_na. I put it in drop list."
## [1] "whichAreInDouble: time_max_AVAILABLE_is_na is exactly equal to dist_max_AVAILABLE_is_na. I put it in drop list."
## [1] "whichAreInDouble: time_max_OTW_DROPOFF_is_na is exactly equal to dist_max_OTW_DROPOFF_is_na. I put it in drop list."
## [1] "whichAreInDouble: time_max_OTW_PICKUP_is_na is exactly equal to dist_max_OTW_PICKUP_is_na. I put it in drop list."
## [1] "whichAreInDouble: time_max_UNAVAILABLE_is_na is exactly equal to dist_max_UNAVAILABLE_is_na. I put it in drop list."
## [1] "whichAreInDouble: it took me 0.14s to identify 24 column(s) to drop."

We have many variables that are in double. It is logic : by construction, each time we had a “NA” for a ping status, we created a new column. So these columns are redundant.

# Remove redundant columns
gps_final <- gps_final %>% select(-double_cols)
# Which columns are bijections ?
bijections_cols <- whichAreBijection(gps_final)
## [1] "whichAreBijection: day_sin is a bijection of day_cos. I put it in drop list."
## [1] "whichAreBijection: it took me 1.23s to identify 1 column(s) to drop."

We only found one bijection variable.

# Remove bijections columns
gps_final <- gps_final %>% select(-bijections_cols)

We now have 52 distinct columns, instead of 77 originally.

# clean environment
rm(bijections_cols, constant_cols, double_cols)

Splitting the data into training and testing set

We will build our algorithms using the “train” part of gps_final (which is where we have the label feature identified), the “test” part being the one we must submit to Kaggle.
So first, let’s recreate the “original” train and test data.

gps_train <- gps_final %>% 
  filter(data == "train") %>% 
  select(-data) %>% 
  mutate_if(is.factor, droplevels)

gps_test <- gps_final %>% 
  filter(data == "test") %>% 
  select(-data) %>% 
  mutate_if(is.factor, droplevels)

From the gps_train dataset, let’s now split our data.
To avoid introducing a bias in test using the gps_train, the train-test split should be performed before (most) data preparation steps.
Using the gps_train set, we will randomly split the data into 80% in the training set and 20% in the testing set.
Note that we will remove the order_id variable from the dependent variables X_train and X_test, since this variable is only an identifier and could introduce some leakage in the algorithms.

# Random sample indexes
set.seed(314)
index <- sample(x = 1:nrow(gps_train), size = 0.8 * nrow(gps_train))

# Build training and testing sets
train <- gps_train[index, colnames(gps_train) != "order_id"]
test <- gps_train[-index, colnames(gps_train) != "order_id"]

Scaling the data

Before applying any algorithm we should scale the data (most machine learning algorithm rather handle scaled data instead of unscaled data).
In order to get a reproducible modeling, we must create the scaling in the training set only. Then, we apply the scaling from the training set into the testing set.
Indeed, the algorithms are created using only the training set ; so what has been used to create models must be applied identically for new unseen data (such as data from the testing set).

So let’s first compute the scales using the function build_scales. The scaling will be performed only on numeric features.

train_scales <- build_scales(dataSet = train, cols = "auto", verbose = TRUE)
## [1] "build_scales: I will compute scale on  37 numeric columns."
## [1] "build_scales: it took me: 0.01s to compute scale for 37 numeric columns."
# Print the results for the 4 first numeric variables
print(train_scales[1:4])
## $day_cos
## $day_cos$mean
## [1] -0.01929746
## 
## $day_cos$sd
## [1] 0.7066631
## 
## 
## $hour_start_cos
## $hour_start_cos$mean
## [1] -0.3161461
## 
## $hour_start_cos$sd
## [1] 0.6051199
## 
## 
## $hour_start_sin
## $hour_start_sin$mean
## [1] -0.2999856
## 
## $hour_start_sin$sd
## [1] 0.6664679
## 
## 
## $hour_end_cos
## $hour_end_cos$mean
## [1] -0.2720265
## 
## $hour_end_cos$sd
## [1] 0.6312151

We can now apply the scaling on the training set and the testing sets (recall, the scaling from the training set only will be applied on both datasets).
To observe the difference after scaling, let’s look at actual mean and standard deviation for the dist_total_AVAILABLE variable.

mean(train$dist_total_AVAILABLE)
## [1] 1160.859
sd(train$dist_total_AVAILABLE)
## [1] 1478.734
mean(test$dist_total_AVAILABLE)
## [1] 1156.618
sd(test$dist_total_AVAILABLE)
## [1] 1453.397

Apply the scaling.

# apply training scaling on both training and testing datasets
train <- fastScale(dataSet = train, scales = train_scales, verbose = TRUE)
## [1] "fastScale: I will scale 37 numeric columns."
## [1] "fastScale: it took me: 0.01s to scale 37 numeric columns."
test <- fastScale(dataSet = test, scales = train_scales, verbose = TRUE)
## [1] "fastScale: I will scale 37 numeric columns."
## [1] "fastScale: it took me: 0.01s to scale 37 numeric columns."

Mean and standard deviation after scaling.

mean(train$dist_total_AVAILABLE)
## [1] -1.268333e-17
sd(train$dist_total_AVAILABLE)
## [1] 1
mean(test$dist_total_AVAILABLE)
## [1] -0.00286792
sd(test$dist_total_AVAILABLE)
## [1] 0.9828655

ML Modeling

The data is now ready for machine learning modeling.
We will use h2o framework to build multiple models. More specifically, the automl() auto-machine learning process is very handful to create models and automatically tune hyperparameters of each algorithm.

h2o.init()
## 
## H2O is not running yet, starting it now...
## 
## Note:  In case of errors look at the following log files:
##     /tmp/Rtmpxcg7Uz/h2o_alex_started_from_r.out
##     /tmp/Rtmpxcg7Uz/h2o_alex_started_from_r.err
## 
## 
## Starting H2O JVM and connecting: . Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         2 seconds 260 milliseconds 
##     H2O cluster timezone:       Europe/Paris 
##     H2O data parsing timezone:  UTC 
##     H2O cluster version:        3.26.0.2 
##     H2O cluster version age:    2 months and 18 days  
##     H2O cluster name:           H2O_started_from_R_alex_tkq024 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   1.73 GB 
##     H2O cluster total cores:    2 
##     H2O cluster allowed cores:  2 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     H2O API Extensions:         Amazon S3, XGBoost, Algos, AutoML, Core V3, Core V4 
##     R Version:                  R version 3.6.1 (2019-07-05)
h2o.no_progress()

The data has to be ‘h2o-formatted’ in order to be used.

train_h2o <- as.h2o(x = train)
test_h2o  <- as.h2o(x = test)

h2o.describe(train_h2o) %>% 
  head()
##             Label Type Missing Zeros PosInf NegInf       Min      Max
## 1           label enum       0   843      0      0  0.000000 1.000000
## 2       is_goride enum       0  1392      0      0  0.000000 1.000000
## 3         day_cos real       0     0      0      0 -1.247655 1.442409
## 4      is_weekend enum       0  2083      0      0  0.000000 1.000000
## 5 is_two_day_ride enum       0  2784      0      0  0.000000 1.000000
## 6     is_biz_hour enum       0   415      0      0  0.000000 1.000000
##           Mean      Sigma Cardinality
## 1 6.989286e-01 0.45880563           2
## 2 5.028571e-01 0.50008114           2
## 3 8.895879e-17 1.00000000          NA
## 4 2.560714e-01 0.43653970           2
## 5 5.714286e-03 0.07539007           2
## 6 8.517857e-01 0.35537574           2
# Identify predictors and response
y <- "label"
x <- setdiff(names(train_h2o), y)
# Run AutoML for 40 base models (limited to 1 hour max runtime by default)
# Excluse Deep Learning
# The metric used on Kaggle is AUC
gps_automl <- h2o.automl(x = x, y = y,
                         training_frame = train_h2o,
                         leaderboard_frame = test_h2o,
                         max_models = 40, max_runtime_secs = 120, 
                         exclude_algos = c("DeepLearning"), 
                         sort_metric = "AUC",
                         seed = 42)
# Extract the AutoML Leaderboard
gps_lb <- gps_automl@leaderboard

# View the 10 best models
gps_lb %>% 
  head(10)
##                                               model_id       auc   logloss
## 1     StackedEnsemble_AllModels_AutoML_20191015_153643 0.9007076 0.3604215
## 2  StackedEnsemble_BestOfFamily_AutoML_20191015_153643 0.9004798 0.3599897
## 3                     XGBoost_1_AutoML_20191015_153643 0.9001660 0.3512131
## 4                         GBM_2_AutoML_20191015_153643 0.8963142 0.3623349
## 5                         GBM_3_AutoML_20191015_153643 0.8937885 0.3650331
## 6                         GBM_5_AutoML_20191015_153643 0.8935506 0.3630208
## 7                     XGBoost_3_AutoML_20191015_153643 0.8925534 0.3637106
## 8                     XGBoost_2_AutoML_20191015_153643 0.8911008 0.3677750
## 9                         GBM_4_AutoML_20191015_153643 0.8890407 0.3779071
## 10                        GBM_1_AutoML_20191015_153643 0.8854116 0.3801981
##    mean_per_class_error      rmse       mse
## 1             0.2295918 0.3392682 0.1151029
## 2             0.2062075 0.3390417 0.1149492
## 3             0.2174036 0.3382877 0.1144386
## 4             0.2434807 0.3424413 0.1172660
## 5             0.2158447 0.3449120 0.1189643
## 6             0.2406463 0.3446288 0.1187690
## 7             0.2560941 0.3451568 0.1191332
## 8             0.1965703 0.3458641 0.1196220
## 9             0.2356859 0.3491716 0.1219208
## 10            0.2633220 0.3519303 0.1238549

We can have a look at variable importance from the top model.

gps_best <- h2o.getModel(model_id = gps_automl@leader@model_id)

h2o.varimp_plot(gps_best)
## Warning in max(tvi$coefficients): aucun argument pour max ; -Inf est
## renvoyé

In our best model, the accuracy of the GPS seems to be the best predictor.
We can also have a look at the confusion matrix on our testing set.

h2o.confusionMatrix(object = gps_best, newdata = test_h2o)
## Confusion Matrix (vertical: actual; across: predicted)  for max f1 @ threshold = 0.366185574791611:
##          0   1    Error      Rate
## 0      120  76 0.387755   =76/196
## 1       36 468 0.071429   =36/504
## Totals 156 544 0.160000  =112/700

Create predictions

We now have created a model, which has a pretty high AUC metric value (around 0.9).
Let’s use this model to create predictions on the Kaggle test data.

 

In order to apply the scaling on the gpe_test dataset, we have to first delete the order_id column (as we did for the gps_train dataset).

kaggle_test <- gps_test %>% select(-order_id)

Now we can apply the scaling.

kaggle_test <- fastScale(dataSet = kaggle_test, scales = train_scales, verbose = TRUE)
## [1] "fastScale: I will scale 37 numeric columns."
## [1] "fastScale: it took me: 0.01s to scale 37 numeric columns."

Import the gps_test data into h2o framework (remove the label column).

kaggle_test_h2o <- kaggle_test %>% 
  select(-label) %>% 
  as.h2o()

Predict on the kaggle_test set.

gps_pred <- h2o.predict(gps_automl@leader, newdata = kaggle_test_h2o)
head(gps_pred)
##   predict         p0         p1
## 1       1 0.28170026 0.71829974
## 2       0 0.84895809 0.15104191
## 3       1 0.03785942 0.96214058
## 4       0 0.90090639 0.09909361
## 5       1 0.04817225 0.95182775
## 6       1 0.04902424 0.95097576

The predictions are in the predict column of the gps_pred object.
Let’s extract it, and build our csv file for submission.

# Create final csv file
final <- tibble(order_id = gps_test$order_id,
                label = as.vector(as.numeric(gps_pred$predict)))

head(final)
## # A tibble: 6 x 2
##   order_id label
##   <fct>    <int>
## 1 F1770        1
## 2 RB1125       0
## 3 RB714        1
## 4 RB610        0
## 5 F1432        1
## 6 F106         1
write_csv(x = final, path = "h2o_final.csv")
# close h2o framework
h2o.shutdown(prompt = FALSE)
## [1] TRUE