Brief Description
Among our drivers, there are drivers who use Fake GPS application to mock their location. This FGPS usage is unfair for other GOJEK drivers who work honestly. Hence, we would like to apply a machine learning model to classify whether a trip is being done using fake GPS or not based on their PING behavior.
You are provided with an anonymized dataset of drivers PING signals containing numeric feature variables, the binary target column, and a string order_id
column. The task is to predict the value of the label
column in the test set.
Data fields
order_id
- an anonymous id unique to a given order numberservice_type
- service type, can be GORIDE or GOFOODdriver_status
- status of the driver PING, can be AVAILABLE, UNAVAILABLE, OTW_PICKUP, OTW_DROPOFFdate
- datehour
- hourseconds
- seconds in linux formatlatitude
- GPS latitudelongitude
- GPS longitudealtitude_in_meters
- GPS Altitudeaccuracy_in_meters
- GPS Accuracy, the smaller the more accurateThe service_type
variable comes from two services provided by the GOJEK company :
Libraries
library(tidyverse) # sata manipulation, plotting, ...
library(ggmap) # map plot
library(geosphere) # compute geographical distances
library(dataPreparation) # data preparation for ML
library(h2o) # ML frame
gps_train <- read_csv("train.csv")
gps_test <- read_csv("test.csv")
gps <- bind_rows(train = gps_train, test = gps_test, .id = "data")
glimpse(gps)
## Observations: 648,879
## Variables: 12
## $ data <chr> "train", "train", "train", "train", "train", …
## $ order_id <chr> "RB193", "RB193", "RB193", "RB193", "RB193", …
## $ service_type <chr> "GO_RIDE", "GO_RIDE", "GO_RIDE", "GO_RIDE", "…
## $ driver_status <chr> "UNAVAILABLE", "AVAILABLE", "AVAILABLE", "AVA…
## $ date <date> 2018-02-05, 2018-02-05, 2018-02-05, 2018-02-…
## $ hour <dbl> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, …
## $ seconds <dbl> 1548890667, 1548890680, 1548890690, 154889070…
## $ latitude <dbl> -6.922910, -6.923039, -6.923039, -6.923048, -…
## $ longitude <dbl> 107.6313, 107.6312, 107.6312, 107.6312, 107.6…
## $ altitude_in_meters <dbl> NA, 712, 712, 713, 713, 713, 713, 713, 713, 7…
## $ accuracy_in_meters <dbl> 23.027, 9.577, 9.577, 8.139, 7.029, 7.029, 3.…
## $ label <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
As per description, we can re-encode some variables as factors.
gps$order_id <- as.factor(gps$order_id)
gps$service_type <- as.factor(gps$service_type)
gps$driver_status <- as.factor(gps$driver_status)
gps$label <- as.factor(gps$label)
The order_id
variable group several observations (“ping”) from a unique ride.
gps %>%
count(order_id, name = "nb_of_ping") %>%
head()
## # A tibble: 6 x 2
## order_id nb_of_ping
## <fct> <int>
## 1 F0 222
## 2 F1 180
## 3 F10 337
## 4 F100 293
## 5 F1000 117
## 6 F1001 118
We actually have 3500 described rides in the train set and 500 in the test set.
gps %>% filter(data == "train") %>% select(order_id) %>% n_distinct()
## [1] 3500
gps %>% filter(data == "test") %>% select(order_id) %>% n_distinct()
## [1] 500
The distribution of the target variable (in the train set) is :
gps %>%
filter(data == "train") %>%
group_by(order_id) %>%
slice(1) %>%
ungroup() %>%
with(table(label))
## label
## 0 1
## 1039 2461
Note that we don’t know which label correspond to a fake GPS.
We could suspect that 0 is the fraud value, since fraud usually appears less frequently ; however, in our case, we would have one third of the data considered as fraud.
The data is equally distributed between the service_type
variable :
gps %>%
group_by(order_id) %>%
slice(1) %>%
ungroup() %>%
with(table(service_type, data))
## data
## service_type test train
## GO_FOOD 250 1750
## GO_RIDE 250 1750
Also note that order_id
starting with an “R” correspond to a “GORIDE” service, those starting with a “F” are a “GOFOOD” service.
gps %>%
group_by(order_id) %>%
slice(1) %>%
ungroup() %>%
mutate(first_letter_order_id = str_sub(string = order_id, start = 1, end = 1)) %>%
with(table(first_letter_order_id, service_type))
## service_type
## first_letter_order_id GO_FOOD GO_RIDE
## F 2000 0
## R 0 2000
The data has been collected for 47 consecutive days, from 2018-02-05 to 2018-03-23 (both in train and test sets).
# unique date values in train set
gps %>%
filter(data == "train") %>%
distinct(date) %>%
pull()
## [1] "2018-02-05" "2018-02-06" "2018-02-07" "2018-02-08" "2018-02-09"
## [6] "2018-02-10" "2018-02-11" "2018-02-12" "2018-02-13" "2018-02-14"
## [11] "2018-02-15" "2018-02-16" "2018-02-17" "2018-02-18" "2018-02-19"
## [16] "2018-02-20" "2018-02-21" "2018-02-22" "2018-02-23" "2018-02-24"
## [21] "2018-02-25" "2018-02-26" "2018-02-27" "2018-02-28" "2018-03-01"
## [26] "2018-03-02" "2018-03-03" "2018-03-04" "2018-03-05" "2018-03-06"
## [31] "2018-03-07" "2018-03-08" "2018-03-09" "2018-03-10" "2018-03-11"
## [36] "2018-03-12" "2018-03-13" "2018-03-14" "2018-03-15" "2018-03-16"
## [41] "2018-03-17" "2018-03-18" "2018-03-19" "2018-03-20" "2018-03-21"
## [46] "2018-03-22" "2018-03-23"
# unique date values in test set
gps %>%
filter(data == "test") %>%
distinct(date) %>%
pull()
## [1] "2018-02-05" "2018-02-06" "2018-02-07" "2018-02-08" "2018-02-09"
## [6] "2018-02-10" "2018-02-11" "2018-02-12" "2018-02-13" "2018-02-14"
## [11] "2018-02-15" "2018-02-16" "2018-02-17" "2018-02-18" "2018-02-19"
## [16] "2018-02-20" "2018-02-21" "2018-02-22" "2018-02-23" "2018-02-24"
## [21] "2018-02-25" "2018-02-26" "2018-02-27" "2018-02-28" "2018-03-01"
## [26] "2018-03-02" "2018-03-03" "2018-03-04" "2018-03-05" "2018-03-06"
## [31] "2018-03-07" "2018-03-08" "2018-03-09" "2018-03-10" "2018-03-11"
## [36] "2018-03-12" "2018-03-13" "2018-03-14" "2018-03-15" "2018-03-16"
## [41] "2018-03-17" "2018-03-18" "2018-03-19" "2018-03-20" "2018-03-21"
## [46] "2018-03-22"
Finally, we can have a quik look at missing data.
sapply(gps, function(x) sum(is.na(x)))
## data order_id service_type
## 0 0 0
## driver_status date hour
## 0 0 0
## seconds latitude longitude
## 0 0 0
## altitude_in_meters accuracy_in_meters label
## 175606 0 81334
The variables longitude
, latitude
and altitude_in_meters
can be used to calculate distances between 2 consecutive “ping”. Since we are not dealing with long distances, altitude is not very important in this calculation.
So, as we have a lot of NA’s for this variable, we will delete the altitude_in_meters
variable.
All vizualizatiions will be done using the train dataset.
gps_train <- gps %>% filter(data == "train")
The label
variable is of interest. We can look at the distribution for each type of service.
gps_train %>%
group_by(order_id) %>%
slice(1) %>%
ggplot(aes(service_type)) +
geom_bar(aes(fill = label), position = position_fill()) +
labs(title = "Distribution of target for each type of service")
Proportionnaly, it seems that the 0
value appears more frequently for the “GORIDE” service.
We can also look at the distribution of the label variable according to the time spend on the service.
gps_train %>%
group_by(order_id, label, service_type) %>%
summarise(service_duration = max(seconds) - min(seconds)) %>%
ggplot(aes(x = service_duration)) +
geom_density(aes(fill = service_type), alpha = 0.4) +
facet_wrap(~ label)
Delivering food takes usually more time than moving someone. However, the distributions are very close for each label.
Finally, we can look at the time spent for each “ping” status for each type of label.
gps_train %>%
count(label, driver_status) %>%
group_by(label) %>%
mutate(prop = n / sum(n)) %>%
ggplot(aes(x = label, y = prop, fill = driver_status)) +
geom_col() +
geom_label(aes(label = scales::percent(prop)), position = position_stack(vjust = 0.5), show.legend = FALSE) +
labs(title = 'Proportion of time spent on each "ping" status',
subtitle = "'label' point of view")
For each label, the time spent on each “ping” status is very similar.
gps_train %>%
count(service_type, driver_status) %>%
group_by(service_type) %>%
mutate(prop = n / sum(n)) %>%
ggplot(aes(x = service_type, y = prop, fill = driver_status)) +
geom_col() +
geom_label(aes(label = scales::percent(prop)), position = position_stack(vjust = 0.5), show.legend = FALSE) +
labs(title = 'Proportion of time spent on each "ping" status',
subtitle = "'service' point of view")
The “GORIDE” service spends more time “available” than the “GOFOOD” service.
Other variables distribution.
gps_train %>%
count(order_id, label, hour) %>%
ggplot(aes(x = hour)) +
geom_histogram(aes(fill = label), color = "black", bins = 24, boundary = 0, position = position_dodge()) +
labs(title = "'hour' distribution")
Each ride has GPS coordinates. So it is easy to view a ride on a map.
First, let’s draw an example of 0 and 1 labels for each service type.
# Sample 0/1 label for GORIDE/GOFFOD service
# and extract the corresponding 'order_id'
set.seed(314)
random_label <- gps_train %>%
group_by(label, service_type) %>%
sample_n(size = 1) %>%
pull(order_id) %>%
as.vector()
# Extract the corresponding rides
gps_example <- gps_train %>%
filter(order_id %in% random_label)
# Add a column to trace the path
gps_example <- gps_example %>%
group_by(order_id) %>%
mutate(path_flow = row_number()) %>%
ungroup()
We now have 4 different trips. We could use ggmap
and facet_wrap
to display all 4 trips, but this would not be readable, as the background map will be the same for all 4 maps.
Instead, use plot_grid
from cowplot
package.
Plot the first trip on a map, to get a glimpse of the future function
# Extract first trip
gps_example_trip1 <- gps_example %>%
filter(order_id == random_label[1])
# Create a plot title
plot_title <- paste0("Type of service: ", gps_example_trip1$service_type, ", label: ", gps_example_trip1$label)
# Extract a map from minimum and maximum latitude/longitude
map_gps_example_trip1 <- get_stamenmap(bbox = c(left = min(gps_example_trip1$longitude), right = max(gps_example_trip1$longitude),
top = max(gps_example_trip1$latitude), bottom = min(gps_example_trip1$latitude)),
zoom = 15, crop = FALSE)
# Map the ride
ggmap(map_gps_example_trip1,
base_layer = ggplot(data = gps_example_trip1, aes(x = longitude, y = latitude))) +
geom_point(aes(color = driver_status)) +
geom_text(aes(label = path_flow),
size = 4, vjust = 0, hjust = -0.5, color = "black", check_overlap = TRUE) +
geom_path(aes(color = driver_status, group = order_id)) +
labs(title = plot_title)
Let’s create a function to avoid repeating all previous code.
plot_map_trip <- function(data, legend = TRUE) {
# Create a plot title
plot_title <- paste0("Type of service: ", data$service_type, ", label: ", data$label)
# Extract a map from minimum and maximum latitude/longitude
map_gps <- get_stamenmap(bbox = c(left = min(data$longitude), right = max(data$longitude),
top = max(data$latitude), bottom = min(data$latitude)),
zoom = 15, crop = FALSE)
# Map the ride
ggmap(map_gps,
base_layer = ggplot(data = data, aes(x = longitude, y = latitude))) +
geom_point(aes(color = driver_status)) +
geom_text(aes(label = path_flow),
size = 4, vjust = 0, hjust = -0.5, color = "black", check_overlap = TRUE) +
geom_path(aes(color = driver_status, group = order_id))+
labs(title = plot_title)
}
# Test
gps_example %>%
filter(order_id == random_label[1]) %>%
plot_map_trip()
Map the 4 random rides.
cowplot::plot_grid(plot_map_trip(gps_example %>% filter(order_id == random_label[1])) + theme(legend.position = "none"),
plot_map_trip(gps_example %>% filter(order_id == random_label[3])),
plot_map_trip(gps_example %>% filter(order_id == random_label[2])) + theme(legend.position = "none"),
plot_map_trip(gps_example %>% filter(order_id == random_label[4])) + theme(legend.position = "none"),
align = 'v', axis = 'l', ncol = 1)
# clean variables
rm(gps_example, gps_example_trip1, map_gps_example_trip1, random_label, plot_title, plot_map_trip)
From available variables, we can create new, more adequate features that might fit better on data modeling.
order_id
: no new featureservice_type
: binary variable is_goride
driver_status
: no new variable, but numeric variables will be split according to the “ping” statusdate
: day of the week recoded into cyclical variable (year and month are probably not relevant because too similar between all described rides) ; is this a business day or a weekend dayhour
: extract starting and ending hours of the ride, and recode into cyclical variable ; is this business hours or notseconds
: duration of the trip ; possibly split by “ping” status ; maximum time between 2 “ping”latitude
and longitude
- distance between 2 consecutive “ping” => total distance of the ride, possibly split by “ping” status ; maximum distance between 2 “ping”altitude_in_meters
: delete variableaccuracy_in_meters
: minimum, maximum, mean, median, split by “ping” statusMany of these new variables will be created for a “grouped ride”.
The goal is to obtain at the end of the process a unique line for each ride, with all these new variables.
gps_goride <- gps %>%
select(order_id, service_type) %>%
distinct() %>%
transmute(order_id, is_goride = as.numeric(service_type == "GO_RIDE")) %>%
# this variable should be a factor
mutate(is_goride = as.factor(is_goride))
head(gps_goride)
## # A tibble: 6 x 2
## order_id is_goride
## <fct> <fct>
## 1 RB193 1
## 2 F1452 0
## 3 F943 0
## 4 F1999 0
## 5 F415 0
## 6 RB865 1
The goal is to re-encode variables as “cyclical” (to keep the time variability).
Indeed, if we use 1-7 encoding for the days of the week, we’re telling the model that days 4 and 5 are very similar, while days 1 and 7 are very dissimilar. In fact, days 1 and 7 are just as similar as days 4 and 5.
Recoding into cyclical values will be done using cos/sin transformations (http://blog.davidkaleko.com/feature-engineering-cyclical-features.html).
Also note that some rides overlap 2 days.
two_days_ride <- gps %>%
distinct(order_id, date) %>%
group_by(order_id) %>%
filter(n() > 1)
head(two_days_ride)
## # A tibble: 6 x 2
## # Groups: order_id [3]
## order_id date
## <fct> <date>
## 1 RB1889 2018-02-07
## 2 RB1889 2018-02-08
## 3 RB288 2018-02-13
## 4 RB288 2018-02-14
## 5 RB564 2018-02-17
## 6 RB564 2018-02-18
In order to ease the reading of the final dataset, we will only keep the starting date of the ride, and add another column to indicate if we have 2 dates.
# recode days of the week
# extract the week day as a numeric variable
# The 'wday' component of a 'POSIXlt' object is the numeric weekday (0 ot 6, starting on Sunday).
gps_date <- gps %>%
distinct(order_id, date) %>%
mutate(numeric_day = as.POSIXlt(date)$wday,
day_cos = cos(2 * pi * numeric_day / 7),
day_sin = sin(2 * pi * numeric_day / 7),
# add an indication of weekday/weekend
is_weekend = as.numeric(numeric_day %in% c(0,6)))
head(gps_date)
## # A tibble: 6 x 6
## order_id date numeric_day day_cos day_sin is_weekend
## <fct> <date> <int> <dbl> <dbl> <dbl>
## 1 RB193 2018-02-05 1 0.623 0.782 0
## 2 F1452 2018-02-05 1 0.623 0.782 0
## 3 F943 2018-02-05 1 0.623 0.782 0
## 4 F1999 2018-02-05 1 0.623 0.782 0
## 5 F415 2018-02-05 1 0.623 0.782 0
## 6 RB865 2018-02-05 1 0.623 0.782 0
# Keep only the first day of the ride and
# add a column indicating a 2-day ride
gps_date <- gps_date %>%
group_by(order_id) %>%
filter(date == min(date)) %>%
ungroup() %>%
mutate(is_two_day_ride = as.numeric(order_id %in% two_days_ride$order_id),
# change variables into factor
is_weekend = as.factor(is_weekend),
is_two_day_ride = as.factor(is_two_day_ride)) %>%
select(-date, -numeric_day)
Let’s check for the 2-day rides.
gps_date %>%
filter(is_two_day_ride == 1) %>%
head()
## # A tibble: 6 x 5
## order_id day_cos day_sin is_weekend is_two_day_ride
## <fct> <dbl> <dbl> <fct> <fct>
## 1 RB1889 -0.901 0.434 0 1
## 2 RB288 -0.223 0.975 0 1
## 3 RB564 0.623 -0.782 1 1
## 4 RB1374 1 0 1 1
## 5 RB806 0.623 0.782 0 1
## 6 F1828 0.623 0.782 0 1
rm(two_days_ride)
We can now proceed for the hour
variable.
In addition to transforming into cyclical variable, we can add an indication of business hours : let’s define these between 7AM and 8PM.
# Extract beginand end hour of the ride
gps_hour <- gps %>%
group_by(order_id) %>%
slice(1, n()) %>%
select(order_id, hour) %>%
mutate(text = c("hour_start", "hour_end")) %>%
pivot_wider(names_from = text, values_from = hour) %>%
# add business hours indication
mutate(is_biz_hour = as.numeric(hour_start %in% 7:20)) %>%
ungroup()
head(gps_hour)
## # A tibble: 6 x 4
## order_id hour_start hour_end is_biz_hour
## <fct> <dbl> <dbl> <dbl>
## 1 F0 10 11 1
## 2 F1 10 10 1
## 3 F10 15 16 1
## 4 F100 19 20 1
## 5 F1000 7 7 1
## 6 F1001 15 15 1
# Convert into cyclical
gps_hour <- gps_hour %>%
mutate(hour_start_cos = cos(2 * pi * hour_start / 24),
hour_start_sin = sin(2 * pi * hour_start / 24),
hour_end_cos = cos(2 * pi * hour_end / 24),
hour_end_sin = sin(2 * pi * hour_end / 24),
is_biz_hour = as.factor(is_biz_hour)) %>%
select(-hour_start, -hour_end)
head(gps_hour)
## # A tibble: 6 x 6
## order_id is_biz_hour hour_start_cos hour_start_sin hour_end_cos
## <fct> <fct> <dbl> <dbl> <dbl>
## 1 F0 1 -0.866 0.500 -0.966
## 2 F1 1 -0.866 0.500 -0.866
## 3 F10 1 -0.707 -0.707 -0.5
## 4 F100 1 0.259 -0.966 0.5
## 5 F1000 1 -0.259 0.966 -0.259
## 6 F1001 1 -0.707 -0.707 -0.707
## # … with 1 more variable: hour_end_sin <dbl>
First we will compute distances between 2 consecutive “ping”, using longitude and latitude.
We will use the “Haversine” distance which, according to the geosphere vignette, gives good results for small distances.
# small example
paste("Distance between the first 2 coordinates of the dataset :",
distHaversine(p1 = c(gps$longitude[1], gps$latitude[1]),
p2 = c(gps$longitude[2], gps$latitude[2])),
"meters.")
## [1] "Distance between the first 2 coordinates of the dataset : 15.4468444664606 meters."
When applied to a dataframe, the process is similar.
Note that we have to add a “NA” value to have a correct number a new distances (with 20 point coordinates, only 19 distances are computed).
gps %>%
slice(1:20) %>%
select(longitude, latitude) %>%
mutate(dist = c(NA, distHaversine(.))) %>%
head()
## # A tibble: 6 x 3
## longitude latitude dist
## <dbl> <dbl> <dbl>
## 1 108. -6.92 NA
## 2 108. -6.92 15.4
## 3 108. -6.92 0
## 4 108. -6.92 2.38
## 5 108. -6.92 9.30
## 6 108. -6.92 0
We just have a small issue. As we are going to use this method for consecutive rides on our dataset, we thus will compute the distance between the last “ping” of a ride and the first “ping” of the next one.
We will trick this by computing each and every possible distance, and then add an “NA” value for the first coordinate of each ride.
gps_distance <- gps %>%
select(longitude, latitude) %>%
# compute all possible distance
mutate(dist = c(NA, distHaversine(.))) %>%
# add the order_id column
mutate(order_id = gps$order_id)
head(gps_distance)
## # A tibble: 6 x 4
## longitude latitude dist order_id
## <dbl> <dbl> <dbl> <fct>
## 1 108. -6.92 NA RB193
## 2 108. -6.92 15.4 RB193
## 3 108. -6.92 0 RB193
## 4 108. -6.92 2.38 RB193
## 5 108. -6.92 9.30 RB193
## 6 108. -6.92 0 RB193
Now we can group by order_id
, and modify the first line of each ride accordingly.
gps_distance <- gps_distance %>%
group_by(order_id) %>%
mutate(dist = ifelse(row_number() == 1, NA, dist)) %>%
select(order_id, dist) %>%
ungroup()
head(gps_distance)
## # A tibble: 6 x 2
## order_id dist
## <fct> <dbl>
## 1 RB193 NA
## 2 RB193 15.4
## 3 RB193 0
## 4 RB193 2.38
## 5 RB193 9.30
## 6 RB193 0
We can check the number of missing values for the dist
column ; it should be the same as the total number of rides, ie 3500 + 500.
sum(is.na(gps_distance$dist))
## [1] 4000
Let’s do the same and calculate the time spent between each “ping”, using the seconds
variable.
As previously, we will add a “NA” value for the first line of each ride.
gps_time <- gps %>%
mutate(time = c(NA, diff(seconds))) %>%
select(order_id, time) %>%
group_by(order_id) %>%
mutate(time = ifelse(row_number() == 1, NA, time)) %>%
ungroup()
head(gps_time)
## # A tibble: 6 x 2
## order_id time
## <fct> <dbl>
## 1 RB193 NA
## 2 RB193 13
## 3 RB193 10
## 4 RB193 10
## 5 RB193 10
## 6 RB193 10
We can check the number of missing values for the time
column ; it should be the same as the total number of rides, ie 3500 + 500.
sum(is.na(gps_time$time))
## [1] 4000
Let’s add these new columns to the original dataframe.
Also, recall that we want to compute values “per status” ; so we will keep only our variables of interest : order_id
and driver_status
, with the new dist
and time
variables.
gps_dist_time <- gps %>%
mutate(dist = gps_distance$dist,
time = gps_time$time) %>%
select(order_id, driver_status, dist, time)
head(gps_dist_time)
## # A tibble: 6 x 4
## order_id driver_status dist time
## <fct> <fct> <dbl> <dbl>
## 1 RB193 UNAVAILABLE NA NA
## 2 RB193 AVAILABLE 15.4 13
## 3 RB193 AVAILABLE 0 10
## 4 RB193 AVAILABLE 2.38 10
## 5 RB193 AVAILABLE 9.30 10
## 6 RB193 AVAILABLE 0 10
Let’s now compute the new features, grouped by ride and “ping” status.
gps_dist_time <- gps_dist_time %>%
group_by(order_id, driver_status) %>%
summarise(dist_total = sum(dist, na.rm = TRUE),
dist_max = max(dist, na.rm = TRUE),
time_total = sum(time, na.rm = TRUE),
time_max = max(time, na.rm = TRUE)) %>%
ungroup()
head(gps_dist_time)
## # A tibble: 6 x 6
## order_id driver_status dist_total dist_max time_total time_max
## <fct> <fct> <dbl> <dbl> <dbl> <dbl>
## 1 F0 AVAILABLE 1125. 1125. 1167 44
## 2 F0 OTW_DROPOFF 579. 100. 705 12
## 3 F0 OTW_PICKUP 1497. 715. 347 16
## 4 F0 UNAVAILABLE 0 0 13 8
## 5 F1 AVAILABLE 842. 842. 352 18
## 6 F1 OTW_DROPOFF 1412. 105. 1165 15
Some rides do not have any value for some status. This is problematic, because when we take the maximum value, the result is -Inf
value.
# for example
gps_dist_time %>%
filter(order_id == "RB193")
## # A tibble: 4 x 6
## order_id driver_status dist_total dist_max time_total time_max
## <fct> <fct> <dbl> <dbl> <dbl> <dbl>
## 1 RB193 AVAILABLE 1635. 112. 800 92
## 2 RB193 OTW_DROPOFF 1069. 59.6 377 11
## 3 RB193 OTW_PICKUP 157. 43.4 186 17
## 4 RB193 UNAVAILABLE 0 -Inf 0 -Inf
To solve this, we will change these -Inf
values into NA
values.
gps_dist_time <- gps_dist_time %>%
# replace -Inf values with NA
mutate_if(.predicate = is.numeric, .funs = list(~ na_if(., -Inf)))
gps_dist_time %>%
filter(order_id == "RB193")
## # A tibble: 4 x 6
## order_id driver_status dist_total dist_max time_total time_max
## <fct> <fct> <dbl> <dbl> <dbl> <dbl>
## 1 RB193 AVAILABLE 1635. 112. 800 92
## 2 RB193 OTW_DROPOFF 1069. 59.6 377 11
## 3 RB193 OTW_PICKUP 157. 43.4 186 17
## 4 RB193 UNAVAILABLE 0 NA 0 NA
Let’s now spread the columns, in order to have only one row per ride.
gps_dist_time <- gps_dist_time %>%
pivot_wider(names_from = driver_status, values_from = dist_total:time_max)
head(gps_dist_time)
## # A tibble: 6 x 17
## order_id dist_total_AVAI… dist_total_OTW_… dist_total_OTW_…
## <fct> <dbl> <dbl> <dbl>
## 1 F0 1125. 579. 1497.
## 2 F1 842. 1412. 375.
## 3 F10 802. 7809. 3393.
## 4 F100 1196. 1234. 179.
## 5 F1000 2692. 75.5 1040.
## 6 F1001 1391. 2436. 0
## # … with 13 more variables: dist_total_UNAVAILABLE <dbl>,
## # dist_max_AVAILABLE <dbl>, dist_max_OTW_DROPOFF <dbl>,
## # dist_max_OTW_PICKUP <dbl>, dist_max_UNAVAILABLE <dbl>,
## # time_total_AVAILABLE <dbl>, time_total_OTW_DROPOFF <dbl>,
## # time_total_OTW_PICKUP <dbl>, time_total_UNAVAILABLE <dbl>,
## # time_max_AVAILABLE <dbl>, time_max_OTW_DROPOFF <dbl>,
## # time_max_OTW_PICKUP <dbl>, time_max_UNAVAILABLE <dbl>
Note that we have some missing values, since each status is not present in every ride.
So, for each variable, we will add a new factor column indicating if we have a missing value, and imputing the missing values with 0.
gps_dist_time <- gps_dist_time %>%
# this adds new columns indicating NA values (except for the column 'order_id')
mutate_at(.vars = vars(-order_id), .funs = list(is_na = ~ as.numeric(is.na(.)))) %>%
# this will replace NA values with 0
mutate_at(.vars = vars(-order_id, -ends_with("is_na")), .funs = replace_na, 0) %>%
# this will transform the 'is_na' columns into factor
mutate_at(.vars = vars(ends_with("is_na")), .funs = as.factor)
head(gps_dist_time)
## # A tibble: 6 x 33
## order_id dist_total_AVAI… dist_total_OTW_… dist_total_OTW_…
## <fct> <dbl> <dbl> <dbl>
## 1 F0 1125. 579. 1497.
## 2 F1 842. 1412. 375.
## 3 F10 802. 7809. 3393.
## 4 F100 1196. 1234. 179.
## 5 F1000 2692. 75.5 1040.
## 6 F1001 1391. 2436. 0
## # … with 29 more variables: dist_total_UNAVAILABLE <dbl>,
## # dist_max_AVAILABLE <dbl>, dist_max_OTW_DROPOFF <dbl>,
## # dist_max_OTW_PICKUP <dbl>, dist_max_UNAVAILABLE <dbl>,
## # time_total_AVAILABLE <dbl>, time_total_OTW_DROPOFF <dbl>,
## # time_total_OTW_PICKUP <dbl>, time_total_UNAVAILABLE <dbl>,
## # time_max_AVAILABLE <dbl>, time_max_OTW_DROPOFF <dbl>,
## # time_max_OTW_PICKUP <dbl>, time_max_UNAVAILABLE <dbl>,
## # dist_total_AVAILABLE_is_na <fct>, dist_total_OTW_DROPOFF_is_na <fct>,
## # dist_total_OTW_PICKUP_is_na <fct>, dist_total_UNAVAILABLE_is_na <fct>,
## # dist_max_AVAILABLE_is_na <fct>, dist_max_OTW_DROPOFF_is_na <fct>,
## # dist_max_OTW_PICKUP_is_na <fct>, dist_max_UNAVAILABLE_is_na <fct>,
## # time_total_AVAILABLE_is_na <fct>, time_total_OTW_DROPOFF_is_na <fct>,
## # time_total_OTW_PICKUP_is_na <fct>, time_total_UNAVAILABLE_is_na <fct>,
## # time_max_AVAILABLE_is_na <fct>, time_max_OTW_DROPOFF_is_na <fct>,
## # time_max_OTW_PICKUP_is_na <fct>, time_max_UNAVAILABLE_is_na <fct>
# clean environment
rm(gps_time, gps_distance)
gc(verbose = FALSE)
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 2434634 130.1 4452424 237.8 4452424 237.8
## Vcells 19283693 147.2 40996540 312.8 40667016 310.3
The last variable to treat is the accuracy_in_meters
variable, from which we will extract basic statistics.
gps_stats <- gps %>%
select(order_id, driver_status, accuracy_in_meters) %>%
group_by(order_id, driver_status) %>%
summarise(accuracy_min = min(accuracy_in_meters),
accuracy_max = max(accuracy_in_meters),
accuracy_mean = mean(accuracy_in_meters),
accuracy_median = median(accuracy_in_meters)) %>%
ungroup()
head(gps_stats)
## # A tibble: 6 x 6
## order_id driver_status accuracy_min accuracy_max accuracy_mean
## <fct> <fct> <dbl> <dbl> <dbl>
## 1 F0 AVAILABLE 0.1 0.1 0.1
## 2 F0 OTW_DROPOFF 4.38 8.29 4.96
## 3 F0 OTW_PICKUP 4.38 7.4 5.53
## 4 F0 UNAVAILABLE 4.48 4.68 4.58
## 5 F1 AVAILABLE 0.01 9 0.267
## 6 F1 OTW_DROPOFF 3.80 35.2 6.43
## # … with 1 more variable: accuracy_median <dbl>
As previously, we will spred the columns to have only one row per ride, and replace NA values with 0 and add a column indicating if there was a missing value.
gps_stats <- gps_stats %>%
pivot_wider(names_from = driver_status, values_from = accuracy_min:accuracy_median) %>%
# this adds new columns indicating NA values (except for the grouped column 'order_id')
mutate_at(.vars = vars(-order_id), .funs = list(is_na = ~ as.numeric(is.na(.)))) %>%
# this will replace NA values with 0
mutate_at(.vars = vars(-order_id, -ends_with("is_na")), .funs = replace_na, 0) %>%
# this will transform the 'is_na' columns into factor
mutate_at(.vars = vars(ends_with("is_na")), .funs = as.factor)
head(gps_stats)
## # A tibble: 6 x 33
## order_id accuracy_min_AV… accuracy_min_OT… accuracy_min_OT…
## <fct> <dbl> <dbl> <dbl>
## 1 F0 0.1 4.38 4.38
## 2 F1 0.01 3.80 0.01
## 3 F10 4.45 3 4.44
## 4 F100 0.1 4.37 0.1
## 5 F1000 3.9 3.9 3.9
## 6 F1001 0.5 0.5 0.5
## # … with 29 more variables: accuracy_min_UNAVAILABLE <dbl>,
## # accuracy_max_AVAILABLE <dbl>, accuracy_max_OTW_DROPOFF <dbl>,
## # accuracy_max_OTW_PICKUP <dbl>, accuracy_max_UNAVAILABLE <dbl>,
## # accuracy_mean_AVAILABLE <dbl>, accuracy_mean_OTW_DROPOFF <dbl>,
## # accuracy_mean_OTW_PICKUP <dbl>, accuracy_mean_UNAVAILABLE <dbl>,
## # accuracy_median_AVAILABLE <dbl>, accuracy_median_OTW_DROPOFF <dbl>,
## # accuracy_median_OTW_PICKUP <dbl>, accuracy_median_UNAVAILABLE <dbl>,
## # accuracy_min_AVAILABLE_is_na <fct>,
## # accuracy_min_OTW_DROPOFF_is_na <fct>,
## # accuracy_min_OTW_PICKUP_is_na <fct>,
## # accuracy_min_UNAVAILABLE_is_na <fct>,
## # accuracy_max_AVAILABLE_is_na <fct>,
## # accuracy_max_OTW_DROPOFF_is_na <fct>,
## # accuracy_max_OTW_PICKUP_is_na <fct>,
## # accuracy_max_UNAVAILABLE_is_na <fct>,
## # accuracy_mean_AVAILABLE_is_na <fct>,
## # accuracy_mean_OTW_DROPOFF_is_na <fct>,
## # accuracy_mean_OTW_PICKUP_is_na <fct>,
## # accuracy_mean_UNAVAILABLE_is_na <fct>,
## # accuracy_median_AVAILABLE_is_na <fct>,
## # accuracy_median_OTW_DROPOFF_is_na <fct>,
## # accuracy_median_OTW_PICKUP_is_na <fct>,
## # accuracy_median_UNAVAILABLE_is_na <fct>
We can now build our final dataframe for modeling.
gps_final <- gps %>%
select(data, order_id, label) %>%
distinct() %>%
left_join(gps_goride, by = "order_id") %>%
left_join(gps_date, by = "order_id") %>%
left_join(gps_hour, by = "order_id") %>%
left_join(gps_dist_time, by = "order_id") %>%
left_join(gps_stats, by = "order_id")
glimpse(gps_final)
## Observations: 4,000
## Variables: 77
## $ data <chr> "train", "train", "train", "tr…
## $ order_id <fct> RB193, F1452, F943, F1999, F41…
## $ label <fct> 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ is_goride <fct> 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, …
## $ day_cos <dbl> 0.6234898, 0.6234898, 0.623489…
## $ day_sin <dbl> 0.7818315, 0.7818315, 0.781831…
## $ is_weekend <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ is_two_day_ride <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ is_biz_hour <fct> 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, …
## $ hour_start_cos <dbl> 6.123234e-17, -8.660254e-01, -…
## $ hour_start_sin <dbl> 1.0000000, 0.5000000, -0.50000…
## $ hour_end_cos <dbl> 6.123234e-17, -9.659258e-01, -…
## $ hour_end_sin <dbl> 1.0000000, 0.2588190, -0.70710…
## $ dist_total_AVAILABLE <dbl> 1635.290795, 1514.376522, 1337…
## $ dist_total_OTW_DROPOFF <dbl> 1068.7541, 9054.5517, 1784.790…
## $ dist_total_OTW_PICKUP <dbl> 157.075962, 1008.015814, 1449.…
## $ dist_total_UNAVAILABLE <dbl> 0.000000, 0.000000, 0.000000, …
## $ dist_max_AVAILABLE <dbl> 112.197553, 985.561324, 177.18…
## $ dist_max_OTW_DROPOFF <dbl> 59.59587, 292.06993, 108.42692…
## $ dist_max_OTW_PICKUP <dbl> 43.436086, 79.306181, 165.3672…
## $ dist_max_UNAVAILABLE <dbl> 0.000000, 0.000000, 0.000000, …
## $ time_total_AVAILABLE <dbl> 800, 1026, 1163, 658, 396, 871…
## $ time_total_OTW_DROPOFF <dbl> 377, 1957, 909, 391, 685, 245,…
## $ time_total_OTW_PICKUP <dbl> 186, 916, 657, 262, 399, 687, …
## $ time_total_UNAVAILABLE <dbl> 0, 0, 26, 24, 14, 23, 10, 0, 6…
## $ time_max_AVAILABLE <dbl> 92, 373, 38, 110, 21, 164, 523…
## $ time_max_OTW_DROPOFF <dbl> 11, 28, 52, 15, 61, 15, 66, 0,…
## $ time_max_OTW_PICKUP <dbl> 17, 16, 20, 12, 20, 11, 16, 16…
## $ time_max_UNAVAILABLE <dbl> 0, 0, 18, 10, 14, 7, 10, 0, 6,…
## $ dist_total_AVAILABLE_is_na <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ dist_total_OTW_DROPOFF_is_na <fct> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, …
## $ dist_total_OTW_PICKUP_is_na <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ dist_total_UNAVAILABLE_is_na <fct> 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, …
## $ dist_max_AVAILABLE_is_na <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ dist_max_OTW_DROPOFF_is_na <fct> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, …
## $ dist_max_OTW_PICKUP_is_na <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ dist_max_UNAVAILABLE_is_na <fct> 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, …
## $ time_total_AVAILABLE_is_na <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ time_total_OTW_DROPOFF_is_na <fct> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, …
## $ time_total_OTW_PICKUP_is_na <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ time_total_UNAVAILABLE_is_na <fct> 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, …
## $ time_max_AVAILABLE_is_na <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ time_max_OTW_DROPOFF_is_na <fct> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, …
## $ time_max_OTW_PICKUP_is_na <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ time_max_UNAVAILABLE_is_na <fct> 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, …
## $ accuracy_min_AVAILABLE <dbl> 3.000, 3.000, 0.010, 0.010, 0.…
## $ accuracy_min_OTW_DROPOFF <dbl> 3.000, 3.000, 3.000, 3.000, 5.…
## $ accuracy_min_OTW_PICKUP <dbl> 3.000, 3.000, 0.010, 0.010, 0.…
## $ accuracy_min_UNAVAILABLE <dbl> 23.027, 0.000, 0.010, 0.010, 0…
## $ accuracy_max_AVAILABLE <dbl> 9.577, 5.893, 9.648, 9.648, 16…
## $ accuracy_max_OTW_DROPOFF <dbl> 9.000, 98.753, 59.751, 6.138, …
## $ accuracy_max_OTW_PICKUP <dbl> 9.000, 22.045, 35.970, 17.487,…
## $ accuracy_max_UNAVAILABLE <dbl> 23.027, 0.000, 0.010, 3.900, 0…
## $ accuracy_mean_AVAILABLE <dbl> 4.1849577, 3.6291324, 3.585307…
## $ accuracy_mean_OTW_DROPOFF <dbl> 4.080289, 7.556753, 10.555709,…
## $ accuracy_mean_OTW_PICKUP <dbl> 5.981111, 7.760374, 4.506516, …
## $ accuracy_mean_UNAVAILABLE <dbl> 23.027000, 0.000000, 0.010000,…
## $ accuracy_median_AVAILABLE <dbl> 3.8420, 3.0000, 5.9340, 3.1470…
## $ accuracy_median_OTW_DROPOFF <dbl> 3.5810, 3.9000, 6.3960, 3.2160…
## $ accuracy_median_OTW_PICKUP <dbl> 6.2015, 3.4120, 3.2160, 0.0100…
## $ accuracy_median_UNAVAILABLE <dbl> 23.0270, 0.0000, 0.0100, 3.000…
## $ accuracy_min_AVAILABLE_is_na <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ accuracy_min_OTW_DROPOFF_is_na <fct> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, …
## $ accuracy_min_OTW_PICKUP_is_na <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ accuracy_min_UNAVAILABLE_is_na <fct> 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, …
## $ accuracy_max_AVAILABLE_is_na <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ accuracy_max_OTW_DROPOFF_is_na <fct> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, …
## $ accuracy_max_OTW_PICKUP_is_na <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ accuracy_max_UNAVAILABLE_is_na <fct> 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, …
## $ accuracy_mean_AVAILABLE_is_na <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ accuracy_mean_OTW_DROPOFF_is_na <fct> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, …
## $ accuracy_mean_OTW_PICKUP_is_na <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ accuracy_mean_UNAVAILABLE_is_na <fct> 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, …
## $ accuracy_median_AVAILABLE_is_na <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ accuracy_median_OTW_DROPOFF_is_na <fct> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, …
## $ accuracy_median_OTW_PICKUP_is_na <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ accuracy_median_UNAVAILABLE_is_na <fct> 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, …
We transformed the original dataset of dimensions 648879 lines x 12 columns to this new dataset of dimensions 4000 lines x 77 columns
# clean environment
rm(gps_date, gps_dist_time, gps_goride, gps_hour, gps_stats)
To close this part, let’s save our gps_final
dataset.
saveRDS(gps_final, file = "gps_final.rds")
rm(gps, gps_train, gps_test)
Using machine learning algorithms will be done using the H2O frame.
For data preparation, we will use the dataPreparation
package.
# if necessary load data
# gps_final <- readRDS("gps_final.rds")
The first thing to do, in order to make computation fast, would be to filter useless variables:
Let’s id them.
# Which columns are constant ?
constant_cols <- whichAreConstant(gps_final)
## [1] "whichAreConstant: it took me 0.03s to identify 0 constant column(s)"
No constant column here.
# Which columns are redundant ?
double_cols <- whichAreInDouble(gps_final)
## [1] "whichAreInDouble: time_total_AVAILABLE_is_na is exactly equal to dist_total_AVAILABLE_is_na. I put it in drop list."
## [1] "whichAreInDouble: accuracy_min_AVAILABLE_is_na is exactly equal to dist_total_AVAILABLE_is_na. I put it in drop list."
## [1] "whichAreInDouble: accuracy_max_AVAILABLE_is_na is exactly equal to dist_total_AVAILABLE_is_na. I put it in drop list."
## [1] "whichAreInDouble: accuracy_mean_AVAILABLE_is_na is exactly equal to dist_total_AVAILABLE_is_na. I put it in drop list."
## [1] "whichAreInDouble: accuracy_median_AVAILABLE_is_na is exactly equal to dist_total_AVAILABLE_is_na. I put it in drop list."
## [1] "whichAreInDouble: time_total_OTW_DROPOFF_is_na is exactly equal to dist_total_OTW_DROPOFF_is_na. I put it in drop list."
## [1] "whichAreInDouble: accuracy_min_OTW_DROPOFF_is_na is exactly equal to dist_total_OTW_DROPOFF_is_na. I put it in drop list."
## [1] "whichAreInDouble: accuracy_max_OTW_DROPOFF_is_na is exactly equal to dist_total_OTW_DROPOFF_is_na. I put it in drop list."
## [1] "whichAreInDouble: accuracy_mean_OTW_DROPOFF_is_na is exactly equal to dist_total_OTW_DROPOFF_is_na. I put it in drop list."
## [1] "whichAreInDouble: accuracy_median_OTW_DROPOFF_is_na is exactly equal to dist_total_OTW_DROPOFF_is_na. I put it in drop list."
## [1] "whichAreInDouble: time_total_OTW_PICKUP_is_na is exactly equal to dist_total_OTW_PICKUP_is_na. I put it in drop list."
## [1] "whichAreInDouble: accuracy_min_OTW_PICKUP_is_na is exactly equal to dist_total_OTW_PICKUP_is_na. I put it in drop list."
## [1] "whichAreInDouble: accuracy_max_OTW_PICKUP_is_na is exactly equal to dist_total_OTW_PICKUP_is_na. I put it in drop list."
## [1] "whichAreInDouble: accuracy_mean_OTW_PICKUP_is_na is exactly equal to dist_total_OTW_PICKUP_is_na. I put it in drop list."
## [1] "whichAreInDouble: accuracy_median_OTW_PICKUP_is_na is exactly equal to dist_total_OTW_PICKUP_is_na. I put it in drop list."
## [1] "whichAreInDouble: time_total_UNAVAILABLE_is_na is exactly equal to dist_total_UNAVAILABLE_is_na. I put it in drop list."
## [1] "whichAreInDouble: accuracy_min_UNAVAILABLE_is_na is exactly equal to dist_total_UNAVAILABLE_is_na. I put it in drop list."
## [1] "whichAreInDouble: accuracy_max_UNAVAILABLE_is_na is exactly equal to dist_total_UNAVAILABLE_is_na. I put it in drop list."
## [1] "whichAreInDouble: accuracy_mean_UNAVAILABLE_is_na is exactly equal to dist_total_UNAVAILABLE_is_na. I put it in drop list."
## [1] "whichAreInDouble: accuracy_median_UNAVAILABLE_is_na is exactly equal to dist_total_UNAVAILABLE_is_na. I put it in drop list."
## [1] "whichAreInDouble: time_max_AVAILABLE_is_na is exactly equal to dist_max_AVAILABLE_is_na. I put it in drop list."
## [1] "whichAreInDouble: time_max_OTW_DROPOFF_is_na is exactly equal to dist_max_OTW_DROPOFF_is_na. I put it in drop list."
## [1] "whichAreInDouble: time_max_OTW_PICKUP_is_na is exactly equal to dist_max_OTW_PICKUP_is_na. I put it in drop list."
## [1] "whichAreInDouble: time_max_UNAVAILABLE_is_na is exactly equal to dist_max_UNAVAILABLE_is_na. I put it in drop list."
## [1] "whichAreInDouble: it took me 0.14s to identify 24 column(s) to drop."
We have many variables that are in double. It is logic : by construction, each time we had a “NA” for a ping status, we created a new column. So these columns are redundant.
# Remove redundant columns
gps_final <- gps_final %>% select(-double_cols)
# Which columns are bijections ?
bijections_cols <- whichAreBijection(gps_final)
## [1] "whichAreBijection: day_sin is a bijection of day_cos. I put it in drop list."
## [1] "whichAreBijection: it took me 1.23s to identify 1 column(s) to drop."
We only found one bijection variable.
# Remove bijections columns
gps_final <- gps_final %>% select(-bijections_cols)
We now have 52 distinct columns, instead of 77 originally.
# clean environment
rm(bijections_cols, constant_cols, double_cols)
We will build our algorithms using the “train” part of gps_final
(which is where we have the label
feature identified), the “test” part being the one we must submit to Kaggle.
So first, let’s recreate the “original” train and test data.
gps_train <- gps_final %>%
filter(data == "train") %>%
select(-data) %>%
mutate_if(is.factor, droplevels)
gps_test <- gps_final %>%
filter(data == "test") %>%
select(-data) %>%
mutate_if(is.factor, droplevels)
From the gps_train
dataset, let’s now split our data.
To avoid introducing a bias in test using the gps_train
, the train-test split should be performed before (most) data preparation steps.
Using the gps_train
set, we will randomly split the data into 80% in the training set and 20% in the testing set.
Note that we will remove the order_id
variable from the dependent variables X_train
and X_test
, since this variable is only an identifier and could introduce some leakage in the algorithms.
# Random sample indexes
set.seed(314)
index <- sample(x = 1:nrow(gps_train), size = 0.8 * nrow(gps_train))
# Build training and testing sets
train <- gps_train[index, colnames(gps_train) != "order_id"]
test <- gps_train[-index, colnames(gps_train) != "order_id"]
Before applying any algorithm we should scale the data (most machine learning algorithm rather handle scaled data instead of unscaled data).
In order to get a reproducible modeling, we must create the scaling in the training set only. Then, we apply the scaling from the training set into the testing set.
Indeed, the algorithms are created using only the training set ; so what has been used to create models must be applied identically for new unseen data (such as data from the testing set).
So let’s first compute the scales using the function build_scales
. The scaling will be performed only on numeric features.
train_scales <- build_scales(dataSet = train, cols = "auto", verbose = TRUE)
## [1] "build_scales: I will compute scale on 37 numeric columns."
## [1] "build_scales: it took me: 0.01s to compute scale for 37 numeric columns."
# Print the results for the 4 first numeric variables
print(train_scales[1:4])
## $day_cos
## $day_cos$mean
## [1] -0.01929746
##
## $day_cos$sd
## [1] 0.7066631
##
##
## $hour_start_cos
## $hour_start_cos$mean
## [1] -0.3161461
##
## $hour_start_cos$sd
## [1] 0.6051199
##
##
## $hour_start_sin
## $hour_start_sin$mean
## [1] -0.2999856
##
## $hour_start_sin$sd
## [1] 0.6664679
##
##
## $hour_end_cos
## $hour_end_cos$mean
## [1] -0.2720265
##
## $hour_end_cos$sd
## [1] 0.6312151
We can now apply the scaling on the training set and the testing sets (recall, the scaling from the training set only will be applied on both datasets).
To observe the difference after scaling, let’s look at actual mean and standard deviation for the dist_total_AVAILABLE
variable.
mean(train$dist_total_AVAILABLE)
## [1] 1160.859
sd(train$dist_total_AVAILABLE)
## [1] 1478.734
mean(test$dist_total_AVAILABLE)
## [1] 1156.618
sd(test$dist_total_AVAILABLE)
## [1] 1453.397
Apply the scaling.
# apply training scaling on both training and testing datasets
train <- fastScale(dataSet = train, scales = train_scales, verbose = TRUE)
## [1] "fastScale: I will scale 37 numeric columns."
## [1] "fastScale: it took me: 0.01s to scale 37 numeric columns."
test <- fastScale(dataSet = test, scales = train_scales, verbose = TRUE)
## [1] "fastScale: I will scale 37 numeric columns."
## [1] "fastScale: it took me: 0.01s to scale 37 numeric columns."
Mean and standard deviation after scaling.
mean(train$dist_total_AVAILABLE)
## [1] -1.268333e-17
sd(train$dist_total_AVAILABLE)
## [1] 1
mean(test$dist_total_AVAILABLE)
## [1] -0.00286792
sd(test$dist_total_AVAILABLE)
## [1] 0.9828655
The data is now ready for machine learning modeling.
We will use h2o
framework to build multiple models. More specifically, the automl()
auto-machine learning process is very handful to create models and automatically tune hyperparameters of each algorithm.
h2o.init()
##
## H2O is not running yet, starting it now...
##
## Note: In case of errors look at the following log files:
## /tmp/Rtmpxcg7Uz/h2o_alex_started_from_r.out
## /tmp/Rtmpxcg7Uz/h2o_alex_started_from_r.err
##
##
## Starting H2O JVM and connecting: . Connection successful!
##
## R is connected to the H2O cluster:
## H2O cluster uptime: 2 seconds 260 milliseconds
## H2O cluster timezone: Europe/Paris
## H2O data parsing timezone: UTC
## H2O cluster version: 3.26.0.2
## H2O cluster version age: 2 months and 18 days
## H2O cluster name: H2O_started_from_R_alex_tkq024
## H2O cluster total nodes: 1
## H2O cluster total memory: 1.73 GB
## H2O cluster total cores: 2
## H2O cluster allowed cores: 2
## H2O cluster healthy: TRUE
## H2O Connection ip: localhost
## H2O Connection port: 54321
## H2O Connection proxy: NA
## H2O Internal Security: FALSE
## H2O API Extensions: Amazon S3, XGBoost, Algos, AutoML, Core V3, Core V4
## R Version: R version 3.6.1 (2019-07-05)
h2o.no_progress()
The data has to be ‘h2o-formatted’ in order to be used.
train_h2o <- as.h2o(x = train)
test_h2o <- as.h2o(x = test)
h2o.describe(train_h2o) %>%
head()
## Label Type Missing Zeros PosInf NegInf Min Max
## 1 label enum 0 843 0 0 0.000000 1.000000
## 2 is_goride enum 0 1392 0 0 0.000000 1.000000
## 3 day_cos real 0 0 0 0 -1.247655 1.442409
## 4 is_weekend enum 0 2083 0 0 0.000000 1.000000
## 5 is_two_day_ride enum 0 2784 0 0 0.000000 1.000000
## 6 is_biz_hour enum 0 415 0 0 0.000000 1.000000
## Mean Sigma Cardinality
## 1 6.989286e-01 0.45880563 2
## 2 5.028571e-01 0.50008114 2
## 3 8.895879e-17 1.00000000 NA
## 4 2.560714e-01 0.43653970 2
## 5 5.714286e-03 0.07539007 2
## 6 8.517857e-01 0.35537574 2
# Identify predictors and response
y <- "label"
x <- setdiff(names(train_h2o), y)
# Run AutoML for 40 base models (limited to 1 hour max runtime by default)
# Excluse Deep Learning
# The metric used on Kaggle is AUC
gps_automl <- h2o.automl(x = x, y = y,
training_frame = train_h2o,
leaderboard_frame = test_h2o,
max_models = 40, max_runtime_secs = 120,
exclude_algos = c("DeepLearning"),
sort_metric = "AUC",
seed = 42)
# Extract the AutoML Leaderboard
gps_lb <- gps_automl@leaderboard
# View the 10 best models
gps_lb %>%
head(10)
## model_id auc logloss
## 1 StackedEnsemble_AllModels_AutoML_20191015_153643 0.9007076 0.3604215
## 2 StackedEnsemble_BestOfFamily_AutoML_20191015_153643 0.9004798 0.3599897
## 3 XGBoost_1_AutoML_20191015_153643 0.9001660 0.3512131
## 4 GBM_2_AutoML_20191015_153643 0.8963142 0.3623349
## 5 GBM_3_AutoML_20191015_153643 0.8937885 0.3650331
## 6 GBM_5_AutoML_20191015_153643 0.8935506 0.3630208
## 7 XGBoost_3_AutoML_20191015_153643 0.8925534 0.3637106
## 8 XGBoost_2_AutoML_20191015_153643 0.8911008 0.3677750
## 9 GBM_4_AutoML_20191015_153643 0.8890407 0.3779071
## 10 GBM_1_AutoML_20191015_153643 0.8854116 0.3801981
## mean_per_class_error rmse mse
## 1 0.2295918 0.3392682 0.1151029
## 2 0.2062075 0.3390417 0.1149492
## 3 0.2174036 0.3382877 0.1144386
## 4 0.2434807 0.3424413 0.1172660
## 5 0.2158447 0.3449120 0.1189643
## 6 0.2406463 0.3446288 0.1187690
## 7 0.2560941 0.3451568 0.1191332
## 8 0.1965703 0.3458641 0.1196220
## 9 0.2356859 0.3491716 0.1219208
## 10 0.2633220 0.3519303 0.1238549
We can have a look at variable importance from the top model.
gps_best <- h2o.getModel(model_id = gps_automl@leader@model_id)
h2o.varimp_plot(gps_best)
## Warning in max(tvi$coefficients): aucun argument pour max ; -Inf est
## renvoyé
In our best model, the accuracy of the GPS seems to be the best predictor.
We can also have a look at the confusion matrix on our testing set.
h2o.confusionMatrix(object = gps_best, newdata = test_h2o)
## Confusion Matrix (vertical: actual; across: predicted) for max f1 @ threshold = 0.366185574791611:
## 0 1 Error Rate
## 0 120 76 0.387755 =76/196
## 1 36 468 0.071429 =36/504
## Totals 156 544 0.160000 =112/700
We now have created a model, which has a pretty high AUC metric value (around 0.9).
Let’s use this model to create predictions on the Kaggle test data.
In order to apply the scaling on the gpe_test
dataset, we have to first delete the order_id
column (as we did for the gps_train
dataset).
kaggle_test <- gps_test %>% select(-order_id)
Now we can apply the scaling.
kaggle_test <- fastScale(dataSet = kaggle_test, scales = train_scales, verbose = TRUE)
## [1] "fastScale: I will scale 37 numeric columns."
## [1] "fastScale: it took me: 0.01s to scale 37 numeric columns."
Import the gps_test
data into h2o framework (remove the label
column).
kaggle_test_h2o <- kaggle_test %>%
select(-label) %>%
as.h2o()
Predict on the kaggle_test
set.
gps_pred <- h2o.predict(gps_automl@leader, newdata = kaggle_test_h2o)
head(gps_pred)
## predict p0 p1
## 1 1 0.28170026 0.71829974
## 2 0 0.84895809 0.15104191
## 3 1 0.03785942 0.96214058
## 4 0 0.90090639 0.09909361
## 5 1 0.04817225 0.95182775
## 6 1 0.04902424 0.95097576
The predictions are in the predict
column of the gps_pred
object.
Let’s extract it, and build our csv file for submission.
# Create final csv file
final <- tibble(order_id = gps_test$order_id,
label = as.vector(as.numeric(gps_pred$predict)))
head(final)
## # A tibble: 6 x 2
## order_id label
## <fct> <int>
## 1 F1770 1
## 2 RB1125 0
## 3 RB714 1
## 4 RB610 0
## 5 F1432 1
## 6 F106 1
write_csv(x = final, path = "h2o_final.csv")
# close h2o framework
h2o.shutdown(prompt = FALSE)
## [1] TRUE