Intro

Background

Scotty is a ride-sharing business that operating in several big cities in Turkey. The company provides motorcycles ride-sharing service for Turkey’s citizen, and really value the efficiency in traveling through the traffic–the apps even give some reference to Star Trek “beam me up” in their order buttons.

Scotty provided us with real-time transaction dataset. With this dataset, we are going to help them in solving some forecasting and classification problems in order to improve their business processes.

Problem

Scotty turns out being a very popular service in Turkey! The demands for Scotty began to overload, at some region and some times, and there was not enough driver at those times and places. Fortunately, we are know that we can use classification model to predict which region and times are risky enough to have this “no drivers” problem.

Create a classification model report that would be evaluated on next 7 days (Sunday, December 3rd 2017 to Saturday, December 9th 2017). Make prediction that should cover the predicted coverage status for each hour and each area: “sufficient” or “insufficient”.

Our target is to get information which hour and area that have “insufficient” label, it mean we want to know which hour and area that really need more driver.

Import Library

To solve this case we need import some package to help process and predict:

library(GGally)
# Wrangling package
library(tidyverse)
library(tidymodels)
# Date processing Package
library(lubridate)
# Scale numeric
library(scales)
# Sampling Helpoer
library(rsample)
# Confussion matrix
library(caret)
# Decision tree package
library(partykit)
# Smote for unbalanced data
library(DMwR)
# KNN modelling
library(class)
#ROCR
library(ROCR)
# Naive Bayes 
library(e1071)
# Visualize
library(plotly)
# Inteprate Random Forest Model
library(lime)

We will import some script to help us finish this analysis.

source("script/matrix_result.R")
source("script/metrics.R")
source("script/confussion_matrix_plot.R")

Read Data

Scotty provide transaction detail transaction from October 1st, 2017 to December 2nd, 2017 as train dataset. Based on this train data we can start process to make model and predict.

train <- read_csv("data/data-train.csv")
test <- read_csv("data/data-test.csv")

Lets check structure of train data:

glimpse(train)
## Observations: 229,532
## Variables: 16
## $ id                 <chr> "59d005e1ffcfa261708ce9cd", "59d0066a3d32b861760d4…
## $ trip_id            <chr> "59d005e9cb564761a8fe5d3e", "59d00678ffcfa261708ce…
## $ driver_id          <chr> "59a892c5568be44b2734f276", "59a135565e88a24b11f11…
## $ rider_id           <chr> "59ad2d6efba75a581666b506", "59ce930f3d32b861760a4…
## $ start_time         <dttm> 2017-10-01 00:00:17, 2017-10-01 00:02:34, 2017-10…
## $ src_lat            <dbl> 41.07047, 40.94157, 41.07487, 41.04995, 41.05287, …
## $ src_lon            <dbl> 29.01945, 29.11484, 28.99528, 29.03107, 28.99522, …
## $ src_area           <chr> "sxk9", "sxk8", "sxk9", "sxk9", "sxk9", "sxk9", "s…
## $ src_sub_area       <chr> "sxk9s", "sxk8y", "sxk9e", "sxk9s", "sxk9e", "sxk9…
## $ dest_lat           <dbl> 41.11716, 41.06151, 41.08351, 41.04495, 41.08140, …
## $ dest_lon           <dbl> 29.03650, 29.02068, 29.00228, 28.98192, 28.98197, …
## $ dest_area          <chr> "sxk9", "sxk9", "sxk9", "sxk9", "sxk9", "sxk9", "s…
## $ dest_sub_area      <chr> "sxk9u", "sxk9s", "sxk9e", "sxk9e", "sxk9e", "sxk9…
## $ distance           <dbl> 5.379250, 15.497130, 1.126098, 4.169492, 3.358296,…
## $ status             <chr> "confirmed", "confirmed", "nodrivers", "confirmed"…
## $ confirmed_time_sec <dbl> 8, 14, 0, 32, 65, 110, 0, 49, 27, 21, 23, 46, 185,…

The dataset includes information about:

  • id: Transaction id
  • trip_id: Trip id
  • driver_id: Driver id
  • rider_id: Rider id
  • start_time: Rider id
  • src_lat: Request source latitude
  • src_lon: Request source longitude
  • src_area: Request source area
  • src_sub_area: Request source sub-area
  • dest_lat: Requested destination latitude
  • dest_lon: Requested destination longitude
  • dest_area: Requested destination area
  • dest_sub_area: Requested destination sub-area
  • distance: Trip distance (in KM)
  • status: Trip status (all status considered as a demand)
  • confirmed_time_sec: Time different from request to confirmed (in seconds)

Data Wrangling

Based on information, we cant see coverage variable inside train data. but we found train data structure contain detail order from scotty. Some variable have mismatch datatype, we will try to change data type:

train <- train %>%
  mutate(src_area = as.factor(src_area),
        src_sub_area = as.factor(src_sub_area),
        dest_area = as.factor(dest_area),
        dest_sub_area = as.factor(dest_sub_area),
        status = as.factor(status)
        )

lets preview contain of data to understand more train data:

head(train)
## # A tibble: 6 x 16
##   id    trip_id driver_id rider_id start_time          src_lat src_lon src_area
##   <chr> <chr>   <chr>     <chr>    <dttm>                <dbl>   <dbl> <fct>   
## 1 59d0… 59d005… 59a892c5… 59ad2d6… 2017-10-01 00:00:17    41.1    29.0 sxk9    
## 2 59d0… 59d006… 59a13556… 59ce930… 2017-10-01 00:02:34    40.9    29.1 sxk8    
## 3 59d0… <NA>    <NA>      59cd704… 2017-10-01 00:02:34    41.1    29.0 sxk9    
## 4 59d0… 59d006… 599dc0df… 59bd62c… 2017-10-01 00:03:29    41.0    29.0 sxk9    
## 5 59d0… 59d007… 59a58565… 59c17cd… 2017-10-01 00:04:24    41.1    29.0 sxk9    
## 6 59d0… 59d007… 59579d5f… 594afb1… 2017-10-01 00:05:32    41.0    28.9 sxk9    
## # … with 8 more variables: src_sub_area <fct>, dest_lat <dbl>, dest_lon <dbl>,
## #   dest_area <fct>, dest_sub_area <fct>, distance <dbl>, status <fct>,
## #   confirmed_time_sec <dbl>

we found there’s any NA value inside trip_id and rider_id, to make it more clear i will check NA value from all column before we take any action to data wrangling

summary(is.na(train))
##      id           trip_id        driver_id        rider_id      
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical  
##  FALSE:229532    FALSE:214631    FALSE:214632    FALSE:229532   
##                  TRUE :14901     TRUE :14900                    
##  start_time       src_lat         src_lon         src_area      
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical  
##  FALSE:229532    FALSE:229532    FALSE:229532    FALSE:229532   
##                                                                 
##  src_sub_area     dest_lat        dest_lon       dest_area      
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical  
##  FALSE:229532    FALSE:229532    FALSE:229532    FALSE:229532   
##                                                                 
##  dest_sub_area    distance         status        confirmed_time_sec
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical     
##  FALSE:229532    FALSE:229532    FALSE:229532    FALSE:229532      
## 

We found NA value only trip_id and driver_id, lets compare some row with NA value with non-NA row

train[c(1:3),]
## # A tibble: 3 x 16
##   id    trip_id driver_id rider_id start_time          src_lat src_lon src_area
##   <chr> <chr>   <chr>     <chr>    <dttm>                <dbl>   <dbl> <fct>   
## 1 59d0… 59d005… 59a892c5… 59ad2d6… 2017-10-01 00:00:17    41.1    29.0 sxk9    
## 2 59d0… 59d006… 59a13556… 59ce930… 2017-10-01 00:02:34    40.9    29.1 sxk8    
## 3 59d0… <NA>    <NA>      59cd704… 2017-10-01 00:02:34    41.1    29.0 sxk9    
## # … with 8 more variables: src_sub_area <fct>, dest_lat <dbl>, dest_lon <dbl>,
## #   dest_area <fct>, dest_sub_area <fct>, distance <dbl>, status <fct>,
## #   confirmed_time_sec <dbl>

We found there’s any relation why trip_id and driver_id NA, is because status is “nodrivers”. and it will fullfil if status is “confirmed”. It will be our clue to extract information of coverage data.

Based on our target we have before, we will predict 7 days for each hour and each area. Why we dont specified time level only days, we cant explain detail which time get more dirver order. but if we make level into minute the data will too specific and result the data we have will make us confuse. So hourly interval is best time level to predict.

We can see from structure we dont have hour variable. We have start_time as users order time, we can extract specific level from Date, Days, Hour:

train_hourly <- train %>%
  mutate(src_area = as.factor(src_area),
        date = date(start_time),
        day = as.factor(wday(start_time, label = TRUE, abbr = FALSE)),
        hour = as.factor(hour(start_time)))
glimpse(train_hourly)
## Observations: 229,532
## Variables: 19
## $ id                 <chr> "59d005e1ffcfa261708ce9cd", "59d0066a3d32b861760d4…
## $ trip_id            <chr> "59d005e9cb564761a8fe5d3e", "59d00678ffcfa261708ce…
## $ driver_id          <chr> "59a892c5568be44b2734f276", "59a135565e88a24b11f11…
## $ rider_id           <chr> "59ad2d6efba75a581666b506", "59ce930f3d32b861760a4…
## $ start_time         <dttm> 2017-10-01 00:00:17, 2017-10-01 00:02:34, 2017-10…
## $ src_lat            <dbl> 41.07047, 40.94157, 41.07487, 41.04995, 41.05287, …
## $ src_lon            <dbl> 29.01945, 29.11484, 28.99528, 29.03107, 28.99522, …
## $ src_area           <fct> sxk9, sxk8, sxk9, sxk9, sxk9, sxk9, sxk9, sxk9, sx…
## $ src_sub_area       <fct> sxk9s, sxk8y, sxk9e, sxk9s, sxk9e, sxk90, sxk9e, s…
## $ dest_lat           <dbl> 41.11716, 41.06151, 41.08351, 41.04495, 41.08140, …
## $ dest_lon           <dbl> 29.03650, 29.02068, 29.00228, 28.98192, 28.98197, …
## $ dest_area          <fct> sxk9, sxk9, sxk9, sxk9, sxk9, sxk9, sxk9, sxk9, sx…
## $ dest_sub_area      <fct> sxk9u, sxk9s, sxk9e, sxk9e, sxk9e, sxk97, sxk9q, s…
## $ distance           <dbl> 5.379250, 15.497130, 1.126098, 4.169492, 3.358296,…
## $ status             <fct> confirmed, confirmed, nodrivers, confirmed, confir…
## $ confirmed_time_sec <dbl> 8, 14, 0, 32, 65, 110, 0, 49, 27, 21, 23, 46, 185,…
## $ date               <date> 2017-10-01, 2017-10-01, 2017-10-01, 2017-10-01, 2…
## $ day                <ord> Sunday, Sunday, Sunday, Sunday, Sunday, Sunday, Su…
## $ hour               <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…

we have new variable date it explain date order happen,day it explain day, and hour it explain hours when user order driver.

summary(train_hourly)
##       id              trip_id           driver_id           rider_id        
##  Length:229532      Length:229532      Length:229532      Length:229532     
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##    start_time                     src_lat         src_lon      src_area     
##  Min.   :2017-10-01 00:00:17   Min.   :40.79   Min.   :28.48   sxk3:  9946  
##  1st Qu.:2017-10-19 19:36:04   1st Qu.:41.00   1st Qu.:28.97   sxk8:  5636  
##  Median :2017-11-07 14:08:48   Median :41.03   Median :29.00   sxk9:213950  
##  Mean   :2017-11-04 18:32:48   Mean   :41.03   Mean   :29.00                
##  3rd Qu.:2017-11-19 03:44:19   3rd Qu.:41.06   3rd Qu.:29.04                
##  Max.   :2017-12-02 23:59:20   Max.   :41.13   Max.   :29.18                
##                                                                             
##   src_sub_area      dest_lat         dest_lon          dest_area     
##  sxk9e  :40869   Min.   : 5.923   Min.   :  0.1273   sxk9   :205449  
##  sxk9s  :24732   1st Qu.:41.001   1st Qu.: 28.9689   sxk3   : 12516  
##  sxk97  :24512   Median :41.036   Median : 29.0036   sxk8   :  4973  
##  sxk9k  :16612   Mean   :41.032   Mean   : 29.0025   sxkc   :  2264  
##  sxk9j  :12336   3rd Qu.:41.064   3rd Qu.: 29.0475   sxkb   :  1994  
##  sxk9h  :12205   Max.   :52.370   Max.   :121.4721   sxkd   :  1911  
##  (Other):98266                                       (Other):   425  
##  dest_sub_area       distance              status       confirmed_time_sec
##  sxk9e  : 35941   Min.   :   0.000   confirmed:214631   Min.   :  0.00    
##  sxk9s  : 25346   1st Qu.:   2.802   nodrivers: 14901   1st Qu.: 12.00    
##  sxk97  : 23031   Median :   4.804                      Median : 26.00    
##  sxk9k  : 13868   Mean   :   6.772                      Mean   : 46.41    
##  sxk9j  : 13704   3rd Qu.:   7.943                      3rd Qu.: 66.00    
##  sxk9h  : 10423   Max.   :9345.566                      Max.   :514.00    
##  (Other):107219                                                           
##       date                   day             hour       
##  Min.   :2017-10-01   Sunday   :21063   18     : 25403  
##  1st Qu.:2017-10-19   Monday   :27425   19     : 20572  
##  Median :2017-11-07   Tuesday  :31190   17     : 19156  
##  Mean   :2017-11-04   Wednesday:31999   16     : 16090  
##  3rd Qu.:2017-11-19   Thursday :32251   8      : 14938  
##  Max.   :2017-12-02   Friday   :50726   15     : 14227  
##                       Saturday :34878   (Other):119146

To make dataset convert perfectly to hourly interval, we need grouping transaction data and summarise data to hourly.

Example we have data with time 06:01:00,06:30:00,07:00:00 so it will group 06:00:00 have 2 transaction and 07:00:00 have 1 transaction.

We will grouping data by src_area, date, hour, and status, for data aggregation. Beside that after we grouping we summarise total_transaction it explain how many transaction within interval, and at status we have 2 class we already know before confirmed and nodrivers, we will count frequency:

coverage_hourly <- train_hourly %>% 
  group_by(src_area, date, day, hour, status) %>% 
  summarise(n = n()) %>%
  ungroup() %>%
  spread(status, n, fill=0) %>% 
  mutate(
    total_transaction = confirmed + nodrivers,
    datetime = ymd_h(paste(date, hour)))

#preview the data
head(coverage_hourly)
## # A tibble: 6 x 8
##   src_area date       day   hour  confirmed nodrivers total_transacti…
##   <fct>    <date>     <ord> <fct>     <dbl>     <dbl>            <dbl>
## 1 sxk3     2017-10-01 Sund… 0             0         1                1
## 2 sxk3     2017-10-01 Sund… 1             1         1                2
## 3 sxk3     2017-10-01 Sund… 3             0         1                1
## 4 sxk3     2017-10-01 Sund… 4             0         1                1
## 5 sxk3     2017-10-01 Sund… 7             0         1                1
## 6 sxk3     2017-10-01 Sund… 8             1         2                3
## # … with 1 more variable: datetime <dttm>

To make prediction model and good exploratory data analysis, we need complete time interval data. lets take look summary of our interval data:

summary(coverage_hourly$datetime)
##                  Min.               1st Qu.                Median 
## "2017-10-01 00:00:00" "2017-10-16 19:00:00" "2017-11-01 17:00:00" 
##                  Mean               3rd Qu.                  Max. 
## "2017-11-01 15:05:47" "2017-11-17 10:00:00" "2017-12-02 23:00:00"

based on summary our data is start from 2017-10-01 00:00:00 to 2017-12-02 23:00:00. We will determine for as start and end of interval

sorted_datetime <- coverage_hourly[order(coverage_hourly$datetime),]

datetime_start <- sorted_datetime$datetime[1]
datetime_end <- sorted_datetime$datetime[length(sorted_datetime$datetime)]

datetime_start
## [1] "2017-10-01 UTC"
datetime_end
## [1] "2017-12-02 23:00:00 UTC"

Make complete list from start interval until end of interval in hourly

all_date <- seq(datetime_start, datetime_end, by = "hour")
all_date <- data.frame(list(datetime = all_date))

# total hours from start to end date
nrow(all_date)
## [1] 1512

Total we have 1512 hour interval from start to the end. we will padding implement new time interval to our train data

sxk3 <- coverage_hourly %>% 
  filter(src_area == "sxk3") %>% 
  arrange(desc(datetime)) %>% 
  merge(all_date, all = T) %>% 
  mutate(src_area = as.factor("sxk3"))

sxk8 <- coverage_hourly %>% 
  filter(src_area == "sxk8") %>% 
  arrange(desc(datetime)) %>% 
  merge(all_date, all = T) %>% 
  mutate(src_area = as.factor("sxk8"))

sxk9 <- coverage_hourly %>% 
  filter(src_area == "sxk9") %>% 
  arrange(desc(datetime)) %>% 
  merge(all_date, all = T) %>% 
  mutate(src_area = as.factor("sxk9"))

coverage_all_date <- do.call("rbind", list(sxk3, sxk8, sxk9))

head(coverage_all_date)
##              datetime src_area       date    day hour confirmed nodrivers
## 1 2017-10-01 00:00:00     sxk3 2017-10-01 Sunday    0         0         1
## 2 2017-10-01 01:00:00     sxk3 2017-10-01 Sunday    1         1         1
## 3 2017-10-01 02:00:00     sxk3       <NA>   <NA> <NA>        NA        NA
## 4 2017-10-01 03:00:00     sxk3 2017-10-01 Sunday    3         0         1
## 5 2017-10-01 04:00:00     sxk3 2017-10-01 Sunday    4         0         1
## 6 2017-10-01 05:00:00     sxk3       <NA>   <NA> <NA>        NA        NA
##   total_transaction
## 1                 1
## 2                 2
## 3                NA
## 4                 1
## 5                 1
## 6                NA

Implement new interval data cause NA value for some row, it because from our train data dont have any transaction during some interval taime. We will fill this NA with 0 value:

coverage_all_date <- coverage_all_date %>% 
  mutate(
        date = date(datetime),
        hour = as.factor(hour(datetime)),
        day = as.factor(wday(datetime, label = TRUE, abbr = FALSE)),
        confirmed = replace_na(confirmed, 0),
        nodrivers = replace_na(nodrivers,0),
        total_transaction = replace_na(total_transaction ,0),
        )
head(coverage_all_date)
##              datetime src_area       date    day hour confirmed nodrivers
## 1 2017-10-01 00:00:00     sxk3 2017-10-01 Sunday    0         0         1
## 2 2017-10-01 01:00:00     sxk3 2017-10-01 Sunday    1         1         1
## 3 2017-10-01 02:00:00     sxk3 2017-10-01 Sunday    2         0         0
## 4 2017-10-01 03:00:00     sxk3 2017-10-01 Sunday    3         0         1
## 5 2017-10-01 04:00:00     sxk3 2017-10-01 Sunday    4         0         1
## 6 2017-10-01 05:00:00     sxk3 2017-10-01 Sunday    5         0         0
##   total_transaction
## 1                 1
## 2                 2
## 3                 0
## 4                 1
## 5                 1
## 6                 0

Exploratory Data Analysis

Our case is to make prediction that should cover the predicted coverage status for each hour and each area: “sufficient” or “insufficient”. Next we will make our target variable coverage based on variable we have on dataset. status is variable we use labeling coverage, status with nodriver > 0, will label as insufficient and if status nodriver = 0, we will label it sufficient.

coverage_all_date <- coverage_all_date %>%
  mutate(coverage = as.factor(if_else(nodrivers == 0, "sufficient", "insufficient")))

head(coverage_all_date)
##              datetime src_area       date    day hour confirmed nodrivers
## 1 2017-10-01 00:00:00     sxk3 2017-10-01 Sunday    0         0         1
## 2 2017-10-01 01:00:00     sxk3 2017-10-01 Sunday    1         1         1
## 3 2017-10-01 02:00:00     sxk3 2017-10-01 Sunday    2         0         0
## 4 2017-10-01 03:00:00     sxk3 2017-10-01 Sunday    3         0         1
## 5 2017-10-01 04:00:00     sxk3 2017-10-01 Sunday    4         0         1
## 6 2017-10-01 05:00:00     sxk3 2017-10-01 Sunday    5         0         0
##   total_transaction     coverage
## 1                 1 insufficient
## 2                 2 insufficient
## 3                 0   sufficient
## 4                 1 insufficient
## 5                 1 insufficient
## 6                 0   sufficient

Until this step, we have label as target variable is coverage with 2 level “sufficient” and “insufficient” in train data.

Check proportion for our target variable coverage, on process we have prepare some set of data frame. coverage_hourly data before we padding it, and after padding data we have coverage_all_date. let check proportion before padding and after padding.

coverage_hourly <- coverage_hourly %>%
  mutate(coverage = as.factor(if_else(nodrivers == 0, "sufficient", "insufficient")))

proportion <- data.frame(prop.table(table(coverage_hourly$coverage)))

proportion %>% 
  ggplot(aes(x = Var1, y = Freq)) + 
  geom_bar(aes(fill = Freq), stat = "identity") +
  scale_y_continuous(labels = scales::percent) +
  geom_text(aes(label = percent(Freq)),
            fontface = "bold",
            color = "white",
            position = position_stack(vjust = 0.5)) +
  theme(legend.position = "none") +
  labs(
    title = "Coverage proportion before padding",
    x = "",
    y = ""
  )

Before padding proportion of coverage is 64% of data label as insufficient and 36% of data label as sufficient. Next we show train data after implement padding or new time interval

proportion <- data.frame(prop.table(table(coverage_all_date$coverage)))

proportion %>% 
  ggplot(aes(x = Var1, y = Freq)) + 
  geom_bar(aes(fill = Freq), stat = "identity") +
  scale_y_continuous(labels = scales::percent) +
  geom_text(aes(label = percent(Freq)),
            fontface = "bold",
            color = "white",
            position = position_stack(vjust = 0.5)) +
  theme(legend.position = "none") +
  labs(
    title = "Coverage proportion after padding",
    x = "",
    y = ""
  )

Proportion after we padding is more better before we padding, proportion is 54.1& labeled as insufficient and 45.9% labeled as sufficient.

This dataset have src_are explain different area, lets check how is frequency from our train dataset:

proportion <- data.frame(summary(coverage_all_date$src_area)) %>% 
  rownames_to_column(., "VALUE") %>%
  mutate("Source Area" = .[,1],
         "Freq" = .[,2]) %>% 
  select(3,4)

proportion
##   Source Area Freq
## 1        sxk3 1512
## 2        sxk8 1512
## 3        sxk9 1512

it show for each area have same frequency data 1512 transaction, its because we has padding the train dataset before. Lets check proportion of coverage for each area:

proprotion <- coverage_all_date %>% 
  group_by(src_area) %>% 
  summarise(insufficient = sum(coverage == "insufficient"),
            sufficient = sum(coverage == "sufficient"),
            total = n(),
            "proportion insufficient" = insufficient/n(),
            "proportion sufficient" = sufficient/n())

proprotion
## # A tibble: 3 x 6
##   src_area insufficient sufficient total `proportion insuffi… `proportion suffi…
##   <fct>           <int>      <int> <int>                <dbl>              <dbl>
## 1 sxk3              947        565  1512                0.626              0.374
## 2 sxk8              258       1254  1512                0.171              0.829
## 3 sxk9             1249        263  1512                0.826              0.174

More better we visualize it

proprotion %>% 
  select("src_area","proportion insufficient", "proportion sufficient") %>% 
  rename("insufficient" = "proportion insufficient",
         "sufficient" = "proportion sufficient") %>% 
  gather(key = "prop",
         value = "value",
         -src_area) %>% 
  ggplot(aes(x = prop, y = value)) +
  geom_bar(aes(fill = prop), stat = "identity") +
  scale_y_continuous(labels = scales::percent) +
  geom_text(aes(label = percent(value)),
            fontface = "bold",
            color = "white",
            position = position_stack(vjust = 0.5)) +
  facet_wrap( ~ src_area) +
  labs(
    title = "Coverage proportion group by Area",
    x = "",
    y = ""
  ) +
  theme(legend.position = "none") 

From 3 area we have, only sxk8 that have more than 50% sufficient label, it indicate other area sxk3 and sxk9 need more driver in those area. lets check other visualize to get better insight.

We will try to use heatmap type visualization

coverage_all_date %>% 
  mutate(day = as.factor(wday(datetime, label = TRUE, abbr = TRUE))) %>% 
  group_by(day, hour, src_area) %>% 
  summarise(sufficient = sum(coverage == "sufficient")) %>% 
  ggplot(aes(x = day, y = hour, fill = sufficient)) +
  geom_tile() +
  facet_wrap(~src_area) +
   theme(legend.position = "bottom") 

To make more clear and get detail insight, i would make other visualization using bar plot:

coverage_all_date %>% 
  select(-c("datetime", "date", "day")) %>% 
  group_by(hour, src_area) %>% 
  summarise(insufficient = sum(coverage == "insufficient"),
            sufficient = sum(coverage == "sufficient")) %>% 
  gather("type","value",-src_area, -hour) %>% 
  ggplot(aes(x = hour, y = value, fill = type)) +
  geom_col(position = "dodge") +
  facet_grid(src_area ~ .) +
  theme(legend.position = "bottom")  +
  labs(title = "Coverage hourly group by Area")

some insight we can get is:
1. sxk9 area is the most insufficient driver area, from heatmap all days its show dark area it mean all days and every hour users didnt get driver when they order.
2. sxk8 area is the most sufficient driver area, every hour and every hour confirmed driver than other area.
3. However area sxk8 still in 08:00 it nearly insufficient label more than sufficient label, especially in working days from monday to friday, it show more dark blue on heatmap.
4.sxk3 area get more driver start from morning 06:00 which people start to activity until late night around 00:00 and this case happen always every day looks from heatmap.
5. There is corellation between days, hour, area to coverage.

glimpse(train)
## Observations: 229,532
## Variables: 16
## $ id                 <chr> "59d005e1ffcfa261708ce9cd", "59d0066a3d32b861760d4…
## $ trip_id            <chr> "59d005e9cb564761a8fe5d3e", "59d00678ffcfa261708ce…
## $ driver_id          <chr> "59a892c5568be44b2734f276", "59a135565e88a24b11f11…
## $ rider_id           <chr> "59ad2d6efba75a581666b506", "59ce930f3d32b861760a4…
## $ start_time         <dttm> 2017-10-01 00:00:17, 2017-10-01 00:02:34, 2017-10…
## $ src_lat            <dbl> 41.07047, 40.94157, 41.07487, 41.04995, 41.05287, …
## $ src_lon            <dbl> 29.01945, 29.11484, 28.99528, 29.03107, 28.99522, …
## $ src_area           <fct> sxk9, sxk8, sxk9, sxk9, sxk9, sxk9, sxk9, sxk9, sx…
## $ src_sub_area       <fct> sxk9s, sxk8y, sxk9e, sxk9s, sxk9e, sxk90, sxk9e, s…
## $ dest_lat           <dbl> 41.11716, 41.06151, 41.08351, 41.04495, 41.08140, …
## $ dest_lon           <dbl> 29.03650, 29.02068, 29.00228, 28.98192, 28.98197, …
## $ dest_area          <fct> sxk9, sxk9, sxk9, sxk9, sxk9, sxk9, sxk9, sxk9, sx…
## $ dest_sub_area      <fct> sxk9u, sxk9s, sxk9e, sxk9e, sxk9e, sxk97, sxk9q, s…
## $ distance           <dbl> 5.379250, 15.497130, 1.126098, 4.169492, 3.358296,…
## $ status             <fct> confirmed, confirmed, nodrivers, confirmed, confir…
## $ confirmed_time_sec <dbl> 8, 14, 0, 32, 65, 110, 0, 49, 27, 21, 23, 46, 185,…

Data Preprocessing

Meanwhile after several processing data we has make new data frame coverage_hourly with structure:

glimpse(coverage_hourly)
## Observations: 3,862
## Variables: 9
## $ src_area          <fct> sxk3, sxk3, sxk3, sxk3, sxk3, sxk3, sxk3, sxk3, sxk…
## $ date              <date> 2017-10-01, 2017-10-01, 2017-10-01, 2017-10-01, 20…
## $ day               <ord> Sunday, Sunday, Sunday, Sunday, Sunday, Sunday, Sun…
## $ hour              <fct> 0, 1, 3, 4, 7, 8, 11, 12, 13, 14, 15, 16, 17, 18, 1…
## $ confirmed         <dbl> 0, 1, 0, 0, 0, 1, 2, 0, 3, 7, 4, 2, 0, 5, 0, 0, 0, …
## $ nodrivers         <dbl> 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 0, 1, 1, 0, 1, 3, 1, …
## $ total_transaction <dbl> 1, 2, 1, 1, 1, 3, 3, 1, 4, 8, 4, 3, 1, 5, 1, 3, 1, …
## $ datetime          <dttm> 2017-10-01 00:00:00, 2017-10-01 01:00:00, 2017-10-…
## $ coverage          <fct> insufficient, insufficient, insufficient, insuffici…

some variable in our train dataset still didint use in this new data frame, we can put into our new dataframe

int_var <- train_hourly %>% 
  group_by(src_area, date, day, hour) %>% 
  summarise(mean_distance = mean(distance),
            sum_distance = sum(distance),
            median_distance = median(distance),
            mean_confirmed_time_sec = mean(confirmed_time_sec),
            sum_confirmed_time_sec = sum(confirmed_time_sec),
            median_confirmed_time_sec = median(confirmed_time_sec)) %>% 
  ungroup() %>% 
  select(-c(src_area, date, day, hour))
glimpse(int_var)
## Observations: 3,862
## Variables: 6
## $ mean_distance             <dbl> 1.6709748, 8.4075656, 14.1027795, 3.6331489…
## $ sum_distance              <dbl> 1.6709748, 16.8151311, 14.1027795, 3.633148…
## $ median_distance           <dbl> 1.6709748, 8.4075656, 14.1027795, 3.6331489…
## $ mean_confirmed_time_sec   <dbl> 0.00000, 22.00000, 0.00000, 0.00000, 0.0000…
## $ sum_confirmed_time_sec    <dbl> 0, 44, 0, 0, 0, 12, 42, 0, 306, 302, 310, 5…
## $ median_confirmed_time_sec <dbl> 0.0, 22.0, 0.0, 0.0, 0.0, 0.0, 9.0, 0.0, 43…

from variable distance and confirmed_time_sec we do mean, median, sum for make new variable. we combine it with coverage_hourly

coverage_hourly <- cbind(coverage_hourly, int_var)

Cross Validation

Cross-validation (CV) is a statistical method that can be used to evaluate the performance of models or algorithms where the data is separated into two subsets namely learning process data and validation / evaluation data. We already have train and test dataset, in this case i will use 3 dataset, is train, validation and test data.

We check proportion our existing train and test:

nrow(train)
## [1] 229532
nrow(test)
## [1] 504

total we have 229.532 data observation and our test data 504 target to predict. The proportion is already well, because we have more 99% data to train our model later.

This train dataset i will seperate into 2 dataset: data train and data validation. I will take 15% from data train to put into data validation.

set.seed(100)
intial_train <- initial_split(coverage_all_date, prop = 0.80, strata = "coverage")
train_training <- training(intial_train)
train_validation <- testing(intial_train)
prop.table(table(train_training$coverage))
## 
## insufficient   sufficient 
##    0.5410468    0.4589532
train_training_upsample <- upSample(x = within(train_training, rm("coverage")), y = train_training$coverage, yname = "coverage")

prop.table(table(train_training_upsample$coverage))
## 
## insufficient   sufficient 
##          0.5          0.5

Modeling

Model Fitting

Before we try to choose model, we have test dataset that give us rules that our model result should fit with our test structure. Lets take look data test we have

glimpse(test)
## Observations: 504
## Variables: 3
## $ src_area <chr> "sxk3", "sxk3", "sxk3", "sxk3", "sxk3", "sxk3", "sxk3", "sxk…
## $ datetime <dttm> 2017-12-03 00:00:00, 2017-12-03 01:00:00, 2017-12-03 02:00:…
## $ coverage <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
summary(test)
##    src_area            datetime                   coverage      
##  Length:504         Min.   :2017-12-03 00:00:00   Mode:logical  
##  Class :character   1st Qu.:2017-12-04 17:45:00   NA's:504      
##  Mode  :character   Median :2017-12-06 11:30:00                 
##                     Mean   :2017-12-06 11:30:00                 
##                     3rd Qu.:2017-12-08 05:15:00                 
##                     Max.   :2017-12-09 23:00:00

Data test is contain 3 variable, this data test will be our prediction result, in this case we will try predict next 7 days (Sunday, December 3rd 2017 to Monday, December 9th 2017).

We have src_area it source area that user order driver. We can see datetime interval separate each 1 hour, it mean our prediction value will predict each hour or hourly for 7 days. We have column coverage variable, it will be our target variable contain sufficient and insufficient, based on our bussiness case we would to know insufficient as priority and it will positive.

Our variable to predict using data test strict only based on this 3 variable, for datetime we can extract it into hour interval.

test <- test %>%
  mutate(src_area = as.factor(src_area),
        date = date(datetime),
        day = as.factor(wday(datetime, label = TRUE, abbr = FALSE)),
        hour = as.factor(hour(datetime)))
glimpse(test)
## Observations: 504
## Variables: 6
## $ src_area <fct> sxk3, sxk3, sxk3, sxk3, sxk3, sxk3, sxk3, sxk3, sxk3, sxk3, …
## $ datetime <dttm> 2017-12-03 00:00:00, 2017-12-03 01:00:00, 2017-12-03 02:00:…
## $ coverage <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ date     <date> 2017-12-03, 2017-12-03, 2017-12-03, 2017-12-03, 2017-12-03,…
## $ day      <ord> Sunday, Sunday, Sunday, Sunday, Sunday, Sunday, Sunday, Sund…
## $ hour     <fct> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17…

Based on data test exploration we do above, the only predictor variable we can use only src_area, datetime,date,day,hour and for target still coverage with 2 class sufficient and insufficient. we can removing unsued variable from our data train and data validation we make before

train_training <- train_training %>% 
  select(c("src_area", "day", "hour", "coverage"))
train_validation <- train_validation %>% 
    select(c("src_area", "day", "hour", "coverage"))
train_training_upsample <- train_training_upsample %>% 
    select(c("src_area", "day", "hour", "coverage"))

For models to predict, we only have choice choose model that accept only factor, or categorical predictor variable and categorical target variable. I decided to compare 4 type model:

  1. Logistict Regression
  2. Naive Bayes
  3. Decision Tree
  4. Random Forest

Evaluation

In this case, to compare each model we will use confusion matrix. Confusion matrix is a table that shows four different category: True Positive, True Negative, False Positive, and False Negative.

The performance will be the Accuracy, Sensitivity/Recall, Specificity, and Precision. Accuracy measures how many of our data is correctly predicted. Sensitivity measures out of all positive outcome, how many are correctly predicted. Specificty measure how many negative outcome is correctly predicted. Precision measures how many of our positive prediction is correct.

Logistict Regression

Logistic regression is a statistical model that in its basic form uses a logistic function to model a binary dependent variable, although many more complex extensions exist. In regression analysis, logistic regression (or logit regression) is estimating the parameters of a logistic model (a form of binary regression).

Logistic regression is a technique that is well suited for examining the relationship between a categorical response variable and one or more categorical or continuous predictor variables. we can check our structure of our dataset, we can see both of predictor or target is factor.

glimpse(train_training_upsample)
## Observations: 3,928
## Variables: 4
## $ src_area <fct> sxk3, sxk3, sxk3, sxk3, sxk3, sxk3, sxk3, sxk3, sxk3, sxk3, …
## $ day      <ord> Sunday, Sunday, Sunday, Sunday, Sunday, Sunday, Sunday, Sund…
## $ hour     <fct> 0, 3, 7, 8, 11, 12, 13, 14, 17, 19, 20, 21, 1, 7, 8, 10, 11,…
## $ coverage <fct> insufficient, insufficient, insufficient, insufficient, insu…

Assumption Checking

Logistict regression follow 3 assumtion before we use it:

  • Multicollinearity: between predictors do not correlate with each other, we can see our predictor contain categorical and each dont have corelate each other.

  • Independence of Observations: between observations are mutually independent & do not originate from repeated measurements. We can check ourtraining data actually extraction for hourly and it show historical and it didnt repeat, if repet for time and day it could have different area

  • Linearity of Predictor & Log of Odds: the way of interpretation refers to this assumption. for numerical variables, an increase of 1 value will increase the log of odds.

From total 3 assumption, 2 assumption we confidence our train dataset meet the requirment. So we can continue using our train dataset

Model Train

We will train logistic regression model. at first we will use all our variabel inside train data frame, and use coverage as target.

logistict_model <- glm(coverage ~ ., train_training_upsample, family = "binomial")
summary(logistict_model)
## 
## Call:
## glm(formula = coverage ~ ., family = "binomial", data = train_training_upsample)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.67425  -0.73517   0.00039   0.70405   2.69108  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -0.30543    0.19411  -1.573 0.115612    
## src_areasxk8  2.45943    0.10611  23.179  < 2e-16 ***
## src_areasxk9 -1.14049    0.09777 -11.665  < 2e-16 ***
## day.L        -0.26384    0.10837  -2.435 0.014912 *  
## day.Q        -0.30119    0.10796  -2.790 0.005274 ** 
## day.C         0.02223    0.10799   0.206 0.836872    
## day^4         0.41647    0.10847   3.840 0.000123 ***
## day^5         0.16530    0.10786   1.533 0.125390    
## day^6        -0.02934    0.10725  -0.274 0.784410    
## hour1         0.10385    0.27392   0.379 0.704611    
## hour2         0.68156    0.26648   2.558 0.010538 *  
## hour3         0.68393    0.26233   2.607 0.009131 ** 
## hour4         1.44160    0.27778   5.190 2.11e-07 ***
## hour5         1.39752    0.27877   5.013 5.35e-07 ***
## hour6         0.99198    0.26675   3.719 0.000200 ***
## hour7        -0.64403    0.28296  -2.276 0.022843 *  
## hour8        -2.09973    0.29788  -7.049 1.80e-12 ***
## hour9        -0.91742    0.27947  -3.283 0.001028 ** 
## hour10        0.11837    0.27001   0.438 0.661097    
## hour11        0.09742    0.27242   0.358 0.720635    
## hour12       -0.38025    0.28216  -1.348 0.177774    
## hour13       -0.69846    0.27990  -2.495 0.012581 *  
## hour14       -0.48210    0.28178  -1.711 0.087100 .  
## hour15       -0.71407    0.28540  -2.502 0.012350 *  
## hour16       -1.05535    0.28809  -3.663 0.000249 ***
## hour17       -1.12399    0.28381  -3.960 7.48e-05 ***
## hour18       -0.99653    0.28472  -3.500 0.000465 ***
## hour19       -1.10776    0.28312  -3.913 9.13e-05 ***
## hour20       -0.04379    0.27489  -0.159 0.873443    
## hour21        0.09773    0.27038   0.361 0.717750    
## hour22        0.69420    0.26831   2.587 0.009672 ** 
## hour23        0.24501    0.26759   0.916 0.359868    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5445.4  on 3927  degrees of freedom
## Residual deviance: 3711.9  on 3896  degrees of freedom
## AIC: 3775.9
## 
## Number of Fisher Scoring iterations: 5

We have only 3 predictor variable, but why it show lot of variable? because glm function will automatically do dummy coding means is reoding the original categorical variable into a set of binary variables that have values of one and zero.

From summary we can analyze the fitting and interpret that the model is telling us. We can se average all of our predictor variable are statistivally significant. No matter some dummy coding variable have big p-value or more <0.05 we cant judge one by one we must sum it as one variable.

Model Evaluation

We have make our logistic regression using all predict variable on dataset. Next this model use to predict with valudation dataset.

logistict_prediction <- predict(logistict_model, train_validation, type = "response")

head(logistict_prediction)
##         2         3         5        17        23        30 
## 0.4641338 0.6068290 0.7674631 0.2136755 0.6098418 0.7396008

Logistic regression return value within range of [0,1] and isnt binary class. We will convert probabilty into class using threshold value (by default we will put 0.5 as threshold value).

logistict_prediction <- as.factor(if_else(logistict_prediction >= 0.5, "sufficient", "insufficient"))

head(logistict_prediction)
## [1] insufficient sufficient   sufficient   insufficient sufficient  
## [6] sufficient  
## Levels: insufficient sufficient

As result we have categorical or class prediction, we use confusion matrix to evaluate the model

matrix <- confusionMatrix(logistict_prediction, train_validation$coverage, positive = "insufficient")
table <- as.table(matrix)
table <- as.data.frame(table)
confussionMatrixPlot(table)

logistict_matrix <- matrix_result(matrix, "Logistict Reg.")
logistict_matrix
##            Model Accuracy Sensitivity Specificity Precision
## 1 Logistict Reg.  79.9117    83.26531    75.96154  80.31496

These table explain:
* Accuracy: the ability to correctly predict both classes from the total observation.
* Precision: the ability to correctly predict the positive class from the total predicted-positive class (false positive is low).
* Recall: the ability to correctly predict the positive class from the total actual-positive class (false negative is low).
* Specificity: the ability to correctly predict the negative class from the total actual-negative class.

The result show our prediction on validation dataset using logistic regression model is 79.91% for accuracy, it mean that our result data prediction 79.91% is correctly classified. Precision/positive predicted value around 80.31%, mean our positive prediction is correctly classified. Value of sensitivity is 80.31% and specificity 75.96%, this indicate positive and negative predict correctly classified range around 75% - 80%.

Naive Bayes

Naive Bayes algorithm, in particular is a logic based technique which is simple yet so powerful that it is often known to outperform complex algorithms for very large datasets.Historically, this technique became popular with applications in email filtering, spam detection, and document categorization. There are certain characteristics of Naive Bayes that should be considered:

  1. Assumes that all features of the dataset are equally important and independent. This allows Naive Bayes to perform faster computation (the algorithms is quite simple).
  2. Prone to bias due to data scarcity. In some cases, our data may have a distribution where scarce observations lead to probabilities approximating close to 0 or 1, which introduces a heavy bias into our model that could lead to poor performance on unseen data.
  3. More appropriate for data with categoric predictors. This is because Naive Bayes is sensitive to data scarcity. Meanwhile, a continuous variable might contain really scarce or even only one observation for certain value.
  4. Apply Laplace estimator/smoothing for data scarcity problem. Laplace estimator proposes the adding of a small number (usually 1) to each of the counts in the frequency table. This subsequently ensures that each class-feature combination has a non-zero probability of occurring.

Although it is often outperformed by other techniques, and despite the naive design and oversimplified assumptions, this classifier can perform well in many complex real-world problems. And since it is a resource efficient algorithm that is fast and scales well, we will implement it to our case.

Model Train

We need build our model, and in this case we also apply Laplace Estimator. At first naive bayes model, we will use all predictor variable we have:

naive_model <- naiveBayes(coverage ~ ., data = train_training_upsample, laplace = 1)

Model Evaluation

Next this model use to predict with valudation dataset, and using confussion matrix

naive_prediction <- predict(naive_model, train_validation)
matrix <- confusionMatrix(naive_prediction, train_validation$coverage, positive = "insufficient")
table <- as.table(matrix)
table <- as.data.frame(table)
confussionMatrixPlot(table)

naive_matrix <- matrix_result(matrix, "Naive Bayes")
naive_matrix
##         Model Accuracy Sensitivity Specificity Precision
## 1 Naive Bayes  80.3532     85.5102    74.27885  79.65779

These table explain:
* Accuracy: the ability to correctly predict both classes from the total observation.
* Precision: the ability to correctly predict the positive class from the total predicted-positive class (false positive is low).
* Recall: the ability to correctly predict the positive class from the total actual-positive class (false negative is low).
* Specificity: the ability to correctly predict the negative class from the total actual-negative class.

Model Improvement

There is way to improve our naive bayes, before we do it we will check ROC curve from our naive bayes:

naive_prediction_raw <- as.data.frame(predict(naive_model, train_validation, type = "raw"))

# ROC
naive_roc <- data.frame(prediction = naive_prediction_raw[,2],
                        trueclass = as.numeric(train_validation$coverage=="sufficient"))

naive_roc_pred <- prediction(naive_roc$prediction, naive_roc$trueclass) 

# ROC curve
plot(performance(naive_roc_pred, "tpr", "fpr"),
     main = "ROC")
abline(a = 0, b = 1)

The ROC is a curve that shows the performance of the classification model for all thresholds. ROC represents graphics from:

  • Recall / sensitivity (true positive rate) on they axis.
  • 1-specificity (false positive rate) on thex axis.

Next we calculate AUC

# AUC
auc <- performance(naive_roc_pred, measure = "auc")
auc <- auc@y.values[[1]]
auc
## [1] 0.8481113

AUC is the area under the ROC curve. AUC values indicate the success of the model predicting / differentiating the two classes. The area value has an interval between 0 to 1. If the AUC value is close to 1, it means that the classification model is able to predict / distinguish the two classes well. However, if the AUC value is close to 0.5 it means that the classification model is not able to predict / distinguish the two classes well.
* AUC value close to 1: the classification model is able to predict / distinguish the two classes well
* AUC values close to 0.5: the classification model is not able to predict / distinguish the two classes well.

Based on result ROC and AUC, we get our ROC curce show good separation with AUC score 0.8436585 it mean we have chance to improve this Naive Bayes Model.

In our case we are focus “insufficient” label in variable coverage and one of method to tuning model is change the threshold.

# model tuning - metrics function
co <- seq(0.01,0.99,length=100)
result <- matrix(0,100,4)

# apply function metrics
for(i in 1:100){
  result[i,] = metrics(cutoff = co[i], 
                     prob = naive_prediction_raw$sufficient, 
                     ref = as.factor(ifelse(train_validation$coverage=="sufficient", 1, 0)), 
                     postarget = "1", 
                     negtarget = "0")
}

# visualize
ggplotly(tibble("Recall" = result[,1],
           "Accuracy" = result[,2],
           "Precision" = result[,3],
           "Specificity" = result[,4],
                   "Cutoff" = co) %>% 
  gather(key = "Metrics", value = "value", 1:4) %>% 
  ggplot(aes(x = Cutoff, y = value, col = Metrics)) +
  geom_line(lwd = 1.5) +
  scale_color_manual(values = c("darkred","darkgreen","orange", "blue")) +
  scale_y_continuous(breaks = seq(0,1,0.1), limits = c(0,1)) +
  scale_x_continuous(breaks = seq(0,1,0.1)) +
  labs(title = "Tradeoff Model Perfomance") +
  theme_minimal() +
  theme(legend.position = "top",
        panel.grid.minor.y = element_blank(),
        panel.grid.minor.x = element_blank()))

From plot above, we get information break even point is around 0.45. we will set it around 0.49 because we are focusing on Precision parameter, meanwhile we still want other paramaeter get good result.

#Tuning Threshold
naive_prediction_tuning <- naive_prediction_raw %>%
  mutate(label = as.factor(ifelse(sufficient >= 0.48, "sufficient", "insufficient"))) %>% 
  select(label)


naive_matrix_tuning <- confusionMatrix(naive_prediction_tuning$label, train_validation$coverage, positive = "insufficient")
naive_matrix_tuning <- matrix_result(naive_matrix_tuning, "Naive Bayes Tuning")

rbind(naive_matrix_tuning,naive_matrix)
##                Model Accuracy Sensitivity Specificity Precision
## 1 Naive Bayes Tuning  80.3532    84.69388    75.24038  80.11583
## 2        Naive Bayes  80.3532    85.51020    74.27885  79.65779

Decision Tree

Decision tree model is one of the tree-based models which has the major benefit of being interpretable. Decision tree is an algorithm that will make a set of rules visualized in a diagram that resembles a tree. There are certain characteristics of decision tree model:

  • Perform well on both numerical and categorical variable.
  • All predictors are assumed to interact.
  • Quite robust to the problem of multicollinearity. A decision tree will choose a variable that has the highest information gain in one split, whereas a method such as logistic regression would have used both.
  • Robust and insensitive to outliers. Splitting will happen at a condition where it maximizes the homogeneity within resulting groups. Outliers will have little influence on the splitting process.

Model Train

let build decision tree model

dtree_model <- ctree(formula = coverage ~., data = train_training_upsample)
plot(dtree_model)

A decision tree consists of several parts. The first box in the upper part of the plot is the root node. The root will split and make branches by certain rules. Each branches ended with a node. Some nodes split again into other nodes and called the internal node. Nodes that do not split anymore or appear at the end of the tree is called the terminal node or leaf node, just like a leaf on a tree.

Each node shows:

  • the predicted class (Excellent/Poor-Normal).
  • the probability of Excellent/Poor-Normal class.
  • the percentage of observations in the node.
  • the root and internal nodes also show the rules (variables with threshold/value) that will partition each observation.
width(dtree_model)
## [1] 10
depth(dtree_model)
## [1] 6

We found our decision tree models create not complicated tree using our train dataset. it shows from Width 10 it mean total Leaf Nodes, and 6 depth it mean Internal Nodes.

we would like to explain more how our model choose class for our train model, i will using lime package to intepret it.

set.seed(100)
model<-as_classifier(dtree_model, labels = NULL)
explainer <- lime((train_training[,-4]),
                  model = model)
explanation <- explain(train_training[1:4,-4],
                       explainer,
                       dist_fun = "manhattan",
                       n_labels = 1,
                       n_features = 3)

plot_features(explanation)

In case number 1, 4, 6, 7 we can see the observer data have same src_area = sxk3 and have same day = Sunday, but the different is hour is being siginificant factor our model classify the data. We can see other insight however its late night people using scotty.

Case 1 show src_area most important factor for coverage label, if hour and day dont have any significant.

Model Evaluation

We need Next this model use to predict with valudation dataset, and using confussion matrix

dtree_prediction <- predict(dtree_model, train_validation)
matrix <- confusionMatrix(dtree_prediction, train_validation$coverage, positive = "insufficient")
table <- as.table(matrix)
table <- as.data.frame(table)
confussionMatrixPlot(table)

dtree_matrix <- matrix_result(matrix, "Decision Tree")
dtree_matrix
##           Model Accuracy Sensitivity Specificity Precision
## 1 Decision Tree  80.7947    87.34694    73.07692  79.25926

These table explain:
* Accuracy: the ability to correctly predict both classes from the total observation.
* Precision: the ability to correctly predict the positive class from the total predicted-positive class (false positive is low).
* Recall: the ability to correctly predict the positive class from the total actual-positive class (false negative is low).
* Specificity: the ability to correctly predict the negative class from the total actual-negative class.

Model Improvement

In Our Decision model result Accuracy, Sensitivity, Specificity is quite ok. However we would try tuning it and hope can get better result.

dtree_prediction_raw <- predict(dtree_model, train_validation, type = "prob")

dtree_roc <- prediction(dtree_prediction_raw[,2], train_validation$coverage)  
plot(performance(dtree_roc, "tpr", "fpr"),
     main = "ROC")
abline(a = 0, b = 1)

auc <- performance(dtree_roc, "auc")
auc@y.values[[1]]
## [1] 0.8448047

Based on result ROC and AUC, we get our ROC curce show good separation with AUC score 0.8448047 it mean we still have chance to improve this Decision tree Model.

In our case we are focus “insufficient” on coverage variable as positive class, so we will try to improve the Precision, and one of method to tuning model is change pruning and tree-size.

ctrl <- ctree_control(mincriterion = 0.35, minsplit = 100, minbucket = 50)
dtree_tuning_model <- ctree(formula = coverage ~., data = train_training_upsample, control = ctrl)
dtree_prediction_tuning <- predict(dtree_tuning_model, train_validation)
matrix <- confusionMatrix(dtree_prediction_tuning, train_validation$coverage, positive = "insufficient")
table <- as.table(matrix)
table <- as.data.frame(table)
confussionMatrixPlot(table)

dtree_tuning_matrix <- matrix_result(matrix, "Decision Tree Tuning")
rbind(dtree_tuning_matrix,dtree_matrix)
##                  Model Accuracy Sensitivity Specificity Precision
## 1 Decision Tree Tuning 80.68433    85.10204    75.48077  80.34682
## 2        Decision Tree 80.79470    87.34694    73.07692  79.25926

Result after we tuning the decision tree model, we can increase 1% precision and specificity compare with previous deicision tree model.

Random Forest

Random forests are based on a simple idea: ‘the wisdom of the crowd’. Aggregate of the results of multiple predictors gives a better prediction than the best individual predictor. A group of predictors is called an ensemble. Thus, this technique is called Ensemble Learning.

To improve our model, we can train a group of Decision Tree classifiers, each on a different random subset of the train set. To make a prediction, we just obtain the predictions of all individuals trees, then predict the class that gets the most votes. This technique is called Random Forest.

One wekness of Random Forest Model is are easy to get over feet with train data, to decrease possibility get overfitted model we can use K-Fold Cross Validation. K- Fold Cross-validation is a resampling procedure used to evaluate machine learning models on a limited data sample.

The procedure has a single parameter called k that refers to the number of groups that a given data sample is to be split into. As such, the procedure is often called k-fold cross-validation. When a specific value for k is chosen, it may be used in place of k in the reference to the model, such as k=10 becoming 10-fold cross-validation.

Cross-validation is primarily used in applied machine learning to estimate the skill of a machine learning model on unseen data. That is, to use a limited sample in order to estimate how the model is expected to perform in general when used to make predictions on data not used during the training of the model.

It is a popular method because it is simple to understand and because it generally results in a less biased or less optimistic estimate of the model skill than other methods, such as a simple train/test split.

Model Train

Lets implement it in our random forest model, we have consideration our case predictor variable is verty limited, so to make train and hope more better result, i will set more k-fold cross validation

ctrl <- trainControl(method = "repeatedcv", number = 5,repeats = 10)

train random forest model

# our result model will save on RDS
# randomforest_model <- train(coverage ~ ., data = train_training_upsample, method = "rf", trControl = ctrl, ntree = 10)
# 
# saveRDS(randomforest_model, file = "data/model_rforest.RDS")

Lets see result and get insight from our Random Forest Model

randomforest_model <- readRDS("data/model_rforest.RDS")

randomforest_model
## Random Forest 
## 
## 3928 samples
##    3 predictor
##    2 classes: 'insufficient', 'sufficient' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 10 times) 
## Summary of sample sizes: 3142, 3142, 3142, 3143, 3143, 3143, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.7842424  0.5684822
##   16    0.7667785  0.5335560
##   31    0.7649200  0.5298385
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.

From model summary we have build, we get optimum number of variable considered for splitting at each tree node is 2. Next we will use varImp() to get information which feature/variable importance for our random forest model.

varImp(randomforest_model)
## rf variable importance
## 
##   only 20 most important variables shown (out of 31)
## 
##               Overall
## src_areasxk8 100.0000
## src_areasxk9  41.2367
## hour4          5.6064
## hour8          5.4409
## hour5          3.1428
## hour6          2.3171
## hour3          2.1025
## hour17         1.5848
## hour16         1.4873
## day^6          1.3055
## day.L          1.2495
## hour19         1.1781
## hour2          1.1772
## day.Q          1.1587
## hour18         1.0768
## hour22         1.0265
## hour9          0.9842
## day^5          0.9739
## day.C          0.9728
## day^4          0.9669

We found src_area is most important predictor variable from our train dataset, we can see top 2 of table is class we have in. src_area. In practice, the random forest already have out-of-bag estimates (OOB) that can be used as a reliable estimate of its true accuracy on unseen examples. we can check OOB rate in our random forest model

randomforest_model$finalModel
## 
## Call:
##  randomForest(x = x, y = y, ntree = 50, mtry = param$mtry) 
##                Type of random forest: classification
##                      Number of trees: 50
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 21.84%
## Confusion matrix:
##              insufficient sufficient class.error
## insufficient         1609        355   0.1807536
## sufficient            503       1461   0.2561100
plot(randomforest_model$finalModel)
legend("topright", colnames(randomforest_model$finalModel$err.rate),col=1:6,cex=0.8,fill=1:6)

Surprisingly, our random forest model actually only under 10 trees as best method to predict based on our train dataset.

Random forest its black box method, that mean very hard to intepret what exactly happen inside the model, i would like try using LIME packages that can help us to inteprate our random forest model. We would to see how our random forest model classify an observations to sufficient with 3 features we have.

To use lime we need dataset that already convert to data frame, and in this case we already have it. We can use train_training dataset as observation. As we know we only have few predictor variable is 3, so we will use all variable.

set.seed(100)
explainer <- lime((train_training[,-4]),
                  model = randomforest_model)
explanation <- explain(train_training[1:4,-4],
                       explainer,
                       n_permutations = 5000,
                       dist_fun = "manhattan",
                       n_labels = 1,
                       n_features = 3)

plot_features(explanation)

Before we start to explain the plot, before we already explain using VarImp is variable importance. The different between Variable importance and LIME is, we can see how feature contribute to classification the observation. Because its an visualisation we can see it clearly how predictor works.

Intepret case 1, label is to be inssufficient because it because its sunday and the orders come at middle of the night. Intresting we found is, start from hour 3 am even it in the sunday the label change to sufficient. We can conclude that area maybe didnt contribute more in this model, the time and day maybe effect more than area.

Model Evaluation

We need Next this model use to predict with valudation dataset, and using confussion matrix

randomforest_prediction <- predict(randomforest_model, train_validation)
matrix <- confusionMatrix(randomforest_prediction, train_validation$coverage)
table <- as.table(matrix)
table <- as.data.frame(table)
confussionMatrixPlot(table)

randomforest_matrix <- matrix_result(matrix, "Random Forest")
randomforest_matrix
##           Model Accuracy Sensitivity Specificity Precision
## 1 Random Forest 79.35982     82.2449    75.96154  80.11928

hese table explain:
* Accuracy: the ability to correctly predict both classes from the total observation.
* Precision: the ability to correctly predict the positive class from the total predicted-positive class (false positive is low).
* Recall: the ability to correctly predict the positive class from the total actual-positive class (false negative is low).
* Specificity: the ability to correctly predict the negative class from the total actual-negative class.

Conclusion

In this case we use Confussion Matrix as model evaluation, and based on our problem we need classification model report that would be evaluated on next 7 days (Sunday, December 3rd 2017 to Saturday, December 9th 2017). And prediction should focus on model to predict which region and times are risky enough to have this “no drivers” problem.

Model that we made before we put “insufficient” as positif value, so target we need highest Precision percentage predeiction. Lets we sum up all result matrix from all model and tuning model we made before:

result <- rbind(logistict_matrix, naive_matrix_tuning, naive_matrix, dtree_tuning_matrix, dtree_matrix,randomforest_matrix)
result %>% arrange(desc(Precision))
##                  Model Accuracy Sensitivity Specificity Precision
## 1 Decision Tree Tuning 80.68433    85.10204    75.48077  80.34682
## 2       Logistict Reg. 79.91170    83.26531    75.96154  80.31496
## 3        Random Forest 79.35982    82.24490    75.96154  80.11928
## 4   Naive Bayes Tuning 80.35320    84.69388    75.24038  80.11583
## 5          Naive Bayes 80.35320    85.51020    74.27885  79.65779
## 6        Decision Tree 80.79470    87.34694    73.07692  79.25926

To prove which model is more better we will use test dataset to simulize real word, and see the result of conffusion matrix comparison fron both non-tuning and tuning decision tree model. We will save both result preduction as csv:

dtree_result <- predict(dtree_model, test)
dtree_tuning_result <- predict(dtree_tuning_model, test)

write_dtree <- test
write_dtree$coverage <- dtree_result
write_dtree <- write_dtree %>%
  select(-c(date,day,hour))

write_tuning_dtree <- test
write_tuning_dtree$coverage <- dtree_tuning_result
write_tuning_dtree <- write_tuning_dtree %>%
  select(-c(date,day,hour))

write.csv(write_dtree, "data/data-test-dtree.csv")
write.csv(write_tuning_dtree, "data/data-test-dtree_tuning.csv")

Here is the result of Decision Tree model result:

Here is the result of Decision Tree Tuning model result:

We can see the different there is trade off between 2 models, if we want to increase precision it will take another result effect decrease. since we are focusing to guess the precision value we can choose Decision Tree Tuning model as our best for this case condition.

We can see Decision Tree Tuning is the most highest percentage on precision value, we can use this model to predict which area and hour that have more risk get “nodriver” status. After we found it, scotty can make strategy or provide more driver in area and hour that have more “insufficient” label.

It should have effect for bussiness to, since we can provide user order more better, it will effect can increase revenue both driver and scotty company.