Scotty is a ride-sharing business that operating in several big cities in Turkey. The company provides motorcycles ride-sharing service for Turkey’s citizen, and really value the efficiency in traveling through the traffic–the apps even give some reference to Star Trek “beam me up” in their order buttons.
Scotty provided us with real-time transaction dataset. With this dataset, we are going to help them in solving some forecasting and classification problems in order to improve their business processes.
Scotty turns out being a very popular service in Turkey! The demands for Scotty began to overload, at some region and some times, and there was not enough driver at those times and places. Fortunately, we are know that we can use classification model to predict which region and times are risky enough to have this “no drivers” problem.
Create a classification model report that would be evaluated on next 7 days (Sunday, December 3rd 2017 to Saturday, December 9th 2017). Make prediction that should cover the predicted coverage status for each hour and each area: “sufficient” or “insufficient”.
Our target is to get information which hour and area that have “insufficient” label, it mean we want to know which hour and area that really need more driver.
To solve this case we need import some package to help process and predict:
library(GGally)
# Wrangling package
library(tidyverse)
library(tidymodels)
# Date processing Package
library(lubridate)
# Scale numeric
library(scales)
# Sampling Helpoer
library(rsample)
# Confussion matrix
library(caret)
# Decision tree package
library(partykit)
# Smote for unbalanced data
library(DMwR)
# KNN modelling
library(class)
#ROCR
library(ROCR)
# Naive Bayes
library(e1071)
# Visualize
library(plotly)
# Inteprate Random Forest Model
library(lime)
We will import some script to help us finish this analysis.
source("script/matrix_result.R")
source("script/metrics.R")
source("script/confussion_matrix_plot.R")
Scotty provide transaction detail transaction from October 1st, 2017 to December 2nd, 2017 as train dataset. Based on this train data we can start process to make model and predict.
train <- read_csv("data/data-train.csv")
test <- read_csv("data/data-test.csv")
Lets check structure of train data:
glimpse(train)
## Observations: 229,532
## Variables: 16
## $ id <chr> "59d005e1ffcfa261708ce9cd", "59d0066a3d32b861760d4…
## $ trip_id <chr> "59d005e9cb564761a8fe5d3e", "59d00678ffcfa261708ce…
## $ driver_id <chr> "59a892c5568be44b2734f276", "59a135565e88a24b11f11…
## $ rider_id <chr> "59ad2d6efba75a581666b506", "59ce930f3d32b861760a4…
## $ start_time <dttm> 2017-10-01 00:00:17, 2017-10-01 00:02:34, 2017-10…
## $ src_lat <dbl> 41.07047, 40.94157, 41.07487, 41.04995, 41.05287, …
## $ src_lon <dbl> 29.01945, 29.11484, 28.99528, 29.03107, 28.99522, …
## $ src_area <chr> "sxk9", "sxk8", "sxk9", "sxk9", "sxk9", "sxk9", "s…
## $ src_sub_area <chr> "sxk9s", "sxk8y", "sxk9e", "sxk9s", "sxk9e", "sxk9…
## $ dest_lat <dbl> 41.11716, 41.06151, 41.08351, 41.04495, 41.08140, …
## $ dest_lon <dbl> 29.03650, 29.02068, 29.00228, 28.98192, 28.98197, …
## $ dest_area <chr> "sxk9", "sxk9", "sxk9", "sxk9", "sxk9", "sxk9", "s…
## $ dest_sub_area <chr> "sxk9u", "sxk9s", "sxk9e", "sxk9e", "sxk9e", "sxk9…
## $ distance <dbl> 5.379250, 15.497130, 1.126098, 4.169492, 3.358296,…
## $ status <chr> "confirmed", "confirmed", "nodrivers", "confirmed"…
## $ confirmed_time_sec <dbl> 8, 14, 0, 32, 65, 110, 0, 49, 27, 21, 23, 46, 185,…
The dataset includes information about:
Based on information, we cant see coverage variable inside train data. but we found train data structure contain detail order from scotty. Some variable have mismatch datatype, we will try to change data type:
train <- train %>%
mutate(src_area = as.factor(src_area),
src_sub_area = as.factor(src_sub_area),
dest_area = as.factor(dest_area),
dest_sub_area = as.factor(dest_sub_area),
status = as.factor(status)
)
lets preview contain of data to understand more train data:
head(train)
## # A tibble: 6 x 16
## id trip_id driver_id rider_id start_time src_lat src_lon src_area
## <chr> <chr> <chr> <chr> <dttm> <dbl> <dbl> <fct>
## 1 59d0… 59d005… 59a892c5… 59ad2d6… 2017-10-01 00:00:17 41.1 29.0 sxk9
## 2 59d0… 59d006… 59a13556… 59ce930… 2017-10-01 00:02:34 40.9 29.1 sxk8
## 3 59d0… <NA> <NA> 59cd704… 2017-10-01 00:02:34 41.1 29.0 sxk9
## 4 59d0… 59d006… 599dc0df… 59bd62c… 2017-10-01 00:03:29 41.0 29.0 sxk9
## 5 59d0… 59d007… 59a58565… 59c17cd… 2017-10-01 00:04:24 41.1 29.0 sxk9
## 6 59d0… 59d007… 59579d5f… 594afb1… 2017-10-01 00:05:32 41.0 28.9 sxk9
## # … with 8 more variables: src_sub_area <fct>, dest_lat <dbl>, dest_lon <dbl>,
## # dest_area <fct>, dest_sub_area <fct>, distance <dbl>, status <fct>,
## # confirmed_time_sec <dbl>
we found there’s any NA value inside trip_id and rider_id, to make it more clear i will check NA value from all column before we take any action to data wrangling
summary(is.na(train))
## id trip_id driver_id rider_id
## Mode :logical Mode :logical Mode :logical Mode :logical
## FALSE:229532 FALSE:214631 FALSE:214632 FALSE:229532
## TRUE :14901 TRUE :14900
## start_time src_lat src_lon src_area
## Mode :logical Mode :logical Mode :logical Mode :logical
## FALSE:229532 FALSE:229532 FALSE:229532 FALSE:229532
##
## src_sub_area dest_lat dest_lon dest_area
## Mode :logical Mode :logical Mode :logical Mode :logical
## FALSE:229532 FALSE:229532 FALSE:229532 FALSE:229532
##
## dest_sub_area distance status confirmed_time_sec
## Mode :logical Mode :logical Mode :logical Mode :logical
## FALSE:229532 FALSE:229532 FALSE:229532 FALSE:229532
##
We found NA value only trip_id and driver_id, lets compare some row with NA value with non-NA row
train[c(1:3),]
## # A tibble: 3 x 16
## id trip_id driver_id rider_id start_time src_lat src_lon src_area
## <chr> <chr> <chr> <chr> <dttm> <dbl> <dbl> <fct>
## 1 59d0… 59d005… 59a892c5… 59ad2d6… 2017-10-01 00:00:17 41.1 29.0 sxk9
## 2 59d0… 59d006… 59a13556… 59ce930… 2017-10-01 00:02:34 40.9 29.1 sxk8
## 3 59d0… <NA> <NA> 59cd704… 2017-10-01 00:02:34 41.1 29.0 sxk9
## # … with 8 more variables: src_sub_area <fct>, dest_lat <dbl>, dest_lon <dbl>,
## # dest_area <fct>, dest_sub_area <fct>, distance <dbl>, status <fct>,
## # confirmed_time_sec <dbl>
We found there’s any relation why trip_id and driver_id NA, is because status is “nodrivers”. and it will fullfil if status is “confirmed”. It will be our clue to extract information of coverage data.
Based on our target we have before, we will predict 7 days for each hour and each area. Why we dont specified time level only days, we cant explain detail which time get more dirver order. but if we make level into minute the data will too specific and result the data we have will make us confuse. So hourly interval is best time level to predict.
We can see from structure we dont have hour variable. We have start_time as users order time, we can extract specific level from Date, Days, Hour:
train_hourly <- train %>%
mutate(src_area = as.factor(src_area),
date = date(start_time),
day = as.factor(wday(start_time, label = TRUE, abbr = FALSE)),
hour = as.factor(hour(start_time)))
glimpse(train_hourly)
## Observations: 229,532
## Variables: 19
## $ id <chr> "59d005e1ffcfa261708ce9cd", "59d0066a3d32b861760d4…
## $ trip_id <chr> "59d005e9cb564761a8fe5d3e", "59d00678ffcfa261708ce…
## $ driver_id <chr> "59a892c5568be44b2734f276", "59a135565e88a24b11f11…
## $ rider_id <chr> "59ad2d6efba75a581666b506", "59ce930f3d32b861760a4…
## $ start_time <dttm> 2017-10-01 00:00:17, 2017-10-01 00:02:34, 2017-10…
## $ src_lat <dbl> 41.07047, 40.94157, 41.07487, 41.04995, 41.05287, …
## $ src_lon <dbl> 29.01945, 29.11484, 28.99528, 29.03107, 28.99522, …
## $ src_area <fct> sxk9, sxk8, sxk9, sxk9, sxk9, sxk9, sxk9, sxk9, sx…
## $ src_sub_area <fct> sxk9s, sxk8y, sxk9e, sxk9s, sxk9e, sxk90, sxk9e, s…
## $ dest_lat <dbl> 41.11716, 41.06151, 41.08351, 41.04495, 41.08140, …
## $ dest_lon <dbl> 29.03650, 29.02068, 29.00228, 28.98192, 28.98197, …
## $ dest_area <fct> sxk9, sxk9, sxk9, sxk9, sxk9, sxk9, sxk9, sxk9, sx…
## $ dest_sub_area <fct> sxk9u, sxk9s, sxk9e, sxk9e, sxk9e, sxk97, sxk9q, s…
## $ distance <dbl> 5.379250, 15.497130, 1.126098, 4.169492, 3.358296,…
## $ status <fct> confirmed, confirmed, nodrivers, confirmed, confir…
## $ confirmed_time_sec <dbl> 8, 14, 0, 32, 65, 110, 0, 49, 27, 21, 23, 46, 185,…
## $ date <date> 2017-10-01, 2017-10-01, 2017-10-01, 2017-10-01, 2…
## $ day <ord> Sunday, Sunday, Sunday, Sunday, Sunday, Sunday, Su…
## $ hour <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
we have new variable date it explain date order happen,day it explain day, and hour it explain hours when user order driver.
summary(train_hourly)
## id trip_id driver_id rider_id
## Length:229532 Length:229532 Length:229532 Length:229532
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## start_time src_lat src_lon src_area
## Min. :2017-10-01 00:00:17 Min. :40.79 Min. :28.48 sxk3: 9946
## 1st Qu.:2017-10-19 19:36:04 1st Qu.:41.00 1st Qu.:28.97 sxk8: 5636
## Median :2017-11-07 14:08:48 Median :41.03 Median :29.00 sxk9:213950
## Mean :2017-11-04 18:32:48 Mean :41.03 Mean :29.00
## 3rd Qu.:2017-11-19 03:44:19 3rd Qu.:41.06 3rd Qu.:29.04
## Max. :2017-12-02 23:59:20 Max. :41.13 Max. :29.18
##
## src_sub_area dest_lat dest_lon dest_area
## sxk9e :40869 Min. : 5.923 Min. : 0.1273 sxk9 :205449
## sxk9s :24732 1st Qu.:41.001 1st Qu.: 28.9689 sxk3 : 12516
## sxk97 :24512 Median :41.036 Median : 29.0036 sxk8 : 4973
## sxk9k :16612 Mean :41.032 Mean : 29.0025 sxkc : 2264
## sxk9j :12336 3rd Qu.:41.064 3rd Qu.: 29.0475 sxkb : 1994
## sxk9h :12205 Max. :52.370 Max. :121.4721 sxkd : 1911
## (Other):98266 (Other): 425
## dest_sub_area distance status confirmed_time_sec
## sxk9e : 35941 Min. : 0.000 confirmed:214631 Min. : 0.00
## sxk9s : 25346 1st Qu.: 2.802 nodrivers: 14901 1st Qu.: 12.00
## sxk97 : 23031 Median : 4.804 Median : 26.00
## sxk9k : 13868 Mean : 6.772 Mean : 46.41
## sxk9j : 13704 3rd Qu.: 7.943 3rd Qu.: 66.00
## sxk9h : 10423 Max. :9345.566 Max. :514.00
## (Other):107219
## date day hour
## Min. :2017-10-01 Sunday :21063 18 : 25403
## 1st Qu.:2017-10-19 Monday :27425 19 : 20572
## Median :2017-11-07 Tuesday :31190 17 : 19156
## Mean :2017-11-04 Wednesday:31999 16 : 16090
## 3rd Qu.:2017-11-19 Thursday :32251 8 : 14938
## Max. :2017-12-02 Friday :50726 15 : 14227
## Saturday :34878 (Other):119146
To make dataset convert perfectly to hourly interval, we need grouping transaction data and summarise data to hourly.
Example we have data with time 06:01:00,06:30:00,07:00:00 so it will group 06:00:00 have 2 transaction and 07:00:00 have 1 transaction.
We will grouping data by src_area, date, hour, and status, for data aggregation. Beside that after we grouping we summarise total_transaction it explain how many transaction within interval, and at status we have 2 class we already know before confirmed and nodrivers, we will count frequency:
coverage_hourly <- train_hourly %>%
group_by(src_area, date, day, hour, status) %>%
summarise(n = n()) %>%
ungroup() %>%
spread(status, n, fill=0) %>%
mutate(
total_transaction = confirmed + nodrivers,
datetime = ymd_h(paste(date, hour)))
#preview the data
head(coverage_hourly)
## # A tibble: 6 x 8
## src_area date day hour confirmed nodrivers total_transacti…
## <fct> <date> <ord> <fct> <dbl> <dbl> <dbl>
## 1 sxk3 2017-10-01 Sund… 0 0 1 1
## 2 sxk3 2017-10-01 Sund… 1 1 1 2
## 3 sxk3 2017-10-01 Sund… 3 0 1 1
## 4 sxk3 2017-10-01 Sund… 4 0 1 1
## 5 sxk3 2017-10-01 Sund… 7 0 1 1
## 6 sxk3 2017-10-01 Sund… 8 1 2 3
## # … with 1 more variable: datetime <dttm>
To make prediction model and good exploratory data analysis, we need complete time interval data. lets take look summary of our interval data:
summary(coverage_hourly$datetime)
## Min. 1st Qu. Median
## "2017-10-01 00:00:00" "2017-10-16 19:00:00" "2017-11-01 17:00:00"
## Mean 3rd Qu. Max.
## "2017-11-01 15:05:47" "2017-11-17 10:00:00" "2017-12-02 23:00:00"
based on summary our data is start from 2017-10-01 00:00:00 to 2017-12-02 23:00:00. We will determine for as start and end of interval
sorted_datetime <- coverage_hourly[order(coverage_hourly$datetime),]
datetime_start <- sorted_datetime$datetime[1]
datetime_end <- sorted_datetime$datetime[length(sorted_datetime$datetime)]
datetime_start
## [1] "2017-10-01 UTC"
datetime_end
## [1] "2017-12-02 23:00:00 UTC"
Make complete list from start interval until end of interval in hourly
all_date <- seq(datetime_start, datetime_end, by = "hour")
all_date <- data.frame(list(datetime = all_date))
# total hours from start to end date
nrow(all_date)
## [1] 1512
Total we have 1512 hour interval from start to the end. we will padding implement new time interval to our train data
sxk3 <- coverage_hourly %>%
filter(src_area == "sxk3") %>%
arrange(desc(datetime)) %>%
merge(all_date, all = T) %>%
mutate(src_area = as.factor("sxk3"))
sxk8 <- coverage_hourly %>%
filter(src_area == "sxk8") %>%
arrange(desc(datetime)) %>%
merge(all_date, all = T) %>%
mutate(src_area = as.factor("sxk8"))
sxk9 <- coverage_hourly %>%
filter(src_area == "sxk9") %>%
arrange(desc(datetime)) %>%
merge(all_date, all = T) %>%
mutate(src_area = as.factor("sxk9"))
coverage_all_date <- do.call("rbind", list(sxk3, sxk8, sxk9))
head(coverage_all_date)
## datetime src_area date day hour confirmed nodrivers
## 1 2017-10-01 00:00:00 sxk3 2017-10-01 Sunday 0 0 1
## 2 2017-10-01 01:00:00 sxk3 2017-10-01 Sunday 1 1 1
## 3 2017-10-01 02:00:00 sxk3 <NA> <NA> <NA> NA NA
## 4 2017-10-01 03:00:00 sxk3 2017-10-01 Sunday 3 0 1
## 5 2017-10-01 04:00:00 sxk3 2017-10-01 Sunday 4 0 1
## 6 2017-10-01 05:00:00 sxk3 <NA> <NA> <NA> NA NA
## total_transaction
## 1 1
## 2 2
## 3 NA
## 4 1
## 5 1
## 6 NA
Implement new interval data cause NA value for some row, it because from our train data dont have any transaction during some interval taime. We will fill this NA with 0 value:
coverage_all_date <- coverage_all_date %>%
mutate(
date = date(datetime),
hour = as.factor(hour(datetime)),
day = as.factor(wday(datetime, label = TRUE, abbr = FALSE)),
confirmed = replace_na(confirmed, 0),
nodrivers = replace_na(nodrivers,0),
total_transaction = replace_na(total_transaction ,0),
)
head(coverage_all_date)
## datetime src_area date day hour confirmed nodrivers
## 1 2017-10-01 00:00:00 sxk3 2017-10-01 Sunday 0 0 1
## 2 2017-10-01 01:00:00 sxk3 2017-10-01 Sunday 1 1 1
## 3 2017-10-01 02:00:00 sxk3 2017-10-01 Sunday 2 0 0
## 4 2017-10-01 03:00:00 sxk3 2017-10-01 Sunday 3 0 1
## 5 2017-10-01 04:00:00 sxk3 2017-10-01 Sunday 4 0 1
## 6 2017-10-01 05:00:00 sxk3 2017-10-01 Sunday 5 0 0
## total_transaction
## 1 1
## 2 2
## 3 0
## 4 1
## 5 1
## 6 0
Our case is to make prediction that should cover the predicted coverage status for each hour and each area: “sufficient” or “insufficient”. Next we will make our target variable coverage based on variable we have on dataset. status is variable we use labeling coverage, status with nodriver > 0, will label as insufficient and if status nodriver = 0, we will label it sufficient.
coverage_all_date <- coverage_all_date %>%
mutate(coverage = as.factor(if_else(nodrivers == 0, "sufficient", "insufficient")))
head(coverage_all_date)
## datetime src_area date day hour confirmed nodrivers
## 1 2017-10-01 00:00:00 sxk3 2017-10-01 Sunday 0 0 1
## 2 2017-10-01 01:00:00 sxk3 2017-10-01 Sunday 1 1 1
## 3 2017-10-01 02:00:00 sxk3 2017-10-01 Sunday 2 0 0
## 4 2017-10-01 03:00:00 sxk3 2017-10-01 Sunday 3 0 1
## 5 2017-10-01 04:00:00 sxk3 2017-10-01 Sunday 4 0 1
## 6 2017-10-01 05:00:00 sxk3 2017-10-01 Sunday 5 0 0
## total_transaction coverage
## 1 1 insufficient
## 2 2 insufficient
## 3 0 sufficient
## 4 1 insufficient
## 5 1 insufficient
## 6 0 sufficient
Until this step, we have label as target variable is coverage with 2 level “sufficient” and “insufficient” in train data.
Check proportion for our target variable coverage, on process we have prepare some set of data frame. coverage_hourly data before we padding it, and after padding data we have coverage_all_date. let check proportion before padding and after padding.
coverage_hourly <- coverage_hourly %>%
mutate(coverage = as.factor(if_else(nodrivers == 0, "sufficient", "insufficient")))
proportion <- data.frame(prop.table(table(coverage_hourly$coverage)))
proportion %>%
ggplot(aes(x = Var1, y = Freq)) +
geom_bar(aes(fill = Freq), stat = "identity") +
scale_y_continuous(labels = scales::percent) +
geom_text(aes(label = percent(Freq)),
fontface = "bold",
color = "white",
position = position_stack(vjust = 0.5)) +
theme(legend.position = "none") +
labs(
title = "Coverage proportion before padding",
x = "",
y = ""
)
Before padding proportion of coverage is 64% of data label as insufficient and 36% of data label as sufficient. Next we show train data after implement padding or new time interval
proportion <- data.frame(prop.table(table(coverage_all_date$coverage)))
proportion %>%
ggplot(aes(x = Var1, y = Freq)) +
geom_bar(aes(fill = Freq), stat = "identity") +
scale_y_continuous(labels = scales::percent) +
geom_text(aes(label = percent(Freq)),
fontface = "bold",
color = "white",
position = position_stack(vjust = 0.5)) +
theme(legend.position = "none") +
labs(
title = "Coverage proportion after padding",
x = "",
y = ""
)
Proportion after we padding is more better before we padding, proportion is 54.1& labeled as insufficient and 45.9% labeled as sufficient.
This dataset have src_are explain different area, lets check how is frequency from our train dataset:
proportion <- data.frame(summary(coverage_all_date$src_area)) %>%
rownames_to_column(., "VALUE") %>%
mutate("Source Area" = .[,1],
"Freq" = .[,2]) %>%
select(3,4)
proportion
## Source Area Freq
## 1 sxk3 1512
## 2 sxk8 1512
## 3 sxk9 1512
it show for each area have same frequency data 1512 transaction, its because we has padding the train dataset before. Lets check proportion of coverage for each area:
proprotion <- coverage_all_date %>%
group_by(src_area) %>%
summarise(insufficient = sum(coverage == "insufficient"),
sufficient = sum(coverage == "sufficient"),
total = n(),
"proportion insufficient" = insufficient/n(),
"proportion sufficient" = sufficient/n())
proprotion
## # A tibble: 3 x 6
## src_area insufficient sufficient total `proportion insuffi… `proportion suffi…
## <fct> <int> <int> <int> <dbl> <dbl>
## 1 sxk3 947 565 1512 0.626 0.374
## 2 sxk8 258 1254 1512 0.171 0.829
## 3 sxk9 1249 263 1512 0.826 0.174
More better we visualize it
proprotion %>%
select("src_area","proportion insufficient", "proportion sufficient") %>%
rename("insufficient" = "proportion insufficient",
"sufficient" = "proportion sufficient") %>%
gather(key = "prop",
value = "value",
-src_area) %>%
ggplot(aes(x = prop, y = value)) +
geom_bar(aes(fill = prop), stat = "identity") +
scale_y_continuous(labels = scales::percent) +
geom_text(aes(label = percent(value)),
fontface = "bold",
color = "white",
position = position_stack(vjust = 0.5)) +
facet_wrap( ~ src_area) +
labs(
title = "Coverage proportion group by Area",
x = "",
y = ""
) +
theme(legend.position = "none")
From 3 area we have, only sxk8 that have more than 50% sufficient label, it indicate other area sxk3 and sxk9 need more driver in those area. lets check other visualize to get better insight.
We will try to use heatmap type visualization
coverage_all_date %>%
mutate(day = as.factor(wday(datetime, label = TRUE, abbr = TRUE))) %>%
group_by(day, hour, src_area) %>%
summarise(sufficient = sum(coverage == "sufficient")) %>%
ggplot(aes(x = day, y = hour, fill = sufficient)) +
geom_tile() +
facet_wrap(~src_area) +
theme(legend.position = "bottom")
To make more clear and get detail insight, i would make other visualization using bar plot:
coverage_all_date %>%
select(-c("datetime", "date", "day")) %>%
group_by(hour, src_area) %>%
summarise(insufficient = sum(coverage == "insufficient"),
sufficient = sum(coverage == "sufficient")) %>%
gather("type","value",-src_area, -hour) %>%
ggplot(aes(x = hour, y = value, fill = type)) +
geom_col(position = "dodge") +
facet_grid(src_area ~ .) +
theme(legend.position = "bottom") +
labs(title = "Coverage hourly group by Area")
some insight we can get is:
1. sxk9 area is the most insufficient driver area, from heatmap all days its show dark area it mean all days and every hour users didnt get driver when they order.
2. sxk8 area is the most sufficient driver area, every hour and every hour confirmed driver than other area.
3. However area sxk8 still in 08:00 it nearly insufficient label more than sufficient label, especially in working days from monday to friday, it show more dark blue on heatmap.
4.sxk3 area get more driver start from morning 06:00 which people start to activity until late night around 00:00 and this case happen always every day looks from heatmap.
5. There is corellation between days, hour, area to coverage.
glimpse(train)
## Observations: 229,532
## Variables: 16
## $ id <chr> "59d005e1ffcfa261708ce9cd", "59d0066a3d32b861760d4…
## $ trip_id <chr> "59d005e9cb564761a8fe5d3e", "59d00678ffcfa261708ce…
## $ driver_id <chr> "59a892c5568be44b2734f276", "59a135565e88a24b11f11…
## $ rider_id <chr> "59ad2d6efba75a581666b506", "59ce930f3d32b861760a4…
## $ start_time <dttm> 2017-10-01 00:00:17, 2017-10-01 00:02:34, 2017-10…
## $ src_lat <dbl> 41.07047, 40.94157, 41.07487, 41.04995, 41.05287, …
## $ src_lon <dbl> 29.01945, 29.11484, 28.99528, 29.03107, 28.99522, …
## $ src_area <fct> sxk9, sxk8, sxk9, sxk9, sxk9, sxk9, sxk9, sxk9, sx…
## $ src_sub_area <fct> sxk9s, sxk8y, sxk9e, sxk9s, sxk9e, sxk90, sxk9e, s…
## $ dest_lat <dbl> 41.11716, 41.06151, 41.08351, 41.04495, 41.08140, …
## $ dest_lon <dbl> 29.03650, 29.02068, 29.00228, 28.98192, 28.98197, …
## $ dest_area <fct> sxk9, sxk9, sxk9, sxk9, sxk9, sxk9, sxk9, sxk9, sx…
## $ dest_sub_area <fct> sxk9u, sxk9s, sxk9e, sxk9e, sxk9e, sxk97, sxk9q, s…
## $ distance <dbl> 5.379250, 15.497130, 1.126098, 4.169492, 3.358296,…
## $ status <fct> confirmed, confirmed, nodrivers, confirmed, confir…
## $ confirmed_time_sec <dbl> 8, 14, 0, 32, 65, 110, 0, 49, 27, 21, 23, 46, 185,…
Meanwhile after several processing data we has make new data frame coverage_hourly with structure:
glimpse(coverage_hourly)
## Observations: 3,862
## Variables: 9
## $ src_area <fct> sxk3, sxk3, sxk3, sxk3, sxk3, sxk3, sxk3, sxk3, sxk…
## $ date <date> 2017-10-01, 2017-10-01, 2017-10-01, 2017-10-01, 20…
## $ day <ord> Sunday, Sunday, Sunday, Sunday, Sunday, Sunday, Sun…
## $ hour <fct> 0, 1, 3, 4, 7, 8, 11, 12, 13, 14, 15, 16, 17, 18, 1…
## $ confirmed <dbl> 0, 1, 0, 0, 0, 1, 2, 0, 3, 7, 4, 2, 0, 5, 0, 0, 0, …
## $ nodrivers <dbl> 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 0, 1, 1, 0, 1, 3, 1, …
## $ total_transaction <dbl> 1, 2, 1, 1, 1, 3, 3, 1, 4, 8, 4, 3, 1, 5, 1, 3, 1, …
## $ datetime <dttm> 2017-10-01 00:00:00, 2017-10-01 01:00:00, 2017-10-…
## $ coverage <fct> insufficient, insufficient, insufficient, insuffici…
some variable in our train dataset still didint use in this new data frame, we can put into our new dataframe
int_var <- train_hourly %>%
group_by(src_area, date, day, hour) %>%
summarise(mean_distance = mean(distance),
sum_distance = sum(distance),
median_distance = median(distance),
mean_confirmed_time_sec = mean(confirmed_time_sec),
sum_confirmed_time_sec = sum(confirmed_time_sec),
median_confirmed_time_sec = median(confirmed_time_sec)) %>%
ungroup() %>%
select(-c(src_area, date, day, hour))
glimpse(int_var)
## Observations: 3,862
## Variables: 6
## $ mean_distance <dbl> 1.6709748, 8.4075656, 14.1027795, 3.6331489…
## $ sum_distance <dbl> 1.6709748, 16.8151311, 14.1027795, 3.633148…
## $ median_distance <dbl> 1.6709748, 8.4075656, 14.1027795, 3.6331489…
## $ mean_confirmed_time_sec <dbl> 0.00000, 22.00000, 0.00000, 0.00000, 0.0000…
## $ sum_confirmed_time_sec <dbl> 0, 44, 0, 0, 0, 12, 42, 0, 306, 302, 310, 5…
## $ median_confirmed_time_sec <dbl> 0.0, 22.0, 0.0, 0.0, 0.0, 0.0, 9.0, 0.0, 43…
from variable distance and confirmed_time_sec we do mean, median, sum for make new variable. we combine it with coverage_hourly
coverage_hourly <- cbind(coverage_hourly, int_var)
Cross-validation (CV) is a statistical method that can be used to evaluate the performance of models or algorithms where the data is separated into two subsets namely learning process data and validation / evaluation data. We already have train and test dataset, in this case i will use 3 dataset, is train, validation and test data.
We check proportion our existing train and test:
nrow(train)
## [1] 229532
nrow(test)
## [1] 504
total we have 229.532 data observation and our test data 504 target to predict. The proportion is already well, because we have more 99% data to train our model later.
This train dataset i will seperate into 2 dataset: data train and data validation. I will take 15% from data train to put into data validation.
set.seed(100)
intial_train <- initial_split(coverage_all_date, prop = 0.80, strata = "coverage")
train_training <- training(intial_train)
train_validation <- testing(intial_train)
prop.table(table(train_training$coverage))
##
## insufficient sufficient
## 0.5410468 0.4589532
train_training_upsample <- upSample(x = within(train_training, rm("coverage")), y = train_training$coverage, yname = "coverage")
prop.table(table(train_training_upsample$coverage))
##
## insufficient sufficient
## 0.5 0.5
Before we try to choose model, we have test dataset that give us rules that our model result should fit with our test structure. Lets take look data test we have
glimpse(test)
## Observations: 504
## Variables: 3
## $ src_area <chr> "sxk3", "sxk3", "sxk3", "sxk3", "sxk3", "sxk3", "sxk3", "sxk…
## $ datetime <dttm> 2017-12-03 00:00:00, 2017-12-03 01:00:00, 2017-12-03 02:00:…
## $ coverage <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
summary(test)
## src_area datetime coverage
## Length:504 Min. :2017-12-03 00:00:00 Mode:logical
## Class :character 1st Qu.:2017-12-04 17:45:00 NA's:504
## Mode :character Median :2017-12-06 11:30:00
## Mean :2017-12-06 11:30:00
## 3rd Qu.:2017-12-08 05:15:00
## Max. :2017-12-09 23:00:00
Data test is contain 3 variable, this data test will be our prediction result, in this case we will try predict next 7 days (Sunday, December 3rd 2017 to Monday, December 9th 2017).
We have src_area it source area that user order driver. We can see datetime interval separate each 1 hour, it mean our prediction value will predict each hour or hourly for 7 days. We have column coverage variable, it will be our target variable contain sufficient and insufficient, based on our bussiness case we would to know insufficient as priority and it will positive.
Our variable to predict using data test strict only based on this 3 variable, for datetime we can extract it into hour interval.
test <- test %>%
mutate(src_area = as.factor(src_area),
date = date(datetime),
day = as.factor(wday(datetime, label = TRUE, abbr = FALSE)),
hour = as.factor(hour(datetime)))
glimpse(test)
## Observations: 504
## Variables: 6
## $ src_area <fct> sxk3, sxk3, sxk3, sxk3, sxk3, sxk3, sxk3, sxk3, sxk3, sxk3, …
## $ datetime <dttm> 2017-12-03 00:00:00, 2017-12-03 01:00:00, 2017-12-03 02:00:…
## $ coverage <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ date <date> 2017-12-03, 2017-12-03, 2017-12-03, 2017-12-03, 2017-12-03,…
## $ day <ord> Sunday, Sunday, Sunday, Sunday, Sunday, Sunday, Sunday, Sund…
## $ hour <fct> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17…
Based on data test exploration we do above, the only predictor variable we can use only src_area, datetime,date,day,hour and for target still coverage with 2 class sufficient and insufficient. we can removing unsued variable from our data train and data validation we make before
train_training <- train_training %>%
select(c("src_area", "day", "hour", "coverage"))
train_validation <- train_validation %>%
select(c("src_area", "day", "hour", "coverage"))
train_training_upsample <- train_training_upsample %>%
select(c("src_area", "day", "hour", "coverage"))
For models to predict, we only have choice choose model that accept only factor, or categorical predictor variable and categorical target variable. I decided to compare 4 type model:
In this case, to compare each model we will use confusion matrix. Confusion matrix is a table that shows four different category: True Positive, True Negative, False Positive, and False Negative.
The performance will be the Accuracy, Sensitivity/Recall, Specificity, and Precision. Accuracy measures how many of our data is correctly predicted. Sensitivity measures out of all positive outcome, how many are correctly predicted. Specificty measure how many negative outcome is correctly predicted. Precision measures how many of our positive prediction is correct.
Logistic regression is a statistical model that in its basic form uses a logistic function to model a binary dependent variable, although many more complex extensions exist. In regression analysis, logistic regression (or logit regression) is estimating the parameters of a logistic model (a form of binary regression).
Logistic regression is a technique that is well suited for examining the relationship between a categorical response variable and one or more categorical or continuous predictor variables. we can check our structure of our dataset, we can see both of predictor or target is factor.
glimpse(train_training_upsample)
## Observations: 3,928
## Variables: 4
## $ src_area <fct> sxk3, sxk3, sxk3, sxk3, sxk3, sxk3, sxk3, sxk3, sxk3, sxk3, …
## $ day <ord> Sunday, Sunday, Sunday, Sunday, Sunday, Sunday, Sunday, Sund…
## $ hour <fct> 0, 3, 7, 8, 11, 12, 13, 14, 17, 19, 20, 21, 1, 7, 8, 10, 11,…
## $ coverage <fct> insufficient, insufficient, insufficient, insufficient, insu…
Logistict regression follow 3 assumtion before we use it:
Multicollinearity: between predictors do not correlate with each other, we can see our predictor contain categorical and each dont have corelate each other.
Independence of Observations: between observations are mutually independent & do not originate from repeated measurements. We can check ourtraining data actually extraction for hourly and it show historical and it didnt repeat, if repet for time and day it could have different area
Linearity of Predictor & Log of Odds: the way of interpretation refers to this assumption. for numerical variables, an increase of 1 value will increase the log of odds.
From total 3 assumption, 2 assumption we confidence our train dataset meet the requirment. So we can continue using our train dataset
We will train logistic regression model. at first we will use all our variabel inside train data frame, and use coverage as target.
logistict_model <- glm(coverage ~ ., train_training_upsample, family = "binomial")
summary(logistict_model)
##
## Call:
## glm(formula = coverage ~ ., family = "binomial", data = train_training_upsample)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.67425 -0.73517 0.00039 0.70405 2.69108
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.30543 0.19411 -1.573 0.115612
## src_areasxk8 2.45943 0.10611 23.179 < 2e-16 ***
## src_areasxk9 -1.14049 0.09777 -11.665 < 2e-16 ***
## day.L -0.26384 0.10837 -2.435 0.014912 *
## day.Q -0.30119 0.10796 -2.790 0.005274 **
## day.C 0.02223 0.10799 0.206 0.836872
## day^4 0.41647 0.10847 3.840 0.000123 ***
## day^5 0.16530 0.10786 1.533 0.125390
## day^6 -0.02934 0.10725 -0.274 0.784410
## hour1 0.10385 0.27392 0.379 0.704611
## hour2 0.68156 0.26648 2.558 0.010538 *
## hour3 0.68393 0.26233 2.607 0.009131 **
## hour4 1.44160 0.27778 5.190 2.11e-07 ***
## hour5 1.39752 0.27877 5.013 5.35e-07 ***
## hour6 0.99198 0.26675 3.719 0.000200 ***
## hour7 -0.64403 0.28296 -2.276 0.022843 *
## hour8 -2.09973 0.29788 -7.049 1.80e-12 ***
## hour9 -0.91742 0.27947 -3.283 0.001028 **
## hour10 0.11837 0.27001 0.438 0.661097
## hour11 0.09742 0.27242 0.358 0.720635
## hour12 -0.38025 0.28216 -1.348 0.177774
## hour13 -0.69846 0.27990 -2.495 0.012581 *
## hour14 -0.48210 0.28178 -1.711 0.087100 .
## hour15 -0.71407 0.28540 -2.502 0.012350 *
## hour16 -1.05535 0.28809 -3.663 0.000249 ***
## hour17 -1.12399 0.28381 -3.960 7.48e-05 ***
## hour18 -0.99653 0.28472 -3.500 0.000465 ***
## hour19 -1.10776 0.28312 -3.913 9.13e-05 ***
## hour20 -0.04379 0.27489 -0.159 0.873443
## hour21 0.09773 0.27038 0.361 0.717750
## hour22 0.69420 0.26831 2.587 0.009672 **
## hour23 0.24501 0.26759 0.916 0.359868
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 5445.4 on 3927 degrees of freedom
## Residual deviance: 3711.9 on 3896 degrees of freedom
## AIC: 3775.9
##
## Number of Fisher Scoring iterations: 5
We have only 3 predictor variable, but why it show lot of variable? because glm function will automatically do dummy coding means is reoding the original categorical variable into a set of binary variables that have values of one and zero.
From summary we can analyze the fitting and interpret that the model is telling us. We can se average all of our predictor variable are statistivally significant. No matter some dummy coding variable have big p-value or more <0.05 we cant judge one by one we must sum it as one variable.
We have make our logistic regression using all predict variable on dataset. Next this model use to predict with valudation dataset.
logistict_prediction <- predict(logistict_model, train_validation, type = "response")
head(logistict_prediction)
## 2 3 5 17 23 30
## 0.4641338 0.6068290 0.7674631 0.2136755 0.6098418 0.7396008
Logistic regression return value within range of [0,1] and isnt binary class. We will convert probabilty into class using threshold value (by default we will put 0.5 as threshold value).
logistict_prediction <- as.factor(if_else(logistict_prediction >= 0.5, "sufficient", "insufficient"))
head(logistict_prediction)
## [1] insufficient sufficient sufficient insufficient sufficient
## [6] sufficient
## Levels: insufficient sufficient
As result we have categorical or class prediction, we use confusion matrix to evaluate the model
matrix <- confusionMatrix(logistict_prediction, train_validation$coverage, positive = "insufficient")
table <- as.table(matrix)
table <- as.data.frame(table)
confussionMatrixPlot(table)
logistict_matrix <- matrix_result(matrix, "Logistict Reg.")
logistict_matrix
## Model Accuracy Sensitivity Specificity Precision
## 1 Logistict Reg. 79.9117 83.26531 75.96154 80.31496
These table explain:
* Accuracy: the ability to correctly predict both classes from the total observation.
* Precision: the ability to correctly predict the positive class from the total predicted-positive class (false positive is low).
* Recall: the ability to correctly predict the positive class from the total actual-positive class (false negative is low).
* Specificity: the ability to correctly predict the negative class from the total actual-negative class.
The result show our prediction on validation dataset using logistic regression model is 79.91% for accuracy, it mean that our result data prediction 79.91% is correctly classified. Precision/positive predicted value around 80.31%, mean our positive prediction is correctly classified. Value of sensitivity is 80.31% and specificity 75.96%, this indicate positive and negative predict correctly classified range around 75% - 80%.
Naive Bayes algorithm, in particular is a logic based technique which is simple yet so powerful that it is often known to outperform complex algorithms for very large datasets.Historically, this technique became popular with applications in email filtering, spam detection, and document categorization. There are certain characteristics of Naive Bayes that should be considered:
Although it is often outperformed by other techniques, and despite the naive design and oversimplified assumptions, this classifier can perform well in many complex real-world problems. And since it is a resource efficient algorithm that is fast and scales well, we will implement it to our case.
We need build our model, and in this case we also apply Laplace Estimator. At first naive bayes model, we will use all predictor variable we have:
naive_model <- naiveBayes(coverage ~ ., data = train_training_upsample, laplace = 1)
Next this model use to predict with valudation dataset, and using confussion matrix
naive_prediction <- predict(naive_model, train_validation)
matrix <- confusionMatrix(naive_prediction, train_validation$coverage, positive = "insufficient")
table <- as.table(matrix)
table <- as.data.frame(table)
confussionMatrixPlot(table)
naive_matrix <- matrix_result(matrix, "Naive Bayes")
naive_matrix
## Model Accuracy Sensitivity Specificity Precision
## 1 Naive Bayes 80.3532 85.5102 74.27885 79.65779
These table explain:
* Accuracy: the ability to correctly predict both classes from the total observation.
* Precision: the ability to correctly predict the positive class from the total predicted-positive class (false positive is low).
* Recall: the ability to correctly predict the positive class from the total actual-positive class (false negative is low).
* Specificity: the ability to correctly predict the negative class from the total actual-negative class.
There is way to improve our naive bayes, before we do it we will check ROC curve from our naive bayes:
naive_prediction_raw <- as.data.frame(predict(naive_model, train_validation, type = "raw"))
# ROC
naive_roc <- data.frame(prediction = naive_prediction_raw[,2],
trueclass = as.numeric(train_validation$coverage=="sufficient"))
naive_roc_pred <- prediction(naive_roc$prediction, naive_roc$trueclass)
# ROC curve
plot(performance(naive_roc_pred, "tpr", "fpr"),
main = "ROC")
abline(a = 0, b = 1)
The ROC is a curve that shows the performance of the classification model for all thresholds. ROC represents graphics from:
Recall / sensitivity (true positive rate) on they axis.1-specificity (false positive rate) on thex axis.Next we calculate AUC
# AUC
auc <- performance(naive_roc_pred, measure = "auc")
auc <- auc@y.values[[1]]
auc
## [1] 0.8481113
AUC is the area under the ROC curve. AUC values indicate the success of the model predicting / differentiating the two classes. The area value has an interval between 0 to 1. If the AUC value is close to 1, it means that the classification model is able to predict / distinguish the two classes well. However, if the AUC value is close to 0.5 it means that the classification model is not able to predict / distinguish the two classes well.
* AUC value close to 1: the classification model is able to predict / distinguish the two classes well
* AUC values close to 0.5: the classification model is not able to predict / distinguish the two classes well.
Based on result ROC and AUC, we get our ROC curce show good separation with AUC score 0.8436585 it mean we have chance to improve this Naive Bayes Model.
In our case we are focus “insufficient” label in variable coverage and one of method to tuning model is change the threshold.
# model tuning - metrics function
co <- seq(0.01,0.99,length=100)
result <- matrix(0,100,4)
# apply function metrics
for(i in 1:100){
result[i,] = metrics(cutoff = co[i],
prob = naive_prediction_raw$sufficient,
ref = as.factor(ifelse(train_validation$coverage=="sufficient", 1, 0)),
postarget = "1",
negtarget = "0")
}
# visualize
ggplotly(tibble("Recall" = result[,1],
"Accuracy" = result[,2],
"Precision" = result[,3],
"Specificity" = result[,4],
"Cutoff" = co) %>%
gather(key = "Metrics", value = "value", 1:4) %>%
ggplot(aes(x = Cutoff, y = value, col = Metrics)) +
geom_line(lwd = 1.5) +
scale_color_manual(values = c("darkred","darkgreen","orange", "blue")) +
scale_y_continuous(breaks = seq(0,1,0.1), limits = c(0,1)) +
scale_x_continuous(breaks = seq(0,1,0.1)) +
labs(title = "Tradeoff Model Perfomance") +
theme_minimal() +
theme(legend.position = "top",
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank()))
From plot above, we get information break even point is around 0.45. we will set it around 0.49 because we are focusing on Precision parameter, meanwhile we still want other paramaeter get good result.
#Tuning Threshold
naive_prediction_tuning <- naive_prediction_raw %>%
mutate(label = as.factor(ifelse(sufficient >= 0.48, "sufficient", "insufficient"))) %>%
select(label)
naive_matrix_tuning <- confusionMatrix(naive_prediction_tuning$label, train_validation$coverage, positive = "insufficient")
naive_matrix_tuning <- matrix_result(naive_matrix_tuning, "Naive Bayes Tuning")
rbind(naive_matrix_tuning,naive_matrix)
## Model Accuracy Sensitivity Specificity Precision
## 1 Naive Bayes Tuning 80.3532 84.69388 75.24038 80.11583
## 2 Naive Bayes 80.3532 85.51020 74.27885 79.65779
Decision tree model is one of the tree-based models which has the major benefit of being interpretable. Decision tree is an algorithm that will make a set of rules visualized in a diagram that resembles a tree. There are certain characteristics of decision tree model:
let build decision tree model
dtree_model <- ctree(formula = coverage ~., data = train_training_upsample)
plot(dtree_model)
A decision tree consists of several parts. The first box in the upper part of the plot is the root node. The root will split and make branches by certain rules. Each branches ended with a node. Some nodes split again into other nodes and called the internal node. Nodes that do not split anymore or appear at the end of the tree is called the terminal node or leaf node, just like a leaf on a tree.
Each node shows:
width(dtree_model)
## [1] 10
depth(dtree_model)
## [1] 6
We found our decision tree models create not complicated tree using our train dataset. it shows from Width 10 it mean total Leaf Nodes, and 6 depth it mean Internal Nodes.
we would like to explain more how our model choose class for our train model, i will using lime package to intepret it.
set.seed(100)
model<-as_classifier(dtree_model, labels = NULL)
explainer <- lime((train_training[,-4]),
model = model)
explanation <- explain(train_training[1:4,-4],
explainer,
dist_fun = "manhattan",
n_labels = 1,
n_features = 3)
plot_features(explanation)
In case number 1, 4, 6, 7 we can see the observer data have same src_area = sxk3 and have same day = Sunday, but the different is hour is being siginificant factor our model classify the data. We can see other insight however its late night people using scotty.
Case 1 show src_area most important factor for coverage label, if hour and day dont have any significant.
We need Next this model use to predict with valudation dataset, and using confussion matrix
dtree_prediction <- predict(dtree_model, train_validation)
matrix <- confusionMatrix(dtree_prediction, train_validation$coverage, positive = "insufficient")
table <- as.table(matrix)
table <- as.data.frame(table)
confussionMatrixPlot(table)
dtree_matrix <- matrix_result(matrix, "Decision Tree")
dtree_matrix
## Model Accuracy Sensitivity Specificity Precision
## 1 Decision Tree 80.7947 87.34694 73.07692 79.25926
These table explain:
* Accuracy: the ability to correctly predict both classes from the total observation.
* Precision: the ability to correctly predict the positive class from the total predicted-positive class (false positive is low).
* Recall: the ability to correctly predict the positive class from the total actual-positive class (false negative is low).
* Specificity: the ability to correctly predict the negative class from the total actual-negative class.
In Our Decision model result Accuracy, Sensitivity, Specificity is quite ok. However we would try tuning it and hope can get better result.
dtree_prediction_raw <- predict(dtree_model, train_validation, type = "prob")
dtree_roc <- prediction(dtree_prediction_raw[,2], train_validation$coverage)
plot(performance(dtree_roc, "tpr", "fpr"),
main = "ROC")
abline(a = 0, b = 1)
auc <- performance(dtree_roc, "auc")
auc@y.values[[1]]
## [1] 0.8448047
Based on result ROC and AUC, we get our ROC curce show good separation with AUC score 0.8448047 it mean we still have chance to improve this Decision tree Model.
In our case we are focus “insufficient” on coverage variable as positive class, so we will try to improve the Precision, and one of method to tuning model is change pruning and tree-size.
ctrl <- ctree_control(mincriterion = 0.35, minsplit = 100, minbucket = 50)
dtree_tuning_model <- ctree(formula = coverage ~., data = train_training_upsample, control = ctrl)
dtree_prediction_tuning <- predict(dtree_tuning_model, train_validation)
matrix <- confusionMatrix(dtree_prediction_tuning, train_validation$coverage, positive = "insufficient")
table <- as.table(matrix)
table <- as.data.frame(table)
confussionMatrixPlot(table)
dtree_tuning_matrix <- matrix_result(matrix, "Decision Tree Tuning")
rbind(dtree_tuning_matrix,dtree_matrix)
## Model Accuracy Sensitivity Specificity Precision
## 1 Decision Tree Tuning 80.68433 85.10204 75.48077 80.34682
## 2 Decision Tree 80.79470 87.34694 73.07692 79.25926
Result after we tuning the decision tree model, we can increase 1% precision and specificity compare with previous deicision tree model.
Random forests are based on a simple idea: ‘the wisdom of the crowd’. Aggregate of the results of multiple predictors gives a better prediction than the best individual predictor. A group of predictors is called an ensemble. Thus, this technique is called Ensemble Learning.
To improve our model, we can train a group of Decision Tree classifiers, each on a different random subset of the train set. To make a prediction, we just obtain the predictions of all individuals trees, then predict the class that gets the most votes. This technique is called Random Forest.
One wekness of Random Forest Model is are easy to get over feet with train data, to decrease possibility get overfitted model we can use K-Fold Cross Validation. K- Fold Cross-validation is a resampling procedure used to evaluate machine learning models on a limited data sample.
The procedure has a single parameter called k that refers to the number of groups that a given data sample is to be split into. As such, the procedure is often called k-fold cross-validation. When a specific value for k is chosen, it may be used in place of k in the reference to the model, such as k=10 becoming 10-fold cross-validation.
Cross-validation is primarily used in applied machine learning to estimate the skill of a machine learning model on unseen data. That is, to use a limited sample in order to estimate how the model is expected to perform in general when used to make predictions on data not used during the training of the model.
It is a popular method because it is simple to understand and because it generally results in a less biased or less optimistic estimate of the model skill than other methods, such as a simple train/test split.
Lets implement it in our random forest model, we have consideration our case predictor variable is verty limited, so to make train and hope more better result, i will set more k-fold cross validation
ctrl <- trainControl(method = "repeatedcv", number = 5,repeats = 10)
train random forest model
# our result model will save on RDS
# randomforest_model <- train(coverage ~ ., data = train_training_upsample, method = "rf", trControl = ctrl, ntree = 10)
#
# saveRDS(randomforest_model, file = "data/model_rforest.RDS")
Lets see result and get insight from our Random Forest Model
randomforest_model <- readRDS("data/model_rforest.RDS")
randomforest_model
## Random Forest
##
## 3928 samples
## 3 predictor
## 2 classes: 'insufficient', 'sufficient'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 10 times)
## Summary of sample sizes: 3142, 3142, 3142, 3143, 3143, 3143, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.7842424 0.5684822
## 16 0.7667785 0.5335560
## 31 0.7649200 0.5298385
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
From model summary we have build, we get optimum number of variable considered for splitting at each tree node is 2. Next we will use varImp() to get information which feature/variable importance for our random forest model.
varImp(randomforest_model)
## rf variable importance
##
## only 20 most important variables shown (out of 31)
##
## Overall
## src_areasxk8 100.0000
## src_areasxk9 41.2367
## hour4 5.6064
## hour8 5.4409
## hour5 3.1428
## hour6 2.3171
## hour3 2.1025
## hour17 1.5848
## hour16 1.4873
## day^6 1.3055
## day.L 1.2495
## hour19 1.1781
## hour2 1.1772
## day.Q 1.1587
## hour18 1.0768
## hour22 1.0265
## hour9 0.9842
## day^5 0.9739
## day.C 0.9728
## day^4 0.9669
We found src_area is most important predictor variable from our train dataset, we can see top 2 of table is class we have in. src_area. In practice, the random forest already have out-of-bag estimates (OOB) that can be used as a reliable estimate of its true accuracy on unseen examples. we can check OOB rate in our random forest model
randomforest_model$finalModel
##
## Call:
## randomForest(x = x, y = y, ntree = 50, mtry = param$mtry)
## Type of random forest: classification
## Number of trees: 50
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 21.84%
## Confusion matrix:
## insufficient sufficient class.error
## insufficient 1609 355 0.1807536
## sufficient 503 1461 0.2561100
plot(randomforest_model$finalModel)
legend("topright", colnames(randomforest_model$finalModel$err.rate),col=1:6,cex=0.8,fill=1:6)
Surprisingly, our random forest model actually only under 10 trees as best method to predict based on our train dataset.
Random forest its black box method, that mean very hard to intepret what exactly happen inside the model, i would like try using LIME packages that can help us to inteprate our random forest model. We would to see how our random forest model classify an observations to sufficient with 3 features we have.
To use lime we need dataset that already convert to data frame, and in this case we already have it. We can use train_training dataset as observation. As we know we only have few predictor variable is 3, so we will use all variable.
set.seed(100)
explainer <- lime((train_training[,-4]),
model = randomforest_model)
explanation <- explain(train_training[1:4,-4],
explainer,
n_permutations = 5000,
dist_fun = "manhattan",
n_labels = 1,
n_features = 3)
plot_features(explanation)
Before we start to explain the plot, before we already explain using VarImp is variable importance. The different between Variable importance and LIME is, we can see how feature contribute to classification the observation. Because its an visualisation we can see it clearly how predictor works.
Intepret case 1, label is to be inssufficient because it because its sunday and the orders come at middle of the night. Intresting we found is, start from hour 3 am even it in the sunday the label change to sufficient. We can conclude that area maybe didnt contribute more in this model, the time and day maybe effect more than area.
We need Next this model use to predict with valudation dataset, and using confussion matrix
randomforest_prediction <- predict(randomforest_model, train_validation)
matrix <- confusionMatrix(randomforest_prediction, train_validation$coverage)
table <- as.table(matrix)
table <- as.data.frame(table)
confussionMatrixPlot(table)
randomforest_matrix <- matrix_result(matrix, "Random Forest")
randomforest_matrix
## Model Accuracy Sensitivity Specificity Precision
## 1 Random Forest 79.35982 82.2449 75.96154 80.11928
hese table explain:
* Accuracy: the ability to correctly predict both classes from the total observation.
* Precision: the ability to correctly predict the positive class from the total predicted-positive class (false positive is low).
* Recall: the ability to correctly predict the positive class from the total actual-positive class (false negative is low).
* Specificity: the ability to correctly predict the negative class from the total actual-negative class.
In this case we use Confussion Matrix as model evaluation, and based on our problem we need classification model report that would be evaluated on next 7 days (Sunday, December 3rd 2017 to Saturday, December 9th 2017). And prediction should focus on model to predict which region and times are risky enough to have this “no drivers” problem.
Model that we made before we put “insufficient” as positif value, so target we need highest Precision percentage predeiction. Lets we sum up all result matrix from all model and tuning model we made before:
result <- rbind(logistict_matrix, naive_matrix_tuning, naive_matrix, dtree_tuning_matrix, dtree_matrix,randomforest_matrix)
result %>% arrange(desc(Precision))
## Model Accuracy Sensitivity Specificity Precision
## 1 Decision Tree Tuning 80.68433 85.10204 75.48077 80.34682
## 2 Logistict Reg. 79.91170 83.26531 75.96154 80.31496
## 3 Random Forest 79.35982 82.24490 75.96154 80.11928
## 4 Naive Bayes Tuning 80.35320 84.69388 75.24038 80.11583
## 5 Naive Bayes 80.35320 85.51020 74.27885 79.65779
## 6 Decision Tree 80.79470 87.34694 73.07692 79.25926
To prove which model is more better we will use test dataset to simulize real word, and see the result of conffusion matrix comparison fron both non-tuning and tuning decision tree model. We will save both result preduction as csv:
dtree_result <- predict(dtree_model, test)
dtree_tuning_result <- predict(dtree_tuning_model, test)
write_dtree <- test
write_dtree$coverage <- dtree_result
write_dtree <- write_dtree %>%
select(-c(date,day,hour))
write_tuning_dtree <- test
write_tuning_dtree$coverage <- dtree_tuning_result
write_tuning_dtree <- write_tuning_dtree %>%
select(-c(date,day,hour))
write.csv(write_dtree, "data/data-test-dtree.csv")
write.csv(write_tuning_dtree, "data/data-test-dtree_tuning.csv")
Here is the result of Decision Tree model result:
Here is the result of Decision Tree Tuning model result:
We can see the different there is trade off between 2 models, if we want to increase precision it will take another result effect decrease. since we are focusing to guess the precision value we can choose Decision Tree Tuning model as our best for this case condition.
We can see Decision Tree Tuning is the most highest percentage on precision value, we can use this model to predict which area and hour that have more risk get “nodriver” status. After we found it, scotty can make strategy or provide more driver in area and hour that have more “insufficient” label.
It should have effect for bussiness to, since we can provide user order more better, it will effect can increase revenue both driver and scotty company.