This project is created as a capstone of Machine Learning for Algoritma’s Student. This time, I’ll use a set of data to analyze the classification of Scotty
Scotty is a ride-sharing business that operating in several big cities in Turkey. It was established in May 2017 with the mission of finding a solution to Istanbul’s traffic problem through motorcycles, as time goes by, they expanding to food and package deliveries. According to USA Today, Scotty’s founder, Tarkan Altan, said he got the idea after being shocked by Istanbul’s traffic when his family moved to the city from Marmaris on the Turkish coast.
It operates like Uber and Lyft: Enter your location and destination, then select a ride by tapping a button that says, “Beam me up” – a reference to Star Trek and, hence, the name Scotty. (The company’s logo is a four-fingered V – Spock’s “Live Long and Prosper” Vulcan salute). cc: Podcast and webrazzi reported that Tarkan Altan was discussing about how Scotty would focus 50% of his time on fundraising, 30% on recruitment, and the remaining 20% on strategy in the near future.
There are 2 csv files for this project, data-train
which we use as our source data and data-test
which we use as a submission into leaderboard to know if the models we’ve created is already the fit one or not.
Before we read the data, let’s load some library we need:
library(tidyverse)
library(zoo)
library(lubridate)
library(padr)
library(rsample)
library(partykit)
library(ggplot2)
library(caret)
library(e1071)
library(lime)
options(scipen = 9999)
Now, let’s read data-test
for the submission
submission <- read_csv("data/data-test.csv")
submission
The
submission
data we have consists of 3 columns only, which become our reference to process our data.
First of all, let’s load scotty
data from data-train
scotty <- read_csv("data/data-train.csv")
scotty
dim(scotty)
## [1] 229532 16
The data we have consists of 16 columns and 229,532 rows.
Now, let’s discover!
Does scotty
already have appropriate data types? Let’s find out.
str(scotty)
## tibble [229,532 x 16] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ id : chr [1:229532] "59d005e1ffcfa261708ce9cd" "59d0066a3d32b861760d4a01" "59d0066affcfa261708ceb11" "59d006a1ffcfa261708ceba4" ...
## $ trip_id : chr [1:229532] "59d005e9cb564761a8fe5d3e" "59d00678ffcfa261708ceb37" NA "59d006c131e39c618969343d" ...
## $ driver_id : chr [1:229532] "59a892c5568be44b2734f276" "59a135565e88a24b11f11994" NA "599dc0dfa5b4fd5471ad8453" ...
## $ rider_id : chr [1:229532] "59ad2d6efba75a581666b506" "59ce930f3d32b861760a4617" "59cd704bcf482f6ce2fadfdb" "59bd62cc7e3c3b663924ba86" ...
## $ start_time : POSIXct[1:229532], format: "2017-10-01 00:00:17" "2017-10-01 00:02:34" ...
## $ src_lat : num [1:229532] 41.1 40.9 41.1 41 41.1 ...
## $ src_lon : num [1:229532] 29 29.1 29 29 29 ...
## $ src_area : chr [1:229532] "sxk9" "sxk8" "sxk9" "sxk9" ...
## $ src_sub_area : chr [1:229532] "sxk9s" "sxk8y" "sxk9e" "sxk9s" ...
## $ dest_lat : num [1:229532] 41.1 41.1 41.1 41 41.1 ...
## $ dest_lon : num [1:229532] 29 29 29 29 29 ...
## $ dest_area : chr [1:229532] "sxk9" "sxk9" "sxk9" "sxk9" ...
## $ dest_sub_area : chr [1:229532] "sxk9u" "sxk9s" "sxk9e" "sxk9e" ...
## $ distance : num [1:229532] 5.38 15.5 1.13 4.17 3.36 ...
## $ status : chr [1:229532] "confirmed" "confirmed" "nodrivers" "confirmed" ...
## $ confirmed_time_sec: num [1:229532] 8 14 0 32 65 110 0 49 27 21 ...
## - attr(*, "spec")=
## .. cols(
## .. id = col_character(),
## .. trip_id = col_character(),
## .. driver_id = col_character(),
## .. rider_id = col_character(),
## .. start_time = col_datetime(format = ""),
## .. src_lat = col_double(),
## .. src_lon = col_double(),
## .. src_area = col_character(),
## .. src_sub_area = col_character(),
## .. dest_lat = col_double(),
## .. dest_lon = col_double(),
## .. dest_area = col_character(),
## .. dest_sub_area = col_character(),
## .. distance = col_double(),
## .. status = col_character(),
## .. confirmed_time_sec = col_double()
## .. )
Since we use
read_csv()
, the data types forstart_time
is already converted into POSIXct format andsrc_area
will be changed into factor type
Now, time to check whether there are any missing values in our data.
colSums(is.na(scotty))
## id trip_id driver_id rider_id
## 0 14901 14900 0
## start_time src_lat src_lon src_area
## 0 0 0 0
## src_sub_area dest_lat dest_lon dest_area
## 0 0 0 0
## dest_sub_area distance status confirmed_time_sec
## 0 0 0 0
There are missing values both on
trip_id
anddriver_id
, but we can continue considering that we won’t use those columns
Let’s summarize our dataset
summary(scotty)
## id trip_id driver_id rider_id
## Length:229532 Length:229532 Length:229532 Length:229532
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## start_time src_lat src_lon
## Min. :2017-10-01 00:00:17 Min. :40.79 Min. :28.48
## 1st Qu.:2017-10-19 19:36:04 1st Qu.:41.00 1st Qu.:28.97
## Median :2017-11-07 14:08:48 Median :41.03 Median :29.00
## Mean :2017-11-04 18:32:48 Mean :41.03 Mean :29.00
## 3rd Qu.:2017-11-19 03:44:19 3rd Qu.:41.06 3rd Qu.:29.04
## Max. :2017-12-02 23:59:20 Max. :41.13 Max. :29.18
## src_area src_sub_area dest_lat dest_lon
## Length:229532 Length:229532 Min. : 5.923 Min. : 0.1273
## Class :character Class :character 1st Qu.:41.001 1st Qu.: 28.9689
## Mode :character Mode :character Median :41.036 Median : 29.0036
## Mean :41.032 Mean : 29.0025
## 3rd Qu.:41.064 3rd Qu.: 29.0475
## Max. :52.370 Max. :121.4721
## dest_area dest_sub_area distance status
## Length:229532 Length:229532 Min. : 0.000 Length:229532
## Class :character Class :character 1st Qu.: 2.802 Class :character
## Mode :character Mode :character Median : 4.804 Mode :character
## Mean : 6.772
## 3rd Qu.: 7.943
## Max. :9345.566
## confirmed_time_sec
## Min. : 0.00
## 1st Qu.: 12.00
## Median : 26.00
## Mean : 46.41
## 3rd Qu.: 66.00
## Max. :514.00
From
start_time
we can refer that the first order was on 2017-10-01 00:00:17 and the last order was on 2017-12-02 23:59:20. So our data just consist of 3 months transactions from October 2017 to December 2017
Since src_area
and status
are in character types, let’s see how may criteria they have using unique()
unique(scotty$src_area)
## [1] "sxk9" "sxk8" "sxk3"
unique(scotty$status)
## [1] "confirmed" "nodrivers"
There are 3 areas from
src_area
and 2 statuses fromstatus
Since we refer our base data to submission
, let’s filter unused columns and change our data types into the appropriate one
First, we are about to select src_area
, start_time
which will be changed into datetime
like in our submission
and status
to define the coverage
. The start_time
will be converted into floor date type using floor_date()
function in hour type. Find out the code below:
scotty <- scotty %>%
select (src_area, start_time, status) %>%
mutate(src_area = as.factor(src_area),
start_time = floor_date(start_time, unit = "hour"),
datetime = start_time) %>%
group_by(src_area, datetime) %>%
count(status) %>%
pivot_wider(names_from = status, values_from = n) %>%
mutate(coverage = ifelse(nodrivers == 0, "sufficient", "insufficient")) %>%
select(-c(confirmed, nodrivers)) %>%
group_by(src_area) %>%
pad('hour', start_val = as.POSIXct("2017-10-01 00:00:00"),end_val = as.POSIXct("2017-12-02 23:00:00") ) %>%
mutate(coverage = replace_na(coverage, "sufficient")) %>%
ungroup()
scotty
The
scotty
data we have is now having the same columns like insubmission
Now, let’s just explore our scotty
data, what kind of other information we could use from the data so we can do the same for submission
Let’s clean our data and do final summary
From datetime
information, we can get day
using wday()
, hour
using hour
and peak
that we can filter from hour
scotty_clean <- scotty %>%
mutate(src_area = as.factor(src_area),
day = wday(datetime, label = T, abbr = F),
hour = hour(datetime),
peak = as.factor(ifelse( 7 <= hour & hour <= 20, "peak", "nonpeak")),
coverage = as.factor(coverage)) %>%
select(datetime, src_area, day, hour, peak, coverage) %>%
mutate(day = as.factor(day),
hour = as.factor(hour))
scotty_clean
As we can see, now we have few rows and appropriate data types
Let’s see the data summary
summary(scotty_clean)
## datetime src_area day hour
## Min. :2017-10-01 00:00:00 sxk3:1512 Sunday :648 0 : 189
## 1st Qu.:2017-10-16 17:45:00 sxk8:1512 Monday :648 1 : 189
## Median :2017-11-01 11:30:00 sxk9:1512 Tuesday :648 2 : 189
## Mean :2017-11-01 11:30:00 Wednesday:648 3 : 189
## 3rd Qu.:2017-11-17 05:15:00 Thursday :648 4 : 189
## Max. :2017-12-02 23:00:00 Friday :648 5 : 189
## Saturday :648 (Other):3402
## peak coverage
## nonpeak:1890 insufficient:2454
## peak :2646 sufficient :2082
##
##
##
##
##
Our
datetime
is already transformed into floor date. The proportion ofsrc_area
,day
andpeak
are equal, compared toscotty_clean
rows. peak hours has more data than nonpeak inpeak
. But we can see that thecoverage
of demand shows us more insufficiency of the driver.
Since we already transformed our base data, now, it’s time to do the same with submission
so we could predict the data with models we have
submission <- submission %>%
mutate(src_area = as.factor(src_area),
day = wday(datetime, label = T, abbr = F),
hour = hour(datetime),
peak = as.factor(ifelse( 7 <= hour & hour <= 20, "peak", "nonpeak"))) %>%
select(datetime, src_area, day, hour, peak, coverage) %>%
mutate(day = as.factor(day),
hour = as.factor(hour))
submission
Yay, the
submission
data now is having the same columns and the same order just like ourscotty_clean
data
Before we model our data, let’s see our data proportionality.
Let’s find out the balance of our target
round(prop.table(table(scotty_clean$coverage)),2)
##
## insufficient sufficient
## 0.54 0.46
From the total 2454 insufficient and 2082 sufficient drivers, it can be concluded that the insufficiency is 54%, which still in a balanced type
Let’s find out the balance of our target by area
round(prop.table(table(scotty_clean$coverage, scotty_clean$src_area)),2)
##
## sxk3 sxk8 sxk9
## insufficient 0.21 0.06 0.28
## sufficient 0.12 0.28 0.06
Both of the total of insufficient and sufficient from 3 areas are having same amount with our trager proportion above.
Since we already get target proportion, let’s see the correlation from our data
First, I want to create a theme for plots
white_theme <- theme(
panel.background = element_rect(fill="white"),
plot.background = element_rect(fill="white"),
panel.grid.minor.x = element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.major.y = element_blank(),
text = element_text(color="black"),
axis.text = element_text(color="black"),
strip.background =element_rect(fill="white"),
strip.text = element_text(colour = 'black')
)
Let’s visualize the coverage
of the driver in daily per hour.
scotty_clean %>%
group_by(src_area,hour) %>%
count(coverage) %>%
ggplot(aes(hour, n, fill = coverage)) +
geom_col(position = position_dodge(preserve = "single")) +
facet_wrap(~src_area, nrow = 3) +
scale_fill_manual(values=c("purple4", "orangered")) +
theme(plot.title = element_text(hjust = 0.5)) +
white_theme +
labs(title = "Daily Coverage Statistics",
x = "Hour",
y = NULL,
fill = "Coverage")
Let’s visualize the coverage
of the driver in a week per day.
scotty_clean %>%
group_by(src_area,day) %>%
count(coverage) %>%
ggplot(aes(day, n, fill = coverage)) +
geom_col(position = "dodge") +
facet_wrap(~src_area, nrow = 3) +
scale_fill_manual(values=c("purple4", "orangered")) +
theme(plot.title = element_text(hjust = 0.5)) +
white_theme +
labs(title = "Weekly Coverage Statistics",
x = "Day",
y = NULL,
fill = "Coverage")
From both of the plots, we can consider that the most insufficiency of the driver are in sxk9 in almost every hour in each day, while for sxk8, the drivers are sufficient in each hour everyday. As we can see, in sxk3 the sufficiency and insufficiency is almost equal, altough the insufficiency is higher. The sufficiency is just from 2 - 7 AM
After, visualizing the plot, let’s visualize the heatmap to see the avaibility in each hour everday.
scotty_clean %>%
mutate(coverage = as.numeric(ifelse(coverage == "insufficient", "0", "100"))) %>%
group_by(src_area, day, hour) %>%
summarise(coverage = mean(coverage)) %>%
ggplot(aes(x = hour, y = day, fill = coverage)) +
geom_tile() +
facet_wrap(~src_area, scales = "free_x", ncol = 1) +
theme(plot.title = element_text(hjust = 0.5)) +
white_theme +
labs(title = "Heatmap Region Avaibility",
x = "Hour",
y = NULL,
fill = "Availability")
The heatmap defines the same interpretation like the plots above where the driver are in sxk9 described with the dark blue color at almost every hour, while for sxk8, the sufficiency of the driver is the highest described with most of light blue color. As we can see, in sxk3 the sufficiency and insufficiency is almost equal, altough the insufficiency is higher. We can refer to the light blue color from 2 - 7 AM while after that the dark blue color dominates the plot
Before we continue to modeling, let’s split our scotty_clean
data into data train
and data test
with a proportion of 85 and 15 using initial_split()
set.seed(123)
split <- initial_split(scotty_clean, prop = 0.85, strata = "coverage")
train <- training(split)
test <- testing(split)
After splitting the data, now, it’s time to check the proportion of both data
round(prop.table(table(train$coverage)),2)
##
## insufficient sufficient
## 0.54 0.46
round(prop.table(table(test$coverage)),2)
##
## insufficient sufficient
## 0.54 0.46
Since we want a balance data, let’s re-sample our train with downSample()
train <- downSample(x = train[,-6], y = train$coverage, yname = "coverage")
prop.table(table(train$coverage))
##
## insufficient sufficient
## 0.5 0.5
The proportion of
train
is balance, 50% for insufficient and 50% for sufficient
In this project, I will try to create some models that will be tested to the test
and submission
The models that we are going to use are
Since we have default threshold for the scoring, I will keep it in eval_threshold
and create it into data.frame()
so we can compare the result later with the result from confusionMatrix()
of each models.
eval_threshold <- data.frame(Accuracy = 0.75,
Recall = 0.85,
Specificity = 0.7,
Precision = 0.75)
eval_threshold
Now, let’s do the modeling!
We’ll try NaiveBayes model using naiveBayes()
naive <- naiveBayes(coverage ~ src_area + day + hour + peak, train)
summary(naive)
## Length Class Mode
## apriori 2 table numeric
## tables 4 -none- list
## levels 2 -none- character
## isnumeric 4 -none- logical
## call 4 -none- call
naive$tables
## $src_area
## src_area
## Y sxk3 sxk8 sxk9
## insufficient 0.3915254 0.1090395 0.4994350
## sufficient 0.2757062 0.5943503 0.1299435
##
## $day
## day
## Y Sunday Monday Tuesday Wednesday Thursday Friday
## insufficient 0.1463277 0.1344633 0.1446328 0.1254237 0.1418079 0.1700565
## sufficient 0.1435028 0.1412429 0.1440678 0.1644068 0.1457627 0.1209040
## day
## Y Saturday
## insufficient 0.1372881
## sufficient 0.1401130
##
## $hour
## hour
## Y 0 1 2 3 4
## insufficient 0.03672316 0.04237288 0.02937853 0.03050847 0.02372881
## sufficient 0.04576271 0.04237288 0.05254237 0.05197740 0.06779661
## hour
## Y 5 6 7 8 9
## insufficient 0.02316384 0.02881356 0.04632768 0.06158192 0.05423729
## sufficient 0.06723164 0.05423729 0.03672316 0.02033898 0.03050847
## hour
## Y 10 11 12 13 14
## insufficient 0.03559322 0.03785311 0.04011299 0.04406780 0.04576271
## sufficient 0.04745763 0.04011299 0.03841808 0.03446328 0.03389831
## hour
## Y 15 16 17 18 19
## insufficient 0.04689266 0.04971751 0.05367232 0.05310734 0.05480226
## sufficient 0.03502825 0.03220339 0.02768362 0.02824859 0.02824859
## hour
## Y 20 21 22 23
## insufficient 0.04180791 0.04293785 0.03446328 0.04237288
## sufficient 0.04180791 0.04406780 0.05480226 0.04406780
##
## $peak
## peak
## Y nonpeak peak
## insufficient 0.3344633 0.6655367
## sufficient 0.5248588 0.4751412
Now, let’s predict our model with data test
and data submission
pred_naive <- predict(object = naive, newdata = test)
submission1 <- submission %>%
mutate(coverage = predict(object = naive, newdata = submission)) %>%
write.csv("C:/Users/yohana/Documents/Kappa Data Analytics/DV & ML/Machine Learning/Capstone ML/data/submission1.csv", row.names = FALSE)
To see if our model give a good result to Accuracy, Recall, Specificity and Prediction, let’s just see it with confusionMatrix()
naive_cm <- confusionMatrix(as.factor(pred_naive), as.factor(test$coverage), positive = "insufficient")
naive_cm
## Confusion Matrix and Statistics
##
## Reference
## Prediction insufficient sufficient
## insufficient 286 54
## sufficient 82 258
##
## Accuracy : 0.8
## 95% CI : (0.7679, 0.8294)
## No Information Rate : 0.5412
## P-Value [Acc > NIR] : <0.0000000000000002
##
## Kappa : 0.6
##
## Mcnemar's Test P-Value : 0.0206
##
## Sensitivity : 0.7772
## Specificity : 0.8269
## Pos Pred Value : 0.8412
## Neg Pred Value : 0.7588
## Prevalence : 0.5412
## Detection Rate : 0.4206
## Detection Prevalence : 0.5000
## Balanced Accuracy : 0.8020
##
## 'Positive' Class : insufficient
##
Time to find out the Accuracy, Recall, Specificity and Prediction from our naiveBayes model
eval_naive <- data.frame(Accuracy = naive_cm$overall[1],
Recall = naive_cm$byClass[1],
Specificity = naive_cm$byClass[2],
Precision = naive_cm$byClass[3])
round(eval_naive,2)
Comparing to our threshold, our naiveBayes model give higher percentage in Accuracy, Specificity and Prediction but give bad result in Recall which give 0.7 points lower than our threshold which is 0.85
Now, let’s compare with our submission1.csv
Just as the same as our result with our data test
, it gives bad result in Recall, that’s why the model didn’t pass the submission in the leaderboard, marked with red box.
For our
test
data, naiveBayes model gives an overfit result for Specificity, while forsubmission
data, the result is not underfot or overfit, it just, the Recall returned is lower than our threshold.
Now, let’s model with Decision Tree using ctree
tree <- ctree(formula = coverage ~ src_area + day + hour + peak, data = train)
summary(tree)
## Length Class Mode
## 1 23 constparty list
## 2 19 constparty list
## 3 13 constparty list
## 4 5 constparty list
## 5 3 constparty list
## 6 1 constparty list
## 7 1 constparty list
## 8 1 constparty list
## 9 7 constparty list
## 10 5 constparty list
## 11 3 constparty list
## 12 1 constparty list
## 13 1 constparty list
## 14 1 constparty list
## 15 1 constparty list
## 16 5 constparty list
## 17 3 constparty list
## 18 1 constparty list
## 19 1 constparty list
## 20 1 constparty list
## 21 3 constparty list
## 22 1 constparty list
## 23 1 constparty list
Let’s check the Decision Tree.
tree
##
## Model formula:
## coverage ~ src_area + day + hour + peak
##
## Fitted party:
## [1] root
## | [2] src_area in sxk3, sxk9
## | | [3] hour in 0, 7, 8, 9, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 23
## | | | [4] hour in 0, 7, 11, 12, 15, 20, 21, 23
## | | | | [5] src_area in sxk3
## | | | | | [6] day <= Thursday: insufficient (n = 280, err = 40.4%)
## | | | | | [7] day > Thursday: insufficient (n = 110, err = 26.4%)
## | | | | [8] src_area in sxk9: insufficient (n = 368, err = 22.0%)
## | | | [9] hour in 8, 9, 13, 14, 16, 17, 18, 19
## | | | | [10] day <= Monday
## | | | | | [11] hour in 8, 9, 14, 19
## | | | | | | [12] src_area in sxk3: insufficient (n = 51, err = 19.6%)
## | | | | | | [13] src_area in sxk9: insufficient (n = 55, err = 1.8%)
## | | | | | [14] hour in 13, 16, 17, 18: insufficient (n = 108, err = 29.6%)
## | | | | [15] day > Monday: insufficient (n = 527, err = 11.0%)
## | | [16] hour in 1, 2, 3, 4, 5, 6, 10, 22
## | | | [17] src_area in sxk3
## | | | | [18] hour in 1, 10, 22: insufficient (n = 153, err = 49.7%)
## | | | | [19] hour in 2, 3, 4, 5, 6: sufficient (n = 257, err = 20.2%)
## | | | [20] src_area in sxk9: insufficient (n = 386, err = 29.3%)
## | [21] src_area in sxk8
## | | [22] hour in 0, 2, 3, 4, 5, 6, 10, 11, 12, 13, 14, 20, 21, 22, 23: sufficient (n = 778, err = 8.9%)
## | | [23] hour in 1, 7, 8, 9, 15, 16, 17, 18, 19: sufficient (n = 467, err = 26.6%)
##
## Number of inner nodes: 11
## Number of terminal nodes: 12
Here is the plot from our Decision Tree.
plot(tree, type = "simple")
Now, let’s predict our model with data test
and data submission
pred_tree <- predict(object = tree, newdata = test)
submission2 <- submission %>%
mutate(coverage = predict(object = tree, newdata = submission)) %>%
write.csv("C:/Users/yohana/Documents/Kappa Data Analytics/DV & ML/Machine Learning/Capstone ML/data/submission2.csv", row.names = FALSE)
Let’s see whether out Decision Tree model give a good result to Accuracy, Recall, Specificity and Prediction, let’s just see it with confusionMatrix()
tree_cm <- confusionMatrix(as.factor(pred_tree), as.factor(test$coverage), positive = "insufficient")
tree_cm
## Confusion Matrix and Statistics
##
## Reference
## Prediction insufficient sufficient
## insufficient 317 71
## sufficient 51 241
##
## Accuracy : 0.8206
## 95% CI : (0.7896, 0.8487)
## No Information Rate : 0.5412
## P-Value [Acc > NIR] : <0.0000000000000002
##
## Kappa : 0.637
##
## Mcnemar's Test P-Value : 0.0854
##
## Sensitivity : 0.8614
## Specificity : 0.7724
## Pos Pred Value : 0.8170
## Neg Pred Value : 0.8253
## Prevalence : 0.5412
## Detection Rate : 0.4662
## Detection Prevalence : 0.5706
## Balanced Accuracy : 0.8169
##
## 'Positive' Class : insufficient
##
Time to find out the Accuracy, Recall, Specificity and Prediction from our Decision Tree model
eval_tree <- data.frame(Accuracy = tree_cm$overall[1],
Recall = tree_cm$byClass[1],
Specificity = tree_cm$byClass[2],
Precision = tree_cm$byClass[3])
round(eval_tree,2)
Comparing to our threshold, our Decision Tree model give better result in Accuracy, Recall, Specificity and Prediction which all percentage is all above our threshold
Now, let’s compare with our submission2.csv
The submission with Decision Tree give a good result, which pass the threshold in the leaderboard and give green boxes
The Decision Tree model give a better result than our threshold. It’s the fittest one both for
test
andsubmission
because the difference from the threshold is not more than 10 points.
The last, we’ll try Random Forest model.
set.seed(123)
ctrl <- trainControl(method="repeatedcv", number = 5, repeats = 3)
scotty_forest <- train(coverage ~., data = train, method = "rf", trControl = ctrl)
summary(scotty_forest)
## Length Class Mode
## call 4 -none- call
## type 1 -none- character
## predicted 3540 factor numeric
## err.rate 1500 -none- numeric
## confusion 6 -none- numeric
## votes 7080 matrix numeric
## oob.times 3540 -none- numeric
## classes 2 -none- character
## importance 33 -none- numeric
## importanceSD 0 -none- NULL
## localImportance 0 -none- NULL
## proximity 0 -none- NULL
## ntree 1 -none- numeric
## mtry 1 -none- numeric
## forest 14 -none- list
## y 3540 factor numeric
## test 0 -none- NULL
## inbag 0 -none- NULL
## xNames 33 -none- character
## problemType 1 -none- character
## tuneValue 1 data.frame list
## obsLevels 2 -none- character
## param 0 -none- list
Here are the plot and explanation of our model.
plot(scotty_forest)
scotty_forest
## Random Forest
##
## 3540 samples
## 5 predictor
## 2 classes: 'insufficient', 'sufficient'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times)
## Summary of sample sizes: 2832, 2832, 2832, 2832, 2832, 2832, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.7832392 0.5664783
## 17 0.7790960 0.5581921
## 33 0.7707156 0.5414313
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
Now, let’s predict our model with data test
and data submission
pred_forest <- predict(object = scotty_forest, newdata = test)
submission3 <- submission %>%
mutate(coverage = predict(object = scotty_forest, newdata = submission)) %>%
write.csv("C:/Users/yohana/Documents/Kappa Data Analytics/DV & ML/Machine Learning/Capstone ML/data/submission3.csv", row.names = FALSE)
To see if our model give a good result to Accuracy, Recall, Specificity and Prediction, let’s just see it with confusionMatrix()
forest_cm <- confusionMatrix(as.factor(pred_forest), as.factor(test$coverage), positive = "insufficient")
forest_cm
## Confusion Matrix and Statistics
##
## Reference
## Prediction insufficient sufficient
## insufficient 297 62
## sufficient 71 250
##
## Accuracy : 0.8044
## 95% CI : (0.7726, 0.8336)
## No Information Rate : 0.5412
## P-Value [Acc > NIR] : <0.0000000000000002
##
## Kappa : 0.607
##
## Mcnemar's Test P-Value : 0.4879
##
## Sensitivity : 0.8071
## Specificity : 0.8013
## Pos Pred Value : 0.8273
## Neg Pred Value : 0.7788
## Prevalence : 0.5412
## Detection Rate : 0.4368
## Detection Prevalence : 0.5279
## Balanced Accuracy : 0.8042
##
## 'Positive' Class : insufficient
##
eval_forest <- data.frame(Accuracy = forest_cm$overall[1],
Recall = forest_cm$byClass[1],
Specificity = forest_cm$byClass[2],
Precision = forest_cm$byClass[3])
round(eval_forest,2)
Comparing to our threshold, our Random Forest model give higher percentage in Accuracy, Specificity and Prediction but give bad result in Recall which give 0.04 points lower than our threshold which is 0.85
Now, let’s compare with our submission3.csv
Similar to our result, it gives bad result in Recall, so the model didn’t pass the submission in the leaderboard.
Compared to data
test
, Random Forest model is not underfit or overfit, It just for Recall, the result is lower than the threshold. But as we can see forsubmission
data, Recall result is underfitting and Specificity is overfitting.
Local Interpretable Model-agnostic Explanations lime()
is a visualization technique that helps explain individual predictions. As the name implies, it is model agnostic so it can be applied to any supervised regression or classification model. Behind the workings of lime()
lies the assumption that every complex model is linear on a local scale and asserting that it is possible to fit a simple model around a single observation that will mimic how the global model behaves at that locality. The simple model can then be used to explain the predictions of the more complex model locally.
The generalized algorithm LIME applies is:
It is often important to understand the Machine Learning model that we’ve trained on a global scale, and also to zoom into local regions of your data or your predictions and derive local explanations. Global interpretations help us understand the inputs and their entire modeled relationship with the prediction target, but global interpretations can be highly approximate in some cases. Local interpretations help us understand model predictions for a single row of data or a group of similar rows.
In this project, I will just visualize first four observations.
Before we create the lime()
function, let’s check the class in our naive
model
class(naive)
## [1] "naiveBayes"
Class type is naiveBayes, so, we will just use the naiveBayes model in our function
model_type.naiveBayes <- function(x){
return("classification")
}
predict_model.naiveBayes <- function(x, newdata, type = "raw") {
# return classification probabilities only
res <- predict(x, newdata, type = "raw") %>% as.data.frame()
return(res)
}
Now, let’s create the lime()
function
set.seed(123)
explainer <- lime(train %>% select(-coverage), naive)
explanation <- explain(test %>% select(-coverage) %>% head(4),
n_labels = 1,
explainer = explainer,
n_features = 3,
feature_select = "none")
It is time to visualize the interpretation
plot_features(explanation)
These values indicate that the naiveBayes model give around 60 % - 80% of probability, but the Explanation Fit only has values around 0.12 - 0.19 (12% - 19%), which can be interpreted that LIME can only explain about 12% - 19% from this model. In case 1 - 3, all predictors support insufficient driver, meanwhile, in case 4,
src_area
contradicts sufficient driver.
Before we create the lime()
function, let’s check the class in our tree
model
class(tree)
## [1] "constparty" "party"
Class type is constparty and party, so, we will just use the party model in our function
model_type.party <- function(x){
return("classification")
}
predict_model.party <- function(x, newdata, type = "prob") {
# return classification probabilities only
res <- predict(x, newdata, type = "prob") %>% as.data.frame()
return(res)
}
Now, let’s create the lime()
function
set.seed(123)
explainer2 <- lime(train %>% select(-coverage), tree)
explanation2 <- explain(test %>% select(-coverage) %>% head(4),
n_labels = 1,
explainer = explainer2,
n_features = 5,
feature_select = "none")
It is time to visualize the interpretation
plot_features(explanation2)
These values explains that Decision Tree model give around 50 % - 80% of probability, but the Explanation Fit only has values around 0.06 - 0.09, which can be interpreted that LIME can only explain about 6% - 9% from this model. From all cases, Sunday in
day
contradicts andsrc_area
sxk3 supports the most to the insufficient driver.
Before we create the lime()
function, let’s check the class in our scotty_forest
model
class(scotty_forest)
## [1] "train" "train.formula"
Class type is train abd train.formula, so, we will just use the train model in our function
model_type.train <- function(x){
return("classification")
}
predict_model.train <- function(x, newdata, type = "prob") {
# return classification probabilities only
res <- predict(x, newdata, type = "prob") %>% as.data.frame()
return(res)
}
Now, let’s create the lime()
function
set.seed(123)
explainer3 <- lime(train %>% select(-coverage), scotty_forest)
explanation3 <- explain(test %>% select(-coverage) %>% head(4),
n_labels = 1,
explainer = explainer3,
n_features = 3,
feature_select = "none")
It is time to visualize the interpretation
plot_features(explanation3)
These values explains that Random Forest model give around 50 % - 80% of probability, but the Explanation Fit only has values around 0.15 - 0.22, which can be interpreted that LIME can only explain about 15% - 22% from this model. In case 1 - 3, all predictors support insufficient driver, except in case 4,
src_area
sxk3 contradicts sufficient driver.
Let’s compare our model’s metric result
eval_total <- rbind(eval_threshold, round(eval_naive,2), round(eval_tree,2), round(eval_forest,2))
rownames(eval_total) <- c("Threshold", "NaiveBayes", "Decision Tree", "Random Forest")
eval_total
Since the Accuracy, Specificity and Precision from our model gives better result, let’s just compare Recall result.
eval_total %>%
select(Recall)
Our objective in this project is to create a model that will be able to predict the driver insufficiency based on time and area and I will use Recall to get more insufficient event to avoid the occurrence.And the highest Recall point is from Decision Tree model with 86%
Since I want to predict the sufficiency and insufficiency of Scotty’s driver in an area and different time, yes, I met my goals
Yes, it can be.
From all model, Decision Tree give the best result from all, not only giving higher result from our threshold, but also it is the fittest model, which means it didn’t have a higher than 10 points for each metrics.