Introduction

About the Project

This project is created as a capstone of Machine Learning for Algoritma’s Student. This time, I’ll use a set of data to analyze the classification of Scotty

Scotty? What’s that?

Scotty is a ride-sharing business that operating in several big cities in Turkey. It was established in May 2017 with the mission of finding a solution to Istanbul’s traffic problem through motorcycles, as time goes by, they expanding to food and package deliveries. According to USA Today, Scotty’s founder, Tarkan Altan, said he got the idea after being shocked by Istanbul’s traffic when his family moved to the city from Marmaris on the Turkish coast.

It operates like Uber and Lyft: Enter your location and destination, then select a ride by tapping a button that says, “Beam me up” – a reference to Star Trek and, hence, the name Scotty. (The company’s logo is a four-fingered V – Spock’s “Live Long and Prosper” Vulcan salute). cc: Podcast and webrazzi reported that Tarkan Altan was discussing about how Scotty would focus 50% of his time on fundraising, 30% on recruitment, and the remaining 20% on strategy in the near future.

Data Pre Processing

There are 2 csv files for this project, data-train which we use as our source data and data-test which we use as a submission into leaderboard to know if the models we’ve created is already the fit one or not.

Library

Before we read the data, let’s load some library we need:

library(tidyverse)
library(zoo)
library(lubridate)
library(padr)
library(rsample)
library(partykit)
library(ggplot2)
library(caret)
library(e1071)
library(lime)

options(scipen = 9999)

Submission Dataset

Now, let’s read data-test for the submission

submission <- read_csv("data/data-test.csv")
submission

The submission data we have consists of 3 columns only, which become our reference to process our data.

Scotty

First of all, let’s load scotty data from data-train

scotty <- read_csv("data/data-train.csv")
scotty
dim(scotty)
## [1] 229532     16

The data we have consists of 16 columns and 229,532 rows.

Now, let’s discover!

Checking Data Types

Does scotty already have appropriate data types? Let’s find out.

str(scotty)
## tibble [229,532 x 16] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ id                : chr [1:229532] "59d005e1ffcfa261708ce9cd" "59d0066a3d32b861760d4a01" "59d0066affcfa261708ceb11" "59d006a1ffcfa261708ceba4" ...
##  $ trip_id           : chr [1:229532] "59d005e9cb564761a8fe5d3e" "59d00678ffcfa261708ceb37" NA "59d006c131e39c618969343d" ...
##  $ driver_id         : chr [1:229532] "59a892c5568be44b2734f276" "59a135565e88a24b11f11994" NA "599dc0dfa5b4fd5471ad8453" ...
##  $ rider_id          : chr [1:229532] "59ad2d6efba75a581666b506" "59ce930f3d32b861760a4617" "59cd704bcf482f6ce2fadfdb" "59bd62cc7e3c3b663924ba86" ...
##  $ start_time        : POSIXct[1:229532], format: "2017-10-01 00:00:17" "2017-10-01 00:02:34" ...
##  $ src_lat           : num [1:229532] 41.1 40.9 41.1 41 41.1 ...
##  $ src_lon           : num [1:229532] 29 29.1 29 29 29 ...
##  $ src_area          : chr [1:229532] "sxk9" "sxk8" "sxk9" "sxk9" ...
##  $ src_sub_area      : chr [1:229532] "sxk9s" "sxk8y" "sxk9e" "sxk9s" ...
##  $ dest_lat          : num [1:229532] 41.1 41.1 41.1 41 41.1 ...
##  $ dest_lon          : num [1:229532] 29 29 29 29 29 ...
##  $ dest_area         : chr [1:229532] "sxk9" "sxk9" "sxk9" "sxk9" ...
##  $ dest_sub_area     : chr [1:229532] "sxk9u" "sxk9s" "sxk9e" "sxk9e" ...
##  $ distance          : num [1:229532] 5.38 15.5 1.13 4.17 3.36 ...
##  $ status            : chr [1:229532] "confirmed" "confirmed" "nodrivers" "confirmed" ...
##  $ confirmed_time_sec: num [1:229532] 8 14 0 32 65 110 0 49 27 21 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   id = col_character(),
##   ..   trip_id = col_character(),
##   ..   driver_id = col_character(),
##   ..   rider_id = col_character(),
##   ..   start_time = col_datetime(format = ""),
##   ..   src_lat = col_double(),
##   ..   src_lon = col_double(),
##   ..   src_area = col_character(),
##   ..   src_sub_area = col_character(),
##   ..   dest_lat = col_double(),
##   ..   dest_lon = col_double(),
##   ..   dest_area = col_character(),
##   ..   dest_sub_area = col_character(),
##   ..   distance = col_double(),
##   ..   status = col_character(),
##   ..   confirmed_time_sec = col_double()
##   .. )

Since we use read_csv(), the data types for start_time is already converted into POSIXct format and src_area will be changed into factor type

Checking Missing Values

Now, time to check whether there are any missing values in our data.

colSums(is.na(scotty))
##                 id            trip_id          driver_id           rider_id 
##                  0              14901              14900                  0 
##         start_time            src_lat            src_lon           src_area 
##                  0                  0                  0                  0 
##       src_sub_area           dest_lat           dest_lon          dest_area 
##                  0                  0                  0                  0 
##      dest_sub_area           distance             status confirmed_time_sec 
##                  0                  0                  0                  0

There are missing values both on trip_id and driver_id, but we can continue considering that we won’t use those columns

Data Summary

Let’s summarize our dataset

summary(scotty)
##       id              trip_id           driver_id           rider_id        
##  Length:229532      Length:229532      Length:229532      Length:229532     
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##    start_time                     src_lat         src_lon     
##  Min.   :2017-10-01 00:00:17   Min.   :40.79   Min.   :28.48  
##  1st Qu.:2017-10-19 19:36:04   1st Qu.:41.00   1st Qu.:28.97  
##  Median :2017-11-07 14:08:48   Median :41.03   Median :29.00  
##  Mean   :2017-11-04 18:32:48   Mean   :41.03   Mean   :29.00  
##  3rd Qu.:2017-11-19 03:44:19   3rd Qu.:41.06   3rd Qu.:29.04  
##  Max.   :2017-12-02 23:59:20   Max.   :41.13   Max.   :29.18  
##    src_area         src_sub_area          dest_lat         dest_lon       
##  Length:229532      Length:229532      Min.   : 5.923   Min.   :  0.1273  
##  Class :character   Class :character   1st Qu.:41.001   1st Qu.: 28.9689  
##  Mode  :character   Mode  :character   Median :41.036   Median : 29.0036  
##                                        Mean   :41.032   Mean   : 29.0025  
##                                        3rd Qu.:41.064   3rd Qu.: 29.0475  
##                                        Max.   :52.370   Max.   :121.4721  
##   dest_area         dest_sub_area         distance           status         
##  Length:229532      Length:229532      Min.   :   0.000   Length:229532     
##  Class :character   Class :character   1st Qu.:   2.802   Class :character  
##  Mode  :character   Mode  :character   Median :   4.804   Mode  :character  
##                                        Mean   :   6.772                     
##                                        3rd Qu.:   7.943                     
##                                        Max.   :9345.566                     
##  confirmed_time_sec
##  Min.   :  0.00    
##  1st Qu.: 12.00    
##  Median : 26.00    
##  Mean   : 46.41    
##  3rd Qu.: 66.00    
##  Max.   :514.00

From start_time we can refer that the first order was on 2017-10-01 00:00:17 and the last order was on 2017-12-02 23:59:20. So our data just consist of 3 months transactions from October 2017 to December 2017

Since src_area and status are in character types, let’s see how may criteria they have using unique()

unique(scotty$src_area)
## [1] "sxk9" "sxk8" "sxk3"
unique(scotty$status)
## [1] "confirmed" "nodrivers"

There are 3 areas from src_area and 2 statuses from status

Data Processing

Since we refer our base data to submission, let’s filter unused columns and change our data types into the appropriate one

First, we are about to select src_area, start_time which will be changed into datetime like in our submission and status to define the coverage. The start_time will be converted into floor date type using floor_date() function in hour type. Find out the code below:

scotty <- scotty %>% 
   select (src_area, start_time, status) %>% 
   mutate(src_area = as.factor(src_area),
          start_time = floor_date(start_time, unit = "hour"),
          datetime = start_time) %>%
   group_by(src_area, datetime) %>% 
   count(status) %>% 

   pivot_wider(names_from = status, values_from = n) %>% 
   mutate(coverage = ifelse(nodrivers == 0, "sufficient", "insufficient")) %>% 
   select(-c(confirmed, nodrivers)) %>%
   
   group_by(src_area) %>% 
   pad('hour', start_val = as.POSIXct("2017-10-01 00:00:00"),end_val = as.POSIXct("2017-12-02 23:00:00") ) %>% 
   
   mutate(coverage = replace_na(coverage, "sufficient")) %>%  

   ungroup()
scotty

The scotty data we have is now having the same columns like in submission

Explanatory Data Analysis

Now, let’s just explore our scotty data, what kind of other information we could use from the data so we can do the same for submission

Scotty

Let’s clean our data and do final summary

Data Cleaning

From datetime information, we can get day using wday(), hour using hour and peak that we can filter from hour

scotty_clean <- scotty %>% 
   mutate(src_area = as.factor(src_area),
          day = wday(datetime, label = T, abbr = F),
          hour = hour(datetime),
          peak = as.factor(ifelse( 7 <= hour & hour <= 20, "peak", "nonpeak")),
          coverage = as.factor(coverage)) %>% 
   select(datetime, src_area, day, hour, peak, coverage) %>% 
   mutate(day = as.factor(day),
          hour = as.factor(hour))
scotty_clean

As we can see, now we have few rows and appropriate data types

Data Summary

Let’s see the data summary

summary(scotty_clean)
##     datetime                   src_area           day           hour     
##  Min.   :2017-10-01 00:00:00   sxk3:1512   Sunday   :648   0      : 189  
##  1st Qu.:2017-10-16 17:45:00   sxk8:1512   Monday   :648   1      : 189  
##  Median :2017-11-01 11:30:00   sxk9:1512   Tuesday  :648   2      : 189  
##  Mean   :2017-11-01 11:30:00               Wednesday:648   3      : 189  
##  3rd Qu.:2017-11-17 05:15:00               Thursday :648   4      : 189  
##  Max.   :2017-12-02 23:00:00               Friday   :648   5      : 189  
##                                            Saturday :648   (Other):3402  
##       peak              coverage   
##  nonpeak:1890   insufficient:2454  
##  peak   :2646   sufficient  :2082  
##                                    
##                                    
##                                    
##                                    
## 

Our datetime is already transformed into floor date. The proportion of src_area, day and peak are equal, compared to scotty_clean rows. peak hours has more data than nonpeak in peak. But we can see that the coverage of demand shows us more insufficiency of the driver.

Submission

Since we already transformed our base data, now, it’s time to do the same with submission so we could predict the data with models we have

submission <- submission %>%
   mutate(src_area = as.factor(src_area),
          day = wday(datetime, label = T, abbr = F),
          hour = hour(datetime),
          peak = as.factor(ifelse( 7 <= hour & hour <= 20, "peak", "nonpeak"))) %>%
   select(datetime, src_area, day, hour, peak, coverage) %>%
   mutate(day = as.factor(day),
          hour = as.factor(hour))
submission

Yay, the submission data now is having the same columns and the same order just like our scotty_clean data

Data Proportion

Before we model our data, let’s see our data proportionality.

Target Proportion

Let’s find out the balance of our target

round(prop.table(table(scotty_clean$coverage)),2)
## 
## insufficient   sufficient 
##         0.54         0.46

From the total 2454 insufficient and 2082 sufficient drivers, it can be concluded that the insufficiency is 54%, which still in a balanced type

Target Proportion by Area

Let’s find out the balance of our target by area

round(prop.table(table(scotty_clean$coverage, scotty_clean$src_area)),2)
##               
##                sxk3 sxk8 sxk9
##   insufficient 0.21 0.06 0.28
##   sufficient   0.12 0.28 0.06

Both of the total of insufficient and sufficient from 3 areas are having same amount with our trager proportion above.

Target and Predictor Correlation

Since we already get target proportion, let’s see the correlation from our data

First, I want to create a theme for plots

white_theme <- theme(
  panel.background = element_rect(fill="white"),
  plot.background = element_rect(fill="white"),
  panel.grid.minor.x = element_blank(),
  panel.grid.major.x = element_blank(),
  panel.grid.minor.y = element_blank(),
  panel.grid.major.y = element_blank(),
  text = element_text(color="black"),
  axis.text = element_text(color="black"),
  strip.background =element_rect(fill="white"),
  strip.text = element_text(colour = 'black')
  )

Daily Coverage Correlation

Let’s visualize the coverage of the driver in daily per hour.

scotty_clean %>% 
  group_by(src_area,hour) %>% 
  count(coverage) %>% 
  ggplot(aes(hour, n, fill = coverage)) +
  geom_col(position = position_dodge(preserve = "single")) +
  facet_wrap(~src_area, nrow = 3) +
  scale_fill_manual(values=c("purple4", "orangered")) +
  theme(plot.title = element_text(hjust = 0.5)) +
  white_theme +
  labs(title = "Daily Coverage Statistics",
       x = "Hour",
      y = NULL,
      fill = "Coverage")

Weekly Coverage Statistics

Let’s visualize the coverage of the driver in a week per day.

scotty_clean %>% 
  group_by(src_area,day) %>% 
  count(coverage) %>% 
  ggplot(aes(day, n, fill = coverage)) +
  geom_col(position = "dodge") +
  facet_wrap(~src_area, nrow = 3) +
  scale_fill_manual(values=c("purple4", "orangered")) +
  theme(plot.title = element_text(hjust = 0.5)) +
  white_theme +
  labs(title = "Weekly Coverage Statistics",
       x = "Day",
       y = NULL,
       fill = "Coverage")

From both of the plots, we can consider that the most insufficiency of the driver are in sxk9 in almost every hour in each day, while for sxk8, the drivers are sufficient in each hour everyday. As we can see, in sxk3 the sufficiency and insufficiency is almost equal, altough the insufficiency is higher. The sufficiency is just from 2 - 7 AM

Heatmap

After, visualizing the plot, let’s visualize the heatmap to see the avaibility in each hour everday.

scotty_clean %>% 
  mutate(coverage = as.numeric(ifelse(coverage == "insufficient", "0", "100"))) %>% 
  group_by(src_area, day, hour) %>% 
  summarise(coverage = mean(coverage)) %>% 
  ggplot(aes(x = hour, y = day, fill = coverage)) +
  geom_tile() +
  facet_wrap(~src_area, scales = "free_x", ncol = 1) +
  theme(plot.title = element_text(hjust = 0.5)) +
  white_theme +
  labs(title = "Heatmap Region Avaibility",
        x = "Hour",
        y = NULL,
        fill = "Availability")

The heatmap defines the same interpretation like the plots above where the driver are in sxk9 described with the dark blue color at almost every hour, while for sxk8, the sufficiency of the driver is the highest described with most of light blue color. As we can see, in sxk3 the sufficiency and insufficiency is almost equal, altough the insufficiency is higher. We can refer to the light blue color from 2 - 7 AM while after that the dark blue color dominates the plot

Cross Validation

Before we continue to modeling, let’s split our scotty_clean data into data train and data test with a proportion of 85 and 15 using initial_split()

set.seed(123)
split <- initial_split(scotty_clean, prop = 0.85, strata = "coverage")
train <- training(split)
test <- testing(split)

After splitting the data, now, it’s time to check the proportion of both data

round(prop.table(table(train$coverage)),2)
## 
## insufficient   sufficient 
##         0.54         0.46
round(prop.table(table(test$coverage)),2)
## 
## insufficient   sufficient 
##         0.54         0.46

Since we want a balance data, let’s re-sample our train with downSample()

train <- downSample(x = train[,-6], y = train$coverage, yname = "coverage")
prop.table(table(train$coverage))
## 
## insufficient   sufficient 
##          0.5          0.5

The proportion of train is balance, 50% for insufficient and 50% for sufficient

Modeling

In this project, I will try to create some models that will be tested to the test and submission

The models that we are going to use are

  • NaiveBayes
  • Decision Tree
  • Random Forest

Since we have default threshold for the scoring, I will keep it in eval_threshold and create it into data.frame() so we can compare the result later with the result from confusionMatrix() of each models.

eval_threshold <- data.frame(Accuracy = 0.75,
                       Recall = 0.85,
                       Specificity = 0.7,
                       Precision = 0.75)
eval_threshold

Now, let’s do the modeling!

NaiveBayes

We’ll try NaiveBayes model using naiveBayes()

naive <- naiveBayes(coverage ~ src_area + day + hour + peak, train)
summary(naive)
##           Length Class  Mode     
## apriori   2      table  numeric  
## tables    4      -none- list     
## levels    2      -none- character
## isnumeric 4      -none- logical  
## call      4      -none- call
naive$tables
## $src_area
##               src_area
## Y                   sxk3      sxk8      sxk9
##   insufficient 0.3915254 0.1090395 0.4994350
##   sufficient   0.2757062 0.5943503 0.1299435
## 
## $day
##               day
## Y                 Sunday    Monday   Tuesday Wednesday  Thursday    Friday
##   insufficient 0.1463277 0.1344633 0.1446328 0.1254237 0.1418079 0.1700565
##   sufficient   0.1435028 0.1412429 0.1440678 0.1644068 0.1457627 0.1209040
##               day
## Y               Saturday
##   insufficient 0.1372881
##   sufficient   0.1401130
## 
## $hour
##               hour
## Y                       0          1          2          3          4
##   insufficient 0.03672316 0.04237288 0.02937853 0.03050847 0.02372881
##   sufficient   0.04576271 0.04237288 0.05254237 0.05197740 0.06779661
##               hour
## Y                       5          6          7          8          9
##   insufficient 0.02316384 0.02881356 0.04632768 0.06158192 0.05423729
##   sufficient   0.06723164 0.05423729 0.03672316 0.02033898 0.03050847
##               hour
## Y                      10         11         12         13         14
##   insufficient 0.03559322 0.03785311 0.04011299 0.04406780 0.04576271
##   sufficient   0.04745763 0.04011299 0.03841808 0.03446328 0.03389831
##               hour
## Y                      15         16         17         18         19
##   insufficient 0.04689266 0.04971751 0.05367232 0.05310734 0.05480226
##   sufficient   0.03502825 0.03220339 0.02768362 0.02824859 0.02824859
##               hour
## Y                      20         21         22         23
##   insufficient 0.04180791 0.04293785 0.03446328 0.04237288
##   sufficient   0.04180791 0.04406780 0.05480226 0.04406780
## 
## $peak
##               peak
## Y                nonpeak      peak
##   insufficient 0.3344633 0.6655367
##   sufficient   0.5248588 0.4751412

Now, let’s predict our model with data test and data submission

pred_naive <- predict(object = naive, newdata = test)

submission1 <- submission %>% 
  mutate(coverage = predict(object = naive, newdata = submission)) %>% 
  write.csv("C:/Users/yohana/Documents/Kappa Data Analytics/DV & ML/Machine Learning/Capstone ML/data/submission1.csv", row.names = FALSE)

To see if our model give a good result to Accuracy, Recall, Specificity and Prediction, let’s just see it with confusionMatrix()

naive_cm <- confusionMatrix(as.factor(pred_naive), as.factor(test$coverage), positive = "insufficient")
naive_cm
## Confusion Matrix and Statistics
## 
##               Reference
## Prediction     insufficient sufficient
##   insufficient          286         54
##   sufficient             82        258
##                                              
##                Accuracy : 0.8                
##                  95% CI : (0.7679, 0.8294)   
##     No Information Rate : 0.5412             
##     P-Value [Acc > NIR] : <0.0000000000000002
##                                              
##                   Kappa : 0.6                
##                                              
##  Mcnemar's Test P-Value : 0.0206             
##                                              
##             Sensitivity : 0.7772             
##             Specificity : 0.8269             
##          Pos Pred Value : 0.8412             
##          Neg Pred Value : 0.7588             
##              Prevalence : 0.5412             
##          Detection Rate : 0.4206             
##    Detection Prevalence : 0.5000             
##       Balanced Accuracy : 0.8020             
##                                              
##        'Positive' Class : insufficient       
## 

Time to find out the Accuracy, Recall, Specificity and Prediction from our naiveBayes model

eval_naive <- data.frame(Accuracy = naive_cm$overall[1],
                       Recall = naive_cm$byClass[1],
                       Specificity = naive_cm$byClass[2],
                       Precision = naive_cm$byClass[3])
round(eval_naive,2)

Comparing to our threshold, our naiveBayes model give higher percentage in Accuracy, Specificity and Prediction but give bad result in Recall which give 0.7 points lower than our threshold which is 0.85

Now, let’s compare with our submission1.csv

Just as the same as our result with our data test, it gives bad result in Recall, that’s why the model didn’t pass the submission in the leaderboard, marked with red box.

For our test data, naiveBayes model gives an overfit result for Specificity, while for submission data, the result is not underfot or overfit, it just, the Recall returned is lower than our threshold.

Decision Tree

Now, let’s model with Decision Tree using ctree

tree <- ctree(formula = coverage ~ src_area + day + hour + peak, data = train)
summary(tree)
##    Length Class      Mode
## 1  23     constparty list
## 2  19     constparty list
## 3  13     constparty list
## 4   5     constparty list
## 5   3     constparty list
## 6   1     constparty list
## 7   1     constparty list
## 8   1     constparty list
## 9   7     constparty list
## 10  5     constparty list
## 11  3     constparty list
## 12  1     constparty list
## 13  1     constparty list
## 14  1     constparty list
## 15  1     constparty list
## 16  5     constparty list
## 17  3     constparty list
## 18  1     constparty list
## 19  1     constparty list
## 20  1     constparty list
## 21  3     constparty list
## 22  1     constparty list
## 23  1     constparty list

Let’s check the Decision Tree.

tree
## 
## Model formula:
## coverage ~ src_area + day + hour + peak
## 
## Fitted party:
## [1] root
## |   [2] src_area in sxk3, sxk9
## |   |   [3] hour in 0, 7, 8, 9, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 23
## |   |   |   [4] hour in 0, 7, 11, 12, 15, 20, 21, 23
## |   |   |   |   [5] src_area in sxk3
## |   |   |   |   |   [6] day <= Thursday: insufficient (n = 280, err = 40.4%)
## |   |   |   |   |   [7] day > Thursday: insufficient (n = 110, err = 26.4%)
## |   |   |   |   [8] src_area in sxk9: insufficient (n = 368, err = 22.0%)
## |   |   |   [9] hour in 8, 9, 13, 14, 16, 17, 18, 19
## |   |   |   |   [10] day <= Monday
## |   |   |   |   |   [11] hour in 8, 9, 14, 19
## |   |   |   |   |   |   [12] src_area in sxk3: insufficient (n = 51, err = 19.6%)
## |   |   |   |   |   |   [13] src_area in sxk9: insufficient (n = 55, err = 1.8%)
## |   |   |   |   |   [14] hour in 13, 16, 17, 18: insufficient (n = 108, err = 29.6%)
## |   |   |   |   [15] day > Monday: insufficient (n = 527, err = 11.0%)
## |   |   [16] hour in 1, 2, 3, 4, 5, 6, 10, 22
## |   |   |   [17] src_area in sxk3
## |   |   |   |   [18] hour in 1, 10, 22: insufficient (n = 153, err = 49.7%)
## |   |   |   |   [19] hour in 2, 3, 4, 5, 6: sufficient (n = 257, err = 20.2%)
## |   |   |   [20] src_area in sxk9: insufficient (n = 386, err = 29.3%)
## |   [21] src_area in sxk8
## |   |   [22] hour in 0, 2, 3, 4, 5, 6, 10, 11, 12, 13, 14, 20, 21, 22, 23: sufficient (n = 778, err = 8.9%)
## |   |   [23] hour in 1, 7, 8, 9, 15, 16, 17, 18, 19: sufficient (n = 467, err = 26.6%)
## 
## Number of inner nodes:    11
## Number of terminal nodes: 12

Here is the plot from our Decision Tree.

plot(tree, type = "simple")

Now, let’s predict our model with data test and data submission

pred_tree <- predict(object = tree, newdata = test)

submission2 <- submission %>% 
  mutate(coverage = predict(object = tree, newdata = submission)) %>% 
  write.csv("C:/Users/yohana/Documents/Kappa Data Analytics/DV & ML/Machine Learning/Capstone ML/data/submission2.csv", row.names = FALSE)

Let’s see whether out Decision Tree model give a good result to Accuracy, Recall, Specificity and Prediction, let’s just see it with confusionMatrix()

tree_cm <- confusionMatrix(as.factor(pred_tree), as.factor(test$coverage), positive = "insufficient")
tree_cm
## Confusion Matrix and Statistics
## 
##               Reference
## Prediction     insufficient sufficient
##   insufficient          317         71
##   sufficient             51        241
##                                              
##                Accuracy : 0.8206             
##                  95% CI : (0.7896, 0.8487)   
##     No Information Rate : 0.5412             
##     P-Value [Acc > NIR] : <0.0000000000000002
##                                              
##                   Kappa : 0.637              
##                                              
##  Mcnemar's Test P-Value : 0.0854             
##                                              
##             Sensitivity : 0.8614             
##             Specificity : 0.7724             
##          Pos Pred Value : 0.8170             
##          Neg Pred Value : 0.8253             
##              Prevalence : 0.5412             
##          Detection Rate : 0.4662             
##    Detection Prevalence : 0.5706             
##       Balanced Accuracy : 0.8169             
##                                              
##        'Positive' Class : insufficient       
## 

Time to find out the Accuracy, Recall, Specificity and Prediction from our Decision Tree model

eval_tree <- data.frame(Accuracy = tree_cm$overall[1],
                       Recall = tree_cm$byClass[1],
                       Specificity = tree_cm$byClass[2],
                       Precision = tree_cm$byClass[3])
round(eval_tree,2)

Comparing to our threshold, our Decision Tree model give better result in Accuracy, Recall, Specificity and Prediction which all percentage is all above our threshold

Now, let’s compare with our submission2.csv

The submission with Decision Tree give a good result, which pass the threshold in the leaderboard and give green boxes

The Decision Tree model give a better result than our threshold. It’s the fittest one both for test and submission because the difference from the threshold is not more than 10 points.

Random Forest

The last, we’ll try Random Forest model.

set.seed(123)
ctrl <- trainControl(method="repeatedcv", number = 5, repeats = 3)
scotty_forest <- train(coverage ~., data = train, method = "rf", trControl = ctrl)
summary(scotty_forest)
##                 Length Class      Mode     
## call               4   -none-     call     
## type               1   -none-     character
## predicted       3540   factor     numeric  
## err.rate        1500   -none-     numeric  
## confusion          6   -none-     numeric  
## votes           7080   matrix     numeric  
## oob.times       3540   -none-     numeric  
## classes            2   -none-     character
## importance        33   -none-     numeric  
## importanceSD       0   -none-     NULL     
## localImportance    0   -none-     NULL     
## proximity          0   -none-     NULL     
## ntree              1   -none-     numeric  
## mtry               1   -none-     numeric  
## forest            14   -none-     list     
## y               3540   factor     numeric  
## test               0   -none-     NULL     
## inbag              0   -none-     NULL     
## xNames            33   -none-     character
## problemType        1   -none-     character
## tuneValue          1   data.frame list     
## obsLevels          2   -none-     character
## param              0   -none-     list

Here are the plot and explanation of our model.

plot(scotty_forest)

scotty_forest
## Random Forest 
## 
## 3540 samples
##    5 predictor
##    2 classes: 'insufficient', 'sufficient' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times) 
## Summary of sample sizes: 2832, 2832, 2832, 2832, 2832, 2832, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.7832392  0.5664783
##   17    0.7790960  0.5581921
##   33    0.7707156  0.5414313
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.

Now, let’s predict our model with data test and data submission

pred_forest <- predict(object = scotty_forest, newdata = test)

submission3 <- submission %>% 
   mutate(coverage = predict(object = scotty_forest, newdata = submission)) %>% 
   write.csv("C:/Users/yohana/Documents/Kappa Data Analytics/DV & ML/Machine Learning/Capstone ML/data/submission3.csv", row.names = FALSE)

To see if our model give a good result to Accuracy, Recall, Specificity and Prediction, let’s just see it with confusionMatrix()

forest_cm <- confusionMatrix(as.factor(pred_forest), as.factor(test$coverage), positive = "insufficient")
forest_cm
## Confusion Matrix and Statistics
## 
##               Reference
## Prediction     insufficient sufficient
##   insufficient          297         62
##   sufficient             71        250
##                                              
##                Accuracy : 0.8044             
##                  95% CI : (0.7726, 0.8336)   
##     No Information Rate : 0.5412             
##     P-Value [Acc > NIR] : <0.0000000000000002
##                                              
##                   Kappa : 0.607              
##                                              
##  Mcnemar's Test P-Value : 0.4879             
##                                              
##             Sensitivity : 0.8071             
##             Specificity : 0.8013             
##          Pos Pred Value : 0.8273             
##          Neg Pred Value : 0.7788             
##              Prevalence : 0.5412             
##          Detection Rate : 0.4368             
##    Detection Prevalence : 0.5279             
##       Balanced Accuracy : 0.8042             
##                                              
##        'Positive' Class : insufficient       
## 
eval_forest <- data.frame(Accuracy = forest_cm$overall[1],
                       Recall = forest_cm$byClass[1],
                       Specificity = forest_cm$byClass[2],
                       Precision = forest_cm$byClass[3])
round(eval_forest,2)

Comparing to our threshold, our Random Forest model give higher percentage in Accuracy, Specificity and Prediction but give bad result in Recall which give 0.04 points lower than our threshold which is 0.85

Now, let’s compare with our submission3.csv

Similar to our result, it gives bad result in Recall, so the model didn’t pass the submission in the leaderboard.

Compared to data test, Random Forest model is not underfit or overfit, It just for Recall, the result is lower than the threshold. But as we can see for submission data, Recall result is underfitting and Specificity is overfitting.

Visualizing Machine Learning Models with LIME

Local Interpretable Model-agnostic Explanations lime() is a visualization technique that helps explain individual predictions. As the name implies, it is model agnostic so it can be applied to any supervised regression or classification model. Behind the workings of lime() lies the assumption that every complex model is linear on a local scale and asserting that it is possible to fit a simple model around a single observation that will mimic how the global model behaves at that locality. The simple model can then be used to explain the predictions of the more complex model locally.

The generalized algorithm LIME applies is:

  1. Given an observation, permute it to create replicated feature data with slight value modifications.
  2. Compute *similarity distance measure between original observation and permuted observations.
  3. Apply selected machine learning model to predict outcomes of permuted data.
  4. Select m number of features to best describe predicted outcomes.
  5. Fit a simple model to the permuted data, explaining the complex model outcome with m features from the permuted data weighted by its similarity to the original observation .
  6. Use the resulting feature weights to explain local behavior.

It is often important to understand the Machine Learning model that we’ve trained on a global scale, and also to zoom into local regions of your data or your predictions and derive local explanations. Global interpretations help us understand the inputs and their entire modeled relationship with the prediction target, but global interpretations can be highly approximate in some cases. Local interpretations help us understand model predictions for a single row of data or a group of similar rows.

In this project, I will just visualize first four observations.

NaiveBayes

Before we create the lime() function, let’s check the class in our naive model

class(naive)
## [1] "naiveBayes"

Class type is naiveBayes, so, we will just use the naiveBayes model in our function

model_type.naiveBayes <- function(x){
  return("classification")
}

predict_model.naiveBayes <- function(x, newdata, type = "raw") {

    # return classification probabilities only   
    res <- predict(x, newdata, type = "raw") %>% as.data.frame()
    
    return(res)
}

Now, let’s create the lime() function

set.seed(123)
explainer <- lime(train %>% select(-coverage), naive)
explanation <- explain(test %>% select(-coverage) %>% head(4),
                       n_labels = 1,
                       explainer = explainer,
                       n_features = 3,
                       feature_select = "none")

It is time to visualize the interpretation

plot_features(explanation)

These values indicate that the naiveBayes model give around 60 % - 80% of probability, but the Explanation Fit only has values around 0.12 - 0.19 (12% - 19%), which can be interpreted that LIME can only explain about 12% - 19% from this model. In case 1 - 3, all predictors support insufficient driver, meanwhile, in case 4, src_area contradicts sufficient driver.

Decision Tree

Before we create the lime() function, let’s check the class in our tree model

class(tree)
## [1] "constparty" "party"

Class type is constparty and party, so, we will just use the party model in our function

model_type.party <- function(x){
  return("classification")
}

predict_model.party <- function(x, newdata, type = "prob") {

    # return classification probabilities only   
    res <- predict(x, newdata, type = "prob") %>% as.data.frame()
    
    return(res)
}

Now, let’s create the lime() function

set.seed(123)
explainer2 <- lime(train %>% select(-coverage), tree)
explanation2 <- explain(test %>% select(-coverage) %>% head(4),
                        n_labels = 1,
                        explainer = explainer2,
                        n_features = 5,
                        feature_select = "none")

It is time to visualize the interpretation

plot_features(explanation2)

These values explains that Decision Tree model give around 50 % - 80% of probability, but the Explanation Fit only has values around 0.06 - 0.09, which can be interpreted that LIME can only explain about 6% - 9% from this model. From all cases, Sunday in day contradicts and src_area sxk3 supports the most to the insufficient driver.

Random Forest

Before we create the lime() function, let’s check the class in our scotty_forest model

class(scotty_forest)
## [1] "train"         "train.formula"

Class type is train abd train.formula, so, we will just use the train model in our function

model_type.train <- function(x){
  return("classification")
}

predict_model.train <- function(x, newdata, type = "prob") {

    # return classification probabilities only   
    res <- predict(x, newdata, type = "prob") %>% as.data.frame()
    
    return(res)
}

Now, let’s create the lime() function

set.seed(123)
explainer3 <- lime(train %>% select(-coverage), scotty_forest)
explanation3 <- explain(test %>% select(-coverage) %>% head(4),
                        n_labels = 1,
                        explainer = explainer3,
                        n_features = 3,
                        feature_select = "none")

It is time to visualize the interpretation

plot_features(explanation3)

These values explains that Random Forest model give around 50 % - 80% of probability, but the Explanation Fit only has values around 0.15 - 0.22, which can be interpreted that LIME can only explain about 15% - 22% from this model. In case 1 - 3, all predictors support insufficient driver, except in case 4, src_area sxk3 contradicts sufficient driver.

Model Comparison

Let’s compare our model’s metric result

eval_total <- rbind(eval_threshold, round(eval_naive,2), round(eval_tree,2), round(eval_forest,2))
rownames(eval_total) <- c("Threshold", "NaiveBayes", "Decision Tree", "Random Forest")

eval_total

Since the Accuracy, Specificity and Precision from our model gives better result, let’s just compare Recall result.

eval_total %>% 
  select(Recall)

Our objective in this project is to create a model that will be able to predict the driver insufficiency based on time and area and I will use Recall to get more insufficient event to avoid the occurrence.And the highest Recall point is from Decision Tree model with 86%

Conclusion

  1. Is your goal achieved?

Since I want to predict the sufficiency and insufficiency of Scotty’s driver in an area and different time, yes, I met my goals

  1. Is the problem can be solved by machine learning?

Yes, it can be.

  1. What model did you use and how is the performance?

From all model, Decision Tree give the best result from all, not only giving higher result from our threshold, but also it is the fittest model, which means it didn’t have a higher than 10 points for each metrics.