Background

Problem

Scotty was founded in 2017 and headquartered in Istanbul, they already have more than 11K drivers with more than 1 million users and 3 Million completed transactions. At some region and some times, there was not enough driver at those times and places. Fortunately, we are know that we can use classification model to predict which region and times are risky enough to have this “no drivers” problem.

They provided us with real-time transaction dataset. With this dataset, we are going to help them in solving some classification problems in order to improve their business processes.

At the end, we will have predicted coverage status for each hour and each area to be labelled as “sufficient” or "insufficient.

Firstly, prepare the libraries

library(tidyverse) #Piping purpose
library(lubridate) #Hour and weekday filter from date
library(ggplot2) #Visualisation purpose
library(plotly) #Visualisation purpose
library(e1071) #for naiveBayes function
library(caret) #for model evaluation
library(padr) #to use padding time interval
library(tidyr) #to tidy the data (is in tidyverse too)
library(glue) #embedding expressions in curly braces which are then evaluated and inserted into the argument string.
library(car)#to use vif function - check multicolinearity in logistic regression model
library(partykit) #to use ctree() function for decision tree model
library(lime) #for lime function

Load and inspect the data:

Train Data

scotty <- read.csv("data/data-train.csv")
head(scotty)

colSums(is.na(scotty))

##                 id            trip_id          driver_id           rider_id 
##                  0              14901              14900                  0 
##         start_time            src_lat            src_lon           src_area 
##                  0                  0                  0                  0 
##       src_sub_area           dest_lat           dest_lon          dest_area 
##                  0                  0                  0                  0 
##      dest_sub_area           distance             status confirmed_time_sec 
##                  0                  0                  0                  0

#checking the type of the data
glimpse(scotty)

## Rows: 229,532
## Columns: 16
## $ id                 <fct> 59d005e1ffcfa261708ce9cd, 59d0066a3d32b861760d4a01…
## $ trip_id            <fct> 59d005e9cb564761a8fe5d3e, 59d00678ffcfa261708ceb37…
## $ driver_id          <fct> 59a892c5568be44b2734f276, 59a135565e88a24b11f11994…
## $ rider_id           <fct> 59ad2d6efba75a581666b506, 59ce930f3d32b861760a4617…
## $ start_time         <fct> 2017-10-01T00:00:17Z, 2017-10-01T00:02:34Z, 2017-1…
## $ src_lat            <dbl> 41.07047, 40.94157, 41.07487, 41.04995, 41.05287, …
## $ src_lon            <dbl> 29.01945, 29.11484, 28.99528, 29.03107, 28.99522, …
## $ src_area           <fct> sxk9, sxk8, sxk9, sxk9, sxk9, sxk9, sxk9, sxk9, sx…
## $ src_sub_area       <fct> sxk9s, sxk8y, sxk9e, sxk9s, sxk9e, sxk90, sxk9e, s…
## $ dest_lat           <dbl> 41.11716, 41.06151, 41.08351, 41.04495, 41.08140, …
## $ dest_lon           <dbl> 29.03650, 29.02068, 29.00228, 28.98192, 28.98197, …
## $ dest_area          <fct> sxk9, sxk9, sxk9, sxk9, sxk9, sxk9, sxk9, sxk9, sx…
## $ dest_sub_area      <fct> sxk9u, sxk9s, sxk9e, sxk9e, sxk9e, sxk97, sxk9q, s…
## $ distance           <dbl> 5.379250, 15.497130, 1.126098, 4.169492, 3.358296,…
## $ status             <fct> confirmed, confirmed, nodrivers, confirmed, confir…
## $ confirmed_time_sec <int> 8, 14, 0, 32, 65, 110, 0, 49, 27, 21, 23, 46, 185,…

From the loaded data above we found out that there are several missing value on trip_id and driver_id, but it looks fine so i will keep it as it is. We also need to change the date type to POSIXct and everything else are good :)

The dataset includes information about:

id: Transaction id
trip_id: Trip id
driver_id: Driver id
rider_id: Rider id
start_time: Rider id
src_lat: Request source latitude
src_lon: Request source longitude
src_area: Request source area
src_sub_area: Request source sub-area
dest_lat: Requested destination latitude
dest_lon: Requested destination longitude
dest_area: Requested destination area
dest_sub_area: Requested destination sub-area
distance: Trip distance (in KM)
status: Trip status (all status considered as a demand)
confirmed_time_sec: Time different from request to confirmed (in seconds)

Test Data

scotty_test <- read.csv("data/data-test.csv")
head(scotty_test)

glimpse(scotty_test)

## Rows: 504
## Columns: 3
## $ src_area <fct> sxk3, sxk3, sxk3, sxk3, sxk3, sxk3, sxk3, sxk3, sxk3, sxk3, …
## $ datetime <fct> 2017-12-03T00:00:00Z, 2017-12-03T01:00:00Z, 2017-12-03T02:00…
## $ coverage <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …

From the loaded data above we found out that we only need to change the date type to POSIXct.

colSums(is.na(scotty_test))

## src_area datetime coverage 
##        0        0      504

In order to fulfill the requirement, we need to do some extra effort on exploring this data test.

unique(scotty_test$src_area) #eval set to FALSE
summary(scotty_test$datetime) #eval set to FALSE

We found out the date range and source area as written below

2017-12-03, 00:00:00 to 2017-12-09, 23:00:00.
Area sxk3, sxk8 and sxk9.

Data Preprocessing

As we can see, some of the data type are still inappropiate, lets try to fix it.

Train Data

scotty_clean <- scotty %>% 
   mutate(datetime = ymd_hms(start_time))
head(scotty_clean)

Test Data

scotty_test_clean <- scotty_test %>% 
   mutate(datetime = ymd_hms(datetime))
head(scotty_test_clean)

It looks like the time stamp are different, the datetime in TRAIN DATA are down to second and the datetime in TEST DATA are in hourly interval, lets make it equal in an hour.

Floor the time in TRAIN DATA to nearest hour

scotty_clean <- scotty_clean %>% 
   mutate(datetime = floor_date(datetime, unit = "hour"))
head(scotty_clean)

#sampling,
scotty_clean[600,]

In simple way, if we want to make sure that the coverage status are Sufficient or Insufficient, firstly we need to observe the order status if its confirmed or no drivers, from the data we can see that both status are in the same hour, i dont exactly sure that it happened every hour but lets assume it is, so lets group it so it wont distract the data preprocess.

scotty_clean_agg <- scotty_clean %>% 
   group_by(src_area, status, datetime) %>% 
   summarise(ndrivers = n()) %>% 
   ungroup() %>% 
   spread(status, ndrivers)

## `summarise()` regrouping output by 'src_area', 'status' (override with `.groups` argument)

head(scotty_clean_agg)

After doing this, found out that the time was not in a proper interval, some hour missing like 02:00:00, 05:00:00, 06:00:00 are saw on first 6 data, lets make it proper. and also, we are going back to an earlier stage, after we know that the TEST data has 3 different source area, lets break the TRAIN data into three different dataframe based on their area to get a proper observation with each scope.

scotty_clean_sxk3 <- scotty_clean_agg %>% 
   filter(src_area == "sxk3") %>% 
   pad('hour') %>% 
   mutate(src_area = replace_na(src_area,"sxk3"))

scotty_clean_sxk8 <- scotty_clean_agg %>% 
   filter(src_area == "sxk8") %>% 
   pad('hour') %>% 
   mutate(src_area = replace_na(src_area,"sxk8"))

scotty_clean_sxk9 <- scotty_clean_agg %>% 
   filter(src_area == "sxk9") %>% 
   pad('hour') %>% 
   mutate(src_area = replace_na(src_area,"sxk9"))

#rejoin all the dataset back into one data
scotty_clean_agg_all <- rbind(scotty_clean_sxk3,scotty_clean_sxk8,scotty_clean_sxk9)

To be able to calculate how many confirmed and nodrivers status, first we need to change NA value to 0. Create a new column to show that the data at that time is sufficient or insufficient

scotty_clean_agg_all <- scotty_clean_agg_all %>% 
   mutate(confirmed = replace_na(confirmed, 0),
          nodrivers = replace_na(nodrivers, 0)) %>% 
   mutate(coverage = as.factor(ifelse(nodrivers == 0, "sufficient", "insufficient")))

head(scotty_clean_agg_all)

#Exploratory Data Analysis

Lets Explore the state in the target distribution by:

Proportion Overall

Proportion of class of target variable overall

prop.table(table(scotty_clean_agg_all$coverage))

## 
## insufficient   sufficient 
##    0.5410053    0.4589947

Proportion in Each Area

Proportion of class of target variable in each area (3 areas)

SXK3

scotty_clean_agg_all_sxk3 <- scotty_clean_agg_all %>% 
   filter(src_area == "sxk3")

prop.table(table(scotty_clean_agg_all_sxk3$coverage))

## 
## insufficient   sufficient 
##    0.6263228    0.3736772

SXK8

scotty_clean_agg_all_sxk8 <- scotty_clean_agg_all %>% 
   filter(src_area == "sxk8")

prop.table(table(scotty_clean_agg_all_sxk8$coverage))

## 
## insufficient   sufficient 
##    0.1706349    0.8293651

SXK9

scotty_clean_agg_all_sxk9 <- scotty_clean_agg_all %>% 
   filter(src_area == "sxk9")

prop.table(table(scotty_clean_agg_all_sxk9$coverage))

## 
## insufficient   sufficient 
##    0.8260582    0.1739418

# sxk3prop
# sxk8prop
# sxk9prop

Pattern & Correlation

Explored the relation between the target and the features.

Finding the pattern or correlation between target and features

Strip the time from date

# sxk3 visual
scotty_viz <- scotty_clean_agg_all_sxk3 %>%
   mutate(hourly = hour(datetime)) %>% 
   group_by(src_area, hourly, coverage) %>% 
   summarise(nostatus = n()) %>% 
   spread(coverage, nostatus) %>% 
   group_by(src_area, hourly) %>% 
   summarise(sum_sufficient = sum(sufficient),
             sum_insufficient = sum(insufficient)) %>% 
   mutate(sum_sufficient = replace_na(sum_sufficient, 0),
          sum_insufficient = replace_na(sum_insufficient, 0))

## `summarise()` regrouping output by 'src_area', 'hourly' (override with `.groups` argument)

## `summarise()` regrouping output by 'src_area' (override with `.groups` argument)

# dataset for bar-chart
scotty_viz2 <- pivot_longer(data = scotty_viz,
             cols = c(sum_sufficient, sum_insufficient),
             names_to = "coverage_status",
             values_to = "value")

scotty_viz3 <- scotty_viz2 %>% 
   mutate(coverage_status = case_when(coverage_status == "sum_sufficient" ~ "Sufficient",
                                     coverage_status == "sum_insufficient" ~ "Insufficient"))
# Plot the bar-chart
ggplot(data = scotty_viz3, 
       aes (x=hourly, y=value)) +
   geom_col(aes (fill = coverage_status), position = "dodge")+
  labs(title = "SXK3 Area")

# sxk8 visual
scotty_viz4 <- scotty_clean_agg_all_sxk8 %>%
   mutate(hourly = hour(datetime)) %>% 
   group_by(src_area, hourly, coverage) %>% 
   summarise(nostatus = n()) %>% 
   spread(coverage, nostatus) %>% 
   group_by(src_area, hourly) %>% 
   summarise(sum_sufficient = sum(sufficient),
             sum_insufficient = sum(insufficient)) %>% 
   mutate(sum_sufficient = replace_na(sum_sufficient, 0),
          sum_insufficient = replace_na(sum_insufficient, 0))

## `summarise()` regrouping output by 'src_area', 'hourly' (override with `.groups` argument)
## `summarise()` regrouping output by 'src_area' (override with `.groups` argument)

# dataset for bar-chart
scotty_viz5 <- pivot_longer(data = scotty_viz4,
             cols = c(sum_sufficient, sum_insufficient),
             names_to = "coverage_status",
             values_to = "value")

scotty_viz6 <- scotty_viz5 %>% 
   mutate(coverage_status = case_when(coverage_status == "sum_sufficient" ~ "Sufficient",
                                     coverage_status == "sum_insufficient" ~ "Insufficient"))
# Plot the bar-chart
ggplot(data = scotty_viz6, 
       aes (x=hourly, y=value)) +
   geom_col(aes (fill = coverage_status), position = "dodge")+
    labs(title = "SXK8 Area")

# sxk9 visual
scotty_viz7 <- scotty_clean_agg_all_sxk9 %>%
   mutate(hourly = hour(datetime)) %>% 
   group_by(src_area, hourly, coverage) %>% 
   summarise(nostatus = n()) %>% 
   spread(coverage, nostatus) %>% 
   group_by(src_area, hourly) %>% 
   summarise(sum_sufficient = sum(sufficient),
             sum_insufficient = sum(insufficient)) %>% 
   mutate(sum_sufficient = replace_na(sum_sufficient, 0),
          sum_insufficient = replace_na(sum_insufficient, 0))

## `summarise()` regrouping output by 'src_area', 'hourly' (override with `.groups` argument)
## `summarise()` regrouping output by 'src_area' (override with `.groups` argument)

# dataset for bar-chart
scotty_viz8 <- pivot_longer(data = scotty_viz7,
             cols = c(sum_sufficient, sum_insufficient),
             names_to = "coverage_status",
             values_to = "value")

scotty_viz9 <- scotty_viz8 %>% 
   mutate(coverage_status = case_when(coverage_status == "sum_sufficient" ~ "Sufficient",
                                     coverage_status == "sum_insufficient" ~ "Insufficient"))
# Plot the bar-chart
ggplot(data = scotty_viz9, 
       aes (x=hourly, y=value)) +
   geom_col(aes (fill = coverage_status), position = "dodge")+
      labs(title = "SXK9 Area")

Based on the bar-chart plot above:

Area sxk3 and sxk9 have more days of insufficient compared to days of Sufficient
Area sxk8 has more days of sufficient compared to days of Insufficient
Area sxk3 has more days of insufficient from 7 am till 12 am
Area sxk9 has more days of insufficient for whole hour (12 am to 11 pm)
Area sxk8 has quite high days of insufficient from 7-9 am and 3-7 pm

Heatmap

Use heatmap of time (hour) and weekdays, grouped by area and find the pattern

scotty_viz11 <- scotty_clean_agg_all %>%
   mutate(hourly = hour(datetime),
          dayname = wday(datetime, label = TRUE)) %>% 
   group_by(src_area, hourly, dayname, coverage) %>% 
   summarise(nstatus = n()) %>% 
   spread(coverage, nstatus) %>% 
   group_by(src_area, hourly, dayname) %>% 
   summarise(sum_sufficient = sum(sufficient),
             sum_insufficient = sum(insufficient)) %>% 
   mutate(sum_sufficient = replace_na(sum_sufficient, 0),
          sum_insufficient = replace_na(sum_insufficient, 0))

## `summarise()` regrouping output by 'src_area', 'hourly', 'dayname' (override with `.groups` argument)

## `summarise()` regrouping output by 'src_area', 'hourly' (override with `.groups` argument)

scotty_viz12 <- scotty_viz11 %>% 
   mutate(coverage_day = sum_sufficient - sum_insufficient) %>% 
   mutate(coverage_status = case_when(coverage_day >= 0 ~ "Sufficient",
                                     coverage_day < 0 ~ "Insufficient")) %>% 
   mutate(absolute_coverage_day = abs(coverage_day))

plotheatmap <- ggplot(data = scotty_viz12, 
       aes (x=dayname, y=hourly, fill = coverage_day,
            text = glue("Status: {coverage_status} at {hourly}:00:00 on {dayname} 
                        Sufficient for {sum_sufficient} days and 
                        Insufficient for {sum_insufficient} days"))) +
  geom_tile() +
  theme(legend.position = "none") +
  labs(title = "Daily and Hourly Scotty Rider Coverage Status",
       x = "", y = "") +
  facet_grid(. ~ src_area)+
  theme(axis.text.x = element_text(angle = 30, hjust = 1))

# Plot the heat-map
ggplotly(plotheatmap, tooltip = "text")

Based on the heatmap above:

Area sxk3 has insufficient riders on weekend (Saturday and Sunday) and weekday (Monday to Friday) from around 7 am to 12 am
Area sxk8 has insufficient riders only on weekday from 8 am to 9 am (Monday to Thursday) and 3 pm to 7 pm (Friday)
Area sxk9 has insufficient riders for both weekday and weekend for whole hour (12 am to 11 pm)

Model Fitting & Evaluation

Cross Validation

-Drop some unuse column

scotty_clean_agg_all <- scotty_clean_agg_all %>% 
   select(-c(confirmed, nodrivers))

Split Training data into train and validation data test

#Data Split
set.seed(100)

index <- sample(nrow(scotty_clean_agg_all), nrow(scotty_clean_agg_all) * 0.8)
data_train <- scotty_clean_agg_all[index,]
data_val <- scotty_clean_agg_all[-index,]

#Change type of of both hour& dayname into factor
data_train <- data_train %>% 
  mutate(hourly = as.factor(hour(datetime)),
         dayname = as.factor(wday(datetime, label = TRUE))) %>% 
   select(-datetime)

data_val <- data_val %>% 
  mutate(hourly = as.factor(hour(datetime)),
         dayname = as.factor(wday(datetime, label = TRUE))) %>% 
   select(-datetime)

head(data_train, 2)

head(data_val, 2)

#Lets recheck the proportion table of the coverage on train data
prop.table(table(scotty_clean_agg_all$coverage))

## 
## insufficient   sufficient 
##    0.5410053    0.4589947

We need to equalize between each of it, best practice is using DownSample.

# donwsample training dataset
scotty_train_down <- downSample(x = data_train[,-2],
                         y = data_train$coverage, 
                         yname = "coverage")

# visualize proportion of downsample training dataset
proportion_area <- prop.table(table(scotty_train_down$coverage))
proportion_area

## 
## insufficient   sufficient 
##          0.5          0.5

It’s balance now, lets move to the next step :)

Model Fitting

Now it’s the fun part, finding the right model for our data. I always thrive for a humanly way to explain a thing and could be easily understood by them.

There will be 2 model that i am going to use

Logistic Regression
Decision Tree

Logistic Regression is one of kind of classification method which the concept are similiar with linear regression. Although in binary logistic regression are not being specifically calculated but calculating the possibilities on each target class.

Decision tree is a quite simple model and the performance is robust/ powerfull and can be easily interpretable. Decision Tree model is the basic of other tree-based model like Random Forest which well known as a robust model to make a prediction. This model will create a tree decision based on the pattern of the data and the result will be visualized as a Flowchart and from that, it can be easily seen how the model build a prediction.

The other character of decision tree:

Assume that on each predictor variable are dependent
It can handle multicollinearity and outlier

Logistic Regression

scotty_logreg <- glm(formula = coverage ~ ., data = scotty_train_down, family = "binomial")
summary(scotty_logreg)

## 
## Call:
## glm(formula = coverage ~ ., family = "binomial", data = scotty_train_down)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.68191  -0.73900   0.00614   0.69223   2.86488  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -0.201639   0.229410  -0.879 0.379430    
## src_areasxk8  2.423088   0.113839  21.285  < 2e-16 ***
## src_areasxk9 -1.163636   0.107249 -10.850  < 2e-16 ***
## hourly1       0.205163   0.315056   0.651 0.514920    
## hourly2       0.781255   0.313253   2.494 0.012631 *  
## hourly3       0.501428   0.308420   1.626 0.103993    
## hourly4       1.348423   0.309672   4.354 1.33e-05 ***
## hourly5       1.124973   0.312085   3.605 0.000313 ***
## hourly6       0.749114   0.304528   2.460 0.013897 *  
## hourly7      -0.695338   0.313962  -2.215 0.026779 *  
## hourly8      -2.132189   0.331128  -6.439 1.20e-10 ***
## hourly9      -1.062422   0.318258  -3.338 0.000843 ***
## hourly10     -0.106276   0.307984  -0.345 0.730041    
## hourly11      0.026123   0.311322   0.084 0.933128    
## hourly12     -0.567000   0.309009  -1.835 0.066521 .  
## hourly13     -0.760945   0.313946  -2.424 0.015359 *  
## hourly14     -0.622568   0.311359  -2.000 0.045552 *  
## hourly15     -0.821045   0.319644  -2.569 0.010210 *  
## hourly16     -1.025553   0.321629  -3.189 0.001430 ** 
## hourly17     -1.206824   0.322380  -3.743 0.000181 ***
## hourly18     -1.291317   0.319109  -4.047 5.20e-05 ***
## hourly19     -1.077813   0.317658  -3.393 0.000691 ***
## hourly20      0.022152   0.308201   0.072 0.942702    
## hourly21     -0.064482   0.300702  -0.214 0.830205    
## hourly22      0.639626   0.301692   2.120 0.033995 *  
## hourly23      0.047613   0.307533   0.155 0.876962    
## dayname.L    -0.187640   0.118142  -1.588 0.112228    
## dayname.Q    -0.260470   0.117547  -2.216 0.026700 *  
## dayname.C     0.195982   0.117362   1.670 0.094941 .  
## dayname^4     0.571191   0.118673   4.813 1.49e-06 ***
## dayname^5     0.264233   0.116753   2.263 0.023625 *  
## dayname^6     0.006165   0.116049   0.053 0.957634    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4597  on 3315  degrees of freedom
## Residual deviance: 3142  on 3284  degrees of freedom
## AIC: 3206
## 
## Number of Fisher Scoring iterations: 5

Lets random check the variable and interpret the value

exp(scotty_logreg$coefficients[7])

##  hourly4 
## 3.851349

It means that, at 2:00, chance of sufficient would be 3,85 times higher than any other hour with the same condition

Check if there is a mistake in creating the model

vif(scotty_logreg)

##              GVIF Df GVIF^(1/(2*Df))
## src_area 1.207358  2        1.048236
## hourly   1.209117 23        1.004137
## dayname  1.038939  6        1.003188

Assumption of multicollinearity and independence of observation has not been violated (because the VIF value are lower than 10)

# glm(formula = coverage ~ ., data = scotty_train_down, family = "binomial", method = "detect_separation")

No perfect separation detected.

Predict

In Prediction, there is a type of parameter which are response and link.

type link = will show value of Log of Odds from Predict (this is default)
type response = will return the Predict value into Probability

Train Dataset

#lets go with `response` to calculate the probability of the coverage
scotty_train_prob_logreg <- predict(scotty_logreg, newdata = scotty_train_down, type = "response")

#Classify the coverage test with 0.5 threshold
# ifelse(condition, sufficient or insufficient)
scotty_train_pred_logreg <- ifelse(scotty_train_prob_logreg > 0.5, "1", "0")

# Change target class into factor
scotty_train_coverage_logreg <- as.factor(ifelse(scotty_train_pred_logreg == 1, "sufficient", "insufficient"))

Validate Dataset (in part of overall train data)

#Go with `response` to calculate the probability of the coverage
scotty_val_prob_logreg <- predict(scotty_logreg, newdata = data_val, type = "response")

#Classify the coverage test with 0.5 threshold
# ifelse(condition, sufficient or insufficient)
scotty_val_pred_logreg <- ifelse(scotty_val_prob_logreg > 0.5, "1", "0")

# Change target class into factor
scotty_val_coverage_logreg <- as.factor(ifelse(scotty_val_pred_logreg == 1, "sufficient", "insufficient"))

Try with Test Dataset

# Prepare testing dataset to make it similar with train dataset
scotty_test_logreg <- scotty_test_clean %>% 
   mutate(src_area = as.factor(src_area),
          hourly = as.factor(hour(datetime)),
          dayname = as.factor(wday(datetime, label = TRUE)))

# Compute probability of the coverage
scotty_test_prob_logreg <- predict(scotty_logreg, newdata = scotty_test_logreg, type = "response")

#Classify the coverage test with 0.5 threshold
# ifelse(condition, sufficient or insufficient)
scotty_test_pred_logreg <- ifelse(scotty_test_prob_logreg > 0.5, "1", "0")

# Change target class into factor
scotty_test_coverage_logreg <- as.factor(ifelse(scotty_test_pred_logreg == 1, "sufficient", "insufficient"))

# Assign the prediction result on scotty test data
scotty_test_logreg$coverage <- scotty_test_coverage_logreg

# remove unneeded column and save the scotty test result on csv file
scotty_test_logreg <- scotty_test_logreg %>% 
   select(src_area, datetime, coverage)

write.csv(scotty_test_logreg,"data-output/data-test-result-logreg.csv")

Model Evaluation

After we doing prediction with the model, there must be a chance of error.

In Classification we can use Confusion Matrix to evaluate the model

TP (True Positive) = When we’re predicting Positive Class and it is Correct
TN (True Negative) = When we’re predicting Negative Class and it is Correct
FP (False Positive) = When we’re predicting Positive Class and it is Wrong
FN (False Negative) = When we’re predicting Negative Class and it is Wrong

Evaluate the Train Dataset

confusionMatrix(data = scotty_train_coverage_logreg,
                reference = scotty_train_down$coverage,
                positive = "insufficient")

## Confusion Matrix and Statistics
## 
##               Reference
## Prediction     insufficient sufficient
##   insufficient         1374        410
##   sufficient            284       1248
##                                           
##                Accuracy : 0.7907          
##                  95% CI : (0.7765, 0.8044)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5814          
##                                           
##  Mcnemar's Test P-Value : 2.086e-06       
##                                           
##             Sensitivity : 0.8287          
##             Specificity : 0.7527          
##          Pos Pred Value : 0.7702          
##          Neg Pred Value : 0.8146          
##              Prevalence : 0.5000          
##          Detection Rate : 0.4144          
##    Detection Prevalence : 0.5380          
##       Balanced Accuracy : 0.7907          
##                                           
##        'Positive' Class : insufficient    
##

Evaluate the Validation Dataset

confusionMatrix(data = scotty_val_coverage_logreg,
                reference = data_val$coverage,
                positive = "insufficient")

## Confusion Matrix and Statistics
## 
##               Reference
## Prediction     insufficient sufficient
##   insufficient          396        113
##   sufficient             88        311
##                                           
##                Accuracy : 0.7786          
##                  95% CI : (0.7502, 0.8052)
##     No Information Rate : 0.533           
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.5537          
##                                           
##  Mcnemar's Test P-Value : 0.09049         
##                                           
##             Sensitivity : 0.8182          
##             Specificity : 0.7335          
##          Pos Pred Value : 0.7780          
##          Neg Pred Value : 0.7794          
##              Prevalence : 0.5330          
##          Detection Rate : 0.4361          
##    Detection Prevalence : 0.5606          
##       Balanced Accuracy : 0.7758          
##                                           
##        'Positive' Class : insufficient    
##

Accuracy: How good the model predict the target class (in global)
Sensitivity/ Recall: Measurement of goodness of the model with positive class
Specificity : Measurement of goodness of the model with negative class
Pos Pred Value/ Precision: Level of precision of the model to predict positive class

As can be shown above, the result between Test Data vs Train Data are quite similar.

We can simply adjust the level of Accuracy, Recall, Precision and Specificity needed by tuning the threshold point (that 0,5 things), it depends on the business case we need and level of goodness on each measurement.

Recall or Sensitivity is the important metric of this business case since the scotty company wants to reduce the wrong prediction of insufficient and sufficient driver on certain times as low as possible

Decision Tree

Some said that Decision Tree is a quite simple model but the performance are Robust/ Powerfull and the results are easily interpretable. Decision Tree Model is a base of any other Tree-Based Model such as the famous Random Forest Model which well-known as a robust model to make a prediction.

Decision tree will create several answer based on the pattern from the data. The result then will be visualize in a Flow Chart so it can be easily seen on how the model doing the prediction.

Additional Character in Decision Tree are:

Assuming that each of Predictor Variabel are all dependent
It can be able to overcome multicollinearity and outlier

scotty.dt <- ctree(formula = coverage ~ . , data = scotty_train_down)
plot(scotty.dt, type = "simple")

From the decision tree above, some of the interpretation are:

If the src_area is sxk8 and hourly is 13:00:00, the the coverage status is mostly sufficient
If the src_area is sxk3 or sxk9 and hourly is 7, 8, 9, 12, 13, 14, 17, 18 and 19 and dayname is after Monday, the coverage status is mostly insufficient

The process are pretty similar with Logistic Regression Model on Train Dataset

# Get probability of the coverage
scotty.train.prob.dtree <- predict(scotty.dt, scotty_train_down, type = "prob")

#Classify the coverage test with 0.5 threshold
# ifelse(condition, sufficient or insufficient)
scotty.train.pred.dtree <- ifelse(scotty.train.prob.dtree[,2] > 0.5, "1", "0")

# Change target class into factor
scotty.train.coverage.dtree <- as.factor(ifelse(scotty.train.pred.dtree == 1, "sufficient", "insufficient"))

on Validation Dataset

# Get probability of the coverage
scotty.val.prob.dtree <- predict(scotty.dt, data_val, type = "prob")

#Classify the coverage test with 0.5 threshold
# ifelse(condition, sufficient or insufficient)
scotty.val.pred.dtree <- ifelse(scotty.val.prob.dtree[,2] > 0.5, "1", "0")

# Change target class into factor
scotty.val.coverage.dtree <- as.factor(ifelse(scotty.val.pred.dtree == 1, "sufficient", "insufficient"))

on Test Dataset

# Prepare testing dataset
scotty.test.dtree <- scotty_test_clean %>% 
   mutate(src_area = as.factor(src_area),
          hourly = as.factor(hour(datetime)),
          dayname = as.factor(wday(datetime, label = TRUE)))

# Get probability of the coverage
scotty.test.prob.dtree <- predict(scotty.dt, scotty.test.dtree, type = "prob")

#Classify the coverage test with 0.5 threshold
# ifelse(condition, sufficient or insufficient)
scotty.test.pred.dtree <- ifelse(scotty.test.prob.dtree[,2] > 0.5, "1", "0")
scotty.test.coverage.dtree <- as.factor(ifelse(scotty.test.pred.dtree == 1, "sufficient", "insufficient"))

# Assign the prediction result on scotty test data
scotty.test.dtree$coverage <- scotty.test.coverage.dtree

# remove unneeded column and save the scotty test result on csv file
scotty.test.dtree <- scotty.test.dtree %>% 
   select(src_area, datetime, coverage)

write.csv(scotty.test.dtree,"data-output/data-test-result-DTree.csv")

Model Evaluation

Evaluate the Train Dataset

confusionMatrix(data = scotty.train.coverage.dtree,
                reference = scotty_train_down$coverage,
                positive = "insufficient")

## Confusion Matrix and Statistics
## 
##               Reference
## Prediction     insufficient sufficient
##   insufficient         1418        436
##   sufficient            240       1222
##                                          
##                Accuracy : 0.7961         
##                  95% CI : (0.782, 0.8097)
##     No Information Rate : 0.5            
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.5923         
##                                          
##  Mcnemar's Test P-Value : 6.382e-14      
##                                          
##             Sensitivity : 0.8552         
##             Specificity : 0.7370         
##          Pos Pred Value : 0.7648         
##          Neg Pred Value : 0.8358         
##              Prevalence : 0.5000         
##          Detection Rate : 0.4276         
##    Detection Prevalence : 0.5591         
##       Balanced Accuracy : 0.7961         
##                                          
##        'Positive' Class : insufficient   
##

Evaluate the Validation Dataset

confusionMatrix(data = scotty.val.coverage.dtree,
                reference = data_val$coverage,
                positive = "insufficient")

## Confusion Matrix and Statistics
## 
##               Reference
## Prediction     insufficient sufficient
##   insufficient          408        118
##   sufficient             76        306
##                                           
##                Accuracy : 0.7863          
##                  95% CI : (0.7582, 0.8126)
##     No Information Rate : 0.533           
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5682          
##                                           
##  Mcnemar's Test P-Value : 0.003244        
##                                           
##             Sensitivity : 0.8430          
##             Specificity : 0.7217          
##          Pos Pred Value : 0.7757          
##          Neg Pred Value : 0.8010          
##              Prevalence : 0.5330          
##          Detection Rate : 0.4493          
##    Detection Prevalence : 0.5793          
##       Balanced Accuracy : 0.7823          
##                                           
##        'Positive' Class : insufficient    
##

As shown above, the result between Test Data vs Train Data are still acceptable with note its better than Logistic Regression Model :).

Same as Logistic Regression method, we can simply adjust the level of Accuracy, Recall, Precision and Specificity needed by tuning the threshold point (that 0,5 things), it depends on the business case we need and level of goodness on each measurement.

It’s acceptable to use the Decision Tree model with 0,5 treshold.

LIME

Lets trying some other method, LIME. LIME is a Local Interpretable Model-Agnostic Explanations.

It’s an explanation technique that reflected from each part of the name. Local referes to local fidelity , i.e., we want the explanation to really reflect the behaviour of the classifier “around” the instance being predicted. This explanation is useless unless it is interpretable - that is, unless a human can make sense of it. Lime is able to explain any model without needing to ‘peak’ into it, so it is model-agnostic. We now give a high level overview of how lime works.

First, a word about interpretability. Some classifiers use representations that are not intuitive to users at all (e.g. word embeddings). Lime explains those classifiers in terms of interpretable representations (words), even if that is not the representation actually used by the classifier. Further, lime takes human limitations into account: i.e. the explanations are not too long.

In order to be model-agnostic, lime can’t peak into the model. In order to figure out what parts of the interpretable input are contributing to the prediction, we perturb the input around its neighborhood and see how the model’s predictions behave. We then weight these perturbed data points by their proximity to the original example, and learn an interpretable model on those and the associated predictions. For example, if we are trying to explain the prediction for the sentence “I hate this movie”, we will perturb the sentence and get predictions on sentences such as “I hate movie”, “I this movie”, “I movie”, “I hate”, etc. Even if the original classifier takes many more words into account globally, it is reasonable to expect that around this example only the word “hate” will be relevant. Note that if the classifier uses some uninterpretable representation such as word embeddings, this still works: we just represent the perturbed sentences with word embeddings, and the explanation will still be in terms of words such as “hate” or “movie”.

Lets return to this case, shall we? :)

I am going to use randomForest method for the model and implementing k-fold cross validation (k=5) and k-fold set 3 times.

set.seed(100)
ctrl <- trainControl(method = "repeatedcv", number = 5,repeats = 3)

# scotty.randomforest <- train(coverage ~ ., data = scotty_train_down, method = "rf", trControl= ctrl)

# Reduce processing time by saving the processed randomforest model in RDS file
# saveRDS(scotty.randomforest, file = "models/model_rf.RDS")

# Load the randomforest model from RDS file
scotty.randomforest <- readRDS("models/model_rf.RDS")

Trying with LIME

# Implement LIME on Random Forest model
set.seed(100)
explainer <- lime(x = scotty_train_down %>% select(-coverage), 
                  model = scotty.randomforest)

explanation <- explain(x = data_val[315:318,] %>% select(-coverage), 
                       labels = "insufficient",
                       explainer = explainer,
                       n_features = 5)

plot_features(explanation)

The text Label: yes shows what value of target variable is being explained.

The Probability shows the probability of the observation belong to the label yes.

We can see that for all observations they have little probability, so the model would predict them as no instead of yes.

LIME can be used to analyze the variable predictors used in the model for each observation.

for LIME result:

Observation 1 : Most vmportant variable = sxk8, least important = hourly 9

Observation 4 : Most important variable = sxk8, least important = Wednesday

Conclusion

One or several Machine Learning Model (and with some tuning in the parameter) can create quite good prediction with sufficient data and treatment.
Problem can be solved by Machine Learning. For other else, we need to assure what point/ target that we need to achieve.
2 Models are used in this case for its simplicity of application and interpretable model. Using LIME will add some different perspective in predicting the results.
The process or way of thinking can be used in vast business case, some of it can be targeting customer satisfaction as their top of goal and lowering the chance of customer lost because of decreasement of customer’s trust.

Scotty Ride Sharing - Rider Availability Prediction

Ronny G.

2020 - 08 - 12

Background

Problem

They provided us with real-time transaction dataset. With this dataset, we are going to help them in solving some classification problems in order to improve their business processes.

At the end, we will have predicted coverage status for each hour and each area to be labelled as “sufficient” or "insufficient.

Load and inspect the data:

Data Preprocessing

Proportion Overall

Proportion in Each Area

Pattern & Correlation

Heatmap

Model Fitting & Evaluation

Cross Validation

Model Fitting

Logistic Regression

Predict

Model Evaluation

Decision Tree

Model Evaluation

LIME

Conclusion

Scotty Ride Sharing - Rider Availability Prediction

Ronny G.

2020 - 08 - 12

Background

Scotty is the fastest growing Turkish born technology start-up, serving in three verticals; ride-sharing, food delivery and courier delivery in İstanbul. They help thousands of people and organizations save time and be happy, and love to continue to make them happy.

Problem

They provided us with real-time transaction dataset. With this dataset, we are going to help them in solving some classification problems in order to improve their business processes.

At the end, we will have predicted coverage status for each hour and each area to be labelled as “sufficient” or "insufficient.

Load and inspect the data:

Data Preprocessing

Proportion Overall

Proportion in Each Area

Pattern & Correlation

Heatmap

Model Fitting & Evaluation

Cross Validation

Model Fitting

Logistic Regression

Predict

Model Evaluation

Decision Tree

Model Evaluation

LIME

Conclusion