Introduction

In this document, we will look at Scotty’s dataset and try to make a machine learning model to predict the insufficiency of Scotty’s driver in a given area, date, and time.

Wait, What’s Scotty?

Scotty is a ride-sharing business that operating in several big cities in Turkey. The company provides motorcycles ride-sharing service for Turkey’s citizen, and really value the efficiency in traveling through the traffic–the apps even give some reference to Star Trek “beam me up” in their order buttons.

Scotty turns out being a very popular service in Turkey! The demands for Scotty began to overload, at some region and some times, and there was not enough driver at those times and places. Fortunately, we know that we can use classification model to predict which region and times are risky enough to have this “no drivers” problem.

Library

library(tidyverse)
library(tidymodels)
library(skimr)
library(lubridate)
library(padr)
library(caret)
library(partykit)
library(DMwR)
library(lime)

Data Input

scotty <- read.csv("data_input/data-train.csv")
test_file <- read.csv("data_input/data-test.csv")

head(scotty,3)

Our Scotty’s dataset comes with several variables.

id: Transaction id
trip_id: Trip id
driver_id: Driver id
rider_id: Rider id
start_time: Rider id
src_lat: Request source latitude
src_lon: Request source longitude
src_area: Request source area
src_sub_area: Request source sub-area
dest_lat: Requested destination latitude
dest_lon: Requested destination longitude
dest_area: Requested destination area
dest_sub_area: Requested destination sub-area
distance: Trip distance (in KM)
status: Trip status (all status considered as a demand)
confirmed_time_sec: Time different from request to confirmed (in seconds)

Data Test Preview

head(test_file,3)

From the look of it, we only have 2 predictor variables to work with, src_area, and datetime.

Data Wrangling

I think one of the first challenge in our data wrangling is to determine how we judge a driver sufficiency. After a discussion with my peers, we concluded that nodrivers status means that there’s insufficient drivers.

The second challenge is to pad our time interval. We need to do this step because there are some hours where there’s no order, and it’s a problem for our model. We used pad function from padr library in order to solve that. We also judge that in the case of no order/demand, there is sufficient drivers.

First, we need to check what’s our min and max time interval for our pad function.

scotty %>% 
   select (src_area, start_time, status) %>% 
   mutate(src_area = as.factor(src_area),
          datetime = ymd_hms(start_time)) %>% 
   mutate(datetime = floor_date(datetime, unit = "hour")) %>% 
   summarise(min_date = min(datetime),
             max_date = max(datetime)) %>% 
   select(min_date, max_date)

Our data wrangling steps :

Select predictor variables
Remove the minutes in our date by floor_date function from lubridate package
Group and count by the hour the nodrivers status, and replace it with “sufficient”, and “insufficient” factor levels.
Group the data by area, and pad the time variables, and replace the missing coverage with “sufficient”

scotty_filter <- scotty %>% 
   select (src_area, start_time, status) %>% 
   mutate(src_area = as.factor(src_area),
          datetime = ymd_hms(start_time)) %>% 
   mutate(datetime = floor_date(datetime, unit = "hour")) %>% 

   select(-start_time) %>% 
   group_by(src_area, datetime) %>% 
   count(status) %>% 

   pivot_wider(names_from = status, values_from = n) %>% 
   mutate(coverage = ifelse(nodrivers == 0, "sufficient", "insufficient")) %>% 
   select(-c(confirmed, nodrivers)) %>%
   
   ungroup() %>% 
   group_by(src_area) %>% 
   pad('hour', start_val = as.POSIXct("2017-10-01 00:00:00"),end_val = as.POSIXct("2017-12-02 23:00:00") ) %>% 
   
   mutate(coverage = replace_na(coverage, "sufficient")) %>%  

   ungroup()

EDA

At this stage, we would like to measure our target proportion using ggplot from ggplot2 package.

Proportion of overall target

scotty_filter_prop_text <- scotty_filter %>% 
   count(coverage) %>% 
   mutate(percentage = round(n/sum(n)*100)) 

scotty_filter %>% 
   ggplot(aes(coverage, fill = coverage)) + 
   geom_bar() +
   geom_text(y = scotty_filter_prop_text$n/2,data = scotty_filter_prop_text, 
             label = paste(scotty_filter_prop_text$percentage,"%", sep = ""), size = 10, color = "white") +
   theme_minimal() +
   labs(title = "Proportion of Target",
        x = NULL,
        y = "Amount",
        fill = "Coverage")

Our target proportion is fairly balanced with 54-46 percent ratio.

Proportion of target by area

Below, we have a table with target proportion divided by src_area It’s abit imbalanced for region sxk8 and sxk9. This may affect our model prediction for those areas.

scotty_filter %>% 
   group_by(src_area, coverage) %>% 
   tally %>% 
   mutate(percent = round(n/sum(n)*100))  %>% 
   select(-n)

Plot of our target proportion per area.

scotty_filter %>% 
   group_by(src_area, coverage) %>% 
   tally %>% 
   mutate(percent = round(n/sum(n)*100)) %>% 
   ggplot(aes(src_area, percent, fill = coverage)) +
   geom_col(position = "dodge")

Correlation between Target and Features

Below is a bar plot visualizing our target and features relationship per hour in each region. We can see there’s alot of insufficiency in sxk3 in certain times and sxk9 most of the time. Area sxk8 is doing very well compared to the other two with very little sufficiency most of the time.

scotty_hourly_eda <- scotty_filter %>% 
   mutate(coverage = as.factor(coverage),
          hourly = hour(datetime)) %>% 
   group_by(src_area,hourly) %>% 
   count(coverage) 

   
   
ggplot(scotty_hourly_eda,aes(hourly, n, fill = coverage)) + geom_col(position = position_dodge(preserve = "single")) +
   facet_wrap(scotty_hourly_eda$src_area, nrow = 3) +
   theme_minimal() +
   labs(title = "Daily Coverage Statistics",
        x = "Hour",
        y = NULL,
        fill = "Coverage")

Below is the weekly pattern. Our group discussed on how Friday prayer would affect our insufficiency rate, and it seemed that in the area of sxk3 and sxk8, the rate of insufficient drivers reaches its peak on Friday.

scotty_weekly_eda <- scotty_filter %>% 
   mutate(coverage = as.factor(coverage),
          daily = wday(datetime, label = T)) %>% 
   group_by(src_area,daily) %>% 
   count(coverage) 

   
   
ggplot(scotty_weekly_eda,aes(daily, n, fill = coverage)) + geom_col(position = "dodge") +
   facet_wrap(scotty_weekly_eda$src_area, nrow = 3) +
   theme_minimal() +
   labs(title = "Weekly Coverage Statistics",
        x = "Day",
        y = NULL,
        fill = "Coverage")

Heatmap

Heatmap of Scotty’s Availability

Below is a heatmap of Scotty’s availability/sufficiency rate grouped by the hour and day of the week. The bluer the better. But as we can see, there’s very little of it in this heatmap. I can’t seem to find a specific pattern for our sufficiency rate here.

scotty_heatmap_hour <- scotty_filter %>% 
   filter (coverage =="sufficient") %>% 
   mutate(hourly = hour(datetime),
          weekly = wday(datetime, label = T)) %>%  
   group_by(hourly, weekly,src_area) %>% 
   count(coverage)

scotty_heatmap_hour %>% 
   ggplot(aes(weekly, hourly, fill =n)) +
   geom_tile() +
   scale_fill_gradient(low = "#C21C3F", high = "#5EF1FF", na.value = "#000000") +
   theme_minimal() +
   labs(title = "Scotty Availability Overall",
        x = NULL,
        y = "Hour",
        fill = "Rate of Availability")

Heatmap of Scotty Availability Based on Region

Below is our heatmap of Scotty’s availability/sufficiency rate divided by region. There are some blank tiles, and that means that in those particular time in that area, there has never been a sufficient drivers condition. It’s specifically bad for area sxk9, where there are alot of blank tiles and very little blue, meaning the rate of sufficiency is very low in sxk9 all the time.

Meanwhile in the area of sxk8 the rate of sufficiency is very high, where the lowest seemed to be on Friday between 3PM and 6PM.

scotty_heatmap <- scotty_filter %>% 
   filter (coverage =="sufficient") %>% 
   mutate(hourly = hour(datetime),
          weekly = wday(datetime, label = T)) %>%  
   group_by(hourly, weekly,src_area) %>% 
   count(coverage)

scotty_heatmap %>% 
   ggplot(aes(hourly, weekly, fill =n)) +
   geom_tile() +
   facet_wrap(scotty_heatmap$src_area, nrow = 3) +
   scale_fill_gradient(low = "#C21C3F", high = "#5EF1FF", na.value = "#000000") +
   theme_minimal() +
   labs(title = "Scotty Heatmap Availability based on Region",
        y = NULL,
        x = "Hour",
        fill = "Rate of Availability")

Data Preprocess

This is our final data wrangling for our models, as well as where we tweak our predictor variables. We purposely do this just before modelling so that we don’t mess with the data used for EDA.

We have tried and tweaked with our features, and find the current one to work the best for the models we’re using later on.

What we do here :

Create hourly and weekday column out of our datetime
Set our variables to 1 and 0 instead of “sufficient” and “insufficient”
Normalize our hourly and weekday variable instead of a factor
Remove datetime

We have tried using hour and weekday as factor, but our models result improved the most when we normalize our hourly and weekday variables.

scotty_filter2 <- scotty_filter %>% 
   mutate(hourly = hour(datetime),
          weekday = wday(datetime)
          ) %>% 
   mutate(coverage = ifelse(coverage == "sufficient", 1, 0)) %>% 
   mutate(coverage = as.factor(coverage),
          hourly = hourly/24,
          weekday = weekday/7 ) %>%
   select(-datetime)

Cross Validation

Our cross validation using 85-15 ratio for our training and validation dataset.

library(rsample)
set.seed(100)
idx1 <- initial_split(scotty_filter2, prop = 0.85, strata = coverage)

scotty_training <- training(idx1)
scotty_val <- testing(idx1)

Balancing our dataset

To further improve our model we balance our target class proportion with SMOTE method to get 50-50 ratio between the target class “sufficient”, and “insufficient”.

library(caret)

scotty_smote <- SMOTE(form = coverage~., data = as.data.frame(scotty_training), k = 5, perc.under = 200, perc.over = 100)

prop.table(table(scotty_smote$coverage))

## 
##   0   1 
## 0.5 0.5

Recipe Preparation for Tidymodels

Yes, we are using scotty training data, the one with 54-46 ratio, even though we did SMOTE just right before this. BUT, from what I have tried, our tidymodel models perform better with our training data instead of SMOTE data.

scotty_rec <- recipe(coverage~., data = scotty_training) %>%
  step_dummy(all_nominal(), -all_outcomes()) %>% 
  prep()

train_proc <-juice(scotty_rec)
test_proc <- bake(scotty_rec, scotty_val)

Modelling

We’re trying a few models with various results. For some, the result is better with our original training data instead of the balanced dataset with smote.

The models we’re trying here include :

Logistic Regression
Decision Tree
kNN with tidymodels

For our confusion matrix, we’re using 0 or “insufficient” as the positive class.

Logistic Regression

With our logistic regression, I get the best result when our threshold is at 0.45.

model_logreg_1 <- glm(coverage~., scotty_smote, family = "binomial")

pred_logreg_1 <- predict(model_logreg_1, scotty_val, type = "response")

scotty_val_logreg <- scotty_val

scotty_val_logreg <- scotty_val_logreg %>% 
   mutate(pred = ifelse(pred_logreg_1>0.45, 1, 0))

confusionMatrix(as.factor(scotty_val_logreg$pred), as.factor(scotty_val_logreg$coverage), positive = "0")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 278  78
##          1  90 234
##                                           
##                Accuracy : 0.7529          
##                  95% CI : (0.7187, 0.7849)
##     No Information Rate : 0.5412          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.504           
##                                           
##  Mcnemar's Test P-Value : 0.3961          
##                                           
##             Sensitivity : 0.7554          
##             Specificity : 0.7500          
##          Pos Pred Value : 0.7809          
##          Neg Pred Value : 0.7222          
##              Prevalence : 0.5412          
##          Detection Rate : 0.4088          
##    Detection Prevalence : 0.5235          
##       Balanced Accuracy : 0.7527          
##                                           
##        'Positive' Class : 0               
##

Decision Tree with partykit

We have 2 decision tree. One from partykit library and the other one is using tidymodels, both using their own default parameters.

scotty_dtree <- ctree(formula = coverage~., data = scotty_smote)

confusionMatrix(predict(scotty_dtree, scotty_val), reference = scotty_val$coverage, positive = "0")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 306  90
##          1  62 222
##                                           
##                Accuracy : 0.7765          
##                  95% CI : (0.7433, 0.8073)
##     No Information Rate : 0.5412          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.5468          
##                                           
##  Mcnemar's Test P-Value : 0.02853         
##                                           
##             Sensitivity : 0.8315          
##             Specificity : 0.7115          
##          Pos Pred Value : 0.7727          
##          Neg Pred Value : 0.7817          
##              Prevalence : 0.5412          
##          Detection Rate : 0.4500          
##    Detection Prevalence : 0.5824          
##       Balanced Accuracy : 0.7715          
##                                           
##        'Positive' Class : 0               
##

Our Decision tree plot

plot(scotty_dtree)

Decision Tree with tidymodels

Our decision tree with tidy models has increased performance! The only difference in our data from the other decision tree is that we convert our area to dummy variable format.

dt_spec <- decision_tree(mode = "classification") %>% 
   set_engine("rpart")

dt_fit <- dt_spec %>% 
   fit(coverage~., test_proc)

confusionMatrix( predict(dt_fit, test_proc)$.pred_class, scotty_val$coverage, positive = "0")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 315  88
##          1  53 224
##                                           
##                Accuracy : 0.7926          
##                  95% CI : (0.7602, 0.8225)
##     No Information Rate : 0.5412          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5789          
##                                           
##  Mcnemar's Test P-Value : 0.004192        
##                                           
##             Sensitivity : 0.8560          
##             Specificity : 0.7179          
##          Pos Pred Value : 0.7816          
##          Neg Pred Value : 0.8087          
##              Prevalence : 0.5412          
##          Detection Rate : 0.4632          
##    Detection Prevalence : 0.5926          
##       Balanced Accuracy : 0.7870          
##                                           
##        'Positive' Class : 0               
##

Below is our ROC curve for our decision tree

predicted_dt <- test_proc %>% 
   bind_cols(predict(dt_fit, test_proc)) %>% 
   bind_cols(predict(dt_fit, test_proc, type = "prob"))

predicted_dt %>% 
   roc_curve(coverage, .pred_1) %>% 
   autoplot()

kNN

For our KNN, we’re using the help of tidymodels package and workflow. For some reason, our kNN models give the best result our of our training dataset instead of the balanced SMOTE data.

knn_spec <- nearest_neighbor(neighbors = round(sqrt(nrow(train_proc)))) %>% 
  set_engine ("kknn") %>% 
  set_mode("classification")

knn_fit <- knn_spec %>% 
   fit(coverage~., test_proc)

confusionMatrix( predict(knn_fit, test_proc)$.pred_class, scotty_val$coverage, positive = "0")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 301  88
##          1  67 224
##                                           
##                Accuracy : 0.7721          
##                  95% CI : (0.7386, 0.8031)
##     No Information Rate : 0.5412          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.5386          
##                                           
##  Mcnemar's Test P-Value : 0.1082          
##                                           
##             Sensitivity : 0.8179          
##             Specificity : 0.7179          
##          Pos Pred Value : 0.7738          
##          Neg Pred Value : 0.7698          
##              Prevalence : 0.5412          
##          Detection Rate : 0.4426          
##    Detection Prevalence : 0.5721          
##       Balanced Accuracy : 0.7679          
##                                           
##        'Positive' Class : 0               
##

Below is ROC Curve of our kNN model to illustrate our sensitivity and specificity relationship.

predicted <- test_proc %>% 
   bind_cols(predict(knn_fit, test_proc)) %>% 
   bind_cols(predict(knn_fit, test_proc, type = "prob"))

predicted %>% 
   roc_curve(coverage, .pred_1) %>% 
   autoplot()

Model Comparison and Metrics Explanation

We have tried various method to improve our models like hyperparamter tuning and feature engineering with the current configuration being the best one so far.

After seeing the comparison table, it seemed that the best performing model is using Decision Tree algorithm.

Which Metric is More Important?

Each model will produce a prediction of insufficient/sufficient where it will be compared to the actual value -we’re calling it outcome- to see how they perform.

Currently we are assessing our models with 4 metrics, Accuracy, Recall, Specificity, and Precision. Here’s a brief explanation on what each metric does.

First, we have 4 possible scenarios for each model.

True Positive (TP) : Prediction is insufficient, outcome is insufficient
True Negative (TN) : Prediction is sufficient, outcome is sufficient
False Positive (FP) : Prediction is insufficient, outcome is sufficient
False Negative (FN) : Prediction is sufficient, outcome is insufficient

Out of all 4 scenarios, we, of course want the absolute True Positive and True Negative. The balancing act lies in the False Positve and False Negative. In our case, I think we want to mitigate our False Negative to be as low as possible, where the prediction is sufficient, and outcome is insufficient.

A False Negative can be harmful because it can lead to overestimation of sufficient drivers.

Formula for all of our metrics is as follow :

Accuracy = TP + TN / (TP + TN + FP + FN)

Accuracy measures how accurately the model predict the overall outcome.

Recall = TP / (TP + FN)

Recall is the ratio of correctly labeled insufficient to all who are insufficient in reality.

Precision = TP / (TP + FP)

Precision is the ratio of correctly labeled insufficient to all insufficient labels.

Specificity = TN / (TN + FP)

Specificity is the ratio of correctly labeled sufficient to all who are sufficient in reality.

A low False Negative number leads to high Recall rate, therefore, we will try to get a model with the highest Recall as possible.

If I were to compare our model metrics below, the highest Recall number is from Decision Tree tidy model, so that’s the one we will pick.

Model Interpretation with LIME

Local Interpretable Model-agnostic Explanations (LIME) is a visualization technique that helps explain individual predictions. Different from random forest variable importance or our decision tree plot where we can understand the Global Interpretations of the model, LIME helps us understand the Local Interpretation. This means that we can interpret how a model weigh different predictors in case to case basis.

Since we are using 2 tidymodels models, I think it will be interesting if we compared how each model weight the variables in 4 identical cases. Two cases from area sxk3, and the other two is from area sxk9.

In our model explanation, we’re only using 4 features, since that’s all we have. The model has been tweaked to improve the explanation fit by increasing the permutation, lowering the kernel width, and changing our distance functions.

Explanation fit is a value to indicate how good our model can be explained with lime, much like R squared. It has value from 0 to 1, and the closer the value to 1, means the better our model can be interpreted.

Decision Tree

In case 1 and 2 where the area is sxk3, we can see that hour is the biggest factor, since the area sxk3 is now missing and become the ‘base variable’. However, in case 3 and 4 where the area is sxk9, we can see that the two most significant variables are the hour of the day and the area. Day of the week is insignificant in all cases.

Currently our explanation fit is about 0.32 to 0.36. I’d say that the lime explanation is good enough to get basic idea, but should be taken with a grain of salt.

kNN

Similar to our decision tree, our lime for kNN model has 2 most significant variables, hour of the day and the area, whereas day of the week is the least significant variable.

Our lime for kNN has better explanation fit than our lime for decision tree, which means that this graph can explain our model better.

Lime Conclusion

After seeing the result of both graph, the most significant variable according to lime are hour of the day and the area. The least important variable is day of the week.

Test File Performance

With our test file, we managed to get a way better result than our validation here.

Here’s the screenshot of the leaderboard result using our decision tree model.

Interestingly, below is the result of our kNN model

There’s a tradeoff in getting higher recall rate with our decision tree. It seemed that we can get better overall result with our kNN model where the Accuracy, Precision, and Specificity is higher.

Compared to our validation data below, the test result is much better. In such case, we can conclude that our model is not an overfit model.

Conclusion

Our objective in the beginning of this document is to create a machine learning model that’s capable to make a prediction of Scotty’s insufficiency depending on the area and time. I think we have done it with our decision tree model where its performance is quite satisfactory at 88% recall rate.

For Scotty, our model can serve to predict when and where insufficiency might happen, and prevent such occasion by various means to add drivers to that location. Maybe something like UBER’s approach, where at a certain time and location, it raises the price in order to attract more drivers.

Speaking as a user of similar service, I think that it’s very easy for a user to switch to competitor’s product if time and time again the service is unavailable. Therefore it’s very important for Scotty to solve the insufficiency problem in order to maintain the user base.

Scotty Classification

Deo Ivan Mareza

2020-03-29