Introduction

In this document, we will look at Scotty’s dataset and try to make a machine learning model to predict the insufficiency of Scotty’s driver in a given area, date, and time.

Wait, What’s Scotty?

Scotty is a ride-sharing business that operating in several big cities in Turkey. The company provides motorcycles ride-sharing service for Turkey’s citizen, and really value the efficiency in traveling through the traffic–the apps even give some reference to Star Trek “beam me up” in their order buttons.

Scotty turns out being a very popular service in Turkey! The demands for Scotty began to overload, at some region and some times, and there was not enough driver at those times and places. Fortunately, we know that we can use classification model to predict which region and times are risky enough to have this “no drivers” problem.

Data Input

Our Scotty’s dataset comes with several variables.

  • id: Transaction id
  • trip_id: Trip id
  • driver_id: Driver id
  • rider_id: Rider id
  • start_time: Rider id
  • src_lat: Request source latitude
  • src_lon: Request source longitude
  • src_area: Request source area
  • src_sub_area: Request source sub-area
  • dest_lat: Requested destination latitude
  • dest_lon: Requested destination longitude
  • dest_area: Requested destination area
  • dest_sub_area: Requested destination sub-area
  • distance: Trip distance (in KM)
  • status: Trip status (all status considered as a demand)
  • confirmed_time_sec: Time different from request to confirmed (in seconds)

Data Test Preview

From the look of it, we only have 2 predictor variables to work with, src_area, and datetime.

Data Wrangling

I think one of the first challenge in our data wrangling is to determine how we judge a driver sufficiency. After a discussion with my peers, we concluded that nodrivers status means that there’s insufficient drivers.

The second challenge is to pad our time interval. We need to do this step because there are some hours where there’s no order, and it’s a problem for our model. We used pad function from padr library in order to solve that. We also judge that in the case of no order/demand, there is sufficient drivers.

First, we need to check what’s our min and max time interval for our pad function.


Our data wrangling steps :

  • Select predictor variables
  • Remove the minutes in our date by floor_date function from lubridate package
  • Group and count by the hour the nodrivers status, and replace it with “sufficient”, and “insufficient” factor levels.
  • Group the data by area, and pad the time variables, and replace the missing coverage with “sufficient”

EDA

At this stage, we would like to measure our target proportion using ggplot from ggplot2 package.

Proportion of target by area

Below, we have a table with target proportion divided by src_area It’s abit imbalanced for region sxk8 and sxk9. This may affect our model prediction for those areas.

Plot of our target proportion per area.

Heatmap

Heatmap of Scotty Availability Based on Region

Below is our heatmap of Scotty’s availability/sufficiency rate divided by region. There are some blank tiles, and that means that in those particular time in that area, there has never been a sufficient drivers condition. It’s specifically bad for area sxk9, where there are alot of blank tiles and very little blue, meaning the rate of sufficiency is very low in sxk9 all the time.

Meanwhile in the area of sxk8 the rate of sufficiency is very high, where the lowest seemed to be on Friday between 3PM and 6PM.

Data Preprocess

This is our final data wrangling for our models, as well as where we tweak our predictor variables. We purposely do this just before modelling so that we don’t mess with the data used for EDA.

We have tried and tweaked with our features, and find the current one to work the best for the models we’re using later on.

What we do here :

  • Create hourly and weekday column out of our datetime
  • Set our variables to 1 and 0 instead of “sufficient” and “insufficient”
  • Normalize our hourly and weekday variable instead of a factor
  • Remove datetime


We have tried using hour and weekday as factor, but our models result improved the most when we normalize our hourly and weekday variables.

Cross Validation

Our cross validation using 85-15 ratio for our training and validation dataset.

Balancing our dataset

To further improve our model we balance our target class proportion with SMOTE method to get 50-50 ratio between the target class “sufficient”, and “insufficient”.

## 
##   0   1 
## 0.5 0.5

Recipe Preparation for Tidymodels

Yes, we are using scotty training data, the one with 54-46 ratio, even though we did SMOTE just right before this. BUT, from what I have tried, our tidymodel models perform better with our training data instead of SMOTE data.

Modelling

We’re trying a few models with various results. For some, the result is better with our original training data instead of the balanced dataset with smote.

The models we’re trying here include :

  • Logistic Regression
  • Decision Tree
  • kNN with tidymodels


For our confusion matrix, we’re using 0 or “insufficient” as the positive class.

Logistic Regression

With our logistic regression, I get the best result when our threshold is at 0.45.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 278  78
##          1  90 234
##                                           
##                Accuracy : 0.7529          
##                  95% CI : (0.7187, 0.7849)
##     No Information Rate : 0.5412          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.504           
##                                           
##  Mcnemar's Test P-Value : 0.3961          
##                                           
##             Sensitivity : 0.7554          
##             Specificity : 0.7500          
##          Pos Pred Value : 0.7809          
##          Neg Pred Value : 0.7222          
##              Prevalence : 0.5412          
##          Detection Rate : 0.4088          
##    Detection Prevalence : 0.5235          
##       Balanced Accuracy : 0.7527          
##                                           
##        'Positive' Class : 0               
## 


Decision Tree with partykit

We have 2 decision tree. One from partykit library and the other one is using tidymodels, both using their own default parameters.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 306  90
##          1  62 222
##                                           
##                Accuracy : 0.7765          
##                  95% CI : (0.7433, 0.8073)
##     No Information Rate : 0.5412          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.5468          
##                                           
##  Mcnemar's Test P-Value : 0.02853         
##                                           
##             Sensitivity : 0.8315          
##             Specificity : 0.7115          
##          Pos Pred Value : 0.7727          
##          Neg Pred Value : 0.7817          
##              Prevalence : 0.5412          
##          Detection Rate : 0.4500          
##    Detection Prevalence : 0.5824          
##       Balanced Accuracy : 0.7715          
##                                           
##        'Positive' Class : 0               
## 

Our Decision tree plot

Decision Tree with tidymodels

Our decision tree with tidy models has increased performance! The only difference in our data from the other decision tree is that we convert our area to dummy variable format.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 315  88
##          1  53 224
##                                           
##                Accuracy : 0.7926          
##                  95% CI : (0.7602, 0.8225)
##     No Information Rate : 0.5412          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5789          
##                                           
##  Mcnemar's Test P-Value : 0.004192        
##                                           
##             Sensitivity : 0.8560          
##             Specificity : 0.7179          
##          Pos Pred Value : 0.7816          
##          Neg Pred Value : 0.8087          
##              Prevalence : 0.5412          
##          Detection Rate : 0.4632          
##    Detection Prevalence : 0.5926          
##       Balanced Accuracy : 0.7870          
##                                           
##        'Positive' Class : 0               
## 


Below is our ROC curve for our decision tree

kNN

For our KNN, we’re using the help of tidymodels package and workflow. For some reason, our kNN models give the best result our of our training dataset instead of the balanced SMOTE data.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 301  88
##          1  67 224
##                                           
##                Accuracy : 0.7721          
##                  95% CI : (0.7386, 0.8031)
##     No Information Rate : 0.5412          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.5386          
##                                           
##  Mcnemar's Test P-Value : 0.1082          
##                                           
##             Sensitivity : 0.8179          
##             Specificity : 0.7179          
##          Pos Pred Value : 0.7738          
##          Neg Pred Value : 0.7698          
##              Prevalence : 0.5412          
##          Detection Rate : 0.4426          
##    Detection Prevalence : 0.5721          
##       Balanced Accuracy : 0.7679          
##                                           
##        'Positive' Class : 0               
## 

Below is ROC Curve of our kNN model to illustrate our sensitivity and specificity relationship.

Model Comparison and Metrics Explanation

We have tried various method to improve our models like hyperparamter tuning and feature engineering with the current configuration being the best one so far.

After seeing the comparison table, it seemed that the best performing model is using Decision Tree algorithm.

Which Metric is More Important?

Each model will produce a prediction of insufficient/sufficient where it will be compared to the actual value -we’re calling it outcome- to see how they perform.

Currently we are assessing our models with 4 metrics, Accuracy, Recall, Specificity, and Precision. Here’s a brief explanation on what each metric does.

First, we have 4 possible scenarios for each model.

  • True Positive (TP) : Prediction is insufficient, outcome is insufficient
  • True Negative (TN) : Prediction is sufficient, outcome is sufficient
  • False Positive (FP) : Prediction is insufficient, outcome is sufficient
  • False Negative (FN) : Prediction is sufficient, outcome is insufficient


Out of all 4 scenarios, we, of course want the absolute True Positive and True Negative. The balancing act lies in the False Positve and False Negative. In our case, I think we want to mitigate our False Negative to be as low as possible, where the prediction is sufficient, and outcome is insufficient.

A False Negative can be harmful because it can lead to overestimation of sufficient drivers.

Formula for all of our metrics is as follow :

Accuracy = TP + TN / (TP + TN + FP + FN)

Accuracy measures how accurately the model predict the overall outcome.

Recall = TP / (TP + FN)

Recall is the ratio of correctly labeled insufficient to all who are insufficient in reality.

Precision = TP / (TP + FP)

Precision is the ratio of correctly labeled insufficient to all insufficient labels.

Specificity = TN / (TN + FP)

Specificity is the ratio of correctly labeled sufficient to all who are sufficient in reality.

A low False Negative number leads to high Recall rate, therefore, we will try to get a model with the highest Recall as possible.

If I were to compare our model metrics below, the highest Recall number is from Decision Tree tidy model, so that’s the one we will pick.


Model Interpretation with LIME

Local Interpretable Model-agnostic Explanations (LIME) is a visualization technique that helps explain individual predictions. Different from random forest variable importance or our decision tree plot where we can understand the Global Interpretations of the model, LIME helps us understand the Local Interpretation. This means that we can interpret how a model weigh different predictors in case to case basis.

Since we are using 2 tidymodels models, I think it will be interesting if we compared how each model weight the variables in 4 identical cases. Two cases from area sxk3, and the other two is from area sxk9.

In our model explanation, we’re only using 4 features, since that’s all we have. The model has been tweaked to improve the explanation fit by increasing the permutation, lowering the kernel width, and changing our distance functions.

Explanation fit is a value to indicate how good our model can be explained with lime, much like R squared. It has value from 0 to 1, and the closer the value to 1, means the better our model can be interpreted.

Decision Tree

In case 1 and 2 where the area is sxk3, we can see that hour is the biggest factor, since the area sxk3 is now missing and become the ‘base variable’. However, in case 3 and 4 where the area is sxk9, we can see that the two most significant variables are the hour of the day and the area. Day of the week is insignificant in all cases.

Currently our explanation fit is about 0.32 to 0.36. I’d say that the lime explanation is good enough to get basic idea, but should be taken with a grain of salt.

kNN

Similar to our decision tree, our lime for kNN model has 2 most significant variables, hour of the day and the area, whereas day of the week is the least significant variable.

Our lime for kNN has better explanation fit than our lime for decision tree, which means that this graph can explain our model better.


Lime Conclusion

After seeing the result of both graph, the most significant variable according to lime are hour of the day and the area. The least important variable is day of the week.

Test File Performance

With our test file, we managed to get a way better result than our validation here.

Here’s the screenshot of the leaderboard result using our decision tree model.


Interestingly, below is the result of our kNN model

There’s a tradeoff in getting higher recall rate with our decision tree. It seemed that we can get better overall result with our kNN model where the Accuracy, Precision, and Specificity is higher.

Compared to our validation data below, the test result is much better. In such case, we can conclude that our model is not an overfit model.


Conclusion

Our objective in the beginning of this document is to create a machine learning model that’s capable to make a prediction of Scotty’s insufficiency depending on the area and time. I think we have done it with our decision tree model where its performance is quite satisfactory at 88% recall rate.

For Scotty, our model can serve to predict when and where insufficiency might happen, and prevent such occasion by various means to add drivers to that location. Maybe something like UBER’s approach, where at a certain time and location, it raises the price in order to attract more drivers.

Speaking as a user of similar service, I think that it’s very easy for a user to switch to competitor’s product if time and time again the service is unavailable. Therefore it’s very important for Scotty to solve the insufficiency problem in order to maintain the user base.