Background

Scotty is the fastest growing Turkish born technology start-up, serving in three verticals; ride-sharing, food delivery and courier delivery in İstanbul. They help thousands of people and organizations save time and be happy, and love to continue to make them happy.

Problem

Scotty was founded in 2017 and headquartered in Istanbul, they already have more than 11K drivers with more than 1 million users and 3 Million completed transactions. At some region and some times, there was not enough driver at those times and places. Fortunately, we are know that we can use classification model to predict which region and times are risky enough to have this “no drivers” problem.

They provided us with real-time transaction dataset. With this dataset, we are going to help them in solving some classification problems in order to improve their business processes.

Load and inspect the data:

Train Data

##                 id            trip_id          driver_id           rider_id 
##                  0              14901              14900                  0 
##         start_time            src_lat            src_lon           src_area 
##                  0                  0                  0                  0 
##       src_sub_area           dest_lat           dest_lon          dest_area 
##                  0                  0                  0                  0 
##      dest_sub_area           distance             status confirmed_time_sec 
##                  0                  0                  0                  0
## Rows: 229,532
## Columns: 16
## $ id                 <fct> 59d005e1ffcfa261708ce9cd, 59d0066a3d32b861760d4a01…
## $ trip_id            <fct> 59d005e9cb564761a8fe5d3e, 59d00678ffcfa261708ceb37…
## $ driver_id          <fct> 59a892c5568be44b2734f276, 59a135565e88a24b11f11994…
## $ rider_id           <fct> 59ad2d6efba75a581666b506, 59ce930f3d32b861760a4617…
## $ start_time         <fct> 2017-10-01T00:00:17Z, 2017-10-01T00:02:34Z, 2017-1…
## $ src_lat            <dbl> 41.07047, 40.94157, 41.07487, 41.04995, 41.05287, …
## $ src_lon            <dbl> 29.01945, 29.11484, 28.99528, 29.03107, 28.99522, …
## $ src_area           <fct> sxk9, sxk8, sxk9, sxk9, sxk9, sxk9, sxk9, sxk9, sx…
## $ src_sub_area       <fct> sxk9s, sxk8y, sxk9e, sxk9s, sxk9e, sxk90, sxk9e, s…
## $ dest_lat           <dbl> 41.11716, 41.06151, 41.08351, 41.04495, 41.08140, …
## $ dest_lon           <dbl> 29.03650, 29.02068, 29.00228, 28.98192, 28.98197, …
## $ dest_area          <fct> sxk9, sxk9, sxk9, sxk9, sxk9, sxk9, sxk9, sxk9, sx…
## $ dest_sub_area      <fct> sxk9u, sxk9s, sxk9e, sxk9e, sxk9e, sxk97, sxk9q, s…
## $ distance           <dbl> 5.379250, 15.497130, 1.126098, 4.169492, 3.358296,…
## $ status             <fct> confirmed, confirmed, nodrivers, confirmed, confir…
## $ confirmed_time_sec <int> 8, 14, 0, 32, 65, 110, 0, 49, 27, 21, 23, 46, 185,…

From the loaded data above we found out that there are several missing value on trip_id and driver_id, but it looks fine so i will keep it as it is. We also need to change the date type to POSIXct and everything else are good :)

The dataset includes information about:

  • id: Transaction id
  • trip_id: Trip id
  • driver_id: Driver id
  • rider_id: Rider id
  • start_time: Rider id
  • src_lat: Request source latitude
  • src_lon: Request source longitude
  • src_area: Request source area
  • src_sub_area: Request source sub-area
  • dest_lat: Requested destination latitude
  • dest_lon: Requested destination longitude
  • dest_area: Requested destination area
  • dest_sub_area: Requested destination sub-area
  • distance: Trip distance (in KM)
  • status: Trip status (all status considered as a demand)
  • confirmed_time_sec: Time different from request to confirmed (in seconds)

Test Data

## Rows: 504
## Columns: 3
## $ src_area <fct> sxk3, sxk3, sxk3, sxk3, sxk3, sxk3, sxk3, sxk3, sxk3, sxk3, …
## $ datetime <fct> 2017-12-03T00:00:00Z, 2017-12-03T01:00:00Z, 2017-12-03T02:00…
## $ coverage <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …

From the loaded data above we found out that we only need to change the date type to POSIXct.

## src_area datetime coverage 
##        0        0      504

In order to fulfill the requirement, we need to do some extra effort on exploring this data test.

We found out the date range and source area as written below

  1. 2017-12-03, 00:00:00 to 2017-12-09, 23:00:00.

  2. Area sxk3, sxk8 and sxk9.

Data Preprocessing

As we can see, some of the data type are still inappropiate, lets try to fix it.

Train Data

Test Data

It looks like the time stamp are different, the datetime in TRAIN DATA are down to second and the datetime in TEST DATA are in hourly interval, lets make it equal in an hour.

Floor the time in TRAIN DATA to nearest hour

In simple way, if we want to make sure that the coverage status are Sufficient or Insufficient, firstly we need to observe the order status if its confirmed or no drivers, from the data we can see that both status are in the same hour, i dont exactly sure that it happened every hour but lets assume it is, so lets group it so it wont distract the data preprocess.

## `summarise()` regrouping output by 'src_area', 'status' (override with `.groups` argument)

After doing this, found out that the time was not in a proper interval, some hour missing like 02:00:00, 05:00:00, 06:00:00 are saw on first 6 data, lets make it proper. and also, we are going back to an earlier stage, after we know that the TEST data has 3 different source area, lets break the TRAIN data into three different dataframe based on their area to get a proper observation with each scope.

To be able to calculate how many confirmed and nodrivers status, first we need to change NA value to 0. Create a new column to show that the data at that time is sufficient or insufficient

#Exploratory Data Analysis

Lets Explore the state in the target distribution by:

Proportion Overall

Proportion of class of target variable overall

## 
## insufficient   sufficient 
##    0.5410053    0.4589947

Pattern & Correlation

Explored the relation between the target and the features.

Finding the pattern or correlation between target and features

  1. Strip the time from date
## `summarise()` regrouping output by 'src_area', 'hourly' (override with `.groups` argument)
## `summarise()` regrouping output by 'src_area' (override with `.groups` argument)

## `summarise()` regrouping output by 'src_area', 'hourly' (override with `.groups` argument)
## `summarise()` regrouping output by 'src_area' (override with `.groups` argument)

## `summarise()` regrouping output by 'src_area', 'hourly' (override with `.groups` argument)
## `summarise()` regrouping output by 'src_area' (override with `.groups` argument)

Based on the bar-chart plot above:

  • Area sxk3 and sxk9 have more days of insufficient compared to days of Sufficient

  • Area sxk8 has more days of sufficient compared to days of Insufficient

  • Area sxk3 has more days of insufficient from 7 am till 12 am

  • Area sxk9 has more days of insufficient for whole hour (12 am to 11 pm)

  • Area sxk8 has quite high days of insufficient from 7-9 am and 3-7 pm

Heatmap

Use heatmap of time (hour) and weekdays, grouped by area and find the pattern

## `summarise()` regrouping output by 'src_area', 'hourly', 'dayname' (override with `.groups` argument)
## `summarise()` regrouping output by 'src_area', 'hourly' (override with `.groups` argument)

Based on the heatmap above:

  • Area sxk3 has insufficient riders on weekend (Saturday and Sunday) and weekday (Monday to Friday) from around 7 am to 12 am

  • Area sxk8 has insufficient riders only on weekday from 8 am to 9 am (Monday to Thursday) and 3 pm to 7 pm (Friday)

  • Area sxk9 has insufficient riders for both weekday and weekend for whole hour (12 am to 11 pm)

Model Fitting & Evaluation

Cross Validation

-Drop some unuse column

  • Split Training data into train and validation data test
## 
## insufficient   sufficient 
##    0.5410053    0.4589947

We need to equalize between each of it, best practice is using DownSample.

## 
## insufficient   sufficient 
##          0.5          0.5

It’s balance now, lets move to the next step :)

Model Fitting

Now it’s the fun part, finding the right model for our data. I always thrive for a humanly way to explain a thing and could be easily understood by them.

There will be 2 model that i am going to use

  • Logistic Regression

  • Decision Tree

Logistic Regression is one of kind of classification method which the concept are similiar with linear regression. Although in binary logistic regression are not being specifically calculated but calculating the possibilities on each target class.

Decision tree is a quite simple model and the performance is robust/ powerfull and can be easily interpretable. Decision Tree model is the basic of other tree-based model like Random Forest which well known as a robust model to make a prediction. This model will create a tree decision based on the pattern of the data and the result will be visualized as a Flowchart and from that, it can be easily seen how the model build a prediction.

The other character of decision tree:

  • Assume that on each predictor variable are dependent

  • It can handle multicollinearity and outlier

Logistic Regression

## 
## Call:
## glm(formula = coverage ~ ., family = "binomial", data = scotty_train_down)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.68191  -0.73900   0.00614   0.69223   2.86488  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -0.201639   0.229410  -0.879 0.379430    
## src_areasxk8  2.423088   0.113839  21.285  < 2e-16 ***
## src_areasxk9 -1.163636   0.107249 -10.850  < 2e-16 ***
## hourly1       0.205163   0.315056   0.651 0.514920    
## hourly2       0.781255   0.313253   2.494 0.012631 *  
## hourly3       0.501428   0.308420   1.626 0.103993    
## hourly4       1.348423   0.309672   4.354 1.33e-05 ***
## hourly5       1.124973   0.312085   3.605 0.000313 ***
## hourly6       0.749114   0.304528   2.460 0.013897 *  
## hourly7      -0.695338   0.313962  -2.215 0.026779 *  
## hourly8      -2.132189   0.331128  -6.439 1.20e-10 ***
## hourly9      -1.062422   0.318258  -3.338 0.000843 ***
## hourly10     -0.106276   0.307984  -0.345 0.730041    
## hourly11      0.026123   0.311322   0.084 0.933128    
## hourly12     -0.567000   0.309009  -1.835 0.066521 .  
## hourly13     -0.760945   0.313946  -2.424 0.015359 *  
## hourly14     -0.622568   0.311359  -2.000 0.045552 *  
## hourly15     -0.821045   0.319644  -2.569 0.010210 *  
## hourly16     -1.025553   0.321629  -3.189 0.001430 ** 
## hourly17     -1.206824   0.322380  -3.743 0.000181 ***
## hourly18     -1.291317   0.319109  -4.047 5.20e-05 ***
## hourly19     -1.077813   0.317658  -3.393 0.000691 ***
## hourly20      0.022152   0.308201   0.072 0.942702    
## hourly21     -0.064482   0.300702  -0.214 0.830205    
## hourly22      0.639626   0.301692   2.120 0.033995 *  
## hourly23      0.047613   0.307533   0.155 0.876962    
## dayname.L    -0.187640   0.118142  -1.588 0.112228    
## dayname.Q    -0.260470   0.117547  -2.216 0.026700 *  
## dayname.C     0.195982   0.117362   1.670 0.094941 .  
## dayname^4     0.571191   0.118673   4.813 1.49e-06 ***
## dayname^5     0.264233   0.116753   2.263 0.023625 *  
## dayname^6     0.006165   0.116049   0.053 0.957634    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4597  on 3315  degrees of freedom
## Residual deviance: 3142  on 3284  degrees of freedom
## AIC: 3206
## 
## Number of Fisher Scoring iterations: 5

Lets random check the variable and interpret the value

##  hourly4 
## 3.851349

It means that, at 2:00, chance of sufficient would be 3,85 times higher than any other hour with the same condition

Check if there is a mistake in creating the model

##              GVIF Df GVIF^(1/(2*Df))
## src_area 1.207358  2        1.048236
## hourly   1.209117 23        1.004137
## dayname  1.038939  6        1.003188

Assumption of multicollinearity and independence of observation has not been violated (because the VIF value are lower than 10)

No perfect separation detected.

Predict

In Prediction, there is a type of parameter which are response and link.

  • type link = will show value of Log of Odds from Predict (this is default)
  • type response = will return the Predict value into Probability

Train Dataset

Validate Dataset (in part of overall train data)

  • Try with Test Dataset

Model Evaluation

After we doing prediction with the model, there must be a chance of error.

In Classification we can use Confusion Matrix to evaluate the model

  • TP (True Positive) = When we’re predicting Positive Class and it is Correct

  • TN (True Negative) = When we’re predicting Negative Class and it is Correct

  • FP (False Positive) = When we’re predicting Positive Class and it is Wrong

  • FN (False Negative) = When we’re predicting Negative Class and it is Wrong

Evaluate the Train Dataset

## Confusion Matrix and Statistics
## 
##               Reference
## Prediction     insufficient sufficient
##   insufficient         1374        410
##   sufficient            284       1248
##                                           
##                Accuracy : 0.7907          
##                  95% CI : (0.7765, 0.8044)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5814          
##                                           
##  Mcnemar's Test P-Value : 2.086e-06       
##                                           
##             Sensitivity : 0.8287          
##             Specificity : 0.7527          
##          Pos Pred Value : 0.7702          
##          Neg Pred Value : 0.8146          
##              Prevalence : 0.5000          
##          Detection Rate : 0.4144          
##    Detection Prevalence : 0.5380          
##       Balanced Accuracy : 0.7907          
##                                           
##        'Positive' Class : insufficient    
## 

Evaluate the Validation Dataset

## Confusion Matrix and Statistics
## 
##               Reference
## Prediction     insufficient sufficient
##   insufficient          396        113
##   sufficient             88        311
##                                           
##                Accuracy : 0.7786          
##                  95% CI : (0.7502, 0.8052)
##     No Information Rate : 0.533           
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.5537          
##                                           
##  Mcnemar's Test P-Value : 0.09049         
##                                           
##             Sensitivity : 0.8182          
##             Specificity : 0.7335          
##          Pos Pred Value : 0.7780          
##          Neg Pred Value : 0.7794          
##              Prevalence : 0.5330          
##          Detection Rate : 0.4361          
##    Detection Prevalence : 0.5606          
##       Balanced Accuracy : 0.7758          
##                                           
##        'Positive' Class : insufficient    
## 

  • Accuracy: How good the model predict the target class (in global)

  • Sensitivity/ Recall: Measurement of goodness of the model with positive class

  • Specificity : Measurement of goodness of the model with negative class

  • Pos Pred Value/ Precision: Level of precision of the model to predict positive class

As can be shown above, the result between Test Data vs Train Data are quite similar.

We can simply adjust the level of Accuracy, Recall, Precision and Specificity needed by tuning the threshold point (that 0,5 things), it depends on the business case we need and level of goodness on each measurement.

  • Recall or Sensitivity is the important metric of this business case since the scotty company wants to reduce the wrong prediction of insufficient and sufficient driver on certain times as low as possible

Decision Tree

Some said that Decision Tree is a quite simple model but the performance are Robust/ Powerfull and the results are easily interpretable. Decision Tree Model is a base of any other Tree-Based Model such as the famous Random Forest Model which well-known as a robust model to make a prediction.

Decision tree will create several answer based on the pattern from the data. The result then will be visualize in a Flow Chart so it can be easily seen on how the model doing the prediction.

Additional Character in Decision Tree are:

  • Assuming that each of Predictor Variabel are all dependent

  • It can be able to overcome multicollinearity and outlier

From the decision tree above, some of the interpretation are:

  1. If the src_area is sxk8 and hourly is 13:00:00, the the coverage status is mostly sufficient

  2. If the src_area is sxk3 or sxk9 and hourly is 7, 8, 9, 12, 13, 14, 17, 18 and 19 and dayname is after Monday, the coverage status is mostly insufficient

The process are pretty similar with Logistic Regression Model on Train Dataset

on Validation Dataset

on Test Dataset

Model Evaluation

Evaluate the Train Dataset

## Confusion Matrix and Statistics
## 
##               Reference
## Prediction     insufficient sufficient
##   insufficient         1418        436
##   sufficient            240       1222
##                                          
##                Accuracy : 0.7961         
##                  95% CI : (0.782, 0.8097)
##     No Information Rate : 0.5            
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.5923         
##                                          
##  Mcnemar's Test P-Value : 6.382e-14      
##                                          
##             Sensitivity : 0.8552         
##             Specificity : 0.7370         
##          Pos Pred Value : 0.7648         
##          Neg Pred Value : 0.8358         
##              Prevalence : 0.5000         
##          Detection Rate : 0.4276         
##    Detection Prevalence : 0.5591         
##       Balanced Accuracy : 0.7961         
##                                          
##        'Positive' Class : insufficient   
## 

Evaluate the Validation Dataset

## Confusion Matrix and Statistics
## 
##               Reference
## Prediction     insufficient sufficient
##   insufficient          408        118
##   sufficient             76        306
##                                           
##                Accuracy : 0.7863          
##                  95% CI : (0.7582, 0.8126)
##     No Information Rate : 0.533           
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5682          
##                                           
##  Mcnemar's Test P-Value : 0.003244        
##                                           
##             Sensitivity : 0.8430          
##             Specificity : 0.7217          
##          Pos Pred Value : 0.7757          
##          Neg Pred Value : 0.8010          
##              Prevalence : 0.5330          
##          Detection Rate : 0.4493          
##    Detection Prevalence : 0.5793          
##       Balanced Accuracy : 0.7823          
##                                           
##        'Positive' Class : insufficient    
## 

As shown above, the result between Test Data vs Train Data are still acceptable with note its better than Logistic Regression Model :).

Same as Logistic Regression method, we can simply adjust the level of Accuracy, Recall, Precision and Specificity needed by tuning the threshold point (that 0,5 things), it depends on the business case we need and level of goodness on each measurement.

It’s acceptable to use the Decision Tree model with 0,5 treshold.

LIME

Lets trying some other method, LIME. LIME is a Local Interpretable Model-Agnostic Explanations.

It’s an explanation technique that reflected from each part of the name. Local referes to local fidelity , i.e., we want the explanation to really reflect the behaviour of the classifier “around” the instance being predicted. This explanation is useless unless it is interpretable - that is, unless a human can make sense of it. Lime is able to explain any model without needing to ‘peak’ into it, so it is model-agnostic. We now give a high level overview of how lime works.

First, a word about interpretability. Some classifiers use representations that are not intuitive to users at all (e.g. word embeddings). Lime explains those classifiers in terms of interpretable representations (words), even if that is not the representation actually used by the classifier. Further, lime takes human limitations into account: i.e. the explanations are not too long.

In order to be model-agnostic, lime can’t peak into the model. In order to figure out what parts of the interpretable input are contributing to the prediction, we perturb the input around its neighborhood and see how the model’s predictions behave. We then weight these perturbed data points by their proximity to the original example, and learn an interpretable model on those and the associated predictions. For example, if we are trying to explain the prediction for the sentence “I hate this movie”, we will perturb the sentence and get predictions on sentences such as “I hate movie”, “I this movie”, “I movie”, “I hate”, etc. Even if the original classifier takes many more words into account globally, it is reasonable to expect that around this example only the word “hate” will be relevant. Note that if the classifier uses some uninterpretable representation such as word embeddings, this still works: we just represent the perturbed sentences with word embeddings, and the explanation will still be in terms of words such as “hate” or “movie”.

Lets return to this case, shall we? :)

I am going to use randomForest method for the model and implementing k-fold cross validation (k=5) and k-fold set 3 times.

Trying with LIME

The text Label: yes shows what value of target variable is being explained.

The Probability shows the probability of the observation belong to the label yes.

We can see that for all observations they have little probability, so the model would predict them as no instead of yes.

LIME can be used to analyze the variable predictors used in the model for each observation.

for LIME result:

Observation 1 : Most vmportant variable = sxk8, least important = hourly 9

Observation 4 : Most important variable = sxk8, least important = Wednesday

Conclusion

  1. One or several Machine Learning Model (and with some tuning in the parameter) can create quite good prediction with sufficient data and treatment.

  2. Problem can be solved by Machine Learning. For other else, we need to assure what point/ target that we need to achieve.

  3. 2 Models are used in this case for its simplicity of application and interpretable model. Using LIME will add some different perspective in predicting the results.

  4. The process or way of thinking can be used in vast business case, some of it can be targeting customer satisfaction as their top of goal and lowering the chance of customer lost because of decreasement of customer’s trust.