1 Intro

1.1 Greetings

This is Capstone Project - Machine Learning of Widya Kania Rahayu.
Irish Night - Class B.
Dataset : Scotty-Classification.

1.2 Content

Scotty is a ride-sharing business that operating in several big cities in Turkey. The company provides motorcycles ride-sharing service for Turkey’s citizen, and really value the efficiency in traveling through the traffic. The apps even give some reference to Star Trek “beam me up” in their order buttons.

Scotty provided us with real-time transaction dataset. With this dataset, we are going to help them in solving classification problems in order to improve their business processes. The demands for Scotty began to overload, at some region and some times, and there was not enough driver at those times and places. Fortunately, we are know that we can use classification model to predict which region and times are risky enough to have this “no drivers” problem.

The train dataset contains detailed transaction details from October 1st 2017 to December 2nd 2017. The dataset includes information about:

id: Transaction id trip_id: Trip id driver_id: Driver id rider_id: Rider id start_time: Rider id src_lat: Request source latitude src_lon: Request source longitude src_area: Request source area src_sub_area: Request source sub-area dest_lat: Requested destination latitude dest_lon: Requested destination longitude dest_area: Requested destination area dest_sub_area: Requested destination sub-area distance: Trip distance (in KM) status: Trip status (all status considered as a demand) confirmed_time_sec: Time different from request to confirmed (in seconds)

Purpose :
Create a classification model report that would be evaluated on next 7 days (Sunday, December 3rd 2017 to Monday, December 9th 2017). Make prediction that should cover the predicted coverage status for each hour and each area: “sufficient” or “insufficient”.

Load Used Library

2 Read Data

#> Observations: 229,532
#> Variables: 16
#> $ id                 <chr> "59d005e1ffcfa261708ce9cd", "59d0066a3d32b8...
#> $ trip_id            <chr> "59d005e9cb564761a8fe5d3e", "59d00678ffcfa2...
#> $ driver_id          <chr> "59a892c5568be44b2734f276", "59a135565e88a2...
#> $ rider_id           <chr> "59ad2d6efba75a581666b506", "59ce930f3d32b8...
#> $ start_time         <dttm> 2017-10-01 00:00:17, 2017-10-01 00:02:34, ...
#> $ src_lat            <dbl> 41.07047, 40.94157, 41.07487, 41.04995, 41....
#> $ src_lon            <dbl> 29.01945, 29.11484, 28.99528, 29.03107, 28....
#> $ src_area           <chr> "sxk9", "sxk8", "sxk9", "sxk9", "sxk9", "sx...
#> $ src_sub_area       <chr> "sxk9s", "sxk8y", "sxk9e", "sxk9s", "sxk9e"...
#> $ dest_lat           <dbl> 41.11716, 41.06151, 41.08351, 41.04495, 41....
#> $ dest_lon           <dbl> 29.03650, 29.02068, 29.00228, 28.98192, 28....
#> $ dest_area          <chr> "sxk9", "sxk9", "sxk9", "sxk9", "sxk9", "sx...
#> $ dest_sub_area      <chr> "sxk9u", "sxk9s", "sxk9e", "sxk9e", "sxk9e"...
#> $ distance           <dbl> 5.379250, 15.497130, 1.126098, 4.169492, 3....
#> $ status             <chr> "confirmed", "confirmed", "nodrivers", "con...
#> $ confirmed_time_sec <dbl> 8, 14, 0, 32, 65, 110, 0, 49, 27, 21, 23, 4...
#> [1] 229532     16

The Dataset has 16 columns and 229532 of observations. But in this dataset still dont have ‘target variable’ and ‘start_time’ column consist of 2 informations are date and time.

3 Data Preprocess

The Time information inside ‘start_time’ still have minutes and seconds information. Needs to be specified time based on hourly. We will use ‘floor_date’ function from lubridate package.

We may see on ‘start_time’ column is grouped by hourly time level already!

*Check how many area we have

#> [1] "sxk9" "sxk8" "sxk3"

We have 3 areas : sxk9, sxk8 and sxk3.

  • Next, we have to check the NA
#>                 id            trip_id          driver_id 
#>                  0              14901              14900 
#>           rider_id         start_time            src_lat 
#>                  0                  0                  0 
#>            src_lon           src_area       src_sub_area 
#>                  0                  0                  0 
#>           dest_lat           dest_lon          dest_area 
#>                  0                  0                  0 
#>      dest_sub_area           distance             status 
#>                  0                  0                  0 
#> confirmed_time_sec 
#>                  0

Column trip_id and driver_id have NA. Since we will not using it, we can continue to next step.

*Next, we will do grouping by area, start_time and status (following the submission dataset), also we need to check does the time has arranged orderly? and complete on hourly base?

From data above we can see that not all data observations are filled orderly by hours. we may seen that some hours are missing, we have to fill the missing hours before continue to the next level To make a complete time series, we can use function ‘pad()’ from padr package.

  • Grouping by area using ‘pad()’ function, assigned to new object ‘sct’

first is defining min and max date of scotty dataset

#> [1] "2017-10-01 UTC"
#> [1] "2017-12-02 23:00:00 UTC"

continue to pad()

we have some new row with ordered by hourly and status. Pad() function will fill the blank row automatically following the information above, next we need to fill the ‘count’ column by 0

*Fill the NA value using ‘0’

Why 0 ? because in the real event it never happened. so we have to fill the blank using 0

*Cek missing value again to make sure

#>   src_area start_time     status      count 
#>          0          0          0          0

No have missing value

4 Exploratory Data Analysis

We will make target variable named : coverage.
First we have to spread the data to see the actual happened based on confirmed and nodrivers event for classifiying insuffient or sufficient event.

*Create new column with name ‘coverage’ = ‘sufficient’ or ‘insufficient’. Column coverage will be our target variable.

Observation which has “nodriver” event (even though only 1 time) classified as “insufficient” but observation which always confirmed or nodriver = 0 , classified as “sufficient”.

*feature engineering/variable selection

Separate the Month, Day, Hour from ‘start_time’ then create new column for Peak_Hour (7<=hour<=20) and Weekend column (Saturday and Sunday).

5 Check Proportion Data

#> 
#> insufficient   sufficient 
#>     54.10053     45.89947
#>               
#>                    sxk3     sxk8     sxk9
#>   insufficient 62.63228 17.06349 82.60582
#>   sufficient   37.36772 82.93651 17.39418

In total, insufficient and sufficient almost have balace proportion but we found that based on area, the proportion still not balance.

*Correlation between feature toward target variable

Visualize using geom_tile from ggplot. We want to see the corelation between Days, Hours and area towards the sufficient or insufficient. Rename insufficient with ‘0’ and sufficient with ‘100’.

Coverage = insufficient –> 0 ; Sufficient –> 100

*find the mean for coverage to know the level of influence

*Continue to geom_tile

* We can see from graph above that sxk9 area has so many insufficient event compare to other area. in this area the insufficient event amost happened all the time during day and night. this may happened in big city area. On the contrary, sxk8 area mostly have sufficient event during day and night.

6 Model Fitting & Evaluation

*Splitting Dataset

We will do splitting dataset using function ‘initial_split()’ from rsample package. Define 75% for training and 25% for testing.

*Check proportion of Dataset

#> 
#> insufficient   sufficient 
#>     54.09932     45.90068
#> 
#> insufficient   sufficient 
#>     54.10415     45.89585

*Downsampling Method for Train dataset to make balance proportion between train dataset and test dataset.

#> 
#> insufficient   sufficient 
#>           50           50

We get the balance proportion now

7 Create Modeling

We will try some algorithm for this case. Scotty is classification case, which consist of character and factor column. In this case, we will use Logistic Regression, Naive Bayes, Decision Tree, Random Forest and Neural Network. The Method is using all variable inside dataset and compare which one is the best from all.

8 Logistic Regression

*Creatign Model Using ‘glm()’

#> 
#> Call:
#> glm(formula = coverage2 ~ ., family = "binomial", data = sct_train2)
#> 
#> Deviance Residuals: 
#>      Min        1Q    Median        3Q       Max  
#> -2.72806  -0.72922  -0.03013   0.73964   2.43359  
#> 
#> Coefficients: (1 not defined because of singularities)
#>                   Estimate Std. Error z value Pr(>|z|)    
#> (Intercept)     -3.119e+02  9.153e+01  -3.408 0.000655 ***
#> start_time       2.068e-07  6.069e-08   3.407 0.000656 ***
#> src_areasxk8     2.429e+00  1.161e-01  20.925  < 2e-16 ***
#> src_areasxk9    -1.146e+00  1.096e-01 -10.453  < 2e-16 ***
#> Month11          3.020e-01  1.845e-01   1.637 0.101700    
#> Month12          5.343e-01  3.642e-01   1.467 0.142376    
#> Day.L           -4.041e-01  1.247e-01  -3.241 0.001190 ** 
#> Day.Q           -2.940e-01  1.209e-01  -2.432 0.015004 *  
#> Day.C            6.494e-02  1.203e-01   0.540 0.589443    
#> Day^4            5.200e-01  1.210e-01   4.298 1.73e-05 ***
#> Day^5            9.626e-02  1.204e-01   0.799 0.424094    
#> Day^6           -4.513e-02  1.190e-01  -0.379 0.704402    
#> Hour            -9.693e-03  6.804e-03  -1.425 0.154299    
#> Peak_Hourpeak   -1.313e+00  1.011e-01 -12.988  < 2e-16 ***
#> Weekendweekends         NA         NA      NA       NA    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 4330.8  on 3123  degrees of freedom
#> Residual deviance: 2996.5  on 3110  degrees of freedom
#> AIC: 3024.5
#> 
#> Number of Fisher Scoring iterations: 5

Interpretation : We may see from summary above, Logistic regression creating dummy variable from each columns. From summary above tells us that ‘start_time’, ‘src_areasxk8’, ‘src_areasxk9’ , ‘Day^4’ and ‘Peak_Hourpeak’ has highest influence signed by 3 stars rating.

*Prediction

*We will use the minimum threshold of 0.5 for insufficient.

*For scotty case, we will use sensitivity/Recall for getting more insufficient event to avoid occurrence of it.

#> Confusion Matrix and Statistics
#> 
#>               Reference
#> Prediction     insufficient sufficient
#>   insufficient          106        389
#>   sufficient            507        131
#>                                          
#>                Accuracy : 0.2092         
#>                  95% CI : (0.1858, 0.234)
#>     No Information Rate : 0.541          
#>     P-Value [Acc > NIR] : 1              
#>                                          
#>                   Kappa : -0.5654        
#>                                          
#>  Mcnemar's Test P-Value : 9.28e-05       
#>                                          
#>             Sensitivity : 0.17292        
#>             Specificity : 0.25192        
#>          Pos Pred Value : 0.21414        
#>          Neg Pred Value : 0.20533        
#>              Prevalence : 0.54104        
#>          Detection Rate : 0.09356        
#>    Detection Prevalence : 0.43689        
#>       Balanced Accuracy : 0.21242        
#>                                          
#>        'Positive' Class : insufficient   
#> 

The result is far from good. we will compare with other algorithm

9 Naive Bayes

*Creating Model using ‘naiveBayes()’

#> $src_area
#>               src_area
#> Y                   sxk3      sxk8      sxk9
#>   insufficient 0.4001280 0.1062740 0.4935980
#>   sufficient   0.2765685 0.5966709 0.1267606
#> 
#> $Month
#>               Month
#> Y                      10         11         12
#>   insufficient 0.55057618 0.42061460 0.02880922
#>   sufficient   0.42829706 0.53713188 0.03457106
#> 
#> $Day
#>               Day
#> Y                 Sunday    Monday   Tuesday Wednesday  Thursday    Friday
#>   insufficient 0.1421255 0.1510883 0.1338028 0.1261204 0.1395647 0.1626120
#>   sufficient   0.1453265 0.1389245 0.1504481 0.1638924 0.1504481 0.1177977
#>               Day
#> Y               Saturday
#>   insufficient 0.1446863
#>   sufficient   0.1331626
#> 
#> $Hour
#>               Hour
#> Y                  [,1]     [,2]
#>   insufficient 12.03969 6.602250
#>   sufficient   10.81818 7.272687
#> 
#> $Peak_Hour
#>               Peak_Hour
#> Y                 nopeak      peak
#>   insufficient 0.3265045 0.6734955
#>   sufficient   0.5358515 0.4641485
#> 
#> $Weekend
#>               Weekend
#> Y               weekdays  weekends
#>   insufficient 0.7131882 0.2868118
#>   sufficient   0.7215109 0.2784891
  • Prediction

We want to know does the model underfit or overfit? we have to do prediction based on test dataset and train dataset to compare.
pred_mnaive1 –> test dataset pred_fit_mnaive –> from train dataset

#> pred_mnaive1
#> insufficient   sufficient 
#>          633          500

Check the model ability to predict from each class of target, we got 633 of insufficient and 500 of sufficient

*Model Validation

Lets set positive class as “insufficient” because we want to use Recall for getting more insufficient event to prevent it from happening.

#> Confusion Matrix and Statistics
#> 
#>               Reference
#> Prediction     insufficient sufficient
#>   insufficient         1273        391
#>   sufficient            289       1171
#>                                           
#>                Accuracy : 0.7823          
#>                  95% CI : (0.7674, 0.7967)
#>     No Information Rate : 0.5             
#>     P-Value [Acc > NIR] : < 2.2e-16       
#>                                           
#>                   Kappa : 0.5647          
#>                                           
#>  Mcnemar's Test P-Value : 0.0001074       
#>                                           
#>             Sensitivity : 0.8150          
#>             Specificity : 0.7497          
#>          Pos Pred Value : 0.7650          
#>          Neg Pred Value : 0.8021          
#>              Prevalence : 0.5000          
#>          Detection Rate : 0.4075          
#>    Detection Prevalence : 0.5327          
#>       Balanced Accuracy : 0.7823          
#>                                           
#>        'Positive' Class : insufficient    
#> 
#> Confusion Matrix and Statistics
#> 
#>               Reference
#> Prediction     insufficient sufficient
#>   insufficient          508        125
#>   sufficient            105        395
#>                                           
#>                Accuracy : 0.797           
#>                  95% CI : (0.7724, 0.8201)
#>     No Information Rate : 0.541           
#>     P-Value [Acc > NIR] : <2e-16          
#>                                           
#>                   Kappa : 0.59            
#>                                           
#>  Mcnemar's Test P-Value : 0.2103          
#>                                           
#>             Sensitivity : 0.8287          
#>             Specificity : 0.7596          
#>          Pos Pred Value : 0.8025          
#>          Neg Pred Value : 0.7900          
#>              Prevalence : 0.5410          
#>          Detection Rate : 0.4484          
#>    Detection Prevalence : 0.5587          
#>       Balanced Accuracy : 0.7942          
#>                                           
#>        'Positive' Class : insufficient    
#> 

Interpretation : We got accuracy 79.7%, sensitivity 82%, specificity 75% and precission 80% from our test dataset. its higher than accuracy, sensitivity, specificity and precission that we got from train dataset. It mean we create a good model which is not underfit nor overfit.

10 Decision Tree

*Creating Model using ‘ctree()’ from library(partykit)

#> 
#> Model formula:
#> coverage2 ~ src_area + Month + Day + Hour + Peak_Hour + Weekend
#> 
#> Fitted party:
#> [1] root
#> |   [2] src_area in sxk3, sxk9
#> |   |   [3] Peak_Hour in nopeak
#> |   |   |   [4] src_area in sxk3
#> |   |   |   |   [5] Hour <= 6
#> |   |   |   |   |   [6] Hour <= 1: insufficient (n = 90, err = 47.8%)
#> |   |   |   |   |   [7] Hour > 1: sufficient (n = 243, err = 23.0%)
#> |   |   |   |   [8] Hour > 6: insufficient (n = 128, err = 47.7%)
#> |   |   |   [9] src_area in sxk9
#> |   |   |   |   [10] Month in 10: insufficient (n = 197, err = 13.2%)
#> |   |   |   |   [11] Month in 11, 12: insufficient (n = 224, err = 39.3%)
#> |   |   [12] Peak_Hour in peak
#> |   |   |   [13] Month in 10
#> |   |   |   |   [14] src_area in sxk3: insufficient (n = 293, err = 18.4%)
#> |   |   |   |   [15] src_area in sxk9: insufficient (n = 260, err = 4.6%)
#> |   |   |   [16] Month in 11, 12
#> |   |   |   |   [17] Day <= Thursday: insufficient (n = 390, err = 31.8%)
#> |   |   |   |   [18] Day > Thursday: insufficient (n = 201, err = 17.4%)
#> |   [19] src_area in sxk8
#> |   |   [20] Peak_Hour in nopeak: sufficient (n = 465, err = 7.1%)
#> |   |   [21] Peak_Hour in peak
#> |   |   |   [22] Month in 10, 12: sufficient (n = 342, err = 27.2%)
#> |   |   |   [23] Month in 11: sufficient (n = 291, err = 13.7%)
#> 
#> Number of inner nodes:    11
#> Number of terminal nodes: 12

*Lets try to plot the model and see how is goes

  • Prediction

We want to know does the model underfit or overfit? we have to do prediction based on test dataset and train dataset to compare.
pred_mtree1 –> test dataset pred_fit_tree1 –> from train dataset

  • Model Validation

Set insufficient as positive value , and we will use Recall/sensitivity because we want to use Recall for getting more insufficient event to prevent it from happening.

#> Confusion Matrix and Statistics
#> 
#>               Reference
#> Prediction     insufficient sufficient
#>   insufficient         1340        443
#>   sufficient            222       1119
#>                                           
#>                Accuracy : 0.7871          
#>                  95% CI : (0.7724, 0.8014)
#>     No Information Rate : 0.5             
#>     P-Value [Acc > NIR] : < 2.2e-16       
#>                                           
#>                   Kappa : 0.5743          
#>                                           
#>  Mcnemar's Test P-Value : < 2.2e-16       
#>                                           
#>             Sensitivity : 0.8579          
#>             Specificity : 0.7164          
#>          Pos Pred Value : 0.7515          
#>          Neg Pred Value : 0.8345          
#>              Prevalence : 0.5000          
#>          Detection Rate : 0.4289          
#>    Detection Prevalence : 0.5707          
#>       Balanced Accuracy : 0.7871          
#>                                           
#>        'Positive' Class : insufficient    
#> 
#> Confusion Matrix and Statistics
#> 
#>               Reference
#> Prediction     insufficient sufficient
#>   insufficient          536        141
#>   sufficient             77        379
#>                                           
#>                Accuracy : 0.8076          
#>                  95% CI : (0.7834, 0.8302)
#>     No Information Rate : 0.541           
#>     P-Value [Acc > NIR] : < 2.2e-16       
#>                                           
#>                   Kappa : 0.6089          
#>                                           
#>  Mcnemar's Test P-Value : 1.982e-05       
#>                                           
#>             Sensitivity : 0.8744          
#>             Specificity : 0.7288          
#>          Pos Pred Value : 0.7917          
#>          Neg Pred Value : 0.8311          
#>              Prevalence : 0.5410          
#>          Detection Rate : 0.4731          
#>    Detection Prevalence : 0.5975          
#>       Balanced Accuracy : 0.8016          
#>                                           
#>        'Positive' Class : insufficient    
#> 

From confusion matrix above, we got Accuracy=80%; Recall=87% ; Specificity= 72%, and Precission=79% from test dataset. It shows us that Decision tree model is good enough without any underfit nor overfit condition

11 Random Forest

*Creating Model using ‘train()’ function from caret package

*Prediction

We want to know does the model underfit or overfit? we have to do prediction based on test dataset and train dataset to compare.
pred_rf –> test dataset pred_fit_rf –> from train dataset

*Model Evaluation

Set insufficient as positive value , and we will use Recall/sensitivity because we want to use Recall for getting more insufficient event to prevent it from happening.

#> Confusion Matrix and Statistics
#> 
#>               Reference
#> Prediction     insufficient sufficient
#>   insufficient         1311        337
#>   sufficient            251       1225
#>                                           
#>                Accuracy : 0.8118          
#>                  95% CI : (0.7976, 0.8254)
#>     No Information Rate : 0.5             
#>     P-Value [Acc > NIR] : < 2.2e-16       
#>                                           
#>                   Kappa : 0.6236          
#>                                           
#>  Mcnemar's Test P-Value : 0.000456        
#>                                           
#>             Sensitivity : 0.8393          
#>             Specificity : 0.7843          
#>          Pos Pred Value : 0.7955          
#>          Neg Pred Value : 0.8299          
#>              Prevalence : 0.5000          
#>          Detection Rate : 0.4197          
#>    Detection Prevalence : 0.5275          
#>       Balanced Accuracy : 0.8118          
#>                                           
#>        'Positive' Class : insufficient    
#> 
#> Confusion Matrix and Statistics
#> 
#>               Reference
#> Prediction     insufficient sufficient
#>   insufficient          515        126
#>   sufficient             98        394
#>                                           
#>                Accuracy : 0.8023          
#>                  95% CI : (0.7779, 0.8251)
#>     No Information Rate : 0.541           
#>     P-Value [Acc > NIR] : < 2e-16         
#>                                           
#>                   Kappa : 0.6003          
#>                                           
#>  Mcnemar's Test P-Value : 0.07123         
#>                                           
#>             Sensitivity : 0.8401          
#>             Specificity : 0.7577          
#>          Pos Pred Value : 0.8034          
#>          Neg Pred Value : 0.8008          
#>              Prevalence : 0.5410          
#>          Detection Rate : 0.4545          
#>    Detection Prevalence : 0.5658          
#>       Balanced Accuracy : 0.7989          
#>                                           
#>        'Positive' Class : insufficient    
#> 

Interpretation : We got accuracy 80%, sensitivity 84%, specificity 75% and precission 80% from our test dataset. its relatively lower than parameters that we got from train dataset. It mean the model is underfit

12 Neural Network

##Feature engineering

In Neural Network, we have to chnage our variable type become numeric. In this case, we have to make feature engineering by changing some of variable become numeric type and create dummy variable. start time column have to deleted as well.

Change the target variable (coverage) become numeric and rename as ‘0’ and ‘1’.
0 –> insufficient 1–> sufficient

*Create dummy variable

*Change into data frame type

*Change into Matrix type

*Separating x variable and y variable (separating target variable) to create one hot coding

*Change into Array

#>      [,1] [,2] [,3] [,4]          [,5]          [,6]          [,7]
#> [1,]    0    0    1    0 -5.669467e-01  5.455447e-01 -4.082483e-01
#> [2,]    0    1    0    0  1.889822e-01 -3.273268e-01 -4.082483e-01
#> [3,]    0    1    0    0 -3.779645e-01  9.690821e-17  4.082483e-01
#> [4,]    1    0    0    0  3.779645e-01  0.000000e+00 -4.082483e-01
#> [5,]    0    1    0    0  1.889822e-01 -3.273268e-01 -4.082483e-01
#> [6,]    0    1    0    0  2.098124e-17 -4.364358e-01  3.021644e-17
#>            [,8]          [,9]       [,10] [,11] [,12] [,13]
#> [1,]  0.2417469 -1.091089e-01  0.03289758     3     0     1
#> [2,]  0.0805823  5.455447e-01  0.49346377     7     1     0
#> [3,] -0.5640761  4.364358e-01 -0.19738551     8     1     0
#> [4,] -0.5640761 -4.364358e-01 -0.19738551    13     1     0
#> [5,]  0.0805823  5.455447e-01  0.49346377    21     0     0
#> [6,]  0.4834938 -9.751389e-16 -0.65795169     2     0     0

*One Hot Coding

12.1 Creating Neural Network Architecture

*Neural Network sequencing

  • Create Model Neural Network

activation at hidden layer = ‘relu’
activation at output layer = ‘sigmoid’ because we scotty case is binary
input shape = ncol from train keras dataset
set units = 64, 32 and 2

#> Model: "sequential"
#> ___________________________________________________________________________
#> Layer (type)                     Output Shape                  Param #     
#> ===========================================================================
#> hidden_1 (Dense)                 (None, 64)                    896         
#> ___________________________________________________________________________
#> hidden_2 (Dense)                 (None, 32)                    2080        
#> ___________________________________________________________________________
#> dense (Dense)                    (None, 2)                     66          
#> ===========================================================================
#> Total params: 3,042
#> Trainable params: 3,042
#> Non-trainable params: 0
#> ___________________________________________________________________________

*Compile

Gather all Neural Network architecture, set learning rate = 0.001

13 Data Validation

#> Confusion Matrix and Statistics
#> 
#>               Reference
#> Prediction     insufficient sufficient
#>   insufficient         1267        450
#>   sufficient            295       1112
#>                                           
#>                Accuracy : 0.7615          
#>                  95% CI : (0.7462, 0.7764)
#>     No Information Rate : 0.5             
#>     P-Value [Acc > NIR] : < 2.2e-16       
#>                                           
#>                   Kappa : 0.523           
#>                                           
#>  Mcnemar's Test P-Value : 1.68e-08        
#>                                           
#>             Sensitivity : 0.8111          
#>             Specificity : 0.7119          
#>          Pos Pred Value : 0.7379          
#>          Neg Pred Value : 0.7903          
#>              Prevalence : 0.5000          
#>          Detection Rate : 0.4056          
#>    Detection Prevalence : 0.5496          
#>       Balanced Accuracy : 0.7615          
#>                                           
#>        'Positive' Class : insufficient    
#> 
#> Confusion Matrix and Statistics
#> 
#>               Reference
#> Prediction     insufficient sufficient
#>   insufficient          506        144
#>   sufficient            107        376
#>                                           
#>                Accuracy : 0.7785          
#>                  95% CI : (0.7531, 0.8023)
#>     No Information Rate : 0.541           
#>     P-Value [Acc > NIR] : < 2e-16         
#>                                           
#>                   Kappa : 0.5515          
#>                                           
#>  Mcnemar's Test P-Value : 0.02307         
#>                                           
#>             Sensitivity : 0.8254          
#>             Specificity : 0.7231          
#>          Pos Pred Value : 0.7785          
#>          Neg Pred Value : 0.7785          
#>              Prevalence : 0.5410          
#>          Detection Rate : 0.4466          
#>    Detection Prevalence : 0.5737          
#>       Balanced Accuracy : 0.7743          
#>                                           
#>        'Positive' Class : insufficient    
#> 

Interpretation : From model validation above, we got the result is overfit (relatively). By using test dataset we got accuracy 73%, sensitivity 86%, specificity 57% and precission 70%.

14 Conclusion

So far the best model is created from Decision Tree algorithm. With Accuracy=80%; Recall=87% ; Specificity= 72%, and Precission=79%. ‘mtree1’ model also showing good performace since we dont fins any overfit nor underfit condition.
Continue using Decision Tree model for Data Submission –> mtree1

15 Data Submission

*Read Data

Read Data submission and name it ‘submission’ object

Here we find 3 column, such as : area, datetime, and coverage. We have to fill the coverage column and change the column become as same as our dataset.

*Create some new columns just like we did on previous steps above, then save it into new object sub

Looks good! Now it looks the same like our dataset, but we have to fill the coverage by the result of predictiong using mtree1 model

*Prediction Using Decision Tree model –> mtree1

Make new object to save predistion result, name it ‘pred_sub’

*Then input the prediction result into ‘coverage’ column inside ‘sub’ dataset

Looks good! Now coverage column has been filled by prediction result

*Next we need to get it back to the original column which is only consist of 3 columns : scr area, datetime and coverage columns. Then save it by the name ‘sub2’

*The last, create a .csv file using ‘sub2’ object. Name it ‘submit.csv’.

latter on, ‘submit.csv’ file will be inputed to leaderboard for submission.