
Intro
Greetings
This is Capstone Project - Machine Learning of Widya Kania Rahayu.
Irish Night - Class B.
Dataset : Scotty-Classification.
Content
Scotty is a ride-sharing business that operating in several big cities in Turkey. The company provides motorcycles ride-sharing service for Turkey’s citizen, and really value the efficiency in traveling through the traffic. The apps even give some reference to Star Trek “beam me up” in their order buttons.
Scotty provided us with real-time transaction dataset. With this dataset, we are going to help them in solving classification problems in order to improve their business processes. The demands for Scotty began to overload, at some region and some times, and there was not enough driver at those times and places. Fortunately, we are know that we can use classification model to predict which region and times are risky enough to have this “no drivers” problem.
The train dataset contains detailed transaction details from October 1st 2017 to December 2nd 2017. The dataset includes information about:
id: Transaction id trip_id: Trip id driver_id: Driver id rider_id: Rider id start_time: Rider id src_lat: Request source latitude src_lon: Request source longitude src_area: Request source area src_sub_area: Request source sub-area dest_lat: Requested destination latitude dest_lon: Requested destination longitude dest_area: Requested destination area dest_sub_area: Requested destination sub-area distance: Trip distance (in KM) status: Trip status (all status considered as a demand) confirmed_time_sec: Time different from request to confirmed (in seconds)
Purpose :
Create a classification model report that would be evaluated on next 7 days (Sunday, December 3rd 2017 to Monday, December 9th 2017). Make prediction that should cover the predicted coverage status for each hour and each area: “sufficient” or “insufficient”.
Load Used Library
Read Data
#> Observations: 229,532
#> Variables: 16
#> $ id <chr> "59d005e1ffcfa261708ce9cd", "59d0066a3d32b8...
#> $ trip_id <chr> "59d005e9cb564761a8fe5d3e", "59d00678ffcfa2...
#> $ driver_id <chr> "59a892c5568be44b2734f276", "59a135565e88a2...
#> $ rider_id <chr> "59ad2d6efba75a581666b506", "59ce930f3d32b8...
#> $ start_time <dttm> 2017-10-01 00:00:17, 2017-10-01 00:02:34, ...
#> $ src_lat <dbl> 41.07047, 40.94157, 41.07487, 41.04995, 41....
#> $ src_lon <dbl> 29.01945, 29.11484, 28.99528, 29.03107, 28....
#> $ src_area <chr> "sxk9", "sxk8", "sxk9", "sxk9", "sxk9", "sx...
#> $ src_sub_area <chr> "sxk9s", "sxk8y", "sxk9e", "sxk9s", "sxk9e"...
#> $ dest_lat <dbl> 41.11716, 41.06151, 41.08351, 41.04495, 41....
#> $ dest_lon <dbl> 29.03650, 29.02068, 29.00228, 28.98192, 28....
#> $ dest_area <chr> "sxk9", "sxk9", "sxk9", "sxk9", "sxk9", "sx...
#> $ dest_sub_area <chr> "sxk9u", "sxk9s", "sxk9e", "sxk9e", "sxk9e"...
#> $ distance <dbl> 5.379250, 15.497130, 1.126098, 4.169492, 3....
#> $ status <chr> "confirmed", "confirmed", "nodrivers", "con...
#> $ confirmed_time_sec <dbl> 8, 14, 0, 32, 65, 110, 0, 49, 27, 21, 23, 4...
#> [1] 229532 16
The Dataset has 16 columns and 229532 of observations. But in this dataset still dont have ‘target variable’ and ‘start_time’ column consist of 2 informations are date and time.
Data Preprocess
The Time information inside ‘start_time’ still have minutes and seconds information. Needs to be specified time based on hourly. We will use ‘floor_date’ function from lubridate package.
We may see on ‘start_time’ column is grouped by hourly time level already!
*Check how many area we have
#> [1] "sxk9" "sxk8" "sxk3"
We have 3 areas : sxk9, sxk8 and sxk3.
- Next, we have to check the NA
#> id trip_id driver_id
#> 0 14901 14900
#> rider_id start_time src_lat
#> 0 0 0
#> src_lon src_area src_sub_area
#> 0 0 0
#> dest_lat dest_lon dest_area
#> 0 0 0
#> dest_sub_area distance status
#> 0 0 0
#> confirmed_time_sec
#> 0
Column trip_id and driver_id have NA. Since we will not using it, we can continue to next step.
*Next, we will do grouping by area, start_time and status (following the submission dataset), also we need to check does the time has arranged orderly? and complete on hourly base?
From data above we can see that not all data observations are filled orderly by hours. we may seen that some hours are missing, we have to fill the missing hours before continue to the next level To make a complete time series, we can use function ‘pad()’ from padr package.
- Grouping by area using ‘pad()’ function, assigned to new object ‘sct’
first is defining min and max date of scotty dataset
#> [1] "2017-10-01 UTC"
#> [1] "2017-12-02 23:00:00 UTC"
continue to pad()
we have some new row with ordered by hourly and status. Pad() function will fill the blank row automatically following the information above, next we need to fill the ‘count’ column by 0
*Fill the NA value using ‘0’
Why 0 ? because in the real event it never happened. so we have to fill the blank using 0
*Cek missing value again to make sure
#> src_area start_time status count
#> 0 0 0 0
No have missing value
Exploratory Data Analysis
We will make target variable named : coverage.
First we have to spread the data to see the actual happened based on confirmed and nodrivers event for classifiying insuffient or sufficient event.
*Create new column with name ‘coverage’ = ‘sufficient’ or ‘insufficient’. Column coverage will be our target variable.
Observation which has “nodriver” event (even though only 1 time) classified as “insufficient” but observation which always confirmed or nodriver = 0 , classified as “sufficient”.
*feature engineering/variable selection
Separate the Month, Day, Hour from ‘start_time’ then create new column for Peak_Hour (7<=hour<=20) and Weekend column (Saturday and Sunday).
sct_sp2 <- sct_sp %>%
mutate(Day = wday(start_time, label = T, abbr = F),
Hour = hour(start_time),
Month = month(start_time),
Peak_Hour = as.factor(ifelse( 7<=Hour & Hour<=20, "peak", "nopeak")),
Weekend = as.factor(ifelse(Day == "Sunday"| Day == "Saturday", "weekends", "weekdays")),
src_area = as.factor(src_area),
Month = as.factor(Month))%>%
select(start_time, src_area, Month,Day,Hour, Peak_Hour, Weekend, coverage)
sct_sp2
Check Proportion Data
#>
#> insufficient sufficient
#> 54.10053 45.89947
#>
#> sxk3 sxk8 sxk9
#> insufficient 62.63228 17.06349 82.60582
#> sufficient 37.36772 82.93651 17.39418
In total, insufficient and sufficient almost have balace proportion but we found that based on area, the proportion still not balance.
*Correlation between feature toward target variable
Visualize using geom_tile from ggplot. We want to see the corelation between Days, Hours and area towards the sufficient or insufficient. Rename insufficient with ‘0’ and sufficient with ‘100’.
Coverage = insufficient –> 0 ; Sufficient –> 100
*find the mean for coverage to know the level of influence
*Continue to geom_tile
* We can see from graph above that sxk9 area has so many insufficient event compare to other area. in this area the insufficient event amost happened all the time during day and night. this may happened in big city area. On the contrary, sxk8 area mostly have sufficient event during day and night.
Model Fitting & Evaluation
*Splitting Dataset
We will do splitting dataset using function ‘initial_split()’ from rsample package. Define 75% for training and 25% for testing.
*Check proportion of Dataset
#>
#> insufficient sufficient
#> 54.09932 45.90068
#>
#> insufficient sufficient
#> 54.10415 45.89585
*Downsampling Method for Train dataset to make balance proportion between train dataset and test dataset.
#>
#> insufficient sufficient
#> 50 50
We get the balance proportion now
Create Modeling
We will try some algorithm for this case. Scotty is classification case, which consist of character and factor column. In this case, we will use Logistic Regression, Naive Bayes, Decision Tree, Random Forest and Neural Network. The Method is using all variable inside dataset and compare which one is the best from all.
Logistic Regression
*Creatign Model Using ‘glm()’
#>
#> Call:
#> glm(formula = coverage2 ~ ., family = "binomial", data = sct_train2)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -2.72806 -0.72922 -0.03013 0.73964 2.43359
#>
#> Coefficients: (1 not defined because of singularities)
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -3.119e+02 9.153e+01 -3.408 0.000655 ***
#> start_time 2.068e-07 6.069e-08 3.407 0.000656 ***
#> src_areasxk8 2.429e+00 1.161e-01 20.925 < 2e-16 ***
#> src_areasxk9 -1.146e+00 1.096e-01 -10.453 < 2e-16 ***
#> Month11 3.020e-01 1.845e-01 1.637 0.101700
#> Month12 5.343e-01 3.642e-01 1.467 0.142376
#> Day.L -4.041e-01 1.247e-01 -3.241 0.001190 **
#> Day.Q -2.940e-01 1.209e-01 -2.432 0.015004 *
#> Day.C 6.494e-02 1.203e-01 0.540 0.589443
#> Day^4 5.200e-01 1.210e-01 4.298 1.73e-05 ***
#> Day^5 9.626e-02 1.204e-01 0.799 0.424094
#> Day^6 -4.513e-02 1.190e-01 -0.379 0.704402
#> Hour -9.693e-03 6.804e-03 -1.425 0.154299
#> Peak_Hourpeak -1.313e+00 1.011e-01 -12.988 < 2e-16 ***
#> Weekendweekends NA NA NA NA
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 4330.8 on 3123 degrees of freedom
#> Residual deviance: 2996.5 on 3110 degrees of freedom
#> AIC: 3024.5
#>
#> Number of Fisher Scoring iterations: 5
Interpretation : We may see from summary above, Logistic regression creating dummy variable from each columns. From summary above tells us that ‘start_time’, ‘src_areasxk8’, ‘src_areasxk9’ , ‘Day^4’ and ‘Peak_Hourpeak’ has highest influence signed by 3 stars rating.
*Prediction
*We will use the minimum threshold of 0.5 for insufficient.
*For scotty case, we will use sensitivity/Recall for getting more insufficient event to avoid occurrence of it.
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction insufficient sufficient
#> insufficient 106 389
#> sufficient 507 131
#>
#> Accuracy : 0.2092
#> 95% CI : (0.1858, 0.234)
#> No Information Rate : 0.541
#> P-Value [Acc > NIR] : 1
#>
#> Kappa : -0.5654
#>
#> Mcnemar's Test P-Value : 9.28e-05
#>
#> Sensitivity : 0.17292
#> Specificity : 0.25192
#> Pos Pred Value : 0.21414
#> Neg Pred Value : 0.20533
#> Prevalence : 0.54104
#> Detection Rate : 0.09356
#> Detection Prevalence : 0.43689
#> Balanced Accuracy : 0.21242
#>
#> 'Positive' Class : insufficient
#>
The result is far from good. we will compare with other algorithm
Naive Bayes
*Creating Model using ‘naiveBayes()’
#> $src_area
#> src_area
#> Y sxk3 sxk8 sxk9
#> insufficient 0.4001280 0.1062740 0.4935980
#> sufficient 0.2765685 0.5966709 0.1267606
#>
#> $Month
#> Month
#> Y 10 11 12
#> insufficient 0.55057618 0.42061460 0.02880922
#> sufficient 0.42829706 0.53713188 0.03457106
#>
#> $Day
#> Day
#> Y Sunday Monday Tuesday Wednesday Thursday Friday
#> insufficient 0.1421255 0.1510883 0.1338028 0.1261204 0.1395647 0.1626120
#> sufficient 0.1453265 0.1389245 0.1504481 0.1638924 0.1504481 0.1177977
#> Day
#> Y Saturday
#> insufficient 0.1446863
#> sufficient 0.1331626
#>
#> $Hour
#> Hour
#> Y [,1] [,2]
#> insufficient 12.03969 6.602250
#> sufficient 10.81818 7.272687
#>
#> $Peak_Hour
#> Peak_Hour
#> Y nopeak peak
#> insufficient 0.3265045 0.6734955
#> sufficient 0.5358515 0.4641485
#>
#> $Weekend
#> Weekend
#> Y weekdays weekends
#> insufficient 0.7131882 0.2868118
#> sufficient 0.7215109 0.2784891
We want to know does the model underfit or overfit? we have to do prediction based on test dataset and train dataset to compare.
pred_mnaive1 –> test dataset pred_fit_mnaive –> from train dataset
#> pred_mnaive1
#> insufficient sufficient
#> 633 500
Check the model ability to predict from each class of target, we got 633 of insufficient and 500 of sufficient
*Model Validation
Lets set positive class as “insufficient” because we want to use Recall for getting more insufficient event to prevent it from happening.
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction insufficient sufficient
#> insufficient 1273 391
#> sufficient 289 1171
#>
#> Accuracy : 0.7823
#> 95% CI : (0.7674, 0.7967)
#> No Information Rate : 0.5
#> P-Value [Acc > NIR] : < 2.2e-16
#>
#> Kappa : 0.5647
#>
#> Mcnemar's Test P-Value : 0.0001074
#>
#> Sensitivity : 0.8150
#> Specificity : 0.7497
#> Pos Pred Value : 0.7650
#> Neg Pred Value : 0.8021
#> Prevalence : 0.5000
#> Detection Rate : 0.4075
#> Detection Prevalence : 0.5327
#> Balanced Accuracy : 0.7823
#>
#> 'Positive' Class : insufficient
#>
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction insufficient sufficient
#> insufficient 508 125
#> sufficient 105 395
#>
#> Accuracy : 0.797
#> 95% CI : (0.7724, 0.8201)
#> No Information Rate : 0.541
#> P-Value [Acc > NIR] : <2e-16
#>
#> Kappa : 0.59
#>
#> Mcnemar's Test P-Value : 0.2103
#>
#> Sensitivity : 0.8287
#> Specificity : 0.7596
#> Pos Pred Value : 0.8025
#> Neg Pred Value : 0.7900
#> Prevalence : 0.5410
#> Detection Rate : 0.4484
#> Detection Prevalence : 0.5587
#> Balanced Accuracy : 0.7942
#>
#> 'Positive' Class : insufficient
#>
Interpretation : We got accuracy 79.7%, sensitivity 82%, specificity 75% and precission 80% from our test dataset. its higher than accuracy, sensitivity, specificity and precission that we got from train dataset. It mean we create a good model which is not underfit nor overfit.
Decision Tree
*Creating Model using ‘ctree()’ from library(partykit)
#>
#> Model formula:
#> coverage2 ~ src_area + Month + Day + Hour + Peak_Hour + Weekend
#>
#> Fitted party:
#> [1] root
#> | [2] src_area in sxk3, sxk9
#> | | [3] Peak_Hour in nopeak
#> | | | [4] src_area in sxk3
#> | | | | [5] Hour <= 6
#> | | | | | [6] Hour <= 1: insufficient (n = 90, err = 47.8%)
#> | | | | | [7] Hour > 1: sufficient (n = 243, err = 23.0%)
#> | | | | [8] Hour > 6: insufficient (n = 128, err = 47.7%)
#> | | | [9] src_area in sxk9
#> | | | | [10] Month in 10: insufficient (n = 197, err = 13.2%)
#> | | | | [11] Month in 11, 12: insufficient (n = 224, err = 39.3%)
#> | | [12] Peak_Hour in peak
#> | | | [13] Month in 10
#> | | | | [14] src_area in sxk3: insufficient (n = 293, err = 18.4%)
#> | | | | [15] src_area in sxk9: insufficient (n = 260, err = 4.6%)
#> | | | [16] Month in 11, 12
#> | | | | [17] Day <= Thursday: insufficient (n = 390, err = 31.8%)
#> | | | | [18] Day > Thursday: insufficient (n = 201, err = 17.4%)
#> | [19] src_area in sxk8
#> | | [20] Peak_Hour in nopeak: sufficient (n = 465, err = 7.1%)
#> | | [21] Peak_Hour in peak
#> | | | [22] Month in 10, 12: sufficient (n = 342, err = 27.2%)
#> | | | [23] Month in 11: sufficient (n = 291, err = 13.7%)
#>
#> Number of inner nodes: 11
#> Number of terminal nodes: 12
*Lets try to plot the model and see how is goes


We want to know does the model underfit or overfit? we have to do prediction based on test dataset and train dataset to compare.
pred_mtree1 –> test dataset pred_fit_tree1 –> from train dataset
Set insufficient as positive value , and we will use Recall/sensitivity because we want to use Recall for getting more insufficient event to prevent it from happening.
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction insufficient sufficient
#> insufficient 1340 443
#> sufficient 222 1119
#>
#> Accuracy : 0.7871
#> 95% CI : (0.7724, 0.8014)
#> No Information Rate : 0.5
#> P-Value [Acc > NIR] : < 2.2e-16
#>
#> Kappa : 0.5743
#>
#> Mcnemar's Test P-Value : < 2.2e-16
#>
#> Sensitivity : 0.8579
#> Specificity : 0.7164
#> Pos Pred Value : 0.7515
#> Neg Pred Value : 0.8345
#> Prevalence : 0.5000
#> Detection Rate : 0.4289
#> Detection Prevalence : 0.5707
#> Balanced Accuracy : 0.7871
#>
#> 'Positive' Class : insufficient
#>
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction insufficient sufficient
#> insufficient 536 141
#> sufficient 77 379
#>
#> Accuracy : 0.8076
#> 95% CI : (0.7834, 0.8302)
#> No Information Rate : 0.541
#> P-Value [Acc > NIR] : < 2.2e-16
#>
#> Kappa : 0.6089
#>
#> Mcnemar's Test P-Value : 1.982e-05
#>
#> Sensitivity : 0.8744
#> Specificity : 0.7288
#> Pos Pred Value : 0.7917
#> Neg Pred Value : 0.8311
#> Prevalence : 0.5410
#> Detection Rate : 0.4731
#> Detection Prevalence : 0.5975
#> Balanced Accuracy : 0.8016
#>
#> 'Positive' Class : insufficient
#>
From confusion matrix above, we got Accuracy=80%; Recall=87% ; Specificity= 72%, and Precission=79% from test dataset. It shows us that Decision tree model is good enough without any underfit nor overfit condition
Random Forest
*Creating Model using ‘train()’ function from caret package
*Prediction
We want to know does the model underfit or overfit? we have to do prediction based on test dataset and train dataset to compare.
pred_rf –> test dataset pred_fit_rf –> from train dataset
*Model Evaluation
Set insufficient as positive value , and we will use Recall/sensitivity because we want to use Recall for getting more insufficient event to prevent it from happening.
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction insufficient sufficient
#> insufficient 1311 337
#> sufficient 251 1225
#>
#> Accuracy : 0.8118
#> 95% CI : (0.7976, 0.8254)
#> No Information Rate : 0.5
#> P-Value [Acc > NIR] : < 2.2e-16
#>
#> Kappa : 0.6236
#>
#> Mcnemar's Test P-Value : 0.000456
#>
#> Sensitivity : 0.8393
#> Specificity : 0.7843
#> Pos Pred Value : 0.7955
#> Neg Pred Value : 0.8299
#> Prevalence : 0.5000
#> Detection Rate : 0.4197
#> Detection Prevalence : 0.5275
#> Balanced Accuracy : 0.8118
#>
#> 'Positive' Class : insufficient
#>
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction insufficient sufficient
#> insufficient 515 126
#> sufficient 98 394
#>
#> Accuracy : 0.8023
#> 95% CI : (0.7779, 0.8251)
#> No Information Rate : 0.541
#> P-Value [Acc > NIR] : < 2e-16
#>
#> Kappa : 0.6003
#>
#> Mcnemar's Test P-Value : 0.07123
#>
#> Sensitivity : 0.8401
#> Specificity : 0.7577
#> Pos Pred Value : 0.8034
#> Neg Pred Value : 0.8008
#> Prevalence : 0.5410
#> Detection Rate : 0.4545
#> Detection Prevalence : 0.5658
#> Balanced Accuracy : 0.7989
#>
#> 'Positive' Class : insufficient
#>
Interpretation : We got accuracy 80%, sensitivity 84%, specificity 75% and precission 80% from our test dataset. its relatively lower than parameters that we got from train dataset. It mean the model is underfit
Neural Network
##Feature engineering
In Neural Network, we have to chnage our variable type become numeric. In this case, we have to make feature engineering by changing some of variable become numeric type and create dummy variable. start time column have to deleted as well.
Change the target variable (coverage) become numeric and rename as ‘0’ and ‘1’.
0 –> insufficient 1–> sufficient
*Create dummy variable
*Change into data frame type
*Change into Matrix type
*Separating x variable and y variable (separating target variable) to create one hot coding
*Change into Array
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7]
#> [1,] 0 0 1 0 -5.669467e-01 5.455447e-01 -4.082483e-01
#> [2,] 0 1 0 0 1.889822e-01 -3.273268e-01 -4.082483e-01
#> [3,] 0 1 0 0 -3.779645e-01 9.690821e-17 4.082483e-01
#> [4,] 1 0 0 0 3.779645e-01 0.000000e+00 -4.082483e-01
#> [5,] 0 1 0 0 1.889822e-01 -3.273268e-01 -4.082483e-01
#> [6,] 0 1 0 0 2.098124e-17 -4.364358e-01 3.021644e-17
#> [,8] [,9] [,10] [,11] [,12] [,13]
#> [1,] 0.2417469 -1.091089e-01 0.03289758 3 0 1
#> [2,] 0.0805823 5.455447e-01 0.49346377 7 1 0
#> [3,] -0.5640761 4.364358e-01 -0.19738551 8 1 0
#> [4,] -0.5640761 -4.364358e-01 -0.19738551 13 1 0
#> [5,] 0.0805823 5.455447e-01 0.49346377 21 0 0
#> [6,] 0.4834938 -9.751389e-16 -0.65795169 2 0 0
*One Hot Coding
Creating Neural Network Architecture
*Neural Network sequencing
- Create Model Neural Network
activation at hidden layer = ‘relu’
activation at output layer = ‘sigmoid’ because we scotty case is binary
input shape = ncol from train keras dataset
set units = 64, 32 and 2
#> Model: "sequential"
#> ___________________________________________________________________________
#> Layer (type) Output Shape Param #
#> ===========================================================================
#> hidden_1 (Dense) (None, 64) 896
#> ___________________________________________________________________________
#> hidden_2 (Dense) (None, 32) 2080
#> ___________________________________________________________________________
#> dense (Dense) (None, 2) 66
#> ===========================================================================
#> Total params: 3,042
#> Trainable params: 3,042
#> Non-trainable params: 0
#> ___________________________________________________________________________
*Compile
Gather all Neural Network architecture, set learning rate = 0.001
Creating Model Neural Network Using ‘fit()’ function
Interpretation : From history model we got 71% of accuracy
*Prediction
Data Validation
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction insufficient sufficient
#> insufficient 1267 450
#> sufficient 295 1112
#>
#> Accuracy : 0.7615
#> 95% CI : (0.7462, 0.7764)
#> No Information Rate : 0.5
#> P-Value [Acc > NIR] : < 2.2e-16
#>
#> Kappa : 0.523
#>
#> Mcnemar's Test P-Value : 1.68e-08
#>
#> Sensitivity : 0.8111
#> Specificity : 0.7119
#> Pos Pred Value : 0.7379
#> Neg Pred Value : 0.7903
#> Prevalence : 0.5000
#> Detection Rate : 0.4056
#> Detection Prevalence : 0.5496
#> Balanced Accuracy : 0.7615
#>
#> 'Positive' Class : insufficient
#>
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction insufficient sufficient
#> insufficient 506 144
#> sufficient 107 376
#>
#> Accuracy : 0.7785
#> 95% CI : (0.7531, 0.8023)
#> No Information Rate : 0.541
#> P-Value [Acc > NIR] : < 2e-16
#>
#> Kappa : 0.5515
#>
#> Mcnemar's Test P-Value : 0.02307
#>
#> Sensitivity : 0.8254
#> Specificity : 0.7231
#> Pos Pred Value : 0.7785
#> Neg Pred Value : 0.7785
#> Prevalence : 0.5410
#> Detection Rate : 0.4466
#> Detection Prevalence : 0.5737
#> Balanced Accuracy : 0.7743
#>
#> 'Positive' Class : insufficient
#>
Interpretation : From model validation above, we got the result is overfit (relatively). By using test dataset we got accuracy 73%, sensitivity 86%, specificity 57% and precission 70%.
Conclusion
So far the best model is created from Decision Tree algorithm. With Accuracy=80%; Recall=87% ; Specificity= 72%, and Precission=79%. ‘mtree1’ model also showing good performace since we dont fins any overfit nor underfit condition.
Continue using Decision Tree model for Data Submission –> mtree1
Data Submission
*Read Data
Read Data submission and name it ‘submission’ object
Here we find 3 column, such as : area, datetime, and coverage. We have to fill the coverage column and change the column become as same as our dataset.
*Create some new columns just like we did on previous steps above, then save it into new object sub
sub <- submission %>%
mutate(Day = wday(datetime, label = T, abbr = F),
Hour = hour(datetime),
Month = month(datetime),
Peak_Hour = as.factor(ifelse( 7<=Hour & Hour<=20, "peak", "nopeak")),
Weekend = as.factor(ifelse(Day == "Sunday"| Day == "Saturday", "weekends", "weekdays")),
src_area = as.factor(src_area),
Month = as.factor(Month),
start_time = datetime)%>%
select(start_time, src_area, Month,Day,Hour, Peak_Hour, Weekend, coverage)
sub
Looks good! Now it looks the same like our dataset, but we have to fill the coverage by the result of predictiong using mtree1 model
*Prediction Using Decision Tree model –> mtree1
Make new object to save predistion result, name it ‘pred_sub’
*Then input the prediction result into ‘coverage’ column inside ‘sub’ dataset
Looks good! Now coverage column has been filled by prediction result
*Next we need to get it back to the original column which is only consist of 3 columns : scr area, datetime and coverage columns. Then save it by the name ‘sub2’
*The last, create a .csv file using ‘sub2’ object. Name it ‘submit.csv’.
latter on, ‘submit.csv’ file will be inputed to leaderboard for submission.