1 Intro

1.1 Greetings

This is Capstone Project - Machine Learning of Widya Kania Rahayu.
Irish Night - Class B.
Dataset : Scotty-Classification.

1.2 Content

Scotty is a ride-sharing business that operating in several big cities in Turkey. The company provides motorcycles ride-sharing service for Turkey’s citizen, and really value the efficiency in traveling through the traffic. The apps even give some reference to Star Trek “beam me up” in their order buttons.

Scotty provided us with real-time transaction dataset. With this dataset, we are going to help them in solving classification problems in order to improve their business processes. The demands for Scotty began to overload, at some region and some times, and there was not enough driver at those times and places. Fortunately, we are know that we can use classification model to predict which region and times are risky enough to have this “no drivers” problem.

The train dataset contains detailed transaction details from October 1st 2017 to December 2nd 2017. The dataset includes information about:

id: Transaction id trip_id: Trip id driver_id: Driver id rider_id: Rider id start_time: Rider id src_lat: Request source latitude src_lon: Request source longitude src_area: Request source area src_sub_area: Request source sub-area dest_lat: Requested destination latitude dest_lon: Requested destination longitude dest_area: Requested destination area dest_sub_area: Requested destination sub-area distance: Trip distance (in KM) status: Trip status (all status considered as a demand) confirmed_time_sec: Time different from request to confirmed (in seconds)

Purpose :
Create a classification model report that would be evaluated on next 7 days (Sunday, December 3rd 2017 to Monday, December 9th 2017). Make prediction that should cover the predicted coverage status for each hour and each area: “sufficient” or “insufficient”.

Load Used Library

library(readr)
library(lubridate)
library(dplyr)
library(padr)
library(zoo)
library(GGally)
library(rsample)
library(caret)
library(e1071)
library(partykit)
library(randomForest)
library(neuralnet)
library(keras)
library(tensorflow)
library(ggthemes)

2 Read Data

scotty <- read_csv("data-train.csv")
head(scotty)

glimpse(scotty)

#> Observations: 229,532
#> Variables: 16
#> $ id                 <chr> "59d005e1ffcfa261708ce9cd", "59d0066a3d32b8...
#> $ trip_id            <chr> "59d005e9cb564761a8fe5d3e", "59d00678ffcfa2...
#> $ driver_id          <chr> "59a892c5568be44b2734f276", "59a135565e88a2...
#> $ rider_id           <chr> "59ad2d6efba75a581666b506", "59ce930f3d32b8...
#> $ start_time         <dttm> 2017-10-01 00:00:17, 2017-10-01 00:02:34, ...
#> $ src_lat            <dbl> 41.07047, 40.94157, 41.07487, 41.04995, 41....
#> $ src_lon            <dbl> 29.01945, 29.11484, 28.99528, 29.03107, 28....
#> $ src_area           <chr> "sxk9", "sxk8", "sxk9", "sxk9", "sxk9", "sx...
#> $ src_sub_area       <chr> "sxk9s", "sxk8y", "sxk9e", "sxk9s", "sxk9e"...
#> $ dest_lat           <dbl> 41.11716, 41.06151, 41.08351, 41.04495, 41....
#> $ dest_lon           <dbl> 29.03650, 29.02068, 29.00228, 28.98192, 28....
#> $ dest_area          <chr> "sxk9", "sxk9", "sxk9", "sxk9", "sxk9", "sx...
#> $ dest_sub_area      <chr> "sxk9u", "sxk9s", "sxk9e", "sxk9e", "sxk9e"...
#> $ distance           <dbl> 5.379250, 15.497130, 1.126098, 4.169492, 3....
#> $ status             <chr> "confirmed", "confirmed", "nodrivers", "con...
#> $ confirmed_time_sec <dbl> 8, 14, 0, 32, 65, 110, 0, 49, 27, 21, 23, 4...

dim(scotty)

#> [1] 229532     16

The Dataset has 16 columns and 229532 of observations. But in this dataset still dont have ‘target variable’ and ‘start_time’ column consist of 2 informations are date and time.

3 Data Preprocess

The Time information inside ‘start_time’ still have minutes and seconds information. Needs to be specified time based on hourly. We will use ‘floor_date’ function from lubridate package.

scotty <- scotty %>% 
  mutate(start_time = floor_date(start_time, "hours")) 

scotty

We may see on ‘start_time’ column is grouped by hourly time level already!

*Check how many area we have

unique(scotty$src_area)

#> [1] "sxk9" "sxk8" "sxk3"

We have 3 areas : sxk9, sxk8 and sxk3.

Next, we have to check the NA

colSums(is.na(scotty))

#>                 id            trip_id          driver_id 
#>                  0              14901              14900 
#>           rider_id         start_time            src_lat 
#>                  0                  0                  0 
#>            src_lon           src_area       src_sub_area 
#>                  0                  0                  0 
#>           dest_lat           dest_lon          dest_area 
#>                  0                  0                  0 
#>      dest_sub_area           distance             status 
#>                  0                  0                  0 
#> confirmed_time_sec 
#>                  0

Column trip_id and driver_id have NA. Since we will not using it, we can continue to next step.

*Next, we will do grouping by area, start_time and status (following the submission dataset), also we need to check does the time has arranged orderly? and complete on hourly base?

sct <- scotty %>% 
  group_by(src_area, start_time, status) %>% 
  summarise(count = n()) %>% 
  ungroup()
sct

From data above we can see that not all data observations are filled orderly by hours. we may seen that some hours are missing, we have to fill the missing hours before continue to the next level To make a complete time series, we can use function ‘pad()’ from padr package.

Grouping by area using ‘pad()’ function, assigned to new object ‘sct’

first is defining min and max date of scotty dataset

min <- min(sct$start_time)
max <- max(sct$start_time)
min

#> [1] "2017-10-01 UTC"

max

#> [1] "2017-12-02 23:00:00 UTC"

continue to pad()

sct_pad<- sct %>% 
  pad('hour', start_val = min, end_val = max, 
      group = c("src_area","status"))

sct_pad

we have some new row with ordered by hourly and status. Pad() function will fill the blank row automatically following the information above, next we need to fill the ‘count’ column by 0

*Fill the NA value using ‘0’

sct_pad[is.na(sct_pad)] <- 0
sct_pad

Why 0 ? because in the real event it never happened. so we have to fill the blank using 0

*Cek missing value again to make sure

colSums(is.na(sct_pad))

#>   src_area start_time     status      count 
#>          0          0          0          0

No have missing value

4 Exploratory Data Analysis

We will make target variable named : coverage.
First we have to spread the data to see the actual happened based on confirmed and nodrivers event for classifiying insuffient or sufficient event.

sct_sp <- spread(data=sct_pad, key = status, value = count)
sct_sp

*Create new column with name ‘coverage’ = ‘sufficient’ or ‘insufficient’. Column coverage will be our target variable.

sct_sp <- sct_sp %>% 
  mutate(coverage = as.factor(ifelse(nodrivers>0, "insufficient", "sufficient")))
sct_sp

Observation which has “nodriver” event (even though only 1 time) classified as “insufficient” but observation which always confirmed or nodriver = 0 , classified as “sufficient”.

*feature engineering/variable selection

Separate the Month, Day, Hour from ‘start_time’ then create new column for Peak_Hour (7<=hour<=20) and Weekend column (Saturday and Sunday).

sct_sp2 <- sct_sp %>% 
  mutate(Day = wday(start_time, label = T, abbr = F),
         Hour = hour(start_time),
         Month = month(start_time),
         Peak_Hour = as.factor(ifelse( 7<=Hour & Hour<=20, "peak", "nopeak")),
         Weekend = as.factor(ifelse(Day == "Sunday"| Day == "Saturday", "weekends", "weekdays")),
         src_area = as.factor(src_area),
         Month = as.factor(Month))%>% 
  select(start_time, src_area, Month,Day,Hour, Peak_Hour, Weekend, coverage)

sct_sp2

5 Check Proportion Data

prop.table(table(sct_sp2$coverage))*100

#> 
#> insufficient   sufficient 
#>     54.10053     45.89947

prop.table(table(sct_sp2$coverage, sct_sp2$src_area), margin =2)*100

#>               
#>                    sxk3     sxk8     sxk9
#>   insufficient 62.63228 17.06349 82.60582
#>   sufficient   37.36772 82.93651 17.39418

In total, insufficient and sufficient almost have balace proportion but we found that based on area, the proportion still not balance.

*Correlation between feature toward target variable

Visualize using geom_tile from ggplot. We want to see the corelation between Days, Hours and area towards the sufficient or insufficient. Rename insufficient with ‘0’ and sufficient with ‘100’.

heat <- sct_sp2 %>% 
  mutate(coverage = as.numeric(ifelse(coverage == "insufficient", "0", "100")))

heat

Coverage = insufficient –> 0 ; Sufficient –> 100

*find the mean for coverage to know the level of influence

heat <-  heat %>% 
  group_by(src_area,Day, Hour) %>% 
  summarise(Mean = mean(coverage)) 

heat

*Continue to geom_tile

heat %>% 
 ggplot(aes(x = Hour,y = Day, fill = Mean)) +
  geom_tile() +
  facet_wrap(~src_area, scales = "free_x", ncol = 1) +
  theme_minimal()+
  labs(title = "Correlation Map")+
  theme(plot.title = element_text(size = 20))

* We can see from graph above that sxk9 area has so many insufficient event compare to other area. in this area the insufficient event amost happened all the time during day and night. this may happened in big city area. On the contrary, sxk8 area mostly have sufficient event during day and night.

6 Model Fitting & Evaluation

*Splitting Dataset

We will do splitting dataset using function ‘initial_split()’ from rsample package. Define 75% for training and 25% for testing.

set.seed(100)
split <- initial_split(data = sct_sp2, prop = 0.75, strata = "coverage")
sct_train <- training(split) # tujuannya untuk mengextract data yang sudah kita split untuk train data
sct_test <- testing(split)

sct_train

sct_test

*Check proportion of Dataset

prop.table(table(sct_train$coverage))*100

#> 
#> insufficient   sufficient 
#>     54.09932     45.90068

prop.table(table(sct_test$coverage))*100

#> 
#> insufficient   sufficient 
#>     54.10415     45.89585

*Downsampling Method for Train dataset to make balance proportion between train dataset and test dataset.

sct_train2 <- downSample(x=sct_train[,-8], y= sct_train$coverage, yname = "coverage2")
prop.table(table(sct_train2$coverage2))*100

#> 
#> insufficient   sufficient 
#>           50           50

We get the balance proportion now

7 Create Modeling

We will try some algorithm for this case. Scotty is classification case, which consist of character and factor column. In this case, we will use Logistic Regression, Naive Bayes, Decision Tree, Random Forest and Neural Network. The Method is using all variable inside dataset and compare which one is the best from all.

8 Logistic Regression

*Creatign Model Using ‘glm()’

mlog1 <- glm(coverage2~.,sct_train2, family = "binomial" )
summary(mlog1)

#> 
#> Call:
#> glm(formula = coverage2 ~ ., family = "binomial", data = sct_train2)
#> 
#> Deviance Residuals: 
#>      Min        1Q    Median        3Q       Max  
#> -2.72806  -0.72922  -0.03013   0.73964   2.43359  
#> 
#> Coefficients: (1 not defined because of singularities)
#>                   Estimate Std. Error z value Pr(>|z|)    
#> (Intercept)     -3.119e+02  9.153e+01  -3.408 0.000655 ***
#> start_time       2.068e-07  6.069e-08   3.407 0.000656 ***
#> src_areasxk8     2.429e+00  1.161e-01  20.925  < 2e-16 ***
#> src_areasxk9    -1.146e+00  1.096e-01 -10.453  < 2e-16 ***
#> Month11          3.020e-01  1.845e-01   1.637 0.101700    
#> Month12          5.343e-01  3.642e-01   1.467 0.142376    
#> Day.L           -4.041e-01  1.247e-01  -3.241 0.001190 ** 
#> Day.Q           -2.940e-01  1.209e-01  -2.432 0.015004 *  
#> Day.C            6.494e-02  1.203e-01   0.540 0.589443    
#> Day^4            5.200e-01  1.210e-01   4.298 1.73e-05 ***
#> Day^5            9.626e-02  1.204e-01   0.799 0.424094    
#> Day^6           -4.513e-02  1.190e-01  -0.379 0.704402    
#> Hour            -9.693e-03  6.804e-03  -1.425 0.154299    
#> Peak_Hourpeak   -1.313e+00  1.011e-01 -12.988  < 2e-16 ***
#> Weekendweekends         NA         NA      NA       NA    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 4330.8  on 3123  degrees of freedom
#> Residual deviance: 2996.5  on 3110  degrees of freedom
#> AIC: 3024.5
#> 
#> Number of Fisher Scoring iterations: 5

Interpretation : We may see from summary above, Logistic regression creating dummy variable from each columns. From summary above tells us that ‘start_time’, ‘src_areasxk8’, ‘src_areasxk9’ , ‘Day^4’ and ‘Peak_Hourpeak’ has highest influence signed by 3 stars rating.

*Prediction

probmlog1 <- predict(mlog1, sct_test, type = "response")

*We will use the minimum threshold of 0.5 for insufficient.

predmlog1<- ifelse(probmlog1 >= 0.5 , "insufficient", "sufficient")

*For scotty case, we will use sensitivity/Recall for getting more insufficient event to avoid occurrence of it.

confusionMatrix(as.factor(predmlog1), as.factor(sct_test$coverage), positive = "insufficient")

#> Confusion Matrix and Statistics
#> 
#>               Reference
#> Prediction     insufficient sufficient
#>   insufficient          106        389
#>   sufficient            507        131
#>                                          
#>                Accuracy : 0.2092         
#>                  95% CI : (0.1858, 0.234)
#>     No Information Rate : 0.541          
#>     P-Value [Acc > NIR] : 1              
#>                                          
#>                   Kappa : -0.5654        
#>                                          
#>  Mcnemar's Test P-Value : 9.28e-05       
#>                                          
#>             Sensitivity : 0.17292        
#>             Specificity : 0.25192        
#>          Pos Pred Value : 0.21414        
#>          Neg Pred Value : 0.20533        
#>              Prevalence : 0.54104        
#>          Detection Rate : 0.09356        
#>    Detection Prevalence : 0.43689        
#>       Balanced Accuracy : 0.21242        
#>                                          
#>        'Positive' Class : insufficient   
#>

The result is far from good. we will compare with other algorithm

9 Naive Bayes

*Creating Model using ‘naiveBayes()’

mnaive1 <- naiveBayes(coverage2~src_area+Month+Day+Hour+Peak_Hour+Weekend, sct_train2)
mnaive1$tables

#> $src_area
#>               src_area
#> Y                   sxk3      sxk8      sxk9
#>   insufficient 0.4001280 0.1062740 0.4935980
#>   sufficient   0.2765685 0.5966709 0.1267606
#> 
#> $Month
#>               Month
#> Y                      10         11         12
#>   insufficient 0.55057618 0.42061460 0.02880922
#>   sufficient   0.42829706 0.53713188 0.03457106
#> 
#> $Day
#>               Day
#> Y                 Sunday    Monday   Tuesday Wednesday  Thursday    Friday
#>   insufficient 0.1421255 0.1510883 0.1338028 0.1261204 0.1395647 0.1626120
#>   sufficient   0.1453265 0.1389245 0.1504481 0.1638924 0.1504481 0.1177977
#>               Day
#> Y               Saturday
#>   insufficient 0.1446863
#>   sufficient   0.1331626
#> 
#> $Hour
#>               Hour
#> Y                  [,1]     [,2]
#>   insufficient 12.03969 6.602250
#>   sufficient   10.81818 7.272687
#> 
#> $Peak_Hour
#>               Peak_Hour
#> Y                 nopeak      peak
#>   insufficient 0.3265045 0.6734955
#>   sufficient   0.5358515 0.4641485
#> 
#> $Weekend
#>               Weekend
#> Y               weekdays  weekends
#>   insufficient 0.7131882 0.2868118
#>   sufficient   0.7215109 0.2784891

Prediction

We want to know does the model underfit or overfit? we have to do prediction based on test dataset and train dataset to compare.
pred_mnaive1 –> test dataset pred_fit_mnaive –> from train dataset

pred_mnaive1 <- predict(object = mnaive1, newdata = sct_test)
pred_fit_mnaive1 <- predict(mnaive1, newdata = sct_train2)

table(pred_mnaive1)

#> pred_mnaive1
#> insufficient   sufficient 
#>          633          500

Check the model ability to predict from each class of target, we got 633 of insufficient and 500 of sufficient

*Model Validation

Lets set positive class as “insufficient” because we want to use Recall for getting more insufficient event to prevent it from happening.

confusionMatrix(as.factor(pred_fit_mnaive1), as.factor(sct_train2$coverage), positive = "insufficient")

#> Confusion Matrix and Statistics
#> 
#>               Reference
#> Prediction     insufficient sufficient
#>   insufficient         1273        391
#>   sufficient            289       1171
#>                                           
#>                Accuracy : 0.7823          
#>                  95% CI : (0.7674, 0.7967)
#>     No Information Rate : 0.5             
#>     P-Value [Acc > NIR] : < 2.2e-16       
#>                                           
#>                   Kappa : 0.5647          
#>                                           
#>  Mcnemar's Test P-Value : 0.0001074       
#>                                           
#>             Sensitivity : 0.8150          
#>             Specificity : 0.7497          
#>          Pos Pred Value : 0.7650          
#>          Neg Pred Value : 0.8021          
#>              Prevalence : 0.5000          
#>          Detection Rate : 0.4075          
#>    Detection Prevalence : 0.5327          
#>       Balanced Accuracy : 0.7823          
#>                                           
#>        'Positive' Class : insufficient    
#>

confusionMatrix(pred_mnaive1, sct_test$coverage, positive = "insufficient")

#> Confusion Matrix and Statistics
#> 
#>               Reference
#> Prediction     insufficient sufficient
#>   insufficient          508        125
#>   sufficient            105        395
#>                                           
#>                Accuracy : 0.797           
#>                  95% CI : (0.7724, 0.8201)
#>     No Information Rate : 0.541           
#>     P-Value [Acc > NIR] : <2e-16          
#>                                           
#>                   Kappa : 0.59            
#>                                           
#>  Mcnemar's Test P-Value : 0.2103          
#>                                           
#>             Sensitivity : 0.8287          
#>             Specificity : 0.7596          
#>          Pos Pred Value : 0.8025          
#>          Neg Pred Value : 0.7900          
#>              Prevalence : 0.5410          
#>          Detection Rate : 0.4484          
#>    Detection Prevalence : 0.5587          
#>       Balanced Accuracy : 0.7942          
#>                                           
#>        'Positive' Class : insufficient    
#>

Interpretation : We got accuracy 79.7%, sensitivity 82%, specificity 75% and precission 80% from our test dataset. its higher than accuracy, sensitivity, specificity and precission that we got from train dataset. It mean we create a good model which is not underfit nor overfit.

10 Decision Tree

*Creating Model using ‘ctree()’ from library(partykit)

mtree1 <- ctree(coverage2~src_area+Month+Day+Hour+Peak_Hour+Weekend, sct_train2) 
mtree1

#> 
#> Model formula:
#> coverage2 ~ src_area + Month + Day + Hour + Peak_Hour + Weekend
#> 
#> Fitted party:
#> [1] root
#> |   [2] src_area in sxk3, sxk9
#> |   |   [3] Peak_Hour in nopeak
#> |   |   |   [4] src_area in sxk3
#> |   |   |   |   [5] Hour <= 6
#> |   |   |   |   |   [6] Hour <= 1: insufficient (n = 90, err = 47.8%)
#> |   |   |   |   |   [7] Hour > 1: sufficient (n = 243, err = 23.0%)
#> |   |   |   |   [8] Hour > 6: insufficient (n = 128, err = 47.7%)
#> |   |   |   [9] src_area in sxk9
#> |   |   |   |   [10] Month in 10: insufficient (n = 197, err = 13.2%)
#> |   |   |   |   [11] Month in 11, 12: insufficient (n = 224, err = 39.3%)
#> |   |   [12] Peak_Hour in peak
#> |   |   |   [13] Month in 10
#> |   |   |   |   [14] src_area in sxk3: insufficient (n = 293, err = 18.4%)
#> |   |   |   |   [15] src_area in sxk9: insufficient (n = 260, err = 4.6%)
#> |   |   |   [16] Month in 11, 12
#> |   |   |   |   [17] Day <= Thursday: insufficient (n = 390, err = 31.8%)
#> |   |   |   |   [18] Day > Thursday: insufficient (n = 201, err = 17.4%)
#> |   [19] src_area in sxk8
#> |   |   [20] Peak_Hour in nopeak: sufficient (n = 465, err = 7.1%)
#> |   |   [21] Peak_Hour in peak
#> |   |   |   [22] Month in 10, 12: sufficient (n = 342, err = 27.2%)
#> |   |   |   [23] Month in 11: sufficient (n = 291, err = 13.7%)
#> 
#> Number of inner nodes:    11
#> Number of terminal nodes: 12

*Lets try to plot the model and see how is goes

plot(mtree1)

plot(mtree1, type = "simple")

Prediction

We want to know does the model underfit or overfit? we have to do prediction based on test dataset and train dataset to compare.
pred_mtree1 –> test dataset pred_fit_tree1 –> from train dataset

pred_mtree1 <- predict(mtree1, newdata = sct_test)
pred_fit_tree1 <- predict(mtree1, newdata = sct_train2)

Model Validation

Set insufficient as positive value , and we will use Recall/sensitivity because we want to use Recall for getting more insufficient event to prevent it from happening.

confusionMatrix(pred_fit_tree1, sct_train2$coverage2, positive = "insufficient")

#> Confusion Matrix and Statistics
#> 
#>               Reference
#> Prediction     insufficient sufficient
#>   insufficient         1340        443
#>   sufficient            222       1119
#>                                           
#>                Accuracy : 0.7871          
#>                  95% CI : (0.7724, 0.8014)
#>     No Information Rate : 0.5             
#>     P-Value [Acc > NIR] : < 2.2e-16       
#>                                           
#>                   Kappa : 0.5743          
#>                                           
#>  Mcnemar's Test P-Value : < 2.2e-16       
#>                                           
#>             Sensitivity : 0.8579          
#>             Specificity : 0.7164          
#>          Pos Pred Value : 0.7515          
#>          Neg Pred Value : 0.8345          
#>              Prevalence : 0.5000          
#>          Detection Rate : 0.4289          
#>    Detection Prevalence : 0.5707          
#>       Balanced Accuracy : 0.7871          
#>                                           
#>        'Positive' Class : insufficient    
#>

confusionMatrix(pred_mtree1, sct_test$coverage, positive = "insufficient")

#> Confusion Matrix and Statistics
#> 
#>               Reference
#> Prediction     insufficient sufficient
#>   insufficient          536        141
#>   sufficient             77        379
#>                                           
#>                Accuracy : 0.8076          
#>                  95% CI : (0.7834, 0.8302)
#>     No Information Rate : 0.541           
#>     P-Value [Acc > NIR] : < 2.2e-16       
#>                                           
#>                   Kappa : 0.6089          
#>                                           
#>  Mcnemar's Test P-Value : 1.982e-05       
#>                                           
#>             Sensitivity : 0.8744          
#>             Specificity : 0.7288          
#>          Pos Pred Value : 0.7917          
#>          Neg Pred Value : 0.8311          
#>              Prevalence : 0.5410          
#>          Detection Rate : 0.4731          
#>    Detection Prevalence : 0.5975          
#>       Balanced Accuracy : 0.8016          
#>                                           
#>        'Positive' Class : insufficient    
#>

From confusion matrix above, we got Accuracy=80%; Recall=87% ; Specificity= 72%, and Precission=79% from test dataset. It shows us that Decision tree model is good enough without any underfit nor overfit condition

11 Random Forest

*Creating Model using ‘train()’ function from caret package

set.seed(100)
ctrl <- trainControl(method = "cv", number = 5, repeats = 3 )
mrf1 <- caret::train(coverage2 ~., data = sct_train2, method = "rf", trControl = ctrl)

*Prediction

We want to know does the model underfit or overfit? we have to do prediction based on test dataset and train dataset to compare.
pred_rf –> test dataset pred_fit_rf –> from train dataset

# predict model
pred_rf <- predict(mrf1, sct_test)
pred_fit_rf <- predict(mrf1, sct_train2)

*Model Evaluation

Set insufficient as positive value , and we will use Recall/sensitivity because we want to use Recall for getting more insufficient event to prevent it from happening.

confusionMatrix(as.factor(pred_fit_rf), as.factor(sct_train2$coverage2))

#> Confusion Matrix and Statistics
#> 
#>               Reference
#> Prediction     insufficient sufficient
#>   insufficient         1311        337
#>   sufficient            251       1225
#>                                           
#>                Accuracy : 0.8118          
#>                  95% CI : (0.7976, 0.8254)
#>     No Information Rate : 0.5             
#>     P-Value [Acc > NIR] : < 2.2e-16       
#>                                           
#>                   Kappa : 0.6236          
#>                                           
#>  Mcnemar's Test P-Value : 0.000456        
#>                                           
#>             Sensitivity : 0.8393          
#>             Specificity : 0.7843          
#>          Pos Pred Value : 0.7955          
#>          Neg Pred Value : 0.8299          
#>              Prevalence : 0.5000          
#>          Detection Rate : 0.4197          
#>    Detection Prevalence : 0.5275          
#>       Balanced Accuracy : 0.8118          
#>                                           
#>        'Positive' Class : insufficient    
#>

confusionMatrix(as.factor(pred_rf), as.factor(sct_test$coverage))

#> Confusion Matrix and Statistics
#> 
#>               Reference
#> Prediction     insufficient sufficient
#>   insufficient          515        126
#>   sufficient             98        394
#>                                           
#>                Accuracy : 0.8023          
#>                  95% CI : (0.7779, 0.8251)
#>     No Information Rate : 0.541           
#>     P-Value [Acc > NIR] : < 2e-16         
#>                                           
#>                   Kappa : 0.6003          
#>                                           
#>  Mcnemar's Test P-Value : 0.07123         
#>                                           
#>             Sensitivity : 0.8401          
#>             Specificity : 0.7577          
#>          Pos Pred Value : 0.8034          
#>          Neg Pred Value : 0.8008          
#>              Prevalence : 0.5410          
#>          Detection Rate : 0.4545          
#>    Detection Prevalence : 0.5658          
#>       Balanced Accuracy : 0.7989          
#>                                           
#>        'Positive' Class : insufficient    
#>

Interpretation : We got accuracy 80%, sensitivity 84%, specificity 75% and precission 80% from our test dataset. its relatively lower than parameters that we got from train dataset. It mean the model is underfit

12 Neural Network

##Feature engineering

In Neural Network, we have to chnage our variable type become numeric. In this case, we have to make feature engineering by changing some of variable become numeric type and create dummy variable. start time column have to deleted as well.

Change the target variable (coverage) become numeric and rename as ‘0’ and ‘1’.
0 –> insufficient 1–> sufficient

sct_train3 <- sct_train2 %>% 
  mutate(coverage2 = ifelse(coverage2 == "insufficient", 0, 1))%>%
  select(-start_time)
sct_train3

sct_test2 <- sct_test %>% 
  mutate(coverage = ifelse(coverage == "insufficient", 0, 1)) %>% 
  select(-start_time)
sct_test2

*Create dummy variable

train_dummy <- dummyVars(~.,sct_train3, fullRank = T)
test_dummy <- dummyVars(~., sct_test2, fullRank = T)

*Change into data frame type

train_dummy <- data.frame(predict(train_dummy, sct_train3))
test_dummy <- data.frame(predict(test_dummy, sct_test2))

*Change into Matrix type

train_matrix <- as.matrix(train_dummy)
test_matrix <- as.matrix(test_dummy)

*Separating x variable and y variable (separating target variable) to create one hot coding

# prediktor
train_x <- train_matrix[, -14]
test_x <- test_matrix[, -14]

# target
train_y <- train_matrix[, 14]
test_y <- test_matrix[, 14]

*Change into Array

# array
train_x_keras <- array_reshape(x = train_x, dim = c(nrow(train_x), ncol(train_x)))
test_x_keras <- array_reshape(x = test_x, dim = c(nrow(test_x), ncol(test_x)))

head(train_x_keras)

#>      [,1] [,2] [,3] [,4]          [,5]          [,6]          [,7]
#> [1,]    0    0    1    0 -5.669467e-01  5.455447e-01 -4.082483e-01
#> [2,]    0    1    0    0  1.889822e-01 -3.273268e-01 -4.082483e-01
#> [3,]    0    1    0    0 -3.779645e-01  9.690821e-17  4.082483e-01
#> [4,]    1    0    0    0  3.779645e-01  0.000000e+00 -4.082483e-01
#> [5,]    0    1    0    0  1.889822e-01 -3.273268e-01 -4.082483e-01
#> [6,]    0    1    0    0  2.098124e-17 -4.364358e-01  3.021644e-17
#>            [,8]          [,9]       [,10] [,11] [,12] [,13]
#> [1,]  0.2417469 -1.091089e-01  0.03289758     3     0     1
#> [2,]  0.0805823  5.455447e-01  0.49346377     7     1     0
#> [3,] -0.5640761  4.364358e-01 -0.19738551     8     1     0
#> [4,] -0.5640761 -4.364358e-01 -0.19738551    13     1     0
#> [5,]  0.0805823  5.455447e-01  0.49346377    21     0     0
#> [6,]  0.4834938 -9.751389e-16 -0.65795169     2     0     0

*One Hot Coding

train_y_keras <- to_categorical(train_y)

12.1 Creating Neural Network Architecture

*Neural Network sequencing

modelnn <- keras_model_sequential()

Create Model Neural Network

activation at hidden layer = ‘relu’
activation at output layer = ‘sigmoid’ because we scotty case is binary
input shape = ncol from train keras dataset
set units = 64, 32 and 2

modelnn %>% 
  layer_dense(input_shape = ncol(train_x_keras), units = 64, activation = "relu",name = "hidden_1") %>%   layer_dense(units = 32, activation = "relu", name = "hidden_2") %>%  
  layer_dense(units = 2, activation = "sigmoid")

summary(modelnn)

#> Model: "sequential"
#> ___________________________________________________________________________
#> Layer (type)                     Output Shape                  Param #     
#> ===========================================================================
#> hidden_1 (Dense)                 (None, 64)                    896         
#> ___________________________________________________________________________
#> hidden_2 (Dense)                 (None, 32)                    2080        
#> ___________________________________________________________________________
#> dense (Dense)                    (None, 2)                     66          
#> ===========================================================================
#> Total params: 3,042
#> Trainable params: 3,042
#> Non-trainable params: 0
#> ___________________________________________________________________________

*Compile

Gather all Neural Network architecture, set learning rate = 0.001

modelnn %>% 
  compile(loss = "binary_crossentropy",
          optimizer = optimizer_sgd(lr=0.001),
            metric = "accuracy")

12.2 Creating Model Neural Network Using ‘fit()’ function

history <- modelnn %>% 
  fit(train_x_keras,
      train_y_keras,
      epoch = 30, 
      batch_size = 32)

Plotting Model

plot(history)

Interpretation : From history model we got 71% of accuracy

*Prediction

predict_test <- predict_classes(object = modelnn, x= test_x_keras)
predict_fit <- predict_classes(modelnn, train_x_keras)

func <- function(data){
  sapply(as.character(data),switch,
      "0" = "insufficient",
      "1" = "sufficient"
  )
}

reference_test <- func(test_y)
reference_train <- func(train_y)

predict_test2 <- func(predict_test)
predict_fit2 <- func(predict_fit)

13 Data Validation

confusionMatrix(as.factor(predict_fit2), as.factor(reference_train))

#> Confusion Matrix and Statistics
#> 
#>               Reference
#> Prediction     insufficient sufficient
#>   insufficient         1267        450
#>   sufficient            295       1112
#>                                           
#>                Accuracy : 0.7615          
#>                  95% CI : (0.7462, 0.7764)
#>     No Information Rate : 0.5             
#>     P-Value [Acc > NIR] : < 2.2e-16       
#>                                           
#>                   Kappa : 0.523           
#>                                           
#>  Mcnemar's Test P-Value : 1.68e-08        
#>                                           
#>             Sensitivity : 0.8111          
#>             Specificity : 0.7119          
#>          Pos Pred Value : 0.7379          
#>          Neg Pred Value : 0.7903          
#>              Prevalence : 0.5000          
#>          Detection Rate : 0.4056          
#>    Detection Prevalence : 0.5496          
#>       Balanced Accuracy : 0.7615          
#>                                           
#>        'Positive' Class : insufficient    
#>

confusionMatrix(as.factor(predict_test2),as.factor(reference_test))

#> Confusion Matrix and Statistics
#> 
#>               Reference
#> Prediction     insufficient sufficient
#>   insufficient          506        144
#>   sufficient            107        376
#>                                           
#>                Accuracy : 0.7785          
#>                  95% CI : (0.7531, 0.8023)
#>     No Information Rate : 0.541           
#>     P-Value [Acc > NIR] : < 2e-16         
#>                                           
#>                   Kappa : 0.5515          
#>                                           
#>  Mcnemar's Test P-Value : 0.02307         
#>                                           
#>             Sensitivity : 0.8254          
#>             Specificity : 0.7231          
#>          Pos Pred Value : 0.7785          
#>          Neg Pred Value : 0.7785          
#>              Prevalence : 0.5410          
#>          Detection Rate : 0.4466          
#>    Detection Prevalence : 0.5737          
#>       Balanced Accuracy : 0.7743          
#>                                           
#>        'Positive' Class : insufficient    
#>

Interpretation : From model validation above, we got the result is overfit (relatively). By using test dataset we got accuracy 73%, sensitivity 86%, specificity 57% and precission 70%.

14 Conclusion

So far the best model is created from Decision Tree algorithm. With Accuracy=80%; Recall=87% ; Specificity= 72%, and Precission=79%. ‘mtree1’ model also showing good performace since we dont fins any overfit nor underfit condition.
Continue using Decision Tree model for Data Submission –> mtree1

15 Data Submission

*Read Data

Read Data submission and name it ‘submission’ object

submission <- read_csv("data-submission.csv")
submission

Here we find 3 column, such as : area, datetime, and coverage. We have to fill the coverage column and change the column become as same as our dataset.

*Create some new columns just like we did on previous steps above, then save it into new object sub

sub <- submission %>% 
  mutate(Day = wday(datetime, label = T, abbr = F),
         Hour = hour(datetime),
         Month = month(datetime),
         Peak_Hour = as.factor(ifelse( 7<=Hour & Hour<=20, "peak", "nopeak")),
         Weekend = as.factor(ifelse(Day == "Sunday"| Day == "Saturday", "weekends", "weekdays")),
         src_area = as.factor(src_area),
         Month = as.factor(Month),
         start_time = datetime)%>% 
  select(start_time, src_area, Month,Day,Hour, Peak_Hour, Weekend, coverage)

sub

Looks good! Now it looks the same like our dataset, but we have to fill the coverage by the result of predictiong using mtree1 model

*Prediction Using Decision Tree model –> mtree1

Make new object to save predistion result, name it ‘pred_sub’

pred_sub <- predict(mtree1, newdata = sub)

*Then input the prediction result into ‘coverage’ column inside ‘sub’ dataset

sub <- sub %>% 
  mutate(coverage = pred_sub)
sub

Looks good! Now coverage column has been filled by prediction result

*Next we need to get it back to the original column which is only consist of 3 columns : scr area, datetime and coverage columns. Then save it by the name ‘sub2’

sub2 <- sub %>% 
  mutate(datetime = start_time ) %>% 
  select(src_area, datetime, coverage)

sub2

*The last, create a .csv file using ‘sub2’ object. Name it ‘submit.csv’.

write.csv(sub2, "submit.csv", row.names = F)

latter on, ‘submit.csv’ file will be inputed to leaderboard for submission.

Widya - Scotty-CL-Cov

Widya Kania Rahayu

December 17, 2019

1 Intro

1.1 Greetings

1.2 Content

2 Read Data

3 Data Preprocess

4 Exploratory Data Analysis

5 Check Proportion Data

6 Model Fitting & Evaluation

7 Create Modeling

8 Logistic Regression

9 Naive Bayes

10 Decision Tree

11 Random Forest

12 Neural Network

12.1 Creating Neural Network Architecture

12.2 Creating Model Neural Network Using ‘fit()’ function

13 Data Validation

14 Conclusion

15 Data Submission