Introduction

The objective of this project was to build classifiers to predict whether an frequency of MYKI transactions for various stops within the Melbourne metropolitian area based on MYKI data collected from 2015-2018. Section 1 describes an overview of our methodology. Section 2 discusses the classifiers’ fine-tuning process and detailed performance analysis of each classifier. Section 3 compares the performance of the classifiers using the same resampling method. Section 4 critiques our methodology. The last section concludes with a summary.

Methodology

We considered two classifiers - Random Forest (RF), and \(K\)-Nearest Neighbour (KNN). The NB was the baseline classifier. Each classifier was trained to make probability predictions so that we were able to adjust the prediction threshold to refine the performance. We split the full data set into 70 % training set and 30 % test set. Each set resembled the full data by having the same proportion of target classes i.e. approximately 15 % as Busy (Stops with 15 or more MYKI transactions per hour) and 85 % as Not Busy (less than 15 MYKI transactions per hour). For the fine-tuning process, we ran a five-folded cross-validation stratified sampling on each classifier. Stratified sampling was used to cater for the class imbalance of the target feature.

Next, for each classifer, we determined the optimal probability threshold. Using the tuned hyperparameters and the optimal thresholds, we made predictions on the test data. During model training (hyperparameter tuning and threshold adjustment), we relied on mean misclassification error rate (mmce). In addition to mmce, we also used the confusion matrix on the test data to evaluate classifiers’ performance. The modelling was implemented in R with the mlr package (Bischl et al. 2016).

Hyperparameter Fine-Tuning

Random Forest

We tune-fined the number of variables randomly sampled as candidates at each split (i.e. mtry). For a classification problem, Breiman (2001) suggests mtry = \(\sqrt{p}\) where \(p\) is the number of descriptive features. In our case, \(\sqrt{p} = \sqrt{11}=3.31\). Therefore, we experimented with mtry = 2, 3, and 4. We left other hyperparameters, such as the number of trees to grow at the default value. The result was 3 with a mean test error of 0.139.

\(K\)-Nearest Neighbour

By using the optimal kernel, we ran a grid search on \(k=2,3,...20\). The outcome was 20 with a mean test error of 0.165.

Feature Selection

Feature selection was used to identify an optimal subset of the available features. Selecting a subset of relevant features can make machine learning algorithm training faster, reduce complexity of the model, improve accuracy and reduce overfitting.There are three broard categoried of feature selection methods: filter methods, wrapper methods and embedded methods.

Filter method

The fiter method assigns an importance value to each feature. Based on these values the features are ranked and a feature subset is selected. The learner was fused with the filter method for training of each classification model.

Wrapper method

The wrapper method used the performance of a learning classifier to access the usefulness of the feature set. In order to select a feature subset the learner was trained repeatedly on different fleature subsets and the subset which lead to the best learner performance was chosen.

Load R packages

Load dataset

Loaded dataset and removed redundant index column.

data <- read.csv('E:\\Datathon\\MelbDatathon2018\\MelbDatathon2018_2\\MelbDatathon2018\\inputML.csv', stringsAsFactors = FALSE)
data <- data[sample(nrow(data), 3000), ]
data <- data[,-1]
str(data)
## 'data.frame':    3000 obs. of  15 variables:
##  $ Var1               : chr  "2015-09-20 07:00:00" "2015-09-20 09:00:00" "2015-09-20 08:00:00" "2015-09-20 10:00:00" ...
##  $ Freq               : int  16 14 2 4 1 5 1 29 2 16 ...
##  $ DateTime           : chr  "2015-09-20 09:20:11" "2015-09-20 08:01:46" "2015-09-20 09:14:10" "2015-09-20 10:02:57" ...
##  $ FinancialMonth     : int  9 9 9 9 9 9 9 9 9 9 ...
##  $ FinancialWeek      : int  12 12 12 12 12 12 12 12 12 12 ...
##  $ WeekDay            : int  7 7 7 7 7 7 7 7 7 7 ...
##  $ Hour               : int  5 4 5 6 5 6 3 6 3 6 ...
##  $ RouteID            : int  3 7 6 14 6 16 7 14 10 13 ...
##  $ StopID             : int  20021 40221 20040 19944 20041 19865 20032 19959 19879 19873 ...
##  $ Number             : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ StopType           : chr  "Platform" "Platform" "Platform" "Platform" ...
##  $ SuburbName         : chr  "Sunshine" "Craigieburn" "Flemington" "Malvern" ...
##  $ PostCode           : int  3020 3064 3031 3144 3031 3194 3046 3141 3135 3150 ...
##  $ LocalGovernmentArea: chr  "Brimbank" "Hume" "Moonee Valley" "Stonnington" ...
##  $ status             : chr  "Busy" "Slightly Busy" "Not Busy" "Not Busy" ...

Data processing

Printed a summary of the dataset to ensure there were no missing values.

data[data$status == "Slightly Busy",15] <- "Not Busy"
summarizeColumns(data) %>% knitr::kable(caption =  'Table 1. Feature Summary')
Table 1. Feature Summary
name type na mean disp median mad min max nlevs
Var1 character 0 NA 0.7456667 NA NA 3 763 6
Freq integer 0 7.389000 7.2279845 5 4.4478 0 45 0
DateTime character 0 NA 0.9980000 NA NA 1 6 1835
FinancialMonth integer 0 9.000000 0.0000000 9 0.0000 9 9 0
FinancialWeek integer 0 12.000000 0.0000000 12 0.0000 12 12 0
WeekDay integer 0 7.000000 0.0000000 7 0.0000 7 7 0
Hour integer 0 4.483000 0.9616059 5 1.4826 1 6 0
RouteID integer 0 11.196333 5.5157285 13 4.4478 1 20 0
StopID integer 0 21036.592667 5060.4338666 19939 87.4734 19829 46468 0
Number integer 0 1.000333 0.0182574 1 0.0000 1 2 0
StopType character 0 NA 0.0000000 NA NA 3000 3000 1
SuburbName character 0 NA 0.9616667 NA NA 1 115 137
PostCode integer 0 3148.644667 174.8464989 3131 81.5430 3011 3977 0
LocalGovernmentArea character 0 NA 0.9090000 NA NA 3 273 29
status character 0 NA 0.1453333 NA NA 436 2564 2

Removed Var1, Freq, DateTime and StopType features.Var1 and DateTime have too many levels. The target features ‘status’ were derived from ‘Freq’ there this feature was removed.

data <- data[,-c(1:3, 11)]
glimpse(data)
## Observations: 3,000
## Variables: 11
## $ FinancialMonth      <int> 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, ...
## $ FinancialWeek       <int> 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12...
## $ WeekDay             <int> 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, ...
## $ Hour                <int> 5, 4, 5, 6, 5, 6, 3, 6, 3, 6, 4, 5, 6, 5, ...
## $ RouteID             <int> 3, 7, 6, 14, 6, 16, 7, 14, 10, 13, 13, 3, ...
## $ StopID              <int> 20021, 40221, 20040, 19944, 20041, 19865, ...
## $ Number              <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ SuburbName          <chr> "Sunshine", "Craigieburn", "Flemington", "...
## $ PostCode            <int> 3020, 3064, 3031, 3144, 3031, 3194, 3046, ...
## $ LocalGovernmentArea <chr> "Brimbank", "Hume", "Moonee Valley", "Ston...
## $ status              <chr> "Busy", "Not Busy", "Not Busy", "Not Busy"...
write.csv(data, "data_MLPro.csv")
propT<-table(data$status)
prop.table(propT)
## 
##      Busy  Not Busy 
## 0.1453333 0.8546667

Plotted proportion of ‘Busy’ and ‘Not Busy’ target levels for each (i) local government area, (ii) route ID and (iii) hour.

Figure 1. Transaction activity based on (top) (Top) local government area and (Bottom) route ID.

Interactive plot of the proportion of ‘Busy’ and ‘Not Busy’ target levels for each suburb.

library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:latex2exp':
## 
##     TeX
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
p4 <- ggplot(data, aes(x = SuburbName, fill = status)) + geom_bar() + theme(axis.text.x = element_text(angle = 90, hjust = 1, size = 8)) + xlab("Suburb")
gg1 <- ggplotly(p4)
gg1

Figure 2. MYKI transaction activity based on suburb.

Plotted the proportion of ‘Busy’ and ‘Not Busy’ target levels for the entire dataset.

ggplot(data, aes(x = status)) + geom_bar() + labs(x = "Status", y = "Frequency") 

Figure 3. Proportion of stops defined as busy or not busy.

Update status (target) variables

data$status[data$status == 'Not Busy'] <- 'Not_Busy'

Dummify categorical variables (categorical variable were re-classified as factors and then converted to numerical variables).

data$SuburbName <- data$SuburbName %>%  as.factor()
data$LocalGovernmentArea <- data$LocalGovernmentArea %>% as.factor()
data$StopID <- data$StopID %>% as.factor()
data$PostCode <- data$PostCode %>% as.factor()
data <- createDummyFeatures(data, target = "status", 'reference', cols = NULL)
#data_dummified <- createDummyFeatures(data_standardized, target = 'Goal', method = 'reference', cols = NULL)

Shuffle rows prior to splitting dataset

To remove any patterns in the dataset (e.g. due to yearly flucations) To avoid any biases the rows in the dataset were randomized prior to splitting the data into training and test sets.

#Shuffle dataset rows
set.seed(1234)
n <- nrow(data)
shuffled_data <- data[sample(n), ]

Threshold Adjustment

The data was split to obtain the training (70 %) and test (30 %) datasets.

# Old school way to spliting the data into 70 % training & 30 % test data
# This is not stratified sampling, which shall be used in model training
# obtain index for training and test indices
training_index <- sample(nrow(shuffled_data)*0.70)
test_index     <- setdiff(seq(1:nrow(shuffled_data)), training_index )

# Get the training data and test data
training_data  <- shuffled_data[training_index, ]
test_data      <- shuffled_data[test_index, ]

Determine the proportion of disease category in the test and training datasets.

## 
##      Busy  Not_Busy 
## 0.1428571 0.8571429
## 
##      Busy  Not_Busy 
## 0.1511111 0.8488889

They were not eaually balanced but were representative of the full dataset. We shall use training data for modeling and test data for model evaluation.

2. Modeling —-

2.1. Basic configuration —-

# Configure classification task
task <- makeClassifTask(data = training_data, target = 'status', id = 'trans_freq', positive = "Busy")

# Configure learners with probability type
learner2 <- makeLearner('classif.randomForest', predict.type = 'prob')  # Random Forest learner
learner3 <- makeLearner('classif.kknn', predict.type = 'prob')          # kNN learner
## Loading required package: kknn
## 
## Attaching package: 'kknn'
## The following object is masked from 'package:caret':
## 
##     contr.dummy

2.2 Model fine-tuning

# For randomForest, we can fine-tune mtry i.e mumber of variables randomly 
# sampled as candidates at each split. Following
# Breiman, L. (2001), Random Forests, Machine Learning 45(1), 5-32,
# we can try mtry = 2, 3, 4 as mtry = sqrt(p) where p = 11
ps2 <- makeParamSet(
  makeDiscreteParam('mtry', values = c(2,3,4))
)

# For kknn, we can fine-tune k = 2 to 20 
ps3 <- makeParamSet(
  makeDiscreteParam('k', values = seq(2, 20, by = 1))
)

Configure the tune control search and a 5-CV stratified sampling scheme

Configure the tune wrapper with tune-tuning settings

Model Training

Train the tune wrappers

## [Tune] Started tuning learner classif.randomForest for parameter set:
##          Type len Def Constr Req Tunable Trafo
## mtry discrete   -   -  2,3,4   -    TRUE     -
## With control class: TuneControlGrid
## Imputation value: 1
## [Tune-x] 1: mtry=2
## [Tune-y] 1: mmce.test.mean=0.1428571; time: 2.1 min
## [Tune-x] 2: mtry=3
## [Tune-y] 2: mmce.test.mean=0.1428571; time: 3.0 min
## [Tune-x] 3: mtry=4
## [Tune-y] 3: mmce.test.mean=0.1328571; time: 3.1 min
## [Tune] Result: mtry=4 : mmce.test.mean=0.1328571
## [Tune] Started tuning learner classif.kknn for parameter set:
##       Type len Def                                   Constr Req Tunable
## k discrete   -   - 2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,...   -    TRUE
##   Trafo
## k     -
## With control class: TuneControlGrid
## Imputation value: 1
## [Tune-x] 1: k=2
## [Tune-y] 1: mmce.test.mean=0.2023810; time: 0.1 min
## [Tune-x] 2: k=3
## [Tune-y] 2: mmce.test.mean=0.2061905; time: 0.1 min
## [Tune-x] 3: k=4
## [Tune-y] 3: mmce.test.mean=0.1990476; time: 0.1 min
## [Tune-x] 4: k=5
## [Tune-y] 4: mmce.test.mean=0.2014286; time: 0.1 min
## [Tune-x] 5: k=6
## [Tune-y] 5: mmce.test.mean=0.2004762; time: 0.1 min
## [Tune-x] 6: k=7
## [Tune-y] 6: mmce.test.mean=0.1961905; time: 0.1 min
## [Tune-x] 7: k=8
## [Tune-y] 7: mmce.test.mean=0.1933333; time: 0.1 min
## [Tune-x] 8: k=9
## [Tune-y] 8: mmce.test.mean=0.1933333; time: 0.1 min
## [Tune-x] 9: k=10
## [Tune-y] 9: mmce.test.mean=0.1904762; time: 0.1 min
## [Tune-x] 10: k=11
## [Tune-y] 10: mmce.test.mean=0.1938095; time: 0.1 min
## [Tune-x] 11: k=12
## [Tune-y] 11: mmce.test.mean=0.1933333; time: 0.1 min
## [Tune-x] 12: k=13
## [Tune-y] 12: mmce.test.mean=0.1961905; time: 0.1 min
## [Tune-x] 13: k=14
## [Tune-y] 13: mmce.test.mean=0.1957143; time: 0.1 min
## [Tune-x] 14: k=15
## [Tune-y] 14: mmce.test.mean=0.1938095; time: 0.1 min
## [Tune-x] 15: k=16
## [Tune-y] 15: mmce.test.mean=0.1871429; time: 0.1 min
## [Tune-x] 16: k=17
## [Tune-y] 16: mmce.test.mean=0.1995238; time: 0.1 min
## [Tune-x] 17: k=18
## [Tune-y] 17: mmce.test.mean=0.1980952; time: 0.1 min
## [Tune-x] 18: k=19
## [Tune-y] 18: mmce.test.mean=0.1961905; time: 0.1 min
## [Tune-x] 19: k=20
## [Tune-y] 19: mmce.test.mean=0.2009524; time: 0.1 min
## [Tune] Result: k=16 : mmce.test.mean=0.1871429

Model Prediction

Predict on training data

Model Evaluation

Obtain threshold values for each learner

Figure 4. Plot for the optimization of the threshold for the kNN and Random Forest classifiers trained on all features.

Evaluation on test data

AUC plots for Random Forest and kNN models

Evaluated each algorithm by comparing the plotted AUC curves.

Figure 5. AUC plot for the kNN and Random Forest classifers trained on all features.

The AUC plots indicated that the Random Forest classifier performed slightly better than kNN (Fig. 7).

Misclassification Error and AUC value

Table 2. Performance for Random Forest and kNN classifiers
Random Forest kNN
mmce 0.1388889 0.1522222
auc 0.8227643 0.8567428

The Random Forest classifier performed the best, when the models were trained on all features, with a mean misclassification error of 0.287 and auc value of 0.784 (Table 4).

Confusion Matrix, Precision and Recall for Random Forest model

##           predicted
## true       Busy      Not_Busy                       
##   Busy     13        123       tpr: 0.1   fnr: 0.9  
##   Not_Busy 2         762       fpr: 0     tnr: 1    
##            ppv: 0.87 for: 0.14 lrp: 36.51 acc: 0.86 
##            fdr: 0.13 npv: 0.86 lrm: 0.91  dor: 40.27
## 
## 
## Abbreviations:
## tpr - True positive rate (Sensitivity, Recall)
## fpr - False positive rate (Fall-out)
## fnr - False negative rate (Miss rate)
## tnr - True negative rate (Specificity)
## ppv - Positive predictive value (Precision)
## for - False omission rate
## lrp - Positive likelihood ratio (LR+)
## fdr - False discovery rate
## npv - Negative predictive value
## acc - Accuracy
## lrm - Negative likelihood ratio (LR-)
## dor - Diagnostic odds ratio

Confusion Matrix, Precision and Recall for k-Nearest Neighbours model

##           predicted
## true       Busy     Not_Busy                     
##   Busy     51       85        tpr: 0.38 fnr: 0.62
##   Not_Busy 52       712       fpr: 0.07 tnr: 0.93
##            ppv: 0.5 for: 0.11 lrp: 5.51 acc: 0.85
##            fdr: 0.5 npv: 0.89 lrm: 0.67 dor: 8.22
## 
## 
## Abbreviations:
## tpr - True positive rate (Sensitivity, Recall)
## fpr - False positive rate (Fall-out)
## fnr - False negative rate (Miss rate)
## tnr - True negative rate (Specificity)
## ppv - Positive predictive value (Precision)
## for - False omission rate
## lrp - Positive likelihood ratio (LR+)
## fdr - False discovery rate
## npv - Negative predictive value
## acc - Accuracy
## lrm - Negative likelihood ratio (LR-)
## dor - Diagnostic odds ratio

Feature selection

Random forest filter method for feature selection

Filter methods assign an importance to each feature. The feature is ranked according to importance value resulting in a feature subset. Create an object named mfv by calling generateFilterValuesData from mlr on classif.task and using the filter method randomForest.importance.

## Supervised task: datathon
## Type: regr
## Target: status
## Observations: 3000
## Features:
##    numerics     factors     ordered functionals 
##         467           0           0           0 
## Missings: FALSE
## Has weights: FALSE
## Has blocking: FALSE
## Has coordinates: FALSE

Plot filtered features obtained using information gain and chi squared methods

Figure 6. Features selected using a filter selection algorithm in mlr based on (Left panel) information gain and (right panel) chi squared (https://mlr-org.github.io/mlr/articles/tutorial/devel/feature_selection.html).(Bischl et al. 2016)

Fuse the random forest learner with the information gain filter

We now ‘fused’ the random forest classification learner with the information.gain filter to train the model.

Determine optimal number of features to keep

The optimal percentage of features to keep was determined by 5-fold cross-validation. We use ‘information gain’ as an importance measure and select the 10 features with highest importance. In each resampling iteration feature selection is carried out on the corresponding training data set before fitting the learner.

## [Tune] Started tuning learner classif.randomForest.filtered for parameter set:
##            Type len Def Constr Req Tunable Trafo
## fw.abs discrete   -   -     10   -    TRUE     -
## With control class: TuneControlGrid
## Imputation value: 1
## [Tune-x] 1: fw.abs=10
## [Tune-y] 1: mmce.test.mean=0.1300000; time: 0.4 min
## [Tune] Result: fw.abs=10 : mmce.test.mean=0.1300000

Performance (misclassification error)

The optimal percentage and corresponding misclassification error are:

## $fw.abs
## [1] 10
## mmce.test.mean 
##           0.13

Fuse learner with feature selection

We can now fuse it with fw percentage by “wrapper” the random forest learner with the chi-squared method before training the model:

View selected features

Now applied getFilteredFeatures on the trained model to view the selected features.

Wrapper Methods

Select optimal features to use

Used a random search with ten iterations on the random forest classifier and classif.task.

Performance (misclassification error)

## mmce.test.mean 
##      0.1280952

View the important features

Table 3. Selected Features for Random Forest classifier with mlr Feature Selection (selectFeatures)
Filter Wrapper
RouteID WeekDay
StopID.19943 RouteID
StopID.20021 StopID.19832
StopID.20025 StopID.19837
SuburbName.Caulfield.East StopID.19839
SuburbName.Footscray StopID.19840
SuburbName.Sunshine StopID.19844
PostCode.3020 StopID.19846
LocalGovernmentArea.Greater.Dandenong StopID.19848
LocalGovernmentArea.Maribyrnong StopID.19849

Wrap feature selection method with learner

By comparing the misclassification error rates, a random search wrapper method out performed the chi squared (filtered) method. We then fused the wrapper method in a learnerusing makeFeatSelWrapper together with makeFeatSelControlRandom and makeResampleDesc objects.

Define the test data

Prediction

Evaluation

Obtain the confusion matrix by running calculateConfusionMatrix(pred_on_test) and get the ROC.

Confusion Matrix

##           predicted
## true       Busy Not_Busy -err.-
##   Busy       26      110    110
##   Not_Busy   18      746     18
##   -err.-     18      110    128

AUC plot

Figure 7. AUC plot for the random forest classifer. RF - Random forest classifier trained on all features; RF_spsa_tuned - Random forest classifier with tuned hyperparameters and trained on features selected RF_wrapper.

The AUC plot show that the random forest classifier with optimized hyperparameters and using the wrapper method for feature selection performed similarly.

Performance for random forest with mlr feature selection

Misclassification Error and AUC value

Table 10. Classifier Performance
x
mmce 0.1422222
auc 0.8300402

The random forest classifier using the wrapper method for feature selection had a mean misclassification error of 0.216 and AUC value of 0.860 (Table 10).

Performance for random forest with mlr feature selection

##           predicted
## true       Busy      Not_Busy                     
##   Busy     26        110       tpr: 0.19 fnr: 0.81
##   Not_Busy 18        746       fpr: 0.02 tnr: 0.98
##            ppv: 0.59 for: 0.13 lrp: 8.11 acc: 0.86
##            fdr: 0.41 npv: 0.87 lrm: 0.83 dor: 9.8 
## 
## 
## Abbreviations:
## tpr - True positive rate (Sensitivity, Recall)
## fpr - False positive rate (Fall-out)
## fnr - False negative rate (Miss rate)
## tnr - True negative rate (Specificity)
## ppv - Positive predictive value (Precision)
## for - False omission rate
## lrp - Positive likelihood ratio (LR+)
## fdr - False discovery rate
## npv - Negative predictive value
## acc - Accuracy
## lrm - Negative likelihood ratio (LR-)
## dor - Diagnostic odds ratio

Discussion

No Feature Selection: RF (mmce 0.118, AUC 0.881, precision 0.67, recall 0.16) kNN (mmce 0.133, AUC 0.900, precision 0.48, recall 0.50)

Feature Selection: RF (mmce 0.112, AUC 0.74, precision 0.65, recall 0.74)

Random forest with and without feature selection exhibited a better precision than kNN. While all classifiers had simiklar AUC values. However, being able to correctly predict that a stop is congested (true positive) would be the best outcome from the perspective of both passengers and serviced providers. The random forestclassifier with feature selection performed best when correctly predicting true positives (recall).

Conclusion

References

Bischl, Bernd, Michel Lang, Lars Kotthoff, Julia Schiffner, Jakob Richter, Erich Studerus, Giuseppe Casalicchio, and Zachary M. Jones. 2016. “mlr: Machine Learning in R.” Journal of Machine Learning Research.

Breiman, L. 2001. “Random Forests.” Machine Learning.