The objective of this project was to build classifiers to predict whether an frequency of MYKI transactions for various stops within the Melbourne metropolitian area based on MYKI data collected from 2015-2018. Section 1 describes an overview of our methodology. Section 2 discusses the classifiers’ fine-tuning process and detailed performance analysis of each classifier. Section 3 compares the performance of the classifiers using the same resampling method. Section 4 critiques our methodology. The last section concludes with a summary.
We considered two classifiers - Random Forest (RF), and \(K\)-Nearest Neighbour (KNN). The NB was the baseline classifier. Each classifier was trained to make probability predictions so that we were able to adjust the prediction threshold to refine the performance. We split the full data set into 70 % training set and 30 % test set. Each set resembled the full data by having the same proportion of target classes i.e. approximately 15 % as Busy (Stops with 15 or more MYKI transactions per hour) and 85 % as Not Busy (less than 15 MYKI transactions per hour). For the fine-tuning process, we ran a five-folded cross-validation stratified sampling on each classifier. Stratified sampling was used to cater for the class imbalance of the target feature.
Next, for each classifer, we determined the optimal probability threshold. Using the tuned hyperparameters and the optimal thresholds, we made predictions on the test data. During model training (hyperparameter tuning and threshold adjustment), we relied on mean misclassification error rate (mmce). In addition to mmce, we also used the confusion matrix on the test data to evaluate classifiers’ performance. The modelling was implemented in R with the mlr package (Bischl et al. 2016).
We tune-fined the number of variables randomly sampled as candidates at each split (i.e. mtry). For a classification problem, Breiman (2001) suggests mtry = \(\sqrt{p}\) where \(p\) is the number of descriptive features. In our case, \(\sqrt{p} = \sqrt{11}=3.31\). Therefore, we experimented with mtry = 2, 3, and 4. We left other hyperparameters, such as the number of trees to grow at the default value. The result was 3 with a mean test error of 0.139.
By using the optimal kernel, we ran a grid search on \(k=2,3,...20\). The outcome was 20 with a mean test error of 0.165.
Feature selection was used to identify an optimal subset of the available features. Selecting a subset of relevant features can make machine learning algorithm training faster, reduce complexity of the model, improve accuracy and reduce overfitting.There are three broard categoried of feature selection methods: filter methods, wrapper methods and embedded methods.
The fiter method assigns an importance value to each feature. Based on these values the features are ranked and a feature subset is selected. The learner was fused with the filter method for training of each classification model.
The wrapper method used the performance of a learning classifier to access the usefulness of the feature set. In order to select a feature subset the learner was trained repeatedly on different fleature subsets and the subset which lead to the best learner performance was chosen.
Loaded dataset and removed redundant index column.
data <- read.csv('E:\\Datathon\\MelbDatathon2018\\MelbDatathon2018_2\\MelbDatathon2018\\inputML.csv', stringsAsFactors = FALSE)
data <- data[sample(nrow(data), 3000), ]
data <- data[,-1]
str(data)
## 'data.frame': 3000 obs. of 15 variables:
## $ Var1 : chr "2015-09-20 07:00:00" "2015-09-20 09:00:00" "2015-09-20 08:00:00" "2015-09-20 10:00:00" ...
## $ Freq : int 16 14 2 4 1 5 1 29 2 16 ...
## $ DateTime : chr "2015-09-20 09:20:11" "2015-09-20 08:01:46" "2015-09-20 09:14:10" "2015-09-20 10:02:57" ...
## $ FinancialMonth : int 9 9 9 9 9 9 9 9 9 9 ...
## $ FinancialWeek : int 12 12 12 12 12 12 12 12 12 12 ...
## $ WeekDay : int 7 7 7 7 7 7 7 7 7 7 ...
## $ Hour : int 5 4 5 6 5 6 3 6 3 6 ...
## $ RouteID : int 3 7 6 14 6 16 7 14 10 13 ...
## $ StopID : int 20021 40221 20040 19944 20041 19865 20032 19959 19879 19873 ...
## $ Number : int 1 1 1 1 1 1 1 1 1 1 ...
## $ StopType : chr "Platform" "Platform" "Platform" "Platform" ...
## $ SuburbName : chr "Sunshine" "Craigieburn" "Flemington" "Malvern" ...
## $ PostCode : int 3020 3064 3031 3144 3031 3194 3046 3141 3135 3150 ...
## $ LocalGovernmentArea: chr "Brimbank" "Hume" "Moonee Valley" "Stonnington" ...
## $ status : chr "Busy" "Slightly Busy" "Not Busy" "Not Busy" ...
Printed a summary of the dataset to ensure there were no missing values.
data[data$status == "Slightly Busy",15] <- "Not Busy"
summarizeColumns(data) %>% knitr::kable(caption = 'Table 1. Feature Summary')
| name | type | na | mean | disp | median | mad | min | max | nlevs |
|---|---|---|---|---|---|---|---|---|---|
| Var1 | character | 0 | NA | 0.7456667 | NA | NA | 3 | 763 | 6 |
| Freq | integer | 0 | 7.389000 | 7.2279845 | 5 | 4.4478 | 0 | 45 | 0 |
| DateTime | character | 0 | NA | 0.9980000 | NA | NA | 1 | 6 | 1835 |
| FinancialMonth | integer | 0 | 9.000000 | 0.0000000 | 9 | 0.0000 | 9 | 9 | 0 |
| FinancialWeek | integer | 0 | 12.000000 | 0.0000000 | 12 | 0.0000 | 12 | 12 | 0 |
| WeekDay | integer | 0 | 7.000000 | 0.0000000 | 7 | 0.0000 | 7 | 7 | 0 |
| Hour | integer | 0 | 4.483000 | 0.9616059 | 5 | 1.4826 | 1 | 6 | 0 |
| RouteID | integer | 0 | 11.196333 | 5.5157285 | 13 | 4.4478 | 1 | 20 | 0 |
| StopID | integer | 0 | 21036.592667 | 5060.4338666 | 19939 | 87.4734 | 19829 | 46468 | 0 |
| Number | integer | 0 | 1.000333 | 0.0182574 | 1 | 0.0000 | 1 | 2 | 0 |
| StopType | character | 0 | NA | 0.0000000 | NA | NA | 3000 | 3000 | 1 |
| SuburbName | character | 0 | NA | 0.9616667 | NA | NA | 1 | 115 | 137 |
| PostCode | integer | 0 | 3148.644667 | 174.8464989 | 3131 | 81.5430 | 3011 | 3977 | 0 |
| LocalGovernmentArea | character | 0 | NA | 0.9090000 | NA | NA | 3 | 273 | 29 |
| status | character | 0 | NA | 0.1453333 | NA | NA | 436 | 2564 | 2 |
data <- data[,-c(1:3, 11)]
glimpse(data)
## Observations: 3,000
## Variables: 11
## $ FinancialMonth <int> 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, ...
## $ FinancialWeek <int> 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12...
## $ WeekDay <int> 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, ...
## $ Hour <int> 5, 4, 5, 6, 5, 6, 3, 6, 3, 6, 4, 5, 6, 5, ...
## $ RouteID <int> 3, 7, 6, 14, 6, 16, 7, 14, 10, 13, 13, 3, ...
## $ StopID <int> 20021, 40221, 20040, 19944, 20041, 19865, ...
## $ Number <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ SuburbName <chr> "Sunshine", "Craigieburn", "Flemington", "...
## $ PostCode <int> 3020, 3064, 3031, 3144, 3031, 3194, 3046, ...
## $ LocalGovernmentArea <chr> "Brimbank", "Hume", "Moonee Valley", "Ston...
## $ status <chr> "Busy", "Not Busy", "Not Busy", "Not Busy"...
write.csv(data, "data_MLPro.csv")
propT<-table(data$status)
prop.table(propT)
##
## Busy Not Busy
## 0.1453333 0.8546667
Figure 1. Transaction activity based on (top) (Top) local government area and (Bottom) route ID.
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:latex2exp':
##
## TeX
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
p4 <- ggplot(data, aes(x = SuburbName, fill = status)) + geom_bar() + theme(axis.text.x = element_text(angle = 90, hjust = 1, size = 8)) + xlab("Suburb")
gg1 <- ggplotly(p4)
gg1
Figure 2. MYKI transaction activity based on suburb.
ggplot(data, aes(x = status)) + geom_bar() + labs(x = "Status", y = "Frequency")
Figure 3. Proportion of stops defined as busy or not busy.
data$status[data$status == 'Not Busy'] <- 'Not_Busy'
data$SuburbName <- data$SuburbName %>% as.factor()
data$LocalGovernmentArea <- data$LocalGovernmentArea %>% as.factor()
data$StopID <- data$StopID %>% as.factor()
data$PostCode <- data$PostCode %>% as.factor()
data <- createDummyFeatures(data, target = "status", 'reference', cols = NULL)
#data_dummified <- createDummyFeatures(data_standardized, target = 'Goal', method = 'reference', cols = NULL)
To remove any patterns in the dataset (e.g. due to yearly flucations) To avoid any biases the rows in the dataset were randomized prior to splitting the data into training and test sets.
#Shuffle dataset rows
set.seed(1234)
n <- nrow(data)
shuffled_data <- data[sample(n), ]
The data was split to obtain the training (70 %) and test (30 %) datasets.
# Old school way to spliting the data into 70 % training & 30 % test data
# This is not stratified sampling, which shall be used in model training
# obtain index for training and test indices
training_index <- sample(nrow(shuffled_data)*0.70)
test_index <- setdiff(seq(1:nrow(shuffled_data)), training_index )
# Get the training data and test data
training_data <- shuffled_data[training_index, ]
test_data <- shuffled_data[test_index, ]
##
## Busy Not_Busy
## 0.1428571 0.8571429
##
## Busy Not_Busy
## 0.1511111 0.8488889
They were not eaually balanced but were representative of the full dataset. We shall use training data for modeling and test data for model evaluation.
# Configure classification task
task <- makeClassifTask(data = training_data, target = 'status', id = 'trans_freq', positive = "Busy")
# Configure learners with probability type
learner2 <- makeLearner('classif.randomForest', predict.type = 'prob') # Random Forest learner
learner3 <- makeLearner('classif.kknn', predict.type = 'prob') # kNN learner
## Loading required package: kknn
##
## Attaching package: 'kknn'
## The following object is masked from 'package:caret':
##
## contr.dummy
# For randomForest, we can fine-tune mtry i.e mumber of variables randomly
# sampled as candidates at each split. Following
# Breiman, L. (2001), Random Forests, Machine Learning 45(1), 5-32,
# we can try mtry = 2, 3, 4 as mtry = sqrt(p) where p = 11
ps2 <- makeParamSet(
makeDiscreteParam('mtry', values = c(2,3,4))
)
# For kknn, we can fine-tune k = 2 to 20
ps3 <- makeParamSet(
makeDiscreteParam('k', values = seq(2, 20, by = 1))
)
## [Tune] Started tuning learner classif.randomForest for parameter set:
## Type len Def Constr Req Tunable Trafo
## mtry discrete - - 2,3,4 - TRUE -
## With control class: TuneControlGrid
## Imputation value: 1
## [Tune-x] 1: mtry=2
## [Tune-y] 1: mmce.test.mean=0.1428571; time: 2.1 min
## [Tune-x] 2: mtry=3
## [Tune-y] 2: mmce.test.mean=0.1428571; time: 3.0 min
## [Tune-x] 3: mtry=4
## [Tune-y] 3: mmce.test.mean=0.1328571; time: 3.1 min
## [Tune] Result: mtry=4 : mmce.test.mean=0.1328571
## [Tune] Started tuning learner classif.kknn for parameter set:
## Type len Def Constr Req Tunable
## k discrete - - 2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,... - TRUE
## Trafo
## k -
## With control class: TuneControlGrid
## Imputation value: 1
## [Tune-x] 1: k=2
## [Tune-y] 1: mmce.test.mean=0.2023810; time: 0.1 min
## [Tune-x] 2: k=3
## [Tune-y] 2: mmce.test.mean=0.2061905; time: 0.1 min
## [Tune-x] 3: k=4
## [Tune-y] 3: mmce.test.mean=0.1990476; time: 0.1 min
## [Tune-x] 4: k=5
## [Tune-y] 4: mmce.test.mean=0.2014286; time: 0.1 min
## [Tune-x] 5: k=6
## [Tune-y] 5: mmce.test.mean=0.2004762; time: 0.1 min
## [Tune-x] 6: k=7
## [Tune-y] 6: mmce.test.mean=0.1961905; time: 0.1 min
## [Tune-x] 7: k=8
## [Tune-y] 7: mmce.test.mean=0.1933333; time: 0.1 min
## [Tune-x] 8: k=9
## [Tune-y] 8: mmce.test.mean=0.1933333; time: 0.1 min
## [Tune-x] 9: k=10
## [Tune-y] 9: mmce.test.mean=0.1904762; time: 0.1 min
## [Tune-x] 10: k=11
## [Tune-y] 10: mmce.test.mean=0.1938095; time: 0.1 min
## [Tune-x] 11: k=12
## [Tune-y] 11: mmce.test.mean=0.1933333; time: 0.1 min
## [Tune-x] 12: k=13
## [Tune-y] 12: mmce.test.mean=0.1961905; time: 0.1 min
## [Tune-x] 13: k=14
## [Tune-y] 13: mmce.test.mean=0.1957143; time: 0.1 min
## [Tune-x] 14: k=15
## [Tune-y] 14: mmce.test.mean=0.1938095; time: 0.1 min
## [Tune-x] 15: k=16
## [Tune-y] 15: mmce.test.mean=0.1871429; time: 0.1 min
## [Tune-x] 16: k=17
## [Tune-y] 16: mmce.test.mean=0.1995238; time: 0.1 min
## [Tune-x] 17: k=18
## [Tune-y] 17: mmce.test.mean=0.1980952; time: 0.1 min
## [Tune-x] 18: k=19
## [Tune-y] 18: mmce.test.mean=0.1961905; time: 0.1 min
## [Tune-x] 19: k=20
## [Tune-y] 19: mmce.test.mean=0.2009524; time: 0.1 min
## [Tune] Result: k=16 : mmce.test.mean=0.1871429
Figure 4. Plot for the optimization of the threshold for the kNN and Random Forest classifiers trained on all features.
Evaluated each algorithm by comparing the plotted AUC curves.
Figure 5. AUC plot for the kNN and Random Forest classifers trained on all features.
The AUC plots indicated that the Random Forest classifier performed slightly better than kNN (Fig. 7).
| Random Forest | kNN | |
|---|---|---|
| mmce | 0.1388889 | 0.1522222 |
| auc | 0.8227643 | 0.8567428 |
The Random Forest classifier performed the best, when the models were trained on all features, with a mean misclassification error of 0.287 and auc value of 0.784 (Table 4).
## predicted
## true Busy Not_Busy
## Busy 13 123 tpr: 0.1 fnr: 0.9
## Not_Busy 2 762 fpr: 0 tnr: 1
## ppv: 0.87 for: 0.14 lrp: 36.51 acc: 0.86
## fdr: 0.13 npv: 0.86 lrm: 0.91 dor: 40.27
##
##
## Abbreviations:
## tpr - True positive rate (Sensitivity, Recall)
## fpr - False positive rate (Fall-out)
## fnr - False negative rate (Miss rate)
## tnr - True negative rate (Specificity)
## ppv - Positive predictive value (Precision)
## for - False omission rate
## lrp - Positive likelihood ratio (LR+)
## fdr - False discovery rate
## npv - Negative predictive value
## acc - Accuracy
## lrm - Negative likelihood ratio (LR-)
## dor - Diagnostic odds ratio
## predicted
## true Busy Not_Busy
## Busy 51 85 tpr: 0.38 fnr: 0.62
## Not_Busy 52 712 fpr: 0.07 tnr: 0.93
## ppv: 0.5 for: 0.11 lrp: 5.51 acc: 0.85
## fdr: 0.5 npv: 0.89 lrm: 0.67 dor: 8.22
##
##
## Abbreviations:
## tpr - True positive rate (Sensitivity, Recall)
## fpr - False positive rate (Fall-out)
## fnr - False negative rate (Miss rate)
## tnr - True negative rate (Specificity)
## ppv - Positive predictive value (Precision)
## for - False omission rate
## lrp - Positive likelihood ratio (LR+)
## fdr - False discovery rate
## npv - Negative predictive value
## acc - Accuracy
## lrm - Negative likelihood ratio (LR-)
## dor - Diagnostic odds ratio
Filter methods assign an importance to each feature. The feature is ranked according to importance value resulting in a feature subset. Create an object named mfv by calling generateFilterValuesData from mlr on classif.task and using the filter method randomForest.importance.
## Supervised task: datathon
## Type: regr
## Target: status
## Observations: 3000
## Features:
## numerics factors ordered functionals
## 467 0 0 0
## Missings: FALSE
## Has weights: FALSE
## Has blocking: FALSE
## Has coordinates: FALSE
Figure 6. Features selected using a filter selection algorithm in mlr based on (Left panel) information gain and (right panel) chi squared (https://mlr-org.github.io/mlr/articles/tutorial/devel/feature_selection.html).(Bischl et al. 2016)
We now ‘fused’ the random forest classification learner with the information.gain filter to train the model.
The optimal percentage of features to keep was determined by 5-fold cross-validation. We use ‘information gain’ as an importance measure and select the 10 features with highest importance. In each resampling iteration feature selection is carried out on the corresponding training data set before fitting the learner.
## [Tune] Started tuning learner classif.randomForest.filtered for parameter set:
## Type len Def Constr Req Tunable Trafo
## fw.abs discrete - - 10 - TRUE -
## With control class: TuneControlGrid
## Imputation value: 1
## [Tune-x] 1: fw.abs=10
## [Tune-y] 1: mmce.test.mean=0.1300000; time: 0.4 min
## [Tune] Result: fw.abs=10 : mmce.test.mean=0.1300000
The optimal percentage and corresponding misclassification error are:
## $fw.abs
## [1] 10
## mmce.test.mean
## 0.13
We can now fuse it with fw percentage by “wrapper” the random forest learner with the chi-squared method before training the model:
Now applied getFilteredFeatures on the trained model to view the selected features.
Used a random search with ten iterations on the random forest classifier and classif.task.
## mmce.test.mean
## 0.1280952
| Filter | Wrapper |
|---|---|
| RouteID | WeekDay |
| StopID.19943 | RouteID |
| StopID.20021 | StopID.19832 |
| StopID.20025 | StopID.19837 |
| SuburbName.Caulfield.East | StopID.19839 |
| SuburbName.Footscray | StopID.19840 |
| SuburbName.Sunshine | StopID.19844 |
| PostCode.3020 | StopID.19846 |
| LocalGovernmentArea.Greater.Dandenong | StopID.19848 |
| LocalGovernmentArea.Maribyrnong | StopID.19849 |
By comparing the misclassification error rates, a random search wrapper method out performed the chi squared (filtered) method. We then fused the wrapper method in a learnerusing makeFeatSelWrapper together with makeFeatSelControlRandom and makeResampleDesc objects.
Obtain the confusion matrix by running calculateConfusionMatrix(pred_on_test) and get the ROC.
## predicted
## true Busy Not_Busy -err.-
## Busy 26 110 110
## Not_Busy 18 746 18
## -err.- 18 110 128
Figure 7. AUC plot for the random forest classifer. RF - Random forest classifier trained on all features; RF_spsa_tuned - Random forest classifier with tuned hyperparameters and trained on features selected RF_wrapper.
The AUC plot show that the random forest classifier with optimized hyperparameters and using the wrapper method for feature selection performed similarly.
| x | |
|---|---|
| mmce | 0.1422222 |
| auc | 0.8300402 |
The random forest classifier using the wrapper method for feature selection had a mean misclassification error of 0.216 and AUC value of 0.860 (Table 10).
## predicted
## true Busy Not_Busy
## Busy 26 110 tpr: 0.19 fnr: 0.81
## Not_Busy 18 746 fpr: 0.02 tnr: 0.98
## ppv: 0.59 for: 0.13 lrp: 8.11 acc: 0.86
## fdr: 0.41 npv: 0.87 lrm: 0.83 dor: 9.8
##
##
## Abbreviations:
## tpr - True positive rate (Sensitivity, Recall)
## fpr - False positive rate (Fall-out)
## fnr - False negative rate (Miss rate)
## tnr - True negative rate (Specificity)
## ppv - Positive predictive value (Precision)
## for - False omission rate
## lrp - Positive likelihood ratio (LR+)
## fdr - False discovery rate
## npv - Negative predictive value
## acc - Accuracy
## lrm - Negative likelihood ratio (LR-)
## dor - Diagnostic odds ratio
No Feature Selection: RF (mmce 0.118, AUC 0.881, precision 0.67, recall 0.16) kNN (mmce 0.133, AUC 0.900, precision 0.48, recall 0.50)
Feature Selection: RF (mmce 0.112, AUC 0.74, precision 0.65, recall 0.74)
Random forest with and without feature selection exhibited a better precision than kNN. While all classifiers had simiklar AUC values. However, being able to correctly predict that a stop is congested (true positive) would be the best outcome from the perspective of both passengers and serviced providers. The random forestclassifier with feature selection performed best when correctly predicting true positives (recall).
Bischl, Bernd, Michel Lang, Lars Kotthoff, Julia Schiffner, Jakob Richter, Erich Studerus, Giuseppe Casalicchio, and Zachary M. Jones. 2016. “mlr: Machine Learning in R.” Journal of Machine Learning Research.
Breiman, L. 2001. “Random Forests.” Machine Learning.