Methodology

We considered two classifiers - Random Forest (RF), and \(K\)-Nearest Neighbour (KNN). Each classifier was trained to make probability predictions so that we were able to adjust the prediction threshold to refine the performance. We split the full data set into a 70 % training set and 30 % test set. Each set resembled the full data by having the same proportion of target classes i.e. approximately 16% of stops labelled as busy and 84 % as not busy. For the fine-tuning process, we ran five-folded cross-validation with stratified sampling on each classifier. Stratified sampling was used to cater for the class imbalance of the target feature.

Next, for each classifer, we determined the optimal probability threshold. Using the tuned hyperparameters and the optimal thresholds, we made predictions on the test data. During model training (hyperparameter tuning and threshold adjustment), we relied on mean misclassification error rate (mmce). In addition to mmce, we also used the confusion matrix on the test data to evaluate the classifiers’ performance. The modelling was implemented in R with the mlr package. Bischl et al. (2016)

Hyperparameter Fine-Tuning

Random Forest

We tune-fined the number of variables randomly sampled as candidates at each split (i.e. mtry). For a classification problem, Breiman (2001) suggests mtry = \(\sqrt{p}\) where \(p\) is the number of descriptive features. In our case, \(\sqrt{p} = \sqrt{11}=3.31\). Therefore, we experimented with mtry = 2, 3, and 4. We left other hyperparameters, such as the number of trees to grow at the default value.

\(K\)-Nearest Neighbours

By using the optimal kernel, we ran a grid search on \(k=2,3,...20\). The outcome was k of 20 and mmce test error of 0.199.

Feature Selection

Feature selection was used to identify an optimal subset of the available features. Selecting a subset of relevant features can make machine learning algorithm training faster, reduce complexity of the model, improve accuracy and reduce overfitting.There are three broard categoried of feature selection methods: filter methods, wrapper methods and embedded methods.

Filter method

The fiter method assigns an importance value to each feature. Based on these values the features are ranked and a feature subset is selected. The learner was fused with the filter method for training of each classification model.

Wrapper method

The wrapper method used the performance of a learning classifier to access the usefulness of the feature set. In order to select a feature subset the learner was trained repeatedly on different fleature subsets and the subset which lead to the best learner performance was chosen.

Install and load required libraries

Sys.setenv(JAVA_HOME='C:\\Program Files\\Java\\jre-9.0.4')
library(dplyr)
library(readr)
library(tidyverse)
library(lubridate)
library(mlr)
library(tidyverse) # for ggplot and data wrangly
library(rJava)
library(FSelector) 
library(taRifx)
library(ggplot2)
library(gridExtra)
library(gmodels)
library(GGally)
library(cowplot)
library(tidyr)
library(magrittr)
library(moments)
library(purrr)
library(data.table)
library(latex2exp)
library(caret)
library(robustHD)
library(spFSR)
library(rjson)
library(party)
library(knitr)
library(kableExtra)
library(stringr)
library(mlbench)
library(e1071)
library(MASS)
library(ggmap)
set.seed(999)

Load datasets

Load Public Holidays Dataset

pubHol <- read.csv("publicHol.csv")

pubHol$Date <- ymd(pubHol$Date)
head(pubHol)

Load Calander Dataset

calendar <- fread('E:\\Datathon\\MelbDatathon2018\\calendar.txt')
colnames(calendar) <- c('', 'Date', 'CalendarYear', 'FinancialYear', 'FinancialMonth', 'CalendarMonth', 'CalendarMonthSeq', 'CalendarQuarter', 'FinancialQuarter', 'CalendarWeek', 'FinancialWeek', 'DayType', 'DayTypeCategory', 'WeekdaySeq', 'WeekDay', 'Day', 'FinancialMonthSeq', 'FinancialMonthName', 'MonthNumber', 'ABSWeek', 'WeekEnding', 'QuarterName')
calendar <- calendar[,-1]
head(calendar)

Load Card Types Dataset

card_types <- fread('E:\\Datathon\\MelbDatathon2018\\card_types.txt')
colnames(card_types) <- c('CardType', 'Card_SubType_Desc', 'Payment_Type', 'Fare_Type', 'Concession_Type', 'MI_Card_Group')
head(card_types)

Load Stop Locations dataset

stop_loc <- fread('E:\\Datathon\\MelbDatathon2018\\stop_locations.txt')
colnames(stop_loc) <- c('StopID', 'StopNameShort', 'StopNameLong', 'StopType', 'SuburbName', 'PostCode', 'RegionName', 'LocalGovernmentArea', 'StatDivision', 'GPSLat', 'GPSLong')
head(stop_loc)

Read in and combine folder IDs from different directories

ScanOnFolderMaster <- 'E:\\Datathon\\MelbDatathon2018\\Samp_X\\ScanOnTransaction'
ScanOffFolderMaster <- 'E:\\Datathon\\MelbDatathon2018\\Samp_X\\ScanOffTransaction'

mySamp <- 0
condDatON <- FALSE
condDatOFF <- FALSE

ScanOnFolder <- sub("X",mySamp,ScanOnFolderMaster)
ScanOffFolder <- sub("X",mySamp,ScanOffFolderMaster)
onFiles <- list.files(ScanOnFolder,recursive = TRUE,full.names = TRUE, pattern = "\\.txt$")
offFiles <- list.files(ScanOffFolder,recursive = TRUE,full.names = TRUE, pattern = "\\.txt$")

Determine the total number of files

allFiles <- union(onFiles,offFiles)
cat("\nthere are", length(allFiles),'files')

## 
## there are 40 files

Extract data sample

Extract a sample of specific rows and columns

#myFiles <- onFiles[1:5]

# Take random sample of files and save in file named myFiles
myFiles <- sample(onFiles, 20, replace = FALSE)

#Extract information for trains, trams or buses from files 
first <- TRUE
count <- 0

for (myOn in myFiles){
  #cmd <- paste0("gzip -dc ", myOn)

  dt <- fread(myOn, na.strings="")
  
  #Choose files(train, tram or bus) to extract
  dt_train <- subset(dt, V1 == 2)
  #dt_tram <- subset(dt, V1 == 3)
  #dt_bus <- subset(dt, V1 == 1)
  #dt_train <- subset(dt_train, V7 != "Headless Mode")
 
  #stack the records together
  if (first == TRUE){
    #Randomly sample a given number of rows of data
    allON <- dt_train[sample(nrow(dt_train), 3000), ]
    first <- FALSE
  } else {
    l = list(dt_train,allON)
    allON <- rbindlist(l)
  }
  count <- count + 1
  cat('\n',count,' of ',length(myFiles))
}

## 
##  1  of  20
##  2  of  20
##  3  of  20
##  4  of  20
##  5  of  20
##  6  of  20
##  7  of  20
##  8  of  20
##  9  of  20
##  10  of  20
##  11  of  20
##  12  of  20
##  13  of  20
##  14  of  20
##  15  of  20
##  16  of  20
##  17  of  20
##  18  of  20
##  19  of  20
##  20  of  20

cat('\n there are ', format(nrow(allON),big.mark = ","),'rows')

## 
##  there are  6,953,096 rows

#Add column names
colnames(allON) <- c('Mode','Date','DateTime','CardID','CardType','VehicleID','ParentRoute','RouteID','StopID')

Summary of allON table containing MYKI transaction data

summary(allON)

##       Mode       Date             DateTime             CardID        
##  Min.   :2   Length:6953096     Length:6953096     Min.   :   15990  
##  1st Qu.:2   Class :character   Class :character   1st Qu.:10992050  
##  Median :2   Mode  :character   Mode  :character   Median :14850010  
##  Mean   :2                                         Mean   :14415174  
##  3rd Qu.:2                                         3rd Qu.:18511970  
##  Max.   :2                                         Max.   :24451870  
##     CardType        VehicleID ParentRoute           RouteID     
##  Min.   : 0.000   Min.   :0   Length:6953096     Min.   : 1.00  
##  1st Qu.: 1.000   1st Qu.:0   Class :character   1st Qu.: 1.00  
##  Median : 1.000   Median :0   Mode  :character   Median : 7.00  
##  Mean   : 6.007   Mean   :0                      Mean   : 7.69  
##  3rd Qu.: 2.000   3rd Qu.:0                      3rd Qu.:14.00  
##  Max.   :71.000   Max.   :0                      Max.   :20.00  
##      StopID     
##  Min.   :19827  
##  1st Qu.:19924  
##  Median :20021  
##  Mean   :37531  
##  3rd Qu.:64403  
##  Max.   :64408

Data pre-processing

Remove Mode, ParentRoute and VehicleID features

allON <- allON[,-c(6,7)]
allON <- allON[,-1]

Convert the one field that is clearly a date time

allON[,DateTime := as.POSIXct(DateTime,tz='Australia/Sydney')]
allON[,unixTime := as.numeric(DateTime)]

Combine allON and Card Type datasets

allON_card <- left_join(allON, card_types, by = "CardType")
head(allON_card)

Combine with Stop Locations dataset (Greater Metro district only)

allON_card_Loc <- left_join(allON_card, stop_loc, by = "StopID")
allON_card_Loc <- subset(allON_card_Loc, allON_card_Loc$StatDivision == "Greater Metro")
head(allON_card_Loc)

Combine with Calendar dataset

allON_card_Loc_Cal <- left_join(allON_card_Loc, calendar, by = "Date")
head(allON_card_Loc_Cal)

Dimensions of entire dataset

dim(allON_card_Loc_Cal)

## [1] 4296785      42

Combine with public holidays datasets

pubHol$Date <- as.character(pubHol$Date)
allON_card_Loc_Cal_Hol <- left_join(allON_card_Loc_Cal, pubHol, by = "Date")
allON_card_Loc_Cal_Hol$HolidayName <- as.character(allON_card_Loc_Cal_Hol$HolidayName)
allON_card_Loc_Cal_Hol$HolidayName[is.na(allON_card_Loc_Cal_Hol$HolidayName)] <- "noHoliday"

Determine the number of tap-on transactions at each Stop along each Route per Hour.

meta <- allON_card_Loc_Cal %>% group_by(DateTime, FinancialMonth, FinancialWeek, WeekDay, Hour = cut(allON_card_Loc_Cal$DateTime, breaks= "1 hour", labels=FALSE, ordered_result = TRUE), RouteID, StopID) %>% summarize(Number = n())
head(meta)

Add location data and stop type to table

loc <- data.frame(StopID = stop_loc$StopID, StopType = stop_loc$StopType, SuburbName = stop_loc$SuburbName, PostCode = stop_loc$PostCode, LocalGovernmentArea = stop_loc$LocalGovernmentArea)
head(loc)

Combine transaction (meta) table with vehicleID (vech) and stop (loc) tables

meta_loc <- left_join(meta, loc, by = "StopID")
meta_loc2 <- meta_loc[1:2000,]
head(meta_loc2)

Determine transaction frequency per hour for each stop

x <- TRUE
while (length(meta_loc2$StopID) >= 1) {
  num <- meta_loc2[1, 'StopID']
  num <- as.numeric(num)
  stop_trains <- meta_loc2[meta_loc2$StopID == num, ]
  sort(stop_trains$DateTime)
  allONBreaks <- data.frame(stop_trains, cuts = cut(stop_trains$DateTime, breaks= "1 hour", labels=FALSE, ordered_result = TRUE))
  myDates <- table(cut(allONBreaks$DateTime, breaks = "hour"))
  myDatesN <- as.data.frame(myDates)
  myDatesN <- merge(myDatesN, stop_trains)

  if (x == TRUE) {
    all_comb <- myDatesN
    x <- FALSE
  } else {
    all_comb <- rbind(all_comb, myDatesN)
  }
  meta_loc2 <- meta_loc2[meta_loc2$StopID != num, ]
  }

Data Visualisation

Plot distribution of transaction frequency for each stop per hour

ggplot(all_comb, aes(x = all_comb$Freq)) + geom_histogram(fill = "cyan", colour = "black")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Set initial threshold for ML training

counts <- 0
for(status in all_comb){
status <- ifelse(all_comb$Freq >= 12, "Busy", "Not Busy")
}

mynewDates <- cbind(all_comb, status) 
locData <- data.frame("StopID" = stop_loc$StopID, "Lat" = stop_loc$GPSLat, "Long" = stop_loc$GPSLong)
melbDatathonData <- left_join(mynewDates, locData, by = "StopID")
head(melbDatathonData)

Save data for training ML algorithms

write.csv(melbDatathonData, file = "melbDatathonSampleData.csv")

Plot datashowing threshold cut-off

ggplot(melbDatathonData, aes(x = melbDatathonData$Freq, fill = melbDatathonData$status)) + geom_histogram(colour = "black") + labs(x = "Number of MYKI Transactions", y = "Frequency")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

### Plot stop congestion for each train line

library(plotly)

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggmap':
## 
##     wind

## The following object is masked from 'package:MASS':
## 
##     select

## The following object is masked from 'package:latex2exp':
## 
##     TeX

## The following object is masked from 'package:taRifx':
## 
##     distinct

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

p7 <- ggplot(melbDatathonData, aes(x = RouteID, fill = status)) + geom_bar() + theme(axis.title.y=element_text(size=20,face="bold")) + xlab("") + ylab("Count") + scale_x_continuous(limits=c(0, 21), breaks=c(1, 2, 3, 6, 7, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 20), labels=c("Werribee", "Williamstown", "Sunbury", "Flemington", "Craigieburn", "Upfield","Lilydale", "Belgrave", "Ashburton", "Glen Waverly", "Pakenham", "Cranbourne", "Frankston", "Sandringham", "Mernda", "Hurstbridge")) + theme(axis.text.x = element_text(angle = 90, hjust = 1, size = 8)) 
gg2 <- ggplotly(p7)
gg2

Plot od MYKI transacrion frequency per hour for each stop in each suburb

p4 <- ggplot(melbDatathonData, aes(x = SuburbName, fill = status)) + geom_bar() + theme(axis.text.x = element_text(angle = 90, hjust = 1, size = 8)) + xlab("Suburb")
gg1 <- ggplotly(p4)
gg1

Bischl, Bernd, Michel Lang, Lars Kotthoff, Julia Schiffner, Jakob Richter, Erich Studerus, Giuseppe Casalicchio, and Zachary M. Jones. 2016. “mlr: Machine Learning in R.” Journal of Machine Learning Research.

Breiman, L. 2001. “Random Forests.” Machine Learning.

Appendix 1 - Data Processing

COSC 2669 Legal, Ethical and Policy Issues in Data Science

Oct 2018

Introduction

Descriptive Features