The objective of this project was to build classifiers to predict whether a tram/bus or train stop will be busy at a particular time of day. The data were obtained from the Public Transport Victoria and was comprised of MYKI touch on touch off transactions for the previous 3 years. The target feature was based on the number of transactions at given times of the day and were categorized as busy, slightly busy or not busy.
The target feature has three classes busy, slightly busy or not busy. To reiterate, the goal is to predict whether a stop is busy at a particular time of the day.
We considered two classifiers - Random Forest (RF), and \(K\)-Nearest Neighbour (KNN). Each classifier was trained to make probability predictions so that we were able to adjust the prediction threshold to refine the performance. We split the full data set into a 70 % training set and 30 % test set. Each set resembled the full data by having the same proportion of target classes i.e. approximately 16% of stops labelled as busy and 84 % as not busy. For the fine-tuning process, we ran five-folded cross-validation with stratified sampling on each classifier. Stratified sampling was used to cater for the class imbalance of the target feature.
Next, for each classifer, we determined the optimal probability threshold. Using the tuned hyperparameters and the optimal thresholds, we made predictions on the test data. During model training (hyperparameter tuning and threshold adjustment), we relied on mean misclassification error rate (mmce). In addition to mmce, we also used the confusion matrix on the test data to evaluate the classifiers’ performance. The modelling was implemented in R with the mlr package. Bischl et al. (2016)
We tune-fined the number of variables randomly sampled as candidates at each split (i.e. mtry). For a classification problem, Breiman (2001) suggests mtry = \(\sqrt{p}\) where \(p\) is the number of descriptive features. In our case, \(\sqrt{p} = \sqrt{11}=3.31\). Therefore, we experimented with mtry = 2, 3, and 4. We left other hyperparameters, such as the number of trees to grow at the default value.
By using the optimal kernel, we ran a grid search on \(k=2,3,...20\). The outcome was k of 20 and mmce test error of 0.199.
Feature selection was used to identify an optimal subset of the available features. Selecting a subset of relevant features can make machine learning algorithm training faster, reduce complexity of the model, improve accuracy and reduce overfitting.There are three broard categoried of feature selection methods: filter methods, wrapper methods and embedded methods.
The fiter method assigns an importance value to each feature. Based on these values the features are ranked and a feature subset is selected. The learner was fused with the filter method for training of each classification model.
The wrapper method used the performance of a learning classifier to access the usefulness of the feature set. In order to select a feature subset the learner was trained repeatedly on different fleature subsets and the subset which lead to the best learner performance was chosen.
Sys.setenv(JAVA_HOME='C:\\Program Files\\Java\\jre-9.0.4')
library(dplyr)
library(readr)
library(tidyverse)
library(lubridate)
library(mlr)
library(tidyverse) # for ggplot and data wrangly
library(rJava)
library(FSelector)
library(taRifx)
library(ggplot2)
library(gridExtra)
library(gmodels)
library(GGally)
library(cowplot)
library(tidyr)
library(magrittr)
library(moments)
library(purrr)
library(data.table)
library(latex2exp)
library(caret)
library(robustHD)
library(spFSR)
library(rjson)
library(party)
library(knitr)
library(kableExtra)
library(stringr)
library(mlbench)
library(e1071)
library(MASS)
library(ggmap)
set.seed(999)
pubHol <- read.csv("publicHol.csv")
pubHol$Date <- ymd(pubHol$Date)
head(pubHol)
calendar <- fread('E:\\Datathon\\MelbDatathon2018\\calendar.txt')
colnames(calendar) <- c('', 'Date', 'CalendarYear', 'FinancialYear', 'FinancialMonth', 'CalendarMonth', 'CalendarMonthSeq', 'CalendarQuarter', 'FinancialQuarter', 'CalendarWeek', 'FinancialWeek', 'DayType', 'DayTypeCategory', 'WeekdaySeq', 'WeekDay', 'Day', 'FinancialMonthSeq', 'FinancialMonthName', 'MonthNumber', 'ABSWeek', 'WeekEnding', 'QuarterName')
calendar <- calendar[,-1]
head(calendar)
card_types <- fread('E:\\Datathon\\MelbDatathon2018\\card_types.txt')
colnames(card_types) <- c('CardType', 'Card_SubType_Desc', 'Payment_Type', 'Fare_Type', 'Concession_Type', 'MI_Card_Group')
head(card_types)
stop_loc <- fread('E:\\Datathon\\MelbDatathon2018\\stop_locations.txt')
colnames(stop_loc) <- c('StopID', 'StopNameShort', 'StopNameLong', 'StopType', 'SuburbName', 'PostCode', 'RegionName', 'LocalGovernmentArea', 'StatDivision', 'GPSLat', 'GPSLong')
head(stop_loc)
ScanOnFolderMaster <- 'E:\\Datathon\\MelbDatathon2018\\Samp_X\\ScanOnTransaction'
ScanOffFolderMaster <- 'E:\\Datathon\\MelbDatathon2018\\Samp_X\\ScanOffTransaction'
mySamp <- 0
condDatON <- FALSE
condDatOFF <- FALSE
ScanOnFolder <- sub("X",mySamp,ScanOnFolderMaster)
ScanOffFolder <- sub("X",mySamp,ScanOffFolderMaster)
onFiles <- list.files(ScanOnFolder,recursive = TRUE,full.names = TRUE, pattern = "\\.txt$")
offFiles <- list.files(ScanOffFolder,recursive = TRUE,full.names = TRUE, pattern = "\\.txt$")
allFiles <- union(onFiles,offFiles)
cat("\nthere are", length(allFiles),'files')
##
## there are 40 files
#myFiles <- onFiles[1:5]
# Take random sample of files and save in file named myFiles
myFiles <- sample(onFiles, 20, replace = FALSE)
#Extract information for trains, trams or buses from files
first <- TRUE
count <- 0
for (myOn in myFiles){
#cmd <- paste0("gzip -dc ", myOn)
dt <- fread(myOn, na.strings="")
#Choose files(train, tram or bus) to extract
dt_train <- subset(dt, V1 == 2)
#dt_tram <- subset(dt, V1 == 3)
#dt_bus <- subset(dt, V1 == 1)
#dt_train <- subset(dt_train, V7 != "Headless Mode")
#stack the records together
if (first == TRUE){
#Randomly sample a given number of rows of data
allON <- dt_train[sample(nrow(dt_train), 3000), ]
first <- FALSE
} else {
l = list(dt_train,allON)
allON <- rbindlist(l)
}
count <- count + 1
cat('\n',count,' of ',length(myFiles))
}
##
## 1 of 20
## 2 of 20
## 3 of 20
## 4 of 20
## 5 of 20
## 6 of 20
## 7 of 20
## 8 of 20
## 9 of 20
## 10 of 20
## 11 of 20
## 12 of 20
## 13 of 20
## 14 of 20
## 15 of 20
## 16 of 20
## 17 of 20
## 18 of 20
## 19 of 20
## 20 of 20
cat('\n there are ', format(nrow(allON),big.mark = ","),'rows')
##
## there are 6,953,096 rows
#Add column names
colnames(allON) <- c('Mode','Date','DateTime','CardID','CardType','VehicleID','ParentRoute','RouteID','StopID')
summary(allON)
## Mode Date DateTime CardID
## Min. :2 Length:6953096 Length:6953096 Min. : 15990
## 1st Qu.:2 Class :character Class :character 1st Qu.:10992050
## Median :2 Mode :character Mode :character Median :14850010
## Mean :2 Mean :14415174
## 3rd Qu.:2 3rd Qu.:18511970
## Max. :2 Max. :24451870
## CardType VehicleID ParentRoute RouteID
## Min. : 0.000 Min. :0 Length:6953096 Min. : 1.00
## 1st Qu.: 1.000 1st Qu.:0 Class :character 1st Qu.: 1.00
## Median : 1.000 Median :0 Mode :character Median : 7.00
## Mean : 6.007 Mean :0 Mean : 7.69
## 3rd Qu.: 2.000 3rd Qu.:0 3rd Qu.:14.00
## Max. :71.000 Max. :0 Max. :20.00
## StopID
## Min. :19827
## 1st Qu.:19924
## Median :20021
## Mean :37531
## 3rd Qu.:64403
## Max. :64408
allON <- allON[,-c(6,7)]
allON <- allON[,-1]
allON[,DateTime := as.POSIXct(DateTime,tz='Australia/Sydney')]
allON[,unixTime := as.numeric(DateTime)]
allON_card <- left_join(allON, card_types, by = "CardType")
head(allON_card)
allON_card_Loc <- left_join(allON_card, stop_loc, by = "StopID")
allON_card_Loc <- subset(allON_card_Loc, allON_card_Loc$StatDivision == "Greater Metro")
head(allON_card_Loc)
allON_card_Loc_Cal <- left_join(allON_card_Loc, calendar, by = "Date")
head(allON_card_Loc_Cal)
dim(allON_card_Loc_Cal)
## [1] 4296785 42
pubHol$Date <- as.character(pubHol$Date)
allON_card_Loc_Cal_Hol <- left_join(allON_card_Loc_Cal, pubHol, by = "Date")
allON_card_Loc_Cal_Hol$HolidayName <- as.character(allON_card_Loc_Cal_Hol$HolidayName)
allON_card_Loc_Cal_Hol$HolidayName[is.na(allON_card_Loc_Cal_Hol$HolidayName)] <- "noHoliday"
meta <- allON_card_Loc_Cal %>% group_by(DateTime, FinancialMonth, FinancialWeek, WeekDay, Hour = cut(allON_card_Loc_Cal$DateTime, breaks= "1 hour", labels=FALSE, ordered_result = TRUE), RouteID, StopID) %>% summarize(Number = n())
head(meta)
loc <- data.frame(StopID = stop_loc$StopID, StopType = stop_loc$StopType, SuburbName = stop_loc$SuburbName, PostCode = stop_loc$PostCode, LocalGovernmentArea = stop_loc$LocalGovernmentArea)
head(loc)
meta_loc <- left_join(meta, loc, by = "StopID")
meta_loc2 <- meta_loc[1:2000,]
head(meta_loc2)
x <- TRUE
while (length(meta_loc2$StopID) >= 1) {
num <- meta_loc2[1, 'StopID']
num <- as.numeric(num)
stop_trains <- meta_loc2[meta_loc2$StopID == num, ]
sort(stop_trains$DateTime)
allONBreaks <- data.frame(stop_trains, cuts = cut(stop_trains$DateTime, breaks= "1 hour", labels=FALSE, ordered_result = TRUE))
myDates <- table(cut(allONBreaks$DateTime, breaks = "hour"))
myDatesN <- as.data.frame(myDates)
myDatesN <- merge(myDatesN, stop_trains)
if (x == TRUE) {
all_comb <- myDatesN
x <- FALSE
} else {
all_comb <- rbind(all_comb, myDatesN)
}
meta_loc2 <- meta_loc2[meta_loc2$StopID != num, ]
}
ggplot(all_comb, aes(x = all_comb$Freq)) + geom_histogram(fill = "cyan", colour = "black")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
counts <- 0
for(status in all_comb){
status <- ifelse(all_comb$Freq >= 12, "Busy", "Not Busy")
}
mynewDates <- cbind(all_comb, status)
locData <- data.frame("StopID" = stop_loc$StopID, "Lat" = stop_loc$GPSLat, "Long" = stop_loc$GPSLong)
melbDatathonData <- left_join(mynewDates, locData, by = "StopID")
head(melbDatathonData)
write.csv(melbDatathonData, file = "melbDatathonSampleData.csv")
ggplot(melbDatathonData, aes(x = melbDatathonData$Freq, fill = melbDatathonData$status)) + geom_histogram(colour = "black") + labs(x = "Number of MYKI Transactions", y = "Frequency")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
### Plot stop congestion for each train line
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggmap':
##
## wind
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:latex2exp':
##
## TeX
## The following object is masked from 'package:taRifx':
##
## distinct
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
p7 <- ggplot(melbDatathonData, aes(x = RouteID, fill = status)) + geom_bar() + theme(axis.title.y=element_text(size=20,face="bold")) + xlab("") + ylab("Count") + scale_x_continuous(limits=c(0, 21), breaks=c(1, 2, 3, 6, 7, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 20), labels=c("Werribee", "Williamstown", "Sunbury", "Flemington", "Craigieburn", "Upfield","Lilydale", "Belgrave", "Ashburton", "Glen Waverly", "Pakenham", "Cranbourne", "Frankston", "Sandringham", "Mernda", "Hurstbridge")) + theme(axis.text.x = element_text(angle = 90, hjust = 1, size = 8))
gg2 <- ggplotly(p7)
gg2
p4 <- ggplot(melbDatathonData, aes(x = SuburbName, fill = status)) + geom_bar() + theme(axis.text.x = element_text(angle = 90, hjust = 1, size = 8)) + xlab("Suburb")
gg1 <- ggplotly(p4)
gg1
Bischl, Bernd, Michel Lang, Lars Kotthoff, Julia Schiffner, Jakob Richter, Erich Studerus, Giuseppe Casalicchio, and Zachary M. Jones. 2016. “mlr: Machine Learning in R.” Journal of Machine Learning Research.
Breiman, L. 2001. “Random Forests.” Machine Learning.