rm(list = ls())
setwd("/Users/noam/Desktop/")
library(tidyverse) # for data splitting and visualization
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.3 ✓ dplyr 1.0.7
## ✓ tidyr 1.1.3 ✓ stringr 1.4.0
## ✓ readr 2.0.1 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(zoo) # for moving averages
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(lubridate) # for mdy formatting
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(forecast) # for forecasting models
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
library(fpp2) # for loading the data
## ── Attaching packages ────────────────────────────────────────────── fpp2 2.4 ──
## ✓ fma 2.4 ✓ expsmooth 2.3
##
library(ggplot2) # for visualization
library(ggthemes) # for checking residuals
data <- read.csv("Metra.csv", header = TRUE)
data$DATE <- as.yearmon(paste(data$YEAR, data$MONTH), "%Y %m")
data <- data %>% arrange(mdy(data$DATE))
BNSF <- data %>% filter(LONGNAME == "BNSF")
library(zoo)
BNSF.ts <- ts(BNSF$RIDES, start = c(2002,1), frequency = 12)
ma.trailing <- rollmean(BNSF.ts, k = 12, align = "right")
ma.centered <- ma(BNSF.ts, order = 12)
plot(BNSF.ts, ylab = "Monthly Ridership", xlab = "Time", bty = "l", xaxt = "n", xlim = c(2002,2022), main = "BNSF line Monthly Ridership, 2002 - 2022")
axis(1, at = seq(2002, 2022, 1), labels = format(seq(2002, 2022, 1)))
lines(ma.centered, lwd = 2, col = "blue")
lines(ma.trailing, lwd = 2, lty = 2, col = "red")
legend(x = "center", legend = c("Moving Average", "Trailing Average"),lty = c(1, 2), bty = "n", col = c("blue", 2), lwd = 2)
The drastic drop towards the 20th year of this time series represents significant noise. The drop in the number of passengers riding the BNSF train line every month stemmed from the outbreak of COVID-19 in February and March of 2020 in the United States. The random nature of this drop is evident from the decomposition of the time series, as shown under the graph titled “random”. Aiming to focus on predicting the time series’ level, trend, and seasonality in “normal” circumstances, I decided to refer to a subset of the data that contains information recorded until January 2020, in order to avoid the significant noise. This subset spans for 18 years, and is sufficient for making robust predictions.
data <- data %>% slice(1:2396) # Slicing Feb 2020 and onward to avoid COVID-19 noise
BNSF <- data %>% filter(LONGNAME == "Heritage Corridor")
Heritage <- data %>% filter(LONGNAME == "Heritage Corridor")
Metra <- data %>% filter(LONGNAME == "Metra Electric District Main Line")
Milwaukee <- data %>% filter(LONGNAME == "Milwaukee District North Line")
North <- data %>% filter(LONGNAME == "North Central Service")
Rock <- data %>% filter(LONGNAME == "Rock Island District Main Line")
SouthWest <- data %>% filter(LONGNAME == "SouthWest Service")
Union <- data %>% filter(LONGNAME == "Union Pacific North Line")
After slicing the data, let’s take a look at the behavior of the eleven train lines covered in this data set.
data %>%
ggplot(aes(x = DATE, y = RIDES, group = LONGNAME, colour = LONGNAME)) +
geom_line()+
ggtitle("Metra Monthly Ridership by Line, 2002-2020")+
theme(
text = element_text(size = 10),
plot.title = element_text(hjust = 0.66, size = 15),
axis.text.x = element_text(size = 10),
axis.text.y = element_text(size = 10),
legend.position = c(1.55, 0.85),
legend.justification = c("right", "top"),
legend.box.just = "right",
legend.margin = margin(10, 10, 10, 10),
plot.margin = unit(c(0,5.8,0,0), "cm"))
We see that different train lines followed somewhat different ridership patterns over time, with the BNSF line being the most popular among them. For the remainder of this analysis, I will focus on three lines that have demonstrated significantly different ridership patterns over the course of 18 years, from January 2002 to January 2020.
BNSF.ts <- ts(BNSF$RIDES, start = c(2002,1), frequency = 12)
ma.trailing <- rollmean(BNSF.ts, k = 12, align = "right")
ma.centered <- ma(BNSF.ts, order = 12)
plot(BNSF.ts, ylab = "Monthly Ridership", xlab = "Time", bty = "l", xaxt = "n", xlim = c(2002,2020), main = "BNSF Line Monthly Ridership, 2002 - 2020")
axis(1, at = seq(2002, 2020, 1), labels = format(seq(2002, 2020, 1)))
lines(ma.centered, lwd = 2, col = "blue")
lines(ma.trailing, lwd = 2, lty = 2, col = "red")
legend(x = "topleft", legend = c("Moving Average", "Trailing Average"),lty = c(1, 2), bty = "n", col = c("blue", 2), lwd = 2)
North.ts <- ts(North$RIDES, start = c(2002,1), frequency = 12)
ma.trailing <- rollmean(North.ts, k = 12, align = "right")
ma.centered <- ma(North.ts, order = 12)
plot(North.ts, ylab = "Monthly Ridership", xlab = "Time", bty = "l", xaxt = "n", xlim = c(2002,2020), main = "North Line Monthly Ridership, 2002 - 2020")
axis(1, at = seq(2002, 2020, 1), labels = format(seq(2002, 2020, 1)))
lines(ma.centered, lwd = 2, col = "blue")
lines(ma.trailing, lwd = 2, lty = 2, col = "red")
legend(x = "topleft", legend = c("Moving Average", "Trailing Average"),lty = c(1, 2), bty = "n", col = c("blue", 2), lwd = 2)
Metra.ts <- ts(Metra$RIDES, start = c(2002,1), frequency = 12)
ma.trailing <- rollmean(Metra.ts, k = 12, align = "right")
ma.centered <- ma(Metra.ts, order = 12)
plot(Metra.ts, ylab = "Monthly Ridership", xlab = "Time", bty = "l", xaxt = "n", xlim = c(2002,2020), main = "Metra Electric Line Monthly Ridership, 2002 - 2020")
axis(1, at = seq(2002, 2020, 1), labels = format(seq(2002, 2020, 1)))
lines(ma.centered, lwd = 2, col = "blue")
lines(ma.trailing, lwd = 2, lty = 2, col = "red")
legend(x = "bottomleft", legend = c("Moving Average", "Trailing Average"),lty = c(1, 2), bty = "n", col = c("blue", 2), lwd = 2)
result = decompose(BNSF.ts)
plot(result)
Following the exclusion of the COVID-19 drop, we see that the BNSF time series demonstrates a more standard random noise pattern. Considering the graphs of the BNSF, Metra, and North lines, we see that each line followed a different ridership trend. The goal of this project is to evaluate the performance of three predictive methods using a two-year validation period on 16 years of time series data from the BNSF, Metra, and North lines.
nValid <- 24
nTrain <- length(BNSF.ts) - nValid
BNSF.train.ts <- window(BNSF.ts, start = c(2002, 1), end = c(2002, nTrain))
BNSF.valid.ts <- window(BNSF.ts, start = c(2002, nTrain+1), end = c(2002, nTrain+nValid))
nTrain <- length(Metra.ts) - nValid
Metra.train.ts <- window(BNSF.ts, start = c(2002, 1), end = c(2002, nTrain))
Metra.valid.ts <- window(BNSF.ts, start = c(2002, nTrain+1), end = c(2002, nTrain+nValid))
nTrain <- length(North.ts) - nValid
North.train.ts <- window(BNSF.ts, start = c(2002, 1), end = c(2002, nTrain))
North.valid.ts <- window(BNSF.ts, start = c(2002, nTrain+1), end = c(2002, nTrain+nValid))
BNSF.lm <- tslm(BNSF.train.ts ~ trend + I(trend^2)+I(trend^3))
BNSF.lm.pred <- forecast(BNSF.lm, h = nValid, level = 95)
BNSF.lm.naive <- naive(BNSF.train.ts, h = nValid, level=95)
BNSF.lm.snaive <- snaive(BNSF.train.ts, h = nValid, level=95)
plot(BNSF.lm.naive, ylab = "Monthly Ridership", xlab = "Year", bty = "l",xaxt="n",main="BNSF Line, Simple Naïve Forecast", flty = 2)
axis(1, at = seq(2002, 2018, 1))
lines(BNSF.valid.ts)
grid()
lines(c(2018,2018), c(0, 1800000),lwd=3,col="red")
text(2008, 40000, "Training",cex=1)
text(2019.5, 40000, "Validation",cex=1)
plot(BNSF.lm.snaive, ylab = "Monthly Ridership", xlab = "Year", bty = "l",xaxt="n",main="BNSF Line, Seasonal Naïve Forecast", ylim = c(20000,90000))
axis(1, at = seq(2002, 2018, 1))
lines(BNSF.valid.ts)
grid()
lines(c(2018,2018), c(0, 1800000),lwd=3,col="red")
text(2008, 40000, "Training",cex=1)
text(2019.5, 40000, "Validation",cex=1)
# Find multiple naive forecasts
BNSF.lm.meanf <- meanf(BNSF.train.ts, h=nValid, level=0)
BNSF.lm.naive <- naive(BNSF.train.ts, h = nValid, level = 0)
BNSF.lm.snaive <- snaive(BNSF.train.ts, h = nValid, level=0)
BNSF.lm.drift <- rwf(BNSF.train.ts, h = nValid, drift = TRUE, level = 0)
# ggplot2 methods
# plot multiple forecasts
p <- autoplot(BNSF.ts, ylab = "Monthly Ridership", xlab = "Year", main = "BNSF Line Naïve Models")+autolayer(BNSF.lm.meanf,series="mean")+
autolayer(BNSF.lm.naive,series="naive")+
autolayer(BNSF.lm.snaive,series="snaive")+
autolayer(BNSF.lm.drift,series="drift")+theme_bw()
print(p)
accuracy(BNSF.lm.naive)
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 61.16321 3604.393 2670.298 -0.06548805 4.57503 0.8518555 -0.419589
accuracy(BNSF.lm.snaive)
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 861.5962 4257.002 3134.684 1.359965 5.307325 1 0.5388567
checkresiduals(BNSF.lm.snaive)
##
## Ljung-Box test
##
## data: Residuals from Seasonal naive method
## Q* = 184.74, df = 24, p-value < 2.2e-16
##
## Model df: 0. Total lags used: 24
Setting a benchmark: RMSE of less than 3,500, ME of less than 60, and MAPE of less than 4.5%.
BNSF.lmp <- tslm(BNSF.train.ts ~ trend + I(trend^2) + I(trend^3)) # Justification for a 3rd order polynomial trend
BNSF.pmas <- tslm(BNSF.train.ts ~ trend + I(trend^2) + I(trend^3) + season) # Adding Additive Seasonality
BNSF.pmms <- tslm(BNSF.train.ts ~ (trend + I(trend^2) + I(trend^3)) * season) # Adding Multiplicative seasonality for comparison
BNSF.lmp.pred <- forecast(BNSF.lmp, h = nValid, level = 0.95)
BNSF.pmas.pred <- forecast(BNSF.pmas, h = nValid, level = 0.95)
BNSF.pmms.pred <- forecast(BNSF.pmms, h = nValid, level = 0.95)
plot(BNSF.pmas.pred, ylab = "Monthly Ridership", xlab = "Year", bty = "l",
main = "BNSF Line 3rd Order Polynomial Model with Additive Seasonality", flty = 1)
lines(BNSF.pmas$fitted.values, lwd = 2, col = "blue")
lines(BNSF.lmp$fitted.values, lwd = 2, col = "purple")
lines(BNSF.ts)
plot(BNSF.pmms.pred, ylab = "Monthly Ridership", xlab = "Year", bty = "l",
main = "BNSF Line 3rd Order Polynomial Model with Multiplicative Seasonality", flty = 1)
lines(BNSF.pmms$fitted.values, lwd = 2, col = "orange")
lines(BNSF.ts)
accuracy(BNSF.pmas)
## ME RMSE MAE MPE MAPE MASE
## Training set -2.623125e-13 2745.124 1935.189 -0.219082 3.312406 0.6173475
## ACF1
## Training set 0.399961
accuracy(BNSF.pmms)
## ME RMSE MAE MPE MAPE MASE
## Training set 1.873164e-13 2501.744 1784.753 -0.1765967 3.031852 0.5693565
## ACF1
## Training set 0.4846989
checkresiduals(BNSF.pmas)
##
## Breusch-Godfrey test for serial correlation of order up to 24
##
## data: Residuals from Linear regression model
## LM test = 47.711, df = 24, p-value = 0.002741
checkresiduals(BNSF.pmms)
##
## Breusch-Godfrey test for serial correlation of order up to 51
##
## data: Residuals from Linear regression model
## LM test = 132.17, df = 51, p-value = 3.966e-09
valid.err <- BNSF.valid.ts - BNSF.pmms.pred$mean
train.err <- BNSF.train.ts - BNSF.pmms$fitted.values
plot(train.err,xlim=c(2002,2020), ylim = c(-15000,17000))
lines(valid.err,col="red")
grid()
mseTrain <- mean( train.err^2 )
mseValid <- mean( valid.err^2 )
print(mseTrain)
## [1] 6258725
print(mseValid)
## [1] 51753905
Evidently, the multiplicative seasonal model demonstrates higher accuracy compared to the additive one, with a root mean squared error of 2500 passengers compared to 2745.
BNSFhw <- ets(BNSF.train.ts, model = "ZZZ")
BNSFhw.pred <- forecast(BNSFhw, h = nValid, level = 0.95)
plot(BNSFhw.pred, ylab = "Monthly Ridership", xlab = "Time",
bty = "l", xaxt = "n", main = "Holt-Winter's Exponential Smoothing Model", flty = 1,)
axis(1, at = seq(2002, 2020, 1), labels = format(seq(2002, 2020, 1)))
lines(BNSFhw.pred$fitted, lwd = 2, col = "blue")
lines(BNSF.ts)
checkresiduals(BNSFhw)
##
## Ljung-Box test
##
## data: Residuals from ETS(A,Ad,A)
## Q* = 16.272, df = 7, p-value = 0.02275
##
## Model df: 17. Total lags used: 24
accuracy(BNSFhw.pred)
## ME RMSE MAE MPE MAPE MASE
## Training set -52.46347 2548.746 1769.331 -0.2532258 3.047338 0.5644367
## ACF1
## Training set 0.08980178
accuracy(BNSF.pmms.pred)
## ME RMSE MAE MPE MAPE MASE
## Training set 1.873164e-13 2501.744 1784.753 -0.1765967 3.031852 0.5693565
## ACF1
## Training set 0.4846989
We see that the 3rd order polynomial model with multiplicative seasonality performed slightly better than the automated Hult-Winter’s model, demonstrating lower ME, RMSE, and MAPE %, with close (though higher) MAE and MASE. In short, it is a close competition between the two models. Importantly, the Hult-Winter’s model outperforms the polynomial model in terms of its Autocorrelation function, which shows no statistical significance throughout the lags.
BNSF.arima <- auto.arima(BNSF.train.ts) # Automated ARIMA model
checkresiduals(BNSF.arima)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,1,1)(2,0,0)[12]
## Q* = 16.289, df = 21, p-value = 0.7532
##
## Model df: 3. Total lags used: 24
arima.pred <- forecast(BNSF.arima, h = nValid, level = .95)
plot(arima.pred, xlab = "Time", ylab = "Monthly Rides",bty = "l",
main = "BNSF Ridership Forecast with ARIMA")
lines(arima.pred$fitted, lwd = 2, col = "blue")
lines(BNSF.ts)
accuracy(arima.pred)
## ME RMSE MAE MPE MAPE MASE
## Training set 120.1516 2844.571 2065.668 0.05327138 3.546806 0.6589717
## ACF1
## Training set 0.05065384
accuracy(BNSF.pmms.pred)
## ME RMSE MAE MPE MAPE MASE
## Training set 1.873164e-13 2501.744 1784.753 -0.1765967 3.031852 0.5693565
## ACF1
## Training set 0.4846989
The Polynomial model introduced earlier has outperformed R’s automated ARIMA model across most performance measures. In addition, it has outperformed R’s automated Hult-Winter’s model on some parameters. Overall, the three predictive methods examined in this project performed better than the benchmark naïve models which were introduced in section 2 of this project. We see that while both auto regressive integrated mean average models and advanced exponential smoothing methods provide can provide robust predictive estimations of time series data, independently fitted models can reach equally impressive forecasts to the point of outperforming automated methods.