1. Preparation

A. Getting a clear working environment:
rm(list = ls())
B. Setting a working directory and importing relevant libraries:
setwd("/Users/noam/Desktop/")
library(tidyverse)  # for data splitting and visualization
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.3     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.3     ✓ stringr 1.4.0
## ✓ readr   2.0.1     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(zoo)        # for moving averages
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
library(lubridate)  # for mdy formatting
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(forecast)   # for forecasting models
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
library(fpp2)       # for loading the data
## ── Attaching packages ────────────────────────────────────────────── fpp2 2.4 ──
## ✓ fma       2.4     ✓ expsmooth 2.3
## 
library(ggplot2)    # for visualization 
library(ggthemes)   # for checking residuals
C. Loading the Data:
data      <- read.csv("Metra.csv", header = TRUE)
data$DATE <- as.yearmon(paste(data$YEAR, data$MONTH), "%Y %m")
data      <- data %>% arrange(mdy(data$DATE))
BNSF      <- data %>% filter(LONGNAME == "BNSF")

2.Data Inspection and Processing

A.Moving Average Visualization (The COVID-19 Drop):
library(zoo)
BNSF.ts <- ts(BNSF$RIDES, start = c(2002,1), frequency = 12)
ma.trailing <- rollmean(BNSF.ts, k = 12, align = "right")
ma.centered <- ma(BNSF.ts, order = 12)
plot(BNSF.ts,  ylab = "Monthly Ridership", xlab = "Time", bty = "l", xaxt = "n", xlim = c(2002,2022), main = "BNSF line Monthly Ridership, 2002 - 2022")
axis(1, at = seq(2002, 2022, 1), labels = format(seq(2002, 2022, 1)))
lines(ma.centered, lwd = 2, col = "blue")
lines(ma.trailing, lwd = 2, lty = 2, col = "red")
legend(x = "center", legend = c("Moving Average", "Trailing Average"),lty = c(1, 2), bty = "n", col = c("blue", 2), lwd = 2)   

The drastic drop towards the 20th year of this time series represents significant noise. The drop in the number of passengers riding the BNSF train line every month stemmed from the outbreak of COVID-19 in February and March of 2020 in the United States. The random nature of this drop is evident from the decomposition of the time series, as shown under the graph titled “random”. Aiming to focus on predicting the time series’ level, trend, and seasonality in “normal” circumstances, I decided to refer to a subset of the data that contains information recorded until January 2020, in order to avoid the significant noise. This subset spans for 18 years, and is sufficient for making robust predictions.

B.Slicing the Data and Creating Dataframes by Station:
data      <- data %>% slice(1:2396) # Slicing Feb 2020 and onward to avoid COVID-19 noise
BNSF      <- data %>% filter(LONGNAME == "Heritage Corridor")
Heritage  <- data %>% filter(LONGNAME == "Heritage Corridor")
Metra     <- data %>% filter(LONGNAME == "Metra Electric District Main Line")
Milwaukee <- data %>% filter(LONGNAME == "Milwaukee District North Line")
North     <- data %>% filter(LONGNAME == "North Central Service")
Rock      <- data %>% filter(LONGNAME == "Rock Island District Main Line")
SouthWest <- data %>% filter(LONGNAME == "SouthWest Service")
Union     <- data %>% filter(LONGNAME == "Union Pacific North Line")

After slicing the data, let’s take a look at the behavior of the eleven train lines covered in this data set.

C.Visualizing ridership data from all stations:
data %>%
  ggplot(aes(x = DATE, y = RIDES, group = LONGNAME, colour = LONGNAME)) + 
  geom_line()+
  ggtitle("Metra Monthly Ridership by Line, 2002-2020")+
theme(
    text = element_text(size = 10),
    plot.title = element_text(hjust = 0.66, size = 15),
    axis.text.x = element_text(size = 10),
    axis.text.y = element_text(size = 10),
    legend.position      = c(1.55, 0.85),
    legend.justification = c("right", "top"),
    legend.box.just      = "right",
    legend.margin        = margin(10, 10, 10, 10),
    plot.margin          = unit(c(0,5.8,0,0), "cm"))

We see that different train lines followed somewhat different ridership patterns over time, with the BNSF line being the most popular among them. For the remainder of this analysis, I will focus on three lines that have demonstrated significantly different ridership patterns over the course of 18 years, from January 2002 to January 2020.

D. Post-Slicing Moving Average Visualization:
BNSF.ts <- ts(BNSF$RIDES, start = c(2002,1), frequency = 12)
ma.trailing <- rollmean(BNSF.ts, k = 12, align = "right")
ma.centered <- ma(BNSF.ts, order = 12)
plot(BNSF.ts,  ylab = "Monthly Ridership", xlab = "Time", bty = "l", xaxt = "n", xlim = c(2002,2020), main = "BNSF Line Monthly Ridership, 2002 - 2020")
axis(1, at = seq(2002, 2020, 1), labels = format(seq(2002, 2020, 1)))
lines(ma.centered, lwd = 2, col = "blue")
lines(ma.trailing, lwd = 2, lty = 2, col = "red")
legend(x = "topleft", legend = c("Moving Average", "Trailing Average"),lty = c(1, 2), bty = "n", col = c("blue", 2), lwd = 2) 

North.ts <- ts(North$RIDES, start = c(2002,1), frequency = 12)
ma.trailing <- rollmean(North.ts, k = 12, align = "right")
ma.centered <- ma(North.ts, order = 12)
plot(North.ts,  ylab = "Monthly Ridership", xlab = "Time", bty = "l", xaxt = "n", xlim = c(2002,2020), main = "North Line Monthly Ridership, 2002 - 2020")
axis(1, at = seq(2002, 2020, 1), labels = format(seq(2002, 2020, 1)))
lines(ma.centered, lwd = 2, col = "blue")
lines(ma.trailing, lwd = 2, lty = 2, col = "red")
legend(x = "topleft", legend = c("Moving Average", "Trailing Average"),lty = c(1, 2), bty = "n", col = c("blue", 2), lwd = 2) 

Metra.ts <- ts(Metra$RIDES, start = c(2002,1), frequency = 12)
ma.trailing <- rollmean(Metra.ts, k = 12, align = "right")
ma.centered <- ma(Metra.ts, order = 12)
plot(Metra.ts,  ylab = "Monthly Ridership", xlab = "Time", bty = "l", xaxt = "n", xlim = c(2002,2020), main = "Metra Electric Line Monthly Ridership, 2002 - 2020")
axis(1, at = seq(2002, 2020, 1), labels = format(seq(2002, 2020, 1)))
lines(ma.centered, lwd = 2, col = "blue")
lines(ma.trailing, lwd = 2, lty = 2, col = "red")
legend(x = "bottomleft", legend = c("Moving Average", "Trailing Average"),lty = c(1, 2), bty = "n", col = c("blue", 2), lwd = 2) 

result = decompose(BNSF.ts)
plot(result)

Following the exclusion of the COVID-19 drop, we see that the BNSF time series demonstrates a more standard random noise pattern. Considering the graphs of the BNSF, Metra, and North lines, we see that each line followed a different ridership trend. The goal of this project is to evaluate the performance of three predictive methods using a two-year validation period on 16 years of time series data from the BNSF, Metra, and North lines.

3. Predictive Model Training

A. Data Splitting:
nValid <- 24
nTrain <- length(BNSF.ts) - nValid
BNSF.train.ts <- window(BNSF.ts, start = c(2002, 1), end = c(2002, nTrain))
BNSF.valid.ts <- window(BNSF.ts, start = c(2002, nTrain+1), end = c(2002, nTrain+nValid))


nTrain <- length(Metra.ts) - nValid
Metra.train.ts <- window(BNSF.ts, start = c(2002, 1), end = c(2002, nTrain))
Metra.valid.ts <- window(BNSF.ts, start = c(2002, nTrain+1), end = c(2002, nTrain+nValid))

nTrain <- length(North.ts) - nValid
North.train.ts <- window(BNSF.ts, start = c(2002, 1), end = c(2002, nTrain))
North.valid.ts <- window(BNSF.ts, start = c(2002, nTrain+1), end = c(2002, nTrain+nValid))

B.Benchmark Predictive Methods: Simple Naïve Models

BNSF.lm <- tslm(BNSF.train.ts ~ trend + I(trend^2)+I(trend^3))
BNSF.lm.pred <- forecast(BNSF.lm, h = nValid, level = 95)
BNSF.lm.naive <- naive(BNSF.train.ts, h = nValid, level=95)
BNSF.lm.snaive <- snaive(BNSF.train.ts, h = nValid, level=95)

plot(BNSF.lm.naive,  ylab = "Monthly Ridership", xlab = "Year", bty = "l",xaxt="n",main="BNSF Line, Simple Naïve Forecast", flty = 2)
axis(1, at = seq(2002, 2018, 1)) 
lines(BNSF.valid.ts)
grid()

lines(c(2018,2018), c(0, 1800000),lwd=3,col="red")
text(2008, 40000, "Training",cex=1)
text(2019.5, 40000, "Validation",cex=1)

plot(BNSF.lm.snaive,  ylab = "Monthly Ridership", xlab = "Year", bty = "l",xaxt="n",main="BNSF Line, Seasonal Naïve Forecast", ylim = c(20000,90000))
axis(1, at = seq(2002, 2018, 1)) 
lines(BNSF.valid.ts)
grid()

lines(c(2018,2018), c(0, 1800000),lwd=3,col="red")
text(2008, 40000, "Training",cex=1)
text(2019.5, 40000, "Validation",cex=1)

# Find multiple naive forecasts
BNSF.lm.meanf <-  meanf(BNSF.train.ts, h=nValid, level=0)
BNSF.lm.naive <-  naive(BNSF.train.ts, h = nValid, level = 0)
BNSF.lm.snaive <- snaive(BNSF.train.ts, h = nValid, level=0)
BNSF.lm.drift <-  rwf(BNSF.train.ts, h = nValid, drift = TRUE, level = 0)

# ggplot2 methods
# plot multiple forecasts
p <- autoplot(BNSF.ts, ylab = "Monthly Ridership", xlab = "Year", main = "BNSF Line Naïve Models")+autolayer(BNSF.lm.meanf,series="mean")+
  autolayer(BNSF.lm.naive,series="naive")+
  autolayer(BNSF.lm.snaive,series="snaive")+
  autolayer(BNSF.lm.drift,series="drift")+theme_bw()

print(p)

accuracy(BNSF.lm.naive)
##                    ME     RMSE      MAE         MPE    MAPE      MASE      ACF1
## Training set 61.16321 3604.393 2670.298 -0.06548805 4.57503 0.8518555 -0.419589
accuracy(BNSF.lm.snaive)
##                    ME     RMSE      MAE      MPE     MAPE MASE      ACF1
## Training set 861.5962 4257.002 3134.684 1.359965 5.307325    1 0.5388567
checkresiduals(BNSF.lm.snaive)

## 
##  Ljung-Box test
## 
## data:  Residuals from Seasonal naive method
## Q* = 184.74, df = 24, p-value < 2.2e-16
## 
## Model df: 0.   Total lags used: 24

Setting a benchmark: RMSE of less than 3,500, ME of less than 60, and MAPE of less than 4.5%.

C.Predictive Method #1: Fitting a 3rd order ploynomial trend model with seasonality

BNSF.lmp       <- tslm(BNSF.train.ts ~ trend + I(trend^2) + I(trend^3))            # Justification for a 3rd order polynomial trend
BNSF.pmas      <- tslm(BNSF.train.ts ~ trend + I(trend^2) + I(trend^3) + season)   # Adding Additive Seasonality
BNSF.pmms      <- tslm(BNSF.train.ts ~ (trend + I(trend^2) + I(trend^3)) * season) # Adding Multiplicative seasonality for comparison

BNSF.lmp.pred   <- forecast(BNSF.lmp, h = nValid, level = 0.95)
BNSF.pmas.pred  <- forecast(BNSF.pmas, h = nValid, level = 0.95)
BNSF.pmms.pred  <- forecast(BNSF.pmms, h = nValid, level = 0.95)

plot(BNSF.pmas.pred, ylab = "Monthly Ridership", xlab = "Year", bty = "l",
main = "BNSF Line 3rd Order Polynomial Model with Additive Seasonality", flty = 1)
lines(BNSF.pmas$fitted.values, lwd = 2, col = "blue")
lines(BNSF.lmp$fitted.values, lwd = 2, col = "purple")
lines(BNSF.ts)

plot(BNSF.pmms.pred, ylab = "Monthly Ridership", xlab = "Year", bty = "l",
main = "BNSF Line 3rd Order Polynomial Model with Multiplicative Seasonality", flty = 1)
lines(BNSF.pmms$fitted.values, lwd = 2, col = "orange")
lines(BNSF.ts)

accuracy(BNSF.pmas)
##                         ME     RMSE      MAE       MPE     MAPE      MASE
## Training set -2.623125e-13 2745.124 1935.189 -0.219082 3.312406 0.6173475
##                  ACF1
## Training set 0.399961
accuracy(BNSF.pmms)
##                        ME     RMSE      MAE        MPE     MAPE      MASE
## Training set 1.873164e-13 2501.744 1784.753 -0.1765967 3.031852 0.5693565
##                   ACF1
## Training set 0.4846989
checkresiduals(BNSF.pmas)

## 
##  Breusch-Godfrey test for serial correlation of order up to 24
## 
## data:  Residuals from Linear regression model
## LM test = 47.711, df = 24, p-value = 0.002741
checkresiduals(BNSF.pmms)

## 
##  Breusch-Godfrey test for serial correlation of order up to 51
## 
## data:  Residuals from Linear regression model
## LM test = 132.17, df = 51, p-value = 3.966e-09
valid.err <- BNSF.valid.ts - BNSF.pmms.pred$mean
train.err <- BNSF.train.ts - BNSF.pmms$fitted.values
plot(train.err,xlim=c(2002,2020), ylim = c(-15000,17000))
lines(valid.err,col="red")
grid()

mseTrain <- mean( train.err^2 )
mseValid <- mean( valid.err^2 )
print(mseTrain)
## [1] 6258725
print(mseValid)
## [1] 51753905

Evidently, the multiplicative seasonal model demonstrates higher accuracy compared to the additive one, with a root mean squared error of 2500 passengers compared to 2745.

D.Predictive Method #2: Automated Holt-Winter’s Exponential Smoothing

BNSFhw <- ets(BNSF.train.ts, model = "ZZZ")
BNSFhw.pred <- forecast(BNSFhw, h = nValid, level = 0.95)
plot(BNSFhw.pred,  ylab = "Monthly Ridership", xlab = "Time",
    bty = "l", xaxt = "n", main = "Holt-Winter's Exponential Smoothing Model", flty = 1,)
axis(1, at = seq(2002, 2020, 1), labels = format(seq(2002, 2020, 1)))
lines(BNSFhw.pred$fitted, lwd = 2, col = "blue")
lines(BNSF.ts)

checkresiduals(BNSFhw)

## 
##  Ljung-Box test
## 
## data:  Residuals from ETS(A,Ad,A)
## Q* = 16.272, df = 7, p-value = 0.02275
## 
## Model df: 17.   Total lags used: 24
accuracy(BNSFhw.pred)
##                     ME     RMSE      MAE        MPE     MAPE      MASE
## Training set -52.46347 2548.746 1769.331 -0.2532258 3.047338 0.5644367
##                    ACF1
## Training set 0.08980178
accuracy(BNSF.pmms.pred)
##                        ME     RMSE      MAE        MPE     MAPE      MASE
## Training set 1.873164e-13 2501.744 1784.753 -0.1765967 3.031852 0.5693565
##                   ACF1
## Training set 0.4846989

We see that the 3rd order polynomial model with multiplicative seasonality performed slightly better than the automated Hult-Winter’s model, demonstrating lower ME, RMSE, and MAPE %, with close (though higher) MAE and MASE. In short, it is a close competition between the two models. Importantly, the Hult-Winter’s model outperforms the polynomial model in terms of its Autocorrelation function, which shows no statistical significance throughout the lags.

E.Predictive Method #3: Automated Autoregressive Integrated Moving Average Model (ARIMA)

BNSF.arima <- auto.arima(BNSF.train.ts) # Automated ARIMA model
checkresiduals(BNSF.arima)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,1,1)(2,0,0)[12]
## Q* = 16.289, df = 21, p-value = 0.7532
## 
## Model df: 3.   Total lags used: 24
arima.pred <- forecast(BNSF.arima, h = nValid, level = .95)
plot(arima.pred, xlab = "Time", ylab = "Monthly Rides",bty = "l",
     main = "BNSF Ridership Forecast with ARIMA")
lines(arima.pred$fitted, lwd = 2, col = "blue")
lines(BNSF.ts)

accuracy(arima.pred)
##                    ME     RMSE      MAE        MPE     MAPE      MASE
## Training set 120.1516 2844.571 2065.668 0.05327138 3.546806 0.6589717
##                    ACF1
## Training set 0.05065384
accuracy(BNSF.pmms.pred)
##                        ME     RMSE      MAE        MPE     MAPE      MASE
## Training set 1.873164e-13 2501.744 1784.753 -0.1765967 3.031852 0.5693565
##                   ACF1
## Training set 0.4846989

4. Comparison

A.Discussion

The Polynomial model introduced earlier has outperformed R’s automated ARIMA model across most performance measures. In addition, it has outperformed R’s automated Hult-Winter’s model on some parameters. Overall, the three predictive methods examined in this project performed better than the benchmark naïve models which were introduced in section 2 of this project. We see that while both auto regressive integrated mean average models and advanced exponential smoothing methods provide can provide robust predictive estimations of time series data, independently fitted models can reach equally impressive forecasts to the point of outperforming automated methods.