Time Series Forecasting with Amtrak Data

# Load necessary libraries
library(forecast)

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(lmtest)

## Loading required package: zoo
## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

1. Introduction to Time Series Forecasting

Time series forecasting is the use of a model to predict future values based on previously observed values. This is commonly used in fields such as economics, finance, and transportation. The goal is to model patterns in the data, including trends, seasonality, and cycles, to make accurate predictions for future data points.

Learning Objectives:

Understand the concept of time series forecasting.
Explore different models such as Regression-based forecasting and ARIMA models.
Evaluate model performance using common metrics.

2. Loading the Amtrak Data

For this exercise, we will be working with the Amtrak ridership data, which records the monthly number of passengers traveling on Amtrak between January 1991 and March 2004.

2.1 Importing the Dataset and Exploratory Data Analysis

# Load Amtrak ridership data
amtrak_data <- read_csv("ARIMA Models/Amtrak data.csv")  # Adjust file path as needed

## Rows: 159 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Month
## dbl (1): Ridership
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(amtrak_data)

## # A tibble: 6 × 2
##   Month      Ridership
##   <chr>          <dbl>
## 1 01-01-1991     1709.
## 2 01-02-1991     1621.
## 3 01-03-1991     1973.
## 4 01-04-1991     1812.
## 5 01-05-1991     1975.
## 6 01-06-1991     1862.

summary(amtrak_data)

##     Month             Ridership   
##  Length:159         Min.   :1361  
##  Class :character   1st Qu.:1699  
##  Mode  :character   Median :1831  
##                     Mean   :1822  
##                     3rd Qu.:1967  
##                     Max.   :2223

# Add column trend & month (seasonal) to the data
amtrak_data$trend <- 1:nrow(amtrak_data)
amtrak_data$season <- month(amtrak_data$Month) |> as.factor()

amtrak_data

## # A tibble: 159 × 4
##    Month      Ridership trend season
##    <chr>          <dbl> <int> <fct> 
##  1 01-01-1991     1709.     1 1     
##  2 01-02-1991     1621.     2 2     
##  3 01-03-1991     1973.     3 3     
##  4 01-04-1991     1812.     4 4     
##  5 01-05-1991     1975.     5 5     
##  6 01-06-1991     1862.     6 6     
##  7 01-07-1991     1940.     7 7     
##  8 01-08-1991     2013.     8 8     
##  9 01-09-1991     1596.     9 9     
## 10 01-10-1991     1725.    10 10    
## # ℹ 149 more rows

# Convert to time series object
amtrak_data_ts <- ts(amtrak_data$Ridership, start = c(1991, 1), frequency = 12)
autoplot(amtrak_data_ts) + ggtitle("Amtrak Ridership Over Time")

# Decompose the time series into trend, seasonal, and random components
decompose(amtrak_data_ts) |>
  autoplot() + ggtitle("Decomposition of Amtrak Ridership Data")

3. Train-test split with 36 Observations for Validation

We will reserve the last 36 observations as the test set (validation set), and the remaining earlier data will be used for training.

# Define nvalid = 36 (validation/test set size)
nvalid <- 36

# Calculate the split point for training and test sets
n <- length(amtrak_data_ts)
train_ts <- window(amtrak_data_ts, end = c(1991 + (n - nvalid) %/% 12, (n - nvalid) %% 12))
test_ts <- window(amtrak_data_ts, start = c(1991 + (n - nvalid) %/% 12, (n - nvalid) %% 12 + 1))

# Visualize the training and test data
autoplot(train_ts, series = "Training Data") +
  autolayer(test_ts, series = "Test Data") +
  ggtitle("Training and Test Split (Last 36 Observations as Validation)") +
  xlab("Year") + ylab("Ridership") +
  theme_minimal()

4. Regression-Based Forecasting

Regression-based forecasting involves building a regression model using time series data, where future values are predicted based on a linear combination of explanatory variables.

4.1 Model 1: Ridership = β0 + β1 * Time

# Fit linear regression model
reg_model1 <- tslm(train_ts ~ trend)

# Summary of the regression model
summary(reg_model1)

## 
## Call:
## tslm(formula = train_ts ~ trend)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -411.29 -114.02   16.06  129.28  306.35 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1750.3595    29.0729  60.206   <2e-16 ***
## trend          0.3514     0.4069   0.864     0.39    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 160.2 on 121 degrees of freedom
## Multiple R-squared:  0.006125,   Adjusted R-squared:  -0.002089 
## F-statistic: 0.7456 on 1 and 121 DF,  p-value: 0.3896

# Plotting the fitted regression line
autoplot(train_ts, series = "Training Data") +
  xlab("Year") + 
  # x-axis step marks
  scale_x_continuous(breaks = seq(1991, 2004, by = 1)) +
  ylab("Ridership") +
  theme_minimal() +
  autolayer(fitted(reg_model1), series = "Fitted Values") +
  autolayer(forecast(reg_model1, h = length(test_ts)), series = "Forecasted Values") + # to use dashed lines add 
  autolayer(test_ts, series = "Test Data") +
  ggtitle("Regression Model with Time as Predictor")

## Scale for x is already present.
## Adding another scale for x, which will replace the existing scale.

4.2 Model 2: Ridership = β0 + β1 * Time + β2 * Time^2

# Fit quadratic regression model on training data
reg_model2 <- tslm(train_ts ~ trend + I(trend^2))

# Summary of the quadratic regression model
summary(reg_model2)

## 
## Call:
## tslm(formula = train_ts ~ trend + I(trend^2))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -344.79 -101.86   40.89   98.54  279.81 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1888.88401   40.91521  46.166  < 2e-16 ***
## trend         -6.29780    1.52327  -4.134 6.63e-05 ***
## I(trend^2)     0.05362    0.01190   4.506 1.55e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 148.8 on 120 degrees of freedom
## Multiple R-squared:  0.1499, Adjusted R-squared:  0.1358 
## F-statistic: 10.58 on 2 and 120 DF,  p-value: 5.844e-05

# Plotting the fitted regression line
autoplot(train_ts, series = "Training Data") +
  xlab("Year") + 
  # x-axis step marks
  # scale_x_continuous(breaks = seq(1991, 2004, by = 1)) +
  ylab("Ridership") +
  theme_minimal() +
  autolayer(fitted(reg_model2), series = "Fitted Values") +
  autolayer(forecast(reg_model2, h = length(test_ts)), series = "Forecasted Values") +
  autolayer(test_ts, series = "Test Data") +
  ggtitle("Regression Model with Quadratic Trend")

### 4.3 Model 3: Ridership = β0 + β1 * Time + β2 * Month

# Fit regression model with month as a categorical variable
reg_model3 <- tslm(train_ts ~ trend + season)

# Summary of the regression model
summary(reg_model3)

## 
## Call:
## tslm(formula = train_ts ~ trend + season)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -276.165  -57.709    5.602   64.844  194.755 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1551.0115    34.2150  45.331  < 2e-16 ***
## trend          0.3764     0.2567   1.466   0.1455    
## season2      -43.3066    43.0218  -1.007   0.3163    
## season3      260.0149    43.0241   6.043 2.10e-08 ***
## season4      246.2211    44.0902   5.584 1.71e-07 ***
## season5      278.9750    44.0864   6.328 5.52e-09 ***
## season6      233.8362    44.0842   5.304 5.91e-07 ***
## season7      345.3265    44.0834   7.833 3.22e-12 ***
## season8      396.2831    44.0842   8.989 8.05e-15 ***
## season9       75.0087    44.0864   1.701   0.0917 .  
## season10     199.4784    44.0902   4.524 1.54e-05 ***
## season11     190.8496    44.0954   4.328 3.33e-05 ***
## season12     228.5331    44.1021   5.182 1.00e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 100.9 on 110 degrees of freedom
## Multiple R-squared:  0.6418, Adjusted R-squared:  0.6027 
## F-statistic: 16.42 on 12 and 110 DF,  p-value: < 2.2e-16

# Forecast using the regression model
# forecasted_ridership <- forecast(reg_model3, h = length(test_ts))

# Plotting the fitted regression
autoplot(train_ts, series = "Training Data") +
  xlab("Year") + 
  # x-axis step marks
  # scale_x_continuous(breaks = seq(1991, 2004, by = 1)) +
  ylab("Ridership") +
  theme_minimal() +
  autolayer(fitted(reg_model3), series = "Fitted Values") +
  autolayer(forecast(reg_model3, h = length(test_ts)), series = "Forecasted Values", alpha = 0.5) +
  autolayer(test_ts, series = "Test Data") +
  ggtitle("Regression Model with Month as Categorical Variable")

4.4 Model 4: Ridership = β0 + β1 * Time + β2 * Time^2 + β3 * Month

# Fit regression model with month as a categorical variable
reg_model4 <- tslm(train_ts ~ trend + I(trend^2) + season)

# Summary of the regression model
summary(reg_model4)

## 
## Call:
## tslm(formula = train_ts ~ trend + I(trend^2) + season)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -213.775  -39.363    9.711   42.422  152.187 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.697e+03  2.768e+01  61.318  < 2e-16 ***
## trend       -7.156e+00  7.293e-01  -9.812  < 2e-16 ***
## I(trend^2)   6.074e-02  5.698e-03  10.660  < 2e-16 ***
## season2     -4.325e+01  3.024e+01  -1.430  0.15556    
## season3      2.600e+02  3.024e+01   8.598 6.60e-14 ***
## season4      2.606e+02  3.102e+01   8.401 1.83e-13 ***
## season5      2.938e+02  3.102e+01   9.471 6.89e-16 ***
## season6      2.490e+02  3.102e+01   8.026 1.26e-12 ***
## season7      3.606e+02  3.102e+01  11.626  < 2e-16 ***
## season8      4.117e+02  3.102e+01  13.270  < 2e-16 ***
## season9      9.032e+01  3.102e+01   2.911  0.00437 ** 
## season10     2.146e+02  3.102e+01   6.917 3.29e-10 ***
## season11     2.057e+02  3.103e+01   6.629 1.34e-09 ***
## season12     2.429e+02  3.103e+01   7.829 3.44e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 70.92 on 109 degrees of freedom
## Multiple R-squared:  0.8246, Adjusted R-squared:  0.8037 
## F-statistic: 39.42 on 13 and 109 DF,  p-value: < 2.2e-16

# Forecast using the regression model
# forecasted_ridership <- forecast(reg_model4, h = length(test_ts))

# Plotting the fitted regression
autoplot(train_ts, series = "Training Data") +
  xlab("Year") + 
  # x-axis step marks
  # scale_x_continuous(breaks = seq(1991, 2004, by = 1)) +
  ylab("Ridership") +
  theme_minimal() +
  autolayer(fitted(reg_model4), series = "Fitted Values") +
  autolayer(forecast(reg_model4, h = length(test_ts)), series = "Forecasted Values", alpha = 0.5) +
  autolayer(test_ts, series = "Test Data") +
  ggtitle("Regression Model with Month as Categorical Variable")

5. ARIMA Models

Autoregressive Integrated Moving Average (ARIMA) models are widely used for forecasting time series data due to their flexibility in handling various types of data patterns.

Autoregressive (AR) terms capture the effect of previous values on the current value.
Integrated (I) terms account for differencing to make the series stationary.
Moving Average (MA) terms capture the effect of past forecast errors on the current value.

ARIMA vs. Auto ARIMA While ARIMA models require manual selection of parameters (p, d, q), Auto ARIMA can automatically select the best parameters based on AIC.

p: The number of autoregressive (AR) terms.
d: The degree of differencing to make the data stationary.
q: The number of moving average (MA) terms.

5.1 Building an ARIMA Model

# Fit ARIMA model on training data
auto_arima_model <- auto.arima(train_ts)
summary(auto_arima_model)

## Series: train_ts 
## ARIMA(1,1,1)(0,1,1)[12] 
## 
## Coefficients:
##          ar1      ma1     sma1
##       0.3221  -0.7446  -0.7017
## s.e.  0.1644   0.1185   0.1021
## 
## sigma^2 = 4032:  log likelihood = -615.44
## AIC=1238.88   AICc=1239.26   BIC=1249.68
## 
## Training set error measures:
##                    ME     RMSE      MAE      MPE     MAPE     MASE        ACF1
## Training set 6.754576 59.22559 45.54802 0.297054 2.593327 0.552151 -0.03298124

5.2 Forecasting with ARIMA Model

# Forecast using the ARIMA model
forecasted_arima <- forecast(auto_arima_model, h = length(test_ts))
autoplot(forecasted_arima) + autolayer(test_ts, series = "Test Data", PI = FALSE) +
  ggtitle("ARIMA Forecast vs Actual Test Data") +
  xlab("Year") + ylab("Ridership")

## Warning in ggplot2::geom_line(ggplot2::aes(x = .data[["timeVal"]], y =
## .data[["seriesVal"]], : Ignoring unknown parameters: `PI`

6. Model Comparison

In this section, we will compare the forecasts from the regression model and ARIMA model to see which provides better results.

6.1 Comparing Forecast Accuracy

# Compare the accuracy of the regression and ARIMA models on test data

 
# Check residuals
checkresiduals(reg_model1)

## 
##  Breusch-Godfrey test for serial correlation of order up to 24
## 
## data:  Residuals from Linear regression model
## LM test = 98.996, df = 24, p-value = 4.454e-11

checkresiduals(reg_model2)

## 
##  Breusch-Godfrey test for serial correlation of order up to 24
## 
## data:  Residuals from Linear regression model
## LM test = 97.523, df = 24, p-value = 7.92e-11

checkresiduals(reg_model3)

## 
##  Breusch-Godfrey test for serial correlation of order up to 24
## 
## data:  Residuals from Linear regression model
## LM test = 85.205, df = 24, p-value = 8.829e-09

checkresiduals(reg_model4)

## 
##  Breusch-Godfrey test for serial correlation of order up to 24
## 
## data:  Residuals from Linear regression model
## LM test = 57.142, df = 24, p-value = 0.0001599

checkresiduals(auto_arima_model)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,1,1)(0,1,1)[12]
## Q* = 13.302, df = 21, p-value = 0.8977
## 
## Model df: 3.   Total lags used: 24

7. Conclusion

In this tutorial, we explored two common approaches for time series forecasting: regression-based forecasting and ARIMA models using the Amtrak ridership data. Each method has its strengths and is suited for different types of data. The choice of model depends on the specific patterns in the time series data and the accuracy required for the predictions.

```

Adjustments for Amtrak Data:

Amtrak Ridership Data: The Amtrak.csv file is used to load the monthly ridership data from 1991 to 2004. This data is suitable for time series forecasting.
Data Splitting: The data is split chronologically into training (80%) and testing (20%) sets.
Model Building and Forecasting: Both the regression-based model and ARIMA models are applied to the training data and used to forecast ridership. Their predictions are evaluated against the test data.

This structure and workflow are typical for time series forecasting tasks and illustrate how to handle both regression-based and ARIMA approaches in R using the Amtrak data.