# Load necessary libraries
library(forecast)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(lmtest)
## Loading required package: zoo
##
## Attaching package: 'zoo'
##
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
Time series forecasting is the use of a model to predict future values based on previously observed values. This is commonly used in fields such as economics, finance, and transportation. The goal is to model patterns in the data, including trends, seasonality, and cycles, to make accurate predictions for future data points.
For this exercise, we will be working with the Amtrak ridership data, which records the monthly number of passengers traveling on Amtrak between January 1991 and March 2004.
# Load Amtrak ridership data
amtrak_data <- read_csv("ARIMA Models/Amtrak data.csv") # Adjust file path as needed
## Rows: 159 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Month
## dbl (1): Ridership
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(amtrak_data)
## # A tibble: 6 × 2
## Month Ridership
## <chr> <dbl>
## 1 01-01-1991 1709.
## 2 01-02-1991 1621.
## 3 01-03-1991 1973.
## 4 01-04-1991 1812.
## 5 01-05-1991 1975.
## 6 01-06-1991 1862.
summary(amtrak_data)
## Month Ridership
## Length:159 Min. :1361
## Class :character 1st Qu.:1699
## Mode :character Median :1831
## Mean :1822
## 3rd Qu.:1967
## Max. :2223
# Add column trend & month (seasonal) to the data
amtrak_data$trend <- 1:nrow(amtrak_data)
amtrak_data$season <- month(amtrak_data$Month) |> as.factor()
amtrak_data
## # A tibble: 159 × 4
## Month Ridership trend season
## <chr> <dbl> <int> <fct>
## 1 01-01-1991 1709. 1 1
## 2 01-02-1991 1621. 2 2
## 3 01-03-1991 1973. 3 3
## 4 01-04-1991 1812. 4 4
## 5 01-05-1991 1975. 5 5
## 6 01-06-1991 1862. 6 6
## 7 01-07-1991 1940. 7 7
## 8 01-08-1991 2013. 8 8
## 9 01-09-1991 1596. 9 9
## 10 01-10-1991 1725. 10 10
## # ℹ 149 more rows
# Convert to time series object
amtrak_data_ts <- ts(amtrak_data$Ridership, start = c(1991, 1), frequency = 12)
autoplot(amtrak_data_ts) + ggtitle("Amtrak Ridership Over Time")
# Decompose the time series into trend, seasonal, and random components
decompose(amtrak_data_ts) |>
autoplot() + ggtitle("Decomposition of Amtrak Ridership Data")
We will reserve the last 36 observations as the test set (validation set), and the remaining earlier data will be used for training.
# Define nvalid = 36 (validation/test set size)
nvalid <- 36
# Calculate the split point for training and test sets
n <- length(amtrak_data_ts)
train_ts <- window(amtrak_data_ts, end = c(1991 + (n - nvalid) %/% 12, (n - nvalid) %% 12))
test_ts <- window(amtrak_data_ts, start = c(1991 + (n - nvalid) %/% 12, (n - nvalid) %% 12 + 1))
# Visualize the training and test data
autoplot(train_ts, series = "Training Data") +
autolayer(test_ts, series = "Test Data") +
ggtitle("Training and Test Split (Last 36 Observations as Validation)") +
xlab("Year") + ylab("Ridership") +
theme_minimal()
Regression-based forecasting involves building a regression model using time series data, where future values are predicted based on a linear combination of explanatory variables.
# Fit linear regression model
reg_model1 <- tslm(train_ts ~ trend)
# Summary of the regression model
summary(reg_model1)
##
## Call:
## tslm(formula = train_ts ~ trend)
##
## Residuals:
## Min 1Q Median 3Q Max
## -411.29 -114.02 16.06 129.28 306.35
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1750.3595 29.0729 60.206 <2e-16 ***
## trend 0.3514 0.4069 0.864 0.39
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 160.2 on 121 degrees of freedom
## Multiple R-squared: 0.006125, Adjusted R-squared: -0.002089
## F-statistic: 0.7456 on 1 and 121 DF, p-value: 0.3896
# Plotting the fitted regression line
autoplot(train_ts, series = "Training Data") +
xlab("Year") +
# x-axis step marks
scale_x_continuous(breaks = seq(1991, 2004, by = 1)) +
ylab("Ridership") +
theme_minimal() +
autolayer(fitted(reg_model1), series = "Fitted Values") +
autolayer(forecast(reg_model1, h = length(test_ts)), series = "Forecasted Values") + # to use dashed lines add
autolayer(test_ts, series = "Test Data") +
ggtitle("Regression Model with Time as Predictor")
## Scale for x is already present.
## Adding another scale for x, which will replace the existing scale.
# Fit quadratic regression model on training data
reg_model2 <- tslm(train_ts ~ trend + I(trend^2))
# Summary of the quadratic regression model
summary(reg_model2)
##
## Call:
## tslm(formula = train_ts ~ trend + I(trend^2))
##
## Residuals:
## Min 1Q Median 3Q Max
## -344.79 -101.86 40.89 98.54 279.81
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1888.88401 40.91521 46.166 < 2e-16 ***
## trend -6.29780 1.52327 -4.134 6.63e-05 ***
## I(trend^2) 0.05362 0.01190 4.506 1.55e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 148.8 on 120 degrees of freedom
## Multiple R-squared: 0.1499, Adjusted R-squared: 0.1358
## F-statistic: 10.58 on 2 and 120 DF, p-value: 5.844e-05
# Plotting the fitted regression line
autoplot(train_ts, series = "Training Data") +
xlab("Year") +
# x-axis step marks
# scale_x_continuous(breaks = seq(1991, 2004, by = 1)) +
ylab("Ridership") +
theme_minimal() +
autolayer(fitted(reg_model2), series = "Fitted Values") +
autolayer(forecast(reg_model2, h = length(test_ts)), series = "Forecasted Values") +
autolayer(test_ts, series = "Test Data") +
ggtitle("Regression Model with Quadratic Trend")
### 4.3 Model 3: Ridership = β0 + β1 * Time + β2 * Month
# Fit regression model with month as a categorical variable
reg_model3 <- tslm(train_ts ~ trend + season)
# Summary of the regression model
summary(reg_model3)
##
## Call:
## tslm(formula = train_ts ~ trend + season)
##
## Residuals:
## Min 1Q Median 3Q Max
## -276.165 -57.709 5.602 64.844 194.755
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1551.0115 34.2150 45.331 < 2e-16 ***
## trend 0.3764 0.2567 1.466 0.1455
## season2 -43.3066 43.0218 -1.007 0.3163
## season3 260.0149 43.0241 6.043 2.10e-08 ***
## season4 246.2211 44.0902 5.584 1.71e-07 ***
## season5 278.9750 44.0864 6.328 5.52e-09 ***
## season6 233.8362 44.0842 5.304 5.91e-07 ***
## season7 345.3265 44.0834 7.833 3.22e-12 ***
## season8 396.2831 44.0842 8.989 8.05e-15 ***
## season9 75.0087 44.0864 1.701 0.0917 .
## season10 199.4784 44.0902 4.524 1.54e-05 ***
## season11 190.8496 44.0954 4.328 3.33e-05 ***
## season12 228.5331 44.1021 5.182 1.00e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 100.9 on 110 degrees of freedom
## Multiple R-squared: 0.6418, Adjusted R-squared: 0.6027
## F-statistic: 16.42 on 12 and 110 DF, p-value: < 2.2e-16
# Forecast using the regression model
# forecasted_ridership <- forecast(reg_model3, h = length(test_ts))
# Plotting the fitted regression
autoplot(train_ts, series = "Training Data") +
xlab("Year") +
# x-axis step marks
# scale_x_continuous(breaks = seq(1991, 2004, by = 1)) +
ylab("Ridership") +
theme_minimal() +
autolayer(fitted(reg_model3), series = "Fitted Values") +
autolayer(forecast(reg_model3, h = length(test_ts)), series = "Forecasted Values", alpha = 0.5) +
autolayer(test_ts, series = "Test Data") +
ggtitle("Regression Model with Month as Categorical Variable")
# Fit regression model with month as a categorical variable
reg_model4 <- tslm(train_ts ~ trend + I(trend^2) + season)
# Summary of the regression model
summary(reg_model4)
##
## Call:
## tslm(formula = train_ts ~ trend + I(trend^2) + season)
##
## Residuals:
## Min 1Q Median 3Q Max
## -213.775 -39.363 9.711 42.422 152.187
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.697e+03 2.768e+01 61.318 < 2e-16 ***
## trend -7.156e+00 7.293e-01 -9.812 < 2e-16 ***
## I(trend^2) 6.074e-02 5.698e-03 10.660 < 2e-16 ***
## season2 -4.325e+01 3.024e+01 -1.430 0.15556
## season3 2.600e+02 3.024e+01 8.598 6.60e-14 ***
## season4 2.606e+02 3.102e+01 8.401 1.83e-13 ***
## season5 2.938e+02 3.102e+01 9.471 6.89e-16 ***
## season6 2.490e+02 3.102e+01 8.026 1.26e-12 ***
## season7 3.606e+02 3.102e+01 11.626 < 2e-16 ***
## season8 4.117e+02 3.102e+01 13.270 < 2e-16 ***
## season9 9.032e+01 3.102e+01 2.911 0.00437 **
## season10 2.146e+02 3.102e+01 6.917 3.29e-10 ***
## season11 2.057e+02 3.103e+01 6.629 1.34e-09 ***
## season12 2.429e+02 3.103e+01 7.829 3.44e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 70.92 on 109 degrees of freedom
## Multiple R-squared: 0.8246, Adjusted R-squared: 0.8037
## F-statistic: 39.42 on 13 and 109 DF, p-value: < 2.2e-16
# Forecast using the regression model
# forecasted_ridership <- forecast(reg_model4, h = length(test_ts))
# Plotting the fitted regression
autoplot(train_ts, series = "Training Data") +
xlab("Year") +
# x-axis step marks
# scale_x_continuous(breaks = seq(1991, 2004, by = 1)) +
ylab("Ridership") +
theme_minimal() +
autolayer(fitted(reg_model4), series = "Fitted Values") +
autolayer(forecast(reg_model4, h = length(test_ts)), series = "Forecasted Values", alpha = 0.5) +
autolayer(test_ts, series = "Test Data") +
ggtitle("Regression Model with Month as Categorical Variable")
Autoregressive Integrated Moving Average (ARIMA) models are widely used for forecasting time series data due to their flexibility in handling various types of data patterns.
ARIMA vs. Auto ARIMA While ARIMA models require manual selection of parameters (p, d, q), Auto ARIMA can automatically select the best parameters based on AIC.
# Fit ARIMA model on training data
auto_arima_model <- auto.arima(train_ts)
summary(auto_arima_model)
## Series: train_ts
## ARIMA(1,1,1)(0,1,1)[12]
##
## Coefficients:
## ar1 ma1 sma1
## 0.3221 -0.7446 -0.7017
## s.e. 0.1644 0.1185 0.1021
##
## sigma^2 = 4032: log likelihood = -615.44
## AIC=1238.88 AICc=1239.26 BIC=1249.68
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 6.754576 59.22559 45.54802 0.297054 2.593327 0.552151 -0.03298124
# Forecast using the ARIMA model
forecasted_arima <- forecast(auto_arima_model, h = length(test_ts))
autoplot(forecasted_arima) + autolayer(test_ts, series = "Test Data", PI = FALSE) +
ggtitle("ARIMA Forecast vs Actual Test Data") +
xlab("Year") + ylab("Ridership")
## Warning in ggplot2::geom_line(ggplot2::aes(x = .data[["timeVal"]], y =
## .data[["seriesVal"]], : Ignoring unknown parameters: `PI`
In this section, we will compare the forecasts from the regression model and ARIMA model to see which provides better results.
# Compare the accuracy of the regression and ARIMA models on test data
# Check residuals
checkresiduals(reg_model1)
##
## Breusch-Godfrey test for serial correlation of order up to 24
##
## data: Residuals from Linear regression model
## LM test = 98.996, df = 24, p-value = 4.454e-11
checkresiduals(reg_model2)
##
## Breusch-Godfrey test for serial correlation of order up to 24
##
## data: Residuals from Linear regression model
## LM test = 97.523, df = 24, p-value = 7.92e-11
checkresiduals(reg_model3)
##
## Breusch-Godfrey test for serial correlation of order up to 24
##
## data: Residuals from Linear regression model
## LM test = 85.205, df = 24, p-value = 8.829e-09
checkresiduals(reg_model4)
##
## Breusch-Godfrey test for serial correlation of order up to 24
##
## data: Residuals from Linear regression model
## LM test = 57.142, df = 24, p-value = 0.0001599
checkresiduals(auto_arima_model)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(1,1,1)(0,1,1)[12]
## Q* = 13.302, df = 21, p-value = 0.8977
##
## Model df: 3. Total lags used: 24
In this tutorial, we explored two common approaches for time series forecasting: regression-based forecasting and ARIMA models using the Amtrak ridership data. Each method has its strengths and is suited for different types of data. The choice of model depends on the specific patterns in the time series data and the accuracy required for the predictions.
```
Amtrak Ridership Data: The
Amtrak.csv
file is used to load the monthly ridership data
from 1991 to 2004. This data is suitable for time series
forecasting.
Data Splitting: The data is split chronologically into training (80%) and testing (20%) sets.
Model Building and Forecasting: Both the regression-based model and ARIMA models are applied to the training data and used to forecast ridership. Their predictions are evaluated against the test data.
This structure and workflow are typical for time series forecasting tasks and illustrate how to handle both regression-based and ARIMA approaches in R using the Amtrak data.