##Packages
suppressWarnings({
suppressMessages({
library(dplyr)
library(ggplot2)
library(lubridate)
library(gridExtra)
library(forecast)
library(modeltime)
library(fabletools)
library(feasts)
library(tsibble)
library(tibble)
library(fable)
library(data.table)
library(knitr)
library(prophet)
library(fable.prophet)
library(tidyr)
library(slider)
library(stats)
library(lagged)
library(tidyverse)
library(ggfortify)
library(zoo)
library(tseries)
options(warn=-1)
})
})
Section 1.1: Exploratory Data Analysis & Time Series Decomposition
To delve into fundamental concepts of time-series decomposition and forecasting, we will utilize data on “Total US Vehicle Sales” sourced from the “FRED” Economic Data platform, curated by the Federal Reserve Bank of St. Louis. This dataset presents non-seasonally adjusted monthly figures (in thousands) representing vehicle sales across the United States, spanning from 1976 to 2023. These numbers, compiled by the U.S. Bureau of Economic Analysis, reflect aggregated sales data from automotive dealers nationwide.
The dynamics of vehicle sales are influenced by a myriad of factors, encompassing the cost of raw materials such as steel and rubber, climatic conditions, economic fluctuations, seasonal variations, insurance premiums, taxation policies, consumer preferences, labor market dynamics, and demographic trends. Given the multifaceted nature of these variables shaping sales trends, generating precise forecasts poses a significant analytical challenge.
Section 1.2: Data Summary
We initiate our analysis by delving into the structure of our dataset, leveraging statistical metrics and visualizations to uncover insights. Along the way, we will provide commentary on noteworthy observations and draw key conclusions as they emerge.
vehicle_df <- read.csv('total_vehicle_sales.csv')
vehicle_sales_tbl_ts <- vehicle_df %>% select(date, vehicle_sales) %>%
mutate(value = vehicle_sales) %>%
mutate(date = yearmonth(date)) %>%
as_tsibble(index = date)
head(vehicle_df)
## date vehicle_sales
## 1 1976-01-01 885.2
## 2 1976-02-01 994.7
## 3 1976-03-01 1243.6
## 4 1976-04-01 1191.2
## 5 1976-05-01 1203.2
## 6 1976-06-01 1254.7
summary(vehicle_df)
## date vehicle_sales
## Length:575 Min. : 670.5
## Class :character 1st Qu.:1117.2
## Mode :character Median :1268.5
## Mean :1261.7
## 3rd Qu.:1420.2
## Max. :1845.7
date_range <- range(vehicle_df$date)
date_range
## [1] "1976-01-01" "2023-11-01"
nrow(vehicle_df)
## [1] 575
The data consists of monthly vehicle sales between January 1976 and November 2023, summarized by month - resulting in 575 observations. On average, 1261.7 thousand sales were done per month, although that number varied from a low of 670.5 thousand to a peak of 1845.7 thousand.
##Histogram
hist(vehicle_df$vehicle_sales, freq = FALSE)
lines(density(vehicle_sales_tbl_ts$value), lwd = 3, col = "red")
The vehicles sales seem to be distributed fairly normal without any outliers in the data set.
ggplot() +
geom_line(data = vehicle_sales_tbl_ts, aes(x = date, y = value)) +
labs(x = "Month", y = "Vehicle Sales", title = "Monthly Vehicle Sales Over Time")
When visualizing the sales data chronologically, we observe a gradual
upward trend alongside notable seasonal fluctuations, suggesting a
recurring pattern, possibly on an annual basis. Given that the sales
data is aggregated monthly, it’s conceivable that the seasonal
variations contribute to the observed quarterly peaks and troughs.
Section 1.3: Time Series Analysis
After examining various moving average plots, it becomes evident that a 6-month moving average strikes a balance between smoothing the data and retaining the underlying seasonal or cyclical patterns.
# Calculate 6-month moving average
vehicle_sales_tbl_ts <- vehicle_sales_tbl_ts %>%
mutate(ma_6 = slide_dbl(.x = value, .f = ~mean(.x, na.rm = TRUE), .before = 5))
# Create the plot
ggplot(vehicle_sales_tbl_ts, aes(x = date)) +
geom_line(aes(y = value), color = "blue", na.rm = TRUE) +
geom_line(aes(y = ma_6), color = "red", na.rm = TRUE) +
labs(x = "Date", y = "Vehicle Sales",
title = "Vehicle Sales with 6-Month",
subtitle = "Blue: Actual Sales, Red: 6-Month MA") +
theme_minimal()
Given our belief in the presence of seasonality within the dataset, we proceed to employ “STL” decomposition for further analysis. The decomposition reveals a distinct seasonal pattern, even after removing the overall trend using a moving average. Notably, we observe a recurring yearly pattern characterized by peaks followed by low periods, succeeded by 2-3 months of heightened activity, and then another downturn before the start of a new seasonal cycle. Considering the observed dates and the pattern’s structure, it’s reasonable to assume that this time series is strongly influenced by the financial year and holidays. Subsequently, the data is additive based on the below plot.
# Calculate the difference between the 12-month moving average and the actual data
vehicle_sales_tbl_ts <- vehicle_sales_tbl_ts %>%
mutate(diff_12 = value - ma_6)
# Convert the tsibble to a ts object with a frequency of 12
vehicle_sales_ts <- ts(vehicle_sales_tbl_ts$value, frequency = 12)
# Decompose the time series
decomposed <- stl(vehicle_sales_ts, s.window="periodic")
# Plot the decomposed time series
autoplot(decomposed) +
labs(x = "Date", y = "Vehicle Sales",
title = "Decomposition of Vehicle Sales Time Series",
subtitle = "Trend, Seasonal, and Random components") +
theme_minimal()
In this section, we will construct a model for our dataset using the ARIMA methodology. To begin, let’s break the data into two parts, initial 80% of data would be taining data and the rest 20% is considered as testing data. The visual representation of the same is shown below (training in blue and testing in red):
set.seed(123)
# Determine the number of rows for training (80%) and testing (20%)
n_rows <- nrow(vehicle_sales_tbl_ts)
train_rows <- round(0.8 * n_rows)
# Split the data into training and testing sets
train_data <- vehicle_sales_tbl_ts %>% slice(1:train_rows)
test_data <- vehicle_sales_tbl_ts %>% slice((train_rows + 1):n_rows)
# Visualize the training and test sets
ggplot() +
geom_line(data = train_data, aes(x = date, y = value, color = "Training"), linewidth = 0.3) +
geom_line(data = test_data, aes(x = date, y = value, color = "Test"), linewidth = 0.3) +
scale_color_manual(values = c("Training" = "blue", "Test" = "red")) +
labs(title = "Training and Test Sets", x = "Date", y = "vehicle sales") +
theme_minimal()
Section 2.1: ARIMA Modeling
# Load the necessary library
library(zoo)
# Calculate the 6-month moving average for the training data
train_data <- train_data %>%
mutate(ma_6 = rollmean(value, 6, align = "center", fill = NA))
# Plot the 6-month moving average
ggplot(train_data, aes(x = date)) +
geom_line(aes(y = ma_6), color = "blue", na.rm = TRUE) +
labs(x = "Date", y = "Vehicle Sales",
title = "Vehicle Sales with 6-Month Moving Average") +
theme_minimal()
As observed in the 6-month rolling average plot above, a distinct upward trend indicates that our time series is not mean-stationary, necessitating correction before applying ARIMA. The data displays minor variance stationarity, characterized by fluctuations in peaks and valleys. Although there are occasional jumps within three or four consecutive periods, no variance-stabilizing transformations seem necessary. To confirm this observation, we examine a rolling standard deviation plot:
# Convert the training data to a ts object with a frequency of 12
train_sales_ts <- ts(train_data$value, frequency = 12)
# Calculate the rolling standard deviation for the training data
rolling_sd_train <- rollapply(train_sales_ts, width = 13, FUN = sd, align = "center", fill = NA)
# Create a data frame with the rolling standard deviation for the training data
rolling_sd_train_df <- data.frame(time = time(rolling_sd_train), sd = as.vector(rolling_sd_train))
# Plot the rolling standard deviation for the training data
ggplot(rolling_sd_train_df, aes(x = time, y = sd)) +
geom_line() +
geom_smooth(method = "lm", se = FALSE, color = "red") +
labs(x = "Time", y = "Standard Deviation",
title = "Rolling Standard Deviation of Vehicle Sales (Training Data)") +
theme_minimal()
## Don't know how to automatically pick scale for object of type <ts>. Defaulting
## to continuous.
## `geom_smooth()` using formula = 'y ~ x'
The rolling Standard Deviation plot indicates a slight decreasing trend in variance over time. Despite this trend’s slow rate of change and the overall variation observed across the years, we maintain our conclusion that correction for variance non-stationarity is unnecessary.
# Create a seasonally differenced series for the training data
train_data_diff <- diff(train_data$value, lag = 12)
# Convert the differenced training data back into a time series object
train_data_diff_ts <- ts(train_data_diff, start = start(train_data$value), frequency = frequency(train_data$value))
# Create the ACF and PACF plot for the training data
tsdisplay(train_data_diff_ts)
kpss_result <- kpss.test(train_data$value)
kpss_result
##
## KPSS Test for Level Stationarity
##
## data: train_data$value
## KPSS Level = 1.1514, Truncation lag parameter = 5, p-value = 0.01
kpss_diff <- kpss.test(diff(train_data$value))
kpss_diff
##
## KPSS Test for Level Stationarity
##
## data: diff(train_data$value)
## KPSS Level = 0.013876, Truncation lag parameter = 5, p-value = 0.1
Upon testing the original sales data, we obtained a very small p-value, leading us to reject the hypothesis of mean stationarity in our data. However, after differencing to remove seasonality and trend, the KPSS Test yields a p-value of 0.10, surpassing our threshold of 0.05. This result confirms that the transformed time series is mean-stationary, indicating that no further transformations are required.
Section 2.2: Analysis of Transformed Time Series and Model fitting
mod4 = Arima(train_data$value, order = c(2, 0, 2), seasonal = c(0, 0, 0))
summary(mod4)
## Series: train_data$value
## ARIMA(2,0,2) with non-zero mean
##
## Coefficients:
## ar1 ar2 ma1 ma2 mean
## 0.1891 0.6556 0.3410 -0.3700 1229.7697
## s.e. 0.0998 0.0824 0.1072 0.0578 41.1733
##
## sigma^2 = 20981: log likelihood = -2939.5
## AIC=5891.01 AICc=5891.19 BIC=5915.79
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 1.349663 144.0588 117.0126 -1.34284 9.788813 0.9002761 -0.0220673
# Extract the residuals from the model
residuals <- residuals(mod4)
# Perform the Ljung-Box test
ljung_box_test <- Box.test(residuals, type = "Ljung-Box")
ljung_box_test
##
## Box-Ljung test
##
## data: residuals
## X-squared = 0.22547, df = 1, p-value = 0.6349
forecast_values <- forecast(mod4, h = 6)
forecast_values
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 461 1376.227 1190.596 1561.857 1092.3295 1660.124
## 462 1359.722 1149.627 1569.817 1038.4091 1681.034
## 463 1350.358 1128.390 1572.327 1010.8867 1689.830
## 464 1337.767 1102.476 1573.058 977.9209 1697.614
## 465 1329.248 1085.998 1572.498 957.2291 1701.267
## 466 1319.383 1068.147 1570.618 935.1508 1703.614
# Plot the forecast
autoplot(forecast_values, conf.int = TRUE) +
ylab('Retail Sales') +
ggtitle('6 Month Forecast of Retail Sales')
The Ljung-Box test performed with various lag choices yields p-values greater than 0.05. Consequently, we can reject the hypothesis that the time series retains residual autocorrelation at those lags. This suggests that the ARIMA model would be a suitable candidate for our forecasting task.
However, before proceeding with the ARIMA model, we also consider Meta’s Prophet model algorithm as an alternative.
Section 3.1: Meta Prophet Modeling
Prophet, a forecasting tool created by Facebook, specializes in analyzing time-series data with daily observations showcasing trends, seasonality, and holiday effects. It stands out for its user-friendly interface, offering intuitive parameter adjustments, automatic feature selection, and robust modeling techniques. Prophet efficiently captures intricate data patterns, making it a valuable asset for accurate time-series forecasting.
The basic model is fit as:
yt = gt + st +ht + et
where gts the trend, stis seasonality, and ht are holidays.
set.seed(123)
# Determine the number of rows for training (80%) and testing (20%)
n_rows <- nrow(vehicle_sales_tbl_ts)
train_rows <- round(0.8 * n_rows)
# Split the data into training and testing sets
vehicle_train <- vehicle_sales_tbl_ts %>% slice(1:train_rows)
vehicle_test <- vehicle_sales_tbl_ts %>% slice((train_rows + 1):n_rows)
Section 3.2: Seasonality Analysis
The prophet model indicates a strong yearly seasonal pattern which reinforces our earlier conclusion that seasonality is additive, rather than multiplicative. We begin be re-examining the decomposition plot, updated with the latest version of the model:
model = vehicle_train %>%
model(prophet = fable.prophet::prophet(vehicle_sales))
model %>%
components() %>%
autoplot()
The analysis of the time series data indicates a rising trend until the early 2000s, followed by a decline thereafter. Furthermore, there is an additive seasonal pattern observed, with no evidence of multiplicative seasonality.
changepoints = model %>%
glance() %>%
pull(changepoints) %>%
bind_rows() %>%
.$changepoints
vehicle_train %>%
ggplot()+
geom_line(aes(date,vehicle_sales))+
geom_vline(xintercept=as.Date(changepoints),color='red',linetype='dashed')
model2 = vehicle_train %>%
model(
prophet_orig = fable.prophet::prophet(vehicle_sales ~ growth(n_changepoints = 25) + season(period = 'year') + season(period = 'month', order = 12)),
prophet_10 = fable.prophet::prophet(vehicle_sales ~ growth(n_changepoints = 10) + season(period = 'year') + season(period = 'month', order = 12))
)
changepoints_orig = model2 %>%
glance() %>%
filter(.model == 'prophet_orig') %>%
pull(changepoints) %>%
bind_rows() %>%
#filter(abs(adjustment)>0.05) %>%
.$changepoints
changepoints_10 = model2 %>%
glance() %>%
filter(.model == 'prophet_10') %>%
pull(changepoints) %>%
bind_rows() %>%
#filter(abs(adjustment)>0.05) %>%
.$changepoints
vehicle_train %>%
ggplot()+
geom_line(aes(date,vehicle_sales))+
# geom_vline(aes(xintercept=ymd('2000-01-01')))
geom_vline(xintercept=as.Date(changepoints_orig),color='red')+
geom_vline(xintercept=as.Date(changepoints_10),color='blue',linetype='dashed')
model1 = vehicle_train %>%
model(
prophet_orig = fable.prophet::prophet(vehicle_sales),
prophet_90_range = fable.prophet::prophet(vehicle_sales~growth(changepoint_range=0.9)+season(period='year')+ season(period = 'month', order = 12)),
)
changepoints_orig = model1 %>%
glance() %>%
filter(.model == 'prophet_orig') %>%
pull(changepoints) %>%
bind_rows() %>%
#filter(abs(adjustment)>0.01) %>%
.$changepoints
changepoints_90_range = model1 %>%
glance() %>%
filter(.model == 'prophet_90_range') %>%
unnest(changepoints) %>%
bind_rows() %>%
#filter(abs(adjustment)>0.01) %>%
.$changepoints
model1 %>%
forecast(h=68) %>%
autoplot(vehicle_train,level=NULL)+
geom_vline(xintercept=as.Date(changepoints_orig),color='blue',linetype='dashed')+
geom_vline(xintercept=as.Date(changepoints_90_range),color='red',linetype='dashed')
model3 = vehicle_train %>%
model(
prophet_orig = fable.prophet::prophet(vehicle_sales),
prophet_less_flexible = fable.prophet::prophet(vehicle_sales~growth(changepoint_prior_scale=0.01)+season(period='year')),
prophet_more_flexible = fable.prophet::prophet(vehicle_sales~growth(changepoint_prior_scale=0.5)+season(period='year'))
)
changepoints_orig = model3 %>%
glance() %>%
unnest(changepoints) %>%
bind_rows() %>%
filter(.model == 'prophet_orig') %>%
filter(abs(adjustment)>=0.05) %>%
.$changepoints
changepoints_more_flexible = model3 %>%
glance() %>%
unnest(changepoints) %>%
bind_rows() %>%
filter(.model == 'prophet_more_flexible') %>%
filter(abs(adjustment)>=0.50) %>%
.$changepoints
changepoints_less_flexible = model3 %>%
glance() %>%
unnest(changepoints) %>%
bind_rows() %>%
filter(.model == 'prophet_less_flexible') %>%
filter(abs(adjustment)>=0.01) %>%
.$changepoints
model3 %>%
forecast(h=68) %>%
autoplot(vehicle_train,level=NULL)+
geom_vline(xintercept=as.Date(changepoints_orig),color='blue',linetype='dashed')+
geom_vline(xintercept=as.Date(changepoints_more_flexible),color='green',linetype='dashed')+
geom_vline(xintercept=as.Date(changepoints_less_flexible),color='red',linetype='dashed')
vehicle_train %>%
model(
prophet_orig = fable.prophet::prophet(vehicle_sales),
prophet_saturating = fable.prophet::prophet(vehicle_sales~growth(type='linear',capacity=3.7,floor=2.8)+season('year'))
) %>%
forecast(h=68) %>%
autoplot(vehicle_train)
The forecast predictions from changepoints_less_flexible outperform other changepoint options, suggesting that a less flexible approach may better capture the underlying data patterns. This implies that simpler models with fewer changepoints could yield more dependable forecasts.
Although adjustments were made to the floor and capacity parameters to a significant extent, the forecasts produced show minimal deviation from the base Prophet model. This suggests that while these parameters theoretically affect the model’s behavior, their practical impact on forecasting accuracy may be limited in this dataset. Further exploration and experimentation may be necessary to gain a deeper understanding of their effect.
model4 = vehicle_train %>%
model(
additive = fable.prophet::prophet(vehicle_sales~growth()+season(period='year',type='additive')),
multiplicative = fable.prophet::prophet(vehicle_sales~growth()+season(period='year',type='multiplicative')))
model4 %>%
components() %>%
autoplot()
Both the decomposition graph and table reveal the existence of additive seasonality within the time series, rather than multiplicative seasonality. This implies that the seasonal fluctuations within the data maintain a consistent magnitude over time, rather than scaling proportionally with the trend.
Section 4: Model Comparison and Validation
Now that we have our models constructed, we can measure their effectiveness through cross-validation: we will use each model to build and plot 15 consecutive forecasts across subsets of our training data and compare their performance:
vehicle_train %>%
bind_rows(vehicle_test) %>%
ggplot()+
geom_line(aes(date, vehicle_sales))+
geom_vline(aes(xintercept=ymd('2014-05-01')),color='red')+
annotate("text", x = as.Date("1976-01-01"), y = max(vehicle_sales_tbl_ts$vehicle_sales) * 0.9,
label = "Train", color = "blue", size = 4) +
annotate("text", x = as.Date("2023-11-01"), y = max(vehicle_sales_tbl_ts$vehicle_sales) * 0.9,
label = "Test", color = "blue", size = 4) +
labs(x = "Week", y = "vehicle sales", title = "Training and Testing Dataset") +
theme_bw()
vehicle_train_data <- vehicle_train %>%
stretch_tsibble(.init = 22*12, .step = 12)
vehicle_train_data %>%
ggplot()+
geom_point(aes(date,factor(.id),color=factor(.id)))+
ylab('Iteration')+
ggtitle('Samples included in each CV Iteration')
vehicle_train_cv_forecast <- vehicle_train_data %>%
model(
arima = ARIMA(vehicle_sales~pdq(2,0,2)),
prophet = fable.prophet::prophet(vehicle_sales~growth()+season(period='year')),
naive = NAIVE(vehicle_sales~drift())
) %>%
forecast(h=6)
vehicle_train_cv_forecast %>%
as_tsibble() %>%
dplyr::select(-vehicle_sales) %>%
left_join(
vehicle_train
) %>%
ggplot()+
geom_line(aes(date,vehicle_sales))+
geom_line(aes(date,.mean,color=factor(.id),linetype=.model))+
scale_color_discrete(name='Iteration')+
ylab('Ridership Count')+
theme_bw()
## Joining with `by = join_by(date)`
vehicle_train_cv_forecast %>%
as_tibble() %>%
dplyr::select(-vehicle_sales) %>%
left_join(vehicle_train) %>%
group_by(.id,.model) %>%
mutate(
months_ahead = seq(1:6)
) %>%
ungroup() %>%
mutate(error = abs(vehicle_sales - .mean)) %>%
ggplot()+
geom_boxplot(aes(factor(months_ahead),error))+
facet_wrap(~.model,ncol=1)+
guides(color='none')+
ylab('Absolute Error')+
xlab('Days Ahead')+
ggtitle('Absolute Error by Iteration')+
ylim(0,600)
## Joining with `by = join_by(date)`
vehicle_train_cv_forecast %>%
accuracy(vehicle_train)
## # A tibble: 3 × 10
## .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 arima Test -16.4 91.1 72.0 -1.94 5.88 0.644 0.616 0.533
## 2 naive Test -27.2 190. 152. -4.49 12.6 1.36 1.29 0.278
## 3 prophet Test -34.4 210. 166. -4.83 14.5 1.48 1.42 0.865
vehicle_train_cv_forecast %>%
as_tibble() %>%
dplyr::select(-vehicle_sales) %>%
left_join(vehicle_train) %>%
group_by(.id,.model) %>%
mutate(
months_ahead = seq(1:6)
) %>%
ungroup() %>%
filter(!is.na(vehicle_sales)) %>%
group_by(months_ahead,.model) %>%
summarize(
rmse = sqrt(mean((vehicle_sales - .mean)^2,na.rm=T)),
) %>%
ungroup() %>%
ggplot()+
geom_point(aes(months_ahead,rmse,color=.model),alpha=0.4)+
geom_smooth(aes(months_ahead,rmse,color=.model),se=F)+
xlab("Months Ahead")+
ylab("Smoothed RMSE")
## Joining with `by = join_by(date)`
## `summarise()` has grouped output by 'months_ahead'. You can override using the
## `.groups` argument.
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
vehicle_train_cv_forecast %>%
as_tibble() %>%
dplyr::select(-vehicle_sales) %>%
left_join(vehicle_train) %>%
group_by(.id,.model) %>%
mutate(
months_ahead = seq(1:6)
) %>%
ungroup() %>%
group_by(months_ahead,.model) %>%
mutate(
mape = mean(abs((vehicle_sales - .mean)/vehicle_sales),na.rm=T)
) %>%
ungroup() %>%
ggplot()+
geom_point(aes(months_ahead,mape,color=.model),alpha=0.4)+
geom_smooth(aes(months_ahead,mape,color=.model),se=F)+
xlab("Months Ahead")+
ylab("Smoothed MAPE")
## Joining with `by = join_by(date)`
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Arima model exhibits lower MAPE and RMSE values based on above plots.
Although the prophet model adeptly captures the variability and trend within the data, it struggles to accurately estimate the magnitude of the observations. This indicates that while the model effectively tracks the general patterns, it tends to underestimate the true values, leading to discrepancies in performance when evaluated on the test set. Notably, the ARIMA model exhibits the lowest error rate among all models, as evidenced by its superior performance in terms of RMSE and MAPE.
When contrasting the fitted values with the actual observations, it’s evident that the model displays minimal bias, indicating a strong alignment with the training dataset. Yet, upon scrutinizing the forecast against the test data, a notable reduction in variability is observed, signaling a potential overfitting challenge. This disparity suggests that while the model effectively captures the intrinsic patterns, its ability to generalize to new data is compromised. Further exploration and adjustments may be necessary to mitigate the overfitting issue and enhance the model’s predictive performance.