vehicle_df <- read.csv("//Users/keerthichereddy/Documents/Sem2/Forecasting Methods/Project/total_vehicle_sales.csv", stringsAsFactors = FALSE)
vehicle_sales_tbl_ts <- vehicle_df %>% select(date, vehicle_sales) %>%
mutate(value = vehicle_sales) %>%
mutate(date = yearmonth(date)) %>%
as_tsibble(index = date)BANA 7050_FinalProject_KeerthiChereddy
Section 1
Data source and data-generating process
The dataset represents the total vehicle sales in the USA (“TOTALNSA” series) published by the Federal Reserve Bank of St. Louis (FRED). This time series data, updated monthly, serves as a critical economic indicator, reflecting consumer spending trends and the trends in the automotive industry. The dataset spans several years, offering a holistic view of vehicle sales dynamics.
Vehicle sales are very closely influenced by macro economic trends and hence fluctuations in income levels, economic policies etc. can lead to significant variability in the data. Several factors such as Government policies, environmental regulations, shift in consumer preferences, new technologies (Electric Vehicles) can have a direct impact on the industry. Vehicle sales data exhibits seasonality such as year-end sales traditionally showing higher sales figures. The impact of all the above factors and their interplay makes forecasting this time series challenging.
Summary Statistics
head(vehicle_df) date vehicle_sales
1 1976-01-01 885.2
2 1976-02-01 994.7
3 1976-03-01 1243.6
4 1976-04-01 1191.2
5 1976-05-01 1203.2
6 1976-06-01 1254.7
summary(vehicle_df) date vehicle_sales
Length:575 Min. : 670.5
Class :character 1st Qu.:1117.2
Mode :character Median :1268.5
Mean :1261.7
3rd Qu.:1420.2
Max. :1845.7
date_range <- range(vehicle_df$date)
date_range[1] "1976-01-01" "2023-11-01"
nrow(vehicle_df)[1] 575
Visualizations
##Histogram
hist(vehicle_df$vehicle_sales, freq = FALSE)
lines(density(vehicle_sales_tbl_ts$value), lwd = 3, col = "red")ggplot() +
geom_line(data = vehicle_sales_tbl_ts, aes(x = date, y = value)) +
labs(x = "Month", y = "Vehicle Sales", title = "Monthly Vehicle Sales Over Time")The histogram suggests that the data is distributed fairly normal without any outliers in the data set. Visualizing sales data over time reveals a steady increase with significant seasonal changes, indicating a possible annual pattern. The monthly aggregated data likely explains the quarterly highs and lows observed.
Moving average of the time series
# Calculate 6-month moving average
vehicle_sales_tbl_ts <- vehicle_sales_tbl_ts %>%
mutate(ma_6 = slide_dbl(.x = value, .f = ~mean(.x, na.rm = TRUE), .before = 5))
# Create the plot
ggplot(vehicle_sales_tbl_ts, aes(x = date)) +
geom_line(aes(y = value), color = "blue", na.rm = TRUE) +
geom_line(aes(y = ma_6), color = "red", na.rm = TRUE) +
labs(x = "Date", y = "Vehicle Sales",
title = "Vehicle Sales with 6-Month Moving Average",
subtitle = "Blue: Actual Sales, Red: 6-Month MA") +
theme_minimal()Analyzing the moving average plots, a 6-month moving average emerges as optimal for smoothing the data while preserving seasonal or cyclical trends. This approach, combined with the anticipation of seasonality in the data suggests applying STL decomposition for deeper analysis.
This method uncovers a clear seasonal trend that persists even after adjusting for the overall trend with a moving average. We identify a consistent annual cycle, marked by peaks and troughs, followed by a 2-3 month period of increased activity, and then a decline before the cycle restarts. The timing and structure of this pattern suggest a significant influence from the financial year and holiday periods. The resulting data, illustrated by the accompanying plot, exhibits an additive nature.
# Calculate the difference between the 12-month moving average and the actual data
vehicle_sales_tbl_ts <- vehicle_sales_tbl_ts %>%
mutate(diff_12 = value - ma_6)
# Convert the tsibble to a ts object with a frequency of 12
vehicle_sales_ts <- ts(vehicle_sales_tbl_ts$value, frequency = 12)
# Decompose the time series
decomposed <- stl(vehicle_sales_ts, s.window="periodic")
# Plot the decomposed time series
autoplot(decomposed) +
labs(x = "Date", y = "Vehicle Sales",
title = "Decomposition of Vehicle Sales Time Series",
subtitle = "Trend, Seasonal, and Random components") +
theme_minimal()set.seed(123)
# Determine the number of rows for training (80%) and testing (20%)
n_rows <- nrow(vehicle_sales_tbl_ts)
train_rows <- round(0.8 * n_rows)
# Split the data into training and testing sets
train_data <- vehicle_sales_tbl_ts %>% slice(1:train_rows)
test_data <- vehicle_sales_tbl_ts %>% slice((train_rows + 1):n_rows)
# Visualize the training and test sets
ggplot() +
geom_line(data = train_data, aes(x = date, y = value, color = "Training"), linewidth = 0.3) +
geom_line(data = test_data, aes(x = date, y = value, color = "Test"), linewidth = 0.3) +
scale_color_manual(values = c("Training" = "blue", "Test" = "red")) +
labs(title = "Training and Test Sets", x = "Date", y = "vehicle sales") +
theme_minimal()Section 2
ARIMA Modeling
# Load the necessary library
library(zoo)
# Calculate the 6-month moving average for the training data
train_data <- train_data %>%
mutate(ma_6 = rollmean(value, 6, align = "center", fill = NA))
# Plot the 6-month moving average
ggplot(train_data, aes(x = date)) +
geom_line(aes(y = ma_6), color = "blue", na.rm = TRUE) +
labs(x = "Date", y = "Vehicle Sales",
title = "Vehicle Sales with 6-Month Moving Average") +
theme_minimal()The 6-month rolling average plot reveals a clear upward trend, suggesting that the time series is not mean-stationary, which requires adjustment before we can use ARIMA modeling. The data shows some level of variance stationarity, with variations in its peaks and troughs. Despite occasional spikes over three or four consecutive periods, there appears to be no need for variance-stabilizing transformations.
suppressWarnings({
suppressMessages({
# Convert the training data to a ts object with a frequency of 12
train_sales_ts <- ts(train_data$value, frequency = 12)
# Calculate the rolling standard deviation for the training data
rolling_sd_train <- rollapply(train_sales_ts, width = 13, FUN = sd, align = "center", fill = NA)
# Create a data frame with the rolling standard deviation for the training data
rolling_sd_train_df <- data.frame(time = time(rolling_sd_train), sd = as.vector(rolling_sd_train))
# Plot the rolling standard deviation for the training data
ggplot(rolling_sd_train_df, aes(x = time, y = sd)) +
geom_line() +
geom_smooth(method = "lm", se = FALSE, color = "red") +
labs(x = "Time", y = "Standard Deviation",
title = "Rolling Standard Deviation of Training Data") +
theme_minimal()
options(warn=-1)
})
})The rolling standard deviation plot shows a minor downtrend in variance over time. Given the slow pace of this trend and the consistent variation noted year over year, we conclude that adjustments for variance non-stationarity are not required.
# Create a seasonally differenced series for the training data
train_data_diff <- diff(train_data$value, lag = 12)
# Convert the differenced training data back into a time series object
train_data_diff_ts <- ts(train_data_diff, start = start(train_data$value), frequency = frequency(train_data$value))
# Create the ACF and PACF plot for the training data
tsdisplay(train_data_diff_ts)kpss_result <- kpss.test(train_data$value)
kpss_result
KPSS Test for Level Stationarity
data: train_data$value
KPSS Level = 1.1514, Truncation lag parameter = 5, p-value = 0.01
kpss_diff <- kpss.test(diff(train_data$value))
kpss_diff
KPSS Test for Level Stationarity
data: diff(train_data$value)
KPSS Level = 0.013876, Truncation lag parameter = 5, p-value = 0.1
Analysis and Model fitting
mod4 = Arima(train_data$value, order = c(2, 0, 2), seasonal = c(0, 0, 0))
summary(mod4)Series: train_data$value
ARIMA(2,0,2) with non-zero mean
Coefficients:
ar1 ar2 ma1 ma2 mean
0.1891 0.6556 0.3410 -0.3700 1229.7697
s.e. 0.0998 0.0824 0.1072 0.0578 41.1733
sigma^2 = 20981: log likelihood = -2939.5
AIC=5891.01 AICc=5891.19 BIC=5915.79
Training set error measures:
ME RMSE MAE MPE MAPE MASE ACF1
Training set 1.349663 144.0588 117.0126 -1.34284 9.788813 0.9002761 -0.0220673
# Extract the residuals from the model
residuals <- residuals(mod4)
# Perform the Ljung-Box test
ljung_box_test <- Box.test(residuals, type = "Ljung-Box")
ljung_box_test
Box-Ljung test
data: residuals
X-squared = 0.22547, df = 1, p-value = 0.6349
forecast_values <- forecast(mod4, h = 6)
forecast_values Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
461 1376.227 1190.596 1561.857 1092.3295 1660.124
462 1359.722 1149.627 1569.817 1038.4091 1681.034
463 1350.358 1128.390 1572.327 1010.8867 1689.830
464 1337.767 1102.476 1573.058 977.9209 1697.614
465 1329.248 1085.998 1572.498 957.2291 1701.267
466 1319.383 1068.147 1570.618 935.1508 1703.614
# Plot the forecast
autoplot(forecast_values, conf.int = TRUE) +
ylab('Retail Sales') +
ggtitle('6 Month Forecast of Retail Sales')From the analysis of the original sales data, the extremely low p-value suggests discarding the notion that the data was mean-stationary. However, once differencing is applied to eliminate trend and seasonality, the KPSS Test produced a p-value of 0.10, which exceeds the 0.05 cutoff.
This outcome verifies that the altered time series has achieved mean stationarity, suggesting no additional transformations are necessary.
Section 3
Prophet Modeling
Facebook’s Prophet is a forecasting tool designed for time-series data that typically includes daily records, capturing trends, seasonal variations, and impacts of holidays. Its user-friendly design features easy parameter tuning, automatic selection of characteristics, and advanced modeling methods. Prophet is adept at identifying complex patterns within data, positioning it as a powerful tool for precise time-series predictions.
The basic model is fit as:
yt = gt + st +ht + et
where gt is the trend, st is seasonality, and ht are holidays.
set.seed(123)
# Determine the number of rows for training (80%) and testing (20%)
n_rows <- nrow(vehicle_sales_tbl_ts)
train_rows <- round(0.8 * n_rows)
# Split the data into training and testing sets
vehicle_train <- vehicle_sales_tbl_ts %>% slice(1:train_rows)
vehicle_test <- vehicle_sales_tbl_ts %>% slice((train_rows + 1):n_rows)Analysis of Seasonality
The Prophet model reveals a pronounced annual seasonal trend, supporting the previous assessment that the seasonality in the data is additive, not multiplicative. Now revisit the decomposition plot, updated to reflect the most recent iteration of the model:
model = vehicle_train %>%
model(prophet = fable.prophet::prophet(vehicle_sales))
model %>%
components() %>%
autoplot()changepoints = model %>%
glance() %>%
pull(changepoints) %>%
bind_rows() %>%
.$changepoints
vehicle_train %>%
ggplot()+
geom_line(aes(date,vehicle_sales))+
geom_vline(xintercept=as.Date(changepoints),color='red',linetype='dashed')model2 = vehicle_train %>%
model(
prophet_orig = fable.prophet::prophet(vehicle_sales ~ growth(n_changepoints = 25) + season(period = 'year') + season(period = 'month', order = 12)),
prophet_10 = fable.prophet::prophet(vehicle_sales ~ growth(n_changepoints = 10) + season(period = 'year') + season(period = 'month', order = 12))
)
changepoints_orig = model2 %>%
glance() %>%
filter(.model == 'prophet_orig') %>%
pull(changepoints) %>%
bind_rows() %>%
#filter(abs(adjustment)>0.05) %>%
.$changepoints
changepoints_10 = model2 %>%
glance() %>%
filter(.model == 'prophet_10') %>%
pull(changepoints) %>%
bind_rows() %>%
#filter(abs(adjustment)>0.05) %>%
.$changepoints
vehicle_train %>%
ggplot()+
geom_line(aes(date,vehicle_sales))+
# geom_vline(aes(xintercept=ymd('2000-01-01')))
geom_vline(xintercept=as.Date(changepoints_orig),color='red')+
geom_vline(xintercept=as.Date(changepoints_10),color='blue',linetype='dashed')model1 = vehicle_train %>%
model(
prophet_orig = fable.prophet::prophet(vehicle_sales),
prophet_90_range = fable.prophet::prophet(vehicle_sales~growth(changepoint_range=0.9)+season(period='year')+ season(period = 'month', order = 12)),
)
changepoints_orig = model1 %>%
glance() %>%
filter(.model == 'prophet_orig') %>%
pull(changepoints) %>%
bind_rows() %>%
#filter(abs(adjustment)>0.01) %>%
.$changepoints
changepoints_90_range = model1 %>%
glance() %>%
filter(.model == 'prophet_90_range') %>%
unnest(changepoints) %>%
bind_rows() %>%
#filter(abs(adjustment)>0.01) %>%
.$changepoints
model1 %>%
forecast(h=68) %>%
autoplot(vehicle_train,level=NULL)+
geom_vline(xintercept=as.Date(changepoints_orig),color='blue',linetype='dashed')+
geom_vline(xintercept=as.Date(changepoints_90_range),color='red',linetype='dashed')model3 = vehicle_train %>%
model(
prophet_orig = fable.prophet::prophet(vehicle_sales),
prophet_less_flexible = fable.prophet::prophet(vehicle_sales~growth(changepoint_prior_scale=0.01)+season(period='year')),
prophet_more_flexible = fable.prophet::prophet(vehicle_sales~growth(changepoint_prior_scale=0.5)+season(period='year'))
)
changepoints_orig = model3 %>%
glance() %>%
unnest(changepoints) %>%
bind_rows() %>%
filter(.model == 'prophet_orig') %>%
filter(abs(adjustment)>=0.05) %>%
.$changepoints
changepoints_more_flexible = model3 %>%
glance() %>%
unnest(changepoints) %>%
bind_rows() %>%
filter(.model == 'prophet_more_flexible') %>%
filter(abs(adjustment)>=0.50) %>%
.$changepoints
changepoints_less_flexible = model3 %>%
glance() %>%
unnest(changepoints) %>%
bind_rows() %>%
filter(.model == 'prophet_less_flexible') %>%
filter(abs(adjustment)>=0.01) %>%
.$changepoints
model3 %>%
forecast(h=68) %>%
autoplot(vehicle_train,level=NULL)+
geom_vline(xintercept=as.Date(changepoints_orig),color='blue',linetype='dashed')+
geom_vline(xintercept=as.Date(changepoints_more_flexible),color='green',linetype='dashed')+
geom_vline(xintercept=as.Date(changepoints_less_flexible),color='red',linetype='dashed')vehicle_train %>%
model(
prophet_orig = fable.prophet::prophet(vehicle_sales),
prophet_saturating = fable.prophet::prophet(vehicle_sales~growth(type='linear',capacity=3.7,floor=2.8)+season('year'))
) %>%
forecast(h=68) %>%
autoplot(vehicle_train)The forecasting results using the less flexible changepoint configuration outshine those from models with more changepoints, indicating that a more straightforward approach might more accurately reflect the data’s inherent trends. This suggests that models with fewer changepoints could lead to more reliable forecasts.
Despite substantial modifications to the floor and capacity parameters, the forecasts remain closely aligned with those from the original Prophet model. This indicates that, although in theory these parameters should influence the model’s performance, their actual contribution to forecast precision seems minimal in this case. Additional investigation and testing might be required to fully comprehend their impact.
model4 = vehicle_train %>%
model(
additive = fable.prophet::prophet(vehicle_sales~growth()+season(period='year',type='additive')),
multiplicative = fable.prophet::prophet(vehicle_sales~growth()+season(period='year',type='multiplicative')))
model4 %>%
components() %>%
autoplot()The decomposition chart and accompanying data table highlight the presence of additive, as opposed to multiplicative, seasonality in the time series.
Section 4
vehicle_train %>%
bind_rows(vehicle_test) %>%
ggplot()+
geom_line(aes(date, vehicle_sales))+
geom_vline(aes(xintercept=ymd('2014-05-01')),color='red')+
annotate("text", x = as.Date("1976-01-01"), y = max(vehicle_sales_tbl_ts$vehicle_sales) * 0.9,
label = "Train", color = "blue", size = 4) +
annotate("text", x = as.Date("2023-11-01"), y = max(vehicle_sales_tbl_ts$vehicle_sales) * 0.9,
label = "Test", color = "blue", size = 4) +
labs(x = "Week", y = "vehicle sales", title = "Training and Testing Dataset") +
theme_bw()vehicle_train_data <- vehicle_train %>%
stretch_tsibble(.init = 22*12, .step = 12)
vehicle_train_data %>%
ggplot()+
geom_point(aes(date,factor(.id),color=factor(.id)))+
ylab('Iteration')+
ggtitle('Samples included in each CV Iteration')vehicle_train_cv_forecast <- vehicle_train_data %>%
model(
arima = ARIMA(vehicle_sales~pdq(2,0,2)),
prophet = fable.prophet::prophet(vehicle_sales~growth()+season(period='year')),
naive = NAIVE(vehicle_sales~drift())
) %>%
forecast(h=6)
vehicle_train_cv_forecast %>%
as_tsibble() %>%
dplyr::select(-vehicle_sales) %>%
left_join(
vehicle_train
) %>%
ggplot()+
geom_line(aes(date,vehicle_sales))+
geom_line(aes(date,.mean,color=factor(.id),linetype=.model))+
scale_color_discrete(name='Iteration')+
ylab('Ridership Count')+
theme_bw()Joining with `by = join_by(date)`
vehicle_train_cv_forecast %>%
as_tibble() %>%
dplyr::select(-vehicle_sales) %>%
left_join(vehicle_train) %>%
group_by(.id,.model) %>%
mutate(
months_ahead = seq(1:6)
) %>%
ungroup() %>%
mutate(error = abs(vehicle_sales - .mean)) %>%
ggplot()+
geom_boxplot(aes(factor(months_ahead),error))+
facet_wrap(~.model,ncol=1)+
guides(color='none')+
ylab('Absolute Error')+
xlab('Days Ahead')+
ggtitle('Absolute Error by Iteration')+
ylim(0,600)Joining with `by = join_by(date)`
vehicle_train_cv_forecast %>%
accuracy(vehicle_train)# A tibble: 3 × 10
.model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 arima Test -16.4 91.1 72.0 -1.94 5.88 0.644 0.616 0.533
2 naive Test -27.2 190. 152. -4.49 12.6 1.36 1.29 0.278
3 prophet Test -34.5 210. 166. -4.83 14.5 1.48 1.42 0.865
vehicle_train_cv_forecast %>%
as_tibble() %>%
dplyr::select(-vehicle_sales) %>%
left_join(vehicle_train) %>%
group_by(.id,.model) %>%
mutate(
months_ahead = seq(1:6)
) %>%
ungroup() %>%
filter(!is.na(vehicle_sales)) %>%
group_by(months_ahead,.model) %>%
summarize(
rmse = sqrt(mean((vehicle_sales - .mean)^2,na.rm=T)),
) %>%
ungroup() %>%
ggplot()+
geom_point(aes(months_ahead,rmse,color=.model),alpha=0.4)+
geom_smooth(aes(months_ahead,rmse,color=.model),se=F)+
xlab("Months Ahead")+
ylab("Smoothed RMSE")Joining with `by = join_by(date)`
`summarise()` has grouped output by 'months_ahead'. You can override using the
`.groups` argument.
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
vehicle_train_cv_forecast %>%
as_tibble() %>%
dplyr::select(-vehicle_sales) %>%
left_join(vehicle_train) %>%
group_by(.id,.model) %>%
mutate(
months_ahead = seq(1:6)
) %>%
ungroup() %>%
group_by(months_ahead,.model) %>%
mutate(
mape = mean(abs((vehicle_sales - .mean)/vehicle_sales),na.rm=T)
) %>%
ungroup() %>%
ggplot()+
geom_point(aes(months_ahead,mape,color=.model),alpha=0.4)+
geom_smooth(aes(months_ahead,mape,color=.model),se=F)+
xlab("Months Ahead")+
ylab("Smoothed MAPE")Joining with `by = join_by(date)`
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
The ARIMA model demonstrates superior performance with lower MAPE and RMSE values compared to other models.
While the Prophet model effectively identifies data trends and variability, its precision in predicting the actual magnitude of data points falls short. This indicates that although it tracks the overall data patterns well, it tends to underestimate the actual values, resulting in less accurate performance on the test set. The ARIMA model, however, stands out with the lowest error rates, showcasing its efficacy through better RMSE and MAPE metrics.
Comparing the model’s fitted values to the actual data shows a close match, suggesting minimal bias and a good fit with the training data. However, a closer examination of the forecasts against the test data reveals reduced variability, hinting at the possibility of overfitting. This indicates that, despite its accuracy in capturing the data’s inherent patterns, the model’s generalizability to unseen data might be limited. Addressing the overfitting concern and refining the model could improve its forecast accuracy.