BANA 7050_FinalProject_KeerthiChereddy

Section 1

Data source and data-generating process

The dataset represents the total vehicle sales in the USA (“TOTALNSA” series) published by the Federal Reserve Bank of St. Louis (FRED). This time series data, updated monthly, serves as a critical economic indicator, reflecting consumer spending trends and the trends in the automotive industry. The dataset spans several years, offering a holistic view of vehicle sales dynamics.

Vehicle sales are very closely influenced by macro economic trends and hence fluctuations in income levels, economic policies etc. can lead to significant variability in the data. Several factors such as Government policies, environmental regulations, shift in consumer preferences, new technologies (Electric Vehicles) can have a direct impact on the industry. Vehicle sales data exhibits seasonality such as year-end sales traditionally showing higher sales figures. The impact of all the above factors and their interplay makes forecasting this time series challenging.

Summary Statistics

vehicle_df <- read.csv("//Users/keerthichereddy/Documents/Sem2/Forecasting Methods/Project/total_vehicle_sales.csv", stringsAsFactors = FALSE)

 vehicle_sales_tbl_ts <- vehicle_df %>% select(date, vehicle_sales) %>%
  mutate(value = vehicle_sales) %>%
  mutate(date = yearmonth(date)) %>%
  as_tsibble(index = date)
head(vehicle_df)
        date vehicle_sales
1 1976-01-01         885.2
2 1976-02-01         994.7
3 1976-03-01        1243.6
4 1976-04-01        1191.2
5 1976-05-01        1203.2
6 1976-06-01        1254.7
summary(vehicle_df)
     date           vehicle_sales   
 Length:575         Min.   : 670.5  
 Class :character   1st Qu.:1117.2  
 Mode  :character   Median :1268.5  
                    Mean   :1261.7  
                    3rd Qu.:1420.2  
                    Max.   :1845.7  
date_range <- range(vehicle_df$date)
date_range
[1] "1976-01-01" "2023-11-01"
nrow(vehicle_df)
[1] 575

Visualizations

##Histogram
hist(vehicle_df$vehicle_sales, freq = FALSE)
lines(density(vehicle_sales_tbl_ts$value), lwd = 3, col = "red")

ggplot() +
  geom_line(data = vehicle_sales_tbl_ts, aes(x = date, y = value)) +
  labs(x = "Month", y = "Vehicle Sales", title = "Monthly Vehicle Sales Over Time")

The histogram suggests that the data is distributed fairly normal without any outliers in the data set. Visualizing sales data over time reveals a steady increase with significant seasonal changes, indicating a possible annual pattern. The monthly aggregated data likely explains the quarterly highs and lows observed.

Moving average of the time series

# Calculate 6-month moving average
vehicle_sales_tbl_ts <- vehicle_sales_tbl_ts %>%
  mutate(ma_6 = slide_dbl(.x = value, .f = ~mean(.x, na.rm = TRUE), .before = 5))

# Create the plot
ggplot(vehicle_sales_tbl_ts, aes(x = date)) +
  geom_line(aes(y = value), color = "blue", na.rm = TRUE) +
  geom_line(aes(y = ma_6), color = "red", na.rm = TRUE) +
  labs(x = "Date", y = "Vehicle Sales", 
       title = "Vehicle Sales with 6-Month Moving Average",
       subtitle = "Blue: Actual Sales, Red: 6-Month MA") +
  theme_minimal()

Analyzing the moving average plots, a 6-month moving average emerges as optimal for smoothing the data while preserving seasonal or cyclical trends. This approach, combined with the anticipation of seasonality in the data suggests applying STL decomposition for deeper analysis.

This method uncovers a clear seasonal trend that persists even after adjusting for the overall trend with a moving average. We identify a consistent annual cycle, marked by peaks and troughs, followed by a 2-3 month period of increased activity, and then a decline before the cycle restarts. The timing and structure of this pattern suggest a significant influence from the financial year and holiday periods. The resulting data, illustrated by the accompanying plot, exhibits an additive nature.

# Calculate the difference between the 12-month moving average and the actual data
vehicle_sales_tbl_ts <- vehicle_sales_tbl_ts %>%
  mutate(diff_12 = value - ma_6)

# Convert the tsibble to a ts object with a frequency of 12
vehicle_sales_ts <- ts(vehicle_sales_tbl_ts$value, frequency = 12)

# Decompose the time series
decomposed <- stl(vehicle_sales_ts, s.window="periodic")

# Plot the decomposed time series
autoplot(decomposed) +
  labs(x = "Date", y = "Vehicle Sales", 
       title = "Decomposition of Vehicle Sales Time Series",
       subtitle = "Trend, Seasonal, and Random components") +
  theme_minimal()

set.seed(123)

# Determine the number of rows for training (80%) and testing (20%)
n_rows <- nrow(vehicle_sales_tbl_ts)
train_rows <- round(0.8 * n_rows)

# Split the data into training and testing sets
train_data <- vehicle_sales_tbl_ts %>% slice(1:train_rows)
test_data <- vehicle_sales_tbl_ts %>% slice((train_rows + 1):n_rows)

# Visualize the training and test sets
ggplot() +
  geom_line(data = train_data, aes(x = date, y = value, color = "Training"), linewidth = 0.3) +
  geom_line(data = test_data, aes(x = date, y = value, color = "Test"), linewidth = 0.3) +
  scale_color_manual(values = c("Training" = "blue", "Test" = "red")) +
  labs(title = "Training and Test Sets", x = "Date", y = "vehicle sales") +
  theme_minimal()

Section 2

ARIMA Modeling

# Load the necessary library
library(zoo)

# Calculate the 6-month moving average for the training data
train_data <- train_data %>%
  mutate(ma_6 = rollmean(value, 6, align = "center", fill = NA))

# Plot the 6-month moving average
ggplot(train_data, aes(x = date)) +
  geom_line(aes(y = ma_6), color = "blue", na.rm = TRUE) +
  labs(x = "Date", y = "Vehicle Sales", 
       title = "Vehicle Sales with 6-Month Moving Average") +
  theme_minimal()

The 6-month rolling average plot reveals a clear upward trend, suggesting that the time series is not mean-stationary, which requires adjustment before we can use ARIMA modeling. The data shows some level of variance stationarity, with variations in its peaks and troughs. Despite occasional spikes over three or four consecutive periods, there appears to be no need for variance-stabilizing transformations.

suppressWarnings({
  suppressMessages({
    
# Convert the training data to a ts object with a frequency of 12
train_sales_ts <- ts(train_data$value, frequency = 12)

# Calculate the rolling standard deviation for the training data
rolling_sd_train <- rollapply(train_sales_ts, width = 13, FUN = sd, align = "center", fill = NA)

# Create a data frame with the rolling standard deviation for the training data
rolling_sd_train_df <- data.frame(time = time(rolling_sd_train), sd = as.vector(rolling_sd_train))

# Plot the rolling standard deviation for the training data
ggplot(rolling_sd_train_df, aes(x = time, y = sd)) +
  geom_line() +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  labs(x = "Time", y = "Standard Deviation", 
       title = "Rolling Standard Deviation of Training Data") +
  theme_minimal()


options(warn=-1)
})
})

The rolling standard deviation plot shows a minor downtrend in variance over time. Given the slow pace of this trend and the consistent variation noted year over year, we conclude that adjustments for variance non-stationarity are not required.

# Create a seasonally differenced series for the training data
train_data_diff <- diff(train_data$value, lag = 12)

# Convert the differenced training data back into a time series object
train_data_diff_ts <- ts(train_data_diff, start = start(train_data$value), frequency = frequency(train_data$value))

# Create the ACF and PACF plot for the training data
tsdisplay(train_data_diff_ts)

kpss_result <- kpss.test(train_data$value)
kpss_result

    KPSS Test for Level Stationarity

data:  train_data$value
KPSS Level = 1.1514, Truncation lag parameter = 5, p-value = 0.01
kpss_diff <- kpss.test(diff(train_data$value))
kpss_diff

    KPSS Test for Level Stationarity

data:  diff(train_data$value)
KPSS Level = 0.013876, Truncation lag parameter = 5, p-value = 0.1

Analysis and Model fitting

mod4 = Arima(train_data$value, order = c(2, 0, 2), seasonal = c(0, 0, 0))

summary(mod4)
Series: train_data$value 
ARIMA(2,0,2) with non-zero mean 

Coefficients:
         ar1     ar2     ma1      ma2       mean
      0.1891  0.6556  0.3410  -0.3700  1229.7697
s.e.  0.0998  0.0824  0.1072   0.0578    41.1733

sigma^2 = 20981:  log likelihood = -2939.5
AIC=5891.01   AICc=5891.19   BIC=5915.79

Training set error measures:
                   ME     RMSE      MAE      MPE     MAPE      MASE       ACF1
Training set 1.349663 144.0588 117.0126 -1.34284 9.788813 0.9002761 -0.0220673
# Extract the residuals from the model
residuals <- residuals(mod4)

# Perform the Ljung-Box test
ljung_box_test <- Box.test(residuals, type = "Ljung-Box")
ljung_box_test

    Box-Ljung test

data:  residuals
X-squared = 0.22547, df = 1, p-value = 0.6349
forecast_values <- forecast(mod4, h = 6)
forecast_values
    Point Forecast    Lo 80    Hi 80     Lo 95    Hi 95
461       1376.227 1190.596 1561.857 1092.3295 1660.124
462       1359.722 1149.627 1569.817 1038.4091 1681.034
463       1350.358 1128.390 1572.327 1010.8867 1689.830
464       1337.767 1102.476 1573.058  977.9209 1697.614
465       1329.248 1085.998 1572.498  957.2291 1701.267
466       1319.383 1068.147 1570.618  935.1508 1703.614
# Plot the forecast
autoplot(forecast_values, conf.int = TRUE) +
  ylab('Retail Sales') +
  ggtitle('6 Month Forecast of Retail Sales')

From the analysis of the original sales data, the extremely low p-value suggests discarding the notion that the data was mean-stationary. However, once differencing is applied to eliminate trend and seasonality, the KPSS Test produced a p-value of 0.10, which exceeds the 0.05 cutoff.

This outcome verifies that the altered time series has achieved mean stationarity, suggesting no additional transformations are necessary.

Section 3

Prophet Modeling

Facebook’s Prophet is a forecasting tool designed for time-series data that typically includes daily records, capturing trends, seasonal variations, and impacts of holidays. Its user-friendly design features easy parameter tuning, automatic selection of characteristics, and advanced modeling methods. Prophet is adept at identifying complex patterns within data, positioning it as a powerful tool for precise time-series predictions.

The basic model is fit as:

yt = gt + st +ht + et

where gt is the trend, st is seasonality, and ht are holidays.

set.seed(123)

# Determine the number of rows for training (80%) and testing (20%)
n_rows <- nrow(vehicle_sales_tbl_ts)
train_rows <- round(0.8 * n_rows)

# Split the data into training and testing sets
vehicle_train <- vehicle_sales_tbl_ts %>% slice(1:train_rows)
vehicle_test <- vehicle_sales_tbl_ts %>% slice((train_rows + 1):n_rows)

Analysis of Seasonality

The Prophet model reveals a pronounced annual seasonal trend, supporting the previous assessment that the seasonality in the data is additive, not multiplicative. Now revisit the decomposition plot, updated to reflect the most recent iteration of the model:

model = vehicle_train %>%
    model(prophet = fable.prophet::prophet(vehicle_sales))

model %>%
components() %>%
autoplot()

changepoints = model %>%
glance() %>%
pull(changepoints) %>%
bind_rows() %>%
.$changepoints

vehicle_train %>%
ggplot()+
geom_line(aes(date,vehicle_sales))+
geom_vline(xintercept=as.Date(changepoints),color='red',linetype='dashed')

model2 = vehicle_train %>%
    model(
        prophet_orig = fable.prophet::prophet(vehicle_sales ~ growth(n_changepoints = 25) + season(period = 'year') + season(period = 'month', order = 12)),
        prophet_10 = fable.prophet::prophet(vehicle_sales ~ growth(n_changepoints = 10) + season(period = 'year') + season(period = 'month', order = 12))
        )


changepoints_orig = model2 %>%
glance() %>%
filter(.model == 'prophet_orig') %>%
pull(changepoints) %>%
bind_rows() %>%
#filter(abs(adjustment)>0.05) %>%
.$changepoints

changepoints_10 = model2 %>%
glance() %>%
filter(.model == 'prophet_10') %>%
pull(changepoints) %>%
bind_rows() %>%
#filter(abs(adjustment)>0.05) %>%
.$changepoints

vehicle_train %>%
ggplot()+
geom_line(aes(date,vehicle_sales))+
# geom_vline(aes(xintercept=ymd('2000-01-01')))
geom_vline(xintercept=as.Date(changepoints_orig),color='red')+
geom_vline(xintercept=as.Date(changepoints_10),color='blue',linetype='dashed')

model1 = vehicle_train %>%
    model(
        prophet_orig = fable.prophet::prophet(vehicle_sales),
        prophet_90_range = fable.prophet::prophet(vehicle_sales~growth(changepoint_range=0.9)+season(period='year')+ season(period = 'month', order = 12)),

        )

changepoints_orig = model1 %>%
glance() %>%
filter(.model == 'prophet_orig') %>%
pull(changepoints) %>%
bind_rows() %>%
#filter(abs(adjustment)>0.01) %>%
.$changepoints

changepoints_90_range = model1 %>%
glance() %>%
filter(.model == 'prophet_90_range') %>%
unnest(changepoints) %>%
bind_rows() %>%
#filter(abs(adjustment)>0.01) %>%
.$changepoints

model1 %>%
forecast(h=68) %>%
autoplot(vehicle_train,level=NULL)+
geom_vline(xintercept=as.Date(changepoints_orig),color='blue',linetype='dashed')+
geom_vline(xintercept=as.Date(changepoints_90_range),color='red',linetype='dashed')

model3 = vehicle_train %>%
    model(
        prophet_orig = fable.prophet::prophet(vehicle_sales),
        prophet_less_flexible = fable.prophet::prophet(vehicle_sales~growth(changepoint_prior_scale=0.01)+season(period='year')),
        prophet_more_flexible = fable.prophet::prophet(vehicle_sales~growth(changepoint_prior_scale=0.5)+season(period='year'))
        )

changepoints_orig = model3 %>%
glance() %>%
unnest(changepoints) %>%
bind_rows() %>% 
filter(.model == 'prophet_orig') %>%
filter(abs(adjustment)>=0.05) %>%
.$changepoints

changepoints_more_flexible = model3 %>%
glance() %>%
unnest(changepoints) %>%
bind_rows() %>%
filter(.model == 'prophet_more_flexible') %>%
filter(abs(adjustment)>=0.50) %>%
.$changepoints

changepoints_less_flexible = model3 %>%
glance() %>%
unnest(changepoints) %>%
bind_rows() %>%
filter(.model == 'prophet_less_flexible') %>%
filter(abs(adjustment)>=0.01) %>%
.$changepoints

model3 %>%
forecast(h=68) %>%
autoplot(vehicle_train,level=NULL)+
geom_vline(xintercept=as.Date(changepoints_orig),color='blue',linetype='dashed')+
geom_vline(xintercept=as.Date(changepoints_more_flexible),color='green',linetype='dashed')+
geom_vline(xintercept=as.Date(changepoints_less_flexible),color='red',linetype='dashed')

vehicle_train %>%
    model(
        prophet_orig = fable.prophet::prophet(vehicle_sales),
        prophet_saturating = fable.prophet::prophet(vehicle_sales~growth(type='linear',capacity=3.7,floor=2.8)+season('year'))
        ) %>%
    forecast(h=68) %>%
    autoplot(vehicle_train)

The forecasting results using the less flexible changepoint configuration outshine those from models with more changepoints, indicating that a more straightforward approach might more accurately reflect the data’s inherent trends. This suggests that models with fewer changepoints could lead to more reliable forecasts.

Despite substantial modifications to the floor and capacity parameters, the forecasts remain closely aligned with those from the original Prophet model. This indicates that, although in theory these parameters should influence the model’s performance, their actual contribution to forecast precision seems minimal in this case. Additional investigation and testing might be required to fully comprehend their impact.

model4 = vehicle_train %>%
    model(
      additive = fable.prophet::prophet(vehicle_sales~growth()+season(period='year',type='additive')),
      multiplicative = fable.prophet::prophet(vehicle_sales~growth()+season(period='year',type='multiplicative')))

model4 %>%
components() %>%
autoplot()

The decomposition chart and accompanying data table highlight the presence of additive, as opposed to multiplicative, seasonality in the time series.

Section 4

vehicle_train %>%
bind_rows(vehicle_test) %>%
    ggplot()+
    geom_line(aes(date, vehicle_sales))+
    geom_vline(aes(xintercept=ymd('2014-05-01')),color='red')+
  annotate("text", x = as.Date("1976-01-01"), y = max(vehicle_sales_tbl_ts$vehicle_sales) * 0.9, 
           label = "Train", color = "blue", size = 4) +
  annotate("text", x = as.Date("2023-11-01"), y = max(vehicle_sales_tbl_ts$vehicle_sales) * 0.9, 
           label = "Test", color = "blue", size = 4) +
  labs(x = "Week", y = "vehicle sales", title = "Training and Testing Dataset") +
    theme_bw()

vehicle_train_data <- vehicle_train %>%
  stretch_tsibble(.init = 22*12, .step = 12)

vehicle_train_data %>%
    ggplot()+
    geom_point(aes(date,factor(.id),color=factor(.id)))+
    ylab('Iteration')+
    ggtitle('Samples included in each CV Iteration')

vehicle_train_cv_forecast <- vehicle_train_data %>%
  model(
    arima = ARIMA(vehicle_sales~pdq(2,0,2)),
    prophet = fable.prophet::prophet(vehicle_sales~growth()+season(period='year')),
    naive = NAIVE(vehicle_sales~drift())
  ) %>%
  forecast(h=6)


vehicle_train_cv_forecast %>%
  as_tsibble() %>%
  dplyr::select(-vehicle_sales) %>%
  left_join(
      vehicle_train
  ) %>%
    ggplot()+
    geom_line(aes(date,vehicle_sales))+
    geom_line(aes(date,.mean,color=factor(.id),linetype=.model))+
    scale_color_discrete(name='Iteration')+
    ylab('Ridership Count')+
    theme_bw()
Joining with `by = join_by(date)`

vehicle_train_cv_forecast %>%
  as_tibble() %>%
  dplyr::select(-vehicle_sales) %>%
  left_join(vehicle_train) %>%
  group_by(.id,.model) %>%
  mutate(
    months_ahead = seq(1:6)
  ) %>%
  ungroup() %>%
  mutate(error = abs(vehicle_sales - .mean)) %>%
  ggplot()+
  geom_boxplot(aes(factor(months_ahead),error))+
  facet_wrap(~.model,ncol=1)+
  guides(color='none')+
  ylab('Absolute Error')+
  xlab('Days Ahead')+
  ggtitle('Absolute Error by Iteration')+
  ylim(0,600)
Joining with `by = join_by(date)`

vehicle_train_cv_forecast %>%
  accuracy(vehicle_train)
# A tibble: 3 × 10
  .model  .type    ME  RMSE   MAE   MPE  MAPE  MASE RMSSE  ACF1
  <chr>   <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 arima   Test  -16.4  91.1  72.0 -1.94  5.88 0.644 0.616 0.533
2 naive   Test  -27.2 190.  152.  -4.49 12.6  1.36  1.29  0.278
3 prophet Test  -34.5 210.  166.  -4.83 14.5  1.48  1.42  0.865
vehicle_train_cv_forecast %>%
  as_tibble() %>%
  dplyr::select(-vehicle_sales) %>%
  left_join(vehicle_train) %>% 
  group_by(.id,.model) %>%
  mutate(
    months_ahead = seq(1:6)
  ) %>%
  ungroup() %>% 
  filter(!is.na(vehicle_sales)) %>%
  group_by(months_ahead,.model) %>%
  summarize(
    rmse = sqrt(mean((vehicle_sales - .mean)^2,na.rm=T)),
  ) %>%
  ungroup() %>%
  ggplot()+
  geom_point(aes(months_ahead,rmse,color=.model),alpha=0.4)+ 
  geom_smooth(aes(months_ahead,rmse,color=.model),se=F)+
  xlab("Months Ahead")+
  ylab("Smoothed RMSE")
Joining with `by = join_by(date)`
`summarise()` has grouped output by 'months_ahead'. You can override using the
`.groups` argument.
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

vehicle_train_cv_forecast %>%
  as_tibble() %>%
  dplyr::select(-vehicle_sales) %>%
  left_join(vehicle_train) %>%
  group_by(.id,.model) %>%
  mutate(
    months_ahead = seq(1:6)
  ) %>%
  ungroup() %>%
  group_by(months_ahead,.model) %>%
  mutate(
    mape = mean(abs((vehicle_sales - .mean)/vehicle_sales),na.rm=T)
  ) %>%
  ungroup() %>%
  ggplot()+
  geom_point(aes(months_ahead,mape,color=.model),alpha=0.4)+ 
  geom_smooth(aes(months_ahead,mape,color=.model),se=F)+
  xlab("Months Ahead")+
  ylab("Smoothed MAPE")
Joining with `by = join_by(date)`
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

The ARIMA model demonstrates superior performance with lower MAPE and RMSE values compared to other models.

While the Prophet model effectively identifies data trends and variability, its precision in predicting the actual magnitude of data points falls short. This indicates that although it tracks the overall data patterns well, it tends to underestimate the actual values, resulting in less accurate performance on the test set. The ARIMA model, however, stands out with the lowest error rates, showcasing its efficacy through better RMSE and MAPE metrics.

Comparing the model’s fitted values to the actual data shows a close match, suggesting minimal bias and a good fit with the training data. However, a closer examination of the forecasts against the test data reveals reduced variability, hinting at the possibility of overfitting. This indicates that, despite its accuracy in capturing the data’s inherent patterns, the model’s generalizability to unseen data might be limited. Addressing the overfitting concern and refining the model could improve its forecast accuracy.