For this assignment, I forcasted Monthly Sales for Total Retail from the US Census Bureau. I trained 4 predictive models on data from 2018 to 2021 and tested the models on monthly 2022 data. The predictive models include an ARIMA model, ETS model, an ARIMA dynamic regression model, and an ensemble model averaging the first three models. For my dynamic regression model I used data on US inflation as an external regressor along with a fourier transformation.

Load Data

setwd("~/Documents/MSAE/Predictive Analytics")
retaildata <- read.csv("retaildata.csv")

Creating Tsibble

retaildata  <- retaildata  %>% 
  mutate(Month = yearmonth(Month)) %>% 
  tsibble(index = Month)

retaildata$Retail = as.numeric(retaildata$Retail)

Training Set (2018-2021) & Test Set (2022) with CPI lag

training <- retaildata %>% 
  filter(year(Month) < 2022 & year(Month) > 2017)

test <- retaildata %>% 
  filter(year(Month) == 2022)


#creating a 12 month lag in CPI for test set
test$CPI = retaildata$CPI[13:24]

Time-Series Plot

training %>% autoplot() + 
  labs(title = "Monthly Sales for Total Retail Sales, 2018-2021") +
  xlab("Month") +
  ylab("Sales ($ in Millions)") 
## Plot variable not specified, automatically selected `.vars = Retail`

training %>% autoplot(CPI) + 
  labs(title = "Consumer Price Index, 2018-2021") +
  xlab("Month") +
  ylab("CPI") 

Seasonal Plots

There is an observed seasonal trend in the data, with retail sales increasing over the course of the year and spiking in December around the holidays. You can see the effects of the pandemic in 2020, as sales dropped significantly in March and April when many businesses were closed and there was a lot of economic uncertainty. Sales quickly pick up again though and actually trend significantly higher in 2021. It would be interesting to compare to data of the service industry to see how retail sales may have been favored over in person services during this time.

training %>% gg_season(Retail) + 
  labs(title = "Seasonality of Retail Sales") +
  xlab("Month") +
  ylab("Sales ($ in Millions)") 

ARIMA

ARIMA_model <- training %>% 
  model(ARIMA(log(Retail)))

ARIMA_fcst <- ARIMA_model %>% forecast(test)
ARIMA_plot <- ARIMA_model %>% forecast(test) %>% autoplot(test) + 
  labs(title = "ARIMA model") +
  xlab("Month") +
  ylab("Sales ($ in Millions)") 
ARIMA_accuracy <- accuracy(ARIMA_fcst, test)

report(ARIMA_model)
## Series: Retail 
## Model: ARIMA(0,1,2)(0,1,0)[12] 
## Transformation: log(Retail) 
## 
## Coefficients:
##          ma1      ma2
##       0.0148  -0.6296
## s.e.  0.1535   0.1590
## 
## sigma^2 estimated as 0.003821:  log likelihood=48.6
## AIC=-91.19   AICc=-90.42   BIC=-86.52
ARIMA_model %>% gg_tsresiduals(lag_max = 12) + labs(title = "ARIMA model")

ARIMA_plot

ARIMA with regressors

Here I used CPI as an external regressor, ensuring the model used a lag of 12 months. Additionally, I used fourier terms to capture seasonal patterns. Including the fourier terms significantly reduces the bias of the model.

ARIMA_withCPI <- training %>% 
  model(ARIMA(Retail~lag(CPI, n=12)+fourier(K=2)))

ARIMA_fcstwithCPI <- ARIMA_withCPI %>% forecast(test)


ARIMA_plot2 <- ARIMA_withCPI %>% forecast(test) %>% autoplot(test) + 
  labs(title = "ARIMA model with regressors") +
  xlab("Month") +
  ylab("Sales ($ in Millions)") 
ARIMAwithCPI_accuracy <- accuracy(ARIMA_fcstwithCPI, test)

report(ARIMA_withCPI)
## Series: Retail 
## Model: LM w/ ARIMA(0,1,1) errors 
## 
## Coefficients:
##           ma1  lag(CPI, n = 12)  fourier(K = 2)C1_12  fourier(K = 2)S1_12
##       -0.8069         -15408.64           -10374.298            -6788.183
## s.e.   0.1502          27161.46             7573.704             7981.667
##       fourier(K = 2)C2_12  fourier(K = 2)S2_12  intercept
##                  400.1748           -21605.327   4073.025
## s.e.            7171.5965             7285.938   1302.032
## 
## sigma^2 estimated as 946050148:  log likelihood=-414.21
## AIC=844.43   AICc=848.21   BIC=859.23
ARIMA_withCPI %>% gg_tsresiduals(lag_max = 12) + labs(title = "ARIMA model with regressors")
## Warning: Removed 12 rows containing missing values (`geom_line()`).
## Warning: Removed 12 rows containing missing values (`geom_point()`).
## Warning: Removed 12 rows containing non-finite values (`stat_bin()`).

ARIMA_plot2

ETS model

ETS_model <- training %>% 
  model(ETS(log(Retail)))


ETS_fcst <- ETS_model %>% forecast(test)
ETS_plot <- ETS_model %>% forecast(test) %>% autoplot(test) + 
  labs(title = "ETS model")+
  xlab("Month") +
  ylab("Sales ($ in Millions)") 
ETS_accuracy <- accuracy(ETS_fcst, test)


report(ETS_model)
## Series: Retail 
## Model: ETS(M,N,A) 
## Transformation: log(Retail) 
##   Smoothing parameters:
##     alpha = 0.5660889 
##     gamma = 0.0003171286 
## 
##   Initial states:
##      l[0]      s[0]      s[-1]     s[-2]       s[-3]      s[-4]      s[-5]
##  13.01298 0.1242254 0.02607795 0.0111748 -0.03110678 0.03582279 0.02661762
##       s[-6]      s[-7]       s[-8]      s[-9]     s[-10]      s[-11]
##  0.02823638 0.04986643 -0.04434762 0.01126362 -0.1410059 -0.09682472
## 
##   sigma^2:  0
## 
##       AIC      AICc       BIC 
## -93.69351 -78.69351 -65.62549
ETS_model %>% gg_tsresiduals(lag_max = 12) + labs(title="ETS Model")

ETS_plot 

Ensemble

## Series: Retail 
## Model: COMBINATION 
## Combination: Retail * 0.333333333333333
## 
## =======================================
## 
## Series: Retail 
## Model: COMBINATION 
## Combination: Retail + Retail
## 
## ============================
## 
## Series: Retail 
## Model: COMBINATION 
## Combination: Retail + Retail
## 
## ============================
## 
## Series: Retail 
## Model: ARIMA(0,1,0)(0,1,0)[12] 
## 
## sigma^2 estimated as 983304805:  log likelihood=-412.02
## AIC=826.05   AICc=826.17   BIC=827.6
## 
## Series: Retail 
## Model: ETS(M,A,A) 
##   Smoothing parameters:
##     alpha = 0.9066967 
##     beta  = 0.0001000256 
##     gamma = 0.0001000857 
## 
##   Initial states:
##      l[0]     b[0]     s[0]    s[-1]    s[-2]     s[-3]    s[-4]    s[-5]
##  432406.5 2177.801 59157.51 15349.39 4206.951 -15973.33 18558.59 13509.38
##     s[-6]    s[-7]     s[-8]    s[-9]    s[-10]    s[-11]
##  10894.68 20069.15 -19665.72 3282.002 -59704.73 -49683.87
## 
##   sigma^2:  0.002
## 
##      AIC     AICc      BIC 
## 1156.645 1177.045 1188.455 
## 
## 
## Series: Retail 
## Model: LM w/ ARIMA(0,1,1) errors 
## 
## Coefficients:
##           ma1  lag(CPI, n = 12)  fourier(K = 2)C1_12  fourier(K = 2)S1_12
##       -0.8069         -15408.64           -10374.298            -6788.183
## s.e.   0.1502          27161.46             7573.704             7981.667
##       fourier(K = 2)C2_12  fourier(K = 2)S2_12  intercept
##                  400.1748           -21605.327   4073.025
## s.e.            7171.5965             7285.938   1302.032
## 
## sigma^2 estimated as 946050148:  log likelihood=-414.21
## AIC=844.43   AICc=848.21   BIC=859.23

#Plot

fit <- training %>% 
  model('ARIMA' = ARIMA(log(Retail)),
        'ARIMA with regressors' = ARIMA(Retail~lag(CPI, n=12)+fourier(K=2)),
        'ETS' = ETS(log(Retail)),
        'Ensemble' = (ARIMA(Retail)+ETS(Retail) +ARIMA(Retail~lag(CPI, n=12)))/3
        )

fcst <- fit %>% forecast(test)

fcst %>% 
  autoplot(test, level = NULL) +
  autolayer(test) +
  labs(title = "Model Forecasts of 2022 Sales") +
  ylab("Sales ($ in Millions)") 
## Plot variable not specified, automatically selected `.vars = Retail`

Accuracy Metrics and Discussion

accuracy_metrics = rbind("ETS" = ETS_accuracy, 
                         "ARIMA" = ARIMA_accuracy,
                         "ARIMA with regressors" = ARIMAwithCPI_accuracy, 
                         "Ensemble" = ensemble_accuracy
                         )
accuracy_metrics <- accuracy_metrics %>% arrange(RMSE)
accuracy_metrics
## # A tibble: 4 × 10
##   .model              .type      ME   RMSE    MAE    MPE  MAPE  MASE RMSSE  ACF1
## * <chr>               <chr>   <dbl>  <dbl>  <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 "(ARIMA(Retail) + … Test  -3.99e3 12022. 10111. -0.771  1.78   NaN   NaN 0.243
## 2 "ARIMA(Retail ~ la… Test  -3.16e1 29578. 23616. -0.318  4.13   NaN   NaN 0.107
## 3 "ARIMA(log(Retail)… Test  -2.96e4 35527. 31378. -4.82   5.17   NaN   NaN 0.388
## 4 "ETS(log(Retail))"  Test   3.46e4 36524. 34579.  5.82   5.82   NaN   NaN 0.104

The ensemble model performs best with regard to variability, with the smallest RMSE of 12,021. Its bias measurements (ME and MAE) are also better than both the ARIMA and ETS models, however overall, the ARIMA with CPI as a regressor has the significantly bias, with a mean error of -32. The dynamic regression model does not perform as well regarding variability however with a RMSE of 29,578. Using CPI as a regressor and a fourier transformation smooths out the forecast more than the other models, however it does not account for seasonal fluctuations as well as the other models. Due to the different strengths of each model, the ensemble model averages out errors in predictions across the three models.