For this assignment, I forcasted Monthly Sales for Total Retail from the US Census Bureau. I trained 4 predictive models on data from 2018 to 2021 and tested the models on monthly 2022 data. The predictive models include an ARIMA model, ETS model, an ARIMA dynamic regression model, and an ensemble model averaging the first three models. For my dynamic regression model I used data on US inflation as an external regressor along with a fourier transformation.
setwd("~/Documents/MSAE/Predictive Analytics")
retaildata <- read.csv("retaildata.csv")
retaildata <- retaildata %>%
mutate(Month = yearmonth(Month)) %>%
tsibble(index = Month)
retaildata$Retail = as.numeric(retaildata$Retail)
training <- retaildata %>%
filter(year(Month) < 2022 & year(Month) > 2017)
test <- retaildata %>%
filter(year(Month) == 2022)
#creating a 12 month lag in CPI for test set
test$CPI = retaildata$CPI[13:24]
training %>% autoplot() +
labs(title = "Monthly Sales for Total Retail Sales, 2018-2021") +
xlab("Month") +
ylab("Sales ($ in Millions)")
## Plot variable not specified, automatically selected `.vars = Retail`
training %>% autoplot(CPI) +
labs(title = "Consumer Price Index, 2018-2021") +
xlab("Month") +
ylab("CPI")
There is an observed seasonal trend in the data, with retail sales increasing over the course of the year and spiking in December around the holidays. You can see the effects of the pandemic in 2020, as sales dropped significantly in March and April when many businesses were closed and there was a lot of economic uncertainty. Sales quickly pick up again though and actually trend significantly higher in 2021. It would be interesting to compare to data of the service industry to see how retail sales may have been favored over in person services during this time.
training %>% gg_season(Retail) +
labs(title = "Seasonality of Retail Sales") +
xlab("Month") +
ylab("Sales ($ in Millions)")
ARIMA_model <- training %>%
model(ARIMA(log(Retail)))
ARIMA_fcst <- ARIMA_model %>% forecast(test)
ARIMA_plot <- ARIMA_model %>% forecast(test) %>% autoplot(test) +
labs(title = "ARIMA model") +
xlab("Month") +
ylab("Sales ($ in Millions)")
ARIMA_accuracy <- accuracy(ARIMA_fcst, test)
report(ARIMA_model)
## Series: Retail
## Model: ARIMA(0,1,2)(0,1,0)[12]
## Transformation: log(Retail)
##
## Coefficients:
## ma1 ma2
## 0.0148 -0.6296
## s.e. 0.1535 0.1590
##
## sigma^2 estimated as 0.003821: log likelihood=48.6
## AIC=-91.19 AICc=-90.42 BIC=-86.52
ARIMA_model %>% gg_tsresiduals(lag_max = 12) + labs(title = "ARIMA model")
ARIMA_plot
Here I used CPI as an external regressor, ensuring the model used a lag of 12 months. Additionally, I used fourier terms to capture seasonal patterns. Including the fourier terms significantly reduces the bias of the model.
ARIMA_withCPI <- training %>%
model(ARIMA(Retail~lag(CPI, n=12)+fourier(K=2)))
ARIMA_fcstwithCPI <- ARIMA_withCPI %>% forecast(test)
ARIMA_plot2 <- ARIMA_withCPI %>% forecast(test) %>% autoplot(test) +
labs(title = "ARIMA model with regressors") +
xlab("Month") +
ylab("Sales ($ in Millions)")
ARIMAwithCPI_accuracy <- accuracy(ARIMA_fcstwithCPI, test)
report(ARIMA_withCPI)
## Series: Retail
## Model: LM w/ ARIMA(0,1,1) errors
##
## Coefficients:
## ma1 lag(CPI, n = 12) fourier(K = 2)C1_12 fourier(K = 2)S1_12
## -0.8069 -15408.64 -10374.298 -6788.183
## s.e. 0.1502 27161.46 7573.704 7981.667
## fourier(K = 2)C2_12 fourier(K = 2)S2_12 intercept
## 400.1748 -21605.327 4073.025
## s.e. 7171.5965 7285.938 1302.032
##
## sigma^2 estimated as 946050148: log likelihood=-414.21
## AIC=844.43 AICc=848.21 BIC=859.23
ARIMA_withCPI %>% gg_tsresiduals(lag_max = 12) + labs(title = "ARIMA model with regressors")
## Warning: Removed 12 rows containing missing values (`geom_line()`).
## Warning: Removed 12 rows containing missing values (`geom_point()`).
## Warning: Removed 12 rows containing non-finite values (`stat_bin()`).
ARIMA_plot2
ETS_model <- training %>%
model(ETS(log(Retail)))
ETS_fcst <- ETS_model %>% forecast(test)
ETS_plot <- ETS_model %>% forecast(test) %>% autoplot(test) +
labs(title = "ETS model")+
xlab("Month") +
ylab("Sales ($ in Millions)")
ETS_accuracy <- accuracy(ETS_fcst, test)
report(ETS_model)
## Series: Retail
## Model: ETS(M,N,A)
## Transformation: log(Retail)
## Smoothing parameters:
## alpha = 0.5660889
## gamma = 0.0003171286
##
## Initial states:
## l[0] s[0] s[-1] s[-2] s[-3] s[-4] s[-5]
## 13.01298 0.1242254 0.02607795 0.0111748 -0.03110678 0.03582279 0.02661762
## s[-6] s[-7] s[-8] s[-9] s[-10] s[-11]
## 0.02823638 0.04986643 -0.04434762 0.01126362 -0.1410059 -0.09682472
##
## sigma^2: 0
##
## AIC AICc BIC
## -93.69351 -78.69351 -65.62549
ETS_model %>% gg_tsresiduals(lag_max = 12) + labs(title="ETS Model")
ETS_plot
## Series: Retail
## Model: COMBINATION
## Combination: Retail * 0.333333333333333
##
## =======================================
##
## Series: Retail
## Model: COMBINATION
## Combination: Retail + Retail
##
## ============================
##
## Series: Retail
## Model: COMBINATION
## Combination: Retail + Retail
##
## ============================
##
## Series: Retail
## Model: ARIMA(0,1,0)(0,1,0)[12]
##
## sigma^2 estimated as 983304805: log likelihood=-412.02
## AIC=826.05 AICc=826.17 BIC=827.6
##
## Series: Retail
## Model: ETS(M,A,A)
## Smoothing parameters:
## alpha = 0.9066967
## beta = 0.0001000256
## gamma = 0.0001000857
##
## Initial states:
## l[0] b[0] s[0] s[-1] s[-2] s[-3] s[-4] s[-5]
## 432406.5 2177.801 59157.51 15349.39 4206.951 -15973.33 18558.59 13509.38
## s[-6] s[-7] s[-8] s[-9] s[-10] s[-11]
## 10894.68 20069.15 -19665.72 3282.002 -59704.73 -49683.87
##
## sigma^2: 0.002
##
## AIC AICc BIC
## 1156.645 1177.045 1188.455
##
##
## Series: Retail
## Model: LM w/ ARIMA(0,1,1) errors
##
## Coefficients:
## ma1 lag(CPI, n = 12) fourier(K = 2)C1_12 fourier(K = 2)S1_12
## -0.8069 -15408.64 -10374.298 -6788.183
## s.e. 0.1502 27161.46 7573.704 7981.667
## fourier(K = 2)C2_12 fourier(K = 2)S2_12 intercept
## 400.1748 -21605.327 4073.025
## s.e. 7171.5965 7285.938 1302.032
##
## sigma^2 estimated as 946050148: log likelihood=-414.21
## AIC=844.43 AICc=848.21 BIC=859.23
#Plot
fit <- training %>%
model('ARIMA' = ARIMA(log(Retail)),
'ARIMA with regressors' = ARIMA(Retail~lag(CPI, n=12)+fourier(K=2)),
'ETS' = ETS(log(Retail)),
'Ensemble' = (ARIMA(Retail)+ETS(Retail) +ARIMA(Retail~lag(CPI, n=12)))/3
)
fcst <- fit %>% forecast(test)
fcst %>%
autoplot(test, level = NULL) +
autolayer(test) +
labs(title = "Model Forecasts of 2022 Sales") +
ylab("Sales ($ in Millions)")
## Plot variable not specified, automatically selected `.vars = Retail`
accuracy_metrics = rbind("ETS" = ETS_accuracy,
"ARIMA" = ARIMA_accuracy,
"ARIMA with regressors" = ARIMAwithCPI_accuracy,
"Ensemble" = ensemble_accuracy
)
accuracy_metrics <- accuracy_metrics %>% arrange(RMSE)
accuracy_metrics
## # A tibble: 4 × 10
## .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## * <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 "(ARIMA(Retail) + … Test -3.99e3 12022. 10111. -0.771 1.78 NaN NaN 0.243
## 2 "ARIMA(Retail ~ la… Test -3.16e1 29578. 23616. -0.318 4.13 NaN NaN 0.107
## 3 "ARIMA(log(Retail)… Test -2.96e4 35527. 31378. -4.82 5.17 NaN NaN 0.388
## 4 "ETS(log(Retail))" Test 3.46e4 36524. 34579. 5.82 5.82 NaN NaN 0.104
The ensemble model performs best with regard to variability, with the smallest RMSE of 12,021. Its bias measurements (ME and MAE) are also better than both the ARIMA and ETS models, however overall, the ARIMA with CPI as a regressor has the significantly bias, with a mean error of -32. The dynamic regression model does not perform as well regarding variability however with a RMSE of 29,578. Using CPI as a regressor and a fourier transformation smooths out the forecast more than the other models, however it does not account for seasonal fluctuations as well as the other models. Due to the different strengths of each model, the ensemble model averages out errors in predictions across the three models.