For this assignment, I am working with data on home sales taken from the U.S. Census This resource contains data on New Single-Family home sales in the U.S. from the years 2017-2022. The methodology in this report will train ETS, Arima, and Fourier models on training data for the years 2017-2020 to then forecast home sales on the testing data set, which covers 12-months in the year 2021.
A basic overview of the data can be seen in the first figures below. While home sales do express a slight seasonal trend, the general trend for the past five years has been positive irrespective of seasonality. The decomposition and seasonal plots highlight a significant spike of home sales beginning around June 2020, and persisting for about nine months. This spike begins to taper off following March 2021. These outlying data points will be crucial in the consideration for model construction and selection, as some of these peak sales will be captured in both our training and testing data. Properly fitting models to predict on these outlying peak years will be critical.
The first model constructed is an ARIMA The ARIMA function has selected a best fit model of (0,1,0)(1,0,0). Review of the residual plots suggest that this model is a good fit, with no lag values exceeding the bounds and all residuals distributed normally around zero. However, we will aim to refine the fit of this ARIMA by generating a dynamic Fourier model.
## Series: Value
## Model: ARIMA(0,1,0)(1,0,0)[12]
##
## Coefficients:
## sar1
## 0.4314
## s.e. 0.1486
##
## sigma^2 estimated as 38.89: log likelihood=-153.45
## AIC=310.89 AICc=311.17 BIC=314.59
The results of the ETS model can be observed below. This forecast comes much closer to capturing the testing data values than did the ARIMA model. Exploration of the residual plot shows that residuals are distributed around zero and, while one lag value falls outside of the bounds of the ACF plot, there is no cause to discredit the model on this basis. Assuming a cut-off 90% confidence interval for the lag plot, the residual plot for this model seems to confirm that this is a good-fit.
## Series: Value
## Model: ETS(M,N,N)
## Smoothing parameters:
## alpha = 0.9999
##
## Initial states:
## l[0]
## 44.36728
##
## sigma^2: 0.0153
##
## AIC AICc BIC
## 374.8945 375.4400 380.5081
A dynamic model is then created using a Fourier transformation as an external reggressor in an ARIMA model. Both time-series linear and ARIMA produced similar results with the Fourier transformation applied so, for this assignment, I elected to use an ARIMA model as the dynamic model. While this dataset works only with monthly data, the Fourier transformation proves to make the ARIMA model more accurate in its forecasting. It is important to note that both the last year in the training dataset and the first few months of the test year exhibit extreme outliers Using the Fourier transformation helps to offset some of the disruption caused by outliers in the model. The results of the dynamic model can be seen below.
## Series: Value
## Model: LM w/ ARIMA(1,0,0) errors
##
## Coefficients:
## ar1 trend(knots = yearmonth("2018 Jan"))trend
## 0.5791 -0.1958
## s.e. 0.1148 0.6030
## trend(knots = yearmonth("2018 Jan"))trend_13 fourier(K = 1)C1_12
## 0.7965 -3.8978
## s.e. 0.7175 1.9561
## fourier(K = 1)S1_12 intercept
## 5.2101 51.2016
## s.e. 2.0291 6.0975
##
## sigma^2 estimated as 36.17: log likelihood=-151.23
## AIC=316.45 AICc=319.25 BIC=329.55
Lastly, an ensemble model is created by averaging the forecasts of all previous models. The results are as follows. Exploration of the residual plot shows that residuals are distributed around zero and all lag values fall within the bounds of the acf plot.
## Series: Value
## Model: COMBINATION
## Combination: Value * 0.333333333333333
##
## ======================================
##
## Series: Value
## Model: COMBINATION
## Combination: Value + Value
##
## ==========================
##
## Series: Value
## Model: COMBINATION
## Combination: Value + Value
##
## ==========================
##
## Series: Value
## Model: ARIMA(0,1,0)(1,0,0)[12]
##
## Coefficients:
## sar1
## 0.4314
## s.e. 0.1486
##
## sigma^2 estimated as 38.89: log likelihood=-153.45
## AIC=310.89 AICc=311.17 BIC=314.59
##
## Series: Value
## Model: ETS(M,N,N)
## Smoothing parameters:
## alpha = 0.9999
##
## Initial states:
## l[0]
## 44.36728
##
## sigma^2: 0.0153
##
## AIC AICc BIC
## 374.8945 375.4400 380.5081
##
##
## Series: Value
## Model: LM w/ ARIMA(1,0,0) errors
##
## Coefficients:
## ar1 trend(knots = yearmonth("2018 Jan"))trend
## 0.5791 -0.1958
## s.e. 0.1148 0.6030
## trend(knots = yearmonth("2018 Jan"))trend_13 fourier(K = 1)C1_12
## 0.7965 -3.8978
## s.e. 0.7175 1.9561
## fourier(K = 1)S1_12 intercept
## 5.2101 51.2016
## s.e. 2.0291 6.0975
##
## sigma^2 estimated as 36.17: log likelihood=-151.23
## AIC=316.45 AICc=319.25 BIC=329.55
When comparing these models, it is apparent that the ETS model is the most accurate predictor on the testing data. However, the ensemble model also performs very well when predicting on the testing data set. The Fourier transformation helped to increase the robustness of the ARIMA model but, still, ARIMA provides the worst forecast. We can determine based on statistics like mean error, RMSE, and ACF that the ETS model is providing the most robust forecast.
## Series: Value
## Model: ARIMA(0,1,0)(1,0,0)[12]
##
## Coefficients:
## sar1
## 0.4314
## s.e. 0.1486
##
## sigma^2 estimated as 38.89: log likelihood=-153.45
## AIC=310.89 AICc=311.17 BIC=314.59
## Series: Value
## Model: ETS(M,N,N)
## Smoothing parameters:
## alpha = 0.9999
##
## Initial states:
## l[0]
## 44.36728
##
## sigma^2: 0.0153
##
## AIC AICc BIC
## 374.8945 375.4400 380.5081
## Series: Value
## Model: LM w/ ARIMA(1,0,0) errors
##
## Coefficients:
## ar1 trend(knots = yearmonth("2018 Jan"))trend
## 0.5791 -0.1958
## s.e. 0.1148 0.6030
## trend(knots = yearmonth("2018 Jan"))trend_13 fourier(K = 1)C1_12
## 0.7965 -3.8978
## s.e. 0.7175 1.9561
## fourier(K = 1)S1_12 intercept
## 5.2101 51.2016
## s.e. 2.0291 6.0975
##
## sigma^2 estimated as 36.17: log likelihood=-151.23
## AIC=316.45 AICc=319.25 BIC=329.55
## Series: Value
## Model: COMBINATION
## Combination: Value * 0.333333333333333
##
## ======================================
##
## Series: Value
## Model: COMBINATION
## Combination: Value + Value
##
## ==========================
##
## Series: Value
## Model: COMBINATION
## Combination: Value + Value
##
## ==========================
##
## Series: Value
## Model: ARIMA(0,1,0)(1,0,0)[12]
##
## Coefficients:
## sar1
## 0.4314
## s.e. 0.1486
##
## sigma^2 estimated as 38.89: log likelihood=-153.45
## AIC=310.89 AICc=311.17 BIC=314.59
##
## Series: Value
## Model: ETS(M,N,N)
## Smoothing parameters:
## alpha = 0.9999
##
## Initial states:
## l[0]
## 44.36728
##
## sigma^2: 0.0153
##
## AIC AICc BIC
## 374.8945 375.4400 380.5081
##
##
## Series: Value
## Model: LM w/ ARIMA(1,0,0) errors
##
## Coefficients:
## ar1 trend(knots = yearmonth("2018 Jan"))trend
## 0.5791 -0.1958
## s.e. 0.1148 0.6030
## trend(knots = yearmonth("2018 Jan"))trend_13 fourier(K = 1)C1_12
## 0.7965 -3.8978
## s.e. 0.7175 1.9561
## fourier(K = 1)S1_12 intercept
## 5.2101 51.2016
## s.e. 2.0291 6.0975
##
## sigma^2 estimated as 36.17: log likelihood=-151.23
## AIC=316.45 AICc=319.25 BIC=329.55
## Warning in rbind(deparse.level, ...): number of columns of result, 10, is not a
## multiple of vector length 4 of arg 2
## # A tibble: 4 × 7
## .model ME RMSE MAE MPE MAPE ACF1
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ETS(Value) 1.25 9.54 7.75 -0.109 11.8 0.658
## 2 Ensemble_Model -5.04 11.8 10.9 -10.3 17.8 0.664
## 3 Fourier_Model -9.25 14.0 12.6 -16.8 21.0 0.609
## 4 ARIMA(Value) -7.13 14.7 13.1 -14.1 21.7 0.703