ARIMA Modeling

Supply Chain HW2

library(forecast)
install.packages('fpp')

## Error in contrib.url(repos, "source"): trying to use CRAN without setting a mirror

library(fpp)
library(tseries)


data <- read.csv("USRTI.csv")
head(data)

##    Month   US.RTI
## 1 1/1/55 36.26374
## 2 1/1/56 36.86062
## 3 1/1/57 37.59969
## 4 1/1/58 34.84664
## 5 1/1/59 36.98137
## 6 1/1/60 37.07525

series <- ts(data, start=1955, frequency=1)[,2]
plot.ts(series, main="US Annual Retail Trade Index")

plot of chunk unnamed-chunk-1

1. (5 pts.) Run the time-series diagnostics on the above time series and difference it as many times as necessary until you can conclude the differenced time-series is stationary. Report the diagnostics and explain your conclusion.

#number of differences necessary to make data stationary
ndiffs(series)

## [1] 1

diff_series <- diff(series)
tsdisplay(diff_series)

plot of chunk unnamed-chunk-2

adf.test(diff_series)

## Warning in adf.test(diff_series): p-value smaller than printed p-value

## 
##  Augmented Dickey-Fuller Test
## 
## data:  diff_series
## Dickey-Fuller = -4.5898, Lag order = 3, p-value = 0.01
## alternative hypothesis: stationary

We can conclude that the time series is stationary after differencing it 1 time because the ACF drops to zero relatively quickly. Likewise, the Augmented Dickey-Fuller Test indicates that we can reject the null hypothesis that the data is non-stationary (p-value=0.01).

2. (5 pts.) Based on the ACF and PACF of the stationary series justify the selection of an ARIMA model with the property that 𝑝𝑞 = 0 that it either 𝑝 = 0 or 𝑞 = 0. Plot and discuss the diagnostics of the residuals of your selected model.

#autoregressive model AR(p)
#moving average model MA(q)
model1 <- Arima(diff_series, order=c(2,1,0))
summary(model1)

## Series: diff_series 
## ARIMA(2,1,0)                    
## 
## Coefficients:
##           ar1      ar2
##       -0.1592  -0.3047
## s.e.   0.1243   0.1241
## 
## sigma^2 estimated as 6.795:  log likelihood=-137.97
## AIC=281.95   AICc=282.39   BIC=288.13
## 
## Training set error measures:
##                      ME     RMSE      MAE       MPE     MAPE      MASE
## Training set 0.05224897 2.584624 1.864245 -31.53451 183.3555 0.9450365
##                     ACF1
## Training set -0.06528859

tsdiag(model1)

plot of chunk unnamed-chunk-3

The following model Arima(p=2, d=1, q=0) sets p equal to two because the PACF shows significant autocorrelations up until lag 2, while the ACF (q=0) dies down after merely 1 lag. Additionally, the above plot indicates that the residuals do not exhibit any systematic patterns. Thus, the model meets the statistical assumption of i.i.d. observations.

3. (5 pts.) Obtain the value of the best lambda for the Box-Cox transformation of the time series y. Transform the series and fit the following three models:

a. ARIMA(0,1,1) b. ARIMA(2,1,0) c. ARIMA(1,1,0)

For each of the three models, comment on the estimates of the model parameters. Which model would you select? Why?

#find optimal value of lambda
lambda <- BoxCox.lambda(series)
lambda

## [1] 0.562376

#transform the series
transformed_data <- BoxCox(series, lambda)


arima1 <- Arima(transformed_data, order=c(0,1,1))
summary(arima1)

## Series: transformed_data 
## ARIMA(0,1,1)                    
## 
## Coefficients:
##          ma1
##       0.5002
## s.e.  0.1148
## 
## sigma^2 estimated as 0.1341:  log likelihood=-24.58
## AIC=53.17   AICc=53.38   BIC=57.32
## 
## Training set error measures:
##                    ME      RMSE       MAE       MPE     MAPE      MASE
## Training set 0.134036 0.3630971 0.2903422 0.7790161 1.745476 0.8134532
##                    ACF1
## Training set -0.1067373

arima2 <- Arima(transformed_data, order=c(2,1,0))
summary(arima2)

## Series: transformed_data 
## ARIMA(2,1,0)                    
## 
## Coefficients:
##          ar1      ar2
##       0.5141  -0.1023
## s.e.  0.1284   0.1285
## 
## sigma^2 estimated as 0.1333:  log likelihood=-24.41
## AIC=54.82   AICc=55.25   BIC=61.05
## 
## Training set error measures:
##                     ME      RMSE       MAE       MPE     MAPE      MASE
## Training set 0.1186693 0.3620784 0.2790365 0.6913397 1.676919 0.7817781
##                   ACF1
## Training set -0.103203

arima3 <- Arima(transformed_data, order=c(1,1,0))
summary(arima3)

## Series: transformed_data 
## ARIMA(1,1,0)                    
## 
## Coefficients:
##          ar1
##       0.4661
## s.e.  0.1139
## 
## sigma^2 estimated as 0.1348:  log likelihood=-24.72
## AIC=53.45   AICc=53.66   BIC=57.6
## 
## Training set error measures:
##                     ME      RMSE       MAE       MPE    MAPE      MASE
## Training set 0.1086841 0.3640817 0.2768466 0.6342619 1.65597 0.7756426
##                     ACF1
## Training set -0.03599984

Model 1 ARIMA(0, 1, 1) performs objectively the best in terms of AICc. However, Model 3 ARIMA(1, 1, 0) performs only marginally worse in terms of AICc. Additionally, model 1 has slightly more bias than model 3 based on the ME scores. It's worth noting that the only difference between these models is whether p is set to zero or q is set to zero. Also, the model we fit in question 2, ARIMA(2, 1, 0) performs the worst out of the three based on AICc. Based on the results, I would select model 1.

4. (5 pts.) Use the auto.arima(…) function with the value of the lambda selected for Question 3 above to obtain a non-seasonal ARIMA model. How does this model compare with the one you selected in Question 3.

auto_selection <- auto.arima(series, lambda=lambda)
summary(auto_selection)

## Series: series 
## ARIMA(0,1,1) with drift         
## Box Cox transformation: lambda= 0.562376 
## 
## Coefficients:
##          ma1   drift
##       0.4043  0.2016
## s.e.  0.1297  0.0622
## 
## sigma^2 estimated as 0.117:  log likelihood=-19.66
## AIC=45.33   AICc=45.76   BIC=51.56
## 
## Training set error measures:
##                       ME     RMSE      MAE         MPE     MAPE      MASE
## Training set 0.006936779 2.185831 1.538405 -0.04691915 2.351403 0.6680674
##                   ACF1
## Training set 0.0575882

This model ARIMA(0, 1, 1) with drift outperforms the model I selected in question 3 on all measures. Most importantly, the AICc score improves to 45.76 over the previous 53.38. The only difference between these two models is the inclusion of drift.

5. (10 pts.) Write a representation the model obtained by the auto.arima(…) function using B- transform notation, then obtain explicitly its statistical model, 𝑦𝑡 = ⋯ and its forecasting model 𝑦̂𝑇+1|𝑇 = ⋯

#not sure what "obtain explicity its statistical model" means
#below is the code that would be used to make forecasts based on the model in Q4

#point forecasts
forecast(auto_selection, h=1)

##      Point Forecast    Lo 80    Hi 80    Lo 95   Hi 95
## 2015       114.5164 111.0499 118.0295 109.2337 119.908

#summary of forecasting model
auto_forecast <- forecast(auto_selection, h=1)

6. (10 pts) For the model in Question 4, you can use the fitted(…) function to obtain the fitted values obtained by the model, then use your explicit model obtained in Question 5 to forecast the mean value of the US Retail Index for 2015 through 2017. Please explain step by step your calculations for each of the three forecasts, and then compare the values you obtained with those reported by the forecast(…) function (they should be the same).

Hint: The values reported by the fitted(…) function are in the original scale. To calculate your forecast, you must work with the BoxCox transformed values, and then after you have obtained the forecast, you can use the InvBoxCox(…) function to express your forecasts in the original scale.

fitted <- fitted(auto_selection)
head(fitted)

## [1] 36.20881 37.21232 37.70962 38.54010 34.34537 39.06780

fitted_trans <- BoxCox(fitted, lambda)
head(fitted_trans)

## [1] 11.60666 11.81403 11.91589 12.08468 11.21481 12.19111

#error = tail of transformed df vs. fitted transformed
error = tail(transformed_data, n =1 )-tail(fitted_trans, n =1 )
error

## [1] 0.05887823

#0.2016=drift
#"explicit model" from Q5 to forecast the mean value of the US Retail Index for 2015 through 2017
forecast_2015 = 0.2016 + 0.4043* error + (tail(transformed_data, n =1))

forecast_2016= 0.2016 + 0 + 0.4043* 0 + forecast_2015

forecast_2017 = 0.2016 + forecast_2016

#express forecasts in the original scale
InvBoxCox(forecast_2015,lambda)

## [1] 114.5161

InvBoxCox(forecast_2016,lambda)

## [1] 116.1262

InvBoxCox(forecast_2017,lambda)

## [1] 117.746

#compare: were the explicit models correct?
forecast(auto_selection, h=3)

##      Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## 2015       114.5164 111.0499 118.0295 109.2337 119.9080
## 2016       116.1267 110.1429 122.2487 107.0314 125.5451
## 2017       117.7469 110.0073 125.7158 106.0037 130.0265

Step by step your calculations for each of the three forecasts:

Fit the model obtained in question 4 using the fitted() function
Transform the data using a BoxCox transformation
Define the error (for the hand calculated equation below) as the difference between the last value from the fitted model from the last value of the original data set
Define the explicit model obtained in Q5 in order to calculate sales in 2015, including error. Repeat the process for 2016 and 2017 and set the error to zero, since we do not have enough data to define error terms for those equations.
Use the InvBoxCox(…) function to express the forecasts in the original scale and compare the results to our results from the forecasts for 2015, 2016, and 2017.
Finally, use the forecast() function on the model from question 4 and compare.

In doing so, we find that the forecasts generated by our “explicit” models are correct as they match the forecasts obtained using the forecast() function.

7. (5 pts.) The dataset hsales2 in fpp includes the monthly sales of new single-family homes in the US for the period from Jan. 1987 through Nov. 1995. Use the tsdisplay(…) function to view the dynamic behavior of the data. What do these plots tell you about the seasonality and stationarity of the data?

head(hsales2)

## [1] 53 59 73 72 62 58

tsdisplay(hsales2)

plot of chunk unnamed-chunk-8

#stationary?
adf.test(hsales2)

## 
##  Augmented Dickey-Fuller Test
## 
## data:  hsales2
## Dickey-Fuller = -3.2778, Lag order = 4, p-value = 0.07857
## alternative hypothesis: stationary

The ADF test concludes that this data is not stationary with a p-value of approximately .07. The above plots show that both the PACF and ACF are slow to die down, which supports the conclusion of the data not being stationary. Additionally, the time series plot depicts a clear seasonal trend.

8. (10 pts.) Compute the 12-period difference of the dataset hsales2 and store the differenced time-series in h.D12 and examine it.

a. Does the differenced data appears to be stationary? Justify your answer.

b. What kind of model does the 1-lag autocorrelation in the PACF plot suggest?

c. What type of seasonal model do the ACF and PACF of h.D12 suggest?

d. Compute the model you suggested in part (b) above.

e. Run the auto.arima(…) function on the hsales2 time-series and compare the resulting model with the one you recommend in (b) in terms of AICc and the log-likelihood of the fit.

h.D12 <- diff(hsales2, lag= 12)
tsdisplay(h.D12)

plot of chunk unnamed-chunk-9

adf.test(h.D12)

## 
##  Augmented Dickey-Fuller Test
## 
## data:  h.D12
## Dickey-Fuller = -2.2569, Lag order = 4, p-value = 0.4703
## alternative hypothesis: stationary

A. The data does not appear to be stationary, which it is not, as confirmed by a high p-value (approximately .47) on an ADF test.

B. The 1-lag autocorrelation in the PACF plot suggests that the non-seasonal component of the model (p) should equal 1. This corresponds to the auto-regressive (AR) portion of the model.

C. The ACF and PACF results indicate that a seasonal ARIMA(1, 1, 0)12 would be most appropriate.

D. Compute the model you suggested in part (b) above.

Combining parts b and c above, our model is the following: ARIMA(1,0,0)(1,1,0)12

order_seasonal <- Arima(h.D12, order = c(1,0,0), seasonal = c(1,1,0))
summary(order_seasonal)

## Series: h.D12 
## ARIMA(1,0,0)(1,1,0)[12]                    
## 
## Coefficients:
##          ar1     sar1
##       0.7435  -0.6596
## s.e.  0.0743   0.0892
## 
## sigma^2 estimated as 50.48:  log likelihood=-284.32
## AIC=574.65   AICc=574.95   BIC=581.91
## 
## Training set error measures:
##                       ME     RMSE      MAE MPE MAPE      MASE        ACF1
## Training set -0.02041005 6.640804 5.041526 NaN  Inf 0.5243693 -0.07831108

#auto arima
auto.arima(hsales2)

## Series: hsales2 
## ARIMA(1,0,0)(2,0,0)[12] with non-zero mean 
## 
## Coefficients:
##          ar1    sar1    sar2  intercept
##       0.7767  0.4248  0.2832    53.6744
## s.e.  0.0607  0.0946  0.1142     5.0468
## 
## sigma^2 estimated as 21.01:  log likelihood=-318.82
## AIC=647.65   AICc=648.24   BIC=661.01

E. Run the auto.arima(…) function on the hsales2 time-series and compare the resulting model with the one you recommend in (b) in terms of AICc and the log-likelihood of the fit.

The model obtained in part D performs far better than the model generated by the auto arima function, as indicated by the difference in AICc values.

9. (5 pts.) Use the forecast() function and the best model you obtained in question 8 to prepare a plot of the two-year forecast (h=24) of the mean, 80% and 95% confidence intervals of the sales of single-family new-houses in the US

plotf <- forecast(order_seasonal, h=24)
plot(plotf, xlab="Time", ylab="Sales")

plot of chunk unnamed-chunk-11

plotf

##          Point Forecast      Lo 80     Hi 80     Lo 95    Hi 95
## Dec 1995   -1.921145062 -11.026136  7.183846 -15.84603 12.00374
## Jan 1996   -1.717977261 -13.063728  9.627773 -19.06981 15.63385
## Feb 1996   -0.741071303 -13.153038 11.670895 -19.72354 18.24139
## Mar 1996    2.778779556 -10.184980 15.742539 -17.04758 22.60514
## Apr 1996   -4.299197866 -17.558121  8.959726 -24.57697 15.97858
## May 1996    3.002078025 -10.417218 16.421374 -17.52096 23.52512
## Jun 1996   -0.269842753 -13.776970 13.237285 -20.92721 20.38753
## Jul 1996    1.589140579 -11.966294 15.144575 -19.14211 22.32039
## Aug 1996    2.296702551 -11.285361 15.878766 -18.47527 23.06868
## Sep 1996   -1.924020462 -15.520781 11.672740 -22.71847 18.87043
## Oct 1996   -0.573817457 -14.178695 13.031060 -21.38068 20.23305
## Nov 1996   -5.775177014 -19.384540  7.834186 -26.58890 15.03855
## Dec 1996   -8.027087716 -22.044827  5.990651 -29.46537 13.41119
## Jan 1997    0.647224783 -13.591227 14.885677 -21.12861 22.42306
## Feb 1997   -7.572930989 -21.931931  6.786069 -29.53313 14.38726
## Mar 1997   -8.337062455 -22.762265  6.088140 -30.39851 13.72438
## Apr 1997   -6.116557475 -20.578225  8.345111 -28.23377 16.00066
## May 1997   -0.324050545 -14.805836 14.157735 -22.47203 21.82393
## Jun 1997    5.824907111  -8.667987 20.317801 -16.34006 27.98988
## Jul 1997    8.441611845  -6.057419 22.940642 -13.73274 30.61597
## Aug 1997    3.409274656 -11.093147 17.911696 -18.77027 25.58881
## Sep 1997   -0.003424947 -14.507721 14.500871 -22.18583 22.17898
## Oct 1997   -2.180231621 -16.685563 12.325100 -24.36422 20.00376
## Nov 1997   -2.629879092 -17.135783 11.876025 -24.81475 19.55499

10. (5 pts.) Consider an 𝐴𝑅𝐼𝑀𝐴(1,0,0)(0,1,1)12 model for the forecasting of the hsales2 time- series.

a. Write a symbolic expression for the model using the back-shift operator 𝐵

b. Suppose 𝑇 is the last period in the time-series, write an explicit equation for the forecast for 𝑇 +1 calculated at 𝑇

11. (5 pts.) Fit the 𝐴𝑅𝐼𝑀𝐴(1,1,0)(0,1,1)12 model to the hsales2 time-series using the Arima(…) function in R and answer the following questions. What are the values of 𝜙1, Θ1, 𝑒𝑇−1 and 𝑒𝑇 in your forecasting model?

hsales_mod <- Arima(hsales2, order = c(1,1,0), seasonal = c(0,1,1))
summary(hsales_mod)

## Series: hsales2 
## ARIMA(1,1,0)(0,1,1)[12]                    
## 
## Coefficients:
##           ar1     sma1
##       -0.1606  -1.0000
## s.e.   0.1016   0.1604
## 
## sigma^2 estimated as 15.63:  log likelihood=-275.67
## AIC=557.34   AICc=557.61   BIC=564.97
## 
## Training set error measures:
##                     ME    RMSE      MAE       MPE     MAPE      MASE
## Training set 0.2162045 3.70557 2.800947 0.1969829 5.509797 0.4340782
##                     ACF1
## Training set -0.03801325

tail(fitted(hsales_mod), n = 2) - tail(hsales2, n =2)

## [1] 2.070578 3.466128

ϕ1= 0.0908,

Θ_1= 0,

e(T-1) = 2.07,

e_T= 3.46