DATA624: Homework 6
library(fpp3)
library(ggpubr)
Task
Do the exercises 9.1, 9.2, 9.3, 9.5, 9.6, 9.7, 9.8 in Hyndman. Please submit both the Rpubs link as well as your .rmd file.
Exercises
9.1
Figure 9.32 shows the ACFs for 36 random numbers, 360 random numbers and 1,000 random numbers.
Explain the differences among these figures. Do they all indicate that the data are white noise?
Why are the critical values at different distances from the mean of zero? Why are the autocorrelations different in each figure when they each refer to white noise?
Part A
The difference among this figures is with the sample size. The bounds appear to become closer and closer to zero as the sample size increases. Each figure shows that the data for each sample size is white noise due to each autocorrelation remaining within / close to the bounds of the ACF plot.
Part B
The critical values are \(\pm1.96 / \sqrt{T}\), where \(T\) is the length of the timeseries. The sample size affects the length of the timeseries and, in turn, affects the critical values distance from zero. The larger the sample size the smaller the critical values are from zero. The autocorrelations are different in each figure due to the random numbers.
9.2
A classic example of a non-stationary series are stock prices. Plot the daily closing prices for Amazon stock (contained in gafa_stock), along with the ACF and PACF. Explain how each plot shows that the series is non-stationary and should be differenced.
<- gafa_stock
gafa_stock
<- gafa_stock %>%
q2 filter(Symbol == "AMZN")
gg_tsdisplay(q2, y = Close, plot_type = "partial") +
ggtitle("Amazon Closing Price, ACF, and PACF")
## Warning: Provided data has an irregular interval, results should be treated with caution. Computing ACF by observation.
## Provided data has an irregular interval, results should be treated with caution. Computing ACF by observation.
print("Unit Root Test Results:")
## [1] "Unit Root Test Results:"
%>%
q2 features(Close, unitroot_kpss)
## # A tibble: 1 x 3
## Symbol kpss_stat kpss_pvalue
## <chr> <dbl> <dbl>
## 1 AMZN 14.0 0.01
Amazon Closing price data is non-stationary and should be differenced because:
- In the daily stock prices plot, the data is clearly trended in a positive direction.
- The ACF plot shows that there is an extremely small incremental decrease to the lags. The ACF should quickly converge to around zero if the timeseries was stationary.
- The PACF plot has the first lag close to 1, indicating that the data is non-stationary
- The Unit Root test results suggests that differencing is requires (pvalue of 0.01 is less than 0.05)
9.3
For the following series, find an appropriate Box-Cox transformation and order of differencing in order to obtain stationary data.
Turkish GDP from global_economy.
Accommodation takings in the state of Tasmania from aus_accommodation.
Monthly sales from souvenirs.
Part A
The appropriate Box-Cox transformation for the Turkish GDP is 0.1572. Only 1 order of differencing is needed to obtain stationary data.
<- global_economy
global_economy
<- global_economy %>%
q3a filter(Country == "Turkey")
<- q3a %>%
lambda features(GDP, features = guerrero) %>%
pull(lambda_guerrero)
paste0("Lambda value: ", round(lambda, 4))
## [1] "Lambda value: 0.1572"
%>%
q3a features(box_cox(GDP, lambda), unitroot_ndiffs)
## # A tibble: 1 x 2
## Country ndiffs
## <fct> <int>
## 1 Turkey 1
Part B
The appropriate Box-Cox transformation for the accommodation takings in the state of Tasmania is 0.0018. Only 1 order of differencing is needed to obtain stationary data.
<- aus_accommodation
aus_accommodation
<- aus_accommodation %>%
q3b filter(State == "Tasmania")
<- q3b %>%
lambda features(Takings, features = guerrero) %>%
pull(lambda_guerrero)
paste0("Lambda value: ", round(lambda, 4))
## [1] "Lambda value: 0.0018"
%>%
q3b features(box_cox(Takings, lambda), unitroot_ndiffs)
## # A tibble: 1 x 2
## State ndiffs
## <chr> <int>
## 1 Tasmania 1
Part C
The appropriate Box-Cox transformation for the monthly sales of souvenirs is 0.0021. Only 1 order of differencing is needed to obtain stationary data.
<- souvenirs
souvenirs
<- souvenirs q3c
<- q3c %>%
lambda features(Sales, features = guerrero) %>%
pull(lambda_guerrero)
paste0("Lambda value: ", round(lambda, 4))
## [1] "Lambda value: 0.0021"
%>%
q3c features(box_cox(Sales, lambda), unitroot_ndiffs)
## # A tibble: 1 x 1
## ndiffs
## <int>
## 1 1
9.5
For your retail data (from Exercise 8 in Section 2.10), find the appropriate order of differencing (after transformation if necessary) to obtain stationary data.
The retail data needs to undergo multiple differences to achieve stationarity. I performed two differences to the data: one seasonal difference and one standard difference. The Unit Root test, KPSS, showed that after differencing twice the test statistic is small (0.02) and the pvalue was larger than 0.05 which allows the acceptance of the null hypothesis that the data is stationary. It sshould be noted that there is still a large amount of autocorrelation and partial autocorrelation.
set.seed(15)
<- aus_retail %>%
myseries filter(`Series ID` == sample(aus_retail$`Series ID`,1))
%>%
myseries gg_tsdisplay(Turnover, plot_type = "partial") +
ggtitle("Original Data")
%>%
myseries features(Turnover, unitroot_ndiffs)
## # A tibble: 1 x 3
## State Industry ndiffs
## <chr> <chr> <int>
## 1 New South Wales Furniture, floor coverings, houseware and textile good~ 1
%>%
myseries features(Turnover, unitroot_nsdiffs)
## # A tibble: 1 x 3
## State Industry nsdiffs
## <chr> <chr> <int>
## 1 New South Wales Furniture, floor coverings, houseware and textile goo~ 1
%>%
myseries features(difference(difference(Turnover, lag = 12)), unitroot_kpss)
## # A tibble: 1 x 4
## State Industry kpss_stat kpss_pvalue
## <chr> <chr> <dbl> <dbl>
## 1 New South Wales Furniture, floor coverings, houseware a~ 0.0214 0.1
%>%
myseries gg_tsdisplay(difference(difference(box_cox(Turnover, lambda), lag = 12)), plot_type='partial') +
ggtitle("Differenced Data")
## Warning: Removed 13 row(s) containing missing values (geom_path).
## Warning: Removed 13 rows containing missing values (geom_point).
9.6
Simulate and plot some data from simple ARIMA models.
- Use the following R code to generate data from an AR(1) model with \(\phi_1 = 0.6\) and \(\sigma^2 = 1\). The process starts with \(y_1=0\)
<- numeric(100)
y <- rnorm(100)
e for(i in 2:100)
<- 0.6*y[i-1] + e[i]
y[i] <- tsibble(idx = seq_len(100), y = y, index = idx) sim
Produce a time plot for the series. How does the plot change as you change \(\phi_1\)
Write your own code to generate data from an MA(1) model with \(\theta_1 = 0.6\) and \(\sigma^2 = 1\)
Produce a time plot for the series. How does the plot change as you change \(theta_1\)
Generate data from an ARMA(1,1) model with \(\phi_1 = 0.6\), \(\theta_1 = 0.6\), and \(\sigma^2 = 1\)
Generate data from an AR(2) model with \(\phi_1 = -0.8\), \(\phi_2 = 0.3\), \(\sigma^2 = 1\) (Note that these parameters will give a non-stationary series.)
Graph the latter two series and compare them.
Part A
set.seed(15)
<- numeric(100)
y <- rnorm(100)
e for(i in 2:100)
<- 0.6*y[i-1] + e[i]
y[i] <- tsibble(idx = seq_len(100), y = y, index = idx)
sim
head(sim)
## # A tsibble: 6 x 2 [1]
## idx y
## <int> <dbl>
## 1 1 0
## 2 2 1.83
## 3 3 0.759
## 4 4 1.35
## 5 5 1.30
## 6 6 -0.476
Part B
The plot changes are significant as \(\phi_1\) is adjusted. Low values for \(\phi_1\) seem to have much larger oscillations compared to high values of \(\phi_1\). Moderate to low \(\phi_1\) values seem to be centered around 0, whereas extremely large \(\phi_1\) are not. It seems that the plot converges into an exponential function with very high values \(\phi_1\). The frequency for negative values of \(\phi_1\) is much greater than when \(\phi_1\) is positive.
for(i in 2:100){
<- 0.01*y[i-1] + e[i]
y[i]
}<- tsibble(idx = seq_len(100), y = y, index = idx)
small_phi
for(i in 2:100){
<- 0.99*y[i-1] + e[i]
y[i]
}<- tsibble(idx = seq_len(100), y = y, index = idx)
large_phi
for(i in 2:100){
<- 1.2*y[i-1] + e[i]
y[i]
}<- tsibble(idx = seq_len(100), y = y, index = idx)
positive_phi
for(i in 2:100){
<- -1.2*y[i-1] + e[i]
y[i]
}<- tsibble(idx = seq_len(100), y = y, index = idx)
negative_phi
for(i in 2:100){
<- -0.6*y[i-1] + e[i]
y[i]
}<- tsibble(idx = seq_len(100), y = y, index = idx) negative_phi2
<- sim %>%
p6 autoplot(y) +
ggtitle("Phi = 0.6")
<- small_phi %>%
p1 autoplot(y) +
ggtitle("Phi = 0.01")
<- large_phi %>%
p2 autoplot(y) +
ggtitle("Phi = 0.99")
<- positive_phi %>%
p3 autoplot(y) +
ggtitle("Phi = 1.2")
<- negative_phi %>%
p4 autoplot(y) +
ggtitle("Phi = -1.2")
<- negative_phi2 %>%
p5 autoplot(y) +
ggtitle("Phi = -0.6")
ggarrange(p1, p2,
p3, p4,
p6, p5,nrow = 3,
ncol = 2)
Part C
<- 0.6
theta
for(i in 2:100){
<- e[i] + theta * e[i-1]
y[i]
}
<- tsibble(idx = seq_len(100), y = y, index = idx)
MA
head(MA)
## # A tsibble: 6 x 2 [1]
## idx y
## <int> <dbl>
## 1 1 0
## 2 2 1.99
## 3 3 0.759
## 4 4 0.693
## 5 5 1.03
## 6 6 -0.963
Part D
The change in \(\theta\) does not drastically change the model shape. Low / negative values of \(\theta\) result in more spikes in the model, whereas large positive values of \(\theta\) provide a smoother outcome.
%>%
MA autoplot(y) +
ggtitle("MA(1) model: Theta = 0.6")
for(i in 2:100){
<- e[i] + 0.01 * e[i-1]
y[i]
}<- tsibble(idx = seq_len(100), y = y, index = idx)
small_theta
for(i in 2:100){
<- e[i] + 0.99 * e[i-1]
y[i]
}<- tsibble(idx = seq_len(100), y = y, index = idx)
large_theta
for(i in 2:100){
<- e[i] + 1.5 * e[i-1]
y[i]
}<- tsibble(idx = seq_len(100), y = y, index = idx)
positive_theta
for(i in 2:100){
<- e[i] + -1.5 * e[i-1]
y[i]
}<- tsibble(idx = seq_len(100), y = y, index = idx)
negative_theta
for(i in 2:100){
<- e[i] + -0.6 * e[i-1]
y[i]
}<- tsibble(idx = seq_len(100), y = y, index = idx) negative_theta2
<- MA %>%
p6 autoplot(y) +
ggtitle("Theta = 0.6")
<- small_theta %>%
p1 autoplot(y) +
ggtitle("Theta = 0.01")
<- large_theta %>%
p2 autoplot(y) +
ggtitle("Theta = 0.99")
<- positive_theta %>%
p3 autoplot(y) +
ggtitle("Theta = 1.5")
<- negative_theta %>%
p4 autoplot(y) +
ggtitle("Theta = -1.5")
<- negative_theta2 %>%
p5 autoplot(y) +
ggtitle("Theta = -0.6")
ggarrange(p1, p2,
p3, p4,
p6, p5,nrow = 3,
ncol = 2)
Part E
for(i in 2:100){
<- 0.6*y[i-1] + 0.6*e[i-1] + e[i]
y[i]
}
<- tsibble(idx = seq_len(100), y = y, index = idx)
arma_E
head(arma_E)
## # A tsibble: 6 x 2 [1]
## idx y
## <int> <dbl>
## 1 1 0
## 2 2 1.99
## 3 3 1.95
## 4 4 1.86
## 5 5 2.14
## 6 6 0.324
Part F
for(i in 3:100){
<- -0.8*y[i-1] + 0.3*y[i-2] + e[i]
y[i]
}
<- tsibble(idx = seq_len(100), y = y, index = idx)
ar_F
head(ar_F)
## # A tsibble: 6 x 2 [1]
## idx y
## <int> <dbl>
## 1 1 0
## 2 2 1.99
## 3 3 -1.93
## 4 4 3.04
## 5 5 -2.52
## 6 6 1.67
Part G
The ARMA(1, 1) model appears to be stationary, as it is constant around 0 with minimal fluctuations. The Both ACF and PACF plots show that the autocorrelation drops after each lag until it is with the bounds. The AR(2) model appears to oscillate around 0 but get progressively larger as the series progresses. The ACF plot shows that the autocorrelation switches between positive and negative strength and slowly decreases each lag. The PCA drops rappidly after teh first lag.
%>%
arma_E gg_tsdisplay(y, plot_type = "partial") +
ggtitle("ARMA(1, 1) Model")
%>%
ar_F gg_tsdisplay(y, plot_type = "partial") +
ggtitle("AR(2) Model")
9.7
Consider aus_airpassengers, the total number of passengers (in millions) from Australian air carriers for the period 1970-2011.
Use ARIMA() to find an appropriate ARIMA model. What model was selected. Check that the residuals look like white noise. Plot forecasts for the next 10 periods.
Write the model in terms of the backshift operator.
Plot forecasts from an ARIMA(0,1,0) model with drift and compare these to part a.
Plot forecasts from an ARIMA(2,1,2) model with drift and compare these to parts a and c. Remove the constant and see what happens.
Plot forecasts from an ARIMA(0,2,1) model with a constant. What happens?
Part A
The selected ARIMA model is ARIMA(0, 2, 1). The residuals do appear to be white noise.
<- aus_airpassengers %>%
fit model(
ARIMA(Passengers)
)
report(fit)
## Series: Passengers
## Model: ARIMA(0,2,1)
##
## Coefficients:
## ma1
## -0.8963
## s.e. 0.0594
##
## sigma^2 estimated as 4.308: log likelihood=-97.02
## AIC=198.04 AICc=198.32 BIC=201.65
%>%
fit gg_tsresiduals() +
ggtitle("ARIMA(0, 2, 1): Residuals")
%>%
fit forecast(h = 10) %>%
autoplot(aus_airpassengers) +
ggtitle("Australian Airpassengers Forecast")
Part B
Backshift operator form:
\[ y_t = -0.8963 * \epsilon_{t - 1} + \epsilon_t \]
Part C
Part A’s forecast is consistently higher than Part C due to a steeper slope. Part C also appears to have residuals that exhibit white noise.
<- aus_airpassengers %>%
fit2 model(
ARIMA(Passengers ~ pdq(0, 1, 0))
)
<- aus_airpassengers %>%
fit_vis model(
PartA = ARIMA(Passengers),
PartC = ARIMA(Passengers ~ pdq(0, 1, 0))
)
%>%
fit2 gg_tsresiduals() +
ggtitle("ARIMA(0, 1, 0): Residuals")
%>%
fit_vis forecast(h = 10) %>%
autoplot(aus_airpassengers) +
ggtitle("Australian Airpassengers Forecast")
Part D
The ARIMA(2, 1, 2) model with a constant seems to be very similar to the ARIMA(0, 1, 0) model but is a slightly worse fit overall. The best fitting model appears to be Part A model, ARIMA(0, 2, 1). Removing the constant throws an error and does not produce an ARIMA model.
<- aus_airpassengers %>%
fit3 model(
ARIMA(Passengers ~ 1 + pdq(2, 1, 2))
)
<- aus_airpassengers %>%
fit4 model(
ARIMA(Passengers ~ 0 + pdq(2, 1, 2))
)
## Warning: 1 error encountered for ARIMA(Passengers ~ 0 + pdq(2, 1, 2))
## [1] non-stationary AR part from CSS
<- aus_airpassengers %>%
fit_vis model(
PartA = ARIMA(Passengers),
PartC = ARIMA(Passengers ~ pdq(0, 1, 0)),
PartD_Constant = ARIMA(Passengers ~ 1 + pdq(2, 1, 2)),
PartD_NoConstant = ARIMA(Passengers ~ 0 + pdq(2, 1, 2))
)
## Warning: 1 error encountered for PartD_NoConstant
## [1] non-stationary AR part from CSS
%>%
fit3 gg_tsresiduals() +
ggtitle("ARIMA(2, 1, 2) With Constant: Residuals")
%>%
fit4 gg_tsresiduals() +
ggtitle("ARIMA(2, 1, 2) Without Constant: Residuals")
## Error in na.contiguous.default(as.ts(x)): all times contain an NA
%>%
fit_vis forecast(h = 10) %>%
autoplot(aus_airpassengers) +
ggtitle("Australian Airpassengers Forecast")
## Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
## -Inf
## Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
## -Inf
## Warning: Removed 10 row(s) containing missing values (geom_path).
report(fit_vis)
## Warning in report.mdl_df(fit_vis): Model reporting is only supported for
## individual models, so a glance will be shown. To see the report for a specific
## model, use `select()` and `filter()` to identify a single model.
## # A tibble: 3 x 8
## .model sigma2 log_lik AIC AICc BIC ar_roots ma_roots
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <list> <list>
## 1 PartA 4.31 -97.0 198. 198. 202. <cpl [0]> <cpl [1]>
## 2 PartC 4.27 -98.2 200. 201. 204. <cpl [0]> <cpl [0]>
## 3 PartD_Constant 4.03 -96.2 204. 207. 215. <cpl [2]> <cpl [2]>
Part E
The ARIMA(0, 2, 1) model with a constant appears to produce nonlinear forecasts.
<- aus_airpassengers %>%
fit model(
ARIMA(Passengers ~ 1 + pdq(0, 2, 1))
)
## Warning: Model specification induces a quadratic or higher order polynomial trend.
## This is generally discouraged, consider removing the constant or reducing the number of differences.
%>%
fit gg_tsresiduals() +
ggtitle("ARIMA(0, 2, 1) with a Constant: Residuals")
%>%
fit forecast(h = 10) %>%
autoplot(aus_airpassengers) +
ggtitle("Australian Airpassengers Forecast")
9.8
For the United States GDP series (from global_economy):
if necessary, find a suitable Box-Cox transformation for the data;
fit a suitable ARIMA model to the transformed data using ARIMA();
try some other plausible models by experimenting with the orders chosen;
choose what you think is the best model and check the residual diagnostics;
produce forecasts of your fitted model. Do the forecasts look reasonable?
compare the results with what you would obtain using ETS() (with no transformation).
Part A
There are no large deviations in the variance or this time series, therefore a boxcox transformation is not necessary.
<- global_economy %>%
us.data filter(Code == "USA")
%>%
us.data gg_tsdisplay(GDP, plot_type = 'partial')
Part B
A suitable ARIMA model for this series is ARIMA(0, 2, 2)
<- us.data %>%
fit model(
ARIMA(GDP)
)
report(fit)
## Series: GDP
## Model: ARIMA(0,2,2)
##
## Coefficients:
## ma1 ma2
## -0.4206 -0.3048
## s.e. 0.1197 0.1078
##
## sigma^2 estimated as 2.615e+22: log likelihood=-1524.08
## AIC=3054.15 AICc=3054.61 BIC=3060.23
%>%
fit forecast(h = 10) %>%
autoplot(us.data) +
ggtitle("US GDP Forecast")
Part C
The best model is still the model from Part B in terms of accuracy and fit measurements.
<- us.data %>%
fit model(
PartB = ARIMA(GDP),
PartC1 = ARIMA(GDP ~ pdq(0, 2, 1)),
PartC2 = ARIMA(GDP ~ pdq(1, 2, 0))
)
glance(fit)
## # A tibble: 3 x 9
## Country .model sigma2 log_lik AIC AICc BIC ar_roots ma_roots
## <fct> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <list> <list>
## 1 United States PartB 2.61e22 -1524. 3054. 3055. 3060. <cpl [0]> <cpl [2]>
## 2 United States PartC1 2.92e22 -1528. 3059. 3059. 3063. <cpl [0]> <cpl [1]>
## 3 United States PartC2 3.06e22 -1529. 3061. 3062. 3065. <cpl [1]> <cpl [0]>
%>%
fit forecast(h = 10) %>%
autoplot(us.data) +
ggtitle("US GDP Forecast")
Part D
As mentioned above, the best model was from Part B, ARIMA(0, 2, 2). The residuals have an outlier or two but otherwise distributed around the mean. The ACF plot validates the model residuals are resembling white noise.
<- us.data %>%
fit model(
ARIMA(GDP)
)
%>%
fit gg_tsresiduals() +
ggtitle("ARIMA(0, 2, 2) Residual Diagnostic Plots")
Part E
The forecasts look reasonable.
%>%
fit forecast(h = 10) %>%
autoplot(us.data) +
ggtitle("US GDP Forecast")
Part F
The ARIMA model has better fit metrics than the ETS model, therefore the best model for this series is ARIMA(0, 2, 2). The end result is a slightly less steep forecast compared to the ETS model. The ETS model also has a much wider prediction interval compared to the ARIMA model.
<- us.data %>%
fit_F model(
ARIMA = ARIMA(GDP),
ETS = ETS(GDP)
)
glance(fit_F)
## # A tibble: 2 x 12
## Country .model sigma2 log_lik AIC AICc BIC ar_roots ma_roots MSE
## <fct> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <list> <list> <dbl>
## 1 United S~ ARIMA 2.61e+22 -1524. 3054. 3055. 3060. <cpl [0~ <cpl [2~ NA
## 2 United S~ ETS 6.78e- 4 -1590. 3191. 3192. 3201. <NULL> <NULL> 2.79e22
## # ... with 2 more variables: AMSE <dbl>, MAE <dbl>
%>%
fit_F forecast(h = 10) %>%
autoplot(us.data) +
ggtitle("US GDP Forecast")