library(fpp3)
library(readxl)
Each dataset includes the same number of lags in the ACF graph, but the critical value for each to be autocorrelated changes for each. For the smallest dataset, the critical value to be an issue (autocorrelated) is larger. This is because the smaller number of observations makes it harder to be sure that each lag is autocorrelated. For the larges dataset, the critical value to be an issue (autocorrelated) is smaller This is because a lot of observations makes it easier to be sure that each lag is autocorrelated. More observations gives you the ability to check each lag for autocorrelation more. The dataset with 360 observations has a medium threshold for autocorrelation.
I believe that each is white noise because neither has any autocorrelation at any lag. This makes sense given that the data is random. If the data is truely random, then there is way that any particular lag can be correlated.
As the number of observations in the dataset increases, the critical values decrease. The sample size increases from 36 to 360 to 1000 in that order. This varying amount of observations changes the necessary critical value in order to be autocorrelated. This is because you can check each lag more times for autocorrelation when the dataset is larger. For example, for the 36 observation dataset, you can only check the 20 lag at most 16 times. For the largest dataset, you can check the twentieth lag almost 1000 times.
amzn <- gafa_stock %>%
filter(Symbol == "AMZN") %>%
select(Date, Close)
amzn %>%
ggplot(aes(x = Date)) +
geom_line(aes(y = Close)) +
labs(title = "Amazon Stock Price Over Time", y = "Closing Price")
amzn %>%
gg_tsdisplay(Close, plot_type = 'partial')
## Warning: Provided data has an irregular interval, results should be treated with caution. Computing ACF by observation.
## Provided data has an irregular interval, results should be treated with caution. Computing ACF by observation.
amzn %>%
features(Close, unitroot_kpss)
## # A tibble: 1 × 2
## kpss_stat kpss_pvalue
## <dbl> <dbl>
## 1 14.0 0.01
amzn %>%
ggplot(aes(x = Date)) +
geom_line(aes(y = difference(Close))) +
labs(title = "Amazon Differenced Stock Price Over Time", y = "Differenced Closing Price")
## Warning: Removed 1 row(s) containing missing values (geom_path).
Amazon stock is most certainly not stationary. When its plotted, there is a clear upward trend with it appearing to have a random-walk around this trend. To check this I plotted th ACF and PACF of Amazon stock. The ACF shows almost perfect correlation through lag thirty. This is expected as the price of a stock today is heavily impacted by the price of the stock yesterday. After removing the effect of the lags in the ACF, we see an almost perfect correlation at one lag in the PACF. This is indicative of the nature of stock prices I stated above. Finally, I use the Kwiatkowski-Phillips-Schmidt-Shin (KPSS) test to check that it is not stationary. The p-value below 0.05 means I reject the null hypothesis of this data being stationary. After the data is differenced, you can see no trend anymore meaning this is more stationary than it was before.
turkey <- global_economy %>%
filter(Country == "Turkey") %>%
mutate(GDP = (GDP/1000000)) %>%
select(Year, GDP) %>%
na.omit()
turkey %>%
ggplot(aes(x = Year, y = GDP)) +
geom_line(aes(y = GDP)) +
labs(y = 'GDP in Millions', title = "Turkey's GDP Over Time")
This series has an upward trend with nonconstant variance. Based off this, I believe I will have to do a Box-Cox transformation to make the series homoskedastic. Then, use the difference of GDP to make the series stationary.
turkey %>%
gg_tsdisplay(GDP, plot_type = 'partial')
Plotting the residuals confirms that the data is not stationary as there is high correlation in the ACF up to lag 10 and the PACF at lag 1.
turkey_guerro <- turkey %>%
features(GDP, features = guerrero)
turkeybc <- turkey %>%
mutate(GDP = box_cox(GDP, turkey_guerro))
Here I performed a Box-Cox Transformation in order to make the variance more constant.
turkeybc %>%
features(GDP, unitroot_ndiffs)
## # A tibble: 1 × 1
## ndiffs
## <int>
## 1 1
turkeydcdif <- turkeybc %>%
mutate(GDP = difference(GDP)) %>%
na.omit()
Using the unitroot_ndiffs function, I was able to find that I need to do the difference once for this data. After computing the difference, I can now plot to see if the data is stationary.
turkeydcdif %>%
ggplot(aes(x = Year, y = GDP)) +
geom_line(aes(y = GDP)) +
labs(y = 'Box-Cox and Differenced GDP', title = "Turkey's GDP After Box-Cox and Differencing")
turkeydcdif %>%
features(GDP, unitroot_kpss)
## # A tibble: 1 × 2
## kpss_stat kpss_pvalue
## <dbl> <dbl>
## 1 0.0889 0.1
After using the Box-Cox and differencing, I can see in the plot that my data is stationary. This is confirmed by my KPSS test in which the p-value is above 0.05. Therefore, I fail to reject the null and my transformed data is stationary.
tasmania_acc <- aus_accommodation %>%
filter(State == "Tasmania") %>%
select(Date, Takings)
tasmania_acc %>%
ggplot(aes(x = Date, y = Takings)) +
geom_line(aes(y = Takings)) +
labs(y = 'Takings', title = "Tasmanian Takings Over Time")
This series has an upward trend with constant variance. Based off this, I believe I will not have to do a Box-Cox transformation to make the series homoskedastic. I will have to take the difference of ‘Takings’ to make the series stationary.
tasmania_acc %>%
gg_tsdisplay(Takings, plot_type = 'partial')
Plotting the residuals confirms that the data is not stationary as there is high correlation in the ACF up to lag 116 and the PACF at lag 6.
tasmania_acc %>%
features(Takings, unitroot_nsdiffs)
## # A tibble: 1 × 1
## nsdiffs
## <int>
## 1 1
tasmania_acc_sdiff <- tasmania_acc %>%
mutate(Takings = difference(Takings, 4)) %>%
na.omit()
tasmania_acc_sdiff %>%
features(Takings, unitroot_ndiffs)
## # A tibble: 1 × 1
## ndiffs
## <int>
## 1 0
Using the unitroot_nsdiffs function, I was able to find that I need to do the seasonal difference once for this data. After computing the seasonal difference, I use the unitroot_ndiffs function to see if I need to take the first difference after doing the seasonal difference. For this data, the first difference is no longer necessary.
tasmania_acc_sdiff %>%
ggplot(aes(x = Date, y = Takings)) +
geom_line(aes(y = Takings)) +
labs(y = 'Seasonally Differenced Takings', title = "Tasmania's Seasonally Differenced Takings Over Time")
tasmania_acc_sdiff %>%
features(Takings, unitroot_kpss)
## # A tibble: 1 × 2
## kpss_stat kpss_pvalue
## <dbl> <dbl>
## 1 0.196 0.1
After using the Box-Cox and seasonal differencing, I can see in the plot that my data is stationary. This is confirmed by my KPSS test in which the p-value is above 0.05. Therefore, I fail to reject the null and my transformed data is stationary.
souvenirs %>%
ggplot(aes(x = Month, y = Sales)) +
geom_line(aes(y = Sales)) +
labs(y = 'Sales', title = "Monthly Souvenir Sales")
souvenirs %>%
gg_tsdisplay(Sales, plot_type = 'partial')
This series has an upward trend with nonconstant variance. Based off this, I believe I will have to do a Box-Cox transformation to make the series homoskedastic. Then, use the difference of GDP to make the series stationary.
souvenirs_guerro <- souvenirs %>%
features(Sales, features = guerrero)
souvenirsbc <- souvenirs %>%
mutate(Sales = box_cox(Sales, souvenirs_guerro))
Here I performed a Box-Cox Transformation in order to make the variance more constant.
souvenirsbc %>%
features(Sales, unitroot_nsdiffs)
## # A tibble: 1 × 1
## nsdiffs
## <int>
## 1 1
souvenirsbcsdiff <- souvenirsbc %>%
mutate(Sales = difference(Sales, 12)) %>%
na.omit()
souvenirsbcsdiff %>%
features(Sales, unitroot_ndiffs)
## # A tibble: 1 × 1
## ndiffs
## <int>
## 1 0
Using the unitroot_nsdiffs function, I was able to find that I need to do the seasonal difference once for this data. After computing the seasonal difference, I use the unitroot_ndiffs function to see if I need to take the first difference after doing the seasonal difference. For this data, the first difference is no longer necessary.
souvenirsbcsdiff %>%
ggplot(aes(x = Month, y = Sales)) +
geom_line(aes(y = Sales)) +
labs(y = 'Seasonally Differenced Sales', title = "Seasonally Differenced Souvenir Sales Over Time")
souvenirsbcsdiff %>%
features(Sales, unitroot_kpss)
## # A tibble: 1 × 2
## kpss_stat kpss_pvalue
## <dbl> <dbl>
## 1 0.240 0.1
After using the Box-Cox and seasonal differencing, I can see in the plot that my data is stationary. This is confirmed by my KPSS test in which the p-value is above 0.05. Therefore, I fail to reject the null and my transformed data is stationary.
ar1 <- function(phi, n = 100) {
y <- numeric(n)
e <- rnorm(n)
for (i in 2:n) {
y[i] <- phi * y[i - 1] + e[i]
}
tsibble(idx = seq_len(n), y = y, index = idx)
}
ar1(0.6) %>%
autoplot() +
labs(title = "AR1(0.6)")
## Plot variable not specified, automatically selected `.vars = y`
ar1(0) %>%
autoplot() +
labs(title = "AR1(0)")
## Plot variable not specified, automatically selected `.vars = y`
ar1(-1) %>%
autoplot() +
labs(title = "AR1(-1)")
## Plot variable not specified, automatically selected `.vars = y`
ar1(1) %>%
autoplot() +
labs(title = "AR1(1)")
## Plot variable not specified, automatically selected `.vars = y`
As I change phi, the level of my series changes across time. When phi is 0.6, the data is stationary. This is the best phi to have for this series. When phi is 0, the series acts as white noise. When phi is -1, the data flips signs with every period. This is similar to negative pure serial correlation. When phi is 1, the data has a random walk pattern. It still has a trend, but there is randomness around the trend line.
ma1 = function(param = 0.6)
{
seed = 123
y <- ts(numeric(100))
forecast.err <- rnorm(100)
rand.err = rnorm(100)
for(i in 2:100)
y[i] <- param * forecast.err[i-1] + rand.err[i]
return(y)
}
ma1(0.6) %>%
as_tsibble() %>%
autoplot() +
labs(title = "MA1(0.6)")
## Plot variable not specified, automatically selected `.vars = value`
ma1(0) %>%
as_tsibble() %>%
autoplot() +
labs(title = "MA1(0)")
## Plot variable not specified, automatically selected `.vars = value`
ma1(1) %>%
as_tsibble() %>%
autoplot() +
labs(title = "MA1(1)")
## Plot variable not specified, automatically selected `.vars = value`
ma1(-1) %>%
as_tsibble() %>%
autoplot() +
labs(title = "MA1(-1)")
## Plot variable not specified, automatically selected `.vars = value`
The series changes with each different value of theta, but it always remains white noise.
arma1 = function(alpha = 0.6, theta = 0.6)
{
seed = 123
y <- ts(numeric(100))
e <- rnorm(100)
rand.error = rnorm(100)
for(i in 2:100)
y[i] <- alpha * y[i-1] + theta * e[i] + rand.error[i]
return(y)
}
arma1() %>%
as_tsibble() %>%
autoplot() +
labs(title = "ARMA(1,1)")
## Plot variable not specified, automatically selected `.vars = value`
ar2 = function(alpha1 = -0.8, alpha2 = 0.3)
{
seed = 123
y <- ts(numeric(100))
e <- rnorm(100)
for(i in 3:100)
y[i] <- alpha1 * y[i-1] + alpha2 * y[i-2] + e[i]
return(y)
}
ar2() %>%
as_tsibble() %>%
autoplot() +
labs(title = "AR(2)")
## Plot variable not specified, automatically selected `.vars = value`
ARMA(1,1) is stationary in that it has no trend nor seasonality. I do not see any reasonable pattern within the series. These are all good characteristics for forecasting. AR(2) is not stationary. It has an exponential pattern overtime. Also, the series flips signs every period similar to pure negative serial correlation. This would not be ready for forecasting in its current state.
aus_airpassengers %>%
filter(Year < 2012)
## # A tsibble: 42 x 2 [1Y]
## Year Passengers
## <dbl> <dbl>
## 1 1970 7.32
## 2 1971 7.33
## 3 1972 7.80
## 4 1973 9.38
## 5 1974 10.7
## 6 1975 11.1
## 7 1976 10.9
## 8 1977 11.3
## 9 1978 12.1
## 10 1979 13.0
## # … with 32 more rows
## # ℹ Use `print(n = ...)` to see more rows
aus_airpassengers_auto <- aus_airpassengers %>%
model(
'Auto' = ARIMA(Passengers)
)
aus_airpassengers_auto %>%
gg_tsresiduals()
augment(aus_airpassengers_auto) %>%
features(.resid, ljung_box, lag = 10, dof = 0)
## # A tibble: 1 × 3
## .model lb_stat lb_pvalue
## <chr> <dbl> <dbl>
## 1 Auto 6.70 0.754
aus_airpassengers_auto %>%
forecast(h = 10) %>%
autoplot(aus_airpassengers)
aus_airpassengers_010 <- aus_airpassengers %>%
model(
ARIMA(Passengers ~ pdq(0,1,0))
)
aus_airpassengers_010 %>%
gg_tsresiduals()
augment(aus_airpassengers_010) %>%
features(.resid, ljung_box, lag = 10, dof = 0)
## # A tibble: 1 × 3
## .model lb_stat lb_pvalue
## <chr> <dbl> <dbl>
## 1 ARIMA(Passengers ~ pdq(0, 1, 0)) 6.77 0.747
aus_airpassengers_010 %>%
forecast(h = 10) %>%
autoplot(aus_airpassengers)
Compared to my forecast in part A, this model has a much smaller prediction interval. I believe this to be because I did not use the automatic ARIMA model. In addition to the smaller prediction interval, the steepness of the forecast is less with the ARIMA(0,1,0) model than with the automatically chosen method in part a (ARIMA(0,2,1)).
aus_airpassengers_212_constant <- aus_airpassengers %>%
model(
ARIMA(Passengers ~ 1 + pdq(2,1,1))
)
aus_airpassengers_212_constant %>%
gg_tsresiduals()
augment(aus_airpassengers_212_constant) %>%
features(.resid, ljung_box, lag = 10, dof = 0)
## # A tibble: 1 × 3
## .model lb_stat lb_pvalue
## <chr> <dbl> <dbl>
## 1 ARIMA(Passengers ~ 1 + pdq(2, 1, 1)) 4.35 0.930
aus_airpassengers_212_constant %>%
forecast(h = 10) %>%
autoplot(aus_airpassengers)
aus_airpassengers_212 <- aus_airpassengers %>%
model(
ARIMA(Passengers ~ 0 + pdq(2,1,1))
)
## Warning: 1 error encountered for ARIMA(Passengers ~ 0 + pdq(2, 1, 1))
## [1] non-stationary AR part from CSS
The forecast for the ARIMA(2,1,2) model with the constant looks to be good. The forecast continues the upward trend. The prediction interval is rather large, but we are forecasting far into the future. When I try to use the ARIMA(2,1,2) without a constant, I am unable to fit the model. This is because the data is not stationary without the constant. Because of the upward trend, there needs to be a constant to account for the drift in the series. Without this there is no way to make the series stationary. Without stationary data, I cannot use an ARIMA model to forecast.
aus_airpassengers_021 <- aus_airpassengers %>%
model(
ARIMA(Passengers ~ pdq(0,2,1))
)
aus_airpassengers_021 %>%
gg_tsresiduals()
augment(aus_airpassengers_021) %>%
features(.resid, ljung_box, lag = 10, dof = 0)
## # A tibble: 1 × 3
## .model lb_stat lb_pvalue
## <chr> <dbl> <dbl>
## 1 ARIMA(Passengers ~ pdq(0, 2, 1)) 6.70 0.754
aus_airpassengers_021 %>%
forecast(h = 10) %>%
autoplot(aus_airpassengers)
aus_airpassengers_021_constant <- aus_airpassengers %>%
model(
ARIMA(Passengers ~ 1 + pdq(0,2,1))
)
## Warning: Model specification induces a quadratic or higher order polynomial trend.
## This is generally discouraged, consider removing the constant or reducing the number of differences.
aus_airpassengers_021_constant %>%
gg_tsresiduals()
augment(aus_airpassengers_021_constant) %>%
features(.resid, ljung_box, lag = 10, dof = 0)
## # A tibble: 1 × 3
## .model lb_stat lb_pvalue
## <chr> <dbl> <dbl>
## 1 ARIMA(Passengers ~ 1 + pdq(0, 2, 1)) 6.91 0.734
aus_airpassengers_021_constant %>%
forecast(h = 10) %>%
autoplot(aus_airpassengers)
I am able to plot the forecast using the ARIMA(0,2,1) model without a constant. The plot of the forecast appears to make sense and looks good for the data shown. When I add a constant, I see that the steepness of the forecast is much greater and the prediction interval is much smaller. I get a warning for using the constant with multiple differences. I believe this forecast with the constant to be worse than without the constant, as the steepness is too great.
us <- global_economy %>%
filter(Code == "USA") %>%
mutate(GDP = (GDP/1000000000000)) %>%
select(Year, GDP)
us %>%
ggplot(aes(x = Year, y = GDP)) +
geom_line(aes(y = GDP)) +
labs(title = "US GDP Since 1960", y = "GDP in Trillions")
us_guerro <- us %>%
features(GDP, features = guerrero)
usbc <- us %>%
mutate(GDP = box_cox(GDP, us_guerro))
usbc %>%
autoplot()
## Plot variable not specified, automatically selected `.vars = GDP`
After cleaning nad plotting the data for US GDP, I can see that there is no need for a Box-Cox transformation. The variance is relatively constant. I did a Box-Cox anyways to see how it may change my data, and it stayed very similar. Therefore, I will not be using it past part a.
us %>%
gg_tsdisplay(GDP, plot_type = 'partial')
us %>%
features(GDP, unitroot_ndiffs)
## # A tibble: 1 × 1
## ndiffs
## <int>
## 1 2
us_auto <- us %>%
model(
'Auto' = ARIMA(GDP)
)
us_auto %>%
report()
## Series: GDP
## Model: ARIMA(0,2,2)
##
## Coefficients:
## ma1 ma2
## -0.4206 -0.3048
## s.e. 0.1197 0.1078
##
## sigma^2 estimated as 0.02615: log likelihood=23.26
## AIC=-40.52 AICc=-40.06 BIC=-34.45
I fitted the data using the auto ARIMA method. The automatically chosen method was ARIMA(0,2,2). I checked the AICc using report(). It gave be a value of -40.06. As I will see later, this ends up being the lowest AICc I find as expected.
I also checked the number of differences needed for this series. I found that two differences are necessary. This is not surprising as the data is no where near a stationary series as currently constructed. The automatically chosen ARIMA method matches this need for two differences.
us_compare <- us %>%
model(
'Auto' = ARIMA(GDP),
'Diffs Only' = ARIMA(GDP ~ pdq(0,2,0)),
'Autoregression + Diff' = ARIMA(GDP ~ pdq(1,2,0)),
"Moving Average + Diff" = ARIMA(GDP ~ pdq(0,2,1)),
"AR + Diff + MA" = ARIMA(GDP ~ pdq(1,2,1))
)
us_compare %>%
report()
## Warning in report.mdl_df(.): Model reporting is only supported for individual
## models, so a glance will be shown. To see the report for a specific model, use
## `select()` and `filter()` to identify a single model.
## # A tibble: 5 × 8
## .model sigma2 log_lik AIC AICc BIC ar_roots ma_roots
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <list> <list>
## 1 Auto 0.0261 23.3 -40.5 -40.1 -34.4 <cpl [0]> <cpl [2]>
## 2 Diffs Only 0.0310 17.8 -33.5 -33.5 -31.5 <cpl [0]> <cpl [0]>
## 3 Autoregression + Diff 0.0306 18.6 -33.3 -33.0 -29.2 <cpl [1]> <cpl [0]>
## 4 Moving Average + Diff 0.0292 19.8 -35.6 -35.3 -31.5 <cpl [0]> <cpl [1]>
## 5 AR + Diff + MA 0.0263 23.1 -40.2 -39.7 -34.1 <cpl [1]> <cpl [1]>
I tried other ARIMA methods and compared them to the automatic method using the report() function. I decided to check the methods with only the differences, only the differences and the autoregression, only the differences and the moving average, and all three components included as ARIMA(1,2,1). None of these methods resulted in a lower AICc than the automatic ARIMA method. The ARIMA(1,2,1) resulted in a very similar AICc to the automatic ARIMA. I believe this to be the best method, as the automatic ARIMA does not us any autoregression. This is necessary for this series as shown in the ACF correlation.
us_all <- us %>%
model(
"AR + Diff + MA" = ARIMA(GDP ~ pdq(1,2,1))
)
us_auto %>%
gg_tsresiduals()
us_all %>%
gg_tsresiduals()
augment(us_auto) %>%
features(.resid, ljung_box, lag = 10, dof = 0)
## # A tibble: 1 × 3
## .model lb_stat lb_pvalue
## <chr> <dbl> <dbl>
## 1 Auto 12.2 0.273
augment(us_all) %>%
features(.resid, ljung_box, lag = 10, dof = 0)
## # A tibble: 1 × 3
## .model lb_stat lb_pvalue
## <chr> <dbl> <dbl>
## 1 AR + Diff + MA 14.1 0.169
As explained in part c, I believe the ARIMA(1,2,1) to be the best method for this series. I compared the residuals here to the automatic ARIMA model (ARIMA(0,2,2)). Based off the residuals, ARIMA(1,2,1) does have slightly more autocorrelation at lags 7 and 9, but I believe this slight difference to not matter. The difference is so small and at random lags not important in this dataset, so I still believe the ARIMA(1,2,1) to be better. Both have normally distributed residuals at a mean of zero. The plot of residuals are only slightly different, but I believe the ARIMA(1,2,1) method to have slightly more white-noise like residuals. This is confirmed in the Ljung-Box test. While neither can reject the null hypothesis, so neither is confirmed to be white-noise, my ARIMA(1,2,1) method gives a lower p-value. This means the residuals are more similar to white noise than the automatic method.
us_all %>%
forecast(h = 10) %>%
autoplot(us)
I believe the forecast for this method makes sense. It continues the upward trend of the series. The downsides of this forecast are that it does not account for the possibilities of a major recession in the next ten years. Although it includes an 80% and 95% prediction interval, I believe a major recession such as the one in 2008 would be way outside these intervals.
us_ets_compare <- us %>%
model('Auto' = ETS(GDP),
'Multiplicative Trend' = ETS(GDP ~ error("A") + trend("M") + season("N"))
)
us_ets_compare %>%
report()
## Warning in report.mdl_df(.): Model reporting is only supported for individual
## models, so a glance will be shown. To see the report for a specific model, use
## `select()` and `filter()` to identify a single model.
## # A tibble: 2 × 9
## .model sigma2 log_lik AIC AICc BIC MSE AMSE MAE
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Auto 0.000455 23.8 -37.7 -36.5 -27.4 0.0280 0.128 0.0169
## 2 Multiplicative Trend 0.0318 -15.7 41.4 42.6 51.7 0.0296 0.137 0.0983
us_ets_compare %>%
forecast(h = 10) %>%
autoplot(us)
I compared my ARIMA methods to the automatic ETS and multiplicative trend ETS. I believe the multiplicative trend ETS is the best of these two methods because GDP increases exponentially overtime. When comparing this to the ARIMA(1,2,1) model though, I believe the ARIMA is better. This is because the AICc of the ARIMA is lower and the forecast makes more sense for only ten years. If I were forecasting longer periods, I believe the multiplicative trend ETS may be better.
japan_arrivals <- aus_arrivals %>%
filter(Origin == "Japan")
autoplot(japan_arrivals)
## Plot variable not specified, automatically selected `.vars = Arrivals`
Japanese arrivals increase from 1980 to 1995, then stay level till 2005, then they decline from 2005 to now. The series is heteroskedastic with variance relative to the level of the series. This causes variation in the 1980s to look much smaller than in the 2000s. The number of Japanese arrivals appears to be plateauing at about 100,000 arrivals now.
japan_arrivalsdiff <- japan_arrivals %>%
gg_tsdisplay(difference(Arrivals, 4) %>%
difference(), plot_type = "partial" , lag = 16)
japan_arrivalsdiff
## Warning: Removed 5 row(s) containing missing values (geom_path).
## Warning: Removed 5 rows containing missing values (geom_point).
There is high autocorrelation at a lag of four. This means the data is heavily seasonal because four lags would be the same quarter the year prior. This is expected as I saw lots of variance in the data when I plotted it. I also know that travel is heavily seasonal with more travel in the second and fourth quarters of the year.
The PACF shows that even once we remove the effects of the lags shown in the ACF, there is still autocorrelation. This autocorrelation is most prevalent atfour periods ago. This makes sense as this is heavily seasonal data. The PACF shows that there is autocorrelation past two years. The critical value at lags 8 and 12 show that there is correlation between the same quarter 2 and 3 years ago in addtion to one year ago.
These graphs together show lots of seasonality in the data. They also suggest that the data may need more differencing in order to be more stationary. I do not want to difference the data more than twice because then I lose all interpretation of the values of my series. I believe a seasonal difference with a lag 4 and a first difference are the most proper differencing to do in order to keep the values meaningful.
japan_arrivalsfit <- japan_arrivals %>%
model(ARIMA(Arrivals))
report(japan_arrivalsfit)
## Series: Arrivals
## Model: ARIMA(0,1,1)(1,1,1)[4]
##
## Coefficients:
## ma1 sar1 sma1
## -0.4228 -0.2305 -0.5267
## s.e. 0.0944 0.1337 0.1246
##
## sigma^2 estimated as 174801727: log likelihood=-1330.66
## AIC=2669.32 AICc=2669.66 BIC=2680.54
japan_arrivalsarima <- japan_arrivals %>%
model(ARIMA(Arrivals ~ pdq(0,1,1) + PDQ(1,1,1)))
report(japan_arrivalsarima)
## Series: Arrivals
## Model: ARIMA(0,1,1)(1,1,1)[4]
##
## Coefficients:
## ma1 sar1 sma1
## -0.4228 -0.2305 -0.5267
## s.e. 0.0944 0.1337 0.1246
##
## sigma^2 estimated as 174801727: log likelihood=-1330.66
## AIC=2669.32 AICc=2669.66 BIC=2680.54
japan_arrivalsarima2 <- japan_arrivals %>%
model(ARIMA(Arrivals ~ pdq(1,1,0) + PDQ(1,1,1)))
report(japan_arrivalsarima2)
## Series: Arrivals
## Model: ARIMA(1,1,0)(1,1,1)[4]
##
## Coefficients:
## ar1 sar1 sma1
## -0.3091 -0.2129 -0.5534
## s.e. 0.0889 0.1321 0.1224
##
## sigma^2 estimated as 180712140: log likelihood=-1332.66
## AIC=2673.32 AICc=2673.67 BIC=2684.54
I believe ARIMA(0,1,1)(1,1,1) is the best model. This model gives the lowest AICc value of the different models I checked. This is expected as this is the automatic model chosen via R.
us_data <- read_excel("//Users//colinadams//Documents//GCSU//Fall 2022//Forecasting//Homeworks//Homework 6//gdp.xlsx")
us_gdp <- us_data %>%
mutate(Date = year(Date)) %>%
as_tsibble(index = Date) %>%
mutate(GDP = Value)
us_gdp %>%
autoplot() +
labs(title = "US GDP Over Time")
## Plot variable not specified, automatically selected `.vars = Value`
us_gdp %>%
features(GDP, unitroot_ndiffs)
## # A tibble: 1 × 1
## ndiffs
## <int>
## 1 1
us_gdp_compare <- us_gdp %>%
model(
'Auto' = ARIMA(GDP),
'Diffs Only' = ARIMA(GDP ~ pdq(0,1,0)),
'Autoregression + Diff' = ARIMA(GDP ~ pdq(1,1,0)),
"Moving Average + Diff" = ARIMA(GDP ~ pdq(0,1,1)),
"All" = ARIMA(GDP ~ pdq(1,1,1))
)
us_gdp_compare %>%
report()
## Warning in report.mdl_df(.): Model reporting is only supported for individual
## models, so a glance will be shown. To see the report for a specific model, use
## `select()` and `filter()` to identify a single model.
## # A tibble: 5 × 8
## .model sigma2 log_lik AIC AICc BIC ar_roots ma_roots
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <list> <list>
## 1 Auto 13855. -136. 277. 278. 280. <cpl [1]> <cpl [0]>
## 2 Diffs Only 25846. -142. 289. 290. 291. <cpl [0]> <cpl [0]>
## 3 Autoregression + Diff 13855. -136. 277. 278. 280. <cpl [1]> <cpl [0]>
## 4 Moving Average + Diff 16757. -137. 281. 282. 284. <cpl [0]> <cpl [1]>
## 5 All 14101. -135. 278. 281. 283. <cpl [1]> <cpl [1]>
In order to identify an ARIMA model, I first checked to see how many differences were necessary. I found that only one difference is necessary. I used the automatically chosen ARIMA method to compare my other models with. I found the automatic ARIMA(1,1,0) with drift to not have the lowest AICc. This is not surprising as the automatic function will usually give me the method with the lowest possible AICc.
us_auto <- us_gdp %>%
model('Auto' = ARIMA(GDP))
us_auto %>%
gg_tsresiduals()
augment(us_auto) %>%
features(.resid, ljung_box, lag = 10, dof = 0)
## # A tibble: 1 × 3
## .model lb_stat lb_pvalue
## <chr> <dbl> <dbl>
## 1 Auto 7.90 0.638
Based off my residuals plot, the residuals do not appear to resemble white noise. There are trends in the residuals over time. There is no autocorrelation which is good for the model. The distribution of the residuals are close to normally distributed and appear to have a mean of zero. In order to see if there is white noise, I had to do a Ljung-Box test because the plots gave me mixed signals. I found that the residuals do not resemble white noise after doing the Ljung-Box test. This is because my p-value was way above 0.05, so I fail to reject the null hypothesis.
us_auto %>%
forecast(h = 4) %>%
autoplot(us_gdp) +
labs(title = "US GDP Forecasted (ARIMA(1,1,0)) for Four Years")
us_gdp %>%
model(
'Auto' = ETS(GDP),
'Simple' = ETS(GDP ~ error("A") + trend("N") + season("N")),
'Holt' = ETS(GDP ~ error("A") + trend("A") + season("N")),
'Multiplicative Trend' = ETS(GDP ~ error("A") + trend("M") + season("N"))
) %>%
report(us_gdp)
## Warning in report.mdl_df(., us_gdp): Model reporting is only supported for
## individual models, so a glance will be shown. To see the report for a specific
## model, use `select()` and `filter()` to identify a single model.
## # A tibble: 4 × 9
## .model sigma2 log_lik AIC AICc BIC MSE AMSE MAE
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Auto 5.81e-5 -144. 298. 301. 304. 12620. 7.77e4 5.32e-3
## 2 Simple 4.40e+5 -184. 375. 376. 378. 401348. 1.79e6 6.01e+2
## 3 Holt 1.54e+4 -145. 299. 303. 305. 12720. 8.02e4 8.47e+1
## 4 Multiplicative Trend 5.36e+4 -159. 328. 332. 334. 44311. 1.84e5 1.46e+2
In order to find a proper ETS method, I compared the automatically chosen method to the simple, holt, and multiplicative trend methods. I found the automatic method to have the lowest AICc as expected. The automatically chosen method is a ETS(M,A,N). Because we have been advised not to use multiplicative errors, I believe the Holt method to be the most correct for this series. The AICc is the second lowest, and almost as low as the ETS(M,A,N) method.
us_holt <- us_gdp %>%
model(
'Holt' = ETS(GDP ~ error("A") + trend("A") + season("N"))
)
us_holt %>%
gg_tsresiduals()
augment(us_holt) %>%
features(.resid, ljung_box, lag = 10, dof = 0)
## # A tibble: 1 × 3
## .model lb_stat lb_pvalue
## <chr> <dbl> <dbl>
## 1 Holt 13.0 0.222
Using the ETS methods, the residuals look almost identical to the residuals using the ARIMA models. The plots show trends in the data, no autocorrelation, close to normally distributed residuals, and a mean of zero. This also gives me mixed signals for the model. I used a Ljung-Box test to see if the residuals are white noise. The value above 0.05 tells me that I fail to reject the null. Therefore, the residual are distinguishable from white-noise.
us_holt %>%
forecast(h = 4) %>%
autoplot(us_gdp) +
labs(title = "US GDP Forecasted (AAN) for Four Years")
us_gdp %>%
stretch_tsibble(.init = 10) %>%
model(
'Holt' = ETS(GDP ~ error("A") + trend("A") + season("N")),
'Auto ARIMA' = ARIMA(GDP ~ pdq(1,1,0))
) %>%
forecast(h = 1) %>%
accuracy(us_gdp)
## Warning: The future dataset is incomplete, incomplete out-of-sample data will be treated as missing.
## 1 observation is missing at 2023
## # A tibble: 2 × 10
## .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Auto ARIMA Test 69.0 154. 128. 0.318 0.678 0.204 0.238 -0.116
## 2 Holt Test 7.53 178. 138. -0.0455 0.760 0.219 0.274 0.333
In order to choose a final method, I used cross-validation to check the accuracy of both methods. I found the Holt ETS method to be more accurate thean the ARIMA(1,1,0) method. This is because it has a lower RMSE. This convinces me that the Holt ETS model is the better of two. Although this is my prefered method of all the ones I looked at, because of the residuals being distinguishable from white noise, I do not believe this method is great.