Exercises from Section 9 of Forecasting: Principles & Practice at https://otexts.com/fpp3/arima-exercises.html
Technically yes, since all nearly 95% of the points on each plot are within the blue dotted lines we’d conclude the data to be white noise as they are all close to zero with some random variation.
ACF plots
This is due to the changes in the size of the data. As more data is randomly generated the distribution will become more normally distributed. In smaller data sets it’s easier for a value to seem extreme.
First I plot the data as-is. The data doesn’t appear to have any seasonality, but does begin with very little increase, then much greater increases as time goes on. The ACF plot suggests non-stationarity with a R1 value close to 1.0 and many values outside the blue-dotted line range. A Box-Cox may be useful since the values and variance are so low until about 1990 in tsi dataset.
turkey_gdp <- global_economy %>%
filter(Country == 'Turkey')
gg_tsdisplay(turkey_gdp, GDP)
I check the lambda value for this dataset and it appears to be 0.157 so a log transformation will be applied. The plot appears more linear, but we still have non-stationarity to deal with.
#strip df down
turkey_gdp_prepped <- dplyr::select(turkey_gdp, GDP)
#determine lambda
BoxCox.lambda(turkey_gdp_prepped$GDP)
## [1] 0.1571804
#perform boxcox and apply to df
turkey_gdp_prepped$GDP <- box_cox(turkey_gdp_prepped$GDP, BoxCox.lambda(turkey_gdp_prepped$GDP))
#plot transformed data
turkey_gdp_prepped %>% gg_tsdisplay(GDP, plot_type = 'partial')
Next we use the KPSS test to test if the data is stationary. With a p-value of 0.01 means the value is lower than this, which indicated the null hypothesis is rejected and the data are not stationary.
turkey_gdp_prepped %>%
features(GDP, unitroot_kpss)
## # A tibble: 1 x 2
## kpss_stat kpss_pvalue
## <dbl> <dbl>
## 1 1.50 0.01
When using the ndiffs (or unitroot_ndiffs) function we see that the right number of differences is one.
ndiffs(turkey_gdp_prepped$GDP)
## [1] 1
Looking at the plot Box-Cox transformed (log) and then taking the first order differencing, we see stationary data at last.
ggtsdisplay(diff(turkey_gdp_prepped$GDP))
Below we definitely see seasonality across the quarters, so a lag of 4, as well as increasing variance across the seasons.
# create df
tasmania_takings <- aus_accommodation %>%
filter(State == 'Tasmania')
#display data
gg_tsdisplay(tasmania_takings, Takings)
First we preform a Box-Cox transformation.With a lambda of -.0.005 a log transformation is performed which does correct some of the variance.
#strip df down
tasmania_takings_prepped <- dplyr::select(tasmania_takings, Takings)
#determine lambda
BoxCox.lambda(tasmania_takings_prepped$Takings)
## [1] -0.005076712
#perform boxcox and apply to df
tasmania_takings_prepped$Takings <- box_cox(tasmania_takings_prepped$Takings, BoxCox.lambda(tasmania_takings_prepped$Takings))
#plot transformed data
tasmania_takings_prepped %>% gg_tsdisplay(Takings, plot_type = 'partial')
To deal with the seasonality we won’t even do a KPSS test to start as it’s very obvious the data isn’t stationary. The ndiffs() function suggests one differencing is best.
ndiffs(tasmania_takings_prepped$Takings)
## [1] 1
The display below shows the Box-Cox (log) transformed data with one difference taken.
ggtsdisplay(diff(tasmania_takings_prepped$Takings))
While I see the seasonality in the data still that is affecting the ACF plot, I’ll use the KPSS test to see if the data is now considered stationary. With a p-value of 0.1 the null hypotehsis is accepted - the data are stationary.
unitroot_kpss((diff(tasmania_takings_prepped$Takings)))
## kpss_stat kpss_pvalue
## 0.2573405 0.1000000
The plot shows annual spikes in December, with increasing variance as time goes on.
#display data
gg_tsdisplay(souvenirs, Sales)
With a lambda of -0.24 this is slightly closer to 0 than -0.5 so a log transformation is performed. We see the data variance has decreased in the scatterplot.
souvenirs_prepped <- souvenirs
#determine lambda
BoxCox.lambda(souvenirs_prepped$Sales)
## [1] -0.2444328
#perform boxcox and apply to df
souvenirs_prepped$Sales <- box_cox(souvenirs_prepped$Sales, BoxCox.lambda(souvenirs_prepped$Sales))
#plot transformed data
souvenirs_prepped %>% gg_tsdisplay(Sales, plot_type = 'partial')
Again, we’ll skip the KPSS as it’s obvious the data needs some differencing. An order of one difference is suggested again.
ndiffs(souvenirs_prepped$Sales)
## [1] 1
Finally below we see the data, while still displaying seasonality, that has been Box-Cox transformed and differenced once.
ggtsdisplay(diff(souvenirs_prepped$Sales))
A final check of the KPSS test confirm the data is stationary now.
unitroot_kpss((diff(souvenirs_prepped$Sales)))
## kpss_stat kpss_pvalue
## 0.06153858 0.10000000
Reproducing the same sample of data for ‘myseries’ and plotting, below I see the data is certainly not stationary and possibly has some seasonality, as well as increasing variance.
set.seed(8675309)
myseries <- aus_retail %>%
filter(`Series ID` == sample(aus_retail$`Series ID`,1))
gg_tsdisplay(myseries, Turnover)
Below I check perform the Box-Cox and see a lambda of 0.038 - so a log transformation will likely be performed.
#strip df down
myseries_prepped <- dplyr::select(myseries, Turnover)
#determine lambda
BoxCox.lambda(myseries_prepped$Turnover)
## [1] 0.03753454
I see the transformation has made the data more linear and smoothed out some of the variance across time.
#perform boxcox and apply to df
myseries_prepped$Turnover <- box_cox(myseries_prepped$Turnover, BoxCox.lambda(myseries_prepped$Turnover))
#plot transformed data
myseries_prepped %>%
gg_tsdisplay(Turnover, plot_type = "partial")
Next I check if any differencing is needed, and just one is suggested.
ndiffs(myseries_prepped$Turnover)
## [1] 1
Below we see the transformed data with one difference taken. Outside of the seasonality, the data appear to be stationary now.
ggtsdisplay(diff(myseries_prepped$Turnover))
A quick check with the KPSS test confirms the data is stationary.
unitroot_kpss(diff(myseries_prepped$Turnover))
## kpss_stat kpss_pvalue
## 0.1631769 0.1000000
y <- numeric(100)
e <- rnorm(100)
for(i in 2:100)
y[i] <- 0.6*y[i-1] + e[i]
sim_0.6 <- tsibble(idx = seq_len(100), y = y, index = idx)
sim_0.6 %>%
autoplot(y)
for(i in 2:100)
y[i] <- 0.05*y[i-1] + e[i]
sim_0.05 <- tsibble(idx = seq_len(100), y = y, index = idx)
sim_0.05 %>%
autoplot(y)
for(i in 2:100)
y[i] <- 0.95*y[i-1] + e[i]
sim_0.95 <- tsibble(idx = seq_len(100), y = y, index = idx)
sim_0.95 %>%
autoplot(y)
From looking at the plots above it appears that lower values of phi appear to have more noise, and with higher values of phi the trend and possibly cycles are more evident.
for(i in 2:100)
y[i] <- 0.6*e[i-1] + e[i] #1st order moving average model
sim_ma1_0.6 <- tsibble(idx = seq_len(100), y = y, index = idx)
sim_ma1_0.6 %>%
autoplot(y)
for(i in 2:100)
y[i] <- 0.01*e[i-1] + e[i] #1st order moving average model
sim_ma1_0.01 <- tsibble(idx = seq_len(100), y = y, index = idx)
sim_ma1_0.01 %>%
autoplot(y)
From the plots above it appears that a smaller value results in more variance and magnitude of the data.
for(i in 2:100)
y[i] <- 0.6*y[i-1] + 0.6*e[i-1] + e[i]
arma_1 <- tsibble(idx = seq_len(100), y = y, index = idx)
for(i in 3:100)
y[i] <- -0.8*y[i-1] + 0.3*y[i-2] + e[i]
arma_2 <- tsibble(idx = seq_len(100), y = y, index = idx)
arma_1 %>%
autoplot(y)
arma_2 %>%
autoplot(y)
The first plot doesn’t appear to have any seasonality, but there are cycles present. To the eye I don’t see a trend. The second plot has very clear patterns, with increasing variance around the same mean. Seasonality is present.
First I build the model and take a look at the report results. An ARIMA(0,2,1) was selected.
fit_1 <- aus_airpassengers %>%
model(ARIMA(Passengers))
report(fit_1)
## Series: Passengers
## Model: ARIMA(0,2,1)
##
## Coefficients:
## ma1
## -0.8963
## s.e. 0.0594
##
## sigma^2 estimated as 4.308: log likelihood=-97.02
## AIC=198.04 AICc=198.32 BIC=201.65
In looking at the residuals we see a fairly normal histogram and the ACF plot has values within the bounds, suggesting the residuals are only white noise.
fit_1 %>%
gg_tsresiduals()
Forecasting 10 years of data is displayed with confidence levels below. The upward trend continues, with decreasing confidence of the predicted values, as expected.
fit_1 %>%
forecast(h = 10) %>%
autoplot(aus_airpassengers)
Unfortunately I didn’t understand this very short section of the chapter and am not sure the use/application of a backshift operation notation without seeing any examples.
This model with the drift appears to have more conservative forecasts, still an upward trend but not as rapid.
#model ARIMA with forced 0,1,0
fit_2 <- aus_airpassengers %>%
model(ARIMA(Passengers ~ pdq(0,1,0)))
fit_2 %>%
forecast(h = 10) %>%
autoplot(aus_airpassengers)
This model doesn’t appear to generate any forecast at all with the package I was using, so I switched packages so I can use the include.drift parameter. After exploring further, removing the constant prevent it from building a model, so the constant is necessary - possibly for stationarity?
# force ARIMA to 2,1,2 - switching to other package as I don't know how to add drift
fit_3 <- Arima(aus_airpassengers$Passengers,
order = c(2,1,2), include.drift = TRUE)
fit_3 %>%
forecast(h = 10) %>%
autoplot()
This is the same model/plot as done in part A, except we’ve forced the model this time.
#model ARIMA with forced 0,1,0
fit_4 <- aus_airpassengers %>%
model(ARIMA(Passengers ~ pdq(0,2,1)))
fit_4 %>%
forecast(h = 10) %>%
autoplot(aus_airpassengers)
I created the subset dataframe and plot. It appears to be a fairly smooth upward trend with no seasonality or cycles.
usa_gdp <- global_economy %>%
filter(Country == "United States") %>%
dplyr::select(GDP)
usa_gdp %>%
autoplot(GDP)
With a lambda of 0.28 we will perform a Box-Cox transformation.
#determine lambda
BoxCox.lambda(usa_gdp$GDP)
## [1] 0.2819714
Despite the data being quite smooth to begin with, the transformation makes it even smoother - this might not be necessary but likely will help the model.
#perform boxcox and apply to df
usa_gdp$GDP <- box_cox(usa_gdp$GDP, BoxCox.lambda(usa_gdp$GDP))
#plot transformed data
usa_gdp %>%
gg_tsdisplay(GDP, plot_type = 'partial')
We see a constant is included and it chose a (1,1,0) model.
fit <- usa_gdp %>%
model(ARIMA(GDP))
report(fit)
## Series: GDP
## Model: ARIMA(1,1,0) w/ drift
##
## Coefficients:
## ar1 constant
## 0.4586 118.2764
## s.e. 0.1198 9.5123
##
## sigma^2 estimated as 5488: log likelihood=-325.37
## AIC=656.74 AICc=657.19 BIC=662.87
We haven’t checked if there are any necessary differencing yet, so I’ll start there. It appears one is necessary on the Box-Cox transformed data.
ndiffs(usa_gdp$GDP)
## [1] 1
All but one value is within the ACF plot bounds. A check with the KPSS test confirms the data is stationary after one order of differencing.
ggtsdisplay(diff(usa_gdp$GDP))
unitroot_kpss((diff(usa_gdp$GDP)))
## kpss_stat kpss_pvalue
## 0.2082358 0.1000000
usa_gdp_1diff <- usa_gdp %>%
mutate(GDP_diff = difference(GDP))
It appears this model is still selected despite the differencing.
fit2 <- usa_gdp_1diff %>%
model(ARIMA(GDP))
report(fit2)
## Series: GDP
## Model: ARIMA(1,1,0) w/ drift
##
## Coefficients:
## ar1 constant
## 0.4586 118.2764
## s.e. 0.1198 9.5123
##
## sigma^2 estimated as 5488: log likelihood=-325.37
## AIC=656.74 AICc=657.19 BIC=662.87
I’ll try another ARIMA model to see how it performs. This one appears to perform worse than the selected model above by automation. I’d imagine with more experience and background there may be times a datascience would be able to achieve a better model than the automation picks.
#model ARIMA with forced 0,1,0
fit_force <- usa_gdp_1diff %>%
model(ARIMA(GDP ~ pdq(1,1,2)))
report(fit_force)
## Series: GDP
## Model: ARIMA(1,1,2) w/ drift
##
## Coefficients:
## ar1 ma1 ma2 constant
## 0.8296 -0.3892 -0.1937 36.5887
## s.e. 0.2039 0.2480 0.1586 4.0053
##
## sigma^2 estimated as 5639: log likelihood=-325.11
## AIC=660.23 AICc=661.4 BIC=670.44
I’ll choose the original model and check how it performs. The historgram is a bit skewed, however the ACF plot looks good such that the residuals are white noise.
gg_tsresiduals(fit)
A visual inspection fo the forecast looked quite reasonable, the trend continues.
fit %>%
forecast(h = 10) %>%
autoplot(usa_gdp)
Switching back to the forecast package I build an ETS model to check it’s performance against our more complex ARIMA model. The AIC here is significantly higher, suggesting the ARIMA is performing better. This is a nice benchmark to see our more advanced model is worth the effort.
ets_check <- usa_gdp$GDP %>%
ets()
ets_check
## ETS(M,A,N)
##
## Call:
## ets(y = .)
##
## Smoothing parameters:
## alpha = 0.9999
## beta = 0.3705
##
## Initial states:
## l = 7059.7092
## b = 192.8105
##
## sigma: 0.0059
##
## AIC AICc BIC
## 743.5101 744.6639 753.8123