Figure 9.32 shows the ACFs for 36 random numbers, 360 random numbers and 1,000 random numbers.
The differences are in the values of the critical values(the blue dotted line) and the placement of the lags in each image.
The critical values are different because each plot has its own points of statistical significance. The autocorrelations refer to the lags that have some sort of statistical significance. For example, in the first plot, at lag 11, it almost reaches the significance level of .3. This would indicate that the first points movement and the eleventh points movement are similar. However that similarity doesn’t rise above a statistically significant level.
amzn <- gafa_stock %>% filter(Symbol == 'AMZN',year(Date)=='2017')
amzn %>% ACF(Close) %>% autoplot()
For the Amazon stock prices we can see that every lag up until lag 26 has statistical significance, as it is over the blue dotted line, or the line of significance. In order for the data to be stationary all of the lags should be below the level of significance. Stationary data means that there are no patterns in the data. While lags with correlations above the level of significance shows that there is possibly a pattern.
This could also be thought of as independence (stationary) and dependence (non-stationary). In this regard non-stationary data and data that is highly correlated shows that the data is dependent on the previous value x-lags back. While stationary data is independent of any other value (except the previous value).
amzn %>% PACF(Close) %>% autoplot()
In an ARIMA model the AR(p) models the change, while the MA(q) models the variance of the errors. The PACF model helps determine the amount of variance that is captured between lags. It helps determine the value of q in the MA(q).
The Turkish GDP shows constant growth, with very little variance. This suggests that the data is not stationary and should be differenced or transformed in order to become stationary.
turk <- global_economy %>%
filter(Country=='Turkey') %>%
mutate(gdp_per_capita=GDP/Population)
turk %>% gg_tsdisplay(gdp_per_capita, plot_type='partial') +
labs(title = "Turkish GDP Per Capita")
turk %>% features(gdp_per_capita,unitroot_ndiffs)
## # A tibble: 1 × 2
## Country ndiffs
## <fct> <int>
## 1 Turkey 1
The unit root test suggests that the data should be differenced once.
turk %>% gg_tsdisplay(difference(gdp_per_capita), plot_type='partial') +
labs(title = "Differenced Turkish GDP Per Capita")
The differenced data does not have a lot of variance between observances. The acf and pacf plots shows a lag at 11 that has statistical significance. Since the data does not have a lot of variance, specifically before 1980, differencing the data a second time would have little effect, as the values up until around 1980 would have no changes. This data will need to be transformed to remove the lag and make the data stationary.
lambda <- turk %>%
features(gdp_per_capita, features = guerrero) %>%
pull(lambda_guerrero)
turk %>% gg_tsdisplay(difference(box_cox(gdp_per_capita,lambda)),plot_type = 'partial')+
labs(title = "Differenced Box-Cox Turkish GDP Per Capita")
When doing a box-cox transformation and differencing it, we get white noise data, where no lags meet statistical significance.
From the plot we can see that the data has a constant growth trend, where the variance grows as time goes by. The pattern of the data looks to have seasonality. These are indications that the data is not stationary. Since the variance looks to have constant growth, a transformation will be needed.
accom <- aus_accommodation %>%
filter(State=='Tasmania')
accom %>% gg_tsdisplay(Takings, plot_type='partial') +
labs(title = "Accommodation Takings")
From the acf and pacf plots we can see that the data is not stationary and is statistically significant at many different lags. The initial assumptions of seasonality and not stationary are proven correct. The data will have to be differenced about 4 times, since it is seasonal and the lags are highly significant at the multiple of 4 points.
accom %>% features(Takings,unitroot_nsdiffs)
## # A tibble: 1 × 2
## State nsdiffs
## <chr> <int>
## 1 Tasmania 1
The unit root test with STL decomposition says that 1 difference will make the data stationary but as we can see from the differences below, it takes 4 differences for the partial auto-correlation function to have no lags beyond statistical significance.
accom %>% gg_tsdisplay(difference(Takings,4), plot_type='partial') +
labs(title = "Differenced Accommodations Takings")
lambda <- accom %>%
features(Takings, features = guerrero) %>%
pull(lambda_guerrero)
accom %>% gg_tsdisplay(difference(box_cox(Takings,lambda),12), plot_type='partial') +
labs(title = "Differenced Box-Cox Accommodations Takings")
Transforming the data with a box-cox does not provide a better data set since at 4 lags the partial autocorrelation function still shows statistical significance at 4 lags.
The best choice on this data set would be to simply difference the data 4 times.
Looking at the initial plot and the initial acf and pacf plots, the data set is seasonal. The growth of the variance looks to grow as time goes on. The variance of the observations looks to be pretty constant, except at the 12th points. A transformation is going to be needed to make this data stationary.
suvs <- souvenirs
suvs %>% gg_tsdisplay(difference(Sales**.5,12),plot_type = 'partial')
lambda <- suvs %>%
features(Sales, features = guerrero) %>%
pull(lambda_guerrero)
suvs %>% gg_tsdisplay(difference(box_cox(Sales,lambda),12), plot_type='partial') +
labs(title = "Differenced Box-Cox Accommodations Takings")
The data being transformed with a box-cox tranformation and differencing 12 times brings this data set close to being stationary. There is still a correlation at lag 2
For this data set a square root transformation with a differencing of 12 brings the lags in the pacf under the significance line. It would seem that this is the best option.
There seems to be a spike in significance at every multiple of 12. This suggest seasonality is present in the dataset. This will require the data to be differenced at a level of 12, then differenced again in order for the data to look like white noise.
There are still lags with levels of significance, however the data now looks like white noise.
set.seed(456)
myseries <- aus_retail %>%
filter(`Series ID` == sample(aus_retail$`Series ID`,1))
myseries %>%
gg_tsdisplay(Turnover, plot_type='partial', lag = 72) +
labs(title = "Original Retail Turnover")
myseries %>%
gg_tsdisplay(difference(Turnover,12), plot_type='partial', lag =36) +
labs(title = "Seasonal Differenced(12) Retail Turnover")
myseries %>%
gg_tsdisplay(difference(Turnover,12) %>% difference(), plot_type='partial', lag =36) +
labs(title = "Seasonal Double Differenced Retail Turnover")
## Exercise 9.6
AR_sim<- function(phi) {
data <- tsibble(
date = as.Date("2017-01-01") + 0:99,
y = numeric(100),
e = rnorm(100)
)
for(i in 2:100)
data$y[i] <- phi*data$y[i-1] + data$e[i]
data %>% gg_tsdisplay(y=y,plot_type = 'partial')
}
AR_sim(.06)
## Using `date` as index variable.
The book states when phi is equal to zero and the constant equals zero then the data will look like white noise. When phi is equal to 1 and the constant is equal to zero then the data will look like a random walk. When phi equals 1 and the constant does not equal zero then the data will look like a random walk with drift. If phi is less than zero then yt tends to oscillate around the mean.
For the most part phi seems to act as a controller on the variance. Depending on phi being positive or negative and y[t-1] being positive or negative and e[i] being positive or negative. If all are positive then the degree of variance will increase. If phi is negative and y[t-1] is negative and e[i] is positive, then variance will increase. If all are negative then the variance will be constrained somewhat by the e[i]. Most other combinations of signs will constrain the variance also. Basically, it depends on the signs as to what will happen.
This is why when phi equals 1 and the constant does not equal 0 the data set will look like a random walk with a drift. As the random y[i] is being added or subtracted to by a random e[i]. This also why when phi is zero that yt tends to oscillate around the mean, as yt just equals e[i].
AR_sim(1)
## Using `date` as index variable.
AR_sim(0)
## Using `date` as index variable.
AR_sim(-.2)
## Using `date` as index variable.
MA_sim<- function(phi) {
data <- tsibble(
date = as.Date("2017-01-01") + 0:99,
y = numeric(100),
e = rnorm(100)
)
for(i in 2:100)
data$y[i] <- phi*data$e[i-1] + data$e[i]
data %>% gg_tsdisplay(y=y,plot_type = 'partial')
}
When phi is greater than 1, then the data points further back in the timeseries has a greater influence on the current error. When phi equals 1, then the weights are constant in size and the influence is constant regardless of time and distance. When phi is less than 1, then the most recent observations have the greatest weight of influence on the most recent errors.
This is basically the same outcome as the AR effects. With the difference being that MA is modeling the errors. The signage is only two, for phi and e[i]. So the outcome of the data set will depend on phi being positive or negative and e[i] being positive or negative. If e[i] is negative and phi is positive then variance will grow in a negative direction. If e[i] is positive and phi is negative then variance shrinks. If all are positive then variance will grow in a positive direction.
MA_sim(0.6)
## Using `date` as index variable.
MA_sim(1)
## Using `date` as index variable.
MA_sim(2)
## Using `date` as index variable.
ARMA_sim<- function(phi,theta) {
data <- tsibble(
date = as.Date("2017-01-01") + 0:99,
y = numeric(100),
e = rnorm(100)
)
for(i in 2:100)
data$y[i] <- phi*data$y[i-1] + theta*data$e[i-1]+ data$e[i]
data %>% gg_tsdisplay(y=y,plot_type = 'partial')
}
ARMA_sim(.6,.6)
## Using `date` as index variable.
AR2_sim<- function(phi1,phi2) {
data <- tsibble(
date = as.Date("2017-01-01") + 0:99,
y = numeric(100),
e = rnorm(100)
)
for(i in 3:100)
data$y[i] <- phi1*data$y[i-1] + phi2*data$y[i-2]+ data$e[i]
data %>% gg_tsdisplay(y=y,plot_type = 'partial')
}
AR2_sim(-.8,.3)
## Using `date` as index variable.
The AR(2) model goes back to what I said in the beginning about independence and dependence. The AR2 model is dependent on y[t-2]. When y[t-2] is positive, phi2 is positive and y[t-1] is negative and phi1 is negative. Phi1 multiplied by y[t-1] is positive, added to the positive phi2 multiplied by the positive y[t-2] is positive. This equals a large positive growth over the long term. When the pattern of the signs are opposite then there is large negative growth over the long term. The error is rather minor being between -1 and 1 so eventually the error term really has no effect on the data as time goes on.Which makes the growth of the variance look uniform. So for this particular model with these particular values, the model is dependent on the sign of y[t-2].
With the ARMA model the phi and the theta are acting on the same y[t-1] this creates an independence even when phi and theta are equal since the error term e[i] is independent. What is happening is that phi and theta are both taking 60% of y[t-1] and summing the result, then adding an error term. Or to put it a different way, y[i] is going to be 20% bigger than y[t-1] plus or minus the error term. The error term is what creates the independence since it is ensuring that the growth rate is not constant.
aus_air <- aus_airpassengers %>% filter(Year < 2012)
set.seed(19283)
model <- aus_air %>%
model(ARIMA(Passengers))
report(model)
## Series: Passengers
## Model: ARIMA(0,2,1)
##
## Coefficients:
## ma1
## -0.8756
## s.e. 0.0722
##
## sigma^2 estimated as 4.671: log likelihood=-87.8
## AIC=179.61 AICc=179.93 BIC=182.99
model <- aus_air %>% model(ARIMA(Passengers ))
model %>% forecast(h=10) %>% autoplot(aus_airpassengers)
model %>% gg_tsresiduals()
\[ y_{t} = -.8756_{\varepsilon_{t-1}} + \varepsilon_{t} \] \[ (1-\beta)^{2}y_{t}= (1-.8756\beta)\epsilon_{t} \]
Firstly in the ARIMA function, drift is automatically used if d > 1. So there is no need to see drift manually.
The first ARIMA model over predicted 2012, while the second model under predicted 2012. Also the confidence interval for the first model is steeper. Model 2’s lower level confidence interval has a slight curve downward then upward.Compared to the first models confidence interval that is a smooth line.
The residuals in the plots move differently. The first models, first two residuals are right on target, while model two’s first residual is on target while the second is below. Also in the frequency chart of the residuals, the first models chart seems to have a slightly tighter range of values.
model2 <- aus_air %>% model(ARIMA(Passengers ~ pdq(0,1,0)))
model2 %>% forecast(h=10) %>%
autoplot(aus_airpassengers)
model2 %>% gg_tsresiduals()
This model’s predictions seem to be in between the first and second model’s predictions. The residuals look more like the second models residuals.
The first model is still the best one so far, when it comes to fitting values.
Removing the constant results in a null model. Or maybe I was reading about the wrong Arima() function in R and implemented the model wrong.
model3 <- aus_air %>% model(ARIMA(Passengers ~ 1+ pdq(2,1,2)))
model3 %>% forecast(h=10) %>%
autoplot(aus_airpassengers)
model3 %>% gg_tsresiduals()
model3_noC <- aus_air %>% model(ARIMA(Passengers ~ pdq(2,1,2),include.constant=FALSE))
model3_noC
## # A mable: 1 x 1
## `ARIMA(Passengers ~ pdq(2, 1, 2), include.constant = FALSE)`
## <model>
## 1 <NULL model>
This has the same affect as the previous attempt at not using a constant. The result is a null data frame.
model4 <- aus_air %>% model(ARIMA(Passengers ~ pdq(0,2,1),include.constant=TRUE))
model4
## # A mable: 1 x 1
## `ARIMA(Passengers ~ pdq(0, 2, 1), include.constant = TRUE)`
## <model>
## 1 <NULL model>
The first transformation is making the data comparable to other countries by having the gdp set to per capita. As the data looks, a box cox isn’t necessary and a simple log transformation should suffice.
The data has a constant growth trend and is not stationary. It will need to be differenced.
gdp <- global_economy %>% filter(Country=="United States") %>% mutate(gdp_per_capita = GDP/Population)
gdp %>% gg_tsdisplay(gdp_per_capita)
log_gdp <- gdp %>% mutate(gdp_per_capita = log(gdp_per_capita))
log_gdp %>% gg_tsdisplay(gdp_per_capita)
log_gdp %>% features(gdp_per_capita, unitroot_ndiffs)
## # A tibble: 1 × 2
## Country ndiffs
## <fct> <int>
## 1 United States 2
log_gdp_d <- log_gdp %>% mutate(gdp_diff = difference(gdp_per_capita,1))
log_gdp_d %>% gg_tsdisplay(gdp_diff,plot_type = 'partial')
## Warning: Removed 1 row containing missing values (`geom_line()`).
## Warning: Removed 1 rows containing missing values (`geom_point()`).
log_gdp_d2 <- log_gdp %>% mutate(gdp_diff = difference(gdp_per_capita,2))
log_gdp_d2 %>% gg_tsdisplay(gdp_diff,plot_type = 'partial')
## Warning: Removed 2 rows containing missing values (`geom_line()`).
## Warning: Removed 2 rows containing missing values (`geom_point()`).
The unitroot_ndiffs suggest two differences to make the data stationary. With two differences there is a significant lag at 3. Trying one difference on the data results in zero significant lags beyond 1.
gdp_model <- log_gdp_d %>% model(ARIMA(gdp_diff))
gdp_model
## # A mable: 1 x 2
## # Key: Country [1]
## Country `ARIMA(gdp_diff)`
## <fct> <model>
## 1 United States <ARIMA(0,1,1)>
gdp_model %>% gg_tsresiduals()
gdp_model2 <- log_gdp_d2 %>% model(ARIMA(gdp_diff))
gdp_model2
## # A mable: 1 x 2
## # Key: Country [1]
## Country `ARIMA(gdp_diff)`
## <fct> <model>
## 1 United States <ARIMA(0,1,0)>
gdp_model2 %>% gg_tsresiduals()
log_gdp_d3 <- log_gdp %>% mutate(gdp_diff = difference(gdp_per_capita,3))
log_gdp_d3 %>% gg_tsdisplay(gdp_diff,plot_type = 'partial')
gdp_model3 <- log_gdp_d3 %>% model(ARIMA(gdp_diff))
gdp_model3
## # A mable: 1 x 2
## # Key: Country [1]
## Country `ARIMA(gdp_diff)`
## <fct> <model>
## 1 United States <ARIMA(0,1,1)>
gdp_model3 %>% gg_tsresiduals()
gdp_models <- log_gdp %>%
model(one = ARIMA(gdp_per_capita ~ 1 + pdq(0,1,1)),
two = ARIMA(gdp_per_capita ~ pdq(0,2,0)),
three= ARIMA(gdp_per_capita ~ 1 + pdq(1,1,1)))
glance(gdp_models) %>% arrange(AICc) %>% select(.model:BIC)
## # A tibble: 3 × 6
## .model sigma2 log_lik AIC AICc BIC
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 three 0.000411 143. -277. -276. -269.
## 2 one 0.000536 135. -263. -263. -257.
## 3 two 0.000520 132. -263. -263. -261.
The model that I chose was model three because it has the highest log likelihood score and the lowest AIC, AICc and BIC scores. The sigma2 score is also the lowest.
gdp_models %>%
forecast(h=10) %>%
filter(.model=='three') %>%
autoplot(log_gdp)
The ARIMA model’s prediction has a narrower 95% confidence interval. Its mean prediction line is also isn’t as steep as the ETS model’s mean prediction line. It seems as though the ARIMA model is more confident in its prediction since it is predictions are in a tighter range. This might be due to the fact that the scale of the values are different and the ETS is working with values in millions, while the ARIMA is working on logged data that is in the ones and tens.
ets <- gdp %>%
model(ETS(GDP))
report(ets)
## Series: GDP
## Model: ETS(M,A,N)
## Smoothing parameters:
## alpha = 0.9990876
## beta = 0.5011949
##
## Initial states:
## l[0] b[0]
## 448093333334 64917355687
##
## sigma^2: 7e-04
##
## AIC AICc BIC
## 3190.787 3191.941 3201.089
ets %>%
forecast(h=10) %>%
autoplot(gdp)