library(forecast)
library(fpp3)
library(zoo)
set.seed(314159)
DATA 624: Assignment 06
Setup
Assignment 06
Do the exercises 9.1, 9.2, 9.3, 9.5, 9.6, 9.7, 9.8 in Hyndman. Please submit both the Rpubs link as well as your .rmd file.
Exercise 9.1
Figure 9.32 shows the ACFs for 36 random numbers, 360 random numbers and 1,000 random numbers.
- Explain the differences among these figures. Do they all indicate that the data are white noise?
No, they don’t suggest the same amount of correlation as we can see by the ACF range. The ACF for the 36 random numbers shows a correlation range of ~[0.5, -0.5], while the ACF for the 360 shows less correlation with a range of ~[0.1, -0.1] and 1,000 random numbers shows even less. The ACF for 36 random numbers does not suggest white noise. Actually, it suggests fairly strong correlation, which is not expected from a white noise series. The ACF for 360 random numbers suggests weak correlation, which is more consistent with white noise. The ACF for 1,000 random numbers shows even weaker correlation, which is what we would expect from a white noise series.
- Why are the critical values at different distances from the mean of zero? Why are the autocorrelations different in each figure when they each refer to white noise?
The critical values are different distances from the mean because of the way the critical threshold is typically calculated. The critical threshold is typically calculated as \(\pm 2/\sqrt{n}\), where \(n\) is the sample size. Which, for our three samples, means that:
- For a sample of 36 numbers, the critical threshold is \(\pm 2/\sqrt{36} = \pm 0.3333\).
- For a sample of 360 numbers, the critical threshold is \(\pm 2/\sqrt{360} = \pm 0.1054\).
- For a sample of 1,000 numbers, the critical threshold is \(\pm 2/\sqrt{1000} = \pm 0.0632\).
As the sample size increases, the critical threshold approaches zero.
The reason the autocorrelations differ, I believe, is effectively the law of large numbers. As the sample size increases, the sample mean and variance converge to the population mean and variance. But what is the population when it comes to the ACF and a uniform distribution random number generator? The population is a theoretically infinite sequence of random numbers generated from a uniform distribution. So, what is the sequence of random numbers in an ACF? Let’s imagine that we eliminate all of the numbers between a given lag. For a sample of 36 numbers, a lag of 20 is a little weird. Beyond lag 18 I’m not sure what it’s looking at. But from 1 - 18 lags, the ACF has 2 numbers to work with for every lag. The chances of finding positive or negative correlation is fairly high. Let’s look at the equation:
\[ \text{ACF}(k) = \frac{\sum_{t=1}^{n-k} (x_t - \bar{x})(x_{t+k} - \bar{x})}{\sum_{t=1}^{n} (x_t - \bar{x})^2} \]
Let’s let R generate two random numbers that we shall consider to be a single lag and calculate its ACF.
<- rnorm(2)
r <- mean(r)
rm print((r[1] - rm) * (r[2] - rm) / ((r[1] - rm)^2 + (r[2] - rm)^2))
[1] -0.5
Our ACF for lag 1 is -0.5.
For a sample of 360 numbers, the ACF has 18 numbers to work with for every lag. And the odds of getting a correlated sequence of random numbers is much lower than for 36 numbers. For a sample of 1,000 numbers, the ACF has 50 numbers to work with for every lag. And the odds of getting a correlated sequence of random numbers is even lower than for 360 numbers. So, as the sample size increases, the ACF converges to zero.
Exercise 9.2
A classic example of a non-stationary series are stock prices. Plot the daily closing prices for Amazon stock (contained in gafa_stock
), along with the ACF and PACF. Explain how each plot shows that the series is non-stationary and should be differenced.
Let’s have a look at our data first. We are getting warnings below that are saying, “…the provided data has an irregular interval.” Which I imagine is due to missing weekends and holidays. And the ACF plot is bonkers.
|>
gafa_stock filter(Symbol == "AMZN") |>
head(n = 10)
# A tsibble: 10 x 8 [!]
# Key: Symbol [1]
Symbol Date Open High Low Close Adj_Close Volume
<chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 AMZN 2014-01-02 399. 399. 394. 398. 398. 2137800
2 AMZN 2014-01-03 398. 403. 396. 396. 396. 2210200
3 AMZN 2014-01-06 396. 397 388. 394. 394. 3170600
4 AMZN 2014-01-07 395. 398. 394. 398. 398. 1916000
5 AMZN 2014-01-08 398. 403 396. 402. 402. 2316500
6 AMZN 2014-01-09 404. 407. 398. 401. 401. 2103000
7 AMZN 2014-01-10 403. 404. 394. 398. 398. 2679500
8 AMZN 2014-01-13 398. 400. 388. 391. 391. 2844900
9 AMZN 2014-01-14 392. 399. 391. 398. 398. 2340100
10 AMZN 2014-01-15 399. 399. 393. 396. 396. 2678300
Let’s experiment. Let’s create a parallel imputed timeseries and see if we get different results.
<- gafa_stock |>
amzn filter(Symbol == "AMZN")
# Create a complete sequence of dates from the min to the max in the data
<- seq(min(amzn$Date), max(amzn$Date), by = "day")
dates
# Merge with our amzn data to fill in missing dates
<- merge(amzn, data.frame(Date = dates), all = TRUE)
amzn
# And now we shall interpolate missing values using LOCF
<- amzn |>
amzn arrange(Date) |>
na.locf()
# Huh, that worked. But our dataset is a data.frame. Let's convert it.
<- as_tsibble(amzn, index = Date) amzn
Plot the original data
|>
gafa_stock filter(Symbol == "AMZN") |>
ggplot(aes(x = Date, y = Close)) +
geom_line() +
labs(
title = "Daily Closing Prices for Amazon Stock",
x = "Date",
y = "Closing Price"
)
Plot the ACF for the original data
|>
gafa_stock filter(Symbol == "AMZN") |>
ACF(Close) |>
autoplot() +
labs(
title = "ACF for Daily Closing Prices for Amazon Stock"
)
Warning: Provided data has an irregular interval, results should be treated
with caution. Computing ACF by observation.
That is bonkers. Either we have total correlation at every lag or something is awry. Let’s try plotting the ACF for our imputed data. Oh, wait! It’s not 1.0 across the board. It descends from 1.0 at lag 1 to ~0.85 at lag 30. To say the least, the ACF is well above the significance line at every lag. This suggests that the series is non-stationary and should be differenced.
Before we made the realization that the ACF wasn’t bonkers, we had pursued imputation to see if we got different results. If there are differences between the non-imputed ACF and the imputed ACF, it’s more minuscule than a plot reveals.
|>
amzn ACF(Close) |>
autoplot() +
labs(
title = "ACF for Daily Closing Prices for Amazon Stock (Imputed)"
)
The PACF was our first reassurance that the ACF was not bonkers. Let’s have a look at it for the original data.
|>
gafa_stock filter(Symbol == "AMZN") |>
PACF(Close) |>
autoplot() +
labs(
title = "PACF for Daily Closing Prices for Amazon Stock"
)
Warning: Provided data has an irregular interval, results should be treated
with caution. Computing ACF by observation.
Ahhhh, when we remove the intervening effects of lags, the plot reveals to us lags that are significant. The PACF shows that the series is non-stationary and should be differenced.
And now the PACF with our imputed version:
|>
amzn PACF(Close) |>
autoplot() +
labs(
title = "PACF for Daily Closing Prices for Amazon Stock (Imputed)"
)
We do get slightly different results with the imputed data. Especially at lag 2. Which does make sense in that at least 3 consecutive days (Friday, Saturday and Sunday) out of 7 days have identical closing prices.
Exercise 9.3
For the following series, find an appropriate Box-Cox transformation and order of differencing in order to obtain stationary data.
- Turkish GDP from
global_economy
.
<- global_economy |>
data filter(Code == "TUR")
Let’s have a look at him.
|>
data ggplot(aes(x = Year, y = GDP)) +
geom_line() +
labs(
title = "Turkish GDP",
x = "Year",
y = "GDP"
)
He’s pretty exponential. Let’s give the guerrero method a try.
<- data |>
lambda features(GDP, guerrero)
<- data |>
data mutate(GDP = box_cox(GDP, lambda$lambda_guerrero))
Let’s see how we did.
|>
data ggplot(aes(x = Year, y = GDP)) +
geom_line() +
labs(
title = "Turkish GDP",
x = "Year",
y = "GDP"
)
Not bad. It doesn’t look like there is any rhyme or reason to his ups and downs, nonetheless there is a strong, linear trend line. Let’s first apply a first difference to see if we can make him stationary.
<- data |>
data mutate(GDP = difference(GDP, lag = 1))
Let’s see how we did.
|>
data filter(!is.na(GDP)) |>
ggplot(aes(x = Year, y = GDP)) +
geom_line() +
labs(
title = "Differenced Turkish GDP",
x = "Year",
y = "GDP"
)
Looks like we are getting closer. Let’s check the ACF to whether it passes the white noise test.
|>
data ACF(GDP) |>
autoplot() +
labs(
title = "ACF for Turkish GDP"
)
We are below the significance line at every lag, but just. Lag 4 is a little bothersome, but I don’t think any further differencing is going to help reduce the ACF. We will come back and experiment if we have time. Actually, I don’t think there is anything more we can do. It looks like we have a stationary series. We could try a seasonal difference, but I don’t think it makes sense. The data is already annual. According to our ACF plot there is some correlation at lag 4, but I don’t think it’s enough to justify a seasonal difference. Let’s move on. There are a lot of questions below.
- Accommodation takings in the state of Tasmania from
aus_accommodation
.
<- aus_accommodation |>
data filter(State == "Tasmania")
Let’s see what he looks like.
|>
data ggplot(aes(x = Date, y = Takings)) +
geom_line() +
labs(
title = "Accommodation Takings in Tasmania"
)
The trend is slightly sinusoidal, but there is a strong linear trend line. We don’t need an ACF plot to tell us that there is a very strong annual seasonal component. So, we need to apply differencing to remove the trend and to make the seasonality’s variability and autocovariance constant.
<- data |>
data mutate(Takings = difference(Takings, lag = 1)) |>
filter(!is.na(Takings))
Now let’s attempt to remove the seasonality’s variability. The data is quarterly, and the seasonality look like it is yearly. So, let’s apply a seasonal difference.
<- data |>
data mutate(Takings = difference(Takings, lag = 4)) |>
filter(!is.na(Takings))
Let’s see how the trend line looks now.
|>
data ggplot(aes(x = Date, y = Takings)) +
geom_line() +
labs(
title = "Differenced Accommodation Takings in Tasmania"
)
That looks pretty good. Let’s look at the ACF.
|>
data ACF(Takings) |>
autoplot() +
labs(
title = "ACF for Differenced Accommodation Takings in Tasmania"
)
Hmmm, it’s showing some correlation at lag 2 and 3? Let’s run a KPSS test to see if we are stationary.
|>
data features(Takings, unitroot_kpss)
# A tibble: 1 × 3
State kpss_stat kpss_pvalue
<chr> <dbl> <dbl>
1 Tasmania 0.0467 0.1
He says we are stationary. Let’s move on.
- Monthly sales from
souvenirs
.
Let’s see what he looks like.
|>
souvenirs ggplot(aes(x = Month, y = Sales)) +
geom_line() +
labs(
title = "Sales from Souvenirs",
x = "Month",
y = "Sales"
)
This one will benefit from a Box-Cox transformation for two reasons:
- It has a slight exponential trend.
- It has a strong seasonal component that is growing at an exponential rate.
Let’s apply the guerrero
method to find a suitable Box-Cox transformation.
<- souvenirs |>
lambda features(Sales, guerrero)
<- souvenirs |>
data mutate(Sales = box_cox(Sales, lambda$lambda_guerrero))
Let’s have a look.
|>
data ggplot(aes(x = Month, y = Sales)) +
geom_line() +
labs(
title = "Sales from Souvenirs (Box-Cox)",
x = "Month",
y = "Sales"
)
querrero
did a pretty good job, but there is still a trend line. Let’s apply a first difference to see if we can level him out.
<- data |>
data mutate(Sales = difference(Sales, lag = 1)) |>
filter(!is.na(Sales))
One more look (hoping):
|>
data ggplot(aes(x = Month, y = Sales)) +
geom_line() +
labs(
title = "Differenced Sales from Souvenirs",
x = "Month",
y = "Sales"
)
That looks much better! Now to tackle the seasonality. Let’s apply a seasonal difference. The data is monthly and the seasonality looks like it is yearly. So, let’s apply a seasonal difference.
<- data |>
data mutate(Sales = difference(Sales, lag = 12)) |>
filter(!is.na(Sales))
Let’s plot an ACF and see how it looks.
|>
data ACF(Sales) |>
autoplot() +
labs(
title = "ACF for Differenced Sales from Souvenirs"
)
Hmmm, there is a fair amount (~ -0.4) of correlation at lag 1? Once again, let’s run a KPSS test to see if we are stationary.
|>
data features(Sales, unitroot_kpss)
# A tibble: 1 × 2
kpss_stat kpss_pvalue
<dbl> <dbl>
1 0.0381 0.1
And KPSS says that we are stationary. Let’s move on.
Exercise 9.5
For your retail data (from Exercise 7 in Section 2.10), find the appropriate order of differencing (after transformation if necessary) to obtain stationary data.
Hey, I remember this dataset! Let’s see what it looks like.
<- aus_retail |>
data filter(`Series ID` == sample(aus_retail$`Series ID`, 1))
|>
data ggplot(aes(x = Month, y = Turnover)) +
geom_line() +
labs(
title = "Retail Turnover",
x = "Month",
y = "Turnover"
)
Oh, this line looks a little reptilian. It’s trend does have an upward curve, but it’s not strongly exponential. In the beginning it’s linear, then levels off, then rises more quickly, and towards the end it fires its boosters and is off. Let’s give querrero
a try.
<- data |>
lambda features(Turnover, guerrero)
print(lambda$lambda_guerrero)
[1] 0.08473322
<- data |>
data mutate(Turnover = box_cox(Turnover, lambda$lambda_guerrero))
It also detected an exponential tendency. Let’s see how he did.
|>
data ggplot(aes(x = Month, y = Turnover)) +
geom_line() +
labs(
title = "Retail Turnover (Box-Cox)",
x = "Month",
y = "Turnover"
)
Wow, querrero
did a good job of straightening out the line and of making the variability constant. But we still have a linear trend line. Let’s apply a first difference to see if we can get him stationary.
<- data |>
data mutate(Turnover = difference(Turnover, lag = 1)) |>
filter(!is.na(Turnover))
|>
data ggplot(aes(x = Month, y = Turnover)) +
geom_line() +
labs(
title = "Differenced Retail Turnover",
x = "Month",
y = "Turnover"
)
It’s certainly level. The seasonal has been somewhat obscured, but we know it’s still there. Let’s try seasonal differencing to see if we can bring it out.
<- data |>
data mutate(Turnover = difference(Turnover, lag = 12)) |>
filter(!is.na(Turnover))
Let’s look at it through KPSS
|>
data features(Turnover, unitroot_kpss)
# A tibble: 1 × 4
State Industry kpss_stat kpss_pvalue
<chr> <chr> <dbl> <dbl>
1 South Australia Food retailing 0.0114 0.1
We hit the upper limit of the KPSS test which means the null hypothesis of stationarity cannot be rejected. So, we are stationary.
Exercise 9.6
Simulate and plot some data from simple ARIMA models.
- Use the following R code to generate data from an AR(1) model with \(\phi_1 = 0.6\) and \(\sigma^2 = 1\). The process starts with \(y_1 = 0\).
<- function(phi = 0.6, sigma_sq = 1, e = NULL) {
run_sim_ar <- numeric(100)
y if(is.null(e))
<- rnorm(100, sd = sqrt(sigma_sq))
e for(i in 2:100)
<- phi*y[i-1] + e[i]
y[i] return(tsibble(idx = seq_len(100), y = y, index = idx))
}
- Produce a time plot for the series. How does the plot change as you change \(\phi_1\)?
# let's use the same random error for all series so that we isolate the effect of phi
<- rnorm(100)
e = data.frame(
df idx = run_sim_ar()$idx,
`phi=0.0` = run_sim_ar(phi = 0.0, e = e)$y,
`phi=0.2` = run_sim_ar(phi = 0.2, e = e)$y,
`phi=0.4` = run_sim_ar(phi = 0.4, e = e)$y,
`phi=0.6` = run_sim_ar(phi = 0.6, e = e)$y,
`phi=0.8` = run_sim_ar(phi = 0.8, e = e)$y,
`phi=0.9` = run_sim_ar(phi = 0.9, e = e)$y
) |>
df pivot_longer(cols = -idx, names_to = "phi", values_to = "y") |>
ggplot(mapping = aes(x = idx, y = y, color = phi)) +
geom_line() +
labs(
title = "AR(1) Model",
x = "Time",
y = "Value"
+
) facet_wrap(~phi, ncol = 2)
The plot becomes increasingly biased as phi approaches 1.0. We used the same error component for each plot so that we could isolate the effect of phi. As phi approaches 1.0, the series becomes more and more biased towards the previous value. The series with phi = 0.0 looks an awful lot like noise, while the series with phi = 1.0 appears to be negatively biased. Let’s look at the means.
summary(df)
idx phi.0.0 phi.0.2 phi.0.4
Min. : 1.00 Min. :-2.30219 Min. :-2.6750 Min. :-3.0555
1st Qu.: 25.75 1st Qu.:-0.79987 1st Qu.:-0.7021 1st Qu.:-0.7894
Median : 50.50 Median :-0.02475 Median :-0.1176 Median :-0.1989
Mean : 50.50 Mean :-0.08163 Mean :-0.1009 Mean :-0.1305
3rd Qu.: 75.25 3rd Qu.: 0.53431 3rd Qu.: 0.5514 3rd Qu.: 0.5666
Max. :100.00 Max. : 2.22818 Max. : 2.3886 Max. : 2.5302
phi.0.6 phi.0.8 phi.0.9
Min. :-3.6765 Min. :-4.5943 Min. :-6.1635
1st Qu.:-0.8831 1st Qu.:-1.1916 1st Qu.:-1.7654
Median :-0.1505 Median :-0.1036 Median :-0.2752
Mean :-0.1830 Mean :-0.3228 Mean :-0.6222
3rd Qu.: 0.6326 3rd Qu.: 0.7204 3rd Qu.: 0.9378
Max. : 2.8152 Max. : 3.2870 Max. : 3.3500
And the means confirm that the series is growing disproportionately in the negative direction as phi approaches 1.0.
- Write your own code to generate data from an MA(1) model with \(\theta_1 = 0.6\) and \(\sigma^2 = 1\).
<- function(theta = 0.6, sigma_sq = 1, e = NULL) {
run_sim_ma <- numeric(100)
y if(is.null(e))
<- rnorm(100, sd = sqrt(sigma_sq))
e for(i in 2:100)
<- e[i] + theta*e[i-1]
y[i] return(tsibble(idx = seq_len(100), y = y, index = idx))
}
- Produce a time plot for the series. How does the plot change as you change \(\theta_1\)?
# let's use the same random error for all series so that we isolate the effect of phi
<- rnorm(100)
e = data.frame(
df idx = run_sim_ma()$idx,
`theta=0.0` = run_sim_ma(theta = 0.0, e = e)$y,
`theta=0.2` = run_sim_ma(theta = 0.2, e = e)$y,
`theta=0.4` = run_sim_ma(theta = 0.4, e = e)$y,
`theta=0.6` = run_sim_ma(theta = 0.6, e = e)$y,
`theta=0.8` = run_sim_ma(theta = 0.8, e = e)$y,
`theta=0.9` = run_sim_ma(theta = 0.9, e = e)$y
) |>
df pivot_longer(cols = -idx, names_to = "theta", values_to = "y") |>
ggplot(mapping = aes(x = idx, y = y, color = theta)) +
geom_line() +
labs(
title = "MA(1) Model",
x = "Time",
y = "Value"
+
) facet_wrap(~theta, ncol = 2)
Again, we used a common error sequence for all of the thetas, so that we could isolate the effects of theta. I was about to say that it look like an amplifier, but that’s not entirely correct. Or maybe it is correct and just not a very faithful amplifier. Let’s try superimposing the plots and see if that helps make the distortion more clear.
|>
df pivot_longer(cols = -idx, names_to = "theta", values_to = "y") |>
ggplot(mapping = aes(x = idx, y = y, color = theta)) +
geom_line() +
labs(
title = "MA(1) Model",
x = "Time",
y = "Value"
)
Not only is it a very cool plot, but it does a good job of show the distortion. It’s most pronounced at the peaks and valleys. The higher the theta, the higher the peaks and the lower the valleys.
- Generate data from an ARMA(1,1) model with \(\phi_1 = 0.6\), \(\theta_1 = 0.6\) and \(\sigma^2 = 1\).
<- function(phi = 0.6, theta = 0.6, sigma_sq = 1, e = NULL) {
run_sim_arma <- numeric(100)
y if(is.null(e))
<- rnorm(100, sd = sqrt(sigma_sq))
e for(i in 2:100)
<- phi*y[i-1] + e[i] + theta*e[i-1]
y[i] return(tsibble(idx = seq_len(100), y = y, index = idx))
}
This will be interesting to see. Let’s use the same trick of using the same error series for all of the plots so that we can isolate the effects of phi and theta.
<- rnorm(100)
e = data.frame(
df idx = run_sim_arma()$idx,
`phi=0.0, theta=0.0` = run_sim_arma(phi = 0.0, theta = 0.0, e = e)$y,
`phi=0.2, theta=0.2` = run_sim_arma(phi = 0.2, theta = 0.2, e = e)$y,
`phi=0.4, theta=0.4` = run_sim_arma(phi = 0.4, theta = 0.4, e = e)$y,
`phi=0.6, theta=0.6` = run_sim_arma(phi = 0.6, theta = 0.6, e = e)$y,
`phi=0.8, theta=0.8` = run_sim_arma(phi = 0.8, theta = 0.8, e = e)$y,
`phi=1.0, theta=1.0` = run_sim_arma(phi = 1.0, theta = 1.0, e = e)$y
)|>
df pivot_longer(cols = -idx, names_to = "phi_theta", values_to = "y") |>
ggplot(mapping = aes(x = idx, y = y, color = phi_theta)) +
geom_line() +
labs(
title = "ARMA(1,1) Model",
x = "Time",
y = "Value"
+
) facet_wrap(~phi_theta, ncol = 2)
The effects of phi are way more obvious than the effects of theta. It steals the show.
- Generate data from an AR(2) model with \(\phi_1 = 0.6\), \(\phi_2 = -0.8\) and \(\sigma^2 = 1\). (Note that these parameters will give a non-stationary series.)
<- function(phi1 = 0.6, phi2 = -0.8, sigma_sq = 1, e = NULL) {
run_sim_ar2 <- numeric(100)
y if(is.null(e))
<- rnorm(100, sd = sqrt(sigma_sq))
e for(i in 3:100) {
<- phi1*y[i-1] + phi2*y[i-2] + e[i]
y[i]
}return(tsibble(idx = seq_len(100), y = y, index = idx))
}
Oh, we are plotting for question g. We won’t spoil the surprise. But we will have a peek using our scheme above. We shall use the following to determine the value of phi2: \(\phi_2=\frac{\phi_1}{2}\).
# We will use the same error component as we did for the ARMA(1,1) model
# Let's set phi2 = phi1/2.
= data.frame(
df idx = run_sim_ar2()$idx,
`phi1=0.0, phi2=0.0` = run_sim_ar2(phi1 = 0.0, phi2 = 0.0, e = e)$y,
`phi1=0.2, phi2=0.1` = run_sim_ar2(phi1 = 0.2, phi2 = 0.1, e = e)$y,
`phi1=0.4, phi2=0.2` = run_sim_ar2(phi1 = 0.4, phi2 = 0.2, e = e)$y,
`phi1=0.6, phi2=0.3` = run_sim_ar2(phi1 = 0.6, phi2 = 0.3, e = e)$y,
`phi1=0.8, phi2=0.4` = run_sim_ar2(phi1 = 0.8, phi2 = 0.4, e = e)$y
)|>
df pivot_longer(cols = -idx, names_to = "phi", values_to = "y") |>
ggplot(mapping = aes(x = idx, y = y, color = phi)) +
geom_line() +
labs(
title = "AR(2) Model (phi2 = phi1/2)",
x = "Time",
y = "Value"
+
) facet_wrap(~phi, ncol = 2, scale = "free_y")
Pretty crazy. It doesn’t take much to push the formula over the edge (phi1=0.8, phi2=0.4).
- Graph the latter two series and compare them.
<- rnorm(100)
e = data.frame(
df idx = run_sim_ar2()$idx,
e = e,
`AR(2)` = run_sim_ar2(phi1 = 0.6, phi2 = -0.8, e = e)$y,
`ARMA(1,1)` = run_sim_arma(phi = 0.6, theta = 0.6, e = e)$y
)|>
df pivot_longer(cols = -idx, names_to = "name", values_to = "y") |>
ggplot(mapping = aes(x = idx, y = y, color = name)) +
geom_area(mapping = aes(fill = name), alpha = 0.2) +
labs(
title = "AR(2) vs ARMA(1,1)",
x = "Time",
y = "Value"
)
summary(df)
idx e AR.2. ARMA.1.1.
Min. : 1.00 Min. :-2.16953 Min. :-3.10514 Min. :-3.3733
1st Qu.: 25.75 1st Qu.:-0.66048 1st Qu.:-1.10201 1st Qu.:-1.1667
Median : 50.50 Median :-0.10414 Median :-0.20388 Median :-0.4256
Mean : 50.50 Mean :-0.09002 Mean :-0.06205 Mean :-0.4002
3rd Qu.: 75.25 3rd Qu.: 0.37771 3rd Qu.: 1.12483 3rd Qu.: 0.4540
Max. :100.00 Max. : 1.85255 Max. : 2.64083 Max. : 1.8936
Once again AR is biased, but as much as it was with AR(1). Where AR(2) is pushing the peaks and valleys to the extreme, ARMA(1,1) has a smoothing effect around 0. It’s mean is closer to 0 than e even though it’s max and min are further from 0 than e.
Exercise 9.7
Consider aus_airpassengers
, the total number of passengers (in millions) from Australian air carriers for the period 1970-2011.
|>
aus_airpassengers ggplot(mapping = aes(x = Year, y = Passengers)) +
geom_line() +
labs(
title = "Total Number of Passengers from Australian Air Carriers",
x = "Year",
y = "Passengers (millions)"
)
- Use ARIMA() to find an appropriate ARIMA model. What model was selected. Check that the residuals look like white noise. Plot forecasts for the next 10 periods.
|>
aus_airpassengers model(ARIMA(Passengers ~ pdq(,,) + PDQ(,,)))
# A mable: 1 x 1
`ARIMA(Passengers ~ pdq(, , ) + PDQ(, , ))`
<model>
1 <ARIMA(0,2,1)>
The model selected was ARIMA(0,2,1). Let’s check the residuals.
|>
aus_airpassengers model(ARIMA(Passengers ~ pdq(0,2,1) + PDQ(0,0,0))) |>
gg_tsresiduals()
The distribution of the residuals is a left leaning approximately normal distribution. The ACF shows lags that are well within the significance lines. We shall proceed with the forecast.
<- aus_airpassengers |>
fit model(ARIMA(Passengers ~ pdq(0,2,1) + PDQ(0,0,0)))
|>
aus_airpassengers model(ARIMA(Passengers ~ pdq(0,2,1) + PDQ(0,0,0))) |>
report() |>
forecast(h = 10) |>
autoplot(aus_airpassengers) +
labs(
title = "Forecast for Total Number of Passengers from Australian Air Carriers",
x = "Year",
y = "Passengers (millions)"
)
Series: Passengers
Model: ARIMA(0,2,1)
Coefficients:
ma1
-0.8963
s.e. 0.0594
sigma^2 estimated as 4.308: log likelihood=-97.02
AIC=198.04 AICc=198.32 BIC=201.65
- Write the model in terms of the backshift operator.
Let’s recall. \(By_{t} = y_{t-1}\). Now let’s think. The model we found was ARIMA(0,2,1) with no seasonal component. This means we have a second order differencing and a first order moving average.
First order differencing may be written as: \[y_t^{'} = y_t - By_{t}\]
Second order differencing may be written as: \[y_t^{''} = y_t - 2 By_{t} + By_{t-2} = (1 - B)^2y_t\]
First order moving average may be written as: \[y_t^{'} = y_t + \theta_1 e_{t-1} = y_t + \theta_1 Be_{t}\]
So, we can express the second order differencing as: \[ \begin{align} \text{ARIMA}(0,2,1) &= y_t - 2 By_{t} + By_{t-2} + y_t + \theta_1 Be_{t} \\ &= 2y_t - 2 By_{t} + By_{t-2} + \theta_1 Be_{t} \end{align} \]
- Plot forecasts from an ARIMA(0,1,0) model with drift and compare these to part a.
I struggled with drift. I tried to force it in a number of ways, none of which worked. I see that drift is included with what is below. At the moment, I don’t understand why.
|>
aus_airpassengers model(ARIMA(Passengers ~ pdq(0,1,0) + PDQ(0,0,0))) |>
report() |>
forecast(h = 10) |>
autoplot(aus_airpassengers) +
labs(
title = "Forecast for Passengers from Australian Air Carriers (ARIMA(0,1,0) with Drift)",
x = "Year",
y = "Passengers (millions)"
)
Series: Passengers
Model: ARIMA(0,1,0) w/ drift
Coefficients:
constant
1.4191
s.e. 0.3014
sigma^2 estimated as 4.271: log likelihood=-98.16
AIC=200.31 AICc=200.59 BIC=203.97
The AIC is almost identical. The prediction’s confidence interval for ARIMA(0,2,1) is conical where the prediction’s confidence interval for ARIMA(0,1,0) is somewhat bell shaped. The ARIMA(0,2,1) model has a slightly more pronounced upward trend than the ARIMA(0,1,0) model. The mouth of the confident interval for ARIMA(0,2,1) is more narrow than ARIMA(0,1,0)’s.
- Plot forecasts from an ARIMA(2,1,2) model with drift and compare these to parts a and c. Remove the constant and see what happens.
|>
aus_airpassengers model(ARIMA(Passengers ~ 1 + pdq(2,1,2) + PDQ(0,0,0))) |>
report() |>
forecast(h = 10) |>
autoplot(aus_airpassengers) +
labs(
title = "Forecast for Passengers from Australian Air Carriers (ARIMA(2,1,2) with Drift)",
x = "Year",
y = "Passengers (millions)"
)
Series: Passengers
Model: ARIMA(2,1,2) w/ drift
Coefficients:
ar1 ar2 ma1 ma2 constant
-0.5518 -0.7327 0.5895 1.0000 3.2216
s.e. 0.1684 0.1224 0.0916 0.1008 0.7224
sigma^2 estimated as 4.031: log likelihood=-96.23
AIC=204.46 AICc=206.61 BIC=215.43
The AIC score is going up. The prediction appears to be carrying some of the noise from the training data. The confidence interval is the same shape as the ARIMA(0,1,0) model, only its outline is wavy due to the noise in the forecast, which I believe is coming from the MA(2) component.
Let’s try removing the drift which should remove the constant.
|>
aus_airpassengers model(ARIMA(Passengers ~ 0 + pdq(2,1,2) + PDQ(0,0,0)))
Warning: 1 error encountered for ARIMA(Passengers ~ 0 + pdq(2, 1, 2) + PDQ(0, 0, 0))
[1] non-stationary AR part from CSS
# A mable: 1 x 1
`ARIMA(Passengers ~ 0 + pdq(2, 1, 2) + PDQ(0, 0, 0))`
<model>
1 <NULL model>
What happened is that I encountered an error:
Warning: 1 error encountered for ARIMA(Passengers ~ 0 + pdq(2, 1, 2) + PDQ(0, 0, 0)) [1] non-stationary AR part from CSS
This means that the model is non-stationary. And judging from the message, I believe this is because the AR(2) component is finding that the component is not stationary. We’ll come back to this if we have time.
- Plot forecasts from an ARIMA(0,2,1) model with a constant. What happens?
|>
aus_airpassengers model(ARIMA(Passengers ~ 1 + pdq(0,1,2) + PDQ(0,0,0))) |>
report() |>
forecast(h = 10) |>
autoplot(aus_airpassengers) +
labs(
title = "Forecast for Passengers from Australian Air Carriers (ARIMA(0,2,1) with Constant)",
x = "Year",
y = "Passengers (millions)"
)
Series: Passengers
Model: ARIMA(0,1,2) w/ drift
Coefficients:
ma1 ma2 constant
-0.0321 0.0957 1.4176
s.e. 0.1466 0.1521 0.3180
sigma^2 estimated as 4.427: log likelihood=-97.95
AIC=203.9 AICc=204.87 BIC=211.21
Ah, the noise component goes away. Wait, I thought it was the MA(2) component that was causing the noise. But the wavy forecast seems to be coming from the AR(2) component. Perhaps, the waviness isn’t coming from noise.
Exercise 9.8
For the United States GDP series (from global_economy
):
Let’s isolate the data
<- global_economy |>
data filter(Code == "USA")
And now let’s have a look
|>
data ggplot(aes(x = Year, y = GDP)) +
geom_line() +
labs(
title = "United States GDP",
x = "Year",
y = "GDP"
)
Oh my gosh, he’s a textbook example of exponential growth. Let’s apply the guerrero
method to see what he thinks our log transform should be.
- if necessary, find a suitable Box-Cox transformation for the data;
<- data |>
lambda features(GDP, guerrero)
print(lambda$lambda_guerrero)
[1] 0.2819443
<- data |>
data mutate(GDP = box_cox(GDP, lambda$lambda_guerrero))
Let’s see how we did.
|>
data ggplot(aes(x = Year, y = GDP)) +
geom_line() +
labs(
title = "United States GDP (Box-Cox)",
x = "Year",
y = "GDP"
)
- fit a suitable ARIMA model to the transformed data using ARIMA();
|>
data model(ARIMA(GDP ~ pdq(,,) + PDQ(0,0,0))) |>
report()
Series: GDP
Model: ARIMA(1,1,0) w/ drift
Coefficients:
ar1 constant
0.4586 118.1822
s.e. 0.1198 9.5047
sigma^2 estimated as 5479: log likelihood=-325.32
AIC=656.65 AICc=657.1 BIC=662.78
- try some other plausible models by experimenting with the orders chosen;
Let’s try the same model without drift
|>
data model(ARIMA(GDP ~ 0 + pdq(1,1,0) + PDQ(0,0,0))) |>
report()
Series: GDP
Model: ARIMA(1,1,0)
Coefficients:
ar1
0.9287
s.e. 0.0428
sigma^2 estimated as 7038: log likelihood=-333.85
AIC=671.69 AICc=671.91 BIC=675.78
This model has a higher AIC and BIC than the ARIMA(1,1,0) with drift.
Let’s look at the residuals
|>
data model(ARIMA(GDP ~ 0 + pdq(1,1,0) + PDQ(0,0,0))) |>
gg_tsresiduals()
It appears to be slightly more normally distributed than ARIMA(1,1,0) with drift (see below), but the ACF shows that lag 1 is showing a fair amount of correlation. So, we shall stick with ARIMA(1,1,0) with drift for the time being.
|>
data model(ARIMA(GDP ~ pdq(0,1,0) + PDQ(0,0,0))) |>
report()
Series: GDP
Model: ARIMA(0,1,0) w/ drift
Coefficients:
constant
220.3489
s.e. 10.8051
sigma^2 estimated as 6774: log likelihood=-331.77
AIC=667.53 AICc=667.76 BIC=671.62
He’s got all around higher goodness of fit scores.
|>
data model(ARIMA(GDP ~ pdq(1,1,1) + PDQ(0,0,0))) |>
report()
Series: GDP
Model: ARIMA(1,1,1) w/ drift
Coefficients:
ar1 ma1 constant
0.4736 -0.0190 114.8547
s.e. 0.2851 0.3286 9.3486
sigma^2 estimated as 5580: log likelihood=-325.32
AIC=658.64 AICc=659.41 BIC=666.82
This model also has higher goodness of fit scores but oh so minimally. But the added complexity (MA) doesn’t warrant selecting ARIMA(1,1,1) over ARIMA1(1,1,0).
- choose what you think is the best model and check the residual diagnostics;
We shall select the first model, ARIMA(1,1,0) as it has the best goodness of fit scores. Let’s check the residuals.
|>
data model(ARIMA(GDP ~ pdq(1,1,0) + PDQ(0,0,0))) |>
gg_tsresiduals()
If the residuals can be considered to be normally distributed, then they are very left biased. The ACF shows that the residuals are well within the significance lines.
- produce forecasts of your fitted model. Do the forecasts look reasonable?
|>
data model(ARIMA(GDP ~ pdq(1,1,0) + PDQ(0,0,0))) |>
forecast(h = 10) |>
autoplot(data) +
labs(
title = "Forecast for United States GDP (ARIMA(1,1,0) w/drift)",
x = "Year",
y = "GDP"
)
The forecast looks very reasonable. Not only is it in line with the apparent trend, but the confidence interval is relatively narrow when compared to those we saw above.
- compare the results with what you would obtain using ETS() (with no transformation).
|>
global_economy filter(Code == "USA") |>
model(ETS(GDP ~ error("A") + trend("M") + season("N"))) |>
report()
Series: GDP
Model: ETS(A,M,N)
Smoothing parameters:
alpha = 0.9998995
beta = 0.485905
Initial states:
l[0] b[0]
453235324559 1.101298
sigma^2: 3.191607e+22
AIC AICc BIC
3246.770 3247.924 3257.072
Oh my gosh, this model is off the charts. It’s AIC, AICc and BIC are all roughly 6 times higher than the ARIMA(1,1,0) model. Let’s see what the residuals look like.
|>
global_economy filter(Code == "USA") |>
model(ETS(GDP ~ error("A") + trend("M") + season("N"))) |>
gg_tsresiduals()
The residuals look good up to the year 2000, at which point it looses its noise like behavior. The distribution looks more normal than ARIMA(1,1,0)’s residuals. The ACF is borderline. Lag 3 and lag 7 are both marginally exceeding the critical limit.