Questions
Exercise 1
# a.
# As the random numbers increase, the ACF values and critical value intervals
# decrease and become more concentrated. I am assuming it's because as random values
# increase, the more precision there is (for example observations in a sample).
# It could also be compared to standard errors decreasing as the number of
# observations increase in a regression. Thus, with a little amount of data, the
# critical values must be much larger to account for the standard error. So, as
# random numbers increase, the critical value decreases. The lags are pretty
# similar across the graphs.
# According to their own graph, the each lag stays within the critical value
# and so there is no autocorrelation. Thus, they are each white noise.
# b.
# Like said above in a, as the sample size increases, the accuracy of the model
# increases. Therefore, the critical value decreases, and so the room for error
# decreases. Think of it as a normal distribution become more clustered around
# the mean (zero in this case) as the sample size increases. The total number
# of lags do not change, yet the number of random numbers increase. Thus,
# the lag magnitudes change as the random numbers change.
Exercise 2
# Plotting the daily closing prices for Amazon stock
View(gafa_stock)
gafa_stock %>%
distinct(Symbol)
## # A tibble: 4 × 1
## Symbol
## <chr>
## 1 AAPL
## 2 AMZN
## 3 FB
## 4 GOOG
# Data manipulation to filter the daily amazon close price
amzn <- gafa_stock %>%
filter(Symbol == "AMZN") %>%
select(Close)
# The plot
autoplot(amzn)
## Plot variable not specified, automatically selected `.vars = Close`

# There is an increasing trend. The first indication that it's not stationary
# The ACF plot
amzn %>%
ACF(Close) %>%
autoplot()
## Warning: Provided data has an irregular interval, results should be treated with
## caution. Computing ACF by observation.

# The ACF plot shows that the autocorrelation is decreasing, but
# I can't tell if it's decreasing exponentially. Thus, the exact values of
# p, d, and q are not distinguishable.
# The PACF plot
amzn %>%
PACF(Close) %>%
autoplot()
## Warning: Provided data has an irregular interval, results should be treated with
## caution. Computing ACF by observation.

# The PACF has an extremely large autocorrelation value for the first lag.
# However, it's not the last significant lag. Again, the movement of the
# PACF is not clear and so the exact values of p, d, and q are unclear.
# Both plots show that there is extreme autocorrelation within the dependent
# variable. Thus, the series is not white noise. So the series needs to be
# differenced.
Exercise 3
# a. Turkish GDP from global_economy
View(global_economy)
turk <- global_economy %>%
filter(Country == "Turkey") %>%
select(Year, GDP)
turk
## # A tsibble: 58 x 2 [1Y]
## Year GDP
## <dbl> <dbl>
## 1 1960 13995067818.
## 2 1961 8022222222.
## 3 1962 8922222222.
## 4 1963 10355555556.
## 5 1964 11177777778.
## 6 1965 11944444444.
## 7 1966 14122222222.
## 8 1967 15666666667.
## 9 1968 17500000000
## 10 1969 19466666667.
## # … with 48 more rows
autoplot(turk)
## Plot variable not specified, automatically selected `.vars = GDP`

# The appropiate box-cox transformation is
turk %>%
features(GDP, features = guerrero)
## # A tibble: 1 × 1
## lambda_guerrero
## <dbl>
## 1 0.157
turk %>%
autoplot(box_cox(GDP, 0.157))

turk1 <- turk %>%
mutate(GDP = log(GDP))
autoplot(turk1)
## Plot variable not specified, automatically selected `.vars = GDP`

# From the autoplot from the original data seemed as if it
# could have been balanced with a log transformation, and since
# the lambda was pretty close to 0, I tried it. Looks good to me.
# Now, let's find the order of differencing:
turk %>%
features(GDP, unitroot_ndiffs)
## # A tibble: 1 × 1
## ndiffs
## <int>
## 1 1
# This function says that we need to take the first difference. This is denoted as
# y_t' = (1 - B)y_t where y_t is y with subscript t.
# Now, to check is I need a seasonal difference.
turk %>%
features(GDP, unitroot_nsdiffs)
## # A tibble: 1 × 1
## nsdiffs
## <int>
## 1 0
# There should be no seasonal difference in this case.
# Now, let's look at the plot to see.
turk2 <- turk1 %>%
mutate(GDP = difference(GDP)) %>%
autoplot()
## Plot variable not specified, automatically selected `.vars = GDP`
turk2
## Warning: Removed 1 row containing missing values (`geom_line()`).

# The series looks stationary for the most part.
# b. Accommodation takings in the state of Tasmania from aus_accommodation
View(aus_accommodation)
aus_accommodation %>%
distinct(State)
## # A tibble: 8 × 1
## State
## <chr>
## 1 Australian Capital Territory
## 2 New South Wales
## 3 Northern Territory
## 4 Queensland
## 5 South Australia
## 6 Tasmania
## 7 Victoria
## 8 Western Australia
tas <- aus_accommodation %>%
filter(State == "Tasmania") %>%
select(Date, Takings)
tas
## # A tsibble: 74 x 2 [1Q]
## Date Takings
## <qtr> <dbl>
## 1 1998 Q1 28.7
## 2 1998 Q2 19.0
## 3 1998 Q3 16.1
## 4 1998 Q4 25.9
## 5 1999 Q1 28.4
## 6 1999 Q2 20.1
## 7 1999 Q3 17.3
## 8 1999 Q4 24.3
## 9 2000 Q1 30.0
## 10 2000 Q2 21.7
## # … with 64 more rows
autoplot(tas)
## Plot variable not specified, automatically selected `.vars = Takings`

# There's a prevalent trend, so it needs to be differenced. Let's see if
# it needs a box-cox transformation
tas %>%
features(Takings, features = guerrero)
## # A tibble: 1 × 1
## lambda_guerrero
## <dbl>
## 1 0.00182
# The value is extremely close to zero, so it needs to be logged.
tas %>%
features(Takings, unitroot_ndiffs)
## # A tibble: 1 × 1
## ndiffs
## <int>
## 1 1
# needs a first difference which is denoted as y_t' = (1 - B)y_t
tas %>%
features(Takings, unitroot_nsdiffs)
## # A tibble: 1 × 1
## nsdiffs
## <int>
## 1 1
# It also needs a seasonal difference! Thus, a seasonal difference followed by a
# first difference can be written as
# (1 - B)(1 - B^m)y_t = (1 - B - B^m + B^(m+1))y_t
# = y_t - y_(t-1) - y_(t-m) + y_(t-m-1)
# It is monthly data with annual seasonality. Thus, the difference lag will be
# 12.
tas1 <- tas %>%
mutate(Takings = difference(log(Takings)))
# Let's see the series so far.
autoplot(tas1)
## Plot variable not specified, automatically selected `.vars = Takings`
## Warning: Removed 1 row containing missing values (`geom_line()`).

# There data seems more stable, but the seasonality is still apparent. Thus,
# I'm taking the seasonal difference with lag 4 since it's quarterly data with
# annual seasonality.
tas2 <- tas1 %>%
mutate(Takings = difference(Takings, 4))
autoplot(tas2)
## Plot variable not specified, automatically selected `.vars = Takings`
## Warning: Removed 5 rows containing missing values (`geom_line()`).

# Now this looks like white noise.
# c. Monthly Sales from souvenirs
View(souvenirs)
autoplot(souvenirs)
## Plot variable not specified, automatically selected `.vars = Sales`

# There's an increasing trend with some seasonality.
souvenirs %>%
features(Sales, features = guerrero)
## # A tibble: 1 × 1
## lambda_guerrero
## <dbl>
## 1 0.00212
# The lambda is very close to 0, so we will use log as our transformation.
souvenirs %>%
features(Sales, unitroot_ndiffs)
## # A tibble: 1 × 1
## ndiffs
## <int>
## 1 1
# And one nonseasonal difference. Therefore, the backshift notation can be
# written as
# (1 - B)(1 - B^m)y_t = (1 - B - B^m + B^(m+1))y_t
# = y_t - y_(t-1) - y_(t-m) + y_(t-m-1)
souvenirs %>%
features(Sales, unitroot_nsdiffs)
## # A tibble: 1 × 1
## nsdiffs
## <int>
## 1 1
# and a seasonal difference. It's monthly data with annual seasonality. Thus,
# the lag will be equal to 12.
souvenirs1 <- souvenirs %>%
mutate(Sales = difference(log(Sales)))
autoplot(souvenirs1)
## Plot variable not specified, automatically selected `.vars = Sales`
## Warning: Removed 1 row containing missing values (`geom_line()`).

# There is still seasonality so,
souvenirs2 <- souvenirs1 %>%
mutate(Sales = difference(Sales, 12))
autoplot(souvenirs2)
## Plot variable not specified, automatically selected `.vars = Sales`
## Warning: Removed 13 rows containing missing values (`geom_line()`).

# This indeed looks like white noise!
Exercise 4
# The function given in the homework set is
ar1 <- function(phi, n = 100) {
y <- numeric(n)
e <- rnorm(n)
for (i in 2:n) {
y[i] <- phi * y[i - 1] + e[i]
}
tsibble(idx = seq_len(n), y = y, index = idx)
}
data <- ar1(0.6)
data
## # A tsibble: 100 x 2 [1]
## idx y
## <int> <dbl>
## 1 1 0
## 2 2 -0.554
## 3 3 -0.212
## 4 4 0.0510
## 5 5 -0.690
## 6 6 0.596
## 7 7 1.98
## 8 8 1.89
## 9 9 1.79
## 10 10 1.57
## # … with 90 more rows
# a.
# Using the code to generate data from an AR(1) model with
# phi = 0.6 and sigma sqaured equal to 1. Since the e is a normal
# distribution, the sigma squared is automatically generated from the
# function. The function already has a p order of 1, since it only
# includes one lagged value for the forecast. Thus,
ar1(0.6)
## # A tsibble: 100 x 2 [1]
## idx y
## <int> <dbl>
## 1 1 0
## 2 2 0.0795
## 3 3 -1.28
## 4 4 0.125
## 5 5 -0.777
## 6 6 -2.13
## 7 7 -2.22
## 8 8 -1.74
## 9 9 -2.74
## 10 10 -1.28
## # … with 90 more rows
# is all we need.
# b. produce a time plot for the series
# I used the function from above to create a function that plots the
# series given the value of phi.
ar2 <- function(phi, n = 100) {
y <- numeric(n)
e <- rnorm(n)
for (i in 2:n) {
y[i] <- phi * y[i - 1] + e[i]
}
tsibble(idx = seq_len(n), y = y, index = idx)
autoplot(tsibble(idx = seq_len(n), y = y, index = idx))
}
# Now I'm choosing different values of phi to observe the graph.
# when phi = 0.1:
trail1 <- ar2(0.1)
## Plot variable not specified, automatically selected `.vars = y`
trail1

# when phi = 0.2:
trail2 <- ar2(0.2)
## Plot variable not specified, automatically selected `.vars = y`
trail2

# when phi = 0.3
trail3 <- ar2(0.3)
## Plot variable not specified, automatically selected `.vars = y`
trail3

# when phi = 0.5
trail4 <- ar2(0.5)
## Plot variable not specified, automatically selected `.vars = y`
trail4

# when phi = 0.8
trail5 <- ar2(0.8)
## Plot variable not specified, automatically selected `.vars = y`
trail5

# when phi = 0.9
trail6 <- ar2(0.9)
## Plot variable not specified, automatically selected `.vars = y`
trail6

# apparently this is suppose to be a random walk or a random walk with drift.
# seems as if it's w a drift.
# Let's try a negative value for phi
trail7 <- ar2(-0.5)
## Plot variable not specified, automatically selected `.vars = y`
trail7

# It's true it seems to be oscillating between positive and negative values.
# Let's try phi = 0 to see if it's white noise.
trail8 <- ar2(0)
## Plot variable not specified, automatically selected `.vars = y`
trail8

# Seems like white noise to me.
# c. Write your own code to generate data from an MA(1) model with
# theta = 0.6 and sigma squared equal to 1.
# The series is white noise when phi = 0. Thus, I created a function
# that makes phi equal to zero and includes the last error term in the
# next to have a MA(1). I am assuming c = 0 for simplificational reasons.
ma1 <- function(theta, phi = 0, n = 100) {
y <- numeric(n)
e <- rnorm(n)
for (i in 2:n) {
y[i] <- phi * y[i - 1] + e[i] + theta * e[i - 1]
}
tsibble(idx = seq_len(n), y = y, index = idx)
}
# Since the error term is normalized, sigma squared is automatically one.
# Thus, an MA(1) model with theta_1 = 0.6 and sigma squared equal to 1 is
data1 <- ma1(0.6)
data1
## # A tsibble: 100 x 2 [1]
## idx y
## <int> <dbl>
## 1 1 0
## 2 2 0.237
## 3 3 0.612
## 4 4 2.03
## 5 5 2.26
## 6 6 2.62
## 7 7 2.68
## 8 8 -0.451
## 9 9 -0.304
## 10 10 0.171
## # … with 90 more rows
# Now, a time plot for the series has the code:
ma2 <- function(theta, phi = 0, n = 100) {
y <- numeric(n)
e <- rnorm(n)
for (i in 2:n) {
y[i] <- phi * y[i - 1] + e[i] + theta * e[i - 1]
}
tsibble(idx = seq_len(n), y = y, index = idx)
autoplot(tsibble(idx = seq_len(n), y = y, index = idx))
}
# and so we can examine how the plot changes as theta_1 changes
# When theta_1 = 0:
t1 <- ma2(0)
## Plot variable not specified, automatically selected `.vars = y`
t1

# This is a graph of the pure white noise with no weight from the errors
# in the past.
# when theta_1 = 0.5
t2 <- ma2(0.5)
## Plot variable not specified, automatically selected `.vars = y`
t2

# It seems to be becoming more "smooth"
# when theta_1 = 0.8:
t3 <- ma2(0.8)
## Plot variable not specified, automatically selected `.vars = y`
t3

# Again, more "smooth"
# when theta_1 = 0.9:
t4 <- ma2(0.9)
## Plot variable not specified, automatically selected `.vars = y`
t4

# e. Generate data from an ARMA(1,1) model with phi_1 = 0.6,
# theta_1 = 0.6, and sigma squared equal to 1.
# Here is the code for the ARMA(1,1) model. I just made phi -= 0.
ARMA11 <- function(theta, phi, n = 100) {
y <- numeric(n)
e <- rnorm(n)
for (i in 2:n) {
y[i] <- phi * y[i - 1] + theta * e[i - 1] + e[i]
}
tsibble(idx = seq_len(n), y = y, index = idx)
}
# and so my generated data from an ARMA(1,1) model with with phi_1 = 0.6,
# theta_1 = 0.6, and sigma squared equal to 1 is
t11 <- ARMA11(0.6, 0.6)
t11
## # A tsibble: 100 x 2 [1]
## idx y
## <int> <dbl>
## 1 1 0
## 2 2 0.212
## 3 3 -1.10
## 4 4 -1.90
## 5 5 -0.838
## 6 6 1.55
## 7 7 1.49
## 8 8 -2.58
## 9 9 -4.26
## 10 10 -4.16
## # … with 90 more rows
# f. From an AR(2) model
AR2 <- function(phi2, phi1, n = 100) {
y <- numeric(n)
e <- rnorm(n)
for (i in 3:n) {
y[i] <- phi2 * y[i - 2] + phi1 * y[i - 1] + e[i]
}
tsibble(idx = seq_len(n), y = y, index = idx)
}
# with phi_1 = -0.8, phi_2 = 0.3, and sigma squared = 1 generates data:
data3 <- AR2(-0.8, 0.3)
data3
## # A tsibble: 100 x 2 [1]
## idx y
## <int> <dbl>
## 1 1 0
## 2 2 0
## 3 3 2.18
## 4 4 1.85
## 5 5 -1.47
## 6 6 -3.67
## 7 7 -0.590
## 8 8 4.38
## 9 9 1.50
## 10 10 -2.25
## # … with 90 more rows
# g. Graphing the latter two series to compare.
autoplot(t11)
## Plot variable not specified, automatically selected `.vars = y`

# The moving is pretty smooth.
autoplot(data3)
## Plot variable not specified, automatically selected `.vars = y`

# This graph is not as smooth as the first one. I'm guessing because the
# first one includes a MA(1) while the second one doesn't.
Exercise 5
# a.
View(aus_airpassengers)
# usign the ARIMA() model to find an appropriate ARIMA model.
# The data:
data <- aus_airpassengers %>%
as_tsibble(index = Year)
autoplot(data)
## Plot variable not specified, automatically selected `.vars = Passengers`

# Now the fit
fit <- data %>%
model(ARIMA(Passengers))
report(fit)
## Series: Passengers
## Model: ARIMA(0,2,1)
##
## Coefficients:
## ma1
## -0.8963
## s.e. 0.0594
##
## sigma^2 estimated as 4.308: log likelihood=-97.02
## AIC=198.04 AICc=198.32 BIC=201.65
# and so the apropriate ARIMA model is ARIMA(0,2,1).
# Now, I am checking the residuals of my model
gg_tsresiduals(fit)

# The variance isn't extremely constant since there are several outliers. The
# mean is centered around zero and there is not autocorrelation which is what
# matters. Overall, the residuals look like white noise.
# Now, I am plotting the forecasts of the next 10 periods.
fc <- fit %>%
forecast(h = 10) %>%
autoplot(data)
fc

# seems pretty accurate to me.
# b. Write the model in terms of the backshift operator.
# The model was differenced twiced and has a first oder moving average.
# There is no order for the autoregressive equation. Thus, the equation of the
# model is
# a 2nd differenced model
# y_t'' = (y_t - y_t-1) - (y_t-1 - y_t-2) = y_t - 2y_t-1 + y_t-2
# = y_t - 2By_t + B^2y_t = (1 - B)^2y_t
# with a moving average of order one
# y_t = c + e_t + theta_1*e_t-1 = c + (1 + theta_1*B)e_t
# Thus, the overall model in backshift notation is:
# (1 - B)^2y_t = c + (1 + theta_1*B)e_t
# c.
# Plotting forecasts from an ARIMA(0,1,0) model with drift (so it has a constant).
# I am choosing my constant to be one for simplification reasons.
fit1 <- data %>%
model(ARIMA(Passengers ~ 1 + pdq(0,1,0)))
# Now, the forecast
fc1 <- fit1 %>%
forecast(h = 10) %>%
autoplot(data)
fc1

# The forecast doesn't look too different from the original arima model.
report(fit1)
## Series: Passengers
## Model: ARIMA(0,1,0) w/ drift
##
## Coefficients:
## constant
## 1.4191
## s.e. 0.3014
##
## sigma^2 estimated as 4.271: log likelihood=-98.16
## AIC=200.31 AICc=200.59 BIC=203.97
# The AICc is 2 units higher though.
# d.
# Plotting the forecast of the ARIMA(2,1,2) model w/ drift.
fit2 <- data %>%
model(ARIMA(Passengers ~ 1 + pdq(2,1,2)))
fc2 <- fit2 %>%
forecast(h = 10) %>%
autoplot(data)
fc2

# This forecast doesn't have such a constant point forecast. but rather more of
# a rigid one.
report(fit2)
## Series: Passengers
## Model: ARIMA(2,1,2) w/ drift
##
## Coefficients:
## ar1 ar2 ma1 ma2 constant
## -0.5518 -0.7327 0.5895 1.0000 3.2216
## s.e. 0.1684 0.1224 0.0916 0.1008 0.7224
##
## sigma^2 estimated as 4.031: log likelihood=-96.23
## AIC=204.46 AICc=206.61 BIC=215.43
# The AICc however is the largest for this one out of any so far.
# Now, removing the constant
fit3 <- data %>%
model(ARIMA(Passengers ~ 0 + pdq(2,1,2)))
## Warning: 1 error encountered for ARIMA(Passengers ~ 0 + pdq(2, 1, 2))
## [1] non-stationary AR part from CSS
# it returned an error saying that it's non-stationary AR part from
# CSS. I am guess this is because the model has an increasing trend, so
# not including a drift would cause an issue for the model.
# e.
# The forecast from an ARIMA(0,2,1) model with a constant.
fit4 <- data %>%
model(ARIMA(Passengers ~ 1 + pdq(0,2,1)))
## Warning: Model specification induces a quadratic or higher order polynomial trend.
## This is generally discouraged, consider removing the constant or reducing the number of differences.
# This returns a warning message: Model specification induces a quadratic or
# higher order polynomial trend. This is generally discouraged, Consider
# removing the constant or reducing the number of differences.
fc4 <- fit4 %>%
forecast(h = 10) %>%
autoplot(data)
fc4

# I guess the warning makes sense since the forecast as an exponentially increasing
# trend. Thus, a drift with two differences causes the value to do this.
Exercise 6
# United States GDP
View(global_economy)
global_economy %>%
distinct(Country)
## # A tibble: 263 × 1
## Country
## <fct>
## 1 Afghanistan
## 2 Albania
## 3 Algeria
## 4 American Samoa
## 5 Andorra
## 6 Angola
## 7 Antigua and Barbuda
## 8 Arab World
## 9 Argentina
## 10 Armenia
## # … with 253 more rows
data <- global_economy %>%
filter(Country == "United States") %>%
select(Year, Country, GDP) %>%
as_tsibble(index = Year)
data
## # A tsibble: 58 x 3 [1Y]
## # Key: Country [1]
## Year Country GDP
## <dbl> <fct> <dbl>
## 1 1960 United States 543300000000
## 2 1961 United States 563300000000
## 3 1962 United States 605100000000
## 4 1963 United States 638600000000
## 5 1964 United States 685800000000
## 6 1965 United States 743700000000
## 7 1966 United States 815000000000
## 8 1967 United States 861700000000
## 9 1968 United States 942500000000
## 10 1969 United States 1019900000000
## # … with 48 more rows
autoplot(data)
## Plot variable not specified, automatically selected `.vars = GDP`

# a.
data %>%
features(GDP, features = guerrero)
## # A tibble: 1 × 2
## Country lambda_guerrero
## <fct> <dbl>
## 1 United States 0.282
# The value is closest to 0. Thus, I shall employ a log transformation.
data1 <- data %>%
mutate(GDP = log(GDP))
autoplot(data1)
## Plot variable not specified, automatically selected `.vars = GDP`

# it does seem more balanced.
# b.
# using the ARIMA() function to fit a suitable ARIMA model to the
# transformed data.
fit <- data %>%
model(ARIMA(log(GDP)))
report(fit)
## Series: GDP
## Model: ARIMA(0,2,1)
## Transformation: log(GDP)
##
## Coefficients:
## ma1
## -0.6353
## s.e. 0.1138
##
## sigma^2 estimated as 0.0004278: log likelihood=139.76
## AIC=-275.53 AICc=-275.3 BIC=-271.48
# The chosen, suitable ARIMA model is ARIMA(0,2,1).
# c.
# This function chooses appropriate orders
data %>%
model(ARIMA(log(GDP) ~ pdq(p=1:3, d=1, q=0:2)))
## # A mable: 1 x 2
## # Key: Country [1]
## Country `ARIMA(log(GDP) ~ pdq(p = 1:3, d = 1, q = 0:2))`
## <fct> <model>
## 1 United States <ARIMA(1,1,1) w/ drift>
# It gave me an ARIMA(1,1,1) w/ a drift
# The fit
fit1 <- data %>%
model(ARIMA(log(GDP) ~ 1 + pdq(1,1,1)))
report(fit1)
## Series: GDP
## Model: ARIMA(1,1,1) w/ drift
## Transformation: log(GDP)
##
## Coefficients:
## ar1 ma1 constant
## 0.9270 -0.5751 0.0042
## s.e. 0.0636 0.1552 0.0010
##
## sigma^2 estimated as 0.0004144: log likelihood=143.16
## AIC=-278.32 AICc=-277.55 BIC=-270.14
# This has a lower AICc!! Let's try it without the drift
# d.
# Based on the AICc, the second model, ARIMA(1,1,1) w/ drift is best!
# Now, to check the residuals
gg_tsresiduals(fit1)

# The variance seems to be constant overall with a few outliers. There is no
# autocorrelation and the mean is cenetred around zero with a normal dist..
# So far, so good.
# e.
# Now, the forecast and the plot
fc2 <- fit1 %>%
forecast(h = 10) %>%
autoplot(data)
fc2

# Since the original plot is exponentially increasing, it's reasonable that
# the forecast is too!
# f.
# Now, an ETS() model with no transformation
fit2 <- data %>%
model(ETS(GDP))
report(fit2)
## Series: GDP
## Model: ETS(M,A,N)
## Smoothing parameters:
## alpha = 0.9990876
## beta = 0.5011949
##
## Initial states:
## l[0] b[0]
## 448093333334 64917355687
##
## sigma^2: 7e-04
##
## AIC AICc BIC
## 3190.787 3191.941 3201.089
# The AICc is huge, but that's because we didn't use the log transformation.
# after we log this value though, we get 3.50 which is much more higher than
# from the ARIMA model.
# Let's see what the forecast looks like
fc3 <- fit2 %>%
forecast(h = 10) %>%
autoplot(data)
fc3

# The prediction interval is much larger, I am assuming because of the residuals.
# Let's see
gg_tsresiduals(fit2)

# yeah, they do not seem to have a normal distribution. The point forecast of
# the model seems fairly accurate though.
Exercise 7
# a.
# Choosing Japan as my country.
data <- aus_arrivals %>%
filter(Origin == "Japan") %>%
as_tsibble(index = Quarter)
data
## # A tsibble: 127 x 3 [1Q]
## # Key: Origin [1]
## Quarter Origin Arrivals
## <qtr> <chr> <int>
## 1 1981 Q1 Japan 14763
## 2 1981 Q2 Japan 9321
## 3 1981 Q3 Japan 10166
## 4 1981 Q4 Japan 19509
## 5 1982 Q1 Japan 17117
## 6 1982 Q2 Japan 10617
## 7 1982 Q3 Japan 11737
## 8 1982 Q4 Japan 20961
## 9 1983 Q1 Japan 20671
## 10 1983 Q2 Japan 12235
## # … with 117 more rows
# b.
# Apparerntly I need to difference the data to obtain stationary
# data. I want to check this, and if so, how many I should take.
# First, let's see if I need a box-cox transformation
data %>%
features(Arrivals, features = guerrero)
## # A tibble: 1 × 2
## Origin lambda_guerrero
## <chr> <dbl>
## 1 Japan 0.254
# It's fairly close to zero, but let's check the original graph before
# any changes are made.
autoplot(data)
## Plot variable not specified, automatically selected `.vars = Arrivals`

# yes, to obtain stationary data, must employ a transformation. Let's
# see what log does
data1 <- data %>%
mutate(Arrivals = log(Arrivals))
autoplot(data1)
## Plot variable not specified, automatically selected `.vars = Arrivals`

# It didn't really fix anything, so maybe I should just stick to taking the difference
# of the original data. Let's see how many we should take.
data %>%
features(Arrivals, unitroot_ndiffs)
## # A tibble: 1 × 2
## Origin ndiffs
## <chr> <int>
## 1 Japan 1
# There should be one non-seasonal difference.
data %>%
features(Arrivals, unitroot_nsdiffs)
## # A tibble: 1 × 2
## Origin nsdiffs
## <chr> <int>
## 1 Japan 1
# and also a seasonal difference.
# Choosing the lag to be 4 since it's quarterly data with annual seasonality.
data2 <- data %>%
mutate(Arrivals = difference(difference(Arrivals), 4)) %>%
as_tsibble(index = Quarter)
# Let's look at the plot
autoplot(data2)
## Plot variable not specified, automatically selected `.vars = Arrivals`
## Warning: Removed 5 rows containing missing values (`geom_line()`).

# Seems stationary to me!
# c.
# The ACF graph of the differenced data
data2 %>%
ACF(Arrivals) %>%
autoplot()

# There are a few significant spikes in random order in the ACF. The significant
# spike at lag 4 in the ACF suggests a seasonal MA(1) component. The significant
# spike at lag 1 suggests a non-seasonal MA(1) component. At lag 6, I am not sure
# how to deal with it, maybe I will try an MA(5) or MA(6).
# d.
# The PACF graph
data2 %>%
PACF(Arrivals) %>%
autoplot()

# There is a very significant lag at lag 4. There are several other
# spikes through out the graph.
# e.
# Like stated before, they suggest an ARIMA(0,1,1)(0,1,1)[4]. If we would have
# started with the PACF graph, we would have chosen the non-seasonal component
# and the ACF to select the seasonal part of the model leaving us with the same
# model as before - ARIMA(0,1,1)(0,1,1)[4].
# f.
# I am now using the automatic ARIMA() function to chose a model.
# I will then use special operation wihtin a fucntion to make the computer
# look hard for the perfect model. I am using the original data to fit the model.
fit <- data %>%
model(auto1 = ARIMA(Arrivals))
report(fit)
## Series: Arrivals
## Model: ARIMA(0,1,1)(1,1,1)[4]
##
## Coefficients:
## ma1 sar1 sma1
## -0.4228 -0.2305 -0.5267
## s.e. 0.0944 0.1337 0.1246
##
## sigma^2 estimated as 174801727: log likelihood=-1330.66
## AIC=2669.32 AICc=2669.66 BIC=2680.54
# The automatic ARIMA() function gives ARIMA(0,1,1)(1,1,1)[4], very close to my
# estimate only with a seasonal AR(1) added to it. Now, I am going make the
# computer look for the perfect model.
fit2 <- data %>%
model(auto2 = ARIMA(Arrivals, stepwise = FALSE, approx = FALSE))
report(fit2)
## Series: Arrivals
## Model: ARIMA(5,1,0)(0,1,1)[4]
##
## Coefficients:
## ar1 ar2 ar3 ar4 ar5 sma1
## -0.4099 -0.2729 -0.2468 -0.3897 -0.2206 -0.4577
## s.e. 0.0880 0.0910 0.0889 0.1326 0.0887 0.1462
##
## sigma^2 estimated as 167809816: log likelihood=-1326.65
## AIC=2667.3 AICc=2668.28 BIC=2686.93
# The precise automatic ARIMA() function gave an ARIMA(5,1,0)(0,1,1)[4]. I'm
# guessing that the significant spike at lag 6 in the ACF graph is why there
# should be a non-seasonal AR(5) component within the model.
# Now, I am checking the AICc of my estimated model based on the ACF and PACF
# graph.
fit3 <- data %>%
model(ARIMA(Arrivals ~ pdq(0,1,1) + PDQ(0,1,1)))
report(fit3)
## Series: Arrivals
## Model: ARIMA(0,1,1)(0,1,1)[4]
##
## Coefficients:
## ma1 sma1
## -0.4256 -0.6755
## s.e. 0.1005 0.0659
##
## sigma^2 estimated as 177223035: log likelihood=-1332.01
## AIC=2670.02 AICc=2670.23 BIC=2678.43
# My model has a slightly higher AICc than the precise automatic model. Therefore,
# the ARIMA(5,1,0)(0,1,1)[4] is best in this case.
# f.
# It's going to be difficult to write out the equation in r because of the
# greek letters.
# The model in terms of the back-shift operator is below:
# (1 - phi*B^5)(1 - B)(1 - B^4)y_t = (1 + Theta*B^4)epsilon_t
# And now without using the back-shift operator:
# After simplifying the equation, I get:
# y_t = y_t-4 + y_t-1 - y_t-5 + phi*y_t-5 - phi*y_t-9 - phi*y_t-6 + phi*y_t-10 + Theta*epsilon_t-4 + epsilon_t
# that took a little while :) hopefully it's comprehensible.
Exercise 8
# The link that was in the data wouldn't load, so I # looked up the data online
# and imported it manually. I found Gold data.
Gold <- read_excel("C:/Users/summe_iy44w2w/Downloads/Gold.xlsx")
View(Gold)
# I guess I'll do the Close price.
# The original graph
data <- Gold %>%
mutate(Date1 = row_number()) %>%
select(Date1, `Close/Last`) %>%
as_tsibble(index = Date1)
data
## # A tsibble: 2,547 x 2 [1]
## Date1 `Close/Last`
## <int> <dbl>
## 1 1 1769
## 2 2 1763
## 3 3 1776.
## 4 4 1775.
## 5 5 1774.
## 6 6 1774.
## 7 7 1754.
## 8 8 1716.
## 9 9 1716
## 10 10 1680.
## # … with 2,537 more rows
autoplot(data)
## Plot variable not specified, automatically selected `.vars = Close/Last`

# The ACF plot
data %>%
ACF(`Close/Last`) %>%
autoplot()

# I wonder if it's autocorrelated (kidding). I know # it's decreasing but
# I can't tell if it is exponentially.
data %>%
PACF(`Close/Last`) %>%
autoplot()

# There is a huge spike at the first significant lag.
# Let's see how many differences the data needs and if it needs a box-cox
data %>%
features(`Close/Last`, features = guerrero)
## # A tibble: 1 × 1
## lambda_guerrero
## <dbl>
## 1 -0.161
# it's pretty close to one, let's try the log.
data1 <- data %>%
mutate(`Close/Last` = log(`Close/Last`))
autoplot(data1)
## Plot variable not specified, automatically selected `.vars = Close/Last`

# It didn't change at all so let's not.
data %>%
features(`Close/Last`, unitroot_ndiffs)
## # A tibble: 1 × 1
## ndiffs
## <int>
## 1 1
# There should be one difference.
data %>%
features(`Close/Last`, unitroot_nsdiffs)
## # A tibble: 1 × 1
## nsdiffs
## <int>
## 1 0
# and no seasonal difference, interesting.
data2 <- data %>%
mutate(`Close/Last` = difference(`Close/Last`))
autoplot(data2)
## Plot variable not specified, automatically selected `.vars = Close/Last`
## Warning: Removed 1 row containing missing values (`geom_line()`).

# Looks stationary
data2 %>%
ACF(`Close/Last`) %>%
autoplot()

# The movement is unrecognizable.
data2 %>%
PACF(`Close/Last`) %>%
autoplot()

# The PACF looks pretty similar to the ACF.
# I am going to use the auto ARIMA() method to find the appropriate
# model.
fit <- data %>%
model(ARIMA(`Close/Last`, stepwise = FALSE, approx = FALSE))
report(fit)
## Series: Close/Last
## Model: ARIMA(4,1,1)
##
## Coefficients:
## ar1 ar2 ar3 ar4 ma1
## 0.5937 0.0431 0.0190 -0.0652 -0.6367
## s.e. 0.1951 0.0244 0.0232 0.0201 0.1951
##
## sigma^2 estimated as 217.7: log likelihood=-10462.93
## AIC=20937.87 AICc=20937.9 BIC=20972.92
# it gave me the ARIMA(4,1,1) model. Which makes sense from the ACF and PACF plot.
# c.
# I am checking the residuals of my ARIMA model.
gg_tsresiduals(fit)

# The variance is pretty constant, the mean is centered around zero with a normal
# distribution. What's concerning is that the errors are autocorrelated.
augment(fit) %>%
features(.resid, ljung_box, lag = 10, dof = 1)
## # A tibble: 1 × 3
## .model lb_stat lb_pvalue
## <chr> <dbl> <dbl>
## 1 ARIMA(`Close/Last`, stepwise = FALSE, approx = FALSE) 3.92 0.917
# The p-value is > 0.05 so the residuals are not significant! Meaning, it is not
# indistinguishable from a white noise!
# d.
# My forecast for the next two years
fc <- fit %>%
forecast(h = 261) %>%
autoplot(data)
fc

# the forecast looks pretty nice, the prediction interval is pretty large.
# e.
# I use the automatic ETS() model to choose one for me.
fit1 <- data %>%
model(ETS(`Close/Last`))
report(fit1)
## Series: Close/Last
## Model: ETS(M,N,N)
## Smoothing parameters:
## alpha = 0.9542947
##
## Initial states:
## l[0]
## 1768.619
##
## sigma^2: 1e-04
##
## AIC AICc BIC
## 33435.72 33435.73 33453.25
# The returned ETS model is ETS(M,N,N).
gg_tsresiduals(fit1)

# the variance is pretty constant, the mean is centered around zero with a normal
# distribution. Again, there is autocorreltionn within the residuals.
augment(fit1) %>%
features(.innov, ljung_box, lag = 10, dof = 0)
## # A tibble: 1 × 3
## .model lb_stat lb_pvalue
## <chr> <dbl> <dbl>
## 1 ETS(`Close/Last`) 11.3 0.335
# The p-value is greater than 0.05, thus, the residuals are indistinguishable
# from white noise!
# g.
# Now, I am using my ETS() model to forecast the next four years.
fc1 <- fit1 %>%
forecast(h = 261) %>%
autoplot(data)
fc1

# The forecast for the ETS() model honestly looks very similar to the ARIMA()
# model forecast.
# h.
# This one is a hard one, but I am going to go with the ARIMA(4,1,1), even
# though both models are appropriate. I can check the RMSE to make sure
train <- data %>%
filter(Date1 < 1783)
recent <- data %>%
filter(Date1 >= 1)
fit4 <- train %>%
model(ARIMA(`Close/Last`, stepwise = FALSE, approx = FALSE))
fc4 <- fit4 %>%
forecast(h = 764) %>%
autoplot(recent)
fc4

# The RMSE
fc44 <- fit4 %>%
forecast(h = 764)
accuracy(fc44, recent)
## # A tibble: 1 × 10
## .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ARIMA(`Close/Last`, ste… Test 137. 206. 146. 9.37 10.2 14.4 13.7 0.990
# The RMSE is 206.
# Now, the ETS() model.
fit5 <- train %>%
model(ETS(`Close/Last`))
fc5 <- fit5 %>%
forecast(h = 764) %>%
autoplot(recent)
fc5

# The RMSE
fc55 <- fit5 %>%
forecast(h = 764)
accuracy(fc55, recent)
## # A tibble: 1 × 10
## .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ETS(`Close/Last`) Test 131. 202. 141. 8.90 9.84 13.9 13.5 0.990
# The ETS() model has the lowest RMSE!! Thus, this is the chosen model,
# something I did not expect.