HW 6

Questions

Exercise 1

# a. 


# As the random numbers increase, the ACF values and critical value intervals 
# decrease and become more concentrated. I am assuming it's because as random values 
# increase, the more precision there is (for example observations in a sample). 
# It could also be compared to standard errors decreasing as the number of 
# observations increase in a regression. Thus, with a little amount of data, the
# critical values must be much larger to account for the standard error. So, as 
# random numbers increase, the critical value decreases. The lags are pretty 
# similar across the graphs. 


# According to their own graph, the each lag stays within the critical value
# and so there is no autocorrelation. Thus, they are each white noise. 


# b. 


# Like said above in a, as the sample size increases, the accuracy of the model 
# increases. Therefore, the critical value decreases, and so the room for error 
# decreases. Think of it as a normal distribution become more clustered around 
# the mean (zero in this case) as the sample size increases. The total number 
# of lags do not change, yet the number of random numbers increase. Thus, 
# the lag magnitudes change as the random numbers change.

Exercise 2

# Plotting the daily closing prices for Amazon stock 

View(gafa_stock)

gafa_stock %>% 
  distinct(Symbol)

## # A tibble: 4 × 1
##   Symbol
##   <chr> 
## 1 AAPL  
## 2 AMZN  
## 3 FB    
## 4 GOOG

# Data manipulation to filter the daily amazon close price

amzn <- gafa_stock %>% 
  filter(Symbol == "AMZN") %>% 
  select(Close)

# The plot

autoplot(amzn)

## Plot variable not specified, automatically selected `.vars = Close`

# There is an increasing trend. The first indication that it's not stationary 

# The ACF plot

amzn %>% 
  ACF(Close) %>% 
  autoplot()

## Warning: Provided data has an irregular interval, results should be treated with
## caution. Computing ACF by observation.

# The ACF plot shows that the autocorrelation is decreasing, but
# I can't tell if it's decreasing exponentially. Thus, the exact values of 
# p, d, and q are not distinguishable.

# The PACF plot

amzn %>% 
  PACF(Close) %>% 
  autoplot()

## Warning: Provided data has an irregular interval, results should be treated with
## caution. Computing ACF by observation.

# The PACF has an extremely large autocorrelation value for the first lag.
# However, it's not the last significant lag. Again, the movement of the
# PACF is not clear and so the exact values of p, d, and q are unclear. 

# Both plots show that there is extreme autocorrelation within the dependent 
# variable. Thus, the series is not white noise. So the series needs to be 
# differenced.

Exercise 3

# a. Turkish GDP from global_economy 

View(global_economy)

turk <- global_economy %>% 
  filter(Country == "Turkey") %>% 
  select(Year, GDP)
turk

## # A tsibble: 58 x 2 [1Y]
##     Year          GDP
##    <dbl>        <dbl>
##  1  1960 13995067818.
##  2  1961  8022222222.
##  3  1962  8922222222.
##  4  1963 10355555556.
##  5  1964 11177777778.
##  6  1965 11944444444.
##  7  1966 14122222222.
##  8  1967 15666666667.
##  9  1968 17500000000 
## 10  1969 19466666667.
## # … with 48 more rows

autoplot(turk)

## Plot variable not specified, automatically selected `.vars = GDP`

# The appropiate box-cox transformation is 

turk %>% 
  features(GDP, features = guerrero)

## # A tibble: 1 × 1
##   lambda_guerrero
##             <dbl>
## 1           0.157

turk %>% 
  autoplot(box_cox(GDP, 0.157))

turk1 <- turk %>% 
  mutate(GDP = log(GDP))

autoplot(turk1)

## Plot variable not specified, automatically selected `.vars = GDP`

# From the autoplot from the original data seemed as if it
# could have been balanced with a log transformation, and since 
# the lambda was pretty close to 0, I tried it. Looks good to me.

# Now, let's find the order of differencing:

turk %>% 
  features(GDP, unitroot_ndiffs)

## # A tibble: 1 × 1
##   ndiffs
##    <int>
## 1      1

# This function says that we need to take the first difference. This is denoted as 
# y_t' = (1 - B)y_t where y_t is y with subscript t. 

# Now, to check is I need a seasonal difference.

turk %>% 
  features(GDP, unitroot_nsdiffs)

## # A tibble: 1 × 1
##   nsdiffs
##     <int>
## 1       0

# There should be no seasonal difference in this case.

# Now, let's look at the plot to see. 


turk2 <- turk1 %>% 
  mutate(GDP = difference(GDP)) %>% 
  autoplot()

## Plot variable not specified, automatically selected `.vars = GDP`

turk2

## Warning: Removed 1 row containing missing values (`geom_line()`).

# The series looks stationary for the most part.


# b. Accommodation takings in the state of Tasmania from aus_accommodation

View(aus_accommodation)

aus_accommodation %>% 
  distinct(State)

## # A tibble: 8 × 1
##   State                       
##   <chr>                       
## 1 Australian Capital Territory
## 2 New South Wales             
## 3 Northern Territory          
## 4 Queensland                  
## 5 South Australia             
## 6 Tasmania                    
## 7 Victoria                    
## 8 Western Australia

tas <- aus_accommodation %>% 
  filter(State == "Tasmania") %>% 
  select(Date, Takings)
tas

## # A tsibble: 74 x 2 [1Q]
##       Date Takings
##      <qtr>   <dbl>
##  1 1998 Q1    28.7
##  2 1998 Q2    19.0
##  3 1998 Q3    16.1
##  4 1998 Q4    25.9
##  5 1999 Q1    28.4
##  6 1999 Q2    20.1
##  7 1999 Q3    17.3
##  8 1999 Q4    24.3
##  9 2000 Q1    30.0
## 10 2000 Q2    21.7
## # … with 64 more rows

autoplot(tas)

## Plot variable not specified, automatically selected `.vars = Takings`

# There's a prevalent trend, so it needs to be differenced. Let's see if 
# it needs a box-cox transformation 

tas %>% 
  features(Takings, features = guerrero)

## # A tibble: 1 × 1
##   lambda_guerrero
##             <dbl>
## 1         0.00182

# The value is extremely close to zero, so it needs to be logged. 

tas %>% 
  features(Takings, unitroot_ndiffs)

## # A tibble: 1 × 1
##   ndiffs
##    <int>
## 1      1

# needs a first difference which is denoted as y_t' = (1 - B)y_t

tas %>% 
  features(Takings, unitroot_nsdiffs)

## # A tibble: 1 × 1
##   nsdiffs
##     <int>
## 1       1

# It also needs a seasonal difference! Thus, a seasonal difference followed by a 
# first difference can be written as 

# (1 - B)(1 - B^m)y_t = (1 - B - B^m + B^(m+1))y_t 
#    = y_t - y_(t-1) - y_(t-m) + y_(t-m-1)

# It is monthly data with annual seasonality. Thus, the difference lag will be
# 12.

tas1 <- tas %>% 
  mutate(Takings = difference(log(Takings)))

# Let's see the series so far.

autoplot(tas1)

## Plot variable not specified, automatically selected `.vars = Takings`

## Warning: Removed 1 row containing missing values (`geom_line()`).

# There data seems more stable, but the seasonality is still apparent. Thus,
# I'm taking the seasonal difference with lag 4 since it's quarterly data with 
# annual seasonality. 

tas2 <- tas1 %>% 
  mutate(Takings = difference(Takings, 4))

autoplot(tas2)

## Plot variable not specified, automatically selected `.vars = Takings`

## Warning: Removed 5 rows containing missing values (`geom_line()`).

# Now this looks like white noise. 


# c. Monthly Sales from souvenirs 


View(souvenirs)

autoplot(souvenirs)

## Plot variable not specified, automatically selected `.vars = Sales`

# There's an increasing trend with some seasonality. 

souvenirs %>% 
  features(Sales, features = guerrero)

## # A tibble: 1 × 1
##   lambda_guerrero
##             <dbl>
## 1         0.00212

# The lambda is very close to 0, so we will use log as our transformation.


souvenirs %>% 
  features(Sales, unitroot_ndiffs)

## # A tibble: 1 × 1
##   ndiffs
##    <int>
## 1      1

# And one nonseasonal difference. Therefore, the backshift notation can be 
# written as 

# (1 - B)(1 - B^m)y_t = (1 - B - B^m + B^(m+1))y_t 
#    = y_t - y_(t-1) - y_(t-m) + y_(t-m-1)


souvenirs %>% 
  features(Sales, unitroot_nsdiffs)

## # A tibble: 1 × 1
##   nsdiffs
##     <int>
## 1       1

# and a seasonal difference. It's monthly data with annual seasonality. Thus, 
# the lag will be equal to 12.

souvenirs1 <- souvenirs %>% 
  mutate(Sales = difference(log(Sales)))

autoplot(souvenirs1)

## Plot variable not specified, automatically selected `.vars = Sales`

## Warning: Removed 1 row containing missing values (`geom_line()`).

# There is still seasonality so,

souvenirs2 <- souvenirs1 %>% 
  mutate(Sales = difference(Sales, 12))

autoplot(souvenirs2)

## Plot variable not specified, automatically selected `.vars = Sales`

## Warning: Removed 13 rows containing missing values (`geom_line()`).

# This indeed looks like white noise!

Exercise 4

# The function given in the homework set is 

ar1 <- function(phi, n = 100) {
  y <- numeric(n)
  e <- rnorm(n)
  for (i in 2:n) {
    y[i] <- phi * y[i - 1] + e[i]
  }
  tsibble(idx = seq_len(n), y = y, index = idx)
}

data <- ar1(0.6)
data

## # A tsibble: 100 x 2 [1]
##      idx       y
##    <int>   <dbl>
##  1     1  0     
##  2     2 -0.554 
##  3     3 -0.212 
##  4     4  0.0510
##  5     5 -0.690 
##  6     6  0.596 
##  7     7  1.98  
##  8     8  1.89  
##  9     9  1.79  
## 10    10  1.57  
## # … with 90 more rows

# a. 

# Using the code to generate data from an AR(1) model with 
# phi = 0.6 and sigma sqaured equal to 1. Since the e is a normal 
# distribution, the sigma squared is automatically generated from the 
# function. The function already has a p order of 1, since it only 
# includes one lagged value for the forecast. Thus, 


ar1(0.6)

## # A tsibble: 100 x 2 [1]
##      idx       y
##    <int>   <dbl>
##  1     1  0     
##  2     2  0.0795
##  3     3 -1.28  
##  4     4  0.125 
##  5     5 -0.777 
##  6     6 -2.13  
##  7     7 -2.22  
##  8     8 -1.74  
##  9     9 -2.74  
## 10    10 -1.28  
## # … with 90 more rows

# is all we need. 


# b. produce a time plot for the series 


# I used the function from above to create a function that plots the 
# series given the value of phi.


ar2 <- function(phi, n = 100) {
  y <- numeric(n)
  e <- rnorm(n)
  for (i in 2:n) {
    y[i] <- phi * y[i - 1] + e[i]
  }
  tsibble(idx = seq_len(n), y = y, index = idx)
  autoplot(tsibble(idx = seq_len(n), y = y, index = idx))
}

# Now I'm choosing different values of phi to observe the graph. 

# when phi = 0.1:

trail1 <- ar2(0.1)

## Plot variable not specified, automatically selected `.vars = y`

trail1

# when phi = 0.2:

trail2 <- ar2(0.2)

## Plot variable not specified, automatically selected `.vars = y`

trail2

# when phi = 0.3

trail3 <- ar2(0.3)

## Plot variable not specified, automatically selected `.vars = y`

trail3

# when phi = 0.5

trail4 <- ar2(0.5)

## Plot variable not specified, automatically selected `.vars = y`

trail4

# when phi = 0.8

trail5 <- ar2(0.8)

## Plot variable not specified, automatically selected `.vars = y`

trail5

# when phi = 0.9

trail6 <- ar2(0.9)

## Plot variable not specified, automatically selected `.vars = y`

trail6

# apparently this is suppose to be a random walk or a random walk with drift. 
# seems as if it's w a drift. 

# Let's try a negative value for phi 

trail7 <- ar2(-0.5)

## Plot variable not specified, automatically selected `.vars = y`

trail7

# It's true it seems to be oscillating between positive and negative values.

# Let's try phi = 0 to see if it's white noise. 

trail8 <- ar2(0)

## Plot variable not specified, automatically selected `.vars = y`

trail8

# Seems like white noise to me. 




# c. Write your own code to generate data from an MA(1) model with 
# theta = 0.6 and sigma squared equal to 1. 


# The series is white noise when phi = 0. Thus, I created a function 
# that makes phi equal to zero and includes the last error term in the 
# next to have a MA(1). I am assuming c = 0 for simplificational reasons.


ma1 <- function(theta, phi = 0, n = 100) {
  y <- numeric(n)
  e <- rnorm(n)
  for (i in 2:n) {
    y[i] <- phi * y[i - 1] + e[i] + theta * e[i - 1]
  }
  tsibble(idx = seq_len(n), y = y, index = idx)
}

# Since the error term is normalized, sigma squared is automatically one. 
# Thus, an MA(1) model with theta_1 = 0.6 and sigma squared equal to 1 is

data1 <- ma1(0.6)
data1

## # A tsibble: 100 x 2 [1]
##      idx      y
##    <int>  <dbl>
##  1     1  0    
##  2     2  0.237
##  3     3  0.612
##  4     4  2.03 
##  5     5  2.26 
##  6     6  2.62 
##  7     7  2.68 
##  8     8 -0.451
##  9     9 -0.304
## 10    10  0.171
## # … with 90 more rows

# Now, a time plot for the series has the code:


ma2 <- function(theta, phi = 0, n = 100) {
  y <- numeric(n)
  e <- rnorm(n)
  for (i in 2:n) {
    y[i] <- phi * y[i - 1] + e[i] + theta * e[i - 1]
  }
  tsibble(idx = seq_len(n), y = y, index = idx)
  autoplot(tsibble(idx = seq_len(n), y = y, index = idx))
}

# and so we can examine how the plot changes as theta_1 changes 

# When theta_1 = 0:

t1 <- ma2(0)

## Plot variable not specified, automatically selected `.vars = y`

t1

# This is a graph of the pure white noise with no weight from the errors
# in the past. 

# when theta_1 = 0.5

t2 <- ma2(0.5)

## Plot variable not specified, automatically selected `.vars = y`

t2

# It seems to be becoming more "smooth" 

# when theta_1 = 0.8:

t3 <- ma2(0.8)

## Plot variable not specified, automatically selected `.vars = y`

t3

# Again, more "smooth"

# when theta_1 = 0.9:

t4 <- ma2(0.9)

## Plot variable not specified, automatically selected `.vars = y`

t4

# e. Generate data from an ARMA(1,1) model with phi_1 = 0.6, 
# theta_1 = 0.6, and sigma squared equal to 1.


# Here is the code for the ARMA(1,1) model. I just made phi -= 0.


ARMA11 <- function(theta, phi, n = 100) {
  y <- numeric(n)
  e <- rnorm(n)
  for (i in 2:n) {
    y[i] <- phi * y[i - 1] + theta * e[i - 1] + e[i]
  }
  tsibble(idx = seq_len(n), y = y, index = idx)
}

# and so my generated data from an ARMA(1,1) model with with phi_1 = 0.6, 
# theta_1 = 0.6, and sigma squared equal to 1 is

t11 <- ARMA11(0.6, 0.6)
t11

## # A tsibble: 100 x 2 [1]
##      idx      y
##    <int>  <dbl>
##  1     1  0    
##  2     2  0.212
##  3     3 -1.10 
##  4     4 -1.90 
##  5     5 -0.838
##  6     6  1.55 
##  7     7  1.49 
##  8     8 -2.58 
##  9     9 -4.26 
## 10    10 -4.16 
## # … with 90 more rows

# f. From an AR(2) model 


AR2 <- function(phi2, phi1, n = 100) {
  y <- numeric(n)
  e <- rnorm(n)
  for (i in 3:n) {
    y[i] <- phi2 * y[i - 2] + phi1 * y[i - 1] + e[i]
  }
  tsibble(idx = seq_len(n), y = y, index = idx)
}

# with phi_1 = -0.8, phi_2 = 0.3, and sigma squared = 1 generates data:

data3 <- AR2(-0.8, 0.3)
data3

## # A tsibble: 100 x 2 [1]
##      idx      y
##    <int>  <dbl>
##  1     1  0    
##  2     2  0    
##  3     3  2.18 
##  4     4  1.85 
##  5     5 -1.47 
##  6     6 -3.67 
##  7     7 -0.590
##  8     8  4.38 
##  9     9  1.50 
## 10    10 -2.25 
## # … with 90 more rows

# g. Graphing the latter two series to compare.




autoplot(t11)

## Plot variable not specified, automatically selected `.vars = y`

# The moving is pretty smooth.

autoplot(data3)

## Plot variable not specified, automatically selected `.vars = y`

# This graph is not as smooth as the first one. I'm guessing because the
# first one includes a MA(1) while the second one doesn't.

Exercise 5

# a. 

View(aus_airpassengers)

# usign the ARIMA() model to find an appropriate ARIMA model. 

# The data:

data <- aus_airpassengers %>% 
  as_tsibble(index = Year)

autoplot(data)

## Plot variable not specified, automatically selected `.vars = Passengers`

# Now the fit

fit <- data %>% 
  model(ARIMA(Passengers))
report(fit)

## Series: Passengers 
## Model: ARIMA(0,2,1) 
## 
## Coefficients:
##           ma1
##       -0.8963
## s.e.   0.0594
## 
## sigma^2 estimated as 4.308:  log likelihood=-97.02
## AIC=198.04   AICc=198.32   BIC=201.65

# and so the apropriate ARIMA model is ARIMA(0,2,1).

# Now, I am checking the residuals of my model

gg_tsresiduals(fit)

# The variance isn't extremely constant since there are several outliers. The 
# mean is centered around zero and there is not autocorrelation which is what 
# matters. Overall, the residuals look like white noise. 

# Now, I am plotting the forecasts of the next 10 periods. 

fc <- fit %>% 
  forecast(h = 10) %>% 
  autoplot(data)
fc

# seems pretty accurate to me. 


# b. Write the model in terms of the backshift operator. 


# The model was differenced twiced and has a first oder moving average. 
# There is no order for the autoregressive equation. Thus, the equation of the 
# model is 

# a 2nd differenced model

# y_t'' = (y_t - y_t-1) - (y_t-1 - y_t-2) = y_t - 2y_t-1 + y_t-2 
# = y_t - 2By_t + B^2y_t = (1 - B)^2y_t

# with a moving average of order one

# y_t = c + e_t + theta_1*e_t-1 = c + (1 + theta_1*B)e_t

# Thus, the overall model in backshift notation is:

# (1 - B)^2y_t = c + (1 + theta_1*B)e_t





# c. 

# Plotting forecasts from an ARIMA(0,1,0) model with drift (so it has a constant).
# I am choosing my constant to be one for simplification reasons. 

fit1 <- data %>% 
  model(ARIMA(Passengers ~ 1 + pdq(0,1,0)))

# Now, the forecast 

fc1 <- fit1 %>% 
  forecast(h = 10) %>% 
  autoplot(data)
fc1

# The forecast doesn't look too different from the original arima model.

report(fit1)

## Series: Passengers 
## Model: ARIMA(0,1,0) w/ drift 
## 
## Coefficients:
##       constant
##         1.4191
## s.e.    0.3014
## 
## sigma^2 estimated as 4.271:  log likelihood=-98.16
## AIC=200.31   AICc=200.59   BIC=203.97

# The AICc is 2 units higher though.


# d. 

# Plotting the forecast of the ARIMA(2,1,2) model w/ drift.

fit2 <- data %>% 
  model(ARIMA(Passengers ~ 1 + pdq(2,1,2)))

fc2 <- fit2 %>% 
  forecast(h = 10) %>% 
  autoplot(data)
fc2

# This forecast doesn't have such a constant point forecast. but rather more of 
# a rigid one. 

report(fit2)

## Series: Passengers 
## Model: ARIMA(2,1,2) w/ drift 
## 
## Coefficients:
##           ar1      ar2     ma1     ma2  constant
##       -0.5518  -0.7327  0.5895  1.0000    3.2216
## s.e.   0.1684   0.1224  0.0916  0.1008    0.7224
## 
## sigma^2 estimated as 4.031:  log likelihood=-96.23
## AIC=204.46   AICc=206.61   BIC=215.43

# The AICc however is the largest for this one out of any so far.

# Now, removing the constant

fit3 <- data %>% 
  model(ARIMA(Passengers ~ 0 + pdq(2,1,2)))

## Warning: 1 error encountered for ARIMA(Passengers ~ 0 + pdq(2, 1, 2))
## [1] non-stationary AR part from CSS

# it returned an error saying that it's non-stationary AR part from 
# CSS. I am guess this is because the model has an increasing trend, so 
# not including a drift would cause an issue for the model.




# e. 



# The forecast from an ARIMA(0,2,1) model with a constant. 

fit4 <- data %>% 
  model(ARIMA(Passengers ~ 1 + pdq(0,2,1)))

## Warning: Model specification induces a quadratic or higher order polynomial trend. 
## This is generally discouraged, consider removing the constant or reducing the number of differences.

# This returns a warning message: Model specification induces a quadratic or 
# higher order polynomial trend. This is generally discouraged, Consider 
# removing the constant or reducing the number of differences. 

fc4 <- fit4 %>% 
  forecast(h = 10) %>% 
  autoplot(data)
fc4

# I guess the warning makes sense since the forecast as an exponentially increasing 
# trend. Thus, a drift with two differences causes the value to do this.

Exercise 6

# United States GDP 

View(global_economy)

global_economy %>% 
  distinct(Country)

## # A tibble: 263 × 1
##    Country            
##    <fct>              
##  1 Afghanistan        
##  2 Albania            
##  3 Algeria            
##  4 American Samoa     
##  5 Andorra            
##  6 Angola             
##  7 Antigua and Barbuda
##  8 Arab World         
##  9 Argentina          
## 10 Armenia            
## # … with 253 more rows

data <- global_economy %>% 
  filter(Country == "United States") %>% 
  select(Year, Country, GDP) %>% 
  as_tsibble(index = Year)
data

## # A tsibble: 58 x 3 [1Y]
## # Key:       Country [1]
##     Year Country                 GDP
##    <dbl> <fct>                 <dbl>
##  1  1960 United States  543300000000
##  2  1961 United States  563300000000
##  3  1962 United States  605100000000
##  4  1963 United States  638600000000
##  5  1964 United States  685800000000
##  6  1965 United States  743700000000
##  7  1966 United States  815000000000
##  8  1967 United States  861700000000
##  9  1968 United States  942500000000
## 10  1969 United States 1019900000000
## # … with 48 more rows

autoplot(data)

## Plot variable not specified, automatically selected `.vars = GDP`

# a.

data %>% 
  features(GDP, features = guerrero)

## # A tibble: 1 × 2
##   Country       lambda_guerrero
##   <fct>                   <dbl>
## 1 United States           0.282

# The value is closest to 0. Thus, I shall employ a log transformation. 

data1 <- data %>% 
  mutate(GDP = log(GDP))

autoplot(data1)

## Plot variable not specified, automatically selected `.vars = GDP`

# it does seem more balanced. 

# b.

# using the ARIMA() function to fit a suitable ARIMA model to the 
# transformed data. 

fit <- data %>% 
  model(ARIMA(log(GDP)))
report(fit)

## Series: GDP 
## Model: ARIMA(0,2,1) 
## Transformation: log(GDP) 
## 
## Coefficients:
##           ma1
##       -0.6353
## s.e.   0.1138
## 
## sigma^2 estimated as 0.0004278:  log likelihood=139.76
## AIC=-275.53   AICc=-275.3   BIC=-271.48

# The chosen, suitable ARIMA model is ARIMA(0,2,1).


# c. 


# This function chooses appropriate orders

data %>% 
  model(ARIMA(log(GDP) ~ pdq(p=1:3, d=1, q=0:2)))

## # A mable: 1 x 2
## # Key:     Country [1]
##   Country       `ARIMA(log(GDP) ~ pdq(p = 1:3, d = 1, q = 0:2))`
##   <fct>                                                  <model>
## 1 United States                          <ARIMA(1,1,1) w/ drift>

# It gave me an ARIMA(1,1,1) w/ a drift 

# The fit 

fit1 <- data %>% 
  model(ARIMA(log(GDP) ~ 1 + pdq(1,1,1)))
report(fit1)

## Series: GDP 
## Model: ARIMA(1,1,1) w/ drift 
## Transformation: log(GDP) 
## 
## Coefficients:
##          ar1      ma1  constant
##       0.9270  -0.5751    0.0042
## s.e.  0.0636   0.1552    0.0010
## 
## sigma^2 estimated as 0.0004144:  log likelihood=143.16
## AIC=-278.32   AICc=-277.55   BIC=-270.14

# This has a lower AICc!! Let's try it without the drift 



# d. 


# Based on the AICc, the second model, ARIMA(1,1,1) w/ drift is best!

# Now, to check the residuals 

gg_tsresiduals(fit1)

# The variance seems to be constant overall with a few outliers. There is no 
# autocorrelation and the mean is cenetred around zero with a normal dist.. 
# So far, so good. 


# e. 

# Now, the forecast and the plot 

fc2 <- fit1 %>% 
  forecast(h = 10) %>% 
  autoplot(data)
fc2

# Since the original plot is exponentially increasing, it's reasonable that 
# the forecast is too!


# f. 

# Now, an ETS() model with no transformation 


fit2 <- data %>% 
  model(ETS(GDP))
report(fit2)

## Series: GDP 
## Model: ETS(M,A,N) 
##   Smoothing parameters:
##     alpha = 0.9990876 
##     beta  = 0.5011949 
## 
##   Initial states:
##          l[0]        b[0]
##  448093333334 64917355687
## 
##   sigma^2:  7e-04
## 
##      AIC     AICc      BIC 
## 3190.787 3191.941 3201.089

# The AICc is huge, but that's because we didn't use the log transformation. 
# after we log this value though, we get 3.50 which is much more higher than
# from the ARIMA model. 

# Let's see what the forecast looks like 


fc3 <- fit2 %>% 
  forecast(h = 10) %>% 
  autoplot(data)
fc3

# The prediction interval is much larger, I am assuming because of the residuals.
# Let's see

gg_tsresiduals(fit2)

# yeah, they do not seem to have a normal distribution. The point forecast of 
# the model seems fairly accurate though.

Exercise 7

# a. 

# Choosing Japan as my country. 

data <- aus_arrivals %>% 
  filter(Origin == "Japan") %>% 
  as_tsibble(index = Quarter)
data

## # A tsibble: 127 x 3 [1Q]
## # Key:       Origin [1]
##    Quarter Origin Arrivals
##      <qtr> <chr>     <int>
##  1 1981 Q1 Japan     14763
##  2 1981 Q2 Japan      9321
##  3 1981 Q3 Japan     10166
##  4 1981 Q4 Japan     19509
##  5 1982 Q1 Japan     17117
##  6 1982 Q2 Japan     10617
##  7 1982 Q3 Japan     11737
##  8 1982 Q4 Japan     20961
##  9 1983 Q1 Japan     20671
## 10 1983 Q2 Japan     12235
## # … with 117 more rows

# b. 


# Apparerntly I need to difference the data to obtain stationary 
# data. I want to check this, and if so, how many I should take. 

# First, let's see if I need a box-cox transformation 

data %>% 
  features(Arrivals, features = guerrero)

## # A tibble: 1 × 2
##   Origin lambda_guerrero
##   <chr>            <dbl>
## 1 Japan            0.254

# It's fairly close to zero, but let's check the original graph before 
# any changes are made. 

autoplot(data)

## Plot variable not specified, automatically selected `.vars = Arrivals`

# yes, to obtain stationary data, must employ a transformation. Let's 
# see what log does

data1 <- data %>% 
  mutate(Arrivals = log(Arrivals))

autoplot(data1)

## Plot variable not specified, automatically selected `.vars = Arrivals`

# It didn't really fix anything, so maybe I should just stick to taking the difference
# of the original data. Let's see how many we should take.


data %>% 
  features(Arrivals, unitroot_ndiffs)

## # A tibble: 1 × 2
##   Origin ndiffs
##   <chr>   <int>
## 1 Japan       1

# There should be one non-seasonal difference. 

data %>% 
  features(Arrivals, unitroot_nsdiffs)

## # A tibble: 1 × 2
##   Origin nsdiffs
##   <chr>    <int>
## 1 Japan        1

# and also a seasonal difference. 

# Choosing the lag to be 4 since it's quarterly data with annual seasonality. 


data2 <- data %>% 
  mutate(Arrivals = difference(difference(Arrivals), 4)) %>% 
  as_tsibble(index = Quarter)

# Let's look at the plot 


autoplot(data2)

## Plot variable not specified, automatically selected `.vars = Arrivals`

## Warning: Removed 5 rows containing missing values (`geom_line()`).

# Seems stationary to me!


# c. 


# The ACF graph of the differenced data


data2 %>% 
  ACF(Arrivals) %>% 
  autoplot()

# There are a few significant spikes in random order in the ACF. The significant 
# spike at lag 4 in the ACF suggests a seasonal MA(1) component. The significant
# spike at lag 1 suggests a non-seasonal MA(1) component. At lag 6, I am not sure
# how to deal with it, maybe I will try an MA(5) or MA(6).

# d. 


# The PACF graph 

data2 %>% 
  PACF(Arrivals) %>% 
  autoplot()

# There is a very significant lag at lag 4. There are several other 
# spikes through out the graph. 


# e. 


# Like stated before, they suggest an ARIMA(0,1,1)(0,1,1)[4]. If we would have 
# started with the PACF graph, we would have chosen the non-seasonal component 
# and the ACF to select the seasonal part of the model leaving us with the same
# model as before - ARIMA(0,1,1)(0,1,1)[4]. 


# f. 

# I am now using the automatic ARIMA() function to chose a model. 
# I will then use special operation wihtin a fucntion to make the computer 
# look hard for the perfect model. I am using the original data to fit the model.


fit <- data %>% 
  model(auto1 = ARIMA(Arrivals))
report(fit)

## Series: Arrivals 
## Model: ARIMA(0,1,1)(1,1,1)[4] 
## 
## Coefficients:
##           ma1     sar1     sma1
##       -0.4228  -0.2305  -0.5267
## s.e.   0.0944   0.1337   0.1246
## 
## sigma^2 estimated as 174801727:  log likelihood=-1330.66
## AIC=2669.32   AICc=2669.66   BIC=2680.54

# The automatic ARIMA() function gives ARIMA(0,1,1)(1,1,1)[4], very close to my
# estimate only with a seasonal AR(1) added to it. Now, I am going make the 
# computer look for the perfect model. 

fit2 <- data %>% 
  model(auto2 = ARIMA(Arrivals, stepwise = FALSE, approx = FALSE))
report(fit2)

## Series: Arrivals 
## Model: ARIMA(5,1,0)(0,1,1)[4] 
## 
## Coefficients:
##           ar1      ar2      ar3      ar4      ar5     sma1
##       -0.4099  -0.2729  -0.2468  -0.3897  -0.2206  -0.4577
## s.e.   0.0880   0.0910   0.0889   0.1326   0.0887   0.1462
## 
## sigma^2 estimated as 167809816:  log likelihood=-1326.65
## AIC=2667.3   AICc=2668.28   BIC=2686.93

# The precise automatic ARIMA() function gave an ARIMA(5,1,0)(0,1,1)[4]. I'm
# guessing that the significant spike at lag 6 in the ACF graph is why there 
# should be a non-seasonal AR(5) component within the model.

# Now, I am checking the AICc of my estimated model based on the ACF and PACF 
# graph. 

fit3 <- data %>% 
  model(ARIMA(Arrivals ~ pdq(0,1,1) + PDQ(0,1,1)))
report(fit3)

## Series: Arrivals 
## Model: ARIMA(0,1,1)(0,1,1)[4] 
## 
## Coefficients:
##           ma1     sma1
##       -0.4256  -0.6755
## s.e.   0.1005   0.0659
## 
## sigma^2 estimated as 177223035:  log likelihood=-1332.01
## AIC=2670.02   AICc=2670.23   BIC=2678.43

# My model has a slightly higher AICc than the precise automatic model. Therefore,
# the ARIMA(5,1,0)(0,1,1)[4] is best in this case.



# f. 

# It's going to be difficult to write out the equation in r because of the
# greek letters.

# The model in terms of the back-shift operator is below:



# (1 - phi*B^5)(1 - B)(1 - B^4)y_t = (1 + Theta*B^4)epsilon_t


# And now without using the back-shift operator:


# After simplifying the equation, I get:


# y_t = y_t-4 + y_t-1 - y_t-5 + phi*y_t-5 - phi*y_t-9 - phi*y_t-6 + phi*y_t-10 + Theta*epsilon_t-4 + epsilon_t


# that took a little while :) hopefully it's comprehensible.

Exercise 8

# The link that was in the data wouldn't load, so I # looked up the data online 
# and imported it manually. I found Gold data. 

Gold <- read_excel("C:/Users/summe_iy44w2w/Downloads/Gold.xlsx")
View(Gold)


# I guess I'll do the Close price.

# The original graph

data <- Gold %>% 
  mutate(Date1 = row_number()) %>% 
  select(Date1, `Close/Last`) %>% 
  as_tsibble(index = Date1)
data

## # A tsibble: 2,547 x 2 [1]
##    Date1 `Close/Last`
##    <int>        <dbl>
##  1     1        1769 
##  2     2        1763 
##  3     3        1776.
##  4     4        1775.
##  5     5        1774.
##  6     6        1774.
##  7     7        1754.
##  8     8        1716.
##  9     9        1716 
## 10    10        1680.
## # … with 2,537 more rows

autoplot(data)

## Plot variable not specified, automatically selected `.vars = Close/Last`

# The ACF plot

data %>% 
  ACF(`Close/Last`) %>% 
  autoplot()

# I wonder if it's autocorrelated (kidding). I know # it's decreasing but 
# I can't tell if it is exponentially.


data %>% 
  PACF(`Close/Last`) %>% 
  autoplot()

# There is a huge spike at the first significant lag. 

# Let's see how many differences the data needs and if it needs a box-cox

data %>% 
  features(`Close/Last`, features = guerrero)

## # A tibble: 1 × 1
##   lambda_guerrero
##             <dbl>
## 1          -0.161

# it's pretty close to one, let's try the log. 

data1 <- data %>% 
  mutate(`Close/Last` = log(`Close/Last`))

autoplot(data1)

## Plot variable not specified, automatically selected `.vars = Close/Last`

# It didn't change at all so let's not.

data %>% 
  features(`Close/Last`, unitroot_ndiffs)

## # A tibble: 1 × 1
##   ndiffs
##    <int>
## 1      1

# There should be one difference.

data %>% 
  features(`Close/Last`, unitroot_nsdiffs)

## # A tibble: 1 × 1
##   nsdiffs
##     <int>
## 1       0

# and no seasonal difference, interesting. 

data2 <- data %>% 
  mutate(`Close/Last` = difference(`Close/Last`))

autoplot(data2)

## Plot variable not specified, automatically selected `.vars = Close/Last`

## Warning: Removed 1 row containing missing values (`geom_line()`).

# Looks stationary 

data2 %>% 
  ACF(`Close/Last`) %>% 
  autoplot()

# The movement is unrecognizable. 

data2 %>% 
  PACF(`Close/Last`) %>% 
  autoplot()

# The PACF looks pretty similar to the ACF. 

# I am going to use the auto ARIMA() method to find the appropriate 
# model.


fit <- data %>% 
  model(ARIMA(`Close/Last`, stepwise = FALSE, approx = FALSE))
report(fit)

## Series: Close/Last 
## Model: ARIMA(4,1,1) 
## 
## Coefficients:
##          ar1     ar2     ar3      ar4      ma1
##       0.5937  0.0431  0.0190  -0.0652  -0.6367
## s.e.  0.1951  0.0244  0.0232   0.0201   0.1951
## 
## sigma^2 estimated as 217.7:  log likelihood=-10462.93
## AIC=20937.87   AICc=20937.9   BIC=20972.92

# it gave me the ARIMA(4,1,1) model. Which makes sense from the ACF and PACF plot.


# c. 


# I am checking the residuals of my ARIMA model. 


gg_tsresiduals(fit)

# The variance is pretty constant, the mean is centered around zero with a normal 
# distribution. What's concerning is that the errors are autocorrelated. 


augment(fit) %>% 
  features(.resid, ljung_box, lag = 10, dof = 1)

## # A tibble: 1 × 3
##   .model                                                lb_stat lb_pvalue
##   <chr>                                                   <dbl>     <dbl>
## 1 ARIMA(`Close/Last`, stepwise = FALSE, approx = FALSE)    3.92     0.917

# The p-value is > 0.05 so the residuals are not significant! Meaning, it is not
# indistinguishable from a white noise!

# d. 

# My forecast for the next two years

fc <- fit %>% 
  forecast(h = 261) %>% 
  autoplot(data)
fc

# the forecast looks pretty nice, the prediction interval is pretty large. 


# e. 


# I use the automatic ETS() model to choose one for me.

fit1 <- data %>% 
  model(ETS(`Close/Last`))
report(fit1)

## Series: Close/Last 
## Model: ETS(M,N,N) 
##   Smoothing parameters:
##     alpha = 0.9542947 
## 
##   Initial states:
##      l[0]
##  1768.619
## 
##   sigma^2:  1e-04
## 
##      AIC     AICc      BIC 
## 33435.72 33435.73 33453.25

# The returned ETS model is ETS(M,N,N). 


gg_tsresiduals(fit1)

# the variance is pretty constant, the mean is centered around zero with a normal
# distribution. Again, there is autocorreltionn within the residuals.


augment(fit1) %>% 
  features(.innov, ljung_box, lag = 10, dof = 0)

## # A tibble: 1 × 3
##   .model            lb_stat lb_pvalue
##   <chr>               <dbl>     <dbl>
## 1 ETS(`Close/Last`)    11.3     0.335

# The p-value is greater than 0.05, thus, the residuals are indistinguishable 
# from white noise!


# g. 


# Now, I am using my ETS() model to forecast the next four years.


fc1 <- fit1 %>% 
  forecast(h = 261) %>% 
  autoplot(data)
fc1

# The forecast for the ETS() model honestly looks very similar to the ARIMA() 
# model forecast.


# h. 


# This one is a hard one, but I am going to go with the ARIMA(4,1,1), even 
# though both models are appropriate. I can check the RMSE to make sure

train <- data %>% 
  filter(Date1 < 1783)

recent <- data %>% 
  filter(Date1 >= 1)

fit4 <- train %>% 
  model(ARIMA(`Close/Last`, stepwise = FALSE, approx = FALSE))

fc4 <- fit4 %>% 
  forecast(h = 764) %>% 
  autoplot(recent)
fc4

# The RMSE 

fc44 <- fit4 %>% 
  forecast(h = 764)

accuracy(fc44, recent)

## # A tibble: 1 × 10
##   .model                   .type    ME  RMSE   MAE   MPE  MAPE  MASE RMSSE  ACF1
##   <chr>                    <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ARIMA(`Close/Last`, ste… Test   137.  206.  146.  9.37  10.2  14.4  13.7 0.990

# The RMSE is 206. 

# Now, the ETS() model. 

fit5 <- train %>% 
  model(ETS(`Close/Last`))

fc5 <- fit5 %>% 
  forecast(h = 764) %>% 
  autoplot(recent)
fc5

# The RMSE 

fc55 <- fit5 %>% 
  forecast(h = 764)

accuracy(fc55, recent)

## # A tibble: 1 × 10
##   .model            .type    ME  RMSE   MAE   MPE  MAPE  MASE RMSSE  ACF1
##   <chr>             <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ETS(`Close/Last`) Test   131.  202.  141.  8.90  9.84  13.9  13.5 0.990

# The ETS() model has the lowest RMSE!! Thus, this is the chosen model, 
# something I did not expect.

HW 6

Summer Ligon

2022-11-22

Load packages and data

Questions

Exercise 1

Exercise 2

Exercise 3

Exercise 4

Exercise 5

Exercise 6

Exercise 7

Exercise 8