Data 624 assignment 6

9.1, 9.2, 9.3, 9.5, 9.6, 9.7, 9.8 in Hyndman

9.1 Explain the differences among these figures. Do they all indicate that the data are white noise? Each one of the figures has a different amount of numbers in their series. as the number increases the ACF bounds (critical values) get smaller. They do indicate the data is white noise due to the ACF graphs not showing any autocorrelation past the critical values. Why are the critical values at different distances from the mean of zero? Why are the autocorrelations different in each figure when they each refer to white noise? The critical values are at different distances from the mean of zero due the larger sample sizes. the autocorrelations are also smaller with increasing sample sizes. With a larger sample size the smaller the deviation from the mean will indicate an autocorrelation. Due to the different sampling of the data the ACF graphs may differ in pattern but still do not show any critical values at any lag.

9.2 A classic example of a non-stationary series are stock prices. Plot the daily closing prices for Amazon stock (contained in gafa_stock), along with the ACF and PACF. Explain how each plot shows that the series is non-stationary and should be differenced.

library(fpp3)

## Registered S3 method overwritten by 'tsibble':
##   method               from 
##   as_tibble.grouped_df dplyr

## ── Attaching packages ──────────────────────────────────────────── fpp3 1.0.1 ──

## ✔ tibble      3.2.1     ✔ tsibble     1.1.5
## ✔ dplyr       1.1.4     ✔ tsibbledata 0.4.1
## ✔ tidyr       1.3.1     ✔ feasts      0.4.1
## ✔ lubridate   1.9.3     ✔ fable       0.4.0
## ✔ ggplot2     3.5.1

## ── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
## ✖ lubridate::date()    masks base::date()
## ✖ dplyr::filter()      masks stats::filter()
## ✖ tsibble::intersect() masks base::intersect()
## ✖ tsibble::interval()  masks lubridate::interval()
## ✖ dplyr::lag()         masks stats::lag()
## ✖ tsibble::setdiff()   masks base::setdiff()
## ✖ tsibble::union()     masks base::union()

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0     ✔ readr   2.1.5
## ✔ purrr   1.0.2     ✔ stringr 1.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter()     masks stats::filter()
## ✖ tsibble::interval() masks lubridate::interval()
## ✖ dplyr::lag()        masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(readr)
library(latex2exp)
library(seasonal)

## 
## Attaching package: 'seasonal'
## 
## The following object is masked from 'package:tibble':
## 
##     view

library(GGally)

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

library(mlbench)
library(AppliedPredictiveModeling)
library(e1071)

## 
## Attaching package: 'e1071'
## 
## The following object is masked from 'package:fabletools':
## 
##     interpolate

library(Hmisc)

## 
## Attaching package: 'Hmisc'
## 
## The following object is masked from 'package:e1071':
## 
##     impute
## 
## The following objects are masked from 'package:dplyr':
## 
##     src, summarize
## 
## The following objects are masked from 'package:base':
## 
##     format.pval, units

library(caret)

## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift
## 
## The following objects are masked from 'package:fabletools':
## 
##     MAE, RMSE

library(dplyr)

#select relevant rows and columns and graph
head(gafa_stock)

## # A tsibble: 6 x 8 [!]
## # Key:       Symbol [1]
##   Symbol Date        Open  High   Low Close Adj_Close    Volume
##   <chr>  <date>     <dbl> <dbl> <dbl> <dbl>     <dbl>     <dbl>
## 1 AAPL   2014-01-02  79.4  79.6  78.9  79.0      67.0  58671200
## 2 AAPL   2014-01-03  79.0  79.1  77.2  77.3      65.5  98116900
## 3 AAPL   2014-01-06  76.8  78.1  76.2  77.7      65.9 103152700
## 4 AAPL   2014-01-07  77.8  78.0  76.8  77.1      65.4  79302300
## 5 AAPL   2014-01-08  77.0  77.9  77.0  77.6      65.8  64632400
## 6 AAPL   2014-01-09  78.1  78.1  76.5  76.6      65.0  69787200

  unique(gafa_stock$Symbol)

## [1] "AAPL" "AMZN" "FB"   "GOOG"

  gafa_stock |> filter(Symbol == "AMZN") |> select(Date, Close, Symbol) |> autoplot() + labs(subtitle = "Amazon Close")

## Plot variable not specified, automatically selected `.vars = Close`

  amazonstock <- gafa_stock |>  filter(Symbol == "AMZN") |> select(Date, Close, Symbol)
  #graph ACF
amazonstock |> ACF(Close) |>
  autoplot() + labs(subtitle = "Amazon closing stock price")

## Warning: Provided data has an irregular interval, results should be treated
## with caution. Computing ACF by observation.

#graph PACF
amazonstock |>
  PACF(Close) |>
  autoplot()

## Warning: Provided data has an irregular interval, results should be treated
## with caution. Computing ACF by observation.

#graph the difference
amazonstock |>
  mutate(diff_close = difference(Close)) |> 
 autoplot(diff_close) + labs(title = "Difference in closing Amazon")

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).

the plot of the closing stock prices for amazon show that it is non stationary due to a clear trend occurring. there is a general increase then decrease. There is also a very strong autocorrelation at different lags in the ACF. The PACF also shows several autocorrelations at large lags. Taking a difference between an observation and the one before it the data resembles white noise and is stationary.

For the following series, find an appropriate Box-Cox transformation and order of differencing in order to obtain stationary data.

Turkish GDP from global_economy. Accommodation takings in the state of Tasmania from aus_accommodation. Monthly sales from souvenirs.

head(global_economy)

## # A tsibble: 6 x 9 [1Y]
## # Key:       Country [1]
##   Country     Code   Year         GDP Growth   CPI Imports Exports Population
##   <fct>       <fct> <dbl>       <dbl>  <dbl> <dbl>   <dbl>   <dbl>      <dbl>
## 1 Afghanistan AFG    1960  537777811.     NA    NA    7.02    4.13    8996351
## 2 Afghanistan AFG    1961  548888896.     NA    NA    8.10    4.45    9166764
## 3 Afghanistan AFG    1962  546666678.     NA    NA    9.35    4.88    9345868
## 4 Afghanistan AFG    1963  751111191.     NA    NA   16.9     9.17    9533954
## 5 Afghanistan AFG    1964  800000044.     NA    NA   18.1     8.89    9731361
## 6 Afghanistan AFG    1965 1006666638.     NA    NA   21.4    11.3     9938414

turkeygdp <- global_economy |> dplyr::select(Country, Year, GDP) |> filter(Country == "Turkey")  
turkeygdp |> autoplot()

## Plot variable not specified, automatically selected `.vars = GDP`

turkeygdp$GDP <- as.numeric(turkeygdp$GDP)
#boxturkey <-  BoxCoxTrans(turkeygdp$GDP)
lambda <- turkeygdp |>  #test to see if any lambda would be helpful
  features(GDP, features = guerrero) |>
  pull(lambda_guerrero)
turkeygdp |>
  autoplot(box_cox(GDP, lambda)) +
  labs(y = "",
       title = latex2exp::TeX(paste0(
         "Transformed GDP with $\\lambda$ = ",
         round(lambda,2))))

box cox transformation is not enough due to the trend that still exists tried to do a difference

#turkeybox <- turkeygdp |> box_cox(GDP, lambda = lambda)
#graph the difference
turkeygdp2 <- turkeygdp |>
  mutate(diff_boxGDP = difference(box_cox(GDP, lambda = lambda)))
turkeygdp |>
  mutate(diff_boxGDP = difference(box_cox(GDP, lambda = lambda))) |> 
 autoplot(diff_boxGDP) + labs(title = "Turkey GDP Box Cox then difference")

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).

turkeygdp2 |>
  features(diff_boxGDP, unitroot_kpss)

## # A tibble: 1 × 3
##   Country kpss_stat kpss_pvalue
##   <fct>       <dbl>       <dbl>
## 1 Turkey     0.0889         0.1

now the data looks stationary. kpss null hypothesis data is stationary for greater than 0.01, p > 0.01 data is stationary

  #graph ACF
turkeygdp2 |> ACF(diff_boxGDP) |>
  autoplot() + labs(subtitle = "TurkeyGDP box cox then difference")

#graph PACF
turkeygdp2 |>
  PACF(diff_boxGDP) |>
  autoplot() + labs(subtitle = "TurkeyGDP box cox then difference")

no autocorrelations detected.

Accommodation takings in the state of Tasmania from aus_accommodation.

head(aus_accommodation)

## # A tsibble: 6 x 5 [1Q]
## # Key:       State [1]
##      Date State                        Takings Occupancy   CPI
##     <qtr> <chr>                          <dbl>     <dbl> <dbl>
## 1 1998 Q1 Australian Capital Territory    24.3        65  67  
## 2 1998 Q2 Australian Capital Territory    22.3        59  67.4
## 3 1998 Q3 Australian Capital Territory    22.5        58  67.5
## 4 1998 Q4 Australian Capital Territory    24.4        59  67.8
## 5 1999 Q1 Australian Capital Territory    23.7        58  67.8
## 6 1999 Q2 Australian Capital Territory    25.4        61  68.1

#select tasmania takings
tastak <- aus_accommodation |> select(Date, State, Takings) |> filter(State == "Tasmania")
  
  tastak |> autoplot() + labs(subtitle = "Tasmania Takings")

## Plot variable not specified, automatically selected `.vars = Takings`

data is seasonal with an upward trend. having a trend is non stationary.

#boxcox it

lambda <- tastak |>  #test to see if any lambda would be helpful
  features(Takings, features = guerrero) |>
  pull(lambda_guerrero)
#plot with the found lambda
tastak |>
  autoplot(box_cox(Takings, lambda )) +
  labs(y = "",
       title = latex2exp::TeX(paste0(
         "Transformed Tasmania Takings with $\\lambda$ = ",
         round(lambda,5))))

taking a small lambda evens the variance of the data. however it is not all centered around a mean (exhibits a trend). need to implement a difference

#add the box cox difference to the dataframe
tastak2 <- tastak |>
  mutate(diff_boxtak = difference(box_cox(Takings, lambda = lambda)))
#graph the ACF of the box cox difference
tastak2  |> 
 autoplot(diff_boxtak) + labs(title = "Turkey GDP Box Cox then difference")

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).

tastak2 |> ACF(diff_boxtak) |>
  autoplot() + labs(subtitle = "Tasmania Takings box cox then difference")

#graph PACF
tastak2 |>
  PACF(diff_boxtak) |>
  autoplot() + labs(subtitle = "TurkeyGDP box cox then difference")

tastak2 |>
  features(diff_boxtak, unitroot_kpss)

## # A tibble: 1 × 3
##   State    kpss_stat kpss_pvalue
##   <chr>        <dbl>       <dbl>
## 1 Tasmania     0.258         0.1

transformed data is centered around zero but shows high autocorrelation. kpss fails to reject the null hypothesis indicating the data is stationary. attempting 1 more difference

head(tastak2)

## # A tsibble: 6 x 4 [1Q]
## # Key:       State [1]
##      Date State    Takings diff_boxtak
##     <qtr> <chr>      <dbl>       <dbl>
## 1 1998 Q1 Tasmania    28.7     NA     
## 2 1998 Q2 Tasmania    19.0     -0.414 
## 3 1998 Q3 Tasmania    16.1     -0.167 
## 4 1998 Q4 Tasmania    25.9      0.477 
## 5 1999 Q1 Tasmania    28.4      0.0915
## 6 1999 Q2 Tasmania    20.1     -0.345

#error due to na 
tastak2 <- tastak2 |>
  filter(!is.na(diff_boxtak))
tastak3 <- tastak2 |>
  mutate(diff_boxtak2 = difference(diff_boxtak))
#graph the ACF of the box cox difference
tastak3  |> 
 autoplot(diff_boxtak2) + labs(title = "Turkey GDP Box Cox difference = 2")

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).

tastak3 |> ACF(diff_boxtak2) |>
  autoplot() + labs(subtitle = "Tasmania Takings box cox difference = 2")

#graph PACF
tastak3 |>
  PACF(diff_boxtak2) |>
  autoplot() + labs(subtitle = "Tasmania Takings box cox then difference = 2")

tastak3 |>
  features(diff_boxtak2, unitroot_kpss)

## # A tibble: 1 × 3
##   State    kpss_stat kpss_pvalue
##   <chr>        <dbl>       <dbl>
## 1 Tasmania     0.444      0.0581

Taking an additional difference does not help. may be best to stick with the first difference after the boxcox transformation.

Monthly sales from souvenirs.

head(souvenirs)

## # A tsibble: 6 x 2 [1M]
##      Month Sales
##      <mth> <dbl>
## 1 1987 Jan 1665.
## 2 1987 Feb 2398.
## 3 1987 Mar 2841.
## 4 1987 Apr 3547.
## 5 1987 May 3753.
## 6 1987 Jun 3715.

#select tasmania takings

  
  souvenirs |> autoplot() + labs(subtitle = "souvenir sales")

## Plot variable not specified, automatically selected `.vars = Sales`

  #boxcox it

lambda <- souvenirs |>  #test to see if any lambda would be helpful
  features(Sales, features = guerrero) |>
  pull(lambda_guerrero)
#plot with the found lambda
souvenirs |>
  autoplot(box_cox(Sales, lambda )) +
  labs(y = "",
       title = latex2exp::TeX(paste0(
         "Souvenir sales with $\\lambda$ = ",
         round(lambda,5))))

#add the box cox difference to the dataframe
souvenirs2 <- souvenirs |>
  mutate(diff_boxsale = difference(box_cox(Sales, lambda = lambda)))
#graph the ACF of the box cox difference
souvenirs2  |> 
 autoplot(diff_boxsale) + labs(title = "souvenir Box Cox then difference")

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).

souvenirs2 |> ACF(diff_boxsale) |>
  autoplot() + labs(subtitle = "souvenir box cox then difference")

#graph PACF
souvenirs2 |>
  PACF(diff_boxsale) |>
  autoplot() + labs(subtitle = "souvenir box cox then difference")

souvenirs2 |>
  features(diff_boxsale, unitroot_kpss)

## # A tibble: 1 × 2
##   kpss_stat kpss_pvalue
##       <dbl>       <dbl>
## 1    0.0631         0.1

#attempting second difference
#remove any nas
souvenirs2 <- souvenirs2 |>
  filter(!is.na(diff_boxsale))

souvenirs3 <- souvenirs2 |>
  mutate(diff_boxsale2 = difference(diff_boxsale))
#graph the ACF of the box cox difference
souvenirs3  |> 
 autoplot(diff_boxsale2) + labs(title = "souvenir Box Cox difference = 2")

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).

souvenirs3 |> ACF(diff_boxsale2) |>
  autoplot() + labs(subtitle = "souvenirbox cox difference = 2")

#graph PACF
souvenirs3 |>
  PACF(diff_boxsale2) |> autoplot() + labs(subtitle = "souvenir box cox then difference = 2")

souvenirs3 |>
  features(diff_boxsale2, unitroot_kpss)

## # A tibble: 1 × 2
##   kpss_stat kpss_pvalue
##       <dbl>       <dbl>
## 1    0.0419         0.1

initially upward trend with increasing spread. box cox can make the spread more even. differencing can make the data centered around zero. PACF shows less autocorrelations when taking the first difference. may be best to stick with the first difference. Also kpss is greater than 0.01 for the first difference.

For your retail data (from Exercise 7 in Section 2.10), find the appropriate order of differencing (after transformation if necessary) to obtain stationary data.

set.seed(987654321)
myseries <- aus_retail |>
  filter(`Series ID` == sample(aus_retail$`Series ID`,1))
myseries |> autoplot()

## Plot variable not specified, automatically selected `.vars = Turnover`

head(myseries)

## # A tsibble: 6 x 5 [1M]
## # Key:       State, Industry [1]
##   State    Industry                           `Series ID`    Month Turnover
##   <chr>    <chr>                              <chr>          <mth>    <dbl>
## 1 Victoria Other recreational goods retailing A3349415T   1982 Apr     15.6
## 2 Victoria Other recreational goods retailing A3349415T   1982 May     15.8
## 3 Victoria Other recreational goods retailing A3349415T   1982 Jun     15.2
## 4 Victoria Other recreational goods retailing A3349415T   1982 Jul     15.2
## 5 Victoria Other recreational goods retailing A3349415T   1982 Aug     14.5
## 6 Victoria Other recreational goods retailing A3349415T   1982 Sep     15.1

box cox transform the data to even the spread.

  #boxcox it

lambda <- myseries |>  #test to see if any lambda would be helpful
  features(Turnover, features = guerrero) |>
  pull(lambda_guerrero)
#plot with the found lambda
myseries |>
  autoplot(box_cox(Turnover, lambda )) +
  labs(y = "",
       title = latex2exp::TeX(paste0(
         "Turnover with $\\lambda$ = ",
         round(lambda,5))))

the spread of the data is now more even. need to center it.

#add the box cox difference to the dataframe
myseries2 <- myseries |>
  mutate(diff_boxturn = difference(box_cox(Turnover, lambda = lambda)))
#graph the ACF of the box cox difference
myseries2  |> 
 autoplot(diff_boxturn) + labs(title = "souvenir Box Cox then difference")

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).

myseries2 |> ACF(diff_boxturn) |>
  autoplot() + labs(subtitle = "turnover box cox then difference")

#graph PACF
myseries2 |>
  PACF(diff_boxturn) |>
  autoplot() + labs(subtitle = "turnover box cox then difference")

#kpss test null is data is stationary
myseries2 |>
  features(diff_boxturn, unitroot_kpss)

## # A tibble: 1 × 4
##   State    Industry                           kpss_stat kpss_pvalue
##   <chr>    <chr>                                  <dbl>       <dbl>
## 1 Victoria Other recreational goods retailing    0.0418         0.1

 myseries2 |> features(diff_boxturn, unitroot_nsdiffs)

## # A tibble: 1 × 3
##   State    Industry                           nsdiffs
##   <chr>    <chr>                                <int>
## 1 Victoria Other recreational goods retailing       1

doing a first order difference transforms the data to something that looks stationary. kpss p>0.01 also indicates the data is stationary. can check second order

#difference the difference
myseries3 <- myseries2 |>
  mutate(diff_boxturn2 = difference(diff_boxturn))
#graph the ACF of the box cox difference
myseries3  |> 
 autoplot(diff_boxturn2) + labs(title = "turnover Box Cox difference = 2")

## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_line()`).

myseries3 |> ACF(diff_boxturn2) |>
  autoplot() + labs(subtitle = " turnover box cox difference = 2")

#graph PACF
myseries3 |>
  PACF(diff_boxturn2) |> autoplot() + labs(subtitle = "turnover box cox then difference = 2")

myseries3 |>
  features(diff_boxturn2, unitroot_kpss)

## # A tibble: 1 × 4
##   State    Industry                           kpss_stat kpss_pvalue
##   <chr>    <chr>                                  <dbl>       <dbl>
## 1 Victoria Other recreational goods retailing    0.0148         0.1

first order and second order difference appear to be very similar. I would stick with the first order difference. a seasonal difference may be needed instead of additional order differencing.

Simulate and plot some data from simple ARIMA models.

Use the following R code to generate data from an AR(1) model with

The process starts with
y <- numeric(100) e <- rnorm(100) for(i in 2:100) y[i] <- 0.6*y[i-1] + e[i] sim <- tsibble(idx = seq_len(100), y = y, index = idx) Produce a time plot for the series. How does the plot change as you change phi

Write your own code to generate data from an MA(1) model with

Produce a time plot for the series. How does the plot change as you change

Generate data from an ARMA(1,1) model with

Generate data from an AR(2) model with

. (Note that these parameters will give a non-stationary series.)

Graph the latter two series and compare them.

y <- numeric(100)
e <- rnorm(100)
for(i in 2:100)
  y[i] <- 0.6*y[i-1] + e[i]
sim <- tsibble(idx = seq_len(100), y = y, index = idx)

sim |> autoplot() + labs(subtitle = "phi 0.6")

## Plot variable not specified, automatically selected `.vars = y`

y <- numeric(100)
e <- rnorm(100)
for(i in 2:100)
  y[i] <- 0.1*y[i-1] + e[i]
sim <- tsibble(idx = seq_len(100), y = y, index = idx)

sim |> autoplot() + labs(subtitle = "phi 0.1")

## Plot variable not specified, automatically selected `.vars = y`

y <- numeric(100)
e <- rnorm(100)
for(i in 2:100)
  y[i] <- 1*y[i-1] + e[i]
sim <- tsibble(idx = seq_len(100), y = y, index = idx)
sim |> autoplot() + labs(subtitle = "phi 1")

## Plot variable not specified, automatically selected `.vars = y`

y <- numeric(100)
e <- rnorm(100)
for(i in 2:100)
  y[i] <- 2*y[i-1] + e[i]
sim <- tsibble(idx = seq_len(100), y = y, index = idx)
sim |> autoplot() + labs(subtitle = "phi 2")

## Plot variable not specified, automatically selected `.vars = y`

How does the plot change as you change phi ? as phi is increased the data becomes less stationary and exhibits more trends. the time of the cyclic behavior also appears to be shorter. also phi is supposed to be less than 1 it appears to diverge if set to higher than 1.

Write your own code to generate data from an MA(1) model with
th1=0.6and sig2=1.

set.seed(6)
#MA uses theta * e(i-1) + e(i) instead of theta* y(i-1) for AR
y <- numeric(100)
e <- rnorm(100, sd = 1)
for(i in 2:100)
  y[i] <- 0.6*e[i-1] + e[i]
sim <- tsibble(idx = seq_len(100), y = y, index = idx)


#attempting to fit a MA(1) model to the simulated MA(1)
fit <- sim |>
  model(
    ma01 = ARIMA(y ~ pdq(0,0,1) )
    
  )
fitted_data <- fit %>%
  augment() %>%
  mutate(idx = sim$idx)

# 
fitted_data |> 
  ggplot(aes(x = idx, y = .fitted)) +  # fitted is from the model
  geom_line() +
  geom_line(data = sim, aes(x=idx,y=y, color = "blue"))+ #despite being set to blue comes out red
  labs(title = "fit ARIMA pdq(0,0,1) model for MA(1) ")

a MA(1) model was fitted to the similated data Produce a time plot for the series. How does the plot change as you change th1?

#copy from above change phi
set.seed(6)
#copied the simulated
y <- numeric(100)
e <- rnorm(100, sd = 1)
for(i in 2:100)
  y[i] <- .9*e[i-1] + e[i]
sim <- tsibble(idx = seq_len(100), y = y, index = idx)
head(sim)

## # A tsibble: 6 x 2 [1]
##     idx      y
##   <int>  <dbl>
## 1     1  0    
## 2     2 -0.387
## 3     3  0.302
## 4     4  2.51 
## 5     5  1.58 
## 6     6  0.390

#then fit a model to the simulated
fit <- sim |>
  model(
    ma01 = ARIMA(y ~ pdq(0,0,1) )
    
  )
fitted_data <- fit %>%
  augment() %>%
  mutate(idx = sim$idx)

# 
fitted_data |> 
  ggplot(aes(x = idx, y = .fitted)) +  # fitted is from the model
  geom_line() +
  geom_line(data = sim, aes(x=idx,y=y), color = "blue")+ # did not set color
  labs(title = "fit ARIMA pdq(0,0,1) model for MA(1) phi 0.9")

#changing phi to 0.2
set.seed(6)
#copied the simulated
y <- numeric(100)
e <- rnorm(100, sd = 1)
for(i in 2:100)
  y[i] <- .2*e[i-1] + e[i]
sim <- tsibble(idx = seq_len(100), y = y, index = idx)
head(sim)

## # A tsibble: 6 x 2 [1]
##     idx      y
##   <int>  <dbl>
## 1     1  0    
## 2     2 -0.576
## 3     3  0.743
## 4     4  1.90 
## 5     5  0.370
## 6     6  0.373

#then fit a model to the simulated
fit <- sim |>
  model(
    ma01 = ARIMA(y ~ pdq(0,0,1) )
    
  )
fitted_data <- fit %>%
  augment() %>%
  mutate(idx = sim$idx)

# 
fitted_data |> 
  ggplot(aes(x = idx, y = .fitted)) +  # fitted is from the model
  geom_line() +
  geom_line(data = sim, aes(x=idx,y=y), color = "blue")+ #
  labs(title = "fit ARIMA pdq(0,0,1) model for MA(1) phi 0.2")

as phi decreases the spread/range of the model decreases. iit will tend to stick closer to zero as the error terms have less of an impact due to the smaller multiplier.

Generate data from an ARMA(1,1) model with
phi1=0.6 th1=0.6and sig2=1.

set.seed(6)
y <- numeric(100)
e <- rnorm(100, sd = 1)
for(i in 2:100)
  y[i] <- e[i] + 0.6 * e[i-1] + 0.6 * y[i-1]
sim2 <- tsibble(idx = seq_len(100), y = y, index = idx)
 sim2 |> autoplot(y)

6f Generate data from an AR(2) model with
phi1=-0.8 , phi2=0.3 and sig2=1 (Note that these parameters will give a non-stationary series.)

#formula phi1 y(t-1) + phi2 y(t-2) + et
 set.seed(6)
y <- numeric(100)
e <- rnorm(100, sd = 1)
for(i in 3:100)
  y[i] <- e[i] + (-1)*0.8 * y[i-1] + 0.3 * y[i-2]
sim3 <- tsibble(idx = seq_len(100), y = y, index = idx)
 sim3 |> autoplot(y)

series indeed looks non stationary

ggplot() +
  geom_line(data = sim2 |> filter(idx<50), aes(x = idx, y = y, color = "ARMA(1,1)")) +
  geom_line(data = sim3|> filter(idx<50), aes(x = idx, y = y, color = "AR(2)")) +
  labs(title = "ARMA(1,1),AR(2)") +
  scale_color_manual(values = c("Series 1" = "blue", "Series 2" = "red"))

## Warning: No shared levels found between `names(values)` of the manual scale and the
## data's colour values.
## No shared levels found between `names(values)` of the manual scale and the
## data's colour values.
## No shared levels found between `names(values)` of the manual scale and the
## data's colour values.

the range of the AR(2) model increases. the spread of the data for the ARMA(1,1) is less. the ARMA(2,2) has a more consistent cyclic pattern.

Consider aus_airpassengers, the total number of passengers (in millions) from Australian air carriers for the period 1970-2011.

Use ARIMA() to find an appropriate ARIMA model. What model was selected. Check that the residuals look like white noise. Plot forecasts for the next 10 periods.

head(aus_airpassengers)

## # A tsibble: 6 x 2 [1Y]
##    Year Passengers
##   <dbl>      <dbl>
## 1  1970       7.32
## 2  1971       7.33
## 3  1972       7.80
## 4  1973       9.38
## 5  1974      10.7 
## 6  1975      11.1

#select the relevant years
air <- aus_airpassengers |> filter(Year<2012) |> filter(Year>1969)
aus_airpassengers |> autoplot()

## Plot variable not specified, automatically selected `.vars = Passengers`

#fit a model 
air_fit <- air |>
  model(ARIMA(Passengers, stepwise=FALSE))
air_fit |> glance()

## # A tibble: 1 × 8
##   .model                      sigma2 log_lik   AIC  AICc   BIC ar_roots ma_roots
##   <chr>                        <dbl>   <dbl> <dbl> <dbl> <dbl> <list>   <list>  
## 1 ARIMA(Passengers, stepwise…   4.67   -87.8  180.  180.  183. <cpl>    <cpl>

air_fit |> report()

## Series: Passengers 
## Model: ARIMA(0,2,1) 
## 
## Coefficients:
##           ma1
##       -0.8756
## s.e.   0.0722
## 
## sigma^2 estimated as 4.671:  log likelihood=-87.8
## AIC=179.61   AICc=179.93   BIC=182.99

# dof = p + q
#augment to see features
air_fit |> augment() |> 
  features(.innov, ljung_box, lag = 1, dof = 1)

## # A tibble: 1 × 3
##   .model                              lb_stat lb_pvalue
##   <chr>                                 <dbl>     <dbl>
## 1 ARIMA(Passengers, stepwise = FALSE)    2.38         0

#check residuals
air_fit  |>
  gg_tsresiduals()

#look at forcast
air_fit |>
  forecast(h=10) |>
  autoplot(aus_airpassengers)

auto picked ARIMA model has residuals that definitely resemble white noise. no autocorrelation outside the critical values. the residuals count resemble a gaussian.

b.Write the model in terms of the backshift operator ARIMA (0,2,1) p = 0 d = 2, (yt-y(t-1))-(y(t-1)-y(t-2)) = yt(1-B) - yt(B^1 -B^2) =yt((1-B)-(B-B^2)) = yt(B^2-2B+1) q = 1, yt = et + phi_1 * e(t-1) = et(1 + phi_1*B)

##yt(B^2-2B+1) = et(1+phi_1B) yt(B-1)^2 = et(1+phi_1B) Plot forecasts from an ARIMA(0,1,0) model with drift and compare these to part a. Plot forecasts from an ARIMA(2,1,2) model with drift and compare these to parts a and c. Remove the constant and see what happens

#comparing the automatically found model (0,2,1) and the 0,1,0
allfit <- air |>
  model(
   arima021 = ARIMA(Passengers, stepwise=FALSE),
    arima010 = ARIMA(Passengers  ~ pdq(0,1,0) ) 
   #,    arima212 = ARIMA(Passengers ~ pdq(2,1,2)  )
   # ,    drift1 = NAIVE(Passengers~ drift())
  )
#forcast for both 021 and 010
forecast(allfit, h=10) |>
  autoplot(aus_airpassengers) +
  labs(title = "Air Passengers",
       y="Passengers")

#021 and 212 comparison
allfit2 <- air |>
  model(
   arima021 = ARIMA(Passengers, stepwise=FALSE),
   # arima010 = ARIMA(Passengers  ~ pdq(0,1,0) ) ,
   arima212 = ARIMA(Passengers ~ pdq(2,1,2)  )
   # ,    drift1 = NAIVE(Passengers~ drift())
  )

forecast(allfit2, h=10) |>
  autoplot(aus_airpassengers) +
  labs(title = "Air Passengers",
       y="Passengers")

#setting y ~ 0 removes the constant
fitrm <- air |>
  model(
    arima212 = ARIMA(Passengers ~ 1+ pdq(2,1,2)   ), 
    arima212000 = ARIMA(Passengers ~ 0+ pdq(2,1,2)   )
  )

## Warning: 1 error encountered for arima212000
## [1] non-stationary AR part from CSS

forecast(fitrm, h=10) |>
  autoplot(aus_airpassengers) +
  labs(title = "Air Passengers constant removed",
       y="Passengers")

## Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
## -Inf

## Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
## -Inf

## Warning: Removed 10 rows containing missing values or values outside the scale range
## (`geom_line()`).

the ARIMA(0,1,0) decreases the predicted values when compared to the 021.
the 021 and the 212 model match very closely compared to the 010. removing the constant causes the model to have a non stationary AR part and it refuses to forecast.

For the United States GDP series (from global_economy):

if necessary, find a suitable Box-Cox transformation for the data; fit a suitable ARIMA model to the transformed data using ARIMA(); try some other plausible models by experimenting with the orders chosen; choose what you think is the best model and check the residual diagnostics; produce forecasts of your fitted model. Do the forecasts look reasonable? compare the results with what you would obtain using ETS() (with no transformation).

head(global_economy)

## # A tsibble: 6 x 9 [1Y]
## # Key:       Country [1]
##   Country     Code   Year         GDP Growth   CPI Imports Exports Population
##   <fct>       <fct> <dbl>       <dbl>  <dbl> <dbl>   <dbl>   <dbl>      <dbl>
## 1 Afghanistan AFG    1960  537777811.     NA    NA    7.02    4.13    8996351
## 2 Afghanistan AFG    1961  548888896.     NA    NA    8.10    4.45    9166764
## 3 Afghanistan AFG    1962  546666678.     NA    NA    9.35    4.88    9345868
## 4 Afghanistan AFG    1963  751111191.     NA    NA   16.9     9.17    9533954
## 5 Afghanistan AFG    1964  800000044.     NA    NA   18.1     8.89    9731361
## 6 Afghanistan AFG    1965 1006666638.     NA    NA   21.4    11.3     9938414

#unique(global_economy$Country) checked if it was united states , united states of america, US ect.
usgdp <- global_economy |> filter(Country == 'United States') |> select(Country, Year, GDP)
usgdp |> autoplot()

## Plot variable not specified, automatically selected `.vars = GDP`

Initially, I do not see how a box cox transformation would really help since the data does not look cyclic at this time scale. the spread seems to be confinded to a line. tested a box cox tranformation anyway

lambda <- usgdp |>  #test to see if any lambda would be helpful
  features(GDP, features = guerrero) |>
  pull(lambda_guerrero)
usgdp |>
  autoplot(box_cox(GDP, lambda)) +
  labs(y = "",
       title = latex2exp::TeX(paste0(
         "Transformed GDP with $\\lambda$ = ",
         round(lambda,2))))

usgdp <- usgdp |>  mutate(boxgdp = box_cox(usgdp$GDP, lambda = lambda))

box cox causes it to be closer to a straight line

#fit a model 
us_fit <- usgdp |>
  model(ARIMA(boxgdp, stepwise=FALSE, approx = FALSE))
us_fit |> glance()

## # A tibble: 1 × 9
##   Country       .model        sigma2 log_lik   AIC  AICc   BIC ar_roots ma_roots
##   <fct>         <chr>          <dbl>   <dbl> <dbl> <dbl> <dbl> <list>   <list>  
## 1 United States ARIMA(boxgdp…  5479.   -325.  657.  657.  663. <cpl>    <cpl>

us_fit |> report()

## Series: boxgdp 
## Model: ARIMA(1,1,0) w/ drift 
## 
## Coefficients:
##          ar1  constant
##       0.4586  118.1822
## s.e.  0.1198    9.5047
## 
## sigma^2 estimated as 5479:  log likelihood=-325.32
## AIC=656.65   AICc=657.1   BIC=662.78

# dof = p + q
#augment to see features
us_fit |> augment() |> 
  features(.innov, ljung_box, lag = 10, dof = 1)

## # A tibble: 1 × 4
##   Country       .model                                         lb_stat lb_pvalue
##   <fct>         <chr>                                            <dbl>     <dbl>
## 1 United States ARIMA(boxgdp, stepwise = FALSE, approx = FALS…    3.81     0.923

#check residuals
us_fit  |>
  gg_tsresiduals()

#look at forcast
us_fit |>
  forecast(h=10) |>
  autoplot(data = usgdp)

there are no autocorrelations above the critical values of the ACF. the residuals are not perfectly gaussian. the ljung box test does indicate that the residuals are randomly distributed.

usgdp |> select(Year, boxgdp)|> gg_lag()

## Plot variable not specified, automatically selected `y = boxgdp`

 usgdp |>select( boxgdp) |>  gg_tsdisplay(boxgdp, plot_type = 'partial')

 #ARIMA boxcox (0,0,14)
 usgdp |>select( boxgdp) |>  gg_tsdisplay(difference(boxgdp), plot_type = 'partial')

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).

 # for differenced data ARIMA (1, 1, 0) acf or ARIMA(0,1,1) pacf

will try a 0,1,1 based off the pacf since a 1,1,0 based off the acf graph for the difference order 1 box cox transformation matches the auto picked model.

us_fit2 <- usgdp |>
  model(     arima110 =   ARIMA(boxgdp, stepwise=FALSE, approx = FALSE),
               arima011 = ARIMA(boxgdp ~ pdq(p=0,d=1,q=1)))
us_fit2 |> glance()

## # A tibble: 2 × 9
##   Country       .model   sigma2 log_lik   AIC  AICc   BIC ar_roots  ma_roots 
##   <fct>         <chr>     <dbl>   <dbl> <dbl> <dbl> <dbl> <list>    <list>   
## 1 United States arima110  5479.   -325.  657.  657.  663. <cpl [1]> <cpl [0]>
## 2 United States arima011  5689.   -326.  659.  659.  665. <cpl [0]> <cpl [1]>

us_fit2 |> report()

## Warning in report.mdl_df(us_fit2): Model reporting is only supported for
## individual models, so a glance will be shown. To see the report for a specific
## model, use `select()` and `filter()` to identify a single model.

## # A tibble: 2 × 9
##   Country       .model   sigma2 log_lik   AIC  AICc   BIC ar_roots  ma_roots 
##   <fct>         <chr>     <dbl>   <dbl> <dbl> <dbl> <dbl> <list>    <list>   
## 1 United States arima110  5479.   -325.  657.  657.  663. <cpl [1]> <cpl [0]>
## 2 United States arima011  5689.   -326.  659.  659.  665. <cpl [0]> <cpl [1]>

# dof = p + q
#augment to see features
us_fit2 |> augment() |> 
  features(.innov, ljung_box, lag = 10, dof = 1)

## # A tibble: 2 × 4
##   Country       .model   lb_stat lb_pvalue
##   <fct>         <chr>      <dbl>     <dbl>
## 1 United States arima011    6.22     0.717
## 2 United States arima110    3.81     0.923

#check residuals
us_fit2  |> select(arima011) |> 
  gg_tsresiduals()

#look at forcast

  forecast(us_fit2, h=10) |> autoplot(data = usgdp)

when compared to ARIMA 011, ARIMA 110 has slightly wider confidence levels. the 110 has a better p value, its residuals resemble more of white noise the AICc/AIC is slightly lower for the 110 model. however the models are very comparable producing similar predictions.out of the two the model selected would be the 110. both predictions look reasonable.

etsfit <- usgdp |> model(ETS(GDP))
forecast(etsfit, h =10) |> autoplot(usgdp)

etsfit |> glance()

## # A tibble: 1 × 10
##   Country       .model   sigma2 log_lik   AIC  AICc   BIC     MSE    AMSE    MAE
##   <fct>         <chr>     <dbl>   <dbl> <dbl> <dbl> <dbl>   <dbl>   <dbl>  <dbl>
## 1 United States ETS(GD… 6.78e-4  -1590. 3191. 3192. 3201. 2.79e22 1.20e23 0.0191

etsfit |> report()

## Series: GDP 
## Model: ETS(M,A,N) 
##   Smoothing parameters:
##     alpha = 0.9990876 
##     beta  = 0.5011949 
## 
##   Initial states:
##          l[0]        b[0]
##  448093333334 64917355687
## 
##   sigma^2:  7e-04
## 
##      AIC     AICc      BIC 
## 3190.787 3191.941 3201.089

#augment to see features
etsfit |> augment() |> 
  features(.innov, ljung_box, lag = 10, dof = 1)

## # A tibble: 1 × 4
##   Country       .model   lb_stat lb_pvalue
##   <fct>         <chr>      <dbl>     <dbl>
## 1 United States ETS(GDP)    3.82     0.923

#check residuals
etsfit  |>  
  gg_tsresiduals()

the ets model performs well, at first glance it has larger confidence intervals compared to the ARIMA models on the box cox transformed data. the ets residuals have white noise behavior, so it may be an accurate model. the ARIMA models may offer a tighter prediction, however the ets model is more likely to contain the true values with its’ wider confidence.

Data 624 assignment 6

Keith DeNivo

2024-12-02