Homework 6

9.1

Figure 9.32 shows the ACFs for 36 random numbers, 360 random numbers and 1,000 random numbers.

Explain the differences among these figures. Do they all indicate that the data are white noise?

library(magick)

## Warning: package 'magick' was built under R version 4.3.3

## Linking to ImageMagick 6.9.12.98
## Enabled features: cairo, freetype, fftw, ghostscript, heic, lcms, pango, raw, rsvg, webp
## Disabled features: fontconfig, x11

url <- "https://github.com/sokkarbishoy/DATA624-Public/raw/39c2dbd957bde26a3234b84978e7089610ce907e/9.223.png"
img <- image_read(url)
print(img)

##   format width height colorspace matte filesize density
## 1    PNG  1344    403       sRGB FALSE    26883   76x76

The difference between the three are that the ACF values, these represent how much a time series is correlated with itself at each lag value. For the first plot, the ACF values vary more widely at different lags, and there were multiple points, as more sample size inccrease we can see that the ACF values show less variation of each lag, with most values between the confidence intervals. All three plots are consistent with white noise behavior, but the degree of confidence improves with sample size. As the sample size increases, the ACF values approach zero more consistently

Why are the critical values at different distances from the mean of zero? Why are the auto correlations different in each figure when they each refer to white noise?

In the plots above, we can see that the critical values, represented in the dashed lines, vary in thier distance from zero because the critical bounds for the ACF depend in the sample size. Even though the samples were chooses randomly, the autocorrelation looks different across the plots due to sampling variablility. The larger the sample, the better ACF plot will approximate the autocorrelation between all the lags.

##9.2

A classic example of a non-stationary series are stock prices. Plot the daily closing prices for Amazon stock (contained in gafa_stock), along with the ACF and PACF. Explain how each plot shows that the series is non-stationary and should be differenced.

amzn <- gafa_stock %>%
  filter(Symbol == "AMZN") %>%
  arrange(Date)
#Convert Date column to Date type if it's not already
amzn$Date <- as.Date(amzn$Date)
amzn_ts <- ts(amzn$Close, frequency = 365)

#Plot the daily closing prices
p1 <- ggplot(amzn, aes(x = Date, y = Close)) +
  geom_line(color = "blue") +
  labs(title = "Amazon Daily Closing Prices",
       x = "Date",
       y = "Closing Price (USD)") +
  theme_minimal()

# Plot the Autocorrelation Function (ACF)
p2 <- ggAcf(amzn_ts, lag.max = 50) +
  ggtitle("ACF of Amazon Closing Prices") +
  theme_minimal()
p2

p3 <- ggPacf(amzn_ts, lag.max = 50) +
  ggtitle("PACF of Amazon Closing Prices") +
  theme_minimal()
p3

From the above plots we can see that they indicate that the time series of Amazon daily closing is non stationary. The trend in the time series plot and the ACF shows signs of non stationary. plot of Amazon’s closing prices shows a clear upward trend over time, indicating that the mean is not constant. This trend suggests that the series is non-stationary. The ACF plot shows high autocorrelations that slowly decrease as the lag increases, rather than quickly dropping off. This pattern is a sign of non-stationarity. The PACF plot has a significant spike at lag 1, with other lags much lower but still significant, further supporting the presence of non-stationary.

##9.3 For the following series, find an appropriate Box-Cox transformation and order of differencing in order to obtain stationary data.

Turkish GDP from global_economy.

global_economy <- global_economy

turkish_GDP <- global_economy %>%
  filter(Country == "Turkey")

autoplot(turkish_GDP, GDP) + 
  ggtitle("Turkish GDP over Time") +
  ylab("GDP") +
  xlab("Year")

The data is from 1960 to 2017.

# We use the BoxCox.lambda function to find the optimal lambda
lambda <- BoxCox.lambda(turkish_GDP$GDP)
turkish_GDP$GDP_transformed <- BoxCox(turkish_GDP$GDP, lambda)
autoplot(turkish_GDP, GDP_transformed) + 
  ggtitle("Box-Cox Transformed Turkish GDP") +
  ylab("Transformed GDP") +
  xlab("Year")

ggAcf(turkish_GDP$GDP_transformed, lag.max = 50) + 
  ggtitle("ACF of Transformed Turkish GDP")

ggPacf(turkish_GDP$GDP_transformed, lag.max = 50) + 
  ggtitle("PACF of Transformed Turkish GDP")

BoxCoxedData <- BoxCox(turkish_GDP$GDP, lambda = BoxCox.lambda(turkish_GDP$GDP))
DifferencedData <- diff(BoxCoxedData)

# Explore the data after BoxCoxing and differencing to check for improvement.
ggtsdisplay(DifferencedData)

# Step 2: Determine the appropriate order of differencing
d <- ndiffs(turkish_GDP$GDP_transformed)
d

## [1] 1

The order of difference needed is at least one.

Accommodation takings in the state of Tasmania from aus_accommodation.

Tasmania_takings <- aus_accommodation %>%
  filter(State == "Tasmania")
Tasmania_takings

## # A tsibble: 74 x 5 [1Q]
## # Key:       State [1]
##       Date State    Takings Occupancy   CPI
##      <qtr> <chr>      <dbl>     <dbl> <dbl>
##  1 1998 Q1 Tasmania    28.7      67    67  
##  2 1998 Q2 Tasmania    19.0      45    67.4
##  3 1998 Q3 Tasmania    16.1      39    67.5
##  4 1998 Q4 Tasmania    25.9      56    67.8
##  5 1999 Q1 Tasmania    28.4      66    67.8
##  6 1999 Q2 Tasmania    20.1      48    68.1
##  7 1999 Q3 Tasmania    17.3      41    68.7
##  8 1999 Q4 Tasmania    24.3      56    69.1
##  9 2000 Q1 Tasmania    30.0      66.2  69.7
## 10 2000 Q2 Tasmania    21.7      49.7  70.2
## # ℹ 64 more rows

autoplot(Tasmania_takings, Takings) + 
  ggtitle("Tasmania Taking over Time") +
  ylab("Takings") +
  xlab("Year")

#determine the optimal box-cox transformation to stabilize the variance
lambda <- BoxCox.lambda(Tasmania_takings$Takings)
cat("Optimal Box-Cox lambda:", lambda, "\n")

## Optimal Box-Cox lambda: -0.005076712

Tasmania_takings <- Tasmania_takings %>%
  mutate(Takings_transformed = BoxCox(Takings, lambda))

Tasmania_takings_diff <- Tasmania_takings %>%
  mutate(Takings_diff = c(NA, diff(Takings_transformed, differences = d)))

d <- ndiffs(Tasmania_takings$Takings_transformed)
cat("Order of differencing required:", d, "\n")

## Order of differencing required: 1

Tasmania_takings_diff <- Tasmania_takings %>%
  mutate(Takings_diff = c(NA, diff(Takings_transformed, differences = d)))
autoplot(na.omit(Tasmania_takings_diff), Takings_diff) +
  ggtitle("Differenced and Transformed Tasmania Takings") +
  ylab("Differenced Transformed Takings") +
  xlab("Year")

Since the data is very seasonal and while it appears to have a more consistent variance and mean than the original data, there is still a clear seasonal pattern that has not been removed. Stationarity typically requires that there is no remaining predictable pattern or seasonality in the data.

Monthly sales from souvenirs.

souvenirs %>%
  gg_tsdisplay(Sales, plot_type='partial', lag = 36) +
  labs(title = "Non-transformed Monthly Souvenir Sales")

# calculate lambda
lambda <- souvenirs %>%
  features(Sales, features = guerrero) %>%
  pull(lambda_guerrero)

##9.5 For your retail data (from Exercise 7 in Section 2.10), find the appropriate order of differencing (after transformation if necessary) to obtain stationary data.

set.seed(242423)
myseries <- aus_retail %>%
  filter(`Series ID` == sample(aus_retail$`Series ID`,1))

autoplot(myseries, Turnover) +
  ggtitle("Selected Retail Series over Time") +
  ylab("Turnover") +
  xlab("Year")

lambda <- BoxCox.lambda(myseries$Turnover)
cat("Optimal Box-Cox lambda:", lambda, "\n")

## Optimal Box-Cox lambda: -0.231131

myseries <- myseries %>%
  mutate(Turnover_transformed = BoxCox(Turnover, lambda))
d <- ndiffs(myseries$Turnover_transformed)
cat("Order of regular differencing required:", d, "\n")

## Order of regular differencing required: 1

myseries_diff <- myseries %>%
  mutate(Turnover_diff = c(NA, diff(Turnover_transformed, differences = d)))
autoplot(na.omit(myseries_diff), Turnover_diff) +
  ggtitle("Differenced and Transformed Series") +
  ylab("Differenced Transformed Turnover") +
  xlab("Year")

ggAcf(na.omit(myseries_diff$Turnover_diff), lag.max = 50) +
  ggtitle("ACF of Differenced and Transformed Series")

ggPacf(na.omit(myseries_diff$Turnover_diff), lag.max = 50) +
  ggtitle("PACF of Differenced and Transformed Series")

The original series plot shows a clear upward trend, indicating non-stationarity, and variance that increases over time. After applying a Box-Cox transformation and first-order differencing, the differenced series plot appears to have a more stable mean but still exhibits high-frequency fluctuations.

##9.6 Simulate and plot some data from simple ARIMA models.

a)Use the following R code to generate data from an AR(1) model with ϕ 1 = 0.6 and σ2=1. The process starts with y1=0.

set.seed(2434243)
y <- numeric(100)
e <- rnorm(100)
for(i in 2:100)
  y[i] <- 0.6*y[i-1] + e[i]
sim <- tsibble(idx = seq_len(100), y = y, index = idx)

ggplot(sim, aes(x = idx, y = y)) +
  geom_line() +
  ggtitle("Simulated AR(1) Process with φ1 = 0.6") +
  xlab("Time") +
  ylab("y")

set.seed(244243)
y <- numeric(100)
e <- rnorm(100)
for(i in 2:100)
  y[i] <- 0.9*y[i-1] + e[i]
sim2 <- tsibble(idx = seq_len(100), y = y, index = idx)

ggplot(sim2, aes(x = idx, y = y)) +
  geom_line() +
  ggtitle("Simulated AR(1) Process with φ1 = 0.9") +
  xlab("Time") +
  ylab("y")

b) Produce a time plot for the series. How does the plot change as you change ϕ1? the more ϕ1 increases it becomes more persistent, meaning it retains memory of past values for longer. As ϕ1 decreases the series shows less autocorrelation between consecutive values, with faster decay in memory.

set.seed(2434243)
theta1 <- 0.6 
n <- 100      
e <- rnorm(n, mean = 0, sd = 1)  # White noise with variance σ^2 = 1

# Initialize y as a numeric vector
y <- numeric(n)
y[1] <- e[1]  # Start the process

# Generate the MA(1) series
for(i in 2:n) {
  y[i] <- e[i] + theta1 * e[i - 1]
}

sim_ma1 <- tsibble(idx = seq_len(n), y = y, index = idx)

# Plot the MA(1) series
ggplot(sim_ma1, aes(x = idx, y = y)) +
  geom_line() +
  ggtitle("Simulated MA(1) Process with θ1 = 0.6 and σ^2 = 1") +
  xlab("Time") +
  ylab("y")

d) the series exhibits stronger correlations between consecutive terms. This results in a smoother, more oscillatory pattern, as each term is more heavily influenced by the previous noise shock.When θ1 is close to 0, the correlation between consecutive terms diminishes, making the series resemble white noise, with little or no dependence between terms.

Generate data from an ARMA(1,1) model with ϕ1=0.6, θ1=0.6 and σ2=1
Generate data from an AR(2) model with ϕ1=-0.8, ϕ2 = 0.3 and σ2=1. (Note that these parameters will give a non-stationary series.)
Graph the latter two series and compare them

set.seed(2434243)

#   the ARMA(1,1) model
phi1_arma <- 0.6
theta1_arma <- 0.6
n <- 100
e <- rnorm(n, mean = 0, sd = 1)

#  ARMA(1,1) series
y_arma <- numeric(n)
y_arma[1] <- e[1]

# Generate the ARMA(1,1) series
for(i in 2:n) {
  y_arma[i] <- phi1_arma * y_arma[i - 1] + e[i] + theta1_arma * e[i - 1]
}

# Convert ARMA(1,1) series to a tsibble for plotting
sim_arma11 <- tsibble(idx = seq_len(n), y = y_arma, index = idx)

# Parameters for the AR(2) model
phi1_ar2 <- -0.8
phi2_ar2 <- 0.3

# Initialize the AR(2) series
y_ar2 <- numeric(n)
y_ar2[1:2] <- e[1:2]

# AR(2) series (non-stationary)
for(i in 3:n) {
  y_ar2[i] <- phi1_ar2 * y_ar2[i - 1] + phi2_ar2 * y_ar2[i - 2] + e[i]
}


sim_ar2 <- tsibble(idx = seq_len(n), y = y_ar2, index = idx)

p1 <- ggplot(sim_arma11, aes(x = idx, y = y)) +
  geom_line() +
  ggtitle("Simulated ARMA(1,1) Process with φ1 = 0.6, θ1 = 0.6") +
  xlab("Time") +
  ylab("y")


p2 <- ggplot(sim_ar2, aes(x = idx, y = y)) +
  geom_line() +
  ggtitle("Simulated AR(2) Process with φ1 = -0.8, φ2 = 0.3 (Non-stationary)") +
  xlab("Time") +
  ylab("y")

grid.arrange(p1, p2, ncol = 2)

##9.7 Consider aus_airpassengers, the total number of passengers (in millions) from Australian air carriers for the period 1970-2011.

Use ARIMA() to find an appropriate ARIMA model. What model was selected. Check that the residuals look like white noise. Plot forecasts for the next 10 periods.

We use auto.arima to find the best-fitting ARIMA model for the aus_airpassengers data. We then check if the residuals appear like white noise. b) Write the model in terms of the backshift operator. yt=−0.8963∗ϵt-1+ϵt

c)Plot forecasts from an ARIMA(0,1,0) model with drift and compare these to part a.

air_ts_fit1 <- aus_airpassengers %>%
  model(arima1 = ARIMA(Passengers ~ pdq(0,1,0)))

report(air_ts_fit1)

## Series: Passengers 
## Model: ARIMA(0,1,0) w/ drift 
## 
## Coefficients:
##       constant
##         1.4191
## s.e.    0.3014
## 
## sigma^2 estimated as 4.271:  log likelihood=-98.16
## AIC=200.31   AICc=200.59   BIC=203.97

air_ts_fit1 %>%
  gg_tsresiduals()

air_ts_fit1 %>%
  forecast(h=12) %>%
  autoplot(aus_airpassengers)

Plot forecasts from an ARIMA(2,1,2) model with drift and compare these to parts a and c. Remove the constant and see what happens.

data(aus_airpassengers)
model_a <- auto.arima(aus_airpassengers)
print(model_a)

## Series: aus_airpassengers 
## ARIMA(0,2,1) 
## 
## Coefficients:
##           ma1
##       -0.8963
## s.e.   0.0594
## 
## sigma^2 = 4.308:  log likelihood = -97.02
## AIC=198.04   AICc=198.32   BIC=201.65

checkresiduals(model_a)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,2,1)
## Q* = 6.5774, df = 8, p-value = 0.5828
## 
## Model df: 1.   Total lags used: 9

summary(model_a)

## Series: aus_airpassengers 
## ARIMA(0,2,1) 
## 
## Coefficients:
##           ma1
##       -0.8963
## s.e.   0.0594
## 
## sigma^2 = 4.308:  log likelihood = -97.02
## AIC=198.04   AICc=198.32   BIC=201.65
## 
## Training set error measures:
##                     ME     RMSE      MAE       MPE     MAPE     MASE       ACF1
## Training set 0.3453292 2.008183 1.215008 0.9528709 4.809966 0.699953 -0.1743841

Plot forecasts from an ARIMA(0,2,1) model with a constant. What happens?

data("aus_airpassengers")
air_passengers <- aus_airpassengers %>%
  filter(Year >= 1970, Year <= 2011) %>%
  summarise(Total_Passengers = sum(Passengers))

# Convert to time series object
air_ts <- ts(air_passengers$Total_Passengers, start = 1970, frequency = 1)

# Part a: Fit an ARIMA model
fit_arima <- auto.arima(air_ts)
print(fit_arima)

## Series: air_ts 
## ARIMA(0,2,1) 
## 
## Coefficients:
##           ma1
##       -0.8756
## s.e.   0.0722
## 
## sigma^2 = 4.671:  log likelihood = -87.8
## AIC=179.61   AICc=179.93   BIC=182.99

# Check residuals to see if they resemble white noise
checkresiduals(fit_arima)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,2,1)
## Q* = 5.7075, df = 7, p-value = 0.5743
## 
## Model df: 1.   Total lags used: 8

# Plot forecast for the next 10 periods
forecast_arima <- forecast(fit_arima, h = 10)
autoplot(forecast_arima) +
  ggtitle("ARIMA Model Forecasts for Australian Air Passengers") +
  xlab("Year") + ylab("Passengers (millions)")

The residuals appear to oscillate around zero without any clear pattern or trend, which is a good indication that the model has removed systematic structure from the data.he ACF plot shows that most auto correlations are within the 95% confidence bounds (indicated by the dashed blue lines). This suggests that there is little to no significant autocorrelation remaining in the residuals, which is a key characteristic of white noise. Overall, the residuals from this ARIMA(0,2,1) model display characteristics close to white noise, indicating that the model has effectively captured the patterns in the data.

##9.8 For the United States GDP series (from global_economy):

if necessary, find a suitable Box-Cox transformation for the data; fit a suitable ARIMA model to the transformed data using ARIMA();

us_gdp <- global_economy %>%
  filter(Code == 'USA') %>%
  select(Country, GDP)

lambda_value <- us_gdp %>%
  features(GDP, features = guerrero) %>%
  pull(lambda_guerrero)

us_gdp_transformed <- us_gdp %>%
  mutate(GDP_transformed = box_cox(GDP, lambda_value))

fit <- us_gdp %>%
  model(
    arima = ARIMA(GDP, stepwise = FALSE, approx = FALSE))

report(fit)

## Series: GDP 
## Model: ARIMA(0,2,2) 
## 
## Coefficients:
##           ma1      ma2
##       -0.4206  -0.3048
## s.e.   0.1197   0.1078
## 
## sigma^2 estimated as 2.615e+22:  log likelihood=-1524.08
## AIC=3054.15   AICc=3054.61   BIC=3060.23

try some other plausible models by experimenting with the orders chosen;

us_gdp_ts <- ts(us_gdp_transformed$GDP_transformed, start = c(1970), frequency = 1)  # Adjust start year if necessary

# Experimenting with different ARIMA models
# ARIMA(1,1,1)
model_111 <- Arima(us_gdp_ts, order = c(1, 1, 1))
# ARIMA(1,1,2)
model_112 <- Arima(us_gdp_ts, order = c(1, 1, 2))
# ARIMA(2,1,1)
model_211 <- Arima(us_gdp_ts, order = c(2, 1, 1))
# ARIMA(0,1,1)
model_011 <- Arima(us_gdp_ts, order = c(0, 1, 1))

# Compare models based on AIC values
models <- list(model_111, model_112, model_211, model_011)
aic_values <- sapply(models, AIC)
best_model <- models[[which.min(aic_values)]]
print(best_model)

## Series: us_gdp_ts 
## ARIMA(1,1,2) 
## 
## Coefficients:
##          ar1      ma1      ma2
##       0.9944  -0.4986  -0.2366
## s.e.  0.0082   0.1314   0.1261
## 
## sigma^2 = 5821:  log likelihood = -327.69
## AIC=663.37   AICc=664.14   BIC=671.54

# Residual diagnostics for the best model
checkresiduals(best_model)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,1,2)
## Q* = 4.0278, df = 7, p-value = 0.7766
## 
## Model df: 3.   Total lags used: 10

# Plotting the ACF and PACF of residuals for white noise check
acf(best_model$residuals, main="ACF of Residuals")

pacf(best_model$residuals, main="PACF of Residuals")

# Additional tests for residuals (Ljung-Box test)
Box.test(best_model$residuals, lag = 10, type = "Ljung-Box")

## 
##  Box-Ljung test
## 
## data:  best_model$residuals
## X-squared = 4.0278, df = 10, p-value = 0.9461

choose what you think is the best model and check the residual diagnostics; produce forecasts of your fitted model. Do the forecasts look reasonable? compare the results with what you would obtain using ETS() (with no transformation).

arima_forecast <- forecast(best_model, h = 10)  # Forecast 10 periods ahead
plot(arima_forecast, main = "Forecast from Best ARIMA Model")

# Forecast summary for inspection
print(arima_forecast)

##      Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## 2028       19978.52 19880.74 20076.30 19828.98 20128.06
## 2029       20167.39 19991.45 20343.33 19898.32 20436.46
## 2030       20355.21 20109.62 20600.80 19979.61 20730.80
## 2031       20541.98 20227.58 20856.37 20061.15 21022.80
## 2032       20727.71 20343.36 21112.06 20139.90 21315.52
## 2033       20912.41 20456.20 21368.61 20214.70 21610.11
## 2034       21096.08 20565.82 21626.34 20285.11 21907.05
## 2035       21278.73 20672.09 21885.37 20350.95 22206.51
## 2036       21460.36 20774.99 22145.73 20412.17 22508.54
## 2037       21640.98 20874.55 22407.41 20468.82 22813.14

# Forecasting with ETS (no transformation applied)
ets_model <- ets(us_gdp_ts)  # Fit ETS model
ets_forecast <- forecast(ets_model, h = 10)
plot(ets_forecast, main = "Forecast from ETS Model")

# Display both ARIMA and ETS forecast results for comparison
print(ets_forecast)

##      Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## 2028       19972.82 19821.34 20124.30 19741.14 20204.49
## 2029       20170.52 19912.32 20428.71 19775.64 20565.39
## 2030       20368.22 19996.79 20739.65 19800.17 20936.27
## 2031       20565.92 20072.20 21059.64 19810.84 21321.00
## 2032       20763.62 20138.22 21389.03 19807.14 21720.10
## 2033       20961.33 20194.98 21727.67 19789.29 22133.36
## 2034       21159.03 20242.75 22075.31 19757.70 22560.35
## 2035       21356.73 20281.81 22431.65 19712.78 23000.67
## 2036       21554.43 20312.43 22796.42 19654.96 23453.90
## 2037       21752.13 20334.86 23169.40 19584.60 23919.66

# Plot ARIMA and ETS forecasts together for direct comparison
plot(arima_forecast, main = "ARIMA vs ETS Forecast Comparison")
lines(ets_forecast$mean, col = "blue", lty = 2)  # ETS forecast as blue dashed line
legend("topleft", legend = c("ARIMA Forecast", "ETS Forecast"), col = c("black", "blue"), lty = c(1, 2))

Both the ARIMA and ETS models capture the upward trend of the GDP data. This suggests that both models recognize the long-term growth pattern in the historical data.

Homework 6

Bishoy Sokkar

2024-10-20

9.1