Data624 Assignment 6

suppressMessages(suppressWarnings(library(fpp2)))
suppressMessages(suppressWarnings(library(readxl)))
suppressMessages(suppressWarnings(library(seasonal)))
suppressMessages(suppressWarnings(library(rdatamarket)))
suppressMessages(suppressWarnings(library(tseries)))

Question 1

Figure 8.24 shows the ACFs for 36 random numbers, 360 random numbers and for 1,000 random numbers.

a. Explain the differences among these figures. Do they all indicate the data are white noise?

Answer:

The 3 plots indicate that the data is white noise because none of the spikes are larger than the critical value range(the correlations are below the blue line so there is a lag)

b. Why are the critical values at different distances from the mean of zero? Why are the autocorrelations different in each figure when they each refer to white noise?

Answer:

As the sample size increases the critical values get smaller, so the cricial value region gets smaller as the sample size increases. The formula for the critical values is +/- 1.96/(sqrt(T - d)) where T is the sample size and d is the amount of differencing used.

Question 2:

A classic example of a non-stationary series is the daily closing IBM stock prices (data set ibmclose). Use R to plot the daily closing prices for IBM stock and the ACF and PACF. Explain how each plot shows the series is non-stationary and should be differenced.

Answer:

plot(ibmclose)

plot(acf(ibmclose))

plot(pacf(ibmclose))

There is clearly a trend element throughout the plot. The ACF plot shows that there is significant autocorrelations, so the data should be differenced in order to remove autocorrelation.

PACF plot shows that there is a strong correlation between IBM stock data and the one lagged values.

Differencing will help stabilize the mean of a time series and eliminate or reduce trend and seasonality to make it stationary.

Question 3

For the following series, find an appropriate Box-Cox transformation and order of differencing in order to obtain stationary data.

transform.plot <- function(x){
  lam <- BoxCox.lambda(x)
  transformed <- BoxCox(x, lambda = lam)
  tsdisplay(transformed)
  return(paste("Box-Cox Lambda = ", round(lam, 4)))
}

a. usnetelec

data("usnetelec")
transform.plot(usnetelec)

## [1] "Box-Cox Lambda =  0.5168"

The series needs one differencing in order to be stationary.

b. usgdp

data("usgdp")
transform.plot(usgdp)

## [1] "Box-Cox Lambda =  0.3664"

Same, the series needs one differencing in order to be stationary.

c. mcopper

data("mcopper")
transform.plot(mcopper)

## [1] "Box-Cox Lambda =  0.1919"

Here we need 2 steps of differencing.

d. enplanements

data("enplanements")
transform.plot(enplanements)

## [1] "Box-Cox Lambda =  -0.2269"

Here we need one differencing and seasonal differencing to be stationary.

e. visitors

data("visitors")
transform.plot(visitors)

## [1] "Box-Cox Lambda =  0.2775"

Here we need one differencing and seasonal differencing to be stationary.

Question 4

For your retail data (from Exercise 3 in Section 2.10), find the appropriate order of differencing (after transformation if necessary) to obtain stationary data.

retaildata <- readxl::read_excel("C:/Users/rites/Documents/GitHub/Data624_Assignment1/retail.xlsx", skip=1)
## readxl works best with a newer version of the tibble package.
## You currently have tibble v1.4.2.
## Falling back to column name repair from tibble <= v1.4.2.
## Message displays once per session.
myts <- ts(retaildata[,"A3349873A"],frequency=12, start=c(1982,4))
autoplot(myts) + xlab("Time") + ylab("Sales")

This data have increasing trend and strong seasonality, so use first differencing and seasonal differencing and then apply Box-Cox transformation.

ndiffs(myts)
## [1] 1
nsdiffs(myts)
## [1] 1

First differencing and 1 seasonal differencing will be good here.

Question 5

Use R to simulate and plot some data from simple ARIMA models.

a. Use the following R code to generate data from an AR(1) model with phi1 = 0.6 and sigma^2 = 1. The process starts with y1 = 0.

y <- ts(numeric(100))
e <- rnorm(100)
for(i in 2:100){
   y[i] <- 0.6*y[i-1] + e[i]

}

b. Produce a time plot for the series. How does the plot change as you change phi1?

ar1generator <- function(phi1){
  # generate 100 data points from an AR(1) model with input phi1.
  y <- ts(numeric(100))
  # error 'e's have variation sigma^2 as 1.
  e <- rnorm(100)
  for(i in 2:100){
    y[i] <- phi1*y[i-1] + e[i]
  }
  return(y)
}
# produce plots changing phi1 value.

autoplot(ar1generator(0.3), series = "0.3") +   geom_line(size = 1, colour = "red") +   autolayer(y, series = "0.6", size = 1) +   autolayer(ar1generator(0.9), size = 1, series = "0.9") +   ylab("AR(1) models") +  guides(colour = guide_legend(title = "Phi1"))

As phi increases, the variation of y increased.

c. Write your own code to generate data from an MA(1) model with theta1 = 0.6 and sigma^2 = 1.

ma1generator <- function(theta1){
  # generate 100 data points from an MA(1) model with input theta1.

  y <- ts(numeric(100))
  # error 'e's have variation sigma^2 as 1.
  e <- rnorm(100)
  for(i in 2:100){
    y[i] <- theta1*e[i-1] + e[i]
  }
  return(y)
}

d. Produce a time plot for the series. How does the plot change as you change theta1?

# produce plots changing theta1 value.

autoplot(ma1generator(0.3), series = "0.3") +   geom_line(size = 1, colour = "red") +  autolayer(y, series = "0.6", size = 1) +   autolayer(ar1generator(0.9), size = 1, series = "0.9") +   ylab("MA(1) models") + guides(colour = guide_legend(title = "Theta1"))

As theta increases, the variation of y increased.

e. Generate data from an ARMA(1,1) model with phi1 = 0.6, theta1 = 0.6 and sigma^2 = 1.

y_arima.1.0.1 <- ts(numeric(50))
e <- rnorm(50)
for(i in 2:50){
y_arima.1.0.1[i] <- 0.6*y_arima.1.0.1[i-1] + 0.6*e[i-1] + e[i]
}

f. Generate data from an AR(2) model with phi1 = -0.8, phi2 = 0.3 and sigma^2 = 1. (Note that these parameters will give a non-stationary series.)

y_arima.2.0.0 <- ts(numeric(50))
e <- rnorm(50)
for(i in 3:50){
y_arima.2.0.0[i] <- -0.8*y_arima.2.0.0[i-1] + 0.3*y_arima.2.0.0[i-2] + e[i]
}

g. Graph the latter two series and compare them.

autoplot(y_arima.1.0.1, series = "ARMA(1, 1)") +  autolayer(y_arima.2.0.0, series = "AR(2)") +  ylab("y") + guides(colour = guide_legend(title = "Models"))

autoplot(y_arima.1.0.1)

data from an AR(2) model increased with oscillation. They are non-staionary data. But data from an ARMA(1, 1) model were stationary.

Question 6

Consider the number of women murdered each year (per 100,000 standard population) in the United States. (Data set wmurders).

a. By studying appropriate graphs of the series in R, find an appropriate ARIMA(p,d,q) model for these data.

autoplot(wmurders)

It looked like the data don’t need seasonal differencing or Box-Cox transformation.

autoplot(diff(wmurders))

It looked like 1 more differencing would be needed to make the data stationary. Differenced data slowly go to minus infinity.

ndiffs(wmurders)
## [1] 2

ndiffs function shows that the data need 2 differencing.

autoplot(diff(wmurders, differences = 2))

kpss.test(diff(wmurders, differences = 2))
## Warning in kpss.test(diff(wmurders, differences = 2)): p-value greater than
## printed p-value
## 
##  KPSS Test for Level Stationarity
## 
## data:  diff(wmurders, differences = 2)
## KPSS Level = 0.045793, Truncation lag parameter = 3, p-value = 0.1

twice differencing made the data stationary.

diff(wmurders, differences = 2) %>% ggtsdisplay()

PACF is decaying. And there are significant spikes at lag 1, and 2 in the ACF, but none beyond lag 2. If the data can be modelled by ARIMA(0, 2, q) or ARIMA(p, 2, 0), I’m going to model the data by ARIMA(0, 2, 2).

b. Should you include a constant in the model? Explain.

ARIMA model of the data includes twice differencing. If there is a constant in the model, twice integrated contant will yield quadratic trend, which is dangerous for forecasting. Therefore I won’t include a constant in the model.

c. Write this model in terms of the backshift operator.

(1 - B)^2yt = (1 + theta1B + theta2B^2)et

d. Fit the model using R and examine the residuals. Is the model satisfactory?

wmurders_arima.0.2.2 <- Arima(wmurders,order = c(0, 2, 2))
checkresiduals(wmurders_arima.0.2.2)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,2,2)
## Q* = 11.764, df = 8, p-value = 0.1621
## 
## Model df: 2.   Total lags used: 10

The residuals of the model can be thought of as white noise series. A little sorry that they aren’t normally distributed. But it is satisfactory to get them.

e. Forecast three times ahead. Check your forecasts by hand to make sure that you know how they have been calculated.

fc_wmurders_arima.0.2.2 <- forecast(  wmurders_arima.0.2.2, h = 3)
# forecasts by Arima function
fc_wmurders_arima.0.2.2$mean
## Time Series:
## Start = 2005 
## End = 2007 
## Frequency = 1 
## [1] 2.480525 2.374890 2.269256
# get forecasts by manual calculation
fc_wmurders_arima.0.2.2$model
## Series: wmurders 
## ARIMA(0,2,2) 
## 
## Coefficients:
##           ma1     ma2
##       -1.0181  0.1470
## s.e.   0.1220  0.1156
## 
## sigma^2 estimated as 0.04702:  log likelihood=6.03
## AIC=-6.06   AICc=-5.57   BIC=-0.15
# formula
# (1 - B)^2*yt = (1 - 1.0181*B + 0.1470*B^2)*et
# yt = 2yt-1 - yt-2 + et - 1.0181*et-1 + 0.1470*et-2
years <- length(wmurders)
e <- fc_wmurders_arima.0.2.2$residuals
fc1 <- 2*wmurders[years] - wmurders[years - 1] - 1.0181*e[years] + 0.1470*e[years - 1]
fc2 <- 2*fc1 - wmurders[years] + 0.1470*e[years]
fc3 <- 2*fc2 - fc1
# forecasts by manual calculation
c(fc1, fc2, fc3)
## [1] 2.480523 2.374887 2.269252

The forecasts are almost similar to the ones got by Arima function.

f. Create a plot of the series with forecasts and prediction intervals for the next three periods shown.

autoplot(fc_wmurders_arima.0.2.2)

g. Does auto.arima give the same model you have chosen? If not, which model do you think is better?

fc_wmurders_autoarima <- forecast(auto.arima(wmurders), h = 3)

Without RMSE, all errors show that ARIMA(0, 2, 2) is better than ARIMA(1, 2, 1).

accuracy(fc_wmurders_arima.0.2.2)
##                      ME      RMSE       MAE        MPE     MAPE      MASE
## Training set -0.0113461 0.2088162 0.1525773 -0.2403396 4.331729 0.9382785
##                     ACF1
## Training set -0.05094066
accuracy(fc_wmurders_autoarima)
##                       ME      RMSE       MAE        MPE     MAPE      MASE
## Training set -0.01065956 0.2072523 0.1528734 -0.2149476 4.335214 0.9400996
##                    ACF1
## Training set 0.02176343

try using auto.arima function with stepwise and approximation options false.

fc_wmurders_autoarima2 <- forecast(
  auto.arima(wmurders, stepwise = FALSE, approximation = FALSE), h = 3)

It is ARIMA(0, 2, 3) model.

accuracy(fc_wmurders_autoarima2)
##                       ME      RMSE       MAE        MPE     MAPE      MASE
## Training set -0.01336585 0.2016929 0.1531053 -0.3332051 4.387024 0.9415259
##                     ACF1
## Training set -0.03193856

In this case, some errors were better while others were worse. I’ll check residuals and ACF, PACF plots.

ggtsdisplay(diff(wmurders, differences = 2))

It looked like that the data are similar to ARIMA(0, 2, 2) rather than ARIMA(0, 2, 3

checkresiduals(fc_wmurders_arima.0.2.2)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,2,2)
## Q* = 11.764, df = 8, p-value = 0.1621
## 
## Model df: 2.   Total lags used: 10
checkresiduals(fc_wmurders_autoarima2)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,2,3)
## Q* = 10.706, df = 7, p-value = 0.152
## 
## Model df: 3.   Total lags used: 10

Similar residuals. Therefore we choose ARIMA(0, 2, 2).