Predictive Analytics - Homework #6

8.1

Figure 8.31 shows the ACF’s for 36 random numbers, 360 random numbers and 1,000 random numbers.

8.1.a

Explain the differences among these figures. Do they all indicate that the data are white noise?

The first thing I notice is that the interval for the critical values gets smaller and smaller. The first figure has a spike that is right at the cutoff which we can ignore because it is not within the first few lags. Because none of the spikes are outside the limits of the critical value, we can say that the data are white noise.

8.1.b

Why are the critical values at different distances from the mean of zero? Why are the auto-correlations different in each figure when they each refer to white noise?

To find the critical values, take (\[+/-1.96/\sqrt{T}\]), where T is the amount of data points we have. It would make sense for the interval to get smaller as we divide by bigger T values.

8.2

A classic example of a non-stationary series is the daily closing IBM stock price series(data set ibmclose). Use R to plot the daily closing prices for IBM stock and the ACF and PACF. Explain how each plot shows that the series is non-stationary.

Before we do the plots, from the text, a data set is considered stationary when the “properties do not depend on the time at which the series is observed.” Data sets that show seasonality or trend are NOT stationary. When looking at the time plots for stationary time series, they will be roughly horizontal. The ACF plot for a stationary data set will drop to 0 quickly while the plot for non-stationary will decrease slowly.

library(fpp2)

data(ibmclose)
tsdisplay(ibmclose)

The plot for the time series shows a downward trends, which automatically tells us that it is non-stationary. When we look at the ACF plot, it is slowly moving towards 0(not the quick drop that we would expect with stationary data). The PACF plot has a spike as the first lag. (a large positive \[r_{1})\]. The spike in the PACF plot also tells us that this time series should be differenced once(AR1).

8.3

For the following series, find an appropriate Box-Cox transformation and order of differening in order to obtain stationary data.

data(usnetelec)
tsdisplay(usnetelec)

lambda <- BoxCox.lambda(usnetelec)
usnetelec.boxcox <- BoxCox(usnetelec, lambda) 
tsdisplay(usnetelec.boxcox)

print(paste0("The lambda for the usnetelec time series data is: ", lambda, "."))

## [1] "The lambda for the usnetelec time series data is: 0.516771443964645."

The lambda value is .52. From the PACF plot, I believe it will only be a first order differencing. Now that we have our lambda value, we can use the find the order of differencing.

library(forecast)
usnetelec.diff <- ndiffs(usnetelec.boxcox)
print(paste0("The order of differencing for the usnetelec time series data is: ", usnetelec.diff, "."))

## [1] "The order of differencing for the usnetelec time series data is: 2."

Using ndiffs function, my initial thought about the order was incorrect. The order the was found by ndiffs is 2.

data(usgdp)
tsdisplay(usgdp)

lambda.gdp <- BoxCox.lambda(usgdp)
usgdp.boxcox <- BoxCox(usgdp, lambda.gdp) #lambda=0.5167714
tsdisplay(usgdp.boxcox)

print(paste0("The lambda for the usgdp time series data is: ", lambda.gdp, "."))

## [1] "The lambda for the usgdp time series data is: 0.366352049520934."

The lambda value is .37 and the PACF plot is showing a differencing order of 1 is needed.

usgdp.diff <- ndiffs(usgdp.boxcox)
print(paste0("The order of differencing for the usgdp time series data is: ", usgdp.diff, "."))

## [1] "The order of differencing for the usgdp time series data is: 1."

I was correct in my assumption that the order was 1, as shown by ndiffs.

data(mcopper)
tsdisplay(mcopper)

lambda.mcopper <- BoxCox.lambda(mcopper)
mcopper.boxcox <- BoxCox(mcopper, lambda.mcopper)
tsdisplay(mcopper.boxcox)

print(paste0("The lambda for the mcopper time series data is: ", lambda.mcopper, "."))

## [1] "The lambda for the mcopper time series data is: 0.191904709003829."

The lambda value is .19 and I’m led to believe that the order of differencing is 3. We will once again us ndiffs to verify.

mcopper.diff <- ndiffs(mcopper.boxcox)
print(paste0("The order of differencing for the mcopper time series data is: ", mcopper.diff, "."))

## [1] "The order of differencing for the mcopper time series data is: 1."

data(enplanements)
tsdisplay(enplanements)

lambda.enplanement <- BoxCox.lambda(enplanements)
enplanements.boxcox <- BoxCox(enplanements, lambda.mcopper)
tsdisplay(enplanements.boxcox)

print(paste0("The lambda for the enplanements time series data is: ", lambda.enplanement, "."))

## [1] "The lambda for the enplanements time series data is: -0.226946111237065."

enplanements.diff <- ndiffs(enplanements.boxcox)
print(paste0("The order of differencing for the enplanements time series data is: ", enplanements.diff, "."))

## [1] "The order of differencing for the enplanements time series data is: 1."

data(visitors)
tsdisplay(visitors)

lambda.visitors <- BoxCox.lambda(visitors)
visitors.boxcox <- BoxCox(visitors, lambda.visitors)


print(paste0("The lambda for the visitors time series data is: ", lambda.visitors, "."))

## [1] "The lambda for the visitors time series data is: 0.277524910835111."

visitors.diff <- ndiffs(visitors.boxcox)
print(paste0("The order of differencing for the visitors time series data is: ", visitors.diff, "."))

## [1] "The order of differencing for the visitors time series data is: 1."

8.5

For your retail data, find the appropriate order of differencing to obtain staitonary data.

retaildata <- readxl::read_excel("retail.xlsx", skip = 1)
myts_2 <- ts(retaildata[, "A3349336V"], frequency = 12, start = c(1982, 4) )
tsdisplay(myts_2)

lambda.myts <- BoxCox.lambda(myts_2)
myts.boxcox <- BoxCox(myts_2, lambda.myts)
print(paste0("The lambda for my retail time series data is: ", lambda.myts, "."))

## [1] "The lambda for my retail time series data is: 0.0248179447964746."

myts.diff <- ndiffs(myts.boxcox)
print(paste0("The order of differencing for my retail time series data is: ", myts.diff, "."))

## [1] "The order of differencing for my retail time series data is: 1."

8.6

Use R to simulate and plot some data from simple ARIMA models.

8.6.a

Use the following R code to generate data from an AR(1) model with \[\phi = .6\] and \[\sigma^{2} = 1\]. The process start with \[y\_{1} = 0\].

set.seed(120)
y <- ts(numeric(100))
e <- rnorm(100)
for(i in 2:100)
  y[i] <- .6 * y[i-1] + e[i]

tsdisplay(y)

ndiffs(y)

## [1] 0

#set.seed(120)
#y.2 <- ts(numeric(100))
#e.2 <- rnorm(100)
#for(i in 2:100)
  #y.2[i] <- .1 * y.2[i-1] + e.2[i]

#tsdisplay(y.2)
#ndiffs(y.2)
sigma <- .1
for(i in 2:100)
  y[i] <- sigma * y[i-1] + e[i]
tsdisplay(y)

ndiffs(y)

## [1] 0

#set.seed(120)
#y.3 <- ts(numeric(100))
#e.3 <- rnorm(100)
#for(i in 2:100)
  #y.3[i] <- .5 * y.3[i-1] + e.3[i]

#tsdisplay(y.3)
#ndiffs(y.3)

sigma <- .5
for(i in 2:100)
  y[i] <- sigma * y[i-1] + e[i]
tsdisplay(y)

ndiffs(y)

## [1] 0

#set.seed(120)
#.4 <- ts(numeric(100))
#e.4 <- rnorm(100)
#for(i in 2:100)
  #y.4[i] <- .9 * y.4[i-1] + e.4[i]

#tsdisplay(y.4)
#ndiffs(y.4)

sigma <- .9
for(i in 2:100)
  y[i] <- sigma * y[i-1] + e[i]
tsdisplay(y)

ndiffs(y)

## [1] 1

#set.seed(120)
#y.5 <- ts(numeric(100))
#e.5 <- rnorm(100)
#for(i in 2:100)
  #y.5[i] <- .8 * y.5[i-1] + e.5[i]

#tsdisplay(y.5)
#ndiffs(y.5)

sigma <- .8
for(i in 2:100)
  y[i] <- sigma * y[i-1] + e[i]
tsdisplay(y)

ndiffs(y)

## [1] 1

#set.seed(120)
#y.6 <- ts(numeric(100))
#e.6 <- rnorm(100)
#for(i in 2:100)
  #y.6[i] <- .7 * y.6[i-1] + e.6[i]

#tsdisplay(y.6)
#ndiffs(y.6)

sigma <- .7
for(i in 2:100)
  y[i] <- sigma * y[i-1] + e[i]
tsdisplay(y)

ndiffs(y)

## [1] 1

#set.seed(120)
#y.7 <- ts(numeric(100))
#e.7 <- rnorm(100)
#for(i in 2:100)
  #y.7[i] <- .2 * y.7[i-1] + e.7[i]

#tsdisplay(y.7)
#ndiffs(y.7)

sigma <- .2
for(i in 2:100)
  y[i] <- sigma * y[i-1] + e[i]
tsdisplay(y)

ndiffs(y)

## [1] 0

When we increase \[\phi\], it seems to me that the although the critical interval stays at +/-.2 for all the plots, the range/variablity of the spikes does increase. I wnated to see how the changes we made would change the output for ndiffs.It looks as though no differencing is needed.

8.6.c

Write your own code to generate data from an MA(1) with \[\theta_{1} = .6\] and \[\sigma^{2} = 1\].

We will us formula: \[y_{t} = \varepsilon_{t} + \theta_{1}\varepsilon_{t-1}\].

set.seed(100)
y.MA <- ts(numeric(100))
e.MA <- rnorm(100)
  for(i in 2:100){
    y.MA[i] <- .6*e.MA[i-1] + e.MA[i]
  }

tsdisplay(y.MA)

ndiffs(y.MA)

## [1] 0

theta = .5
yMA <- ts(numeric(100))
eMA <- rnorm(100)
  for(i in 2:100){
    yMA[i] <- theta*eMA[i-1] + eMA[i]
  }

tsdisplay(yMA)

theta = .9
yMA <- ts(numeric(100))
eMA <- rnorm(100)
  for(i in 2:100){
    yMA[i] <- theta*eMA[i-1] + eMA[i]
  }

tsdisplay(yMA)

theta = .4
yMA <- ts(numeric(100))
eMA <- rnorm(100)
  for(i in 2:100){
    yMA[i] <- theta*eMA[i-1] + eMA[i]
  }

tsdisplay(yMA)

It looks as though increasing theta increase the variablity in the spikes.

8.6.e

Generate data from an ARMA(1,1) model with \[\phi_{1} = 0.6\], \[\theta_{1} = 0.6\] and \[\sigma^{2} = 1\].

set.seed(100)
phi = 0.6
theta=0.6
y.arma1 <- ts(numeric(100))
e.arma1 <- rnorm(100)
for(i in 2:100)    
  y.arma1[i] <- phi*y.arma1[i-1] + theta*e.arma1[i-1] + e.arma1[i]

tsdisplay(y.arma1)

8.6.f

Generate data from an AR(2) model with \[\phi_{1} = -.8\], \[\phi_{2} = 0.3\] and \[\sigma^{2} = 1\]. (Note that these parametes will give a non-stationary series.)

set.seed(100)
phi1 = -.8
phi2 = .3
y.ar1 <- ts(numeric(100))
e.ar1 <- rnorm(100)
for(i in 3:100)    
  y.ar1[i] <- phi1*y.ar1[i-1] + phi2*e.ar1[i-2] + e.ar1[i]
tsdisplay(y.ar1)

8.6.g

From the above plots, it looks like the AR(2) model is more stable(less variable).

8.7

Consider wmurders, the number of women murdered each year (per 100,000 standard population in the United States).

8.7.a

By studying appropriate graphs of the series in R, find an appropriate ARIMA(p, d, q) model for these data.

data("wmurders")
tsdisplay(wmurders)

ndiffs(wmurders)

## [1] 2

Looking at the plot for this time series data, we see that it is not stationary and would benefit from differencing of order two. I will complete the differencing then look at the plot again.

new.wmurders <- diff(diff(wmurders))

tsdisplay(new.wmurders)

After completing two orders of differencing, I did not like how the plot looked so I went back and did just one.

diff.wmurders <- diff(wmurders)
tsdisplay(diff.wmurders)

The differencing of order one looked significantly better with less variability. I will also perform a unit root test to verify because I was initially led to believe that it would be a 2nd order differencing that should be performed.

library(tseries)

## Warning: package 'tseries' was built under R version 3.5.3

kpss.test(diff.wmurders)

## 
##  KPSS Test for Level Stationarity
## 
## data:  diff.wmurders
## KPSS Level = 0.46973, Truncation lag parameter = 3, p-value =
## 0.04848

The kpss test shows that the data is stationary.

8.7.b

Should you include a constant in the model? Explain?

The model we are going with is ARIMA(0,1,0), with d =1. Adding a constant/drift to this data seems like the way to go.

8.7.c

Write this model in terms of the backshift operator.

The backward shift is convenient for describing the process of differencing. We performed a first difference on our time series.

\[(1 - B^{1})^{1}y_{t}\] ## 8.7.d Fit the model using R and examine the residuals. Is the model satisfactory?

library(forecast)
(fit <- Arima(wmurders, order = c(0,1,0), include.constant=TRUE)) #because we determined in a previous question we needed the constant for drift.

## Series: wmurders 
## ARIMA(0,1,0) with drift 
## 
## Coefficients:
##        drift
##       0.0030
## s.e.  0.0291
## 
## sigma^2 estimated as 0.04649:  log likelihood=6.73
## AIC=-9.47   AICc=-9.23   BIC=-5.49

Acf(residuals(fit))

Box.test(residuals(fit), lag = 24, fitdf = 4, type = "Ljung")

## 
##  Box-Ljung test
## 
## data:  residuals(fit)
## X-squared = 23.329, df = 20, p-value = 0.2729

The model seems satisfactory to me, which initally it did not due to using the incorrect data.

fc <- forecast(fit, h = 3)
fc

##      Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## 2005       2.592345 2.316033 2.868658 2.169762 3.014929
## 2006       2.595308 2.204543 2.986073 1.997684 3.192931
## 2007       2.598270 2.119683 3.076858 1.866334 3.330207

I realize I made a mistake somewhere because my forecasts are negative, which they should not be. I went back and realized that I needed to change the data I was using. I accidentally used the differenced data. Once I made this change, things looked significantly better.

8.7.f

Create a plot of the series with forecasts and prediction intervals for the next 3 periods.

plot(fc)

8.7.g

Does auto.arima( ) give the same model you have chosen? If not, which model do you think is better?

auto.arima(wmurders, stepwise = FALSE, approximation = FALSE)

## Series: wmurders 
## ARIMA(0,2,3) 
## 
## Coefficients:
##           ma1     ma2      ma3
##       -1.0154  0.4324  -0.3217
## s.e.   0.1282  0.2278   0.1737
## 
## sigma^2 estimated as 0.04475:  log likelihood=7.77
## AIC=-7.54   AICc=-6.7   BIC=0.35

fit

## Series: wmurders 
## ARIMA(0,1,0) with drift 
## 
## Coefficients:
##        drift
##       0.0030
## s.e.  0.0291
## 
## sigma^2 estimated as 0.04649:  log likelihood=6.73
## AIC=-9.47   AICc=-9.23   BIC=-5.49

The auto. arima() went with ARIMA(0, 2, 3) model, which is not what I went with. All the measures(AIC, AICc) are higher which indicate that the auto model is not as good as the one we chose.