AR, MA Exercises

Nile Exploratory Data Analysis

Let’s start with a quick warm-up where we explore the Nile dataset. First, load the data and make a plot of the autocorrelation function using the Acf function in the forecast library.

Try regenerating the one of the plots on Rick’s 4th slide. There were two: the time series of the Nile Annual Flows and the relationship between y(t) and y(t+1). Remember to use functions like start(), end(), and frequency() to explore the data.

data(Nile)
head(Nile) # ts object
## Time Series:
## Start = 1871 
## End = 1876 
## Frequency = 1 
## [1] 1120 1160  963 1210 1160 1160
plot(Nile, 
     #type="b",
     col="blue", 
     xlab="Time",
     ylab= "Flow (10^8 m^3)", main="Nile Annual Flow Rate From 1871 to 1970")

plot(Nile[1:99], Nile[2:100],col="blue", 
     xlab="year t", ylab="year t + 1", 
     main="Relationship in Flows in Consecutives Years")

Acf(Nile, main="")

  • Describe the autocorrelations (e.g., are they positive, how do they change as the lag increases, etc.)

They are positive, there seems to be some initial memory in the system and then the lags are insignificant.

  • On how many lags do you think the current value of the flow in the Nile depends? Why?

AC(3) because MA(1) would have only the first lag = 1 and the rest insignificant. There would also be negative autocorrelation values. First three or so lags have high correlation from the previous lag.

Let’s now see if our instinct is correct. Make a series of plots like the one on slide 4 that relates to the flow in time \(t\) to time \(t-j\) for some \(j\)

nile <- as.vector(Nile)
nile.df <- data.frame(nile1 = nile[1:95],
                      nile2 = nile[2:96],
                      nile3 = nile[3:97],
                      nile4 = nile[4:98],
                      nile5 = nile[5:99],
                      nile6 = nile[6:100])

length(Nile) # 100
## [1] 100
head(nile.df) # 95 rows because needed to get rid of up to 5 rows 
##   nile1 nile2 nile3 nile4 nile5 nile6
## 1  1120  1160   963  1210  1160  1160
## 2  1160   963  1210  1160  1160   813
## 3   963  1210  1160  1160   813  1230
## 4  1210  1160  1160   813  1230  1370
## 5  1160  1160   813  1230  1370  1140
## 6  1160   813  1230  1370  1140   995
# can also get more precise correlations
cor(nile.df["nile1"],nile.df["nile2"])
##           nile2
## nile1 0.4966515
cor(nile.df["nile1"],nile.df["nile3"])
##           nile3
## nile1 0.3837741
cor(nile.df["nile1"],nile.df["nile4"])
##          nile4
## nile1 0.334788
cor(nile.df["nile1"],nile.df["nile5"])
##           nile5
## nile1 0.2460455
cor(nile.df["nile1"],nile.df["nile6"])
##           nile6
## nile1 0.2490949
ggplot(nile.df, aes(x=nile1, y=nile2)) +
         geom_point(shape=1) + geom_smooth(method=lm)
## `geom_smooth()` using formula 'y ~ x'

ggplot(nile.df, aes(x=nile1, y=nile3)) +
         geom_point(shape=1) + geom_smooth(method=lm)
## `geom_smooth()` using formula 'y ~ x'

ggplot(nile.df, aes(x=nile1, y=nile4)) +
         geom_point(shape=1) + geom_smooth(method=lm)
## `geom_smooth()` using formula 'y ~ x'

ggplot(nile.df, aes(x=nile1, y=nile5)) +
         geom_point(shape=1) + geom_smooth(method=lm)
## `geom_smooth()` using formula 'y ~ x'

ggplot(nile.df, aes(x=nile1, y=nile6)) +
         geom_point(shape=1) + geom_smooth(method=lm)
## `geom_smooth()` using formula 'y ~ x'

  • Do these plots agree or disagree with what you observe in the Acf plot? In what ways? Agree.

AR Models

In the next set of exercises, we are going to fit a series of AR(p) models and check whether the model captures the dependence structure of the noise terms. As we discussed in class, our aim is that the residuals of a fitted model exhibit white noise. In this exercise, we are not doing any formal hypothesis testing to determine if one model is more appropriate compared to another. For now, use the Nile dataset and identify a model that appears to be most appropriate. Justify your findings.

In the following code chunk, we use the arima function to fit the model. The two key parameters we need are x, which is our dataset, and order which tells us how many lags to use. For now, only toggle the first parameter to choose an appropriate AR(p) model (i.e., you will have order=c(x,0,0), where x is replaced by an appropriate number of lags for your model).

# (AR order, differencing, MA order)
# Example AR(1) model
mod.ar1 <- arima(x=nile, order=c(1,0,0))
Acf(residuals(mod.ar1), main='')

# Now try fitting your own models and examine the Acf of the residuals
mod.ar3 <- arima(x=nile, order=c(3,0,0))
Acf(residuals(mod.ar3), main='')

mod.ar5 <- arima(x=nile, order=c(5,0,0))
Acf(residuals(mod.ar5), main='') # AR(5) looks okay! But is it too much??

Now, use the PACF function in \(\mathsf{R}\) to check whether it suggests what you found from your experiments above.

Acf(nile, type="partial", main='')

  • What does this tell us about how we should go about finding an appropriate number of lags for an AR model (i.e., going about choosing \(p\))?

PACF(1)…so ACF(1) still pretty reliable.

Model Diagnostics

With the model that you chose, run some diagnostics. Consider, for example, a histogram of the residuals. What do you expect it to look like? (Hint: Suppose we assume that the white noise is normally distributed). Also, check out the QQ plot for the residuals.

# Make a histogram of the residuals from your model
hist(residuals(mod.ar1))

# Make a QQ plot for the residuals from your model
qqnorm(residuals(mod.ar1))
qqline(residuals(mod.ar1))

# Make a histogram of the residuals from your model
hist(residuals(mod.ar3))

# Make a QQ plot for the residuals from your model
qqnorm(residuals(mod.ar3))
qqline(residuals(mod.ar3))

Forecasting

Let’s try to some forecasting with the model that we chose.

# Forecast data forward 10 periods
forecasted.data <- data.frame(forecast(mod.ar1, h=10))
total.data = data.frame(year = c(1:110),
                        flow = c(Nile, forecasted.data$Point.Forecast))

head(total.data)
##   year flow
## 1    1 1120
## 2    2 1160
## 3    3  963
## 4    4 1210
## 5    5 1160
## 6    6 1160
plot(total.data, type="l", ylim=c(450, 1350))
lines(c(101:110), forecasted.data$Lo.95, col="red", type="l", lty=2)
lines(c(101:110), forecasted.data$Hi.95, col="red", type="l", lty=2)
lines(c(101:110), forecasted.data$Lo.80, col="blue", type="l", lty=2)
lines(c(101:110), forecasted.data$Hi.80, col="blue", type="l", lty=2)
abline(v=101, col="grey", lty=3)
abline(v=110, col="grey", lty=3)

Time Series Pipline Exercises

In this series of exercises, you will be in charge of your own journey. The aim is to combine the concepts in time series you have been mastering over the last couple of days as well as the themes from the entire bootcamp to analyze data.

We will be using the quantmod package’s API to bring in financial data into \(\mathsf{R}\). We will be examining housing price index (HPI) data available from the St. Louis Federal Reserve (FRED) database.

rm(list=ls())
# Start by loading the data and converting to log growth rates
hp <- getSymbols("HPIPONM226S", src="FRED", auto.assign=FALSE) # "xts" "zoo"
## 'getSymbols' currently uses auto.assign=TRUE by default, but will
## use auto.assign=FALSE in 0.5-0. You will still be able to use
## 'loadSymbols' to automatically load data. getOption("getSymbols.env")
## and getOption("getSymbols.auto.assign") will still be checked for
## alternate defaults.
## 
## This message is shown once per session and may be disabled by setting 
## options("getSymbols.warning4.0"=FALSE). See ?getSymbols for details.
# Symbols = a character vector specifying the names of each symbol to be loaded
# src = source. Current src methods available are: yahoo, google, MySQL, FRED, csv, RData, oanda, and av.
# https://fred.stlouisfed.org/series/HPIPONM226S
# xts objects are simple. Think of them as a matrix of observations combined with an index of corresponding dates and times. xts = matrix + times 

# hp 
# # if the getSymbols doesn't work
# library(xts)
# data <- read.csv("HPIPONM226S.csv", header=TRUE)
# hpi <- as.numeric(data$HPIPONM226S)
# date <- as.Date(as.character(hp$DATE), "%Y-%m-%d")
# hp <- xts(hpi, date)
# colnames(hp) = "HPI"
# hp

# sessionInfo()
colnames(hp) <- "HPI"
plot(hp)

plot(log(hp$HPI)) # de-trending

hp$growth <- diff(log(hp$HPI)) # <--- Make sure you understand what's happening here!
hp <- hp[-1,]

# Plot the time series of data
# Hint: Just use the base R package
plot(hp$growth)

  • Are there any unusual jumps in the data? What do you think caused them?

Yes, housing crisis.

  • Do you think the data is stationary? Test it in the next block of code using the Augmented Dickey-Fuller test. Hint: try to use the graph to guide your understanding of what is happening.
library(tseries)
adf.test(hp$growth) # if the null is non-stationary then the null not rejected; nonconstant mean, covariance, and variance; lag order = shift every 7 years detected?
## 
##  Augmented Dickey-Fuller Test
## 
## data:  hp$growth
## Dickey-Fuller = -2.0518, Lag order = 7, p-value = 0.5549
## alternative hypothesis: stationary

Let’s focus on HPI data starting from May 1, 2014 to the present. Check whether that is stationary.

adf.test(hp[280:353,"growth"]) # yes stationary 
## 
##  Augmented Dickey-Fuller Test
## 
## data:  hp[280:353, "growth"]
## Dickey-Fuller = -3.7404, Lag order = 4, p-value = 0.02736
## alternative hypothesis: stationary

Use an ACF and a PCF plot to identify temporal dependence in the hp$growth data.

# Make an ACF plot
Acf(hp$growth, main='')

# Make a PACF plot
Acf(hp$growth, type="partial", main='')

#install.packages("ppcor")
# library(MASS)
# library(ppcor)

# hp.df <- data.frame(hp1 = hp[1:nrow(hp)-5],
#                       hp2 = hp[2:nrow(hp)-4],
#                       hp3 = hp[3:nrow(hp)-3],
#                       hp4 = hp[4:nrow(hp)-2],
#                       hp5 = hp[5:nrow(hp)-1],
#                       hp6 = hp[6:nrow(hp)])
# pcor(hp.df, method ="pearson")

PACF finds correlation of the residuals (which remains after removing the effects which are already explained by the earlier lag(s)). In general, the “partial” correlation between two variables is the amount of correlation between them which is not explained by their mutual correlations with a specified set of other variables. For example, if we are regressing a variable Y on other variables X1, X2, and X3, the partial correlation between Y and X3 is the amount of correlation between Y and X3 that is not explained by their common correlations with X1 and X2.

  • What kind of model do you think is appropriate?

Fit a model that you think is appropriate and run some diagnostics on it.

# Fit the model
mod.hpi <- arima(x=hp$growth, order=c(6,0,0))
summary(mod.hpi)
## 
## Call:
## arima(x = hp$growth, order = c(6, 0, 0))
## 
## Coefficients:
##          ar1     ar2     ar3     ar4     ar5     ar6  intercept
##       0.2194  0.1351  0.1782  0.0775  0.1759  0.1190     0.0030
## s.e.  0.0534  0.0543  0.0547  0.0546  0.0544  0.0538     0.0015
## 
## sigma^2 estimated as 7.938e-06:  log likelihood = 1571.18,  aic = -3126.35
## 
## Training set error measures:
##                        ME        RMSE         MAE  MPE MAPE      MASE
## Training set 1.550527e-05 0.002817366 0.001961957 -Inf  Inf 0.7816305
##                     ACF1
## Training set 0.004109879
# Check residuals for white noise
Acf(mod.hpi$residuals)

# Check for normality of residuals
hist(mod.hpi$residuals)

qqnorm(mod.hpi$residuals)
qqline(mod.hpi$residuals)

Finally, let’s forecast the data in the future a few quarters.

forecast(mod.hpi, h=4)
##     Point Forecast         Lo 80       Hi 80        Lo 95       Hi 95
## 354    0.004341119  0.0007305189 0.007951719 -0.001180817 0.009863055
## 355    0.003458885 -0.0002375824 0.007155352 -0.002194374 0.009112144
## 356    0.003514898 -0.0002403030 0.007270099 -0.002228186 0.009257983
## 357    0.002740049 -0.0011204928 0.006600591 -0.003164140 0.008644238
  • What value does the point forecast converge to if you forecast many quarters into the future?
  • Challenge Question: How do you think this is related to the model coefficients?
  • Are you satisfied with the model we fit? Would you choose another model? Why?