Nile
Exploratory Data AnalysisLet’s start with a quick warm-up where we explore the Nile
dataset. First, load the data and make a plot of the autocorrelation function using the Acf
function in the forecast
library.
Try regenerating the one of the plots on Rick’s 4th slide. There were two: the time series of the Nile Annual Flows and the relationship between y(t) and y(t+1). Remember to use functions like start(), end(), and frequency() to explore the data.
data(Nile)
head(Nile) # ts object
## Time Series:
## Start = 1871
## End = 1876
## Frequency = 1
## [1] 1120 1160 963 1210 1160 1160
plot(Nile,
#type="b",
col="blue",
xlab="Time",
ylab= "Flow (10^8 m^3)", main="Nile Annual Flow Rate From 1871 to 1970")
plot(Nile[1:99], Nile[2:100],col="blue",
xlab="year t", ylab="year t + 1",
main="Relationship in Flows in Consecutives Years")
Acf(Nile, main="")
They are positive, there seems to be some initial memory in the system and then the lags are insignificant.
AC(3) because MA(1) would have only the first lag = 1 and the rest insignificant. There would also be negative autocorrelation values. First three or so lags have high correlation from the previous lag.
Let’s now see if our instinct is correct. Make a series of plots like the one on slide 4 that relates to the flow in time \(t\) to time \(t-j\) for some \(j\)
nile <- as.vector(Nile)
nile.df <- data.frame(nile1 = nile[1:95],
nile2 = nile[2:96],
nile3 = nile[3:97],
nile4 = nile[4:98],
nile5 = nile[5:99],
nile6 = nile[6:100])
length(Nile) # 100
## [1] 100
head(nile.df) # 95 rows because needed to get rid of up to 5 rows
## nile1 nile2 nile3 nile4 nile5 nile6
## 1 1120 1160 963 1210 1160 1160
## 2 1160 963 1210 1160 1160 813
## 3 963 1210 1160 1160 813 1230
## 4 1210 1160 1160 813 1230 1370
## 5 1160 1160 813 1230 1370 1140
## 6 1160 813 1230 1370 1140 995
# can also get more precise correlations
cor(nile.df["nile1"],nile.df["nile2"])
## nile2
## nile1 0.4966515
cor(nile.df["nile1"],nile.df["nile3"])
## nile3
## nile1 0.3837741
cor(nile.df["nile1"],nile.df["nile4"])
## nile4
## nile1 0.334788
cor(nile.df["nile1"],nile.df["nile5"])
## nile5
## nile1 0.2460455
cor(nile.df["nile1"],nile.df["nile6"])
## nile6
## nile1 0.2490949
ggplot(nile.df, aes(x=nile1, y=nile2)) +
geom_point(shape=1) + geom_smooth(method=lm)
## `geom_smooth()` using formula 'y ~ x'
ggplot(nile.df, aes(x=nile1, y=nile3)) +
geom_point(shape=1) + geom_smooth(method=lm)
## `geom_smooth()` using formula 'y ~ x'
ggplot(nile.df, aes(x=nile1, y=nile4)) +
geom_point(shape=1) + geom_smooth(method=lm)
## `geom_smooth()` using formula 'y ~ x'
ggplot(nile.df, aes(x=nile1, y=nile5)) +
geom_point(shape=1) + geom_smooth(method=lm)
## `geom_smooth()` using formula 'y ~ x'
ggplot(nile.df, aes(x=nile1, y=nile6)) +
geom_point(shape=1) + geom_smooth(method=lm)
## `geom_smooth()` using formula 'y ~ x'
Acf
plot? In what ways? Agree.In the next set of exercises, we are going to fit a series of AR(p) models and check whether the model captures the dependence structure of the noise terms. As we discussed in class, our aim is that the residuals of a fitted model exhibit white noise. In this exercise, we are not doing any formal hypothesis testing to determine if one model is more appropriate compared to another. For now, use the Nile
dataset and identify a model that appears to be most appropriate. Justify your findings.
In the following code chunk, we use the arima
function to fit the model. The two key parameters we need are x
, which is our dataset, and order
which tells us how many lags to use. For now, only toggle the first parameter to choose an appropriate AR(p) model (i.e., you will have order=c(x,0,0)
, where x
is replaced by an appropriate number of lags for your model).
# (AR order, differencing, MA order)
# Example AR(1) model
mod.ar1 <- arima(x=nile, order=c(1,0,0))
Acf(residuals(mod.ar1), main='')
# Now try fitting your own models and examine the Acf of the residuals
mod.ar3 <- arima(x=nile, order=c(3,0,0))
Acf(residuals(mod.ar3), main='')
mod.ar5 <- arima(x=nile, order=c(5,0,0))
Acf(residuals(mod.ar5), main='') # AR(5) looks okay! But is it too much??
Now, use the PACF function in \(\mathsf{R}\) to check whether it suggests what you found from your experiments above.
Acf(nile, type="partial", main='')
PACF(1)…so ACF(1) still pretty reliable.
With the model that you chose, run some diagnostics. Consider, for example, a histogram of the residuals. What do you expect it to look like? (Hint: Suppose we assume that the white noise is normally distributed). Also, check out the QQ plot for the residuals.
# Make a histogram of the residuals from your model
hist(residuals(mod.ar1))
# Make a QQ plot for the residuals from your model
qqnorm(residuals(mod.ar1))
qqline(residuals(mod.ar1))
# Make a histogram of the residuals from your model
hist(residuals(mod.ar3))
# Make a QQ plot for the residuals from your model
qqnorm(residuals(mod.ar3))
qqline(residuals(mod.ar3))
Let’s try to some forecasting with the model that we chose.
# Forecast data forward 10 periods
forecasted.data <- data.frame(forecast(mod.ar1, h=10))
total.data = data.frame(year = c(1:110),
flow = c(Nile, forecasted.data$Point.Forecast))
head(total.data)
## year flow
## 1 1 1120
## 2 2 1160
## 3 3 963
## 4 4 1210
## 5 5 1160
## 6 6 1160
plot(total.data, type="l", ylim=c(450, 1350))
lines(c(101:110), forecasted.data$Lo.95, col="red", type="l", lty=2)
lines(c(101:110), forecasted.data$Hi.95, col="red", type="l", lty=2)
lines(c(101:110), forecasted.data$Lo.80, col="blue", type="l", lty=2)
lines(c(101:110), forecasted.data$Hi.80, col="blue", type="l", lty=2)
abline(v=101, col="grey", lty=3)
abline(v=110, col="grey", lty=3)
In this series of exercises, you will be in charge of your own journey. The aim is to combine the concepts in time series you have been mastering over the last couple of days as well as the themes from the entire bootcamp to analyze data.
We will be using the quantmod
package’s API to bring in financial data into \(\mathsf{R}\). We will be examining housing price index (HPI) data available from the St. Louis Federal Reserve (FRED) database.
rm(list=ls())
# Start by loading the data and converting to log growth rates
hp <- getSymbols("HPIPONM226S", src="FRED", auto.assign=FALSE) # "xts" "zoo"
## 'getSymbols' currently uses auto.assign=TRUE by default, but will
## use auto.assign=FALSE in 0.5-0. You will still be able to use
## 'loadSymbols' to automatically load data. getOption("getSymbols.env")
## and getOption("getSymbols.auto.assign") will still be checked for
## alternate defaults.
##
## This message is shown once per session and may be disabled by setting
## options("getSymbols.warning4.0"=FALSE). See ?getSymbols for details.
# Symbols = a character vector specifying the names of each symbol to be loaded
# src = source. Current src methods available are: yahoo, google, MySQL, FRED, csv, RData, oanda, and av.
# https://fred.stlouisfed.org/series/HPIPONM226S
# xts objects are simple. Think of them as a matrix of observations combined with an index of corresponding dates and times. xts = matrix + times
# hp
# # if the getSymbols doesn't work
# library(xts)
# data <- read.csv("HPIPONM226S.csv", header=TRUE)
# hpi <- as.numeric(data$HPIPONM226S)
# date <- as.Date(as.character(hp$DATE), "%Y-%m-%d")
# hp <- xts(hpi, date)
# colnames(hp) = "HPI"
# hp
# sessionInfo()
colnames(hp) <- "HPI"
plot(hp)
plot(log(hp$HPI)) # de-trending
hp$growth <- diff(log(hp$HPI)) # <--- Make sure you understand what's happening here!
hp <- hp[-1,]
# Plot the time series of data
# Hint: Just use the base R package
plot(hp$growth)
Yes, housing crisis.
library(tseries)
adf.test(hp$growth) # if the null is non-stationary then the null not rejected; nonconstant mean, covariance, and variance; lag order = shift every 7 years detected?
##
## Augmented Dickey-Fuller Test
##
## data: hp$growth
## Dickey-Fuller = -2.0518, Lag order = 7, p-value = 0.5549
## alternative hypothesis: stationary
Let’s focus on HPI data starting from May 1, 2014 to the present. Check whether that is stationary.
adf.test(hp[280:353,"growth"]) # yes stationary
##
## Augmented Dickey-Fuller Test
##
## data: hp[280:353, "growth"]
## Dickey-Fuller = -3.7404, Lag order = 4, p-value = 0.02736
## alternative hypothesis: stationary
Use an ACF and a PCF plot to identify temporal dependence in the hp$growth
data.
# Make an ACF plot
Acf(hp$growth, main='')
# Make a PACF plot
Acf(hp$growth, type="partial", main='')
#install.packages("ppcor")
# library(MASS)
# library(ppcor)
# hp.df <- data.frame(hp1 = hp[1:nrow(hp)-5],
# hp2 = hp[2:nrow(hp)-4],
# hp3 = hp[3:nrow(hp)-3],
# hp4 = hp[4:nrow(hp)-2],
# hp5 = hp[5:nrow(hp)-1],
# hp6 = hp[6:nrow(hp)])
# pcor(hp.df, method ="pearson")
PACF finds correlation of the residuals (which remains after removing the effects which are already explained by the earlier lag(s)). In general, the “partial” correlation between two variables is the amount of correlation between them which is not explained by their mutual correlations with a specified set of other variables. For example, if we are regressing a variable Y on other variables X1, X2, and X3, the partial correlation between Y and X3 is the amount of correlation between Y and X3 that is not explained by their common correlations with X1 and X2.
Fit a model that you think is appropriate and run some diagnostics on it.
# Fit the model
mod.hpi <- arima(x=hp$growth, order=c(6,0,0))
summary(mod.hpi)
##
## Call:
## arima(x = hp$growth, order = c(6, 0, 0))
##
## Coefficients:
## ar1 ar2 ar3 ar4 ar5 ar6 intercept
## 0.2194 0.1351 0.1782 0.0775 0.1759 0.1190 0.0030
## s.e. 0.0534 0.0543 0.0547 0.0546 0.0544 0.0538 0.0015
##
## sigma^2 estimated as 7.938e-06: log likelihood = 1571.18, aic = -3126.35
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 1.550527e-05 0.002817366 0.001961957 -Inf Inf 0.7816305
## ACF1
## Training set 0.004109879
# Check residuals for white noise
Acf(mod.hpi$residuals)
# Check for normality of residuals
hist(mod.hpi$residuals)
qqnorm(mod.hpi$residuals)
qqline(mod.hpi$residuals)
Finally, let’s forecast the data in the future a few quarters.
forecast(mod.hpi, h=4)
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 354 0.004341119 0.0007305189 0.007951719 -0.001180817 0.009863055
## 355 0.003458885 -0.0002375824 0.007155352 -0.002194374 0.009112144
## 356 0.003514898 -0.0002403030 0.007270099 -0.002228186 0.009257983
## 357 0.002740049 -0.0011204928 0.006600591 -0.003164140 0.008644238