#Outline
This project has several sections and will provide you a concise introduction to time series concepts in R. We will learn the essential theory and also practice fitting the four main types of time series models, getting you up and running with all the basics in a little more than an hour!
Introduction to Rhyme Environment
Time Series Data Overview (Theory)
Why Time Series? (Theory)
Key Concepts: Autocorrelation / Autocovariance (Theory)
Key Concepts: Stationarity (Theory)
Checking for Stationarity (Practice)
Transforming for Stationarity: Differencing (Practice)
Transforming for Stationarity: Detrending (Practice)
Basic Model Types: AR(p), MA(q), ARMA(p,q), ARIMA(p,d,q), Decomposition (Theory)
Fitting AR / MA / ARMA / ARIMA models with the Box Jenkins Method (Theory)
Box Jenkins Method: Checking for Stationarity (Practice)
Box Jenkins Method: Transforming for Stationarity & Identifying Model Parameters (Practice)
Box Jenkins Method: Checking the Residuals of the Model Fit (Practice)
Making a Forecast for Each Model (Practice)
Fitting STL (Seasonal Trend Loess) Decomposition Models (Practice)
Where to go Next
#load required r packages
library(IRdisplay)
library(magrittr)
library(tidyverse)
## ── Attaching packages ──────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2 ✓ purrr 0.3.4
## ✓ tibble 3.0.3 ✓ dplyr 1.0.0
## ✓ tidyr 1.1.0 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.5.0
## ── Conflicts ─────────────────────── tidyverse_conflicts() ──
## x tidyr::extract() masks magrittr::extract()
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## x purrr::set_names() masks magrittr::set_names()
library(scales)
##
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
##
## discard
## The following object is masked from 'package:readr':
##
## col_factor
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
library(forecast)
library(tseries)
library(ggthemes)
theme_set(theme_economist())
#load helper R functions
source("/cloud/project/compare_models_function.R")
source("/cloud/project/sim_random_walk_function.R")
source("/cloud/project/sim_stationary_example_function.R")
print("Loading is completed")
## [1] "Loading is completed"
Time Series Data Overview
(Univariate) time series data is defined as sequence data over time: 𝑋1,𝑋2,…,𝑋𝑇 where 𝑡 is the time period and 𝑋𝑡 is the value of the time series at a particular point
Examples: daily temperatures in Boston, US presidential election turnout by year, minute stock prices
Variables in time series models generally fall into three categories:
endogenous
random noise
exogenous
All time series models involve (1) and (2) but (3) is optional.
The answer is that:
many forecasting tasks actually involve small samples which makes machine learning less effective
time series models are more interpretable and less black box than machine learning algorithms
time series appropriately accounts for forecasting uncertainty.
As an example, lets look at the following data generating process known as a random walk: 𝑋𝑡=𝑋𝑡−1+𝜖𝑡 We can compare the forecasting performance of linear regression to that of a basic time series model known as an AR(1) model.
#run function to compare linear regression to basic AR(1) time series model
compare.models(n=100)
Key Concepts: Autocorrelation/Autocovariance Autocorrelation/autocovariance refers to the correlation/covariance between two observations in the time series at different points.
The central idea behind it is how related the data/time series is over time.
For ease of interpretation we typically focus on autocorrelation i.e. what is the correlation between 𝑋𝑡 and 𝑋𝑡+𝑝 for some integer 𝑝 .
A related concept is partial autocorrelation that computes the correlation adjusting for previous lags/periods i.e. the autocorrelation between 𝑋𝑡 and 𝑋𝑡+𝑝 adjusting for the correlation of 𝑋𝑡 and 𝑋𝑡+1 , … , 𝑋𝑡+𝑝−1 .
When analyzing time series we usually view autocorrelation/partial autocorrelation in ACF/PACF plots.
Let’s view this for the random walk model we analyzed above: 𝑋𝑡=𝑋𝑡−1+𝜖𝑡 .
#simulate random walk
dat<-sim.random.walk()
#plot random walk
dat %>%
ggplot(aes(t,X)) + geom_line() + xlab("T") + ylab("X") + ggtitle("Time Series Plot")
#ACF plot
ggAcf(dat$X,type="correlation") + ggtitle("Autocorrelation ACF Plot")
#PACF plot
ggAcf(dat$X,type="partial") + ggtitle("Partial Autocorrelation PACF Plot")
Key Concepts: Stationarity
The second key concept in time series is stationarity.
While the concept can get quite technical, the basic idea is examining whether the distribution of the data over time is consistent.
There are two main forms of stationarity.
The cumulative distribution function of the data does not depend on time:
𝐹𝑋(𝑋1,…,𝑋𝑇)=𝐹𝑋(𝑋1+Δ,…,𝑋𝑇+Δ) ∀Δ∈ℝ
the mean of the time series is constant 𝐸(𝑋𝑡)=𝐸(𝑋𝑡+Δ) the autocovariance/autocorrelation only depends on the time difference between points 𝐴𝐶𝐹(𝑋𝑡,𝑋𝑡+Δ−1)=𝐴𝐶𝐹(𝑋1,𝑋Δ) the time series has a finite variance 𝑉𝑎𝑟(𝑋Δ)<∞ ∀Δ∈ℝ
Checking for Stationarity
#create three time series for example
df<-sim.stationary.example(n=1000)
head(df);dim(df)
## t X1 X2 X3
## 1 1 0.3376258 1.144196 1.8088315
## 2 2 -0.9906829 1.799888 0.7675419
## 3 3 -1.9938056 1.293361 1.5449087
## 4 4 -0.4263313 5.455526 0.4429517
## 5 5 0.7133339 5.452631 0.2237386
## 6 6 -0.6529274 7.750084 -1.4701338
## [1] 1000 4
Check a plot of the time series over time and look for constant mean and finite variance i.e. values appear bounded.
#plot nonstationary and stationary time series
g1<- ggplot(df,aes(x=t,y=X1)) + geom_line() + xlab("T") + ylab("X1") + ggtitle("Nonstationary")
g2<- ggplot(df,aes(x=t,y=X3)) + geom_line() + xlab("T") + ylab("X3") + ggtitle("Stationary")
grid.arrange(g1,g2)
Look at the ACF plot and see if it dies off quickly as opposed to a gradual decline.
#ACF for nonstationary and stationary time series
g1<- ggAcf(df$X1,type="correlation") + xlab("T") + ylab("X1") + ggtitle("Nonstationary")
g2<- ggAcf(df$X3,type="correlation") + xlab("T") + ylab("X3") + ggtitle("Stationary")
grid.arrange(g1,g2)
Perform unit root tests such as the Augmented Dickey–Fuller test.
#perform unit test; nonstationary example has large, non-significant p-value
adf.test(df$X1)
##
## Augmented Dickey-Fuller Test
##
## data: df$X1
## Dickey-Fuller = -2.4507, Lag order = 9, p-value = 0.3875
## alternative hypothesis: stationary
adf.test(df$X3)
## Warning in adf.test(df$X3): p-value smaller than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: df$X3
## Dickey-Fuller = -9.8636, Lag order = 9, p-value = 0.01
## alternative hypothesis: stationary
##Transforming for Stationarity
#Differencing
Differencing involves taking differences between successive time series values. The order of differencing is defined as p for Xt - Xt-p
Let’s transform a nonstationary time series to stationary by differencing with the random walk model in a random walk Xt =Xt-1 + et where et ~ N(0,sigma^2) iid.
Differencing with an order of one means tha X˜t = Xt -Xt-1 = et
#difference time series to make stationary
diff<- df$X1 - lag(df$X1,1)
#plot original and differenced time series
g1<- ggAcf(df$X1,type="correlation")
g2<- ggAcf(diff,type="correlation")
grid.arrange(g1,g2)
Detrending involves removing a deterministic relationship with time. As an example suppose we have the following data generating process Xt= Bt + et where et~N(0,sigma^2) iid. Detrending involves using the transformed time series X˜t=Xt-Bt=et
#detrending time series to make stationary
detrended<- resid(lm(X2~t,data=df))
#plot original and detrended time series
g1<- ggAcf(df$X2,type="correlation")
g2<- ggAcf(detrended,type="correlation")
grid.arrange(g1,g2)
Basic Model Types: AR(p), MA(q), ARMA(p,q),ARIMA(p,d,q),Decomposition
Autoregressive AR(p) Models
AR models specify Xt as a function of lagged time series values Xt-1,Xt-2….
i.e. Xt = µ + ø1 Xt-1 +…+øp Xt-p + et
where µ is a mean term and et ~iid N(0,sigma^2) is a random error
When fitting an AR model the key choice is p, the number of lags to include
Moving Average MA(q) Models
MA models specify Xt using random noise lags: Xt = µ + et +Ø1 et-1+…+Øqet-q where µ is a mean term and et~iid N(0,sigma^2) is a random error similar to an AR model the key choice is q, the number of random shock lags
Autoregressive Moving Average ARMA(p,q) Models
ARMA(p,q) models are a combination of an AR and MA model:
Xt = µ + ø1 Xt-1 +…+øp Xt-p + et + Ø1 et-1+…+Øqet-q
where µ is a mean term and et ~iid N(0,sigma^2) is a random error
when fitting an ARMA model, we need to choose two things: p, the number of AR lags, and q, the number of MA lags
Autoregressive Integrated Moving Average ARIMA(p,d, q) Models
ARIMA(p,d,q) is an ARMA model with differencing
when fitting an ARIMA model we need to choose three things: p, the number of AR lags, q, the number of MA lags, and d, the number of differences to use.
Decomposition Models
Decomposition models specify Xt as a combination of a trend component (Tt), seasonal component (St), and an error component/residual (Et)i.e. Xt =f(Tt,St,Et)
Common decomposition forms are: Xt=Tt+St+Et or Xt=TtStEt (where then take logs to recover the additive form).
There are various waist estimate the different trend components: exponential shooting, state space models/Kalman filtering, STL models, etc.
In this project we will cover STL models because of their ease of use and flexibility.
Fitting AR/MA/ARMA/ARIMA models with the BOX Jenkins
We will now go over how to fit AR/MA/ARMA/ARIMA models on a real data set and review a generic strategy for fitting them known as the Box Jenkins method.
This process involves several steps to help identify the p,d, and q parameters that we need: - identify whether the time series is stationary or not - identify p,d, and q of the time series by: - making the time series stationary through differencing/detrending to find d - looking at ACF/PACF to find p and q - using model fit diagnostics like AIC or BIC to select the best model to find p,d, and q - check the model fit using the Ljung-Box test.
#load data
ur<- read.csv("Mass Monthly Unemployment Rate.csv")
head(ur); dim(ur)
## DATE MAURN
## 1 1976-01-01 11.6
## 2 1976-02-01 11.3
## 3 1976-03-01 10.9
## 4 1976-04-01 9.9
## 5 1976-05-01 9.4
## 6 1976-06-01 9.8
## [1] 529 2
#check date
class(ur$DATE)
## [1] "factor"
#change date class to date type
ur$DATE<- as.Date(ur$DATE)
class(ur$DATE)
## [1] "Date"
Checking for stationarity
#check time series plot
ggplot(ur,aes(x=DATE,y=MAURN)) +geom_line()
#check ACF plot
ggAcf(ur$MAURN,type='correlation')
#run ADF test
adf.test(ur$MAURN)
##
## Augmented Dickey-Fuller Test
##
## data: ur$MAURN
## Dickey-Fuller = -3.0954, Lag order = 8, p-value = 0.1146
## alternative hypothesis: stationary
#fit AR model
ar.model<- auto.arima(ur$MAURN, max.d=0, max.q=0,allowdrift=T)
ar.model
## Series: ur$MAURN
## ARIMA(1,0,0) with non-zero mean
##
## Coefficients:
## ar1 mean
## 0.9787 5.7425
## s.e. 0.0101 0.8498
##
## sigma^2 estimated as 0.2: log likelihood=-325.44
## AIC=656.88 AICc=656.93 BIC=669.7
#fit MA model
ma.model<- auto.arima(ur$MAURN, max.d=0, max.p=0,allowdrift=T)
ma.model
## Series: ur$MAURN
## ARIMA(0,0,5) with non-zero mean
##
## Coefficients:
## ma1 ma2 ma3 ma4 ma5 mean
## 1.3646 1.7103 1.4882 1.2714 0.4804 5.4588
## s.e. 0.0368 0.0492 0.0578 0.0393 0.0350 0.1507
##
## sigma^2 estimated as 0.229: log likelihood=-361.03
## AIC=736.05 AICc=736.27 BIC=765.95
#fit ARMA model
arma.model<- auto.arima(ur$MAURN, max.d=0,allowdrift=T)
arma.model
## Series: ur$MAURN
## ARIMA(3,0,2) with non-zero mean
##
## Coefficients:
## ar1 ar2 ar3 ma1 ma2 mean
## -0.2267 0.5998 0.5573 1.3361 0.8876 5.7038
## s.e. 0.0885 0.0544 0.0569 0.0544 0.0221 0.7764
##
## sigma^2 estimated as 0.1693: log likelihood=-280.15
## AIC=574.3 AICc=574.51 BIC=604.19
#fit ARIMA model
arima.model<- auto.arima(ur$MAURN, allowdrift=T)
arima.model
## Series: ur$MAURN
## ARIMA(4,1,2)
##
## Coefficients:
## ar1 ar2 ar3 ar4 ma1 ma2
## 1.0029 -0.1834 -0.3982 0.4872 -1.1149 0.2512
## s.e. 0.0708 0.0750 0.0560 0.0394 0.0793 0.0711
##
## sigma^2 estimated as 0.1509: log likelihood=-247.45
## AIC=508.9 AICc=509.12 BIC=538.78
Checking the Residuals of the Model fit
#calculate residuals of each model
ar.residual<- resid(ar.model)
ma.residual<- resid(ma.model)
arma.residual<- resid(arma.model)
arima.residual<- resid(arima.model)
#plot PACF plot of each models residuals
ggAcf(ar.residual,type='partial')
ggAcf(ma.residual,type='partial')
ggAcf(arma.residual,type='partial')
ggAcf(arima.residual,type='partial')
#run the Ljung Box test on the residuals
Box.test(ar.residual,type='Ljung-Box',lag=1)
##
## Box-Ljung test
##
## data: ar.residual
## X-squared = 0.88802, df = 1, p-value = 0.346
Box.test(ma.residual,type='Ljung-Box',lag=1)
##
## Box-Ljung test
##
## data: ma.residual
## X-squared = 0.56386, df = 1, p-value = 0.4527
Box.test(arma.residual,type='Ljung-Box',lag=1)
##
## Box-Ljung test
##
## data: arma.residual
## X-squared = 0.96749, df = 1, p-value = 0.3253
Box.test(arima.residual,type='Ljung-Box',lag=1)
##
## Box-Ljung test
##
## data: arima.residual
## X-squared = 0.0032696, df = 1, p-value = 0.9544
Making a forecast for each model
#make forecast for each model
ar.forecast<- forecast(ar.model,h=24,level=80)
ma.forecast<- forecast(ma.model,h=24,level=80)
arma.forecast<- forecast(arma.model,h=24,level=80)
arima.forecast<- forecast(arima.model,h=24,level=80)
#plot forecast for each model
g1<- autoplot(ar.forecast)
g2<- autoplot(ma.forecast)
g3<- autoplot(arma.forecast)
g4<- autoplot(arima.forecast)
grid.arrange(g1,g2,g3,g4,nrow=2,ncol=2)
Fitting Seasonal Trend Losses (STL) Decomposition Models
#transform to time series object; need to specify frequency
ur.ts<- ts(ur$MAURN,frequency=12)
#fit stl model
stl.model<- stl(ur.ts,s.window='periodic')
#plot model fit
autoplot(stl.model)
#make forecast
stl.forecast<- forecast(stl.model,h=24,level=80)
autoplot(stl.forecast)
Where to go Next