library(quantmod)

## Loading required package: xts

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

## Loading required package: TTR

## Version 0.4-0 included new data defaults. See ?getSymbols.

library(tseries)
library(timeSeries)

## Loading required package: timeDate

## 
## Attaching package: 'timeSeries'

## The following object is masked from 'package:zoo':
## 
##     time<-

library(forecast)
library(xts)

Introduction

For this R Guide, we will be taking a look at time series regression models. First, we will take a look at the traitional method of modeling regression in R, using the lm() function. We will asses the model fit and performance in additon to some hypothesis testing. Then, we will look at an autoregressive moving average model and assess its performance in the same manner. Lastly, we will discuss forecasting and some of the methods one can use to implement that in an application context.

Time Series Introduction

Time series regression models describe the relationship between a dependent variable \(\sf{ŷ_{i}}\) to time. Similarly to the multiple linear regression R guide, we can use any order of polynomial to represent the trend of the time series. An example of a third order polynomial trend model is as follows:

\(\sf{ŷ_{i}}\) = \(\sf{\beta_{0}}\) + \(\sf{\beta_{1}}\)t + \(\sf{\beta_{2}}\)t² + \(\sf{\beta_{3}}\)t³

As stated before, this would be a third order polynomial trend model. It can be generalized to a pth-order polynomial trend model.

We can create this regression model in the same way that we did for multiple linear regression, except that time will be our only predictor. Let’s bring in the csv file that was downloaded. For this R Guide, we will be looking at minimum daily temperatures (degrees Farenheit) in Australia.

austempdata <- read.csv(file="/Users/Cloutiers2/Downloads/austemp.csv")
attach(austempdata)

Considering the context of our data, it may be useful for us to use a dummy variable for each season in order to analyze the fluctuations in temperature over time. The system for creating this dummy variable is outlined in the syntax below. We start by using the floor function to round down to the nearest year for each temperature. Then we obtain the decimal by finding the difference between the original value and that year. So if we have a temperature recorded at year 81.44, we will then have the value 0.44. By multiplying each of these decimals by four and using the factor function, we can obtain four categorical factors, which we will name according to the seasons.

justyear <- floor(austempdata$Year)
seadecimal <- austempdata$Year - justyear
seafactor <- factor(floor(seadecimal*4))
levels(seafactor) <- c("Spring","Summer","Fall","Winter")
summary(seafactor)

## Spring Summer   Fall Winter 
##    909    910    910    918

Now, we can create our model using the justyear and seafactor dummy variables, in a very similar manner to the multiple linear regression R guide.

seasonmod <- lm(log(Temperature) ~ justyear + seafactor)
summary(seasonmod)

## 
## Call:
## lm(formula = log(Temperature) ~ justyear + seafactor)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.41736 -0.06730  0.00312  0.06709  0.37678 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      3.8956388  0.0529535  73.567  < 2e-16 ***
## justyear         0.0020608  0.0006179   3.335 0.000861 ***
## seafactorSummer -0.1815267  0.0050231 -36.138  < 2e-16 ***
## seafactorFall   -0.2440210  0.0050231 -48.580  < 2e-16 ***
## seafactorWinter -0.0863581  0.0050122 -17.230  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1071 on 3642 degrees of freedom
## Multiple R-squared:  0.4296, Adjusted R-squared:  0.4289 
## F-statistic: 685.7 on 4 and 3642 DF,  p-value: < 2.2e-16

To better represent the fit of our model, let’s plot it against a line graph of our original data set. We need to specify the type of graph using type = “l”. Additionally, we can set the color of line we’d like to use for the model within the lines function as well. The complete syntax of the plotting process is shown below.

with(austempdata,plot(log(Temperature)~Year,type = "l"))
lines(austempdata$Year, seasonmod$fitted.values,col="yellow")

As you can see, the fit of our model seems to follow the seasonal trend of the minimum temperature in Australia for these years.

Similarly, we can do excecute the same procedure and find out how well our model does when we use a monthly dummy variable. Below is the syntax for that process:

justyear <- floor(austempdata$Year)
mondecimal <- austempdata$Year - justyear
monfactor <- factor(floor(mondecimal*12))
levels(monfactor) <- c("Jan", "Feb", "Mar", "Apr", "May", 
                      "Jun", "Jul", "Aug", "Sep", "Oct",
                      "Nov", "Dec")  

summary(monfactor)

## Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 
## 329 290 290 330 290 290 330 290 290 330 290 298

monthmod <- lm(log(Temperature) ~ justyear + monfactor)
summary(monthmod)

## 
## Call:
## lm(formula = log(Temperature) ~ justyear + monfactor)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.34833 -0.06227  0.00192  0.06530  0.36797 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   3.8952993  0.0477704  81.542  < 2e-16 ***
## justyear      0.0020773  0.0005552   3.742 0.000186 ***
## monfactorFeb  0.0188871  0.0077519   2.436 0.014880 *  
## monfactorMar -0.0222405  0.0077519  -2.869 0.004141 ** 
## monfactorApr -0.0980774  0.0074981 -13.080  < 2e-16 ***
## monfactorMay -0.1903121  0.0077519 -24.550  < 2e-16 ***
## monfactorJun -0.2710575  0.0077519 -34.967  < 2e-16 ***
## monfactorJul -0.2844145  0.0074981 -37.932  < 2e-16 ***
## monfactorAug -0.2440464  0.0077519 -31.482  < 2e-16 ***
## monfactorSep -0.2013872  0.0077519 -25.979  < 2e-16 ***
## monfactorOct -0.1480668  0.0074981 -19.747  < 2e-16 ***
## monfactorNov -0.0755277  0.0077519  -9.743  < 2e-16 ***
## monfactorDec -0.0318574  0.0076964  -4.139 3.56e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09624 on 3634 degrees of freedom
## Multiple R-squared:  0.5405, Adjusted R-squared:  0.539 
## F-statistic: 356.3 on 12 and 3634 DF,  p-value: < 2.2e-16

with(austempdata,plot(log(Temperature)~Year,type = "l"))
lines(austempdata$Year, monthmod$fitted.values,col="yellow")

Once again, the fit of this model follows the trend fairly well, and seems to describe the seasonality even more so than our first model.

Model Comparison

One way in which we can compare these two models is by using an ANOVA test. Analysis of variance (ANOVA) is a statistical technique that is used to check if the means of two or more groups are significantly different from each other. ANOVA checks the influence of factors by comparing the means of different samples.

In our case, we can use it to tell if our models are significantly different enough from one another. Specifically, R will perform an F test in order to see whether reduction in the residual sum of squares are statistically significant or not. The syntax for an ANOVA test for the models is found below.

anova(seasonmod, monthmod)

With such a low p-value, we can conclude that our monthly model is significantly different from the model with a seaon-based dummy variable. Additionally, since the monthmod contains less residual sum of squares, we are affirmed that it is the better of the models that we created.

ARIMA

ARIMA stands for AutoRegressive Integrated Moving Average. It is another way to model time series and contains three components, which are as follows:

AR: Autoregression. The model looks at the relationship between an observation and some number of “lagged”" observations. I: Integrated. The subtraction of one observation from an observation at the previous time in order to make the time series stationary. Essentially this means that the mean, variance, and autocorrelation structure do not change over time. MA: Moving Average. The model also uses the dependency between an observation and residuals from a moving average model applied to lagged observations.

All of the outlined components are essentially parameters within the ARIMA function in R.

The format is ARIMA(p,d,q) and we need to specify the parameters that we’d like to use for our model. Descriptions of these parameters are as follows:

p: The number of lag observations included in the mode. d: The number of times that the raw observations are differenced (so the time series is stationary) q: The size (order) of the moving average

Each of these parameters also correlate to a seasonal P, D, and Q that are important to the ARIMA model. Our notation for the parameters will be:

ARIMA (p,d,q) x (P,D,Q) Period

We can set all of these parameters by decomposing a time series in R and going step by step through the process. It is useful to first put our data into something that R will recognize as a time series using the ts function. We need to specify the frequency of our time series (which is daily in our context) as well as the starting date for the data.

temptimeseries <- ts(austempdata,freq=365,start=c(1981),end=c(1990,360))
TSdecomp <- decompose(temptimeseries)
summary(TSdecomp)

##          Length Class  Mode     
## x        7290   mts    numeric  
## seasonal 7290   ts     numeric  
## trend    7290   mts    numeric  
## random   7290   mts    numeric  
## figure    365   -none- numeric  
## type        1   -none- character

After decomposing the time series, we will remove the seasonal component.

noseason <- temptimeseries - TSdecomp$seasonal
plot(noseason)

Now after we have plotted our data, we can move forward by finding the differences to determine the number of lag observations. After we have these differences, we can use the augmented Dickey-Fuller test to determine whether or not our data is stationary at that point.

diff1 <- diff(austempdata$Temperature,lag = 365)
adf.test(diff1)

## Warning in adf.test(diff1): p-value smaller than printed p-value

## 
##  Augmented Dickey-Fuller Test
## 
## data:  diff1
## Dickey-Fuller = -13.381, Lag order = 14, p-value = 0.01
## alternative hypothesis: stationary

The alternative hypothesis of stationary affirms that our data is stationary. Thusly, we can keep our d parameter of our ARIMA model zero at this point. Let’s use a couple of tools to figure out what the other parameters will be.

We will use ACF and PACF to determine the next two parameters. These are two tools in R that estimate the autocorrelation function (and partial autocorrelation function) of a time series. These functions tell us the correlation between points separated by various time lags. That information, in turn, will help us to set the parameters for the ARIMA model.

pacf(diff1)

acf(diff1)

Nonseasonal p & q

Looking at the Partial ACF, we can see spikes at low lags, which would indicate that our p should be greater or equal to 1.

On the ACF, we again see a couple of spikes at the low levels of lag, which indicates that our q should also be greater or equal to one.

Our model, then, is currently (1,0,1)

Seasonal P & Q

Looking again at the Partial ACF, we can see spikes around lag 1.0. p>=1

On the ACF, we see the same, indicateing that q>=1 as well.

This model is currently (1,1,1)

Now let’s build the model and make sure that is adequate for our uses. We will use a box tests to verify that the residuals are small enough.

modtemp <-  arima(austempdata$Temperature, order = c(1,0,1), season = list( order = c( 1,1,1), period=12))
Box.test(modtemp$residuals, type = "Ljung")

## 
##  Box-Ljung test
## 
## data:  modtemp$residuals
## X-squared = 6.5518, df = 1, p-value = 0.01048

Since the p-value from the box test is reasonably small, we can conclude that there is not evidence of our model being inadequate in this case. From here, we could go step by step and try to get the best model fit that we possibly can, with the smallest residuals possible. However, R provides us with some useful tools to expedite this process. Namely, we have the auto.arima function which optimizes the parameters that we just worked on finding.

automod <- auto.arima(austempdata$Temperature)
automod

## Series: austempdata$Temperature 
## ARIMA(5,0,0) with non-zero mean 
## 
## Coefficients:
##          ar1      ar2     ar3     ar4     ar5     mean
##       0.6504  -0.0624  0.0937  0.0633  0.1312  52.1362
## s.e.  0.0164   0.0196  0.0196  0.0196  0.0164   0.5924
## 
## sigma^2 estimated as 19.75:  log likelihood=-10612.63
## AIC=21239.26   AICc=21239.29   BIC=21282.67

Forecasting

Now we will try to forecast into the future, which will hopefully follow our trendline fairly well. We can use the built in Forecast function in R, and then plot our results. It can be difficult to see when using daily data like we have been, but overall our forecast seems to follow the trend fairly well.

plot(forecast(automod,h=12))

Time Series R Guide

Jacob Cloutier