Data Modeling

Using the aatemp in the faraway package in R, we will try to model a trend of air temperature over time.

library(faraway)
data(airpass)

First we look at the data and try to determine what type of model to build. In class we learned how to connect the data with lines to help visualize the trend over time.

plot(temp ~ year, data=aatemp, type="l")

It is difficult to decipher a trend. There is possibly a curve in the plot for the later years. Let’s start by building a linear model and increasing the degree until we think we have a decent model. Then we can compare the models by discussing the diagnostic plots.

mod<-lm(temp ~ year, data=aatemp)
summary(mod)

## 
## Call:
## lm(formula = temp ~ year, data = aatemp)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.9843 -0.9113 -0.0820  0.9946  3.5343 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 24.005510   7.310781   3.284  0.00136 **
## year         0.012237   0.003768   3.247  0.00153 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.466 on 113 degrees of freedom
## Multiple R-squared:  0.08536,    Adjusted R-squared:  0.07727 
## F-statistic: 10.55 on 1 and 113 DF,  p-value: 0.001533

A T-test tells us that year is a significant predictor for temperature in our linear model. We can check our model assumptions using diagnostic plots.

plot(mod)

We see that we violate our homoscedasticity assumption because the spread of our residuals in not the same throughout. We also have some systematic over and under predictions happening. So we will stop our plot analysis here and look to improve by trying a quadratic term.

quadmod<-lm(temp ~ year + I(year^2), data=aatemp)
summary(quadmod)

## 
## Call:
## lm(formula = temp ~ year + I(year^2), data = aatemp)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.0412 -0.9538 -0.0624  0.9959  3.5820 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.127e+02  3.837e+02  -0.554    0.580
## year         2.567e-01  3.962e-01   0.648    0.518
## I(year^2)   -6.307e-05  1.022e-04  -0.617    0.539
## 
## Residual standard error: 1.47 on 112 degrees of freedom
## Multiple R-squared:  0.08846,    Adjusted R-squared:  0.07218 
## F-statistic: 5.434 on 2 and 112 DF,  p-value: 0.005591

plot(quadmod)

We can see that none of our predictors are significant and that we violate the homoscedasticity assumption even more. Let’s try a cubic model.

cubemod<-lm(temp ~ year + I(year^2)+ I (year^3), data=aatemp)
summary(cubemod)

## 
## Call:
## lm(formula = temp ~ year + I(year^2) + I(year^3), data = aatemp)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8557 -0.9646 -0.1552  1.0485  4.1538 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  3.959e+04  1.734e+04   2.283   0.0243 *
## year        -6.159e+01  2.694e+01  -2.286   0.0241 *
## I(year^2)    3.197e-02  1.395e-02   2.291   0.0238 *
## I(year^3)   -5.527e-06  2.407e-06  -2.296   0.0236 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.443 on 111 degrees of freedom
## Multiple R-squared:  0.1298, Adjusted R-squared:  0.1063 
## F-statistic: 5.518 on 3 and 111 DF,  p-value: 0.001436

plot(cubemod)

At a significance level of alpha=.05, all of our predictors are significant. This model greatly improves the residual vs fitted plot. We no longer are violating homoscedasticity assumption and the red line is nearly horizontal at 0 indicating we don’t have systematic predictions. Looking back at the plot of our data, we can see where there might be two curves in our data so it makes sense a cubic model is our best model.

Autocorrelation

We will test for first-order autocorrelation. This means testing whether the error term in time period t is related to the error term in the previous time period, t-1.

Positive first-order autocorrelation indicates that a positive error term in period t-1 tends to be followed by a positive error term in period t. It also means that a negative error term in period t-1 tends to be followed by a negative error term in period t.

Negative first-order autocorrelation indiates that a positive error term in period t-1 tends to be followed by a negative error term in time period t. It also means a negative term in period t-1 tends to be followed by a positive term period t.

The Durbin-Watson test is a decisive check:

library(lmtest)

dwtest(cubemod)

## 
##  Durbin-Watson test
## 
## data:  cubemod
## DW = 1.7171, p-value = 0.03464
## alternative hypothesis: true autocorrelation is greater than 0

The dwtest command defaults to checking for positive fist-order autocorrelation. We can use alternative=“less” in our code to test for negative autocorrelation. With a significance level of alpha=.05, we conclude that error terms are positively autocorrelated.

Learning Log Day 16: Time Series Introduction

Jill Wanner

April 3, 2018

Overview

Data Modeling

Autocorrelation