Learning Log Day 16

Concepts

Today, in class, we discussed the very early concepts of time series analysis and autocorrelation. Generally, a time series is any variable graphed against time and autocorrelation is any one point having an affect on the value of any other point, either later in time or earlier.

We would utilize a time series analysis when we want to know how some variable changes with time. We need to check for autocorrelation for any time series analysis we create.

To illustrate the commands used for time series analyses and testing for autocorrelation, I will go through an example using the data set aatemp from the R package faraway. This data set gives the annual mean temperature in Ann Arbor (in F) for the years 1854 to 2000.

Modeling the Data

We can start plotting the temperature against time to develop reasonable expectations for the relationship between the two variables. We want connect all data points with lines to follow the trend over time so we’ll add the argument type = “l” to our plot() command.

plot(temp~year, data = aatemp, type = "l")

We see that the data has a slightly upward trend, but we can really not visualize the trend beyond that.

We can then start creating the models. We’ll start with a very simple, linear model.

mod1 <- lm(temp~year, data = aatemp)

We next need to look at the summary for the model to determine whether the relationship between the two variables is significant and then we can look at the diagnostic plots to check whether there is a trend in our residuals.

summary(mod1)

## 
## Call:
## lm(formula = temp ~ year, data = aatemp)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.9843 -0.9113 -0.0820  0.9946  3.5343 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 24.005510   7.310781   3.284  0.00136 **
## year         0.012237   0.003768   3.247  0.00153 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.466 on 113 degrees of freedom
## Multiple R-squared:  0.08536,    Adjusted R-squared:  0.07727 
## F-statistic: 10.55 on 1 and 113 DF,  p-value: 0.001533

plot(mod1)

We can see that the relationship between the two variables is significant, but we see a trend in our residuals on our residual plot. We can try to improve this model by adding a quadratic term. Then, we can check the summary and diagnostic plots for that model.

mod2 <- lm(temp~year + I(year^2), data = aatemp)
summary(mod2)

## 
## Call:
## lm(formula = temp ~ year + I(year^2), data = aatemp)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.0412 -0.9538 -0.0624  0.9959  3.5820 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.127e+02  3.837e+02  -0.554    0.580
## year         2.567e-01  3.962e-01   0.648    0.518
## I(year^2)   -6.307e-05  1.022e-04  -0.617    0.539
## 
## Residual standard error: 1.47 on 112 degrees of freedom
## Multiple R-squared:  0.08846,    Adjusted R-squared:  0.07218 
## F-statistic: 5.434 on 2 and 112 DF,  p-value: 0.005591

plot(mod2)

We see from our summary that none of our predictors are significant any longer, but we can also see that our residual plot is still not adaquate. We will take this analysis one step further by adding a cubic term to our model.

mod3 <- lm(temp~year + I(year^2) + I(year^3), data = aatemp)
summary(mod3)

## 
## Call:
## lm(formula = temp ~ year + I(year^2) + I(year^3), data = aatemp)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8557 -0.9646 -0.1552  1.0485  4.1538 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  3.959e+04  1.734e+04   2.283   0.0243 *
## year        -6.159e+01  2.694e+01  -2.286   0.0241 *
## I(year^2)    3.197e-02  1.395e-02   2.291   0.0238 *
## I(year^3)   -5.527e-06  2.407e-06  -2.296   0.0236 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.443 on 111 degrees of freedom
## Multiple R-squared:  0.1298, Adjusted R-squared:  0.1063 
## F-statistic: 5.518 on 3 and 111 DF,  p-value: 0.001436

plot(mod3)

Now, all three of our predictors have returned to being significant and our residual plot looks relatively good. It seems that we have found the trend that our data follows.

Autocorrelation

Now, we want to check whether our model has autocorrelation. We wil perform a Durbin-Watson test because that is the most rigorous way to detect autocorrelation, although sometimes it can be detected visually. To do this, we’ll use dwtest(model), found in the R package lmtest, and if we add the argument alternative = “greater” we can check for positive autocorrelation. If alternative = “less”, the test is testing for negative autocorrelation.

dwtest(mod3, alternative = "greater")

## 
##  Durbin-Watson test
## 
## data:  mod3
## DW = 1.7171, p-value = 0.03464
## alternative hypothesis: true autocorrelation is greater than 0

Since our p-value is less than .05, we will conclude that our model has positive autocorrelation, meaning a positive residual is more likely to follow a positive residual and a negative residual is more likely to follow a negative residual.

Relation to Other Topics and the Course

This topic begins our exploration into time series analysis, a large part of our course. It is similar to regression because it involves the same commands in R, but it is also very different in the type of data we are looking at, time series rather than cross-sectional, and the answers we are looking to uncover. However, this is just the beginning and as we develop a deeper understanding, the relationship between the two types of modeling may become more evident.

However, regardless of its relationship with other modeling techniques, there is no doubt that time series analysis is important in modeling the world’s data.

Learning Log Day 16

Sydney Benson

4/3/2018

Concepts

Modeling the Data

Autocorrelation

Relation to Other Topics and the Course