This week we covered linear regression and also conducting some residual analysis.
The following code demonstrates how to create a linear model, and comparing two different models with ANOVA tests. In order to conduct an ANOVA test, you must be using nested models. From the example below, model1 is nested in model2. We could not do ANOVA between model1 and model3 since they aren’t nested.
model0 <- lm(Rating ~ 1, beer)
model1 <- lm(Rating ~ IBU, beer)
model2 <- lm(Rating ~ IBU + Brewery, beer)
model3 <- lm(Rating ~ Brewery, beer)
anova(model1, model0)
## Analysis of Variance Table
##
## Model 1: Rating ~ IBU
## Model 2: Rating ~ 1
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 42 509.69
## 2 43 598.55 -1 -88.859 7.3223 0.0098 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(model2, model3)
## Analysis of Variance Table
##
## Model 1: Rating ~ IBU + Brewery
## Model 2: Rating ~ Brewery
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 35 246.55
## 2 36 301.38 -1 -54.829 7.7835 0.008475 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
If you don’t have nested models, there are other ways to compare two different models. This is shown below using AIC and BIC. You want to minimize these values. It also shows how to transform your variables if there isn’t a linear relationship. You can also find the log likelihoods, but since they aren’t penalized like AIC and BIC, you can’t directly compare them between models that have different numbers of variables.
mod1 <- lm(therms ~ month, data)
AIC(mod1)
## [1] 1252.867
BIC(mod1)
## [1] 1260.652
logLik(mod1)
## 'log Lik.' -623.4335 (df=3)
data$monthsquared <- (data$month)^2
mod2 <- lm(therms ~ month + monthsquared, data)
AIC(mod2)
## [1] 1121.437
BIC(mod2)
## [1] 1131.817
logLik(mod2)
## 'log Lik.' -556.7183 (df=4)
To find nonlinear relationships, you can use plots to find them. Below shows a couple ways to view plots. You’ll be able to see that the first plot has a parabolic relationship, and the second plot is like a log graph. Then you’d want to use transformed variables to cover that.
ggplot(data, aes(month, therms)) + geom_point()
## Warning: Removed 12 rows containing missing values (geom_point).
with(lifedata, plot(GNP, MaleLife))
Below shows a way to analyze your residuals. Look through these plots to find any weird relationships.
plot(mod2)
Also, here is a really cool way to add a line to a graph that I finally learned how to do! You use the coefficients from your summary to build the line.
xvar <- 1:12
coef(mod2)
## (Intercept) month monthsquared
## 549.8271 -147.6907 10.4416
ypreds <- 549.8271+ - 147.607*xvar + 10.4416*xvar*xvar
with(data, (plot(month, therms)))
## NULL
lines(xvar, ypreds)