Linear Regression Review

Here is an example of a linear model that predicts male life expectancy from GNP.Two assumptions of linear models are:

lifedata <- read.csv("http://www.cknudson.com/data/LifeExp.csv")

mod1<-lm(MaleLife~GNP,data=lifedata)

After creating a linear model, we may want to test whether or not a coefficient is signficant. One way to do this is with the Wald Test. For this, our test stat is the estimated coefficient divided by its standard error. This test stat follows a t-distribution with the number of degrees of freedom equal to: (# observations - # of parameters).

The summary function gives us the corresponding Wald Test p-value.

summary(mod1)
## 
## Call:
## lm(formula = MaleLife ~ GNP, data = lifedata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -18.999  -5.388   1.222   5.840  12.553 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5.694e+01  9.647e-01   59.03  < 2e-16 ***
## GNP         7.728e-04  9.758e-05    7.92 6.35e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.492 on 89 degrees of freedom
##   (6 observations deleted due to missingness)
## Multiple R-squared:  0.4134, Adjusted R-squared:  0.4068 
## F-statistic: 62.72 on 1 and 89 DF,  p-value: 6.345e-12

The fourth column of the Coefficients section gives us this p-value. In this case, the coefficient for GNP is very small at 6.35e-12.Thus, we would reject the null hypothesis that this coefficient is 0 in favor of the alternative hypothesis that it is non-zero. We have reason to believe that there is a linear relationship between GNP and male life expectancy.

Transforming Variables

plot(lifedata$GNP,lifedata$MaleLife)

This graph shows how the relationship between GNP and male life expectancy is more likely to be exponential rather than linear. We can adjust for this by transforming GNP with a natural log.

life<-transform(lifedata,lnGNP=log(GNP)) 
attach(life)
plot(life$lnGNP,life$MaleLife)

After transforming GNP, the relationship looks to be linear rather thatn exponential. Next, we can perform polynomial regression to create a second model.

mod2<-lm(MaleLife~GNP+lnGNP, data = life)
summary(mod2)
## 
## Call:
## lm(formula = MaleLife ~ GNP + lnGNP, data = life)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.8862  -3.1499   0.3632   3.5823  14.3717 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 20.4063459  4.6241607   4.413 2.88e-05 ***
## GNP         -0.0001969  0.0001423  -1.384     0.17    
## lnGNP        5.6053605  0.7003013   8.004 4.54e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.732 on 88 degrees of freedom
##   (6 observations deleted due to missingness)
## Multiple R-squared:  0.6605, Adjusted R-squared:  0.6528 
## F-statistic: 85.62 on 2 and 88 DF,  p-value: < 2.2e-16

Note that this model performs better than the previous model in terms of adjusted R^2.

Comparing Models

A few ways to compare models include AIC, BIC, ANOVA, and LRT.

AIC(mod1)
## [1] 628.7462
AIC(mod2)
## [1] 580.9704
BIC(mod1)
## [1] 636.2788
BIC(mod2)
## [1] 591.0138

Here we can see that the second model outperforms the first in terms of AIC and BIC.

mod1<-lm(MaleLife~GNP,data=lifedata)  
mod2<-lm(MaleLife~GNP+lnGNP, data = life)

In this example,since mod2 contains all of the predictors in mod1, mod1 is nested in mod2.

anova(mod1,mod2)
## Analysis of Variance Table
## 
## Model 1: MaleLife ~ GNP
## Model 2: MaleLife ~ GNP + lnGNP
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1     89 4995.8                                  
## 2     88 2891.0  1    2104.8 64.067 4.544e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Here is how we interpret the results of this ANOVA:

The null hypothesis states that the simpler model is sufficient versus the alternative hypothesis that the more complex model should be used. When our test stat is significant, we reject the null hypothesis.

So since the test stat for the ANOVA above is 4.544e-12, we favor the more complex model, mod2.

Lastly, the LRT test involves calculating a test stat that is 2 times the difference in loglikelihoods between the bigger and smaller model.It corresponds to a Chi-square distribution with degrees of freedom equal to the difference in number of parameters.

logLik(mod2)
## 'log Lik.' -286.4852 (df=4)
logLik(mod1)
## 'log Lik.' -311.3731 (df=3)
teststat<-2*(-286.4852-(-311.3731))
teststat
## [1] 49.7758
pchisq(teststat,lower.tail = FALSE,df=1)
## [1] 1.723568e-12

Since we have a significant p-value, we conclude that the more complex model should be used.