The slides for Lab 6 session are now available to review here.

options(scipen=999, digits = 2)
library(car)

Load the Dataset

good <- read.csv("good.csv")
good_1 <- na.omit(good[,c( "readss97","loginc02","read02","AGE97", 
                             "bthwht", "HOME97", "faminc97","nfa01","nw01")])
attach(good_1)

Exploratory Data Analysis (Simplified)

hist(readss97)
hist(AGE97)
hist(bthwht)
hist(HOME97)
hist(faminc97)

6.1 Check for Omitted Variable Bias (Added Variable plots)

“Added-variable plots” provide graphic information about the marginal importance of a predictor variable (e.g.,HOME97) given the other variables in the model.

The added variable plot has some nice characteristics: 1. The slope of the line in the added variable plot is the coefficient of Added Variable (e.g., HOME97) in the full regression. 2. The residuals from the fitted line in the added variable plot are the same as the residuals from the full regression.

Recall what we have already done in Lab 5 here.

Initial regression with 3 variables with readss97 as the DV

lm <- lm(readss97~AGE97+faminc97+bthwht)
summary(lm)
## 
## Call:
## lm(formula = readss97 ~ AGE97 + faminc97 + bthwht)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -59.01  -9.11   0.06   9.14  63.47 
## 
## Coefficients:
##                Estimate  Std. Error t value             Pr(>|t|)    
## (Intercept) 94.33089777  1.07526293   87.73 < 0.0000000000000002 ***
## AGE97        0.73804325  0.13300514    5.55          0.000000034 ***
## faminc97     0.00008475  0.00000708   11.97 < 0.0000000000000002 ***
## bthwht      -4.23248060  0.80092553   -5.28          0.000000144 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15 on 1540 degrees of freedom
## Multiple R-squared:  0.123,  Adjusted R-squared:  0.121 
## F-statistic: 72.1 on 3 and 1540 DF,  p-value: <0.0000000000000002
# View(lm)

Dataset with the residuals from the previous regression

resid_read <- as.data.frame(lm$residuals)
# View(resid_read)

Regressions with HOME97 as the DV, store residuals into a new dataframe

lm2 <- lm(HOME97~AGE97+faminc97+bthwht, data = good_1)
summary(lm2)
## 
## Call:
## lm(formula = HOME97 ~ AGE97 + faminc97 + bthwht, data = good_1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -12.014  -1.623   0.283   1.902   6.825 
## 
## Coefficients:
##                Estimate  Std. Error t value             Pr(>|t|)    
## (Intercept) 17.69018579  0.20236511   87.42 < 0.0000000000000002 ***
## AGE97        0.22014952  0.02503164    8.79 < 0.0000000000000002 ***
## faminc97     0.00002323  0.00000133   17.43 < 0.0000000000000002 ***
## bthwht      -0.71832248  0.15073465   -4.77            0.0000021 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.8 on 1540 degrees of freedom
## Multiple R-squared:  0.22,   Adjusted R-squared:  0.219 
## F-statistic:  145 on 3 and 1540 DF,  p-value: <0.0000000000000002
resid_home <- as.data.frame(lm2$residuals)

Added Variable Plot

Plotting residuals from initial model against residuals from model with home97 as DV, and then adding regression line and/or lowess curve.

plot(lm2$residuals, lm$residuals)
abline(lm(lm$residuals~lm2$residuals), col="red") 
lines(lowess(lm2$residuals,lm$residuals), col="blue")

# If there is a trend, regression/lowess is not horizontal, 
# then there is reason to suspect that home97 is the omitted variable. 

Diagnosis

Check to see if the added variable has a linear relationship with our DV, in term of the direction and the slope. If so, then it is reasonable to include that variable (i.e.,HOME97) in the model. If the the relationship between the added variable and DV is linear, this plot should look linear.

# Alternatively, use `avPlot` from library(car)
avPlot(model=lm(good_1$readss97 ~ good_1$AGE97 + good_1$faminc97 + good_1$bthwht + good_1$HOME97), variable=good_1$HOME97)

NOTE: As evaluating the residual plots and added varible plot, look for need to transform, non-constant variance, outliers, etc. in the next step.


Problem Diagnosis:

Based on the diagnostics it looks like there may be problems with the variables faminc97 and age97. The problems can be fixed largely by re-specifying our model. Primarily what we need to do is 1) log-transform our faminc97 variable and 2) center age97 (i.e. subtract the mean value of age97 from each observation so that its new mean is 0) and also include its squared term.


6.2 Logarithm transformation

Produce a new data set which makes faminc97 for any observations less than one equal to one, then take the log of this and create a new variable,

assumption <- good_1
hist(assumption$faminc97)

summary(assumption$faminc97)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0   20686   43720   53901   71175  646743
assumption$faminc97 <- ifelse(assumption$faminc97 <= 1, 1, assumption$faminc97)
summary(assumption$faminc97)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       1   20686   43720   53901   71175  646743
assumption$loginc <- log(assumption$faminc97)
hist(assumption$loginc)

# View(assumption)

6.3 Centering and Adding a Quadratic Term

Two Steps: create a new age variable centered around 0, then square it

  1. Centering the Age variable by deducting the mean from the value
summary(good$AGE97)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       3       5       7       7      10      13    1340

The mean value for age is about 7.5, but the age variable is a discrete variable (i.e. age is measured in whole year here). So, it is better to center it around the year that is closest to mean value, i.e., 8 years old.

assumption$ageC <- (assumption$AGE97 - 8) 
summary(assumption)
##     readss97      loginc02        read02        AGE97          bthwht    
##  Min.   : 48   Min.   : 0.0   Min.   : 16   Min.   : 3.0   Min.   :0.00  
##  1st Qu.: 93   1st Qu.:10.2   1st Qu.: 65   1st Qu.: 5.0   1st Qu.:0.00  
##  Median :102   Median :10.9   Median : 74   Median : 7.0   Median :0.00  
##  Mean   :103   Mean   :10.7   Mean   : 72   Mean   : 7.4   Mean   :0.35  
##  3rd Qu.:113   3rd Qu.:11.4   3rd Qu.: 82   3rd Qu.:10.0   3rd Qu.:1.00  
##  Max.   :166   Max.   :14.2   Max.   :100   Max.   :13.0   Max.   :1.00  
##      HOME97        faminc97          nfa01               nw01         
##  Min.   : 7.9   Min.   :     1   Min.   : -314901   Min.   : -187925  
##  1st Qu.:18.4   1st Qu.: 20686   1st Qu.:    1016   1st Qu.:    3047  
##  Median :20.7   Median : 43720   Median :   10666   Median :   38310  
##  Mean   :20.3   Mean   : 53901   Mean   :  122358   Mean   :  174502  
##  3rd Qu.:22.6   3rd Qu.: 71175   3rd Qu.:   57901   3rd Qu.:  133071  
##  Max.   :26.9   Max.   :646743   Max.   :42875320   Max.   :43687968  
##      loginc          ageC     
##  Min.   : 0.0   Min.   :-5.0  
##  1st Qu.: 9.9   1st Qu.:-3.0  
##  Median :10.7   Median :-1.0  
##  Mean   :10.5   Mean   :-0.6  
##  3rd Qu.:11.2   3rd Qu.: 2.0  
##  Max.   :13.4   Max.   : 5.0
  1. Squaring the centered variable ageC
assumption$ageC2 <- (assumption$ageC^2)
summary(assumption)
##     readss97      loginc02        read02        AGE97          bthwht    
##  Min.   : 48   Min.   : 0.0   Min.   : 16   Min.   : 3.0   Min.   :0.00  
##  1st Qu.: 93   1st Qu.:10.2   1st Qu.: 65   1st Qu.: 5.0   1st Qu.:0.00  
##  Median :102   Median :10.9   Median : 74   Median : 7.0   Median :0.00  
##  Mean   :103   Mean   :10.7   Mean   : 72   Mean   : 7.4   Mean   :0.35  
##  3rd Qu.:113   3rd Qu.:11.4   3rd Qu.: 82   3rd Qu.:10.0   3rd Qu.:1.00  
##  Max.   :166   Max.   :14.2   Max.   :100   Max.   :13.0   Max.   :1.00  
##      HOME97        faminc97          nfa01               nw01         
##  Min.   : 7.9   Min.   :     1   Min.   : -314901   Min.   : -187925  
##  1st Qu.:18.4   1st Qu.: 20686   1st Qu.:    1016   1st Qu.:    3047  
##  Median :20.7   Median : 43720   Median :   10666   Median :   38310  
##  Mean   :20.3   Mean   : 53901   Mean   :  122358   Mean   :  174502  
##  3rd Qu.:22.6   3rd Qu.: 71175   3rd Qu.:   57901   3rd Qu.:  133071  
##  Max.   :26.9   Max.   :646743   Max.   :42875320   Max.   :43687968  
##      loginc          ageC          ageC2     
##  Min.   : 0.0   Min.   :-5.0   Min.   : 0.0  
##  1st Qu.: 9.9   1st Qu.:-3.0   1st Qu.: 1.0  
##  Median :10.7   Median :-1.0   Median : 4.0  
##  Mean   :10.5   Mean   :-0.6   Mean   : 8.6  
##  3rd Qu.:11.2   3rd Qu.: 2.0   3rd Qu.:16.0  
##  Max.   :13.4   Max.   : 5.0   Max.   :25.0

6.4 Run a Polynomial Regression with a Log-transformed Variable

Run a model with these new transformations,

lm3 <- lm(readss97~loginc+ageC+ageC2+bthwht+HOME97,data = assumption)
summary(lm3)
## 
## Call:
## lm(formula = readss97 ~ loginc + ageC + ageC2 + bthwht + HOME97, 
##     data = assumption)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -56.20  -9.17  -0.04   9.30  56.35 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)  55.4312     3.5060   15.81 < 0.0000000000000002 ***
## loginc        1.8015     0.3420    5.27           0.00000016 ***
## ageC          0.3476     0.1461    2.38              0.01750 *  
## ageC2        -0.0638     0.0516   -1.24              0.21626    
## bthwht       -2.9602     0.8120   -3.65              0.00028 ***
## HOME97        1.4959     0.1337   11.19 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14 on 1538 degrees of freedom
## Multiple R-squared:  0.179,  Adjusted R-squared:  0.177 
## F-statistic: 67.2 on 5 and 1538 DF,  p-value: <0.0000000000000002

Run a Scatterplot of residuals against independent variables, with regression line and lowess curve

plot(lm3$fitted.values, lm3$residuals)
abline(lm(lm3$residuals~lm3$fitted.values), col="red")  
lines(lowess(lm3$residuals~lm3$fitted.values), col="blue")

# Create a data set with **the fitted values and residuals** 
pred1 <- as.data.frame(lm3$fitted.values)
pred1$residuals <- (lm3$residuals)
# View(pred1)

Refer the slides for Lab 5 session for the illustrations of the plots here.

plot(lm3) 


Scatterplot of residuals against independent variables, with regression line and lowess curve

# plot of loginc and fitted residuals
plot(assumption$loginc, pred1$residuals)
abline(lm(pred1$residuals~assumption$loginc), col="red")  
lines(lowess(assumption$loginc, pred1$residuals), col="blue")

# plot of HOME97 and fitted residuals
plot(assumption$HOME97, pred1$residuals)
abline(lm(pred1$residuals~assumption$HOME97), col="red")  
lines(lowess(assumption$HOME97, pred1$residuals), col="blue")

There should be NO relationship between the random residuals and x. That is, the ‘residual by predicted’ plot and the ‘residual by x’ plot should show no systematic patterns that for different values of x (or ŷ).


Since ageC2 is not significant, drop it and re-run the model:

lm4 <- lm(readss97~loginc+ageC+bthwht+HOME97,data = assumption)
summary(lm4)
## 
## Call:
## lm(formula = readss97 ~ loginc + ageC + bthwht + HOME97, data = assumption)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -55.54  -9.18  -0.12   9.32  55.65 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)   54.973      3.487   15.77 < 0.0000000000000002 ***
## loginc         1.788      0.342    5.23           0.00000019 ***
## ageC           0.425      0.132    3.22               0.0013 ** 
## bthwht        -3.234      0.781   -4.14           0.00003666 ***
## HOME97         1.506      0.133   11.28 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14 on 1539 degrees of freedom
## Multiple R-squared:  0.178,  Adjusted R-squared:  0.176 
## F-statistic: 83.5 on 4 and 1539 DF,  p-value: <0.0000000000000002

6.5 Multicollinearity (also see Lab 5)

library(car)
# To test for multicollinearity use vif() function (Variable Inflation Factor)
vif(lm4)
## loginc   ageC bthwht HOME97 
##    1.3    1.1    1.1    1.3
# An example with multicollinearity
lm5 <- lm(read02~AGE97+loginc02+nfa01+nw01+HOME97,data =assumption)
vif(lm5)
##    AGE97 loginc02    nfa01     nw01   HOME97 
##      1.1      1.2    101.5    102.3      1.3