options(scipen=999, digits = 2)
library(car)good <- read.csv("good.csv")
good_1 <- na.omit(good[,c( "readss97","loginc02","read02","AGE97",
"bthwht", "HOME97", "faminc97","nfa01","nw01")])
attach(good_1)hist(readss97)
hist(AGE97)
hist(bthwht)
hist(HOME97)
hist(faminc97)“Added-variable plots” provide graphic information about the marginal importance of a predictor variable (e.g.,HOME97) given the other variables in the model.
The added variable plot has some nice characteristics: 1. The slope of the line in the added variable plot is the coefficient of Added Variable (e.g., HOME97) in the full regression. 2. The residuals from the fitted line in the added variable plot are the same as the residuals from the full regression.
Initial regression with 3 variables with readss97 as the DV
lm <- lm(readss97~AGE97+faminc97+bthwht)
summary(lm)##
## Call:
## lm(formula = readss97 ~ AGE97 + faminc97 + bthwht)
##
## Residuals:
## Min 1Q Median 3Q Max
## -59.01 -9.11 0.06 9.14 63.47
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 94.33089777 1.07526293 87.73 < 0.0000000000000002 ***
## AGE97 0.73804325 0.13300514 5.55 0.000000034 ***
## faminc97 0.00008475 0.00000708 11.97 < 0.0000000000000002 ***
## bthwht -4.23248060 0.80092553 -5.28 0.000000144 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15 on 1540 degrees of freedom
## Multiple R-squared: 0.123, Adjusted R-squared: 0.121
## F-statistic: 72.1 on 3 and 1540 DF, p-value: <0.0000000000000002
# View(lm)Dataset with the residuals from the previous regression
resid_read <- as.data.frame(lm$residuals)
# View(resid_read)Regressions with HOME97 as the DV, store residuals into a new dataframe
lm2 <- lm(HOME97~AGE97+faminc97+bthwht, data = good_1)
summary(lm2)##
## Call:
## lm(formula = HOME97 ~ AGE97 + faminc97 + bthwht, data = good_1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.014 -1.623 0.283 1.902 6.825
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.69018579 0.20236511 87.42 < 0.0000000000000002 ***
## AGE97 0.22014952 0.02503164 8.79 < 0.0000000000000002 ***
## faminc97 0.00002323 0.00000133 17.43 < 0.0000000000000002 ***
## bthwht -0.71832248 0.15073465 -4.77 0.0000021 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.8 on 1540 degrees of freedom
## Multiple R-squared: 0.22, Adjusted R-squared: 0.219
## F-statistic: 145 on 3 and 1540 DF, p-value: <0.0000000000000002
resid_home <- as.data.frame(lm2$residuals)Plotting residuals from initial model against residuals from model with home97 as DV, and then adding regression line and/or lowess curve.
plot(lm2$residuals, lm$residuals)
abline(lm(lm$residuals~lm2$residuals), col="red")
lines(lowess(lm2$residuals,lm$residuals), col="blue")# If there is a trend, regression/lowess is not horizontal,
# then there is reason to suspect that home97 is the omitted variable. Check to see if the added variable has a linear relationship with our DV, in term of the direction and the slope. If so, then it is reasonable to include that variable (i.e.,HOME97) in the model. If the the relationship between the added variable and DV is linear, this plot should look linear.
# Alternatively, use `avPlot` from library(car)
avPlot(model=lm(good_1$readss97 ~ good_1$AGE97 + good_1$faminc97 + good_1$bthwht + good_1$HOME97), variable=good_1$HOME97)NOTE: As evaluating the residual plots and added varible plot, look for need to transform, non-constant variance, outliers, etc. in the next step.
Based on the diagnostics it looks like there may be problems with the variables faminc97 and age97. The problems can be fixed largely by re-specifying our model. Primarily what we need to do is 1) log-transform our faminc97 variable and 2) center age97 (i.e. subtract the mean value of age97 from each observation so that its new mean is 0) and also include its squared term.
Produce a new data set which makes faminc97 for any observations less than one equal to one, then take the log of this and create a new variable,
assumption <- good_1
hist(assumption$faminc97)summary(assumption$faminc97)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 20686 43720 53901 71175 646743
assumption$faminc97 <- ifelse(assumption$faminc97 <= 1, 1, assumption$faminc97)
summary(assumption$faminc97)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 20686 43720 53901 71175 646743
assumption$loginc <- log(assumption$faminc97)
hist(assumption$loginc)# View(assumption)Two Steps: create a new age variable centered around 0, then square it
summary(good$AGE97)## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 3 5 7 7 10 13 1340
The mean value for age is about 7.5, but the age variable is a discrete variable (i.e. age is measured in whole year here). So, it is better to center it around the year that is closest to mean value, i.e., 8 years old.
assumption$ageC <- (assumption$AGE97 - 8)
summary(assumption)## readss97 loginc02 read02 AGE97 bthwht
## Min. : 48 Min. : 0.0 Min. : 16 Min. : 3.0 Min. :0.00
## 1st Qu.: 93 1st Qu.:10.2 1st Qu.: 65 1st Qu.: 5.0 1st Qu.:0.00
## Median :102 Median :10.9 Median : 74 Median : 7.0 Median :0.00
## Mean :103 Mean :10.7 Mean : 72 Mean : 7.4 Mean :0.35
## 3rd Qu.:113 3rd Qu.:11.4 3rd Qu.: 82 3rd Qu.:10.0 3rd Qu.:1.00
## Max. :166 Max. :14.2 Max. :100 Max. :13.0 Max. :1.00
## HOME97 faminc97 nfa01 nw01
## Min. : 7.9 Min. : 1 Min. : -314901 Min. : -187925
## 1st Qu.:18.4 1st Qu.: 20686 1st Qu.: 1016 1st Qu.: 3047
## Median :20.7 Median : 43720 Median : 10666 Median : 38310
## Mean :20.3 Mean : 53901 Mean : 122358 Mean : 174502
## 3rd Qu.:22.6 3rd Qu.: 71175 3rd Qu.: 57901 3rd Qu.: 133071
## Max. :26.9 Max. :646743 Max. :42875320 Max. :43687968
## loginc ageC
## Min. : 0.0 Min. :-5.0
## 1st Qu.: 9.9 1st Qu.:-3.0
## Median :10.7 Median :-1.0
## Mean :10.5 Mean :-0.6
## 3rd Qu.:11.2 3rd Qu.: 2.0
## Max. :13.4 Max. : 5.0
ageCassumption$ageC2 <- (assumption$ageC^2)
summary(assumption)## readss97 loginc02 read02 AGE97 bthwht
## Min. : 48 Min. : 0.0 Min. : 16 Min. : 3.0 Min. :0.00
## 1st Qu.: 93 1st Qu.:10.2 1st Qu.: 65 1st Qu.: 5.0 1st Qu.:0.00
## Median :102 Median :10.9 Median : 74 Median : 7.0 Median :0.00
## Mean :103 Mean :10.7 Mean : 72 Mean : 7.4 Mean :0.35
## 3rd Qu.:113 3rd Qu.:11.4 3rd Qu.: 82 3rd Qu.:10.0 3rd Qu.:1.00
## Max. :166 Max. :14.2 Max. :100 Max. :13.0 Max. :1.00
## HOME97 faminc97 nfa01 nw01
## Min. : 7.9 Min. : 1 Min. : -314901 Min. : -187925
## 1st Qu.:18.4 1st Qu.: 20686 1st Qu.: 1016 1st Qu.: 3047
## Median :20.7 Median : 43720 Median : 10666 Median : 38310
## Mean :20.3 Mean : 53901 Mean : 122358 Mean : 174502
## 3rd Qu.:22.6 3rd Qu.: 71175 3rd Qu.: 57901 3rd Qu.: 133071
## Max. :26.9 Max. :646743 Max. :42875320 Max. :43687968
## loginc ageC ageC2
## Min. : 0.0 Min. :-5.0 Min. : 0.0
## 1st Qu.: 9.9 1st Qu.:-3.0 1st Qu.: 1.0
## Median :10.7 Median :-1.0 Median : 4.0
## Mean :10.5 Mean :-0.6 Mean : 8.6
## 3rd Qu.:11.2 3rd Qu.: 2.0 3rd Qu.:16.0
## Max. :13.4 Max. : 5.0 Max. :25.0
Run a model with these new transformations,
lm3 <- lm(readss97~loginc+ageC+ageC2+bthwht+HOME97,data = assumption)
summary(lm3)##
## Call:
## lm(formula = readss97 ~ loginc + ageC + ageC2 + bthwht + HOME97,
## data = assumption)
##
## Residuals:
## Min 1Q Median 3Q Max
## -56.20 -9.17 -0.04 9.30 56.35
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 55.4312 3.5060 15.81 < 0.0000000000000002 ***
## loginc 1.8015 0.3420 5.27 0.00000016 ***
## ageC 0.3476 0.1461 2.38 0.01750 *
## ageC2 -0.0638 0.0516 -1.24 0.21626
## bthwht -2.9602 0.8120 -3.65 0.00028 ***
## HOME97 1.4959 0.1337 11.19 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14 on 1538 degrees of freedom
## Multiple R-squared: 0.179, Adjusted R-squared: 0.177
## F-statistic: 67.2 on 5 and 1538 DF, p-value: <0.0000000000000002
Run a Scatterplot of residuals against independent variables, with regression line and lowess curve
plot(lm3$fitted.values, lm3$residuals)
abline(lm(lm3$residuals~lm3$fitted.values), col="red")
lines(lowess(lm3$residuals~lm3$fitted.values), col="blue")# Create a data set with **the fitted values and residuals**
pred1 <- as.data.frame(lm3$fitted.values)
pred1$residuals <- (lm3$residuals)
# View(pred1)Refer the slides for Lab 5 session for the illustrations of the plots here.
plot(lm3) Scatterplot of residuals against independent variables, with regression line and lowess curve
# plot of loginc and fitted residuals
plot(assumption$loginc, pred1$residuals)
abline(lm(pred1$residuals~assumption$loginc), col="red")
lines(lowess(assumption$loginc, pred1$residuals), col="blue")# plot of HOME97 and fitted residuals
plot(assumption$HOME97, pred1$residuals)
abline(lm(pred1$residuals~assumption$HOME97), col="red")
lines(lowess(assumption$HOME97, pred1$residuals), col="blue")There should be NO relationship between the random residuals and x. That is, the ‘residual by predicted’ plot and the ‘residual by x’ plot should show no systematic patterns that for different values of x (or ŷ).
Since ageC2 is not significant, drop it and re-run the model:
lm4 <- lm(readss97~loginc+ageC+bthwht+HOME97,data = assumption)
summary(lm4)##
## Call:
## lm(formula = readss97 ~ loginc + ageC + bthwht + HOME97, data = assumption)
##
## Residuals:
## Min 1Q Median 3Q Max
## -55.54 -9.18 -0.12 9.32 55.65
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 54.973 3.487 15.77 < 0.0000000000000002 ***
## loginc 1.788 0.342 5.23 0.00000019 ***
## ageC 0.425 0.132 3.22 0.0013 **
## bthwht -3.234 0.781 -4.14 0.00003666 ***
## HOME97 1.506 0.133 11.28 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14 on 1539 degrees of freedom
## Multiple R-squared: 0.178, Adjusted R-squared: 0.176
## F-statistic: 83.5 on 4 and 1539 DF, p-value: <0.0000000000000002
library(car)
# To test for multicollinearity use vif() function (Variable Inflation Factor)
vif(lm4)## loginc ageC bthwht HOME97
## 1.3 1.1 1.1 1.3
# An example with multicollinearity
lm5 <- lm(read02~AGE97+loginc02+nfa01+nw01+HOME97,data =assumption)
vif(lm5)## AGE97 loginc02 nfa01 nw01 HOME97
## 1.1 1.2 101.5 102.3 1.3