Answer the following questions using the NLSY dataset and the NLSY Codebook.

For each question, provide your code and the answer.


Q1:

Using the added variable plot method, assess whether the highest grade completed by the child’s mother (medu) is an omitted relevant variable in the following regression model:
read = magebirth + breastfed
(include all relevant plots in your answer)

Answer - Q1:

NLSY <- read.csv("/Users/YanfeiQin/Desktop/Fall 2021/897-002 Applied Linear Modeling/Lab 5/NLSY-3.csv", header=TRUE, sep=",")
NLSY2 <- na.omit(NLSY[,c("read","magebirth","breastfed","medu")])
reg1 <- lm(read ~ magebirth + breastfed, data = NLSY2)
resid_read <- as.data.frame(reg1$residuals)
reg2 <- lm(medu ~ magebirth + breastfed, data = NLSY2)
resid_medu <- as.data.frame(reg2$residuals)
plot(density(resid(reg1)))

plot(density(resid(reg2)))

qqnorm(resid(reg1)) 
qqline(resid(reg1))

qqnorm(resid(reg2))
qqline(resid(reg2))


By looking at the density plots and the quantile normal plots of two residuals, the residual of regression 1 is normally distributed, while the residual of regression 2 is not normally distributed.

plot(reg2$residuals,reg1$residuals)
abline(lm(reg1$residuals ~ reg2$residuals), col = "red")
lines(lowess(reg1$residuals ~ reg2$residuals), col = "blue")


Since the regression and lowess lines are not horizontal, there is reason to suspect that the highest grade completed by the child’s mother (medu) is an omitted relevant variable in the regression model read = magebirth + breastfed.