For this homework, we will investigate the Nashville_housing.csv dataset located on D2L. The dataset contains information on houses sold in the Nashville area between 2013 and 2016. Be sure to download the file to your computer and import using “Import Dataset”, then “From Text File…”
Before we start, please run the following code to limit the size of our analysis. This code gets us specified variables. We are doing this because these are the variables we think could be influential.
housing2 <- housing[names(housing)%in%c("logSP","Finished.Area","Acreage","Grade","Bedrooms","Full.Bath","Half.Bath")]
housing2 = na.omit(housing2)
1) Using forward stepwise regression (with AIC), find the best subset of predictor variables to predict logSP (make sure you are using housing2 for this homework).
biggest1=formula(lm(logSP.,data=housing2)) step1<-step(lm(logSP1,data=housing2),direction=“forward”, scope = biggest1) summary(step1)
We can see that there are a number of significant vaibales that come with the forward stepwise regressoin. logSP = 1.173e+01 + 3.215e-02 (Acreage) +4.171e-04(Finished.Area) -2.770e-01(Grade C) -5.900e-01(Grade D) -2.549e-02(Bedrooms) -4.513e-02(Half.Bath)
2) Using backward stepwise regression (with AIC), find the best subset of predictor variables to predict logSP. step2<-step(lm(logSP~.,data=housing2),direction=“backward”) summary(step2)
We can see that there are a number of significant vaibales that come with the backward stepwise regressoin. LogSP =1.173e+01 + 3.215e-02(Acreage) +4.171e-04 (Finished.Area) -2.770e-01 (Grade C) -5.900e-01 (Grade D) -2.549e-0 (Bedrooms) -4.513e-0 (Half.Bath)
3) Are the models in part (a) and part (b) the same?
Yes, for both these Forward and backward step wise regression models they are Identical.
4) Using forward stepwise regression (with BIC), find the best subset of predictor variables to predict logSP. n = nrow(housing2) step3 = step(lm(logSP ~., data = housing2), direction = “forward”, scope = biggest, k=log(n)) summary(step3)
logSP = 1.173e+01 + 3.236e-02(Acreage) + 4.145e-04(Finished.Area) -2.754e-01 (Grade c) -5.879e-0(Grade D) -2.644e-02(Bedrooms) -4.318e-0(Half.Bath)
5) Using backward stepwise regression (with BIC), find the best subset of predictor variables to predict logSP.
step4 = step(lm(logSP ~., data = housing2), direction = “backward”, k=log(n)) summary(step4)
logSP = 1.173e+01 + 3.215e-0(Acreage) + 4.171e-04(Finished.Area) -2.770e-01 (Grade c) -5.900e-01(Grade D) -2.549e-02(Bedrooms) -4.513e-02(Half.Bath)
6) Are the models in part (a) and part (d) the same? Are the models in part (b) and part (e) the same? For the step wise regression with BIC for both parts a and b they are essentally the same. The difference in coef is hardly different.
Also all the same varibles showed up when we ran the AIC model and the BIC model. The coef are all the same except for model(step3) which has very small differences in the coef.
7) Calculate \(PRESS_{p}\) for the 4 best models above.
m1 <- resid(step1) m2 <- resid(step2) m3 <- resid(step3) m4 <- resid(step4)
pr1 <- m1/(1 - lm.influence(step1)\(hat) pr2 <- m2/(1 - lm.influence(step2)\)hat) pr3 <- m3/(1 - lm.influence(step3)\(hat) pr4 <- m4/(1 - lm.influence(step4)\)hat)
sum(pr12) sum(pr22) sum(pr32) sum(pr42)
For the PRESS stat we would want to pick the model with the lowest number but since the models are all identical this doesn’t matter. 8) Calculate Mallow’s \(C_p\) for the 4 best models above. library(wle)
cp1 = mle.cp(step1) summary(cp1)
cp2 = mle.cp(step2) summary(cp2)
cp3 = mle.cp(step3) summary(cp3)
cp4 = mle.cp(step4) summary(cp4)