library(tidyverse)
## ── Attaching packages ─────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2 ✓ purrr 0.3.4
## ✓ tibble 3.0.3 ✓ dplyr 1.0.2
## ✓ tidyr 1.1.2 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(leaps)
library(readr)
AmesHousing<-read.csv("/Users/timogunsalus/Downloads/AmesHousing.csv", header = TRUE)
# Select only the useful parts
AH<-subset(AmesHousing,select = c(SalePrice,MS.Zoning,Lot.Area,Neighborhood,Gr.Liv.Area,
Overall.Cond,Overall.Qual,Land.Slope,Lot.Config,Condition.1,Bldg.Type,
Year.Built,Year.Remod.Add,Garage.Cars))
# I only want residental buildings
AH1 <- filter(AH,(MS.Zoning == "RH")| (MS.Zoning =="RM")|(MS.Zoning =="RL"))
AH2<-AH1
AH2$MS.Zoning <- as.factor(AH2$MS.Zoning)
AH2$Neighborhood <- as.factor(AH2$Neighborhood)
AH2$Land.Slope <- as.factor(AH2$Land.Slope)
AH2$Lot.Config <- as.factor(AH2$Lot.Config)
AH2$Condition.1 <- as.factor(AH2$Condition.1)
AH2$Bldg.Type <- as.factor(AH2$Bldg.Type)
AH2$Condition.1 <- factor(AH2$Condition.1, ordered = FALSE)
AH2$Condition.1 <- relevel(AH2$Condition.1, "Norm")
#create partition variable
AH2$part<-rep(0,2762)
set.seed(1)
test<-sample(1:2762,1381, replace = FALSE)
AH2$part[test]<-1
#training df
AHTrain<-AH2%>%
filter(part ==0)
#testing df
AHTest<-AH2%>%
filter(part ==1)
A) Identify your response variable, a categorical predictor, and a numeric predictor (that you suspect might be related to your response). Describe the units for these variables and for the categorical variable describe the levels.
The response variable is the sale price of the home and is in dollars.
One numeric predictor is the year the home was built in.
One catigorical predictor is various conditions that the house is in. The categories are:
Artery = Adjacent to arterial street
Feedr = Adjacent to feeder street
Norm = Normal
PosA = Adjacent to postive off-site feature
PosN = Near positive off-site feature-park, greenbelt, etc.
RRAe = Adjacent to East-West Railroad
RRAn = Adjacent to North-South Railroad
RRNe = Within 200’ of East-West Railroad
RRNn = Within 200’ of North-South Railroad
B) Fit a simple linear model with a response variable and the numeric predictor that you chose. Does the relationship appear to be significant? Make sure to also include a graphic.
It is significant but the data does not look linear. (Also I dont know why the goem_abline is not working)
LMB<-lm(SalePrice~Year.Built, AHTrain)
summary(LMB)
##
## Call:
## lm(formula = SalePrice ~ Year.Built, data = AHTrain)
##
## Residuals:
## Min 1Q Median 3Q Max
## -147718 -40316 -13098 23682 409608
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.703e+06 1.149e+05 -23.52 <2e-16 ***
## Year.Built 1.463e+03 5.836e+01 25.07 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 64730 on 1379 degrees of freedom
## Multiple R-squared: 0.3131, Adjusted R-squared: 0.3126
## F-statistic: 628.7 on 1 and 1379 DF, p-value: < 2.2e-16
ggplot(AHTrain, aes(Year.Built, SalePrice))+
geom_point(alpha = .5)+
geom_abline(slope=AHTrain$coefficients[2], intercept = AHTrain$coefficients[1])+
theme_bw()
C) Now, write the “dummy” variable coding for your categorical variable. (Hint: the contrasts() function might help).
contrasts(AH2$Condition.1)
## Artery Feedr PosA PosN RRAe RRAn RRNe RRNn
## Norm 0 0 0 0 0 0 0 0
## Artery 1 0 0 0 0 0 0 0
## Feedr 0 1 0 0 0 0 0 0
## PosA 0 0 1 0 0 0 0 0
## PosN 0 0 0 1 0 0 0 0
## RRAe 0 0 0 0 1 0 0 0
## RRAn 0 0 0 0 0 1 0 0
## RRNe 0 0 0 0 0 0 1 0
## RRNn 0 0 0 0 0 0 0 1
D) Fit a linear model with response variable and the categorical variable. Does it appear that there are differences among the means of levels of the categorical variable? (Hint: Look at the ANOVA F-test). Be sure to include an appropriate graphic (i.e. side-by-side boxplot)
It appears that proximity to major roads and positive features are significant but the proximity to trains are mostly not significant.
LMD<-lm(SalePrice~Condition.1, AHTrain)
summary(LMD)
##
## Call:
## lm(formula = SalePrice ~ Condition.1, data = AHTrain)
##
## Residuals:
## Min 1Q Median 3Q Max
## -169304 -49593 -16110 27907 442907
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 182093 2220 82.012 < 2e-16 ***
## Condition.1Artery -49419 11138 -4.437 9.85e-06 ***
## Condition.1Feedr -42983 9097 -4.725 2.54e-06 ***
## Condition.1PosA 91102 24262 3.755 0.000181 ***
## Condition.1PosN 54136 19851 2.727 0.006470 **
## Condition.1RRAe -41126 18144 -2.267 0.023566 *
## Condition.1RRAn -9006 14870 -0.606 0.544827
## Condition.1RRNe 12407 76432 0.162 0.871069
## Condition.1RRNn -41093 54068 -0.760 0.447376
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 76400 on 1372 degrees of freedom
## Multiple R-squared: 0.04796, Adjusted R-squared: 0.04241
## F-statistic: 8.64 on 8 and 1372 DF, p-value: 1.488e-11
anova(LMD)
## Analysis of Variance Table
##
## Response: SalePrice
## Df Sum Sq Mean Sq F value Pr(>F)
## Condition.1 8 4.0343e+11 5.0429e+10 8.6397 1.488e-11 ***
## Residuals 1372 8.0082e+12 5.8369e+09
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ggplot(AHTrain, aes(Condition.1, SalePrice, fill = Condition.1))+
geom_boxplot()+
theme_bw()
E) Now fit a multiple linear model that combines parts (b) and (d), with both the numeric and categorical variables. What are the estimated models for the different levels? Include a graphic of the scatter plot with lines overlaid for each level.
LME<-lm(SalePrice~Year.Built+Condition.1, data = AHTrain)
summary(LME)
##
## Call:
## lm(formula = SalePrice ~ Year.Built + Condition.1, data = AHTrain)
##
## Residuals:
## Min 1Q Median 3Q Max
## -148354 -39538 -12932 23917 408226
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.700e+06 1.183e+05 -22.827 < 2e-16 ***
## Year.Built 1.462e+03 5.998e+01 24.370 < 2e-16 ***
## Condition.1Artery 5.399e+03 9.575e+03 0.564 0.572968
## Condition.1Feedr -1.490e+04 7.688e+03 -1.938 0.052778 .
## Condition.1PosA 8.580e+04 2.027e+04 4.232 2.47e-05 ***
## Condition.1PosN 4.528e+04 1.659e+04 2.729 0.006432 **
## Condition.1RRAe -5.420e+04 1.517e+04 -3.573 0.000365 ***
## Condition.1RRAn -2.333e+04 1.244e+04 -1.876 0.060934 .
## Condition.1RRNe -1.935e+04 6.388e+04 -0.303 0.762005
## Condition.1RRNn -8.530e+03 4.520e+04 -0.189 0.850338
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 63840 on 1371 degrees of freedom
## Multiple R-squared: 0.3357, Adjusted R-squared: 0.3314
## F-statistic: 76.99 on 9 and 1371 DF, p-value: < 2.2e-16
ggplot(AH2, aes(Year.Built, SalePrice))+
geom_point(aes(color = Condition.1, alpha = .5))+
geom_abline(aes(slope=LME$coefficients[2], intercept = LME$coefficients[1], colour = "#95A900"))+ #norm
geom_abline(slope=LME$coefficients[2], intercept = LME$coefficients[1] + LME$coefficients[3], color = "#F8766D")+ #Artey
geom_abline(slope=LME$coefficients[2], intercept = LME$coefficients[1]+ LME$coefficients[4], color = "#CF9400")+ # Feedr
geom_abline(slope=LME$coefficients[2], intercept = LME$coefficients[1]+ LME$coefficients[5], color = "#5BB300")+ #PosA
geom_abline(slope=LME$coefficients[2], intercept = LME$coefficients[1]+ LME$coefficients[6], color = "#00BF7D")+ #PosN
geom_abline(slope=LME$coefficients[2], intercept = LME$coefficients[1]+ LME$coefficients[7], color = "#00BFC4")+ #RRAe
geom_abline(slope=LME$coefficients[2], intercept = LME$coefficients[1]+ LME$coefficients[8], color = "#00A5FF")+ #RRAn
geom_abline(slope=LME$coefficients[2], intercept = LME$coefficients[1]+ LME$coefficients[9], color = "#DC71FA")+ #RRNe
geom_abline(slope=LME$coefficients[2], intercept = LME$coefficients[1]+ LME$coefficients[10], color = "#FF61C9")+ #RRNn
theme_bw()
F) Finally, fit a multiple linear model that includes also the interaction between the numeric and categorical variables, which allows for different slopes. What are the estimated models for the different levels? Include a graphic of the scatter plot with lines overlaid for each level.
LMF<-lm(SalePrice~Year.Built*Condition.1, data = AHTrain)
summary(LMF)
##
## Call:
## lm(formula = SalePrice ~ Year.Built * Condition.1, data = AHTrain)
##
## Residuals:
## Min 1Q Median 3Q Max
## -151267 -38980 -10908 24427 406236
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.865e+06 1.224e+05 -23.409 < 2e-16 ***
## Year.Built 1.546e+03 6.208e+01 24.900 < 2e-16 ***
## Condition.1Artery 4.955e+06 8.508e+05 5.824 7.15e-09 ***
## Condition.1Feedr 1.412e+06 5.264e+05 2.682 0.007407 **
## Condition.1PosA -7.340e+06 2.209e+06 -3.324 0.000912 ***
## Condition.1PosN -4.239e+06 1.779e+06 -2.383 0.017300 *
## Condition.1RRAe 1.985e+06 1.666e+06 1.192 0.233623
## Condition.1RRAn 1.298e+06 8.270e+05 1.569 0.116774
## Condition.1RRNe -2.117e+04 6.264e+04 -0.338 0.735397
## Condition.1RRNn 2.132e+06 2.978e+06 0.716 0.474051
## Year.Built:Condition.1Artery -2.558e+03 4.398e+02 -5.817 7.47e-09 ***
## Year.Built:Condition.1Feedr -7.300e+02 2.695e+02 -2.709 0.006834 **
## Year.Built:Condition.1PosA 3.760e+03 1.118e+03 3.362 0.000794 ***
## Year.Built:Condition.1PosN 2.167e+03 8.996e+02 2.408 0.016154 *
## Year.Built:Condition.1RRAe -1.030e+03 8.414e+02 -1.225 0.220934
## Year.Built:Condition.1RRAn -6.674e+02 4.174e+02 -1.599 0.110128
## Year.Built:Condition.1RRNe NA NA NA NA
## Year.Built:Condition.1RRNn -1.097e+03 1.528e+03 -0.718 0.472614
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 62600 on 1364 degrees of freedom
## Multiple R-squared: 0.3646, Adjusted R-squared: 0.3572
## F-statistic: 48.92 on 16 and 1364 DF, p-value: < 2.2e-16
ggplot(AH2, aes(Year.Built, SalePrice))+
geom_point(aes(color = Condition.1, alpha = .5))+
geom_abline(aes(slope=LMF$coefficients[2], intercept = LMF$coefficients[1], colour = "#95A900"))+ #norm
geom_abline(slope=LMF$coefficients[2] + LMF$coefficients[11], intercept = LMF$coefficients[1] + LMF$coefficients[3], color = "#F8766D")+ #Artey
geom_abline(slope=LMF$coefficients[2] + LMF$coefficients[12], intercept = LMF$coefficients[1] + LMF$coefficients[4], color = "#CF9400")+ #Feedr
geom_abline(slope=LMF$coefficients[2] + LMF$coefficients[13], intercept = LMF$coefficients[1] + LMF$coefficients[5], color = "#5BB300")+ #PosA
geom_abline(slope=LMF$coefficients[2] + LMF$coefficients[14], intercept = LMF$coefficients[1] + LMF$coefficients[6], color = "#00BF7D")+ #PosN
geom_abline(slope=LMF$coefficients[2] + LMF$coefficients[15], intercept = LMF$coefficients[1] + LMF$coefficients[7], color = "#00BFC4")+ #RRAe
geom_abline(slope=LMF$coefficients[2] + LMF$coefficients[16], intercept = LMF$coefficients[1] + LMF$coefficients[8], color = "#00A5FF")+ #RRAn
geom_abline(slope=LMF$coefficients[2] + LMF$coefficients[17], intercept = LMF$coefficients[1] + LMF$coefficients[9], color = "#DC71FA")+ #RRNe
geom_abline(slope=LMF$coefficients[2] + LMF$coefficients[18], intercept = LMF$coefficients[1] + LMF$coefficients[10], color = "#FF61C9")+ #RRNn
theme_bw()
## Warning: Removed 1 rows containing missing values (geom_abline).
G) Compare the models from parts (B), (D), (E), and (F).
Calculate the MSEs
The MSE are large!
#B
#LMB<-lm(SalePrice~Year.Built, AHTrain)
anova(LMB)
## Analysis of Variance Table
##
## Response: SalePrice
## Df Sum Sq Mean Sq F value Pr(>F)
## Year.Built 1 2.6340e+12 2.6340e+12 628.69 < 2.2e-16 ***
## Residuals 1379 5.7776e+12 4.1897e+09
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
BtestFit<-predict(LMB, AHTrain)
BtestRSS<-sum((AHTrain$SalePrice-BtestFit)^2)
BtestRSS
## [1] 5.777646e+12
BtestMSE<-mean((AHTrain$SalePrice-BtestFit)^2)
BtestMSE
## [1] 4183668603
#D
#LMD<-lm(SalePrice~Condition.1, AHTrain)
anova(LMD)
## Analysis of Variance Table
##
## Response: SalePrice
## Df Sum Sq Mean Sq F value Pr(>F)
## Condition.1 8 4.0343e+11 5.0429e+10 8.6397 1.488e-11 ***
## Residuals 1372 8.0082e+12 5.8369e+09
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
DtestFit<-predict(LMD, AHTrain)
DtestRSS<-sum((AHTrain$SalePrice-DtestFit)^2)
DtestRSS
## [1] 8.008249e+12
DtestMSE<-mean((AHTrain$SalePrice-DtestFit)^2)
DtestMSE
## [1] 5798876970
#E
#LME<-lm(SalePrice~Year.Built+Condition.1, data = AHTrain)
anova(LME)
## Analysis of Variance Table
##
## Response: SalePrice
## Df Sum Sq Mean Sq F value Pr(>F)
## Year.Built 1 2.6340e+12 2.6340e+12 646.2828 < 2.2e-16 ***
## Condition.1 8 1.8990e+11 2.3737e+10 5.8242 2.463e-07 ***
## Residuals 1371 5.5877e+12 4.0757e+09
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
EtestFit<-predict(LME, AHTrain)
EtestRSS<-sum((AHTrain$SalePrice-EtestFit)^2)
EtestRSS
## [1] 5.587746e+12
EtestMSE<-mean((AHTrain$SalePrice-EtestFit)^2)
EtestMSE
## [1] 4046159638
#F
#LMF<-lm(SalePrice~Year.Built*Condition.1, data = AHTrain)
anova(LMF)
## Analysis of Variance Table
##
## Response: SalePrice
## Df Sum Sq Mean Sq F value Pr(>F)
## Year.Built 1 2.6340e+12 2.6340e+12 672.2293 < 2.2e-16 ***
## Condition.1 8 1.8990e+11 2.3737e+10 6.0580 1.114e-07 ***
## Year.Built:Condition.1 7 2.4310e+11 3.4729e+10 8.8631 1.053e-10 ***
## Residuals 1364 5.3446e+12 3.9184e+09
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
FtestFit<-predict(LMF, AHTrain)
## Warning in predict.lm(LMF, AHTrain): prediction from a rank-deficient fit may be
## misleading
FtestRSS<-sum((AHTrain$SalePrice-FtestFit)^2)
FtestRSS
## [1] 5.344644e+12
FtestMSE<-mean((AHTrain$SalePrice-FtestFit)^2)
FtestMSE
## [1] 3870125685
H) Conclusion: What did you learn from this exercise? Were any of the relationships significant? (Note: This would be great to include in your final project write up!)
Firstly, I found a surprising discovery when looking at the price of homes compared to when they were built. I assumed that newer homes would be more expensive on a linear scale. I don’t really think of homes as devaluing or that new homes are exponentially better then older homes. I also overlooked that some very old homes would become more expensive with time. Thus the scatter plot looks like it almost has a small U shape. One important caveat is that this model does not look at when the home was remodeled or renovated. I do think that the “Condition” variable confirmed my suspicion. Some non-market value (or value that is not necessarily represented directly in the homes price) can be explained. Having some green-space or a park increased the price of a home by an estimated 54,136 USD. Having other positive features (I could not find out what but I assume things like access to public transposition) increases the homes value by an astounding 91,102 USD. I don’t know if I should trust the MLR model. Although there is lots of data most of that falls under the “normal” condition. I don’t know how much data there is for other conditions across different years.