Tidy the data up

library(tidyverse)
## ── Attaching packages ─────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2     ✓ purrr   0.3.4
## ✓ tibble  3.0.3     ✓ dplyr   1.0.2
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(leaps)
library(readr)
AmesHousing<-read.csv("/Users/timogunsalus/Downloads/AmesHousing.csv", header = TRUE)

# Select only the useful parts
AH<-subset(AmesHousing,select = c(SalePrice,MS.Zoning,Lot.Area,Neighborhood,Gr.Liv.Area,
          Overall.Cond,Overall.Qual,Land.Slope,Lot.Config,Condition.1,Bldg.Type,
          Year.Built,Year.Remod.Add,Garage.Cars))

# I only want residental buildings
AH1 <- filter(AH,(MS.Zoning == "RH")| (MS.Zoning =="RM")|(MS.Zoning =="RL"))


AH2<-AH1
AH2$MS.Zoning <- as.factor(AH2$MS.Zoning)
AH2$Neighborhood <- as.factor(AH2$Neighborhood)
AH2$Land.Slope <- as.factor(AH2$Land.Slope)
AH2$Lot.Config <- as.factor(AH2$Lot.Config)
AH2$Condition.1 <- as.factor(AH2$Condition.1)
AH2$Bldg.Type <- as.factor(AH2$Bldg.Type)

AH2$Condition.1 <- factor(AH2$Condition.1, ordered = FALSE)
AH2$Condition.1 <- relevel(AH2$Condition.1, "Norm")

#create partition variable
AH2$part<-rep(0,2762)
set.seed(1)
test<-sample(1:2762,1381, replace = FALSE)
AH2$part[test]<-1

#training df
AHTrain<-AH2%>%
  filter(part ==0)

#testing df
AHTest<-AH2%>%
  filter(part ==1)

Regression: If your response variable is numeric…

A) Identify your response variable, a categorical predictor, and a numeric predictor (that you suspect might be related to your response). Describe the units for these variables and for the categorical variable describe the levels.

The response variable is the sale price of the home and is in dollars.

One numeric predictor is the year the home was built in.

One catigorical predictor is various conditions that the house is in. The categories are:

  • Artery = Adjacent to arterial street

  • Feedr = Adjacent to feeder street

  • Norm = Normal

  • PosA = Adjacent to postive off-site feature

  • PosN = Near positive off-site feature-park, greenbelt, etc.

  • RRAe = Adjacent to East-West Railroad

  • RRAn = Adjacent to North-South Railroad

  • RRNe = Within 200’ of East-West Railroad

  • RRNn = Within 200’ of North-South Railroad

B) Fit a simple linear model with a response variable and the numeric predictor that you chose. Does the relationship appear to be significant? Make sure to also include a graphic.

It is significant but the data does not look linear. (Also I dont know why the goem_abline is not working)

LMB<-lm(SalePrice~Year.Built, AHTrain)
summary(LMB)
## 
## Call:
## lm(formula = SalePrice ~ Year.Built, data = AHTrain)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -147718  -40316  -13098   23682  409608 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.703e+06  1.149e+05  -23.52   <2e-16 ***
## Year.Built   1.463e+03  5.836e+01   25.07   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 64730 on 1379 degrees of freedom
## Multiple R-squared:  0.3131, Adjusted R-squared:  0.3126 
## F-statistic: 628.7 on 1 and 1379 DF,  p-value: < 2.2e-16
ggplot(AHTrain, aes(Year.Built, SalePrice))+
  geom_point(alpha = .5)+
  geom_abline(slope=AHTrain$coefficients[2], intercept = AHTrain$coefficients[1])+
  theme_bw()

C) Now, write the “dummy” variable coding for your categorical variable. (Hint: the contrasts() function might help).

contrasts(AH2$Condition.1)
##        Artery Feedr PosA PosN RRAe RRAn RRNe RRNn
## Norm        0     0    0    0    0    0    0    0
## Artery      1     0    0    0    0    0    0    0
## Feedr       0     1    0    0    0    0    0    0
## PosA        0     0    1    0    0    0    0    0
## PosN        0     0    0    1    0    0    0    0
## RRAe        0     0    0    0    1    0    0    0
## RRAn        0     0    0    0    0    1    0    0
## RRNe        0     0    0    0    0    0    1    0
## RRNn        0     0    0    0    0    0    0    1

D) Fit a linear model with response variable and the categorical variable. Does it appear that there are differences among the means of levels of the categorical variable? (Hint: Look at the ANOVA F-test). Be sure to include an appropriate graphic (i.e. side-by-side boxplot)

It appears that proximity to major roads and positive features are significant but the proximity to trains are mostly not significant.

LMD<-lm(SalePrice~Condition.1, AHTrain)
summary(LMD)
## 
## Call:
## lm(formula = SalePrice ~ Condition.1, data = AHTrain)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -169304  -49593  -16110   27907  442907 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         182093       2220  82.012  < 2e-16 ***
## Condition.1Artery   -49419      11138  -4.437 9.85e-06 ***
## Condition.1Feedr    -42983       9097  -4.725 2.54e-06 ***
## Condition.1PosA      91102      24262   3.755 0.000181 ***
## Condition.1PosN      54136      19851   2.727 0.006470 ** 
## Condition.1RRAe     -41126      18144  -2.267 0.023566 *  
## Condition.1RRAn      -9006      14870  -0.606 0.544827    
## Condition.1RRNe      12407      76432   0.162 0.871069    
## Condition.1RRNn     -41093      54068  -0.760 0.447376    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 76400 on 1372 degrees of freedom
## Multiple R-squared:  0.04796,    Adjusted R-squared:  0.04241 
## F-statistic:  8.64 on 8 and 1372 DF,  p-value: 1.488e-11
anova(LMD)
## Analysis of Variance Table
## 
## Response: SalePrice
##               Df     Sum Sq    Mean Sq F value    Pr(>F)    
## Condition.1    8 4.0343e+11 5.0429e+10  8.6397 1.488e-11 ***
## Residuals   1372 8.0082e+12 5.8369e+09                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ggplot(AHTrain, aes(Condition.1, SalePrice, fill = Condition.1))+
  geom_boxplot()+
  theme_bw()

E) Now fit a multiple linear model that combines parts (b) and (d), with both the numeric and categorical variables. What are the estimated models for the different levels? Include a graphic of the scatter plot with lines overlaid for each level.

LME<-lm(SalePrice~Year.Built+Condition.1, data = AHTrain)
summary(LME)
## 
## Call:
## lm(formula = SalePrice ~ Year.Built + Condition.1, data = AHTrain)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -148354  -39538  -12932   23917  408226 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -2.700e+06  1.183e+05 -22.827  < 2e-16 ***
## Year.Built         1.462e+03  5.998e+01  24.370  < 2e-16 ***
## Condition.1Artery  5.399e+03  9.575e+03   0.564 0.572968    
## Condition.1Feedr  -1.490e+04  7.688e+03  -1.938 0.052778 .  
## Condition.1PosA    8.580e+04  2.027e+04   4.232 2.47e-05 ***
## Condition.1PosN    4.528e+04  1.659e+04   2.729 0.006432 ** 
## Condition.1RRAe   -5.420e+04  1.517e+04  -3.573 0.000365 ***
## Condition.1RRAn   -2.333e+04  1.244e+04  -1.876 0.060934 .  
## Condition.1RRNe   -1.935e+04  6.388e+04  -0.303 0.762005    
## Condition.1RRNn   -8.530e+03  4.520e+04  -0.189 0.850338    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 63840 on 1371 degrees of freedom
## Multiple R-squared:  0.3357, Adjusted R-squared:  0.3314 
## F-statistic: 76.99 on 9 and 1371 DF,  p-value: < 2.2e-16
ggplot(AH2, aes(Year.Built, SalePrice))+
  geom_point(aes(color = Condition.1, alpha = .5))+
  geom_abline(aes(slope=LME$coefficients[2], intercept = LME$coefficients[1], colour = "#95A900"))+ #norm
  geom_abline(slope=LME$coefficients[2], intercept = LME$coefficients[1] + LME$coefficients[3], color = "#F8766D")+ #Artey
  geom_abline(slope=LME$coefficients[2], intercept = LME$coefficients[1]+ LME$coefficients[4], color = "#CF9400")+ # Feedr
  geom_abline(slope=LME$coefficients[2], intercept = LME$coefficients[1]+ LME$coefficients[5], color = "#5BB300")+ #PosA
  geom_abline(slope=LME$coefficients[2], intercept = LME$coefficients[1]+ LME$coefficients[6], color = "#00BF7D")+ #PosN
  geom_abline(slope=LME$coefficients[2], intercept = LME$coefficients[1]+ LME$coefficients[7], color = "#00BFC4")+ #RRAe
  geom_abline(slope=LME$coefficients[2], intercept = LME$coefficients[1]+ LME$coefficients[8], color = "#00A5FF")+ #RRAn
  geom_abline(slope=LME$coefficients[2], intercept = LME$coefficients[1]+ LME$coefficients[9], color = "#DC71FA")+ #RRNe
  geom_abline(slope=LME$coefficients[2], intercept = LME$coefficients[1]+ LME$coefficients[10], color = "#FF61C9")+ #RRNn
  theme_bw()

F) Finally, fit a multiple linear model that includes also the interaction between the numeric and categorical variables, which allows for different slopes. What are the estimated models for the different levels? Include a graphic of the scatter plot with lines overlaid for each level.

LMF<-lm(SalePrice~Year.Built*Condition.1, data = AHTrain)
summary(LMF)
## 
## Call:
## lm(formula = SalePrice ~ Year.Built * Condition.1, data = AHTrain)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -151267  -38980  -10908   24427  406236 
## 
## Coefficients: (1 not defined because of singularities)
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  -2.865e+06  1.224e+05 -23.409  < 2e-16 ***
## Year.Built                    1.546e+03  6.208e+01  24.900  < 2e-16 ***
## Condition.1Artery             4.955e+06  8.508e+05   5.824 7.15e-09 ***
## Condition.1Feedr              1.412e+06  5.264e+05   2.682 0.007407 ** 
## Condition.1PosA              -7.340e+06  2.209e+06  -3.324 0.000912 ***
## Condition.1PosN              -4.239e+06  1.779e+06  -2.383 0.017300 *  
## Condition.1RRAe               1.985e+06  1.666e+06   1.192 0.233623    
## Condition.1RRAn               1.298e+06  8.270e+05   1.569 0.116774    
## Condition.1RRNe              -2.117e+04  6.264e+04  -0.338 0.735397    
## Condition.1RRNn               2.132e+06  2.978e+06   0.716 0.474051    
## Year.Built:Condition.1Artery -2.558e+03  4.398e+02  -5.817 7.47e-09 ***
## Year.Built:Condition.1Feedr  -7.300e+02  2.695e+02  -2.709 0.006834 ** 
## Year.Built:Condition.1PosA    3.760e+03  1.118e+03   3.362 0.000794 ***
## Year.Built:Condition.1PosN    2.167e+03  8.996e+02   2.408 0.016154 *  
## Year.Built:Condition.1RRAe   -1.030e+03  8.414e+02  -1.225 0.220934    
## Year.Built:Condition.1RRAn   -6.674e+02  4.174e+02  -1.599 0.110128    
## Year.Built:Condition.1RRNe           NA         NA      NA       NA    
## Year.Built:Condition.1RRNn   -1.097e+03  1.528e+03  -0.718 0.472614    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 62600 on 1364 degrees of freedom
## Multiple R-squared:  0.3646, Adjusted R-squared:  0.3572 
## F-statistic: 48.92 on 16 and 1364 DF,  p-value: < 2.2e-16
ggplot(AH2, aes(Year.Built, SalePrice))+
  geom_point(aes(color = Condition.1, alpha = .5))+
  geom_abline(aes(slope=LMF$coefficients[2], intercept = LMF$coefficients[1], colour = "#95A900"))+ #norm
  geom_abline(slope=LMF$coefficients[2] + LMF$coefficients[11], intercept = LMF$coefficients[1] + LMF$coefficients[3], color = "#F8766D")+ #Artey
  geom_abline(slope=LMF$coefficients[2] + LMF$coefficients[12], intercept = LMF$coefficients[1] + LMF$coefficients[4], color = "#CF9400")+ #Feedr
  geom_abline(slope=LMF$coefficients[2] + LMF$coefficients[13], intercept = LMF$coefficients[1] + LMF$coefficients[5], color = "#5BB300")+ #PosA
  geom_abline(slope=LMF$coefficients[2] + LMF$coefficients[14], intercept = LMF$coefficients[1] + LMF$coefficients[6], color = "#00BF7D")+ #PosN
  geom_abline(slope=LMF$coefficients[2] + LMF$coefficients[15], intercept = LMF$coefficients[1] + LMF$coefficients[7], color = "#00BFC4")+ #RRAe
  geom_abline(slope=LMF$coefficients[2] + LMF$coefficients[16], intercept = LMF$coefficients[1] + LMF$coefficients[8], color = "#00A5FF")+ #RRAn
  geom_abline(slope=LMF$coefficients[2] + LMF$coefficients[17], intercept = LMF$coefficients[1] + LMF$coefficients[9], color = "#DC71FA")+ #RRNe
  geom_abline(slope=LMF$coefficients[2] + LMF$coefficients[18], intercept = LMF$coefficients[1] + LMF$coefficients[10], color = "#FF61C9")+ #RRNn
  theme_bw()
## Warning: Removed 1 rows containing missing values (geom_abline).

G) Compare the models from parts (B), (D), (E), and (F).

Calculate the MSEs

The MSE are large!

#B
#LMB<-lm(SalePrice~Year.Built, AHTrain)
anova(LMB)
## Analysis of Variance Table
## 
## Response: SalePrice
##              Df     Sum Sq    Mean Sq F value    Pr(>F)    
## Year.Built    1 2.6340e+12 2.6340e+12  628.69 < 2.2e-16 ***
## Residuals  1379 5.7776e+12 4.1897e+09                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
BtestFit<-predict(LMB, AHTrain)
BtestRSS<-sum((AHTrain$SalePrice-BtestFit)^2)
BtestRSS
## [1] 5.777646e+12
BtestMSE<-mean((AHTrain$SalePrice-BtestFit)^2)
BtestMSE
## [1] 4183668603
#D
#LMD<-lm(SalePrice~Condition.1, AHTrain)
anova(LMD)
## Analysis of Variance Table
## 
## Response: SalePrice
##               Df     Sum Sq    Mean Sq F value    Pr(>F)    
## Condition.1    8 4.0343e+11 5.0429e+10  8.6397 1.488e-11 ***
## Residuals   1372 8.0082e+12 5.8369e+09                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
DtestFit<-predict(LMD, AHTrain)
DtestRSS<-sum((AHTrain$SalePrice-DtestFit)^2)
DtestRSS
## [1] 8.008249e+12
DtestMSE<-mean((AHTrain$SalePrice-DtestFit)^2)
DtestMSE
## [1] 5798876970
#E
#LME<-lm(SalePrice~Year.Built+Condition.1, data = AHTrain)
anova(LME)
## Analysis of Variance Table
## 
## Response: SalePrice
##               Df     Sum Sq    Mean Sq  F value    Pr(>F)    
## Year.Built     1 2.6340e+12 2.6340e+12 646.2828 < 2.2e-16 ***
## Condition.1    8 1.8990e+11 2.3737e+10   5.8242 2.463e-07 ***
## Residuals   1371 5.5877e+12 4.0757e+09                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
EtestFit<-predict(LME, AHTrain)
EtestRSS<-sum((AHTrain$SalePrice-EtestFit)^2)
EtestRSS
## [1] 5.587746e+12
EtestMSE<-mean((AHTrain$SalePrice-EtestFit)^2)
EtestMSE
## [1] 4046159638
#F
#LMF<-lm(SalePrice~Year.Built*Condition.1, data = AHTrain)
anova(LMF)
## Analysis of Variance Table
## 
## Response: SalePrice
##                          Df     Sum Sq    Mean Sq  F value    Pr(>F)    
## Year.Built                1 2.6340e+12 2.6340e+12 672.2293 < 2.2e-16 ***
## Condition.1               8 1.8990e+11 2.3737e+10   6.0580 1.114e-07 ***
## Year.Built:Condition.1    7 2.4310e+11 3.4729e+10   8.8631 1.053e-10 ***
## Residuals              1364 5.3446e+12 3.9184e+09                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
FtestFit<-predict(LMF, AHTrain)
## Warning in predict.lm(LMF, AHTrain): prediction from a rank-deficient fit may be
## misleading
FtestRSS<-sum((AHTrain$SalePrice-FtestFit)^2)
FtestRSS
## [1] 5.344644e+12
FtestMSE<-mean((AHTrain$SalePrice-FtestFit)^2)
FtestMSE
## [1] 3870125685

H) Conclusion: What did you learn from this exercise? Were any of the relationships significant? (Note: This would be great to include in your final project write up!)

Firstly, I found a surprising discovery when looking at the price of homes compared to when they were built. I assumed that newer homes would be more expensive on a linear scale. I don’t really think of homes as devaluing or that new homes are exponentially better then older homes. I also overlooked that some very old homes would become more expensive with time. Thus the scatter plot looks like it almost has a small U shape. One important caveat is that this model does not look at when the home was remodeled or renovated. I do think that the “Condition” variable confirmed my suspicion. Some non-market value (or value that is not necessarily represented directly in the homes price) can be explained. Having some green-space or a park increased the price of a home by an estimated 54,136 USD. Having other positive features (I could not find out what but I assume things like access to public transposition) increases the homes value by an astounding 91,102 USD. I don’t know if I should trust the MLR model. Although there is lots of data most of that falls under the “normal” condition. I don’t know how much data there is for other conditions across different years.