library(GGally)
library(MASS)
library(modelr)
library(tidyverse)
library(stats)
set.seed(235711)

Introduction

This data set contains 79 variables that describe houses in Ames, IA. The goal is to build a model that uses this data to explain the Sale Price for each house.

Import Data

model.dat <- read_csv("https://raw.githubusercontent.com/dmoscoe/SPS/main/houses_train.csv")
predict.dat <- read_csv("https://raw.githubusercontent.com/dmoscoe/SPS/main/houses_test.csv")

Tidy

Much of the work of generating a model for this dataset amounts to identifying which data to exclude. I followed a few principles in choosing which variables to include or drop from the final model:

I start by dropping Id, Utilities, RoofMatl. Id merely signifies row numbers for observations, and is not a determinant of Sale Price. Almost every entry contains the same values for Utilities, Condition2, and RoofMatl.

model.dat <- model.dat[,c(2:9,11:14,16:22,24:81)]

The dataset contains a large number of NA entries. Reviewing the data dictionary explains that these entries actually do contain information. Often, they indicate that a particular feature of a house is missing, or that a measurement is zero. Based on the information in the data dictionary, I replace all NAs with more descriptive entries. The resulting dataset has no NA entries.

Here I also transform some very rare entries in Exterior1st and BldgType to the modes for those variables.

model.dat$Alley[is.na(model.dat$Alley)] <- "none"
model.dat$FireplaceQu[is.na(model.dat$FireplaceQu)] <- "none"
model.dat$PoolQC[is.na(model.dat$PoolQC)] <- "none"
model.dat$Fence[is.na(model.dat$Fence)] <- "none"
model.dat$MiscFeature[is.na(model.dat$MiscFeature)] <- "none"
model.dat$BsmtQual[is.na(model.dat$BsmtQual)] <- "none"
model.dat$BsmtCond[is.na(model.dat$BsmtCond)] <- "none"
model.dat$BsmtExposure[is.na(model.dat$BsmtExposure)] <- "none"
model.dat$BsmtFinType1[is.na(model.dat$BsmtFinType1)] <- "none"
model.dat$BsmtFinType2[is.na(model.dat$BsmtFinType2)] <- "none"
model.dat$GarageType[is.na(model.dat$GarageType)] <- "none"
model.dat$GarageFinish[is.na(model.dat$GarageFinish)] <- "none"
model.dat$GarageQual[is.na(model.dat$GarageQual)] <- "none"
model.dat$GarageCond[is.na(model.dat$GarageCond)] <- "none"
model.dat$MasVnrArea[is.na(model.dat$MasVnrArea)] <- 0
model.dat$MasVnrType[is.na(model.dat$MasVnrType)] <- "none"
model.dat$Electrical[is.na(model.dat$Electrical)] <- "SBrkr"
model.dat$GarageYrBlt[is.na(model.dat$GarageYrBlt)] <- 2006 #Most common value
model.dat$Exterior1st[model.dat$Exterior1st %in% c("AsphShn","BrkComm","CBlock","ImStucc","Stone")] <- "VinylSd" #Most common value
model.dat$BldgType[!(model.dat$BldgType %in% c("1Fam"))] <- "Not1Fam"
model.dat$LotFrontage[is.na(model.dat$LotFrontage)] <- 0

Test/Train Split

Now that NAs have been resolved, it’s time to partition the data into a training set and a testing set. I include 75% of the data in the training set, and reserve 25% for testing.

training_rows <- sample(nrow(model.dat), round(0.75 * nrow(model.dat)), replace = FALSE)
model_train.dat <- model.dat[training_rows,]
model_test.dat <- model.dat[-training_rows,]

Default Linear Model

My plan for constructing a model is to use the stepAIC function from modelr. This function eliminates variables as long as the elimination results in a reduction in the AIC. The stepAIC model requires a complete linear model to begin its work.

model_train.lm <- lm(SalePrice ~ ., data = model_train.dat)

Because the output from stepAIC is so long, I set include = FALSE in the code chunk containing it. The output of the stepAIC function is summarized below.

summary(model_train.slm)
## 
## Call:
## lm(formula = SalePrice ~ MSZoning + LotArea + Street + LotShape + 
##     LandContour + LotConfig + Neighborhood + Condition1 + BldgType + 
##     OverallQual + OverallCond + YearBuilt + YearRemodAdd + Exterior1st + 
##     MasVnrArea + ExterQual + BsmtQual + BsmtExposure + BsmtFinSF1 + 
##     BsmtFinSF2 + BsmtUnfSF + `1stFlrSF` + `2ndFlrSF` + BsmtFullBath + 
##     BedroomAbvGr + KitchenAbvGr + KitchenQual + TotRmsAbvGrd + 
##     Functional + Fireplaces + FireplaceQu + GarageCars + GarageArea + 
##     GarageQual + GarageCond + WoodDeckSF + ScreenPorch + PoolArea + 
##     PoolQC + MiscFeature + MoSold + SaleType + SaleCondition, 
##     data = model_train.dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -305887   -9843    -209    9736  186164 
## 
## Coefficients: (2 not defined because of singularities)
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -3.886e+06  4.826e+05  -8.053 2.38e-15 ***
## MSZoningFV            4.466e+04  1.441e+04   3.100 0.001995 ** 
## MSZoningRH            3.531e+04  1.404e+04   2.514 0.012090 *  
## MSZoningRL            3.122e+04  1.213e+04   2.574 0.010203 *  
## MSZoningRM            2.808e+04  1.167e+04   2.407 0.016277 *  
## LotArea               3.028e-01  9.471e-02   3.197 0.001432 ** 
## StreetPave            3.422e+04  1.650e+04   2.074 0.038372 *  
## LotShapeIR2           6.378e+03  4.772e+03   1.336 0.181705    
## LotShapeIR3           2.798e+04  1.317e+04   2.125 0.033857 *  
## LotShapeReg           2.005e+03  1.848e+03   1.085 0.278335    
## LandContourHLS        1.405e+04  5.921e+03   2.374 0.017814 *  
## LandContourLow        4.166e+03  7.278e+03   0.572 0.567174    
## LandContourLvl        1.154e+04  4.341e+03   2.658 0.007997 ** 
## LotConfigCulDSac      4.256e+03  3.716e+03   1.145 0.252389    
## LotConfigFR2         -7.659e+03  4.558e+03  -1.680 0.093224 .  
## LotConfigFR3         -1.878e+04  1.530e+04  -1.227 0.220007    
## LotConfigInside      -1.620e+03  1.964e+03  -0.825 0.409533    
## NeighborhoodBlueste  -7.319e+03  2.521e+04  -0.290 0.771612    
## NeighborhoodBrDale   -2.456e+03  1.170e+04  -0.210 0.833776    
## NeighborhoodBrkSide  -1.333e+03  9.946e+03  -0.134 0.893387    
## NeighborhoodClearCr   1.421e+03  1.018e+04   0.140 0.889019    
## NeighborhoodCollgCr  -7.693e+02  7.774e+03  -0.099 0.921194    
## NeighborhoodCrawfor   1.179e+04  9.297e+03   1.268 0.205107    
## NeighborhoodEdwards  -2.110e+04  8.624e+03  -2.447 0.014599 *  
## NeighborhoodGilbert  -1.198e+03  8.196e+03  -0.146 0.883798    
## NeighborhoodIDOTRR   -6.155e+03  1.155e+04  -0.533 0.594259    
## NeighborhoodMeadowV   1.797e+03  1.198e+04   0.150 0.880827    
## NeighborhoodMitchel  -1.304e+04  8.816e+03  -1.479 0.139485    
## NeighborhoodNAmes    -1.264e+04  8.404e+03  -1.504 0.133031    
## NeighborhoodNoRidge   3.758e+04  8.982e+03   4.184 3.13e-05 ***
## NeighborhoodNPkVill   1.230e+04  1.130e+04   1.088 0.276764    
## NeighborhoodNridgHt   2.269e+04  7.957e+03   2.852 0.004440 ** 
## NeighborhoodNWAmes   -5.353e+03  8.594e+03  -0.623 0.533514    
## NeighborhoodOldTown  -1.274e+04  1.025e+04  -1.243 0.214314    
## NeighborhoodSawyer   -3.260e+03  8.824e+03  -0.369 0.711901    
## NeighborhoodSawyerW   1.696e+03  8.324e+03   0.204 0.838639    
## NeighborhoodSomerst   2.785e+03  1.015e+04   0.274 0.783848    
## NeighborhoodStoneBr   3.700e+04  9.027e+03   4.098 4.51e-05 ***
## NeighborhoodSWISU    -2.990e+03  1.076e+04  -0.278 0.781048    
## NeighborhoodTimber   -2.982e+03  8.782e+03  -0.340 0.734237    
## NeighborhoodVeenker   1.042e+04  1.113e+04   0.936 0.349270    
## Condition1Feedr       3.504e+03  5.765e+03   0.608 0.543441    
## Condition1Norm        1.412e+04  4.862e+03   2.905 0.003762 ** 
## Condition1PosA        2.777e+03  1.081e+04   0.257 0.797367    
## Condition1PosN       -1.698e+04  8.073e+03  -2.103 0.035724 *  
## Condition1RRAe       -9.474e+03  1.071e+04  -0.885 0.376478    
## Condition1RRAn        8.903e+03  8.030e+03   1.109 0.267856    
## Condition1RRNe        4.419e+03  1.811e+04   0.244 0.807232    
## Condition1RRNn        1.519e+04  1.515e+04   1.002 0.316520    
## BldgTypeNot1Fam      -2.010e+04  3.386e+03  -5.935 4.09e-09 ***
## OverallQual           6.245e+03  1.137e+03   5.495 5.00e-08 ***
## OverallCond           4.851e+03  9.397e+02   5.162 2.98e-07 ***
## YearBuilt             2.841e+02  7.454e+01   3.811 0.000147 ***
## YearRemodAdd          1.259e+02  6.340e+01   1.986 0.047297 *  
## Exterior1stBrkFace    1.921e+04  8.554e+03   2.246 0.024951 *  
## Exterior1stCemntBd   -1.062e+04  8.941e+03  -1.188 0.235222    
## Exterior1stHdBoard   -5.544e+03  7.836e+03  -0.708 0.479402    
## Exterior1stMetalSd    1.801e+03  7.693e+03   0.234 0.814996    
## Exterior1stPlywood   -7.043e+03  8.224e+03  -0.856 0.391991    
## Exterior1stStucco     5.363e+03  9.254e+03   0.579 0.562388    
## Exterior1stVinylSd    1.125e+03  7.697e+03   0.146 0.883820    
## Exterior1stWd Sdng    7.144e+01  7.608e+03   0.009 0.992509    
## Exterior1stWdShing    8.461e+02  9.164e+03   0.092 0.926456    
## MasVnrArea            1.046e+01  5.394e+00   1.940 0.052704 .  
## ExterQualFa          -1.059e+04  1.196e+04  -0.886 0.376095    
## ExterQualGd          -2.845e+04  5.735e+03  -4.960 8.34e-07 ***
## ExterQualTA          -2.864e+04  6.259e+03  -4.577 5.34e-06 ***
## BsmtQualFa           -1.480e+04  7.635e+03  -1.938 0.052891 .  
## BsmtQualGd           -1.790e+04  3.900e+03  -4.590 5.01e-06 ***
## BsmtQualnone         -5.698e+03  8.859e+03  -0.643 0.520276    
## BsmtQualTA           -1.660e+04  4.747e+03  -3.497 0.000492 ***
## BsmtExposureGd        1.966e+04  3.522e+03   5.581 3.10e-08 ***
## BsmtExposureMn        4.611e+03  3.431e+03   1.344 0.179245    
## BsmtExposureNo       -1.251e+03  2.409e+03  -0.519 0.603584    
## BsmtExposurenone             NA         NA      NA       NA    
## BsmtFinSF1            3.019e+01  5.253e+00   5.746 1.23e-08 ***
## BsmtFinSF2            2.277e+01  6.784e+00   3.357 0.000820 ***
## BsmtUnfSF             1.776e+01  4.953e+00   3.587 0.000352 ***
## `1stFlrSF`            4.812e+01  5.587e+00   8.612  < 2e-16 ***
## `2ndFlrSF`            4.612e+01  3.763e+00  12.255  < 2e-16 ***
## BsmtFullBath          4.550e+03  2.099e+03   2.168 0.030401 *  
## BedroomAbvGr         -3.569e+03  1.502e+03  -2.377 0.017637 *  
## KitchenAbvGr         -7.760e+03  5.430e+03  -1.429 0.153346    
## KitchenQualFa        -2.090e+04  7.173e+03  -2.913 0.003660 ** 
## KitchenQualGd        -2.264e+04  4.221e+03  -5.363 1.03e-07 ***
## KitchenQualTA        -2.402e+04  4.682e+03  -5.130 3.50e-07 ***
## TotRmsAbvGrd          2.655e+03  1.071e+03   2.479 0.013331 *  
## FunctionalMaj2        1.130e+04  1.548e+04   0.730 0.465672    
## FunctionalMin1        1.182e+04  9.550e+03   1.237 0.216281    
## FunctionalMin2        1.670e+04  9.590e+03   1.741 0.081944 .  
## FunctionalMod         1.017e+04  1.064e+04   0.956 0.339199    
## FunctionalSev        -3.060e+04  2.670e+04  -1.146 0.252063    
## FunctionalTyp         3.041e+04  7.976e+03   3.813 0.000146 ***
## Fireplaces            1.192e+04  3.268e+03   3.649 0.000277 ***
## FireplaceQuFa        -1.819e+04  8.154e+03  -2.231 0.025938 *  
## FireplaceQuGd        -1.217e+04  6.325e+03  -1.924 0.054712 .  
## FireplaceQunone      -2.742e+03  7.597e+03  -0.361 0.718270    
## FireplaceQuPo        -7.722e+03  9.820e+03  -0.786 0.431838    
## FireplaceQuTA        -1.486e+04  6.525e+03  -2.278 0.022936 *  
## GarageCars            4.790e+03  2.618e+03   1.830 0.067622 .  
## GarageArea            1.611e+01  8.755e+00   1.840 0.066096 .  
## GarageQualFa         -1.561e+05  2.681e+04  -5.823 7.89e-09 ***
## GarageQualGd         -1.441e+05  2.803e+04  -5.141 3.30e-07 ***
## GarageQualnone        7.501e+01  1.808e+04   0.004 0.996690    
## GarageQualPo         -1.465e+05  3.553e+04  -4.123 4.06e-05 ***
## GarageQualTA         -1.518e+05  2.653e+04  -5.719 1.43e-08 ***
## GarageCondFa          1.368e+05  3.245e+04   4.217 2.71e-05 ***
## GarageCondGd          1.274e+05  3.369e+04   3.781 0.000166 ***
## GarageCondnone               NA         NA      NA       NA    
## GarageCondPo          1.388e+05  3.501e+04   3.963 7.95e-05 ***
## GarageCondTA          1.430e+05  3.199e+04   4.471 8.73e-06 ***
## WoodDeckSF            1.551e+01  6.633e+00   2.339 0.019557 *  
## ScreenPorch           4.681e+01  1.511e+01   3.098 0.002005 ** 
## PoolArea              5.900e+03  8.542e+02   6.907 9.01e-12 ***
## PoolQCFa             -7.895e+05  1.024e+05  -7.708 3.17e-14 ***
## PoolQCGd             -3.523e+05  4.759e+04  -7.404 2.89e-13 ***
## PoolQCnone            3.024e+06  4.566e+05   6.623 5.85e-11 ***
## MiscFeaturenone       2.775e+02  2.422e+04   0.011 0.990861    
## MiscFeatureOthr       2.040e+04  3.419e+04   0.597 0.550842    
## MiscFeatureShed       9.029e+02  2.448e+04   0.037 0.970580    
## MiscFeatureTenC       7.202e+05  1.188e+05   6.062 1.93e-09 ***
## MoSold               -5.857e+02  2.815e+02  -2.080 0.037758 *  
## SaleTypeCon           1.649e+04  1.835e+04   0.898 0.369226    
## SaleTypeConLD         2.445e+04  1.288e+04   1.898 0.057940 .  
## SaleTypeConLI         7.363e+03  1.323e+04   0.557 0.577942    
## SaleTypeConLw         1.075e+04  1.368e+04   0.785 0.432525    
## SaleTypeCWD           1.873e+04  1.333e+04   1.405 0.160279    
## SaleTypeNew           1.313e+05  2.628e+04   4.996 6.93e-07 ***
## SaleTypeOth           1.361e+04  1.504e+04   0.905 0.365760    
## SaleTypeWD            1.144e+03  4.925e+03   0.232 0.816393    
## SaleConditionAdjLand  2.552e+04  1.839e+04   1.388 0.165460    
## SaleConditionAlloca   5.754e+03  9.877e+03   0.583 0.560347    
## SaleConditionFamily  -2.546e+02  7.118e+03  -0.036 0.971479    
## SaleConditionNormal   1.062e+04  3.362e+03   3.160 0.001629 ** 
## SaleConditionPartial -1.114e+05  2.582e+04  -4.315 1.76e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23560 on 962 degrees of freedom
## Multiple R-squared:  0.9184, Adjusted R-squared:  0.9072 
## F-statistic: 81.98 on 132 and 962 DF,  p-value: < 2.2e-16

The stepAIC model results in a value for \(R^2\) so high that I suspect overfitting.

rsquare(model_train.slm, model_test.dat)
## Warning in predict.lm(model, data): prediction from a rank-deficient fit may be
## misleading
## [1] 0.6050672

With this particular test/train split, overfitting does not look like a major concern. \(R^2\) for the test data on the trained model is still quite high at 78.5%. However, as I ran this model multiple times with different test/train splits, overfitting was, in general, an issue. I obtained values for \(R^2\) as low as 0.35. If I were to revisit my work on this model, I would be interested to explore cross-validation as a way to summarize changes in \(R^2\) depending on the particular data in the test/train splits.

To resolve the overfit, I need to remove variables from the model. The normal procedure of backward elimination would be relevant here, except that all variables in the model include at least one level with significance less than 0.05. Typically, this would mean the variable should remain in the model. My strategy for removing variables will rely on the principles described in the Tidy section. I’ll reserve only one variable for each attribute of the house (for example, multiple variables regarding the garage will be removed until only one about the garage remains). I’ll rely on some of my informal understanding of real estate sales. And I’ll try to reduce the number of factors of highly imbalanced categorical variables when possible.

Variable transformations

model_train.dat$BldgType[!(model_train.dat$BldgType %in% c("1Fam"))] <- "Not1Fam"
model_train.dat$HouseStyle[model_train.dat$HouseStyle %in% c("2.5Fin", "2.5Unf")] <- "2Story"
model_train.dat$BsmtFullBath[model_train.dat$BsmtFullBath > 1] <- 1
model_train.dat$HalfBath[model_train.dat$HalfBath > 1] <- 1
model_train.dat$BedroomAbvGr[model_train.dat$BedroomAbvGr > 4] <- 4
model_train.dat$Fireplaces[model_train.dat$Fireplaces > 1] <- 1
model_train.dat$SaleType[model_train.dat$SaleType %in% c("Con", "ConLD", "ConLI", "ConLW", "CWD", "Oth")] <- "WD"

model_test.dat$BldgType[!(model_test.dat$BldgType %in% c("1Fam"))] <- "Not1Fam"
model_test.dat$HouseStyle[model_test.dat$HouseStyle %in% c("2.5Fin", "2.5Unf")] <- "2Story"
model_test.dat$BsmtFullBath[model_test.dat$BsmtFullBath > 1] <- 1
model_test.dat$HalfBath[model_test.dat$HalfBath > 1] <- 1
model_test.dat$BedroomAbvGr[model_test.dat$BedroomAbvGr > 4] <- 4
model_test.dat$Fireplaces[model_test.dat$Fireplaces > 1] <- 1
model_test.dat$SaleType[model_test.dat$SaleType %in% c("Con", "ConLD", "ConLI", "ConLW", "CWD", "Oth")] <- "WD"

Updated model

model_train_210515.lm <- lm(SalePrice ~ LotFrontage + LotArea + LotConfig + Neighborhood + BldgType + HouseStyle + OverallQual + OverallCond + YearBuilt + YearRemodAdd + Exterior1st + BsmtFinSF1 + `1stFlrSF` + `2ndFlrSF` + BsmtFullBath + FullBath + HalfBath + BedroomAbvGr + KitchenQual + Fireplaces + GarageArea + SaleCondition, data = model_train.dat)

summary(model_train_210515.lm)
## 
## Call:
## lm(formula = SalePrice ~ LotFrontage + LotArea + LotConfig + 
##     Neighborhood + BldgType + HouseStyle + OverallQual + OverallCond + 
##     YearBuilt + YearRemodAdd + Exterior1st + BsmtFinSF1 + `1stFlrSF` + 
##     `2ndFlrSF` + BsmtFullBath + FullBath + HalfBath + BedroomAbvGr + 
##     KitchenQual + Fireplaces + GarageArea + SaleCondition, data = model_train.dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -348254  -11469     436   10432  226107 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -8.989e+05  1.913e+05  -4.698 2.99e-06 ***
## LotFrontage           5.406e+01  2.851e+01   1.896 0.058176 .  
## LotArea               5.074e-01  9.242e-02   5.490 5.05e-08 ***
## LotConfigCulDSac      4.582e+03  4.238e+03   1.081 0.279838    
## LotConfigFR2         -6.652e+03  5.360e+03  -1.241 0.214889    
## LotConfigFR3         -3.235e+04  1.676e+04  -1.930 0.053833 .  
## LotConfigInside      -1.582e+03  2.243e+03  -0.705 0.480872    
## NeighborhoodBlueste  -1.761e+04  2.954e+04  -0.596 0.551170    
## NeighborhoodBrDale   -8.387e+03  1.253e+04  -0.669 0.503489    
## NeighborhoodBrkSide  -6.085e+03  1.058e+04  -0.575 0.565494    
## NeighborhoodClearCr  -8.661e+03  1.121e+04  -0.773 0.439872    
## NeighborhoodCollgCr  -8.365e+03  8.745e+03  -0.957 0.338999    
## NeighborhoodCrawfor   6.891e+03  1.023e+04   0.674 0.500546    
## NeighborhoodEdwards  -2.568e+04  9.626e+03  -2.667 0.007763 ** 
## NeighborhoodGilbert  -8.515e+03  9.272e+03  -0.918 0.358653    
## NeighborhoodIDOTRR   -1.518e+04  1.122e+04  -1.354 0.176150    
## NeighborhoodMeadowV  -6.180e+03  1.293e+04  -0.478 0.632871    
## NeighborhoodMitchel  -1.705e+04  9.879e+03  -1.726 0.084569 .  
## NeighborhoodNAmes    -2.129e+04  9.295e+03  -2.290 0.022230 *  
## NeighborhoodNoRidge   3.216e+04  1.008e+04   3.192 0.001456 ** 
## NeighborhoodNPkVill   2.277e+03  1.319e+04   0.173 0.863009    
## NeighborhoodNridgHt   2.680e+04  9.023e+03   2.971 0.003042 ** 
## NeighborhoodNWAmes   -2.174e+04  9.579e+03  -2.269 0.023450 *  
## NeighborhoodOldTown  -2.134e+04  1.019e+04  -2.094 0.036500 *  
## NeighborhoodSawyer   -1.697e+04  9.810e+03  -1.729 0.084043 .  
## NeighborhoodSawyerW  -1.187e+04  9.461e+03  -1.255 0.209881    
## NeighborhoodSomerst   5.608e+03  8.972e+03   0.625 0.532060    
## NeighborhoodStoneBr   2.471e+04  1.031e+04   2.397 0.016713 *  
## NeighborhoodSWISU    -8.300e+03  1.191e+04  -0.697 0.486029    
## NeighborhoodTimber   -5.334e+03  9.897e+03  -0.539 0.590002    
## NeighborhoodVeenker   6.259e+03  1.260e+04   0.497 0.619503    
## BldgTypeNot1Fam      -1.876e+04  3.098e+03  -6.054 1.97e-09 ***
## HouseStyle1.5Unf      1.241e+04  1.052e+04   1.180 0.238413    
## HouseStyle1Story      1.583e+04  4.718e+03   3.354 0.000825 ***
## HouseStyle2Story     -5.398e+03  4.139e+03  -1.304 0.192459    
## HouseStyleSFoyer      1.546e+04  7.674e+03   2.014 0.044220 *  
## HouseStyleSLvl        9.800e+03  5.888e+03   1.664 0.096332 .  
## OverallQual           1.211e+04  1.208e+03  10.026  < 2e-16 ***
## OverallCond           5.132e+03  1.012e+03   5.073 4.65e-07 ***
## YearBuilt             3.511e+02  8.139e+01   4.314 1.76e-05 ***
## YearRemodAdd          1.008e+02  7.061e+01   1.427 0.153872    
## Exterior1stBrkFace    7.104e+03  9.476e+03   0.750 0.453626    
## Exterior1stCemntBd   -1.411e+04  1.006e+04  -1.403 0.160973    
## Exterior1stHdBoard   -9.650e+03  8.757e+03  -1.102 0.270751    
## Exterior1stMetalSd   -4.125e+03  8.498e+03  -0.485 0.627479    
## Exterior1stPlywood   -1.071e+04  9.125e+03  -1.174 0.240783    
## Exterior1stStucco     6.443e+02  1.039e+04   0.062 0.950557    
## Exterior1stVinylSd   -6.342e+03  8.638e+03  -0.734 0.462986    
## Exterior1stWd Sdng   -3.984e+03  8.472e+03  -0.470 0.638313    
## Exterior1stWdShing   -2.504e+03  1.043e+04  -0.240 0.810217    
## BsmtFinSF1            2.324e+01  2.934e+00   7.923 6.02e-15 ***
## `1stFlrSF`            6.334e+01  4.498e+00  14.081  < 2e-16 ***
## `2ndFlrSF`            6.759e+01  6.062e+00  11.151  < 2e-16 ***
## BsmtFullBath          2.422e+03  2.391e+03   1.013 0.311380    
## FullBath              6.365e+02  2.867e+03   0.222 0.824346    
## HalfBath              4.890e+03  2.860e+03   1.710 0.087599 .  
## BedroomAbvGr         -2.555e+03  1.663e+03  -1.536 0.124735    
## KitchenQualFa        -3.452e+04  7.848e+03  -4.398 1.21e-05 ***
## KitchenQualGd        -3.992e+04  4.265e+03  -9.360  < 2e-16 ***
## KitchenQualTA        -3.836e+04  4.911e+03  -7.811 1.39e-14 ***
## Fireplaces            1.111e+03  2.224e+03   0.500 0.617498    
## GarageArea            2.288e+01  5.878e+00   3.893 0.000105 ***
## SaleConditionAdjLand  2.201e+04  2.130e+04   1.033 0.301649    
## SaleConditionAlloca  -1.239e+04  1.055e+04  -1.175 0.240325    
## SaleConditionFamily   1.198e+03  8.002e+03   0.150 0.881049    
## SaleConditionNormal   9.106e+03  3.490e+03   2.609 0.009212 ** 
## SaleConditionPartial  1.927e+04  4.985e+03   3.865 0.000118 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 28130 on 1028 degrees of freedom
## Multiple R-squared:  0.8756, Adjusted R-squared:  0.8676 
## F-statistic: 109.6 on 66 and 1028 DF,  p-value: < 2.2e-16
rsquare(model_train_210515.lm, model_test.dat)
## [1] 0.7778742

The simplified model performs much better on the test data. For this particular test/train split, the \(R^2\) on the test data is actually higher than on the training data!

Diagnostics

ggplot(model_train.slm, aes(sample = .resid)) +
  geom_qq() +
  geom_qq_line()

The QQ plot indicates that there are some important deviations from the linear model at the tails of the Sale Price distribution. One avenue to pursue if I were to continue this investigation might be to have separate models for predicting low, medium, and high-price houses. The first step of the model would be to predict the range of the house price, and then a separate multiple regression model for each group would be used to predict the final price.

ggplot(data = model_train.slm, aes(x = .fitted, y = .resid)) +
geom_point()

Again, we see problems at the higher values of SalePrice. The model systematically underestimates the prices of the most expensive houses in Ames.

Generate predictions for submission

predict.dat <- predict.dat %>%
  dplyr::select(LotFrontage, LotArea, LotConfig, Neighborhood, BldgType, HouseStyle, OverallQual, OverallCond, YearBuilt, YearRemodAdd, Exterior1st, BsmtFinSF1, `1stFlrSF`, `2ndFlrSF`, BsmtFullBath, FullBath, HalfBath, BedroomAbvGr, KitchenQual, Fireplaces, GarageArea, SaleType, SaleCondition)

predict.dat$LotFrontage[is.na(predict.dat$LotFrontage)] <- 0

predict.dat$Exterior1st[predict.dat$Exterior1st %in% c("AsphShn","BrkComm","CBlock","ImStucc","Stone")] <- "VinylSd"

predict.dat$Exterior1st[is.na(predict.dat$Exterior1st)] <- "VinylSd"

predict.dat$BsmtFinSF1[is.na(predict.dat$BsmtFinSF1)] <- 0
predict.dat$BsmtFullBath[is.na(predict.dat$BsmtFullBath)] <- 0
predict.dat$KitchenQual[is.na(predict.dat$KitchenQual)] <- "TA"
predict.dat$GarageArea[is.na(predict.dat$GarageArea)] <- 0
predict.dat$SaleType[is.na(predict.dat$SaleType)] <- "WD"
sum(is.na(predict.dat))
## [1] 0
predict.dat$BldgType[!(predict.dat$BldgType %in% c("1Fam"))] <- "Not1Fam"
predict.dat$HouseStyle[predict.dat$HouseStyle %in% c("2.5Fin", "2.5Unf")] <- "2Story"
predict.dat$BsmtFullBath[predict.dat$BsmtFullBath > 1] <- 1
predict.dat$HalfBath[predict.dat$HalfBath > 1] <- 1
predict.dat$BedroomAbvGr[predict.dat$BedroomAbvGr > 4] <- 4
predict.dat$Fireplaces[predict.dat$Fireplaces > 1] <- 1
predict.dat$SaleType[predict.dat$SaleType %in% c("Con", "ConLD", "ConLI", "ConLW", "CWD", "Oth")] <- "WD"

preds <- predict.lm(model_train_210515.lm, predict.dat, type = "response")
submit <- data.frame("Id" = seq(from = 1461, to = 2919), "SalePrice" = preds)
write_csv(submit, "C:/Users/dmosc/OneDrive/Desktop/y.csv")

Kaggle results

This model earned a score of 0.15828. My user name is Daniel Moscoe.