library(GGally)
library(MASS)
library(modelr)
library(tidyverse)
library(stats)
set.seed(235711)
This data set contains 79 variables that describe houses in Ames, IA. The goal is to build a model that uses this data to explain the Sale Price for each house.
model.dat <- read_csv("https://raw.githubusercontent.com/dmoscoe/SPS/main/houses_train.csv")
predict.dat <- read_csv("https://raw.githubusercontent.com/dmoscoe/SPS/main/houses_test.csv")
Much of the work of generating a model for this dataset amounts to identifying which data to exclude. I followed a few principles in choosing which variables to include or drop from the final model:
If a group of variables all describe the same aspect of the house, such as the garage, then it might be possible to summarize them with a smaller number of variables. These variables are also likely to show correlation with each other, which may be a reason to exclude one or more of them.
If almost all the entries for a variable are the same, then the variation in this variable can’t explain variation in Sale Price. Variables like this are good candidates for exclusion.
Sale Price decisions are made by people. That means whatever actual people value most in a home will be strong determinants of Sale Price. So my informal understanding of home sales is at least potentially relevant to determining a model for Sale Price.
I start by dropping Id, Utilities, RoofMatl. Id merely signifies row numbers for observations, and is not a determinant of Sale Price. Almost every entry contains the same values for Utilities, Condition2, and RoofMatl.
model.dat <- model.dat[,c(2:9,11:14,16:22,24:81)]
The dataset contains a large number of NA entries. Reviewing the data dictionary explains that these entries actually do contain information. Often, they indicate that a particular feature of a house is missing, or that a measurement is zero. Based on the information in the data dictionary, I replace all NAs with more descriptive entries. The resulting dataset has no NA entries.
Here I also transform some very rare entries in Exterior1st and BldgType to the modes for those variables.
model.dat$Alley[is.na(model.dat$Alley)] <- "none"
model.dat$FireplaceQu[is.na(model.dat$FireplaceQu)] <- "none"
model.dat$PoolQC[is.na(model.dat$PoolQC)] <- "none"
model.dat$Fence[is.na(model.dat$Fence)] <- "none"
model.dat$MiscFeature[is.na(model.dat$MiscFeature)] <- "none"
model.dat$BsmtQual[is.na(model.dat$BsmtQual)] <- "none"
model.dat$BsmtCond[is.na(model.dat$BsmtCond)] <- "none"
model.dat$BsmtExposure[is.na(model.dat$BsmtExposure)] <- "none"
model.dat$BsmtFinType1[is.na(model.dat$BsmtFinType1)] <- "none"
model.dat$BsmtFinType2[is.na(model.dat$BsmtFinType2)] <- "none"
model.dat$GarageType[is.na(model.dat$GarageType)] <- "none"
model.dat$GarageFinish[is.na(model.dat$GarageFinish)] <- "none"
model.dat$GarageQual[is.na(model.dat$GarageQual)] <- "none"
model.dat$GarageCond[is.na(model.dat$GarageCond)] <- "none"
model.dat$MasVnrArea[is.na(model.dat$MasVnrArea)] <- 0
model.dat$MasVnrType[is.na(model.dat$MasVnrType)] <- "none"
model.dat$Electrical[is.na(model.dat$Electrical)] <- "SBrkr"
model.dat$GarageYrBlt[is.na(model.dat$GarageYrBlt)] <- 2006 #Most common value
model.dat$Exterior1st[model.dat$Exterior1st %in% c("AsphShn","BrkComm","CBlock","ImStucc","Stone")] <- "VinylSd" #Most common value
model.dat$BldgType[!(model.dat$BldgType %in% c("1Fam"))] <- "Not1Fam"
model.dat$LotFrontage[is.na(model.dat$LotFrontage)] <- 0
Now that NAs have been resolved, it’s time to partition the data into a training set and a testing set. I include 75% of the data in the training set, and reserve 25% for testing.
training_rows <- sample(nrow(model.dat), round(0.75 * nrow(model.dat)), replace = FALSE)
model_train.dat <- model.dat[training_rows,]
model_test.dat <- model.dat[-training_rows,]
My plan for constructing a model is to use the stepAIC function from modelr. This function eliminates variables as long as the elimination results in a reduction in the AIC. The stepAIC model requires a complete linear model to begin its work.
model_train.lm <- lm(SalePrice ~ ., data = model_train.dat)
Because the output from stepAIC is so long, I set include = FALSE in the code chunk containing it. The output of the stepAIC function is summarized below.
summary(model_train.slm)
##
## Call:
## lm(formula = SalePrice ~ MSZoning + LotArea + Street + LotShape +
## LandContour + LotConfig + Neighborhood + Condition1 + BldgType +
## OverallQual + OverallCond + YearBuilt + YearRemodAdd + Exterior1st +
## MasVnrArea + ExterQual + BsmtQual + BsmtExposure + BsmtFinSF1 +
## BsmtFinSF2 + BsmtUnfSF + `1stFlrSF` + `2ndFlrSF` + BsmtFullBath +
## BedroomAbvGr + KitchenAbvGr + KitchenQual + TotRmsAbvGrd +
## Functional + Fireplaces + FireplaceQu + GarageCars + GarageArea +
## GarageQual + GarageCond + WoodDeckSF + ScreenPorch + PoolArea +
## PoolQC + MiscFeature + MoSold + SaleType + SaleCondition,
## data = model_train.dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -305887 -9843 -209 9736 186164
##
## Coefficients: (2 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.886e+06 4.826e+05 -8.053 2.38e-15 ***
## MSZoningFV 4.466e+04 1.441e+04 3.100 0.001995 **
## MSZoningRH 3.531e+04 1.404e+04 2.514 0.012090 *
## MSZoningRL 3.122e+04 1.213e+04 2.574 0.010203 *
## MSZoningRM 2.808e+04 1.167e+04 2.407 0.016277 *
## LotArea 3.028e-01 9.471e-02 3.197 0.001432 **
## StreetPave 3.422e+04 1.650e+04 2.074 0.038372 *
## LotShapeIR2 6.378e+03 4.772e+03 1.336 0.181705
## LotShapeIR3 2.798e+04 1.317e+04 2.125 0.033857 *
## LotShapeReg 2.005e+03 1.848e+03 1.085 0.278335
## LandContourHLS 1.405e+04 5.921e+03 2.374 0.017814 *
## LandContourLow 4.166e+03 7.278e+03 0.572 0.567174
## LandContourLvl 1.154e+04 4.341e+03 2.658 0.007997 **
## LotConfigCulDSac 4.256e+03 3.716e+03 1.145 0.252389
## LotConfigFR2 -7.659e+03 4.558e+03 -1.680 0.093224 .
## LotConfigFR3 -1.878e+04 1.530e+04 -1.227 0.220007
## LotConfigInside -1.620e+03 1.964e+03 -0.825 0.409533
## NeighborhoodBlueste -7.319e+03 2.521e+04 -0.290 0.771612
## NeighborhoodBrDale -2.456e+03 1.170e+04 -0.210 0.833776
## NeighborhoodBrkSide -1.333e+03 9.946e+03 -0.134 0.893387
## NeighborhoodClearCr 1.421e+03 1.018e+04 0.140 0.889019
## NeighborhoodCollgCr -7.693e+02 7.774e+03 -0.099 0.921194
## NeighborhoodCrawfor 1.179e+04 9.297e+03 1.268 0.205107
## NeighborhoodEdwards -2.110e+04 8.624e+03 -2.447 0.014599 *
## NeighborhoodGilbert -1.198e+03 8.196e+03 -0.146 0.883798
## NeighborhoodIDOTRR -6.155e+03 1.155e+04 -0.533 0.594259
## NeighborhoodMeadowV 1.797e+03 1.198e+04 0.150 0.880827
## NeighborhoodMitchel -1.304e+04 8.816e+03 -1.479 0.139485
## NeighborhoodNAmes -1.264e+04 8.404e+03 -1.504 0.133031
## NeighborhoodNoRidge 3.758e+04 8.982e+03 4.184 3.13e-05 ***
## NeighborhoodNPkVill 1.230e+04 1.130e+04 1.088 0.276764
## NeighborhoodNridgHt 2.269e+04 7.957e+03 2.852 0.004440 **
## NeighborhoodNWAmes -5.353e+03 8.594e+03 -0.623 0.533514
## NeighborhoodOldTown -1.274e+04 1.025e+04 -1.243 0.214314
## NeighborhoodSawyer -3.260e+03 8.824e+03 -0.369 0.711901
## NeighborhoodSawyerW 1.696e+03 8.324e+03 0.204 0.838639
## NeighborhoodSomerst 2.785e+03 1.015e+04 0.274 0.783848
## NeighborhoodStoneBr 3.700e+04 9.027e+03 4.098 4.51e-05 ***
## NeighborhoodSWISU -2.990e+03 1.076e+04 -0.278 0.781048
## NeighborhoodTimber -2.982e+03 8.782e+03 -0.340 0.734237
## NeighborhoodVeenker 1.042e+04 1.113e+04 0.936 0.349270
## Condition1Feedr 3.504e+03 5.765e+03 0.608 0.543441
## Condition1Norm 1.412e+04 4.862e+03 2.905 0.003762 **
## Condition1PosA 2.777e+03 1.081e+04 0.257 0.797367
## Condition1PosN -1.698e+04 8.073e+03 -2.103 0.035724 *
## Condition1RRAe -9.474e+03 1.071e+04 -0.885 0.376478
## Condition1RRAn 8.903e+03 8.030e+03 1.109 0.267856
## Condition1RRNe 4.419e+03 1.811e+04 0.244 0.807232
## Condition1RRNn 1.519e+04 1.515e+04 1.002 0.316520
## BldgTypeNot1Fam -2.010e+04 3.386e+03 -5.935 4.09e-09 ***
## OverallQual 6.245e+03 1.137e+03 5.495 5.00e-08 ***
## OverallCond 4.851e+03 9.397e+02 5.162 2.98e-07 ***
## YearBuilt 2.841e+02 7.454e+01 3.811 0.000147 ***
## YearRemodAdd 1.259e+02 6.340e+01 1.986 0.047297 *
## Exterior1stBrkFace 1.921e+04 8.554e+03 2.246 0.024951 *
## Exterior1stCemntBd -1.062e+04 8.941e+03 -1.188 0.235222
## Exterior1stHdBoard -5.544e+03 7.836e+03 -0.708 0.479402
## Exterior1stMetalSd 1.801e+03 7.693e+03 0.234 0.814996
## Exterior1stPlywood -7.043e+03 8.224e+03 -0.856 0.391991
## Exterior1stStucco 5.363e+03 9.254e+03 0.579 0.562388
## Exterior1stVinylSd 1.125e+03 7.697e+03 0.146 0.883820
## Exterior1stWd Sdng 7.144e+01 7.608e+03 0.009 0.992509
## Exterior1stWdShing 8.461e+02 9.164e+03 0.092 0.926456
## MasVnrArea 1.046e+01 5.394e+00 1.940 0.052704 .
## ExterQualFa -1.059e+04 1.196e+04 -0.886 0.376095
## ExterQualGd -2.845e+04 5.735e+03 -4.960 8.34e-07 ***
## ExterQualTA -2.864e+04 6.259e+03 -4.577 5.34e-06 ***
## BsmtQualFa -1.480e+04 7.635e+03 -1.938 0.052891 .
## BsmtQualGd -1.790e+04 3.900e+03 -4.590 5.01e-06 ***
## BsmtQualnone -5.698e+03 8.859e+03 -0.643 0.520276
## BsmtQualTA -1.660e+04 4.747e+03 -3.497 0.000492 ***
## BsmtExposureGd 1.966e+04 3.522e+03 5.581 3.10e-08 ***
## BsmtExposureMn 4.611e+03 3.431e+03 1.344 0.179245
## BsmtExposureNo -1.251e+03 2.409e+03 -0.519 0.603584
## BsmtExposurenone NA NA NA NA
## BsmtFinSF1 3.019e+01 5.253e+00 5.746 1.23e-08 ***
## BsmtFinSF2 2.277e+01 6.784e+00 3.357 0.000820 ***
## BsmtUnfSF 1.776e+01 4.953e+00 3.587 0.000352 ***
## `1stFlrSF` 4.812e+01 5.587e+00 8.612 < 2e-16 ***
## `2ndFlrSF` 4.612e+01 3.763e+00 12.255 < 2e-16 ***
## BsmtFullBath 4.550e+03 2.099e+03 2.168 0.030401 *
## BedroomAbvGr -3.569e+03 1.502e+03 -2.377 0.017637 *
## KitchenAbvGr -7.760e+03 5.430e+03 -1.429 0.153346
## KitchenQualFa -2.090e+04 7.173e+03 -2.913 0.003660 **
## KitchenQualGd -2.264e+04 4.221e+03 -5.363 1.03e-07 ***
## KitchenQualTA -2.402e+04 4.682e+03 -5.130 3.50e-07 ***
## TotRmsAbvGrd 2.655e+03 1.071e+03 2.479 0.013331 *
## FunctionalMaj2 1.130e+04 1.548e+04 0.730 0.465672
## FunctionalMin1 1.182e+04 9.550e+03 1.237 0.216281
## FunctionalMin2 1.670e+04 9.590e+03 1.741 0.081944 .
## FunctionalMod 1.017e+04 1.064e+04 0.956 0.339199
## FunctionalSev -3.060e+04 2.670e+04 -1.146 0.252063
## FunctionalTyp 3.041e+04 7.976e+03 3.813 0.000146 ***
## Fireplaces 1.192e+04 3.268e+03 3.649 0.000277 ***
## FireplaceQuFa -1.819e+04 8.154e+03 -2.231 0.025938 *
## FireplaceQuGd -1.217e+04 6.325e+03 -1.924 0.054712 .
## FireplaceQunone -2.742e+03 7.597e+03 -0.361 0.718270
## FireplaceQuPo -7.722e+03 9.820e+03 -0.786 0.431838
## FireplaceQuTA -1.486e+04 6.525e+03 -2.278 0.022936 *
## GarageCars 4.790e+03 2.618e+03 1.830 0.067622 .
## GarageArea 1.611e+01 8.755e+00 1.840 0.066096 .
## GarageQualFa -1.561e+05 2.681e+04 -5.823 7.89e-09 ***
## GarageQualGd -1.441e+05 2.803e+04 -5.141 3.30e-07 ***
## GarageQualnone 7.501e+01 1.808e+04 0.004 0.996690
## GarageQualPo -1.465e+05 3.553e+04 -4.123 4.06e-05 ***
## GarageQualTA -1.518e+05 2.653e+04 -5.719 1.43e-08 ***
## GarageCondFa 1.368e+05 3.245e+04 4.217 2.71e-05 ***
## GarageCondGd 1.274e+05 3.369e+04 3.781 0.000166 ***
## GarageCondnone NA NA NA NA
## GarageCondPo 1.388e+05 3.501e+04 3.963 7.95e-05 ***
## GarageCondTA 1.430e+05 3.199e+04 4.471 8.73e-06 ***
## WoodDeckSF 1.551e+01 6.633e+00 2.339 0.019557 *
## ScreenPorch 4.681e+01 1.511e+01 3.098 0.002005 **
## PoolArea 5.900e+03 8.542e+02 6.907 9.01e-12 ***
## PoolQCFa -7.895e+05 1.024e+05 -7.708 3.17e-14 ***
## PoolQCGd -3.523e+05 4.759e+04 -7.404 2.89e-13 ***
## PoolQCnone 3.024e+06 4.566e+05 6.623 5.85e-11 ***
## MiscFeaturenone 2.775e+02 2.422e+04 0.011 0.990861
## MiscFeatureOthr 2.040e+04 3.419e+04 0.597 0.550842
## MiscFeatureShed 9.029e+02 2.448e+04 0.037 0.970580
## MiscFeatureTenC 7.202e+05 1.188e+05 6.062 1.93e-09 ***
## MoSold -5.857e+02 2.815e+02 -2.080 0.037758 *
## SaleTypeCon 1.649e+04 1.835e+04 0.898 0.369226
## SaleTypeConLD 2.445e+04 1.288e+04 1.898 0.057940 .
## SaleTypeConLI 7.363e+03 1.323e+04 0.557 0.577942
## SaleTypeConLw 1.075e+04 1.368e+04 0.785 0.432525
## SaleTypeCWD 1.873e+04 1.333e+04 1.405 0.160279
## SaleTypeNew 1.313e+05 2.628e+04 4.996 6.93e-07 ***
## SaleTypeOth 1.361e+04 1.504e+04 0.905 0.365760
## SaleTypeWD 1.144e+03 4.925e+03 0.232 0.816393
## SaleConditionAdjLand 2.552e+04 1.839e+04 1.388 0.165460
## SaleConditionAlloca 5.754e+03 9.877e+03 0.583 0.560347
## SaleConditionFamily -2.546e+02 7.118e+03 -0.036 0.971479
## SaleConditionNormal 1.062e+04 3.362e+03 3.160 0.001629 **
## SaleConditionPartial -1.114e+05 2.582e+04 -4.315 1.76e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 23560 on 962 degrees of freedom
## Multiple R-squared: 0.9184, Adjusted R-squared: 0.9072
## F-statistic: 81.98 on 132 and 962 DF, p-value: < 2.2e-16
The stepAIC model results in a value for \(R^2\) so high that I suspect overfitting.
rsquare(model_train.slm, model_test.dat)
## Warning in predict.lm(model, data): prediction from a rank-deficient fit may be
## misleading
## [1] 0.6050672
With this particular test/train split, overfitting does not look like a major concern. \(R^2\) for the test data on the trained model is still quite high at 78.5%. However, as I ran this model multiple times with different test/train splits, overfitting was, in general, an issue. I obtained values for \(R^2\) as low as 0.35. If I were to revisit my work on this model, I would be interested to explore cross-validation as a way to summarize changes in \(R^2\) depending on the particular data in the test/train splits.
To resolve the overfit, I need to remove variables from the model. The normal procedure of backward elimination would be relevant here, except that all variables in the model include at least one level with significance less than 0.05. Typically, this would mean the variable should remain in the model. My strategy for removing variables will rely on the principles described in the Tidy section. I’ll reserve only one variable for each attribute of the house (for example, multiple variables regarding the garage will be removed until only one about the garage remains). I’ll rely on some of my informal understanding of real estate sales. And I’ll try to reduce the number of factors of highly imbalanced categorical variables when possible.
model_train.dat$BldgType[!(model_train.dat$BldgType %in% c("1Fam"))] <- "Not1Fam"
model_train.dat$HouseStyle[model_train.dat$HouseStyle %in% c("2.5Fin", "2.5Unf")] <- "2Story"
model_train.dat$BsmtFullBath[model_train.dat$BsmtFullBath > 1] <- 1
model_train.dat$HalfBath[model_train.dat$HalfBath > 1] <- 1
model_train.dat$BedroomAbvGr[model_train.dat$BedroomAbvGr > 4] <- 4
model_train.dat$Fireplaces[model_train.dat$Fireplaces > 1] <- 1
model_train.dat$SaleType[model_train.dat$SaleType %in% c("Con", "ConLD", "ConLI", "ConLW", "CWD", "Oth")] <- "WD"
model_test.dat$BldgType[!(model_test.dat$BldgType %in% c("1Fam"))] <- "Not1Fam"
model_test.dat$HouseStyle[model_test.dat$HouseStyle %in% c("2.5Fin", "2.5Unf")] <- "2Story"
model_test.dat$BsmtFullBath[model_test.dat$BsmtFullBath > 1] <- 1
model_test.dat$HalfBath[model_test.dat$HalfBath > 1] <- 1
model_test.dat$BedroomAbvGr[model_test.dat$BedroomAbvGr > 4] <- 4
model_test.dat$Fireplaces[model_test.dat$Fireplaces > 1] <- 1
model_test.dat$SaleType[model_test.dat$SaleType %in% c("Con", "ConLD", "ConLI", "ConLW", "CWD", "Oth")] <- "WD"
model_train_210515.lm <- lm(SalePrice ~ LotFrontage + LotArea + LotConfig + Neighborhood + BldgType + HouseStyle + OverallQual + OverallCond + YearBuilt + YearRemodAdd + Exterior1st + BsmtFinSF1 + `1stFlrSF` + `2ndFlrSF` + BsmtFullBath + FullBath + HalfBath + BedroomAbvGr + KitchenQual + Fireplaces + GarageArea + SaleCondition, data = model_train.dat)
summary(model_train_210515.lm)
##
## Call:
## lm(formula = SalePrice ~ LotFrontage + LotArea + LotConfig +
## Neighborhood + BldgType + HouseStyle + OverallQual + OverallCond +
## YearBuilt + YearRemodAdd + Exterior1st + BsmtFinSF1 + `1stFlrSF` +
## `2ndFlrSF` + BsmtFullBath + FullBath + HalfBath + BedroomAbvGr +
## KitchenQual + Fireplaces + GarageArea + SaleCondition, data = model_train.dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -348254 -11469 436 10432 226107
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.989e+05 1.913e+05 -4.698 2.99e-06 ***
## LotFrontage 5.406e+01 2.851e+01 1.896 0.058176 .
## LotArea 5.074e-01 9.242e-02 5.490 5.05e-08 ***
## LotConfigCulDSac 4.582e+03 4.238e+03 1.081 0.279838
## LotConfigFR2 -6.652e+03 5.360e+03 -1.241 0.214889
## LotConfigFR3 -3.235e+04 1.676e+04 -1.930 0.053833 .
## LotConfigInside -1.582e+03 2.243e+03 -0.705 0.480872
## NeighborhoodBlueste -1.761e+04 2.954e+04 -0.596 0.551170
## NeighborhoodBrDale -8.387e+03 1.253e+04 -0.669 0.503489
## NeighborhoodBrkSide -6.085e+03 1.058e+04 -0.575 0.565494
## NeighborhoodClearCr -8.661e+03 1.121e+04 -0.773 0.439872
## NeighborhoodCollgCr -8.365e+03 8.745e+03 -0.957 0.338999
## NeighborhoodCrawfor 6.891e+03 1.023e+04 0.674 0.500546
## NeighborhoodEdwards -2.568e+04 9.626e+03 -2.667 0.007763 **
## NeighborhoodGilbert -8.515e+03 9.272e+03 -0.918 0.358653
## NeighborhoodIDOTRR -1.518e+04 1.122e+04 -1.354 0.176150
## NeighborhoodMeadowV -6.180e+03 1.293e+04 -0.478 0.632871
## NeighborhoodMitchel -1.705e+04 9.879e+03 -1.726 0.084569 .
## NeighborhoodNAmes -2.129e+04 9.295e+03 -2.290 0.022230 *
## NeighborhoodNoRidge 3.216e+04 1.008e+04 3.192 0.001456 **
## NeighborhoodNPkVill 2.277e+03 1.319e+04 0.173 0.863009
## NeighborhoodNridgHt 2.680e+04 9.023e+03 2.971 0.003042 **
## NeighborhoodNWAmes -2.174e+04 9.579e+03 -2.269 0.023450 *
## NeighborhoodOldTown -2.134e+04 1.019e+04 -2.094 0.036500 *
## NeighborhoodSawyer -1.697e+04 9.810e+03 -1.729 0.084043 .
## NeighborhoodSawyerW -1.187e+04 9.461e+03 -1.255 0.209881
## NeighborhoodSomerst 5.608e+03 8.972e+03 0.625 0.532060
## NeighborhoodStoneBr 2.471e+04 1.031e+04 2.397 0.016713 *
## NeighborhoodSWISU -8.300e+03 1.191e+04 -0.697 0.486029
## NeighborhoodTimber -5.334e+03 9.897e+03 -0.539 0.590002
## NeighborhoodVeenker 6.259e+03 1.260e+04 0.497 0.619503
## BldgTypeNot1Fam -1.876e+04 3.098e+03 -6.054 1.97e-09 ***
## HouseStyle1.5Unf 1.241e+04 1.052e+04 1.180 0.238413
## HouseStyle1Story 1.583e+04 4.718e+03 3.354 0.000825 ***
## HouseStyle2Story -5.398e+03 4.139e+03 -1.304 0.192459
## HouseStyleSFoyer 1.546e+04 7.674e+03 2.014 0.044220 *
## HouseStyleSLvl 9.800e+03 5.888e+03 1.664 0.096332 .
## OverallQual 1.211e+04 1.208e+03 10.026 < 2e-16 ***
## OverallCond 5.132e+03 1.012e+03 5.073 4.65e-07 ***
## YearBuilt 3.511e+02 8.139e+01 4.314 1.76e-05 ***
## YearRemodAdd 1.008e+02 7.061e+01 1.427 0.153872
## Exterior1stBrkFace 7.104e+03 9.476e+03 0.750 0.453626
## Exterior1stCemntBd -1.411e+04 1.006e+04 -1.403 0.160973
## Exterior1stHdBoard -9.650e+03 8.757e+03 -1.102 0.270751
## Exterior1stMetalSd -4.125e+03 8.498e+03 -0.485 0.627479
## Exterior1stPlywood -1.071e+04 9.125e+03 -1.174 0.240783
## Exterior1stStucco 6.443e+02 1.039e+04 0.062 0.950557
## Exterior1stVinylSd -6.342e+03 8.638e+03 -0.734 0.462986
## Exterior1stWd Sdng -3.984e+03 8.472e+03 -0.470 0.638313
## Exterior1stWdShing -2.504e+03 1.043e+04 -0.240 0.810217
## BsmtFinSF1 2.324e+01 2.934e+00 7.923 6.02e-15 ***
## `1stFlrSF` 6.334e+01 4.498e+00 14.081 < 2e-16 ***
## `2ndFlrSF` 6.759e+01 6.062e+00 11.151 < 2e-16 ***
## BsmtFullBath 2.422e+03 2.391e+03 1.013 0.311380
## FullBath 6.365e+02 2.867e+03 0.222 0.824346
## HalfBath 4.890e+03 2.860e+03 1.710 0.087599 .
## BedroomAbvGr -2.555e+03 1.663e+03 -1.536 0.124735
## KitchenQualFa -3.452e+04 7.848e+03 -4.398 1.21e-05 ***
## KitchenQualGd -3.992e+04 4.265e+03 -9.360 < 2e-16 ***
## KitchenQualTA -3.836e+04 4.911e+03 -7.811 1.39e-14 ***
## Fireplaces 1.111e+03 2.224e+03 0.500 0.617498
## GarageArea 2.288e+01 5.878e+00 3.893 0.000105 ***
## SaleConditionAdjLand 2.201e+04 2.130e+04 1.033 0.301649
## SaleConditionAlloca -1.239e+04 1.055e+04 -1.175 0.240325
## SaleConditionFamily 1.198e+03 8.002e+03 0.150 0.881049
## SaleConditionNormal 9.106e+03 3.490e+03 2.609 0.009212 **
## SaleConditionPartial 1.927e+04 4.985e+03 3.865 0.000118 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 28130 on 1028 degrees of freedom
## Multiple R-squared: 0.8756, Adjusted R-squared: 0.8676
## F-statistic: 109.6 on 66 and 1028 DF, p-value: < 2.2e-16
rsquare(model_train_210515.lm, model_test.dat)
## [1] 0.7778742
The simplified model performs much better on the test data. For this particular test/train split, the \(R^2\) on the test data is actually higher than on the training data!
ggplot(model_train.slm, aes(sample = .resid)) +
geom_qq() +
geom_qq_line()
The QQ plot indicates that there are some important deviations from the linear model at the tails of the Sale Price distribution. One avenue to pursue if I were to continue this investigation might be to have separate models for predicting low, medium, and high-price houses. The first step of the model would be to predict the range of the house price, and then a separate multiple regression model for each group would be used to predict the final price.
ggplot(data = model_train.slm, aes(x = .fitted, y = .resid)) +
geom_point()
Again, we see problems at the higher values of SalePrice. The model systematically underestimates the prices of the most expensive houses in Ames.
predict.dat <- predict.dat %>%
dplyr::select(LotFrontage, LotArea, LotConfig, Neighborhood, BldgType, HouseStyle, OverallQual, OverallCond, YearBuilt, YearRemodAdd, Exterior1st, BsmtFinSF1, `1stFlrSF`, `2ndFlrSF`, BsmtFullBath, FullBath, HalfBath, BedroomAbvGr, KitchenQual, Fireplaces, GarageArea, SaleType, SaleCondition)
predict.dat$LotFrontage[is.na(predict.dat$LotFrontage)] <- 0
predict.dat$Exterior1st[predict.dat$Exterior1st %in% c("AsphShn","BrkComm","CBlock","ImStucc","Stone")] <- "VinylSd"
predict.dat$Exterior1st[is.na(predict.dat$Exterior1st)] <- "VinylSd"
predict.dat$BsmtFinSF1[is.na(predict.dat$BsmtFinSF1)] <- 0
predict.dat$BsmtFullBath[is.na(predict.dat$BsmtFullBath)] <- 0
predict.dat$KitchenQual[is.na(predict.dat$KitchenQual)] <- "TA"
predict.dat$GarageArea[is.na(predict.dat$GarageArea)] <- 0
predict.dat$SaleType[is.na(predict.dat$SaleType)] <- "WD"
sum(is.na(predict.dat))
## [1] 0
predict.dat$BldgType[!(predict.dat$BldgType %in% c("1Fam"))] <- "Not1Fam"
predict.dat$HouseStyle[predict.dat$HouseStyle %in% c("2.5Fin", "2.5Unf")] <- "2Story"
predict.dat$BsmtFullBath[predict.dat$BsmtFullBath > 1] <- 1
predict.dat$HalfBath[predict.dat$HalfBath > 1] <- 1
predict.dat$BedroomAbvGr[predict.dat$BedroomAbvGr > 4] <- 4
predict.dat$Fireplaces[predict.dat$Fireplaces > 1] <- 1
predict.dat$SaleType[predict.dat$SaleType %in% c("Con", "ConLD", "ConLI", "ConLW", "CWD", "Oth")] <- "WD"
preds <- predict.lm(model_train_210515.lm, predict.dat, type = "response")
submit <- data.frame("Id" = seq(from = 1461, to = 2919), "SalePrice" = preds)
write_csv(submit, "C:/Users/dmosc/OneDrive/Desktop/y.csv")
This model earned a score of 0.15828. My user name is Daniel Moscoe.