Join the Kaggle competition House Prices Advanced Regression Techniques and build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score. Provide a screen snapshot of your score with your name identifiable.
This is a long assignment with detailed code outputs. Please use the floating Table of Contents to follow the structure of the response and note that commentary proceeds the code and output.
Our data is publicly available as the Ames Housing Data.
The target value is SalePrice
, there are 78 predictors
and one ID column, with 1460 rows in our train
data and
1459 rows in test
. test
does not contain the
target value and is ultimately what we need to submit for the
competition.
We’ve uploaded the competition data to our Github to read from here.
The target value is SalePrice
, there are 78 predictors
and one ID column, with 1460 rows in our train
data and
1459 rows in test
. test
does not contain the
target value and is ultimately what we need to submit for the
competition.
datalocation_train = 'https://raw.githubusercontent.com/pkofy/DATA605/main/Final%20Project/train.csv'
datalocation_test = 'https://raw.githubusercontent.com/pkofy/DATA605/main/Final%20Project/test.csv'
train <- read.csv(file=datalocation_train)
test <- read.csv(file=datalocation_test)
We’re going to scale the target variable from 0-100 so that it’s easier to evaluate the model iterations.
max_sp <- max(train$SalePrice)
min_sp <- min(train$SalePrice)
range <- max_sp - min_sp
train$ssp <- 100 * (train$SalePrice - min_sp) / range
We’re trying to figure out which predictors we can exclude before we start backwards elimination.
For the numeric predictors we’re going to eliminate any that don’t
appear to have a linear relationship when when we compare them to scaled
sale price, ssp
.
For the non-numeric predictors we’re going to eliminate any that have a lot of N/A values, or have only two unique values as including them in the model would break the linear regression model function.
str(train)
## 'data.frame': 1460 obs. of 82 variables:
## $ Id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ MSSubClass : int 60 20 60 70 60 50 20 60 50 190 ...
## $ MSZoning : chr "RL" "RL" "RL" "RL" ...
## $ LotFrontage : int 65 80 68 60 84 85 75 NA 51 50 ...
## $ LotArea : int 8450 9600 11250 9550 14260 14115 10084 10382 6120 7420 ...
## $ Street : chr "Pave" "Pave" "Pave" "Pave" ...
## $ Alley : chr NA NA NA NA ...
## $ LotShape : chr "Reg" "Reg" "IR1" "IR1" ...
## $ LandContour : chr "Lvl" "Lvl" "Lvl" "Lvl" ...
## $ Utilities : chr "AllPub" "AllPub" "AllPub" "AllPub" ...
## $ LotConfig : chr "Inside" "FR2" "Inside" "Corner" ...
## $ LandSlope : chr "Gtl" "Gtl" "Gtl" "Gtl" ...
## $ Neighborhood : chr "CollgCr" "Veenker" "CollgCr" "Crawfor" ...
## $ Condition1 : chr "Norm" "Feedr" "Norm" "Norm" ...
## $ Condition2 : chr "Norm" "Norm" "Norm" "Norm" ...
## $ BldgType : chr "1Fam" "1Fam" "1Fam" "1Fam" ...
## $ HouseStyle : chr "2Story" "1Story" "2Story" "2Story" ...
## $ OverallQual : int 7 6 7 7 8 5 8 7 7 5 ...
## $ OverallCond : int 5 8 5 5 5 5 5 6 5 6 ...
## $ YearBuilt : int 2003 1976 2001 1915 2000 1993 2004 1973 1931 1939 ...
## $ YearRemodAdd : int 2003 1976 2002 1970 2000 1995 2005 1973 1950 1950 ...
## $ RoofStyle : chr "Gable" "Gable" "Gable" "Gable" ...
## $ RoofMatl : chr "CompShg" "CompShg" "CompShg" "CompShg" ...
## $ Exterior1st : chr "VinylSd" "MetalSd" "VinylSd" "Wd Sdng" ...
## $ Exterior2nd : chr "VinylSd" "MetalSd" "VinylSd" "Wd Shng" ...
## $ MasVnrType : chr "BrkFace" "None" "BrkFace" "None" ...
## $ MasVnrArea : int 196 0 162 0 350 0 186 240 0 0 ...
## $ ExterQual : chr "Gd" "TA" "Gd" "TA" ...
## $ ExterCond : chr "TA" "TA" "TA" "TA" ...
## $ Foundation : chr "PConc" "CBlock" "PConc" "BrkTil" ...
## $ BsmtQual : chr "Gd" "Gd" "Gd" "TA" ...
## $ BsmtCond : chr "TA" "TA" "TA" "Gd" ...
## $ BsmtExposure : chr "No" "Gd" "Mn" "No" ...
## $ BsmtFinType1 : chr "GLQ" "ALQ" "GLQ" "ALQ" ...
## $ BsmtFinSF1 : int 706 978 486 216 655 732 1369 859 0 851 ...
## $ BsmtFinType2 : chr "Unf" "Unf" "Unf" "Unf" ...
## $ BsmtFinSF2 : int 0 0 0 0 0 0 0 32 0 0 ...
## $ BsmtUnfSF : int 150 284 434 540 490 64 317 216 952 140 ...
## $ TotalBsmtSF : int 856 1262 920 756 1145 796 1686 1107 952 991 ...
## $ Heating : chr "GasA" "GasA" "GasA" "GasA" ...
## $ HeatingQC : chr "Ex" "Ex" "Ex" "Gd" ...
## $ CentralAir : chr "Y" "Y" "Y" "Y" ...
## $ Electrical : chr "SBrkr" "SBrkr" "SBrkr" "SBrkr" ...
## $ X1stFlrSF : int 856 1262 920 961 1145 796 1694 1107 1022 1077 ...
## $ X2ndFlrSF : int 854 0 866 756 1053 566 0 983 752 0 ...
## $ LowQualFinSF : int 0 0 0 0 0 0 0 0 0 0 ...
## $ GrLivArea : int 1710 1262 1786 1717 2198 1362 1694 2090 1774 1077 ...
## $ BsmtFullBath : int 1 0 1 1 1 1 1 1 0 1 ...
## $ BsmtHalfBath : int 0 1 0 0 0 0 0 0 0 0 ...
## $ FullBath : int 2 2 2 1 2 1 2 2 2 1 ...
## $ HalfBath : int 1 0 1 0 1 1 0 1 0 0 ...
## $ BedroomAbvGr : int 3 3 3 3 4 1 3 3 2 2 ...
## $ KitchenAbvGr : int 1 1 1 1 1 1 1 1 2 2 ...
## $ KitchenQual : chr "Gd" "TA" "Gd" "Gd" ...
## $ TotRmsAbvGrd : int 8 6 6 7 9 5 7 7 8 5 ...
## $ Functional : chr "Typ" "Typ" "Typ" "Typ" ...
## $ Fireplaces : int 0 1 1 1 1 0 1 2 2 2 ...
## $ FireplaceQu : chr NA "TA" "TA" "Gd" ...
## $ GarageType : chr "Attchd" "Attchd" "Attchd" "Detchd" ...
## $ GarageYrBlt : int 2003 1976 2001 1998 2000 1993 2004 1973 1931 1939 ...
## $ GarageFinish : chr "RFn" "RFn" "RFn" "Unf" ...
## $ GarageCars : int 2 2 2 3 3 2 2 2 2 1 ...
## $ GarageArea : int 548 460 608 642 836 480 636 484 468 205 ...
## $ GarageQual : chr "TA" "TA" "TA" "TA" ...
## $ GarageCond : chr "TA" "TA" "TA" "TA" ...
## $ PavedDrive : chr "Y" "Y" "Y" "Y" ...
## $ WoodDeckSF : int 0 298 0 0 192 40 255 235 90 0 ...
## $ OpenPorchSF : int 61 0 42 35 84 30 57 204 0 4 ...
## $ EnclosedPorch: int 0 0 0 272 0 0 0 228 205 0 ...
## $ X3SsnPorch : int 0 0 0 0 0 320 0 0 0 0 ...
## $ ScreenPorch : int 0 0 0 0 0 0 0 0 0 0 ...
## $ PoolArea : int 0 0 0 0 0 0 0 0 0 0 ...
## $ PoolQC : chr NA NA NA NA ...
## $ Fence : chr NA NA NA NA ...
## $ MiscFeature : chr NA NA NA NA ...
## $ MiscVal : int 0 0 0 0 0 700 0 350 0 0 ...
## $ MoSold : int 2 5 9 2 12 10 8 11 4 1 ...
## $ YrSold : int 2008 2007 2008 2006 2008 2009 2007 2009 2008 2008 ...
## $ SaleType : chr "WD" "WD" "WD" "WD" ...
## $ SaleCondition: chr "Normal" "Normal" "Normal" "Abnorml" ...
## $ SalePrice : int 208500 181500 223500 140000 250000 143000 307000 200000 129900 118000 ...
## $ ssp : num 24.1 20.4 26.2 14.6 29.9 ...
We use pairs charts to look for linear relationships between the
values. We want to look at the top row which shows ssp
on
the y-axis and the predictor in that column as the x-axis. We tried
setting verInd=1
to only show the first row however we
couldn’t fix the distortion since the charts were still stretched on the
y-axis for the whole length of the chart.
From this we want to eliminate MSSubClass
. The problem
with eliminating one of these variables is they could in conjunction
with another variable have predictive value but we’ll excuse that for
now.
pairs1 <- train[c("ssp", "MSSubClass", "LotFrontage", "LotArea", "OverallQual", "OverallCond")]
pairs(pairs1, gap=.5)
From this set of pairs we’re going to eliminate
MasVnrArea
and BsmtFinSF2
, The first looks
like a cloud with a lot of zeros and the second looks like it has a lot
of zero values.
pairs2 <- train[c("ssp", "YearBuilt", "YearRemodAdd", "MasVnrArea", "BsmtFinSF2", "BsmtUnfSF")]
pairs(pairs2, gap=.5)
We’re going to eliminate X2ndFlrSF
and
LowQualFinSF
because of lots of zero values, and the second
doesn’t look like it has a linear relationship.
pairs3 <- train[c("ssp", "TotalBsmtSF", "X1stFlrSF", "X2ndFlrSF", "LowQualFinSF", "GrLivArea")]
pairs(pairs3, gap=.5)
We’re going to eliminate all of the variables from pairs4 because
they seem to not have linear relationships with scaled sales price. It
could be that we should treat them as qualitative variables and only
eliminate them from the backwards elimination step if they have too many
0 or N/A values but we’ll excuse that for now. FullBath
looks linear but maybe it’s a proxy for another variable like size of
house.
pairs4 <- train[c("ssp", "BsmtFullBath", "BsmtHalfBath", "FullBath", "HalfBath", "BedroomAbvGr")]
pairs(pairs4, gap=.5)
We’re going to keep TotRmsAbvGrd
because it seems linear
and not too many zeros, and exclude the rest. In retrospect
Fireplaces
and GarageCars
seem linear (If you
exclude 3 fire places and 4 car garages) but we’re okay with the
judgement call we made then
pairs5 <- train[c("ssp", "KitchenAbvGr", "TotRmsAbvGrd", "Fireplaces", "GarageYrBlt", "GarageCars")]
pairs(pairs5, gap=.5)
From the sixth group we’ll keep Garage Area, but the rest seem to have a lot of zero values or don’t seem linear.
pairs6 <- train[c("ssp", "GarageArea", "WoodDeckSF", "OpenPorchSF", "EnclosedPorch", "X3SsnPorch")]
pairs(pairs6, gap=.5)
We’re not going to keep any from these values, either from too many zeros or not linear enough (a line from the bottom left corner to the top right corner, or if inversely linear, a line from the top left corner to the bottom right corner).
pairs7 <- train[c("ssp", "ScreenPorch", "PoolArea", "MiscVal", "MoSold", "YrSold")]
pairs(pairs7, gap=.5)
Now we have to evaluate the non-numeric predictors that have too many missing values or two few unique values. We could also look at number of occurrences for each unique value but we’ll leave that aside.
Here we check the number of missing values in the non-numeric
predictors. From this we can eliminate Alley
,
FireplaceQu
, PoolQC
, Fence
,
MiscFeature
.
not_numeric <- c("MSZoning", "Street", "Alley", "LotShape", "LandContour", "Utilities", "LotConfig", "LandSlope", "Neighborhood", "Condition1", "Condition2", "BldgType", "HouseStyle", "RoofStyle", "RoofMatl", "Exterior1st", "Exterior2nd", "MasVnrType", "ExterQual", "ExterCond", "Foundation", "BsmtQual", "BsmtCond", "BsmtExposure", "BsmtFinType1", "BsmtFinType2", "Heating", "HeatingQC", "CentralAir", "Electrical", "KitchenQual", "Functional", "FireplaceQu", "GarageType", "GarageFinish", "GarageQual", "GarageCond", "PavedDrive", "PoolQC", "Fence", "MiscFeature", "SaleType", "SaleCondition")
colSums(is.na(train[not_numeric]))
## MSZoning Street Alley LotShape LandContour
## 0 0 1369 0 0
## Utilities LotConfig LandSlope Neighborhood Condition1
## 0 0 0 0 0
## Condition2 BldgType HouseStyle RoofStyle RoofMatl
## 0 0 0 0 0
## Exterior1st Exterior2nd MasVnrType ExterQual ExterCond
## 0 0 8 0 0
## Foundation BsmtQual BsmtCond BsmtExposure BsmtFinType1
## 0 37 37 38 37
## BsmtFinType2 Heating HeatingQC CentralAir Electrical
## 38 0 0 0 1
## KitchenQual Functional FireplaceQu GarageType GarageFinish
## 0 0 690 81 81
## GarageQual GarageCond PavedDrive PoolQC Fence
## 81 81 0 1453 1179
## MiscFeature SaleType SaleCondition
## 1406 0 0
Here we take out three of the remaining non-numeric predictors,
Street
, Utilities
, and
CentralAir
, because they only have two unique values. We’re
not evaluating the rest for display but keeping the code below
length(unique(train$Street))
## [1] 2
length(unique(train$Utilities))
## [1] 2
length(unique(train$CentralAir))
## [1] 2
length(unique(train$MSZoning))
length(unique(train$LotShape))
length(unique(train$LandContour))
length(unique(train$LotConfig))
length(unique(train$LandSlope))
length(unique(train$Neighborhood))
length(unique(train$Condition1))
length(unique(train$Condition2))
length(unique(train$BldgType))
length(unique(train$HouseStyle))
length(unique(train$RoofStyle))
length(unique(train$RoofMatl))
length(unique(train$Exterior1st))
length(unique(train$Exterior2nd))
length(unique(train$MasVnrType))
length(unique(train$ExterQual))
length(unique(train$ExterCond))
length(unique(train$Foundation))
length(unique(train$BsmtQual))
length(unique(train$BsmtCond))
length(unique(train$BsmtExposure))
length(unique(train$BsmtFinType1))
length(unique(train$BsmtFinType2))
length(unique(train$Heating))
length(unique(train$HeatingQC))
length(unique(train$Electrical))
length(unique(train$KitchenQual))
length(unique(train$Functional))
length(unique(train$GarageType))
length(unique(train$GarageFinish))
length(unique(train$GarageQual))
length(unique(train$GarageCond))
length(unique(train$PavedDrive))
length(unique(train$SaleType))
length(unique(train$SaleCondition))
Here we perform Backwards Elimination using the variables remaining after our commonsense eliminations.
In Backwards Elimination you remove one predictor at a time and reevaluating your model between each elimination however we’ll batch some of the earlier eliminations for efficiency.
After running our first model, based on the p-values of the respective coefficients being high we can remove the following in our first batch elimination. We’re not predetermining a threshold for p to make our eliminations.
YearRemodAdd
TotRmsAbvGrd
LotShape
RoofStyle
MasVnrType
ExterCond
BsmtCond
Heating
HeatingQC
Electrical
Functional
GarageFinish
PavedDrive
SaleType
SaleCondition
lm1 <- lm(ssp ~ LotFrontage + LotArea + OverallQual + OverallCond + YearBuilt + YearRemodAdd + BsmtUnfSF + TotalBsmtSF + X1stFlrSF + GrLivArea + TotRmsAbvGrd + GarageArea + MSZoning + LotShape + LandContour + LotConfig + LandSlope + Neighborhood + Condition1 + Condition2 + BldgType + HouseStyle + RoofStyle + RoofMatl + Exterior1st + Exterior2nd + MasVnrType + ExterQual + ExterCond + Foundation + BsmtQual + BsmtCond + BsmtExposure + BsmtFinType1 + BsmtFinType2 + Heating + HeatingQC + Electrical + KitchenQual + Functional + GarageType + GarageFinish + GarageQual + GarageCond + PavedDrive + SaleType + SaleCondition, data=train)
summary(lm1)
##
## Call:
## lm(formula = ssp ~ LotFrontage + LotArea + OverallQual + OverallCond +
## YearBuilt + YearRemodAdd + BsmtUnfSF + TotalBsmtSF + X1stFlrSF +
## GrLivArea + TotRmsAbvGrd + GarageArea + MSZoning + LotShape +
## LandContour + LotConfig + LandSlope + Neighborhood + Condition1 +
## Condition2 + BldgType + HouseStyle + RoofStyle + RoofMatl +
## Exterior1st + Exterior2nd + MasVnrType + ExterQual + ExterCond +
## Foundation + BsmtQual + BsmtCond + BsmtExposure + BsmtFinType1 +
## BsmtFinType2 + Heating + HeatingQC + Electrical + KitchenQual +
## Functional + GarageType + GarageFinish + GarageQual + GarageCond +
## PavedDrive + SaleType + SaleCondition, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -26.1684 -1.3217 0.0134 1.3400 26.1684
##
## Coefficients: (2 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.464e+02 3.247e+01 -7.589 8.04e-14 ***
## LotFrontage 1.324e-02 7.930e-03 1.670 0.095276 .
## LotArea 1.169e-04 2.179e-05 5.365 1.03e-07 ***
## OverallQual 1.060e+00 1.831e-01 5.792 9.61e-09 ***
## OverallCond 9.498e-01 1.564e-01 6.071 1.87e-09 ***
## YearBuilt 5.139e-02 1.323e-02 3.883 0.000111 ***
## YearRemodAdd 1.289e-02 9.955e-03 1.295 0.195670
## BsmtUnfSF -2.826e-03 4.554e-04 -6.205 8.32e-10 ***
## TotalBsmtSF 6.789e-03 9.644e-04 7.039 3.83e-12 ***
## X1stFlrSF -4.519e-03 1.321e-03 -3.422 0.000650 ***
## GrLivArea 1.100e-02 9.087e-04 12.107 < 2e-16 ***
## TotRmsAbvGrd -4.350e-02 1.509e-01 -0.288 0.773217
## GarageArea 3.269e-03 9.230e-04 3.542 0.000418 ***
## MSZoningFV 6.449e+00 1.987e+00 3.246 0.001212 **
## MSZoningRH 4.881e+00 2.128e+00 2.293 0.022071 *
## MSZoningRL 4.756e+00 1.744e+00 2.728 0.006500 **
## MSZoningRM 4.216e+00 1.617e+00 2.607 0.009276 **
## LotShapeIR2 8.978e-01 8.174e-01 1.098 0.272322
## LotShapeIR3 1.077e+00 1.819e+00 0.592 0.553784
## LotShapeReg 3.086e-01 2.945e-01 1.048 0.295049
## LandContourHLS 1.314e+00 9.216e-01 1.426 0.154189
## LandContourLow -3.208e+00 1.383e+00 -2.320 0.020574 *
## LandContourLvl 8.046e-01 6.921e-01 1.163 0.245324
## LotConfigCulDSac 1.992e+00 7.247e-01 2.749 0.006091 **
## LotConfigFR2 -1.594e+00 7.654e-01 -2.082 0.037582 *
## LotConfigFR3 -2.188e+00 1.914e+00 -1.143 0.253192
## LotConfigInside -5.342e-02 3.244e-01 -0.165 0.869267
## LandSlopeMod 9.145e-01 7.265e-01 1.259 0.208408
## LandSlopeSev -6.162e+00 2.283e+00 -2.700 0.007074 **
## NeighborhoodBlueste 4.405e-01 2.957e+00 0.149 0.881587
## NeighborhoodBrDale 1.324e+00 1.790e+00 0.740 0.459515
## NeighborhoodBrkSide 1.231e-01 1.636e+00 0.075 0.940048
## NeighborhoodClearCr -1.907e+00 1.712e+00 -1.114 0.265745
## NeighborhoodCollgCr -1.386e+00 1.179e+00 -1.175 0.240124
## NeighborhoodCrawfor 2.065e+00 1.433e+00 1.441 0.149857
## NeighborhoodEdwards -2.809e+00 1.321e+00 -2.127 0.033655 *
## NeighborhoodGilbert -1.309e+00 1.291e+00 -1.014 0.310779
## NeighborhoodIDOTRR 4.482e-02 1.854e+00 0.024 0.980717
## NeighborhoodMeadowV -7.656e-01 1.899e+00 -0.403 0.686960
## NeighborhoodMitchel -2.367e+00 1.387e+00 -1.706 0.088303 .
## NeighborhoodNAmes -2.116e+00 1.278e+00 -1.656 0.098112 .
## NeighborhoodNoRidge 4.103e+00 1.358e+00 3.021 0.002591 **
## NeighborhoodNPkVill 1.214e+00 2.786e+00 0.436 0.663218
## NeighborhoodNridgHt 2.634e+00 1.195e+00 2.204 0.027764 *
## NeighborhoodNWAmes -2.323e+00 1.350e+00 -1.720 0.085764 .
## NeighborhoodOldTown -1.090e+00 1.635e+00 -0.667 0.505243
## NeighborhoodSawyer -9.059e-01 1.364e+00 -0.664 0.506723
## NeighborhoodSawyerW -2.794e-01 1.298e+00 -0.215 0.829670
## NeighborhoodSomerst -3.999e-01 1.428e+00 -0.280 0.779528
## NeighborhoodStoneBr 6.329e+00 1.372e+00 4.614 4.52e-06 ***
## NeighborhoodSWISU -6.314e-01 1.622e+00 -0.389 0.697206
## NeighborhoodTimber -1.828e+00 1.335e+00 -1.369 0.171185
## NeighborhoodVeenker 1.133e+00 1.876e+00 0.604 0.546017
## Condition1Feedr 3.456e-01 8.844e-01 0.391 0.696055
## Condition1Norm 1.786e+00 6.961e-01 2.566 0.010443 *
## Condition1PosA 9.717e-01 2.396e+00 0.406 0.685190
## Condition1PosN 3.099e-01 1.615e+00 0.192 0.847852
## Condition1RRAe -1.820e+00 1.644e+00 -1.107 0.268530
## Condition1RRAn 1.251e+00 1.127e+00 1.109 0.267610
## Condition1RRNe 1.514e+00 3.595e+00 0.421 0.673823
## Condition1RRNn -3.425e-01 2.293e+00 -0.149 0.881274
## Condition2Feedr 4.975e-01 3.767e+00 0.132 0.894971
## Condition2Norm 1.367e+00 3.231e+00 0.423 0.672175
## Condition2PosA 7.034e+00 6.371e+00 1.104 0.269800
## Condition2PosN -3.095e+01 4.424e+00 -6.996 5.14e-12 ***
## Condition2RRNn 2.767e+00 4.222e+00 0.655 0.512387
## BldgType2fmCon -2.009e+00 1.041e+00 -1.930 0.053880 .
## BldgTypeDuplex -4.347e+00 9.715e-01 -4.474 8.64e-06 ***
## BldgTypeTwnhs -3.283e+00 9.503e-01 -3.454 0.000578 ***
## BldgTypeTwnhsE -2.432e+00 6.501e-01 -3.741 0.000195 ***
## HouseStyle1.5Unf 2.319e+00 1.475e+00 1.573 0.116102
## HouseStyle1Story 2.001e+00 6.921e-01 2.891 0.003934 **
## HouseStyle2.5Fin -6.240e+00 1.932e+00 -3.231 0.001279 **
## HouseStyle2.5Unf -2.870e+00 1.641e+00 -1.749 0.080557 .
## HouseStyle2Story -7.926e-01 5.716e-01 -1.386 0.165951
## HouseStyleSFoyer 1.248e+00 1.102e+00 1.132 0.257933
## HouseStyleSLvl 1.049e+00 9.098e-01 1.153 0.249392
## RoofStyleGable 4.587e+00 4.285e+00 1.071 0.284593
## RoofStyleGambrel 4.729e+00 4.498e+00 1.052 0.293310
## RoofStyleHip 4.490e+00 4.291e+00 1.046 0.295621
## RoofStyleMansard 7.219e+00 4.795e+00 1.506 0.132515
## RoofMatlCompShg 9.547e+01 5.180e+00 18.432 < 2e-16 ***
## RoofMatlMembran 1.130e+02 8.243e+00 13.706 < 2e-16 ***
## RoofMatlRoll 9.720e+01 6.489e+00 14.978 < 2e-16 ***
## RoofMatlTar&Grv 9.638e+01 6.389e+00 15.086 < 2e-16 ***
## RoofMatlWdShake 9.253e+01 6.432e+00 14.387 < 2e-16 ***
## RoofMatlWdShngl 1.034e+02 5.365e+00 19.266 < 2e-16 ***
## Exterior1stBrkComm -7.932e+00 5.570e+00 -1.424 0.154801
## Exterior1stBrkFace 4.660e-01 2.178e+00 0.214 0.830612
## Exterior1stCBlock 3.672e-02 4.348e+00 0.008 0.993263
## Exterior1stCemntBd -2.684e+00 3.511e+00 -0.765 0.444754
## Exterior1stHdBoard -2.654e+00 2.202e+00 -1.205 0.228489
## Exterior1stImStucc -8.850e+00 4.267e+00 -2.074 0.038364 *
## Exterior1stMetalSd -2.375e-01 2.543e+00 -0.093 0.925586
## Exterior1stPlywood -2.883e+00 2.194e+00 -1.314 0.189053
## Exterior1stStone -2.088e+00 5.781e+00 -0.361 0.718061
## Exterior1stStucco -1.846e+00 2.430e+00 -0.760 0.447731
## Exterior1stVinylSd -2.280e+00 2.214e+00 -1.030 0.303230
## Exterior1stWd Sdng -1.724e+00 2.116e+00 -0.815 0.415539
## Exterior1stWdShing -1.467e+00 2.257e+00 -0.650 0.515871
## Exterior2ndAsphShn 2.015e+00 3.662e+00 0.550 0.582333
## Exterior2ndBrk Cmn 3.872e+00 3.647e+00 1.061 0.288756
## Exterior2ndBrkFace 1.186e+00 2.265e+00 0.524 0.600594
## Exterior2ndCBlock NA NA NA NA
## Exterior2ndCmentBd 3.092e+00 3.445e+00 0.898 0.369654
## Exterior2ndHdBoard 2.270e+00 2.136e+00 1.063 0.288172
## Exterior2ndImStucc 6.082e+00 2.366e+00 2.571 0.010310 *
## Exterior2ndMetalSd 9.759e-01 2.475e+00 0.394 0.693509
## Exterior2ndOther -2.023e+00 4.215e+00 -0.480 0.631350
## Exterior2ndPlywood 2.157e+00 2.063e+00 1.046 0.296056
## Exterior2ndStone 6.028e-01 4.231e+00 0.142 0.886740
## Exterior2ndStucco 2.186e+00 2.349e+00 0.930 0.352393
## Exterior2ndVinylSd 2.566e+00 2.135e+00 1.202 0.229715
## Exterior2ndWd Sdng 2.252e+00 2.028e+00 1.110 0.267266
## Exterior2ndWd Shng 1.683e+00 2.109e+00 0.798 0.425012
## MasVnrTypeBrkFace 1.228e+00 1.296e+00 0.947 0.343711
## MasVnrTypeNone 1.098e+00 1.279e+00 0.859 0.390785
## MasVnrTypeStone 2.145e+00 1.334e+00 1.607 0.108319
## ExterQualFa -3.481e+00 2.106e+00 -1.653 0.098720 .
## ExterQualGd -3.529e+00 7.840e-01 -4.501 7.65e-06 ***
## ExterQualTA -3.266e+00 8.900e-01 -3.670 0.000257 ***
## ExterCondFa 7.266e-01 3.982e+00 0.183 0.855229
## ExterCondGd -5.676e-01 3.816e+00 -0.149 0.881779
## ExterCondTA -2.909e-02 3.811e+00 -0.008 0.993910
## FoundationCBlock 7.934e-01 5.612e-01 1.414 0.157752
## FoundationPConc 6.422e-01 6.046e-01 1.062 0.288445
## FoundationStone 5.987e-01 1.723e+00 0.347 0.728351
## FoundationWood -5.060e+00 2.743e+00 -1.845 0.065377 .
## BsmtQualFa -1.424e+00 1.035e+00 -1.376 0.169073
## BsmtQualGd -2.479e+00 5.340e-01 -4.643 3.95e-06 ***
## BsmtQualTA -2.432e+00 6.869e-01 -3.541 0.000419 ***
## BsmtCondGd -2.889e-01 9.055e-01 -0.319 0.749741
## BsmtCondPo 5.890e+00 5.996e+00 0.982 0.326253
## BsmtCondTA 2.596e-02 7.288e-01 0.036 0.971596
## BsmtExposureGd 1.897e+00 5.253e-01 3.611 0.000321 ***
## BsmtExposureMn -8.482e-01 5.157e-01 -1.645 0.100375
## BsmtExposureNo -1.053e+00 3.698e-01 -2.848 0.004496 **
## BsmtFinType1BLQ 8.549e-02 4.970e-01 0.172 0.863459
## BsmtFinType1GLQ 7.884e-01 4.413e-01 1.787 0.074351 .
## BsmtFinType1LwQ -8.771e-01 6.505e-01 -1.348 0.177934
## BsmtFinType1Rec -1.649e-01 5.108e-01 -0.323 0.746982
## BsmtFinType1Unf 2.494e-01 5.063e-01 0.493 0.622435
## BsmtFinType2BLQ -1.818e-01 1.334e+00 -0.136 0.891607
## BsmtFinType2GLQ 1.069e-01 1.631e+00 0.066 0.947769
## BsmtFinType2LwQ -3.113e-01 1.295e+00 -0.240 0.810120
## BsmtFinType2Rec -4.743e-01 1.264e+00 -0.375 0.707562
## BsmtFinType2Unf 5.369e-01 1.130e+00 0.475 0.634658
## HeatingGasW -5.862e-01 1.158e+00 -0.506 0.612922
## HeatingGrav 2.426e+00 3.430e+00 0.707 0.479459
## HeatingOthW -5.547e-01 4.170e+00 -0.133 0.894210
## HeatingQCFa 2.954e-01 8.646e-01 0.342 0.732715
## HeatingQCGd -4.381e-01 3.632e-01 -1.206 0.228072
## HeatingQCPo 5.401e-02 4.215e+00 0.013 0.989779
## HeatingQCTA -3.809e-01 3.674e-01 -1.037 0.300162
## ElectricalFuseF -4.577e-01 1.173e+00 -0.390 0.696483
## ElectricalFuseP -1.970e-01 3.744e+00 -0.053 0.958041
## ElectricalMix NA NA NA NA
## ElectricalSBrkr 3.245e-01 5.238e-01 0.619 0.535779
## KitchenQualFa -2.778e+00 1.122e+00 -2.475 0.013489 *
## KitchenQualGd -3.364e+00 5.699e-01 -5.903 5.05e-09 ***
## KitchenQualTA -3.071e+00 6.661e-01 -4.611 4.58e-06 ***
## FunctionalMaj2 -2.709e+00 2.561e+00 -1.058 0.290504
## FunctionalMin1 5.606e-01 1.482e+00 0.378 0.705220
## FunctionalMin2 1.088e-01 1.451e+00 0.075 0.940276
## FunctionalMod -8.693e-01 1.875e+00 -0.464 0.643042
## FunctionalTyp 1.754e+00 1.254e+00 1.399 0.162172
## GarageTypeAttchd 3.044e+00 1.802e+00 1.689 0.091639 .
## GarageTypeBasment 3.123e+00 2.120e+00 1.473 0.141103
## GarageTypeBuiltIn 2.497e+00 1.900e+00 1.314 0.189078
## GarageTypeCarPort 5.387e+00 2.552e+00 2.111 0.035040 *
## GarageTypeDetchd 3.627e+00 1.796e+00 2.020 0.043702 *
## GarageFinishRFn -3.433e-01 3.387e-01 -1.014 0.310978
## GarageFinishUnf -2.593e-01 4.242e-01 -0.611 0.541082
## GarageQualFa -1.360e+01 4.583e+00 -2.967 0.003091 **
## GarageQualGd -1.190e+01 4.741e+00 -2.511 0.012216 *
## GarageQualPo -1.776e+01 6.453e+00 -2.752 0.006045 **
## GarageQualTA -1.304e+01 4.530e+00 -2.879 0.004085 **
## GarageCondFa 1.299e+01 5.234e+00 2.482 0.013251 *
## GarageCondGd 1.292e+01 5.567e+00 2.320 0.020567 *
## GarageCondPo 1.309e+01 5.742e+00 2.279 0.022881 *
## GarageCondTA 1.338e+01 5.169e+00 2.589 0.009785 **
## PavedDriveP -5.227e-01 1.028e+00 -0.509 0.611142
## PavedDriveY -2.432e-02 6.949e-01 -0.035 0.972083
## SaleTypeCon 3.046e+00 2.734e+00 1.114 0.265397
## SaleTypeConLD 2.356e+00 2.057e+00 1.146 0.252271
## SaleTypeConLI 5.484e-02 2.303e+00 0.024 0.981010
## SaleTypeConLw 7.081e-01 2.058e+00 0.344 0.730854
## SaleTypeCWD 1.139e+00 1.998e+00 0.570 0.568707
## SaleTypeNew 1.154e+00 2.539e+00 0.455 0.649541
## SaleTypeOth 3.038e+00 3.558e+00 0.854 0.393421
## SaleTypeWD -1.985e-01 7.402e-01 -0.268 0.788589
## SaleConditionAdjLand 4.699e+00 3.870e+00 1.214 0.224932
## SaleConditionAlloca 9.881e-01 1.775e+00 0.557 0.577802
## SaleConditionFamily -7.086e-01 1.005e+00 -0.705 0.480904
## SaleConditionNormal 5.878e-01 5.182e-01 1.134 0.256994
## SaleConditionPartial 1.210e+00 2.447e+00 0.494 0.621135
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.417 on 900 degrees of freedom
## (366 observations deleted due to missingness)
## Multiple R-squared: 0.9279, Adjusted R-squared: 0.9125
## F-statistic: 60.04 on 193 and 900 DF, p-value: < 2.2e-16
From model 2 below we identify five more predictors to batch eliminate based on p-values that are too high.
Exterior1st
Exterior2nd
Foundation
BsmtFinType1
BsmtFinType2
lm2 <- update(lm1, .~. -YearRemodAdd -TotRmsAbvGrd -LotShape -RoofStyle -MasVnrType -ExterCond -BsmtCond -Heating -HeatingQC -Electrical -Functional -GarageFinish -PavedDrive -SaleType -SaleCondition, data=train)
summary(lm2)
##
## Call:
## lm(formula = ssp ~ LotFrontage + LotArea + OverallQual + OverallCond +
## YearBuilt + BsmtUnfSF + TotalBsmtSF + X1stFlrSF + GrLivArea +
## GarageArea + MSZoning + LandContour + LotConfig + LandSlope +
## Neighborhood + Condition1 + Condition2 + BldgType + HouseStyle +
## RoofMatl + Exterior1st + Exterior2nd + ExterQual + Foundation +
## BsmtQual + BsmtExposure + BsmtFinType1 + BsmtFinType2 + KitchenQual +
## GarageType + GarageQual + GarageCond, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -25.395 -1.380 0.058 1.368 25.395
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.318e+02 2.617e+01 -8.860 < 2e-16 ***
## LotFrontage 1.296e-02 7.716e-03 1.680 0.093269 .
## LotArea 1.193e-04 1.971e-05 6.052 2.05e-09 ***
## OverallQual 1.044e+00 1.760e-01 5.932 4.17e-09 ***
## OverallCond 1.052e+00 1.321e-01 7.964 4.70e-15 ***
## YearBuilt 6.308e-02 1.225e-02 5.148 3.20e-07 ***
## BsmtUnfSF -2.890e-03 4.430e-04 -6.524 1.11e-10 ***
## TotalBsmtSF 6.868e-03 9.136e-04 7.518 1.28e-13 ***
## X1stFlrSF -4.457e-03 1.242e-03 -3.588 0.000351 ***
## GrLivArea 1.050e-02 8.303e-04 12.651 < 2e-16 ***
## GarageArea 3.720e-03 8.963e-04 4.150 3.61e-05 ***
## MSZoningFV 5.693e+00 1.903e+00 2.991 0.002851 **
## MSZoningRH 4.887e+00 2.077e+00 2.353 0.018849 *
## MSZoningRL 4.671e+00 1.670e+00 2.797 0.005256 **
## MSZoningRM 4.100e+00 1.534e+00 2.672 0.007664 **
## LandContourHLS 1.320e+00 8.948e-01 1.475 0.140487
## LandContourLow -3.149e+00 1.336e+00 -2.358 0.018569 *
## LandContourLvl 7.412e-01 6.618e-01 1.120 0.263032
## LotConfigCulDSac 1.910e+00 6.924e-01 2.758 0.005931 **
## LotConfigFR2 -1.627e+00 7.517e-01 -2.165 0.030637 *
## LotConfigFR3 -2.463e+00 1.912e+00 -1.288 0.198033
## LotConfigInside -3.349e-02 3.166e-01 -0.106 0.915771
## LandSlopeMod 5.300e-01 7.046e-01 0.752 0.452087
## LandSlopeSev -6.814e+00 2.143e+00 -3.180 0.001520 **
## NeighborhoodBlueste 1.322e-01 2.907e+00 0.045 0.963748
## NeighborhoodBrDale 5.991e-01 1.743e+00 0.344 0.731075
## NeighborhoodBrkSide -6.242e-01 1.551e+00 -0.402 0.687420
## NeighborhoodClearCr -2.809e+00 1.660e+00 -1.693 0.090841 .
## NeighborhoodCollgCr -2.108e+00 1.122e+00 -1.878 0.060688 .
## NeighborhoodCrawfor 1.498e+00 1.373e+00 1.092 0.275318
## NeighborhoodEdwards -3.437e+00 1.261e+00 -2.726 0.006536 **
## NeighborhoodGilbert -1.909e+00 1.233e+00 -1.548 0.122032
## NeighborhoodIDOTRR -7.177e-01 1.754e+00 -0.409 0.682541
## NeighborhoodMeadowV -1.846e+00 1.849e+00 -0.998 0.318426
## NeighborhoodMitchel -3.482e+00 1.336e+00 -2.607 0.009282 **
## NeighborhoodNAmes -3.067e+00 1.224e+00 -2.506 0.012381 *
## NeighborhoodNoRidge 3.290e+00 1.305e+00 2.521 0.011849 *
## NeighborhoodNPkVill 3.613e-01 2.740e+00 0.132 0.895146
## NeighborhoodNridgHt 2.510e+00 1.136e+00 2.211 0.027301 *
## NeighborhoodNWAmes -3.264e+00 1.298e+00 -2.515 0.012069 *
## NeighborhoodOldTown -2.010e+00 1.549e+00 -1.298 0.194736
## NeighborhoodSawyer -1.913e+00 1.311e+00 -1.459 0.144942
## NeighborhoodSawyerW -1.168e+00 1.237e+00 -0.945 0.345094
## NeighborhoodSomerst 8.796e-02 1.350e+00 0.065 0.948077
## NeighborhoodStoneBr 5.845e+00 1.333e+00 4.386 1.28e-05 ***
## NeighborhoodSWISU -1.352e+00 1.553e+00 -0.870 0.384506
## NeighborhoodTimber -2.443e+00 1.298e+00 -1.882 0.060100 .
## NeighborhoodVeenker 4.908e-01 1.802e+00 0.272 0.785362
## Condition1Feedr 3.332e-01 8.636e-01 0.386 0.699742
## Condition1Norm 1.683e+00 6.678e-01 2.521 0.011869 *
## Condition1PosA 5.758e-01 2.297e+00 0.251 0.802111
## Condition1PosN 5.628e-01 1.583e+00 0.355 0.722301
## Condition1RRAe -2.016e+00 1.494e+00 -1.350 0.177481
## Condition1RRAn 1.326e+00 1.086e+00 1.221 0.222318
## Condition1RRNe 1.325e+00 3.599e+00 0.368 0.712904
## Condition1RRNn 5.144e-03 2.245e+00 0.002 0.998173
## Condition2Feedr 8.947e-01 3.680e+00 0.243 0.807977
## Condition2Norm 2.018e+00 3.148e+00 0.641 0.521519
## Condition2PosA 7.881e+00 5.006e+00 1.574 0.115775
## Condition2PosN -2.969e+01 4.323e+00 -6.867 1.18e-11 ***
## Condition2RRNn 3.297e+00 4.167e+00 0.791 0.428966
## BldgType2fmCon -1.505e+00 9.432e-01 -1.596 0.110913
## BldgTypeDuplex -4.325e+00 8.965e-01 -4.824 1.64e-06 ***
## BldgTypeTwnhs -3.419e+00 8.996e-01 -3.800 0.000154 ***
## BldgTypeTwnhsE -2.521e+00 6.115e-01 -4.123 4.06e-05 ***
## HouseStyle1.5Unf 2.446e+00 1.334e+00 1.834 0.066952 .
## HouseStyle1Story 2.023e+00 6.655e-01 3.040 0.002433 **
## HouseStyle2.5Fin -5.366e+00 1.890e+00 -2.839 0.004626 **
## HouseStyle2.5Unf -2.331e+00 1.456e+00 -1.601 0.109747
## HouseStyle2Story -6.128e-01 5.476e-01 -1.119 0.263369
## HouseStyleSFoyer 1.104e+00 1.074e+00 1.028 0.304368
## HouseStyleSLvl 8.281e-01 8.721e-01 0.950 0.342596
## RoofMatlCompShg 9.225e+01 4.900e+00 18.828 < 2e-16 ***
## RoofMatlMembran 1.041e+02 6.741e+00 15.441 < 2e-16 ***
## RoofMatlRoll 9.474e+01 6.235e+00 15.195 < 2e-16 ***
## RoofMatlTar&Grv 8.969e+01 5.223e+00 17.173 < 2e-16 ***
## RoofMatlWdShake 9.236e+01 5.795e+00 15.937 < 2e-16 ***
## RoofMatlWdShngl 9.975e+01 5.112e+00 19.511 < 2e-16 ***
## Exterior1stBrkComm -1.038e+01 5.061e+00 -2.051 0.040540 *
## Exterior1stBrkFace 4.492e-01 2.119e+00 0.212 0.832194
## Exterior1stCBlock 1.625e+00 4.271e+00 0.381 0.703632
## Exterior1stCemntBd -2.250e+00 3.396e+00 -0.662 0.507839
## Exterior1stHdBoard -2.161e+00 2.133e+00 -1.013 0.311358
## Exterior1stImStucc -7.692e+00 4.233e+00 -1.817 0.069524 .
## Exterior1stMetalSd -7.480e-01 2.454e+00 -0.305 0.760613
## Exterior1stPlywood -2.369e+00 2.124e+00 -1.115 0.265057
## Exterior1stStone -1.364e+00 5.686e+00 -0.240 0.810476
## Exterior1stStucco -1.177e+00 2.329e+00 -0.506 0.613302
## Exterior1stVinylSd -2.044e+00 2.163e+00 -0.945 0.344781
## Exterior1stWd Sdng -1.587e+00 2.071e+00 -0.766 0.443780
## Exterior1stWdShing -8.988e-01 2.201e+00 -0.408 0.683067
## Exterior2ndAsphShn 4.044e+00 3.281e+00 1.233 0.218022
## Exterior2ndBrk Cmn 3.863e+00 3.568e+00 1.083 0.279149
## Exterior2ndBrkFace 1.638e+00 2.208e+00 0.742 0.458239
## Exterior2ndCBlock NA NA NA NA
## Exterior2ndCmentBd 3.095e+00 3.336e+00 0.928 0.353797
## Exterior2ndHdBoard 1.901e+00 2.048e+00 0.928 0.353576
## Exterior2ndImStucc 5.069e+00 2.281e+00 2.222 0.026492 *
## Exterior2ndMetalSd 1.584e+00 2.382e+00 0.665 0.506316
## Exterior2ndOther -5.194e-01 4.170e+00 -0.125 0.900897
## Exterior2ndPlywood 1.770e+00 1.982e+00 0.893 0.371889
## Exterior2ndStone 1.218e-01 4.193e+00 0.029 0.976829
## Exterior2ndStucco 1.646e+00 2.201e+00 0.748 0.454785
## Exterior2ndVinylSd 2.688e+00 2.080e+00 1.292 0.196634
## Exterior2ndWd Sdng 2.208e+00 1.964e+00 1.124 0.261161
## Exterior2ndWd Shng 1.247e+00 2.052e+00 0.608 0.543636
## ExterQualFa -3.922e+00 1.999e+00 -1.963 0.049988 *
## ExterQualGd -3.864e+00 7.633e-01 -5.062 4.97e-07 ***
## ExterQualTA -3.626e+00 8.694e-01 -4.171 3.30e-05 ***
## FoundationCBlock 5.432e-01 5.378e-01 1.010 0.312777
## FoundationPConc 5.290e-01 5.836e-01 0.906 0.364908
## FoundationStone 4.195e-01 1.599e+00 0.262 0.793103
## FoundationWood -5.557e+00 2.727e+00 -2.038 0.041800 *
## BsmtQualFa -2.233e+00 9.789e-01 -2.281 0.022782 *
## BsmtQualGd -3.016e+00 5.251e-01 -5.743 1.25e-08 ***
## BsmtQualTA -2.946e+00 6.680e-01 -4.410 1.15e-05 ***
## BsmtExposureGd 1.985e+00 5.177e-01 3.834 0.000134 ***
## BsmtExposureMn -8.730e-01 5.111e-01 -1.708 0.087930 .
## BsmtExposureNo -1.170e+00 3.640e-01 -3.214 0.001354 **
## BsmtFinType1BLQ 2.138e-01 4.868e-01 0.439 0.660649
## BsmtFinType1GLQ 9.523e-01 4.333e-01 2.198 0.028206 *
## BsmtFinType1LwQ -9.165e-01 6.341e-01 -1.445 0.148684
## BsmtFinType1Rec -1.863e-01 4.974e-01 -0.375 0.708096
## BsmtFinType1Unf 4.898e-01 4.910e-01 0.998 0.318761
## BsmtFinType2BLQ -2.780e-01 1.304e+00 -0.213 0.831225
## BsmtFinType2GLQ -3.146e-01 1.596e+00 -0.197 0.843769
## BsmtFinType2LwQ -4.751e-01 1.270e+00 -0.374 0.708361
## BsmtFinType2Rec -7.747e-01 1.236e+00 -0.627 0.531041
## BsmtFinType2Unf 4.348e-01 1.093e+00 0.398 0.690910
## KitchenQualFa -3.201e+00 1.070e+00 -2.993 0.002837 **
## KitchenQualGd -3.464e+00 5.620e-01 -6.164 1.05e-09 ***
## KitchenQualTA -3.386e+00 6.493e-01 -5.215 2.25e-07 ***
## GarageTypeAttchd 3.805e+00 1.768e+00 2.153 0.031606 *
## GarageTypeBasment 4.514e+00 2.059e+00 2.193 0.028575 *
## GarageTypeBuiltIn 3.600e+00 1.853e+00 1.943 0.052327 .
## GarageTypeCarPort 6.391e+00 2.479e+00 2.578 0.010074 *
## GarageTypeDetchd 4.257e+00 1.754e+00 2.428 0.015386 *
## GarageQualFa -1.453e+01 4.474e+00 -3.248 0.001204 **
## GarageQualGd -1.298e+01 4.624e+00 -2.807 0.005106 **
## GarageQualPo -1.682e+01 5.490e+00 -3.064 0.002246 **
## GarageQualTA -1.404e+01 4.422e+00 -3.175 0.001547 **
## GarageCondFa 1.383e+01 5.129e+00 2.697 0.007130 **
## GarageCondGd 1.371e+01 5.461e+00 2.511 0.012194 *
## GarageCondPo 1.334e+01 5.604e+00 2.380 0.017502 *
## GarageCondTA 1.380e+01 5.064e+00 2.724 0.006569 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.449 on 957 degrees of freedom
## (359 observations deleted due to missingness)
## Multiple R-squared: 0.9227, Adjusted R-squared: 0.9112
## F-statistic: 79.94 on 143 and 957 DF, p-value: < 2.2e-16
From model 3 below we identify two more predictors to batch eliminate based on p-values that are too high. They were on the border of being eliminated next in model 2, however they seem to have even less impact in model 3 than in model 2.
LotFrontage
LandContour
lm3 <- update(lm2, .~. -Exterior1st -Exterior2nd -Foundation -BsmtFinType1 -BsmtFinType2, data=train)
summary(lm3)
##
## Call:
## lm(formula = ssp ~ LotFrontage + LotArea + OverallQual + OverallCond +
## YearBuilt + BsmtUnfSF + TotalBsmtSF + X1stFlrSF + GrLivArea +
## GarageArea + MSZoning + LandContour + LotConfig + LandSlope +
## Neighborhood + Condition1 + Condition2 + BldgType + HouseStyle +
## RoofMatl + ExterQual + BsmtQual + BsmtExposure + KitchenQual +
## GarageType + GarageQual + GarageCond, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -25.4647 -1.5244 0.0051 1.3967 25.4647
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.393e+02 2.277e+01 -10.513 < 2e-16 ***
## LotFrontage 1.002e-02 7.503e-03 1.336 0.181847
## LotArea 1.178e-04 1.906e-05 6.182 9.22e-10 ***
## OverallQual 1.085e+00 1.697e-01 6.392 2.50e-10 ***
## OverallCond 1.031e+00 1.267e-01 8.133 1.23e-15 ***
## YearBuilt 6.969e-02 1.052e-02 6.625 5.65e-11 ***
## BsmtUnfSF -2.628e-03 3.026e-04 -8.685 < 2e-16 ***
## TotalBsmtSF 6.187e-03 7.995e-04 7.738 2.46e-14 ***
## X1stFlrSF -3.938e-03 1.189e-03 -3.313 0.000958 ***
## GrLivArea 1.041e-02 8.073e-04 12.891 < 2e-16 ***
## GarageArea 3.722e-03 8.776e-04 4.241 2.43e-05 ***
## MSZoningFV 5.122e+00 1.870e+00 2.738 0.006283 **
## MSZoningRH 4.400e+00 2.035e+00 2.162 0.030885 *
## MSZoningRL 4.121e+00 1.629e+00 2.530 0.011545 *
## MSZoningRM 3.581e+00 1.510e+00 2.372 0.017893 *
## LandContourHLS 1.370e+00 8.610e-01 1.592 0.111791
## LandContourLow -2.482e+00 1.321e+00 -1.878 0.060666 .
## LandContourLvl 7.501e-01 6.358e-01 1.180 0.238395
## LotConfigCulDSac 1.618e+00 6.853e-01 2.361 0.018393 *
## LotConfigFR2 -1.691e+00 7.430e-01 -2.276 0.023083 *
## LotConfigFR3 -2.155e+00 1.914e+00 -1.126 0.260601
## LotConfigInside -1.040e-01 3.109e-01 -0.335 0.737979
## LandSlopeMod 3.104e-01 6.796e-01 0.457 0.647957
## LandSlopeSev -6.748e+00 2.138e+00 -3.156 0.001647 **
## NeighborhoodBlueste -5.888e-01 2.796e+00 -0.211 0.833224
## NeighborhoodBrDale 1.427e-01 1.676e+00 0.085 0.932183
## NeighborhoodBrkSide -1.717e-01 1.507e+00 -0.114 0.909267
## NeighborhoodClearCr -2.453e+00 1.611e+00 -1.523 0.128066
## NeighborhoodCollgCr -1.794e+00 1.115e+00 -1.609 0.107896
## NeighborhoodCrawfor 1.870e+00 1.338e+00 1.398 0.162335
## NeighborhoodEdwards -3.404e+00 1.242e+00 -2.742 0.006217 **
## NeighborhoodGilbert -1.727e+00 1.218e+00 -1.418 0.156422
## NeighborhoodIDOTRR -9.836e-02 1.708e+00 -0.058 0.954082
## NeighborhoodMeadowV -1.589e+00 1.701e+00 -0.934 0.350452
## NeighborhoodMitchel -3.526e+00 1.309e+00 -2.693 0.007191 **
## NeighborhoodNAmes -2.924e+00 1.193e+00 -2.452 0.014364 *
## NeighborhoodNoRidge 3.758e+00 1.274e+00 2.950 0.003256 **
## NeighborhoodNPkVill 4.216e-01 1.725e+00 0.244 0.806937
## NeighborhoodNridgHt 2.425e+00 1.133e+00 2.141 0.032502 *
## NeighborhoodNWAmes -3.690e+00 1.245e+00 -2.964 0.003113 **
## NeighborhoodOldTown -1.670e+00 1.514e+00 -1.103 0.270287
## NeighborhoodSawyer -2.263e+00 1.286e+00 -1.759 0.078814 .
## NeighborhoodSawyerW -1.479e+00 1.194e+00 -1.239 0.215619
## NeighborhoodSomerst 3.526e-01 1.340e+00 0.263 0.792460
## NeighborhoodStoneBr 5.593e+00 1.298e+00 4.309 1.80e-05 ***
## NeighborhoodSWISU -1.194e+00 1.537e+00 -0.777 0.437375
## NeighborhoodTimber -2.345e+00 1.277e+00 -1.836 0.066625 .
## NeighborhoodVeenker 7.326e-01 1.723e+00 0.425 0.670796
## Condition1Feedr 2.537e-01 8.495e-01 0.299 0.765297
## Condition1Norm 1.443e+00 6.586e-01 2.191 0.028698 *
## Condition1PosA 1.924e+00 2.158e+00 0.892 0.372840
## Condition1PosN 1.136e+00 1.560e+00 0.728 0.466876
## Condition1RRAe -1.531e+00 1.488e+00 -1.029 0.303957
## Condition1RRAn 1.106e+00 1.074e+00 1.030 0.303358
## Condition1RRNe 1.157e+00 3.606e+00 0.321 0.748414
## Condition1RRNn -5.344e-02 2.149e+00 -0.025 0.980171
## Condition2Feedr 4.055e-01 3.649e+00 0.111 0.911549
## Condition2Norm 1.509e+00 3.124e+00 0.483 0.629228
## Condition2PosA 7.333e+00 4.985e+00 1.471 0.141621
## Condition2PosN -3.044e+01 4.287e+00 -7.100 2.37e-12 ***
## Condition2RRNn 1.849e+00 4.119e+00 0.449 0.653717
## BldgType2fmCon -1.440e+00 9.224e-01 -1.561 0.118742
## BldgTypeDuplex -4.359e+00 8.730e-01 -4.993 7.01e-07 ***
## BldgTypeTwnhs -3.232e+00 8.689e-01 -3.720 0.000210 ***
## BldgTypeTwnhsE -2.363e+00 5.866e-01 -4.029 6.02e-05 ***
## HouseStyle1.5Unf 2.664e+00 1.326e+00 2.009 0.044792 *
## HouseStyle1Story 2.008e+00 6.432e-01 3.122 0.001848 **
## HouseStyle2.5Fin -4.888e+00 1.856e+00 -2.634 0.008563 **
## HouseStyle2.5Unf -2.398e+00 1.423e+00 -1.685 0.092274 .
## HouseStyle2Story -5.532e-01 5.269e-01 -1.050 0.294062
## HouseStyleSFoyer 1.559e+00 1.047e+00 1.489 0.136689
## HouseStyleSLvl 8.815e-01 8.331e-01 1.058 0.290236
## RoofMatlCompShg 9.029e+01 4.616e+00 19.559 < 2e-16 ***
## RoofMatlMembran 9.885e+01 6.364e+00 15.533 < 2e-16 ***
## RoofMatlRoll 9.243e+01 5.888e+00 15.700 < 2e-16 ***
## RoofMatlTar&Grv 8.730e+01 4.938e+00 17.680 < 2e-16 ***
## RoofMatlWdShake 8.959e+01 5.564e+00 16.103 < 2e-16 ***
## RoofMatlWdShngl 9.697e+01 4.833e+00 20.063 < 2e-16 ***
## ExterQualFa -4.193e+00 1.789e+00 -2.344 0.019280 *
## ExterQualGd -3.946e+00 7.476e-01 -5.279 1.59e-07 ***
## ExterQualTA -3.923e+00 8.490e-01 -4.621 4.32e-06 ***
## BsmtQualFa -2.518e+00 9.632e-01 -2.615 0.009069 **
## BsmtQualGd -3.327e+00 5.108e-01 -6.512 1.17e-10 ***
## BsmtQualTA -3.306e+00 6.437e-01 -5.136 3.37e-07 ***
## BsmtExposureGd 2.252e+00 5.091e-01 4.424 1.07e-05 ***
## BsmtExposureMn -7.599e-01 5.054e-01 -1.503 0.133057
## BsmtExposureNo -1.105e+00 3.588e-01 -3.081 0.002121 **
## KitchenQualFa -3.491e+00 1.048e+00 -3.332 0.000894 ***
## KitchenQualGd -3.678e+00 5.500e-01 -6.688 3.77e-11 ***
## KitchenQualTA -3.762e+00 6.316e-01 -5.957 3.56e-09 ***
## GarageTypeAttchd 3.664e+00 1.697e+00 2.160 0.031024 *
## GarageTypeBasment 4.101e+00 1.978e+00 2.074 0.038354 *
## GarageTypeBuiltIn 3.547e+00 1.785e+00 1.986 0.047269 *
## GarageTypeCarPort 5.238e+00 2.348e+00 2.231 0.025921 *
## GarageTypeDetchd 3.976e+00 1.680e+00 2.366 0.018174 *
## GarageQualFa -1.422e+01 4.480e+00 -3.174 0.001547 **
## GarageQualGd -1.312e+01 4.622e+00 -2.838 0.004629 **
## GarageQualPo -1.685e+01 5.465e+00 -3.084 0.002100 **
## GarageQualTA -1.395e+01 4.425e+00 -3.153 0.001664 **
## GarageCondFa 1.404e+01 5.124e+00 2.741 0.006243 **
## GarageCondGd 1.351e+01 5.440e+00 2.483 0.013195 *
## GarageCondPo 1.377e+01 5.591e+00 2.462 0.013979 *
## GarageCondTA 1.412e+01 5.061e+00 2.790 0.005371 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.48 on 999 degrees of freedom
## (358 observations deleted due to missingness)
## Multiple R-squared: 0.918, Adjusted R-squared: 0.9096
## F-statistic: 109.6 on 102 and 999 DF, p-value: < 2.2e-16
From model 4 below we identify one more predictor to eliminate based on p-values that are too high. While one of the predictor’s values has a p-value below 0.05, the average of all four are well above 0.05. (Note, there’s a fifth-value of the predictor but it’s a default with no coefficient in the model. It’s the equivalent of selecting none of the other four options.)
However there’s six more variables to consider removing. In the
Predictor Triage
section, we’re going to run each one and
compare just the resulting Adjusted R-Squared and the number of missing
variables to Model 5 so we can work out an order to remove them.
LotConfig
LandSlope
Neighborhood
Condition1
Condition2
HouseStyle
GarageType
lm4 <- update(lm3, .~. -LotFrontage -LandContour, data=train)
summary(lm4)
##
## Call:
## lm(formula = ssp ~ LotArea + OverallQual + OverallCond + YearBuilt +
## BsmtUnfSF + TotalBsmtSF + X1stFlrSF + GrLivArea + GarageArea +
## MSZoning + LotConfig + LandSlope + Neighborhood + Condition1 +
## Condition2 + BldgType + HouseStyle + RoofMatl + ExterQual +
## BsmtQual + BsmtExposure + KitchenQual + GarageType + GarageQual +
## GarageCond, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -25.487 -1.397 0.000 1.467 25.487
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.258e+02 2.038e+01 -11.075 < 2e-16 ***
## LotArea 9.522e-05 1.360e-05 7.000 4.18e-12 ***
## OverallQual 1.176e+00 1.468e-01 8.010 2.61e-15 ***
## OverallCond 9.617e-01 1.094e-01 8.787 < 2e-16 ***
## YearBuilt 6.657e-02 9.484e-03 7.020 3.64e-12 ***
## BsmtUnfSF -2.390e-03 2.703e-04 -8.841 < 2e-16 ***
## TotalBsmtSF 5.639e-03 6.842e-04 8.241 4.27e-16 ***
## X1stFlrSF -3.897e-03 1.013e-03 -3.847 0.000126 ***
## GrLivArea 1.054e-02 6.977e-04 15.101 < 2e-16 ***
## GarageArea 3.965e-03 7.701e-04 5.148 3.05e-07 ***
## MSZoningFV 5.256e+00 1.765e+00 2.978 0.002956 **
## MSZoningRH 4.316e+00 1.871e+00 2.306 0.021272 *
## MSZoningRL 4.325e+00 1.528e+00 2.831 0.004716 **
## MSZoningRM 3.822e+00 1.436e+00 2.662 0.007861 **
## LotConfigCulDSac 8.916e-01 4.554e-01 1.958 0.050492 .
## LotConfigFR2 -1.344e+00 5.950e-01 -2.259 0.024083 *
## LotConfigFR3 -2.130e+00 1.857e+00 -1.147 0.251663
## LotConfigInside -9.965e-02 2.573e-01 -0.387 0.698632
## LandSlopeMod 1.860e-01 5.231e-01 0.356 0.722266
## LandSlopeSev -6.043e+00 1.494e+00 -4.045 5.55e-05 ***
## NeighborhoodBlueste -8.443e-01 2.686e+00 -0.314 0.753367
## NeighborhoodBrDale -5.399e-02 1.533e+00 -0.035 0.971902
## NeighborhoodBrkSide -1.921e-01 1.339e+00 -0.143 0.885946
## NeighborhoodClearCr -2.528e+00 1.265e+00 -1.998 0.045939 *
## NeighborhoodCollgCr -1.860e+00 9.921e-01 -1.875 0.060996 .
## NeighborhoodCrawfor 1.812e+00 1.168e+00 1.551 0.121129
## NeighborhoodEdwards -3.137e+00 1.112e+00 -2.821 0.004870 **
## NeighborhoodGilbert -1.533e+00 1.061e+00 -1.444 0.149051
## NeighborhoodIDOTRR -4.464e-01 1.530e+00 -0.292 0.770521
## NeighborhoodMeadowV -1.541e+00 1.502e+00 -1.026 0.305018
## NeighborhoodMitchel -3.306e+00 1.143e+00 -2.891 0.003904 **
## NeighborhoodNAmes -2.974e+00 1.059e+00 -2.807 0.005073 **
## NeighborhoodNoRidge 3.624e+00 1.128e+00 3.214 0.001343 **
## NeighborhoodNPkVill 1.922e-01 1.496e+00 0.128 0.897825
## NeighborhoodNridgHt 2.336e+00 1.023e+00 2.284 0.022531 *
## NeighborhoodNWAmes -3.389e+00 1.088e+00 -3.116 0.001876 **
## NeighborhoodOldTown -1.901e+00 1.354e+00 -1.404 0.160697
## NeighborhoodSawyer -2.414e+00 1.122e+00 -2.151 0.031664 *
## NeighborhoodSawyerW -1.529e+00 1.072e+00 -1.427 0.153965
## NeighborhoodSomerst 1.031e-01 1.234e+00 0.084 0.933435
## NeighborhoodStoneBr 4.208e+00 1.131e+00 3.722 0.000206 ***
## NeighborhoodSWISU -1.797e+00 1.393e+00 -1.290 0.197257
## NeighborhoodTimber -2.497e+00 1.124e+00 -2.222 0.026475 *
## NeighborhoodVeenker -4.497e-01 1.423e+00 -0.316 0.752062
## Condition1Feedr 4.659e-01 7.842e-01 0.594 0.552578
## Condition1Norm 1.392e+00 6.281e-01 2.217 0.026837 *
## Condition1PosA 9.222e-01 1.425e+00 0.647 0.517597
## Condition1PosN 1.598e+00 1.077e+00 1.484 0.138065
## Condition1RRAe -2.244e+00 1.320e+00 -1.700 0.089435 .
## Condition1RRAn 1.028e+00 1.001e+00 1.028 0.304370
## Condition1RRNe -2.716e-01 2.567e+00 -0.106 0.915776
## Condition1RRNn 8.310e-01 1.841e+00 0.451 0.651790
## Condition2Feedr -1.495e+00 3.481e+00 -0.429 0.667729
## Condition2Norm 2.510e-02 2.974e+00 0.008 0.993268
## Condition2PosA 5.007e+00 4.775e+00 1.049 0.294513
## Condition2PosN -3.191e+01 4.010e+00 -7.958 3.89e-15 ***
## Condition2RRAe -3.081e+00 4.719e+00 -0.653 0.514028
## Condition2RRAn -1.289e-01 4.660e+00 -0.028 0.977933
## Condition2RRNn 4.235e-01 3.942e+00 0.107 0.914450
## BldgType2fmCon -1.818e+00 8.560e-01 -2.124 0.033874 *
## BldgTypeDuplex -4.124e+00 7.948e-01 -5.189 2.47e-07 ***
## BldgTypeTwnhs -3.566e+00 7.667e-01 -4.651 3.65e-06 ***
## BldgTypeTwnhsE -2.569e+00 5.034e-01 -5.103 3.86e-07 ***
## HouseStyle1.5Unf 2.598e+00 1.233e+00 2.107 0.035329 *
## HouseStyle1Story 2.118e+00 5.693e-01 3.721 0.000208 ***
## HouseStyle2.5Fin -4.996e+00 1.759e+00 -2.840 0.004586 **
## HouseStyle2.5Unf -1.762e+00 1.278e+00 -1.379 0.168287
## HouseStyle2Story -6.911e-01 4.657e-01 -1.484 0.138068
## HouseStyleSFoyer 1.412e+00 9.138e-01 1.546 0.122466
## HouseStyleSLvl 1.076e+00 6.932e-01 1.552 0.120941
## RoofMatlCompShg 8.625e+01 4.136e+00 20.855 < 2e-16 ***
## RoofMatlMembran 9.250e+01 5.647e+00 16.380 < 2e-16 ***
## RoofMatlMetal 9.308e+01 5.708e+00 16.307 < 2e-16 ***
## RoofMatlRoll 8.869e+01 5.480e+00 16.183 < 2e-16 ***
## RoofMatlTar&Grv 8.452e+01 4.355e+00 19.407 < 2e-16 ***
## RoofMatlWdShake 8.759e+01 4.515e+00 19.399 < 2e-16 ***
## RoofMatlWdShngl 9.218e+01 4.323e+00 21.321 < 2e-16 ***
## ExterQualFa -3.686e+00 1.673e+00 -2.204 0.027725 *
## ExterQualGd -3.429e+00 6.742e-01 -5.086 4.22e-07 ***
## ExterQualTA -3.461e+00 7.502e-01 -4.614 4.37e-06 ***
## BsmtQualFa -2.743e+00 9.053e-01 -3.030 0.002499 **
## BsmtQualGd -3.521e+00 4.678e-01 -7.527 9.94e-14 ***
## BsmtQualTA -3.434e+00 5.743e-01 -5.980 2.91e-09 ***
## BsmtExposureGd 2.342e+00 4.339e-01 5.398 8.05e-08 ***
## BsmtExposureMn -5.742e-01 4.435e-01 -1.295 0.195730
## BsmtExposureNo -1.101e+00 3.162e-01 -3.483 0.000513 ***
## KitchenQualFa -3.929e+00 9.490e-01 -4.140 3.71e-05 ***
## KitchenQualGd -3.906e+00 5.044e-01 -7.743 2.00e-14 ***
## KitchenQualTA -4.081e+00 5.658e-01 -7.213 9.46e-13 ***
## GarageTypeAttchd 3.385e+00 1.527e+00 2.217 0.026835 *
## GarageTypeBasment 3.200e+00 1.756e+00 1.822 0.068681 .
## GarageTypeBuiltIn 3.276e+00 1.597e+00 2.051 0.040453 *
## GarageTypeCarPort 2.539e+00 2.102e+00 1.208 0.227334
## GarageTypeDetchd 3.520e+00 1.514e+00 2.325 0.020206 *
## GarageQualFa -1.434e+01 4.311e+00 -3.327 0.000903 ***
## GarageQualGd -1.307e+01 4.420e+00 -2.958 0.003158 **
## GarageQualPo -1.662e+01 5.145e+00 -3.230 0.001269 **
## GarageQualTA -1.422e+01 4.263e+00 -3.335 0.000878 ***
## GarageCondFa 1.379e+01 4.965e+00 2.778 0.005560 **
## GarageCondGd 1.342e+01 5.138e+00 2.612 0.009099 **
## GarageCondPo 1.361e+01 5.312e+00 2.563 0.010508 *
## GarageCondTA 1.417e+01 4.911e+00 2.886 0.003974 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.418 on 1246 degrees of freedom
## (112 observations deleted due to missingness)
## Multiple R-squared: 0.9103, Adjusted R-squared: 0.903
## F-statistic: 125.2 on 101 and 1246 DF, p-value: < 2.2e-16
After running the seven versions of model five below we note the following Adjusted R-Squared values for the model. In none of them were records with missing values added back in so we can compare between the Adjusted R-Squared values.
LotConfig
0.9023538 LandSlope
0.9018605
Neighborhood
0.8860973 Condition1
0.9022634
Condition2
0.8922981 HouseStyle
0.9016426
GarageType
0.9029784
Here we try removing LotConfig
.
lm5a <- update(lm4, .~. -LotConfig, data=train)
summary(lm5a)$adj.r.squared
## [1] 0.9023538
length(summary(lm5a)$na.action)
## [1] 112
summary(lm4)$adj.r.squared
## [1] 0.9030483
length(summary(lm4)$na.action)
## [1] 112
Here we try removing LandSlope
.
lm5b <- update(lm4, .~. -LandSlope, data=train)
summary(lm5b)$adj.r.squared
## [1] 0.9018605
length(summary(lm5b)$na.action)
## [1] 112
summary(lm4)$adj.r.squared
## [1] 0.9030483
length(summary(lm4)$na.action)
## [1] 112
Here we try removing Neighborhood
.
lm5c <- update(lm4, .~. -Neighborhood, data=train)
summary(lm5c)$adj.r.squared
## [1] 0.8860973
length(summary(lm5c)$na.action)
## [1] 112
summary(lm4)$adj.r.squared
## [1] 0.9030483
length(summary(lm4)$na.action)
## [1] 112
Here we try removing Condition1
.
lm5d <- update(lm4, .~. -Condition1, data=train)
summary(lm5d)$adj.r.squared
## [1] 0.9022634
length(summary(lm5d)$na.action)
## [1] 112
summary(lm4)$adj.r.squared
## [1] 0.9030483
length(summary(lm4)$na.action)
## [1] 112
Here we try removing Condition2
.
lm5e <- update(lm4, .~. -Condition2, data=train)
summary(lm5e)$adj.r.squared
## [1] 0.8922981
length(summary(lm5d)$na.action)
## [1] 112
summary(lm4)$adj.r.squared
## [1] 0.9030483
length(summary(lm4)$na.action)
## [1] 112
Here we try removing HouseStyle
.
lm5f <- update(lm4, .~. -HouseStyle, data=train)
summary(lm5f)$adj.r.squared
## [1] 0.9016426
length(summary(lm5d)$na.action)
## [1] 112
summary(lm4)$adj.r.squared
## [1] 0.9030483
length(summary(lm4)$na.action)
## [1] 112
Here we try removing GarageType
.
lm5g <- update(lm4, .~. -GarageType, data=train)
summary(lm5g)$adj.r.squared
## [1] 0.9029784
length(summary(lm5d)$na.action)
## [1] 112
summary(lm4)$adj.r.squared
## [1] 0.9030483
length(summary(lm4)$na.action)
## [1] 112
We have the first seven we want to try to eliminate and the order we’ll try eliminating them. We’ll be on the look out for observations after each elimination.
GarageType
0.9029784 Condition1
0.9022634
LotConfig
0.9023538 LandSlope
0.9018605
HouseStyle
0.9016426 Condition2
0.8922981
Neighborhood
0.8860973
Here we remove GarageType
with no observations except to
continue.
lm5 <- update(lm4, .~. -GarageType, data=train)
summary(lm5)
##
## Call:
## lm(formula = ssp ~ LotArea + OverallQual + OverallCond + YearBuilt +
## BsmtUnfSF + TotalBsmtSF + X1stFlrSF + GrLivArea + GarageArea +
## MSZoning + LotConfig + LandSlope + Neighborhood + Condition1 +
## Condition2 + BldgType + HouseStyle + RoofMatl + ExterQual +
## BsmtQual + BsmtExposure + KitchenQual + GarageQual + GarageCond,
## data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -25.5628 -1.4026 -0.0165 1.4816 25.5628
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.227e+02 1.986e+01 -11.211 < 2e-16 ***
## LotArea 9.390e-05 1.356e-05 6.923 7.07e-12 ***
## OverallQual 1.196e+00 1.457e-01 8.212 5.39e-16 ***
## OverallCond 9.760e-01 1.092e-01 8.935 < 2e-16 ***
## YearBuilt 6.667e-02 9.210e-03 7.239 7.87e-13 ***
## BsmtUnfSF -2.410e-03 2.687e-04 -8.970 < 2e-16 ***
## TotalBsmtSF 5.700e-03 6.730e-04 8.470 < 2e-16 ***
## X1stFlrSF -3.952e-03 9.623e-04 -4.107 4.27e-05 ***
## GrLivArea 1.052e-02 6.449e-04 16.309 < 2e-16 ***
## GarageArea 3.660e-03 7.388e-04 4.954 8.26e-07 ***
## MSZoningFV 5.221e+00 1.757e+00 2.972 0.003019 **
## MSZoningRH 3.971e+00 1.859e+00 2.136 0.032861 *
## MSZoningRL 4.311e+00 1.522e+00 2.832 0.004697 **
## MSZoningRM 3.784e+00 1.431e+00 2.644 0.008294 **
## LotConfigCulDSac 8.741e-01 4.554e-01 1.919 0.055165 .
## LotConfigFR2 -1.361e+00 5.942e-01 -2.291 0.022116 *
## LotConfigFR3 -2.084e+00 1.856e+00 -1.123 0.261643
## LotConfigInside -1.315e-01 2.563e-01 -0.513 0.607983
## LandSlopeMod 8.828e-02 5.175e-01 0.171 0.864572
## LandSlopeSev -5.878e+00 1.491e+00 -3.942 8.53e-05 ***
## NeighborhoodBlueste -8.392e-01 2.683e+00 -0.313 0.754476
## NeighborhoodBrDale 7.230e-02 1.530e+00 0.047 0.962306
## NeighborhoodBrkSide -7.662e-02 1.332e+00 -0.058 0.954141
## NeighborhoodClearCr -2.483e+00 1.264e+00 -1.965 0.049656 *
## NeighborhoodCollgCr -1.845e+00 9.914e-01 -1.861 0.063013 .
## NeighborhoodCrawfor 1.823e+00 1.165e+00 1.564 0.118066
## NeighborhoodEdwards -3.083e+00 1.104e+00 -2.792 0.005325 **
## NeighborhoodGilbert -1.577e+00 1.060e+00 -1.487 0.137138
## NeighborhoodIDOTRR -3.559e-01 1.528e+00 -0.233 0.815879
## NeighborhoodMeadowV -1.541e+00 1.493e+00 -1.033 0.302015
## NeighborhoodMitchel -3.404e+00 1.142e+00 -2.980 0.002936 **
## NeighborhoodNAmes -2.960e+00 1.059e+00 -2.794 0.005281 **
## NeighborhoodNoRidge 3.659e+00 1.125e+00 3.252 0.001176 **
## NeighborhoodNPkVill 1.883e-01 1.493e+00 0.126 0.899603
## NeighborhoodNridgHt 2.348e+00 1.020e+00 2.301 0.021529 *
## NeighborhoodNWAmes -3.379e+00 1.087e+00 -3.109 0.001922 **
## NeighborhoodOldTown -1.799e+00 1.352e+00 -1.330 0.183659
## NeighborhoodSawyer -2.395e+00 1.123e+00 -2.133 0.033091 *
## NeighborhoodSawyerW -1.507e+00 1.072e+00 -1.406 0.160019
## NeighborhoodSomerst 1.715e-01 1.234e+00 0.139 0.889419
## NeighborhoodStoneBr 4.209e+00 1.129e+00 3.727 0.000202 ***
## NeighborhoodSWISU -1.675e+00 1.384e+00 -1.210 0.226484
## NeighborhoodTimber -2.573e+00 1.123e+00 -2.292 0.022050 *
## NeighborhoodVeenker -4.570e-01 1.423e+00 -0.321 0.748150
## Condition1Feedr 4.234e-01 7.827e-01 0.541 0.588591
## Condition1Norm 1.336e+00 6.276e-01 2.128 0.033496 *
## Condition1PosA 9.347e-01 1.425e+00 0.656 0.511873
## Condition1PosN 1.552e+00 1.077e+00 1.442 0.149676
## Condition1RRAe -2.231e+00 1.309e+00 -1.704 0.088677 .
## Condition1RRAn 9.686e-01 1.000e+00 0.968 0.333112
## Condition1RRNe -3.860e-01 2.567e+00 -0.150 0.880498
## Condition1RRNn 5.534e-01 1.820e+00 0.304 0.761160
## Condition2Feedr -9.538e-01 3.359e+00 -0.284 0.776518
## Condition2Norm 5.517e-01 2.836e+00 0.195 0.845810
## Condition2PosA 5.444e+00 4.719e+00 1.154 0.248854
## Condition2PosN -3.133e+01 3.905e+00 -8.024 2.34e-15 ***
## Condition2RRAe -2.464e+00 4.603e+00 -0.535 0.592573
## Condition2RRAn 7.167e-01 4.541e+00 0.158 0.874616
## Condition2RRNn 1.046e+00 3.828e+00 0.273 0.784636
## BldgType2fmCon -1.921e+00 8.523e-01 -2.254 0.024388 *
## BldgTypeDuplex -4.297e+00 7.782e-01 -5.521 4.09e-08 ***
## BldgTypeTwnhs -3.580e+00 7.583e-01 -4.721 2.61e-06 ***
## BldgTypeTwnhsE -2.589e+00 4.994e-01 -5.184 2.52e-07 ***
## HouseStyle1.5Unf 2.662e+00 1.222e+00 2.179 0.029501 *
## HouseStyle1Story 2.103e+00 5.547e-01 3.792 0.000157 ***
## HouseStyle2.5Fin -4.932e+00 1.740e+00 -2.834 0.004664 **
## HouseStyle2.5Unf -1.711e+00 1.274e+00 -1.343 0.179588
## HouseStyle2Story -7.253e-01 4.625e-01 -1.568 0.117065
## HouseStyleSFoyer 1.461e+00 9.023e-01 1.620 0.105548
## HouseStyleSLvl 1.055e+00 6.728e-01 1.567 0.117295
## RoofMatlCompShg 8.609e+01 4.134e+00 20.828 < 2e-16 ***
## RoofMatlMembran 9.207e+01 5.645e+00 16.312 < 2e-16 ***
## RoofMatlMetal 9.275e+01 5.704e+00 16.261 < 2e-16 ***
## RoofMatlRoll 8.883e+01 5.477e+00 16.219 < 2e-16 ***
## RoofMatlTar&Grv 8.426e+01 4.348e+00 19.377 < 2e-16 ***
## RoofMatlWdShake 8.740e+01 4.514e+00 19.361 < 2e-16 ***
## RoofMatlWdShngl 9.206e+01 4.322e+00 21.302 < 2e-16 ***
## ExterQualFa -3.774e+00 1.632e+00 -2.312 0.020920 *
## ExterQualGd -3.451e+00 6.739e-01 -5.121 3.51e-07 ***
## ExterQualTA -3.483e+00 7.491e-01 -4.649 3.68e-06 ***
## BsmtQualFa -2.756e+00 9.043e-01 -3.048 0.002353 **
## BsmtQualGd -3.579e+00 4.664e-01 -7.673 3.37e-14 ***
## BsmtQualTA -3.530e+00 5.715e-01 -6.176 8.87e-10 ***
## BsmtExposureGd 2.345e+00 4.331e-01 5.416 7.31e-08 ***
## BsmtExposureMn -5.504e-01 4.434e-01 -1.241 0.214756
## BsmtExposureNo -1.090e+00 3.161e-01 -3.449 0.000582 ***
## KitchenQualFa -3.827e+00 9.445e-01 -4.052 5.40e-05 ***
## KitchenQualGd -3.826e+00 5.017e-01 -7.626 4.76e-14 ***
## KitchenQualTA -3.997e+00 5.631e-01 -7.098 2.11e-12 ***
## GarageQualFa -1.451e+01 4.305e+00 -3.370 0.000775 ***
## GarageQualGd -1.305e+01 4.414e+00 -2.956 0.003179 **
## GarageQualPo -1.673e+01 5.140e+00 -3.254 0.001168 **
## GarageQualTA -1.427e+01 4.257e+00 -3.351 0.000829 ***
## GarageCondFa 1.371e+01 4.960e+00 2.764 0.005785 **
## GarageCondGd 1.328e+01 5.130e+00 2.589 0.009746 **
## GarageCondPo 1.361e+01 5.306e+00 2.565 0.010430 *
## GarageCondTA 1.408e+01 4.907e+00 2.870 0.004169 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.419 on 1251 degrees of freedom
## (112 observations deleted due to missingness)
## Multiple R-squared: 0.9099, Adjusted R-squared: 0.903
## F-statistic: 131.6 on 96 and 1251 DF, p-value: < 2.2e-16
Here we remove LotConfig
with no observations except to
continue.
lm6 <- update(lm5, .~. -LotConfig, data=train)
summary(lm6)
##
## Call:
## lm(formula = ssp ~ LotArea + OverallQual + OverallCond + YearBuilt +
## BsmtUnfSF + TotalBsmtSF + X1stFlrSF + GrLivArea + GarageArea +
## MSZoning + LandSlope + Neighborhood + Condition1 + Condition2 +
## BldgType + HouseStyle + RoofMatl + ExterQual + BsmtQual +
## BsmtExposure + KitchenQual + GarageQual + GarageCond, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -25.7763 -1.3754 -0.0038 1.4847 25.7763
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.251e+02 1.990e+01 -11.312 < 2e-16 ***
## LotArea 9.729e-05 1.347e-05 7.224 8.72e-13 ***
## OverallQual 1.198e+00 1.461e-01 8.197 6.01e-16 ***
## OverallCond 9.867e-01 1.095e-01 9.014 < 2e-16 ***
## YearBuilt 6.749e-02 9.229e-03 7.313 4.64e-13 ***
## BsmtUnfSF -2.389e-03 2.694e-04 -8.869 < 2e-16 ***
## TotalBsmtSF 5.681e-03 6.746e-04 8.422 < 2e-16 ***
## X1stFlrSF -3.847e-03 9.652e-04 -3.985 7.12e-05 ***
## GrLivArea 1.052e-02 6.468e-04 16.271 < 2e-16 ***
## GarageArea 3.672e-03 7.404e-04 4.959 8.06e-07 ***
## MSZoningFV 5.262e+00 1.761e+00 2.989 0.002854 **
## MSZoningRH 4.174e+00 1.864e+00 2.240 0.025292 *
## MSZoningRL 4.496e+00 1.525e+00 2.948 0.003255 **
## MSZoningRM 3.804e+00 1.436e+00 2.650 0.008161 **
## LandSlopeMod 9.149e-02 5.179e-01 0.177 0.859820
## LandSlopeSev -5.898e+00 1.495e+00 -3.945 8.43e-05 ***
## NeighborhoodBlueste -5.885e-01 2.691e+00 -0.219 0.826951
## NeighborhoodBrDale 3.067e-01 1.533e+00 0.200 0.841498
## NeighborhoodBrkSide 2.040e-02 1.335e+00 0.015 0.987812
## NeighborhoodClearCr -2.398e+00 1.267e+00 -1.892 0.058715 .
## NeighborhoodCollgCr -1.799e+00 9.926e-01 -1.813 0.070116 .
## NeighborhoodCrawfor 1.833e+00 1.169e+00 1.568 0.117080
## NeighborhoodEdwards -3.035e+00 1.107e+00 -2.742 0.006185 **
## NeighborhoodGilbert -1.553e+00 1.061e+00 -1.464 0.143456
## NeighborhoodIDOTRR -1.285e-01 1.531e+00 -0.084 0.933087
## NeighborhoodMeadowV -1.344e+00 1.496e+00 -0.898 0.369131
## NeighborhoodMitchel -3.292e+00 1.140e+00 -2.887 0.003951 **
## NeighborhoodNAmes -2.956e+00 1.061e+00 -2.786 0.005414 **
## NeighborhoodNoRidge 3.701e+00 1.124e+00 3.293 0.001020 **
## NeighborhoodNPkVill 1.185e-01 1.497e+00 0.079 0.936909
## NeighborhoodNridgHt 2.222e+00 1.020e+00 2.179 0.029529 *
## NeighborhoodNWAmes -3.363e+00 1.089e+00 -3.088 0.002057 **
## NeighborhoodOldTown -1.600e+00 1.354e+00 -1.182 0.237533
## NeighborhoodSawyer -2.302e+00 1.124e+00 -2.047 0.040851 *
## NeighborhoodSawyerW -1.475e+00 1.075e+00 -1.373 0.169985
## NeighborhoodSomerst 2.447e-01 1.235e+00 0.198 0.842956
## NeighborhoodStoneBr 4.421e+00 1.129e+00 3.916 9.50e-05 ***
## NeighborhoodSWISU -1.606e+00 1.388e+00 -1.157 0.247468
## NeighborhoodTimber -2.530e+00 1.124e+00 -2.251 0.024584 *
## NeighborhoodVeenker -4.637e-01 1.416e+00 -0.327 0.743448
## Condition1Feedr 2.440e-01 7.819e-01 0.312 0.755051
## Condition1Norm 1.343e+00 6.296e-01 2.132 0.033173 *
## Condition1PosA 8.511e-01 1.429e+00 0.596 0.551554
## Condition1PosN 1.732e+00 1.078e+00 1.607 0.108345
## Condition1RRAe -2.017e+00 1.309e+00 -1.540 0.123726
## Condition1RRAn 1.103e+00 1.001e+00 1.103 0.270450
## Condition1RRNe 1.219e-01 2.567e+00 0.047 0.962134
## Condition1RRNn 1.802e-01 1.793e+00 0.100 0.919975
## Condition2Feedr -1.384e+00 3.349e+00 -0.413 0.679537
## Condition2Norm 5.223e-01 2.838e+00 0.184 0.854022
## Condition2PosA 5.123e+00 4.726e+00 1.084 0.278509
## Condition2PosN -3.171e+01 3.907e+00 -8.118 1.13e-15 ***
## Condition2RRAe -2.375e+00 4.609e+00 -0.515 0.606464
## Condition2RRAn 9.479e-01 4.556e+00 0.208 0.835228
## Condition2RRNn 1.299e+00 3.827e+00 0.339 0.734397
## BldgType2fmCon -1.990e+00 8.543e-01 -2.329 0.020000 *
## BldgTypeDuplex -4.365e+00 7.789e-01 -5.604 2.58e-08 ***
## BldgTypeTwnhs -3.672e+00 7.584e-01 -4.842 1.45e-06 ***
## BldgTypeTwnhsE -2.577e+00 5.005e-01 -5.149 3.04e-07 ***
## HouseStyle1.5Unf 2.686e+00 1.225e+00 2.192 0.028595 *
## HouseStyle1Story 2.085e+00 5.561e-01 3.750 0.000185 ***
## HouseStyle2.5Fin -5.052e+00 1.746e+00 -2.894 0.003869 **
## HouseStyle2.5Unf -1.694e+00 1.278e+00 -1.325 0.185252
## HouseStyle2Story -6.874e-01 4.639e-01 -1.482 0.138635
## HouseStyleSFoyer 1.660e+00 9.038e-01 1.837 0.066453 .
## HouseStyleSLvl 1.030e+00 6.749e-01 1.526 0.127209
## RoofMatlCompShg 8.636e+01 4.142e+00 20.852 < 2e-16 ***
## RoofMatlMembran 9.323e+01 5.639e+00 16.533 < 2e-16 ***
## RoofMatlMetal 9.396e+01 5.696e+00 16.494 < 2e-16 ***
## RoofMatlRoll 8.948e+01 5.492e+00 16.291 < 2e-16 ***
## RoofMatlTar&Grv 8.475e+01 4.353e+00 19.469 < 2e-16 ***
## RoofMatlWdShake 8.761e+01 4.528e+00 19.350 < 2e-16 ***
## RoofMatlWdShngl 9.221e+01 4.334e+00 21.275 < 2e-16 ***
## ExterQualFa -3.693e+00 1.637e+00 -2.256 0.024256 *
## ExterQualGd -3.538e+00 6.736e-01 -5.253 1.76e-07 ***
## ExterQualTA -3.569e+00 7.492e-01 -4.763 2.13e-06 ***
## BsmtQualFa -2.747e+00 9.042e-01 -3.038 0.002432 **
## BsmtQualGd -3.572e+00 4.669e-01 -7.650 3.98e-14 ***
## BsmtQualTA -3.465e+00 5.712e-01 -6.067 1.72e-09 ***
## BsmtExposureGd 2.344e+00 4.338e-01 5.404 7.80e-08 ***
## BsmtExposureMn -4.632e-01 4.441e-01 -1.043 0.297223
## BsmtExposureNo -1.068e+00 3.167e-01 -3.372 0.000768 ***
## KitchenQualFa -3.823e+00 9.463e-01 -4.040 5.67e-05 ***
## KitchenQualGd -3.743e+00 5.025e-01 -7.449 1.75e-13 ***
## KitchenQualTA -3.925e+00 5.640e-01 -6.959 5.51e-12 ***
## GarageQualFa -1.457e+01 4.312e+00 -3.378 0.000753 ***
## GarageQualGd -1.307e+01 4.425e+00 -2.954 0.003196 **
## GarageQualPo -1.673e+01 5.153e+00 -3.247 0.001195 **
## GarageQualTA -1.434e+01 4.266e+00 -3.363 0.000794 ***
## GarageCondFa 1.375e+01 4.973e+00 2.764 0.005789 **
## GarageCondGd 1.362e+01 5.141e+00 2.648 0.008188 **
## GarageCondPo 1.372e+01 5.322e+00 2.577 0.010075 *
## GarageCondTA 1.416e+01 4.920e+00 2.879 0.004063 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.432 on 1255 degrees of freedom
## (112 observations deleted due to missingness)
## Multiple R-squared: 0.909, Adjusted R-squared: 0.9023
## F-statistic: 136.2 on 92 and 1255 DF, p-value: < 2.2e-16
Here we remove Condition1
. There’s no real shift yet on
the two main metrics we’re looking at: * The interquartile range of the
residuals is still tight and evenly bordering zero * That
Adjusted-R-Squared doesn’t precipitously drop
We know there’s no NA values in these seven variables for the records that are included so we can compare Adjusted R-Squareds directly in Models 5 through 11.
lm7 <- update(lm6, .~. -Condition1, data=train)
summary(lm7)
##
## Call:
## lm(formula = ssp ~ LotArea + OverallQual + OverallCond + YearBuilt +
## BsmtUnfSF + TotalBsmtSF + X1stFlrSF + GrLivArea + GarageArea +
## MSZoning + LandSlope + Neighborhood + Condition2 + BldgType +
## HouseStyle + RoofMatl + ExterQual + BsmtQual + BsmtExposure +
## KitchenQual + GarageQual + GarageCond, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -25.8004 -1.4255 0.0177 1.5167 25.8004
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.260e+02 1.980e+01 -11.413 < 2e-16 ***
## LotArea 9.576e-05 1.347e-05 7.107 1.98e-12 ***
## OverallQual 1.206e+00 1.462e-01 8.253 3.86e-16 ***
## OverallCond 9.846e-01 1.095e-01 8.991 < 2e-16 ***
## YearBuilt 6.733e-02 9.197e-03 7.321 4.38e-13 ***
## BsmtUnfSF -2.350e-03 2.692e-04 -8.731 < 2e-16 ***
## TotalBsmtSF 5.739e-03 6.738e-04 8.517 < 2e-16 ***
## X1stFlrSF -3.847e-03 9.616e-04 -4.000 6.69e-05 ***
## GrLivArea 1.051e-02 6.480e-04 16.218 < 2e-16 ***
## GarageArea 3.584e-03 7.415e-04 4.834 1.50e-06 ***
## MSZoningFV 5.514e+00 1.761e+00 3.132 0.001777 **
## MSZoningRH 4.246e+00 1.866e+00 2.275 0.023048 *
## MSZoningRL 4.450e+00 1.526e+00 2.917 0.003602 **
## MSZoningRM 3.629e+00 1.437e+00 2.526 0.011649 *
## LandSlopeMod 9.128e-02 5.192e-01 0.176 0.860478
## LandSlopeSev -5.670e+00 1.495e+00 -3.793 0.000156 ***
## NeighborhoodBlueste -3.739e-01 2.701e+00 -0.138 0.889915
## NeighborhoodBrDale 5.452e-01 1.536e+00 0.355 0.722744
## NeighborhoodBrkSide 2.748e-01 1.329e+00 0.207 0.836180
## NeighborhoodClearCr -2.329e+00 1.272e+00 -1.832 0.067229 .
## NeighborhoodCollgCr -1.667e+00 9.957e-01 -1.674 0.094420 .
## NeighborhoodCrawfor 1.990e+00 1.170e+00 1.701 0.089275 .
## NeighborhoodEdwards -2.925e+00 1.110e+00 -2.635 0.008512 **
## NeighborhoodGilbert -1.461e+00 1.062e+00 -1.376 0.169104
## NeighborhoodIDOTRR 7.609e-02 1.527e+00 0.050 0.960272
## NeighborhoodMeadowV -1.130e+00 1.500e+00 -0.753 0.451338
## NeighborhoodMitchel -3.102e+00 1.143e+00 -2.714 0.006732 **
## NeighborhoodNAmes -2.922e+00 1.063e+00 -2.748 0.006082 **
## NeighborhoodNoRidge 3.826e+00 1.127e+00 3.394 0.000710 ***
## NeighborhoodNPkVill 1.824e-01 1.502e+00 0.121 0.903399
## NeighborhoodNridgHt 2.300e+00 1.023e+00 2.247 0.024800 *
## NeighborhoodNWAmes -3.296e+00 1.085e+00 -3.038 0.002427 **
## NeighborhoodOldTown -1.611e+00 1.353e+00 -1.191 0.233950
## NeighborhoodSawyer -2.544e+00 1.125e+00 -2.262 0.023875 *
## NeighborhoodSawyerW -1.710e+00 1.072e+00 -1.595 0.110940
## NeighborhoodSomerst 3.622e-02 1.221e+00 0.030 0.976336
## NeighborhoodStoneBr 4.477e+00 1.133e+00 3.951 8.20e-05 ***
## NeighborhoodSWISU -1.624e+00 1.389e+00 -1.170 0.242422
## NeighborhoodTimber -2.402e+00 1.128e+00 -2.129 0.033412 *
## NeighborhoodVeenker -6.486e-01 1.418e+00 -0.458 0.647388
## Condition2Feedr -4.030e-01 3.194e+00 -0.126 0.899621
## Condition2Norm 1.804e+00 2.769e+00 0.652 0.514797
## Condition2PosA 5.045e+00 4.740e+00 1.064 0.287383
## Condition2PosN -3.002e+01 3.757e+00 -7.991 2.99e-15 ***
## Condition2RRAe -1.301e+00 4.552e+00 -0.286 0.775041
## Condition2RRAn 1.516e+00 4.506e+00 0.337 0.736520
## Condition2RRNn 1.646e+00 3.756e+00 0.438 0.661298
## BldgType2fmCon -1.889e+00 8.537e-01 -2.212 0.027122 *
## BldgTypeDuplex -4.699e+00 7.738e-01 -6.073 1.66e-09 ***
## BldgTypeTwnhs -3.612e+00 7.609e-01 -4.747 2.29e-06 ***
## BldgTypeTwnhsE -2.490e+00 5.015e-01 -4.965 7.79e-07 ***
## HouseStyle1.5Unf 2.440e+00 1.225e+00 1.992 0.046596 *
## HouseStyle1Story 2.229e+00 5.520e-01 4.038 5.71e-05 ***
## HouseStyle2.5Fin -4.712e+00 1.748e+00 -2.696 0.007105 **
## HouseStyle2.5Unf -1.367e+00 1.266e+00 -1.079 0.280594
## HouseStyle2Story -5.355e-01 4.588e-01 -1.167 0.243320
## HouseStyleSFoyer 1.834e+00 9.009e-01 2.036 0.041964 *
## HouseStyleSLvl 1.242e+00 6.712e-01 1.851 0.064425 .
## RoofMatlCompShg 8.752e+01 4.119e+00 21.250 < 2e-16 ***
## RoofMatlMembran 9.428e+01 5.628e+00 16.753 < 2e-16 ***
## RoofMatlMetal 9.498e+01 5.683e+00 16.714 < 2e-16 ***
## RoofMatlRoll 9.004e+01 5.508e+00 16.346 < 2e-16 ***
## RoofMatlTar&Grv 8.590e+01 4.326e+00 19.856 < 2e-16 ***
## RoofMatlWdShake 8.874e+01 4.493e+00 19.752 < 2e-16 ***
## RoofMatlWdShngl 9.334e+01 4.316e+00 21.625 < 2e-16 ***
## ExterQualFa -4.164e+00 1.621e+00 -2.569 0.010311 *
## ExterQualGd -3.508e+00 6.750e-01 -5.198 2.35e-07 ***
## ExterQualTA -3.561e+00 7.505e-01 -4.745 2.32e-06 ***
## BsmtQualFa -2.845e+00 9.051e-01 -3.143 0.001712 **
## BsmtQualGd -3.575e+00 4.677e-01 -7.643 4.17e-14 ***
## BsmtQualTA -3.504e+00 5.717e-01 -6.130 1.17e-09 ***
## BsmtExposureGd 2.323e+00 4.353e-01 5.336 1.13e-07 ***
## BsmtExposureMn -5.126e-01 4.435e-01 -1.156 0.248048
## BsmtExposureNo -1.074e+00 3.175e-01 -3.383 0.000738 ***
## KitchenQualFa -3.895e+00 9.466e-01 -4.115 4.12e-05 ***
## KitchenQualGd -3.769e+00 5.038e-01 -7.482 1.37e-13 ***
## KitchenQualTA -3.942e+00 5.653e-01 -6.974 4.97e-12 ***
## GarageQualFa -1.325e+01 4.292e+00 -3.088 0.002061 **
## GarageQualGd -1.175e+01 4.408e+00 -2.667 0.007760 **
## GarageQualPo -1.470e+01 5.098e+00 -2.884 0.003999 **
## GarageQualTA -1.318e+01 4.253e+00 -3.099 0.001987 **
## GarageCondFa 1.241e+01 4.967e+00 2.498 0.012613 *
## GarageCondGd 1.235e+01 5.135e+00 2.405 0.016337 *
## GarageCondPo 1.180e+01 5.280e+00 2.235 0.025566 *
## GarageCondTA 1.285e+01 4.913e+00 2.615 0.009026 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.445 on 1263 degrees of freedom
## (112 observations deleted due to missingness)
## Multiple R-squared: 0.9077, Adjusted R-squared: 0.9015
## F-statistic: 147.8 on 84 and 1263 DF, p-value: < 2.2e-16
In Model 8 we remove LandSlope
. Look how in the output
below for Model 8 there is a drop off in significance for
X1stFlrSF
once LandSlope
is removed. My
estimation is our models so far overfit. Meaning we’re modeling a lot of
the noise in the data and not just the underlying general predictors.
Once we removed LandSlope
, X1stFlrSF
was no
longer useful for fitting the noise and so it dropped off in
significance. This tells me that we need to keep trying and look for
more variables that drop off in significance.
lm8 <- update(lm7, .~. -LandSlope, data=train)
summary(lm8)
##
## Call:
## lm(formula = ssp ~ LotArea + OverallQual + OverallCond + YearBuilt +
## BsmtUnfSF + TotalBsmtSF + X1stFlrSF + GrLivArea + GarageArea +
## MSZoning + Neighborhood + Condition2 + BldgType + HouseStyle +
## RoofMatl + ExterQual + BsmtQual + BsmtExposure + KitchenQual +
## GarageQual + GarageCond, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -25.5290 -1.4056 0.0139 1.5141 25.5290
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.219e+02 1.987e+01 -11.164 < 2e-16 ***
## LotArea 6.638e-05 1.112e-05 5.971 3.05e-09 ***
## OverallQual 1.224e+00 1.467e-01 8.344 < 2e-16 ***
## OverallCond 9.814e-01 1.101e-01 8.918 < 2e-16 ***
## YearBuilt 6.587e-02 9.235e-03 7.133 1.65e-12 ***
## BsmtUnfSF -2.339e-03 2.702e-04 -8.658 < 2e-16 ***
## TotalBsmtSF 5.702e-03 6.749e-04 8.449 < 2e-16 ***
## X1stFlrSF -3.755e-03 9.657e-04 -3.889 0.000106 ***
## GrLivArea 1.056e-02 6.509e-04 16.219 < 2e-16 ***
## GarageArea 3.639e-03 7.451e-04 4.884 1.17e-06 ***
## MSZoningFV 5.332e+00 1.766e+00 3.019 0.002587 **
## MSZoningRH 4.070e+00 1.871e+00 2.176 0.029767 *
## MSZoningRL 4.303e+00 1.530e+00 2.812 0.004996 **
## MSZoningRM 3.542e+00 1.440e+00 2.460 0.014024 *
## NeighborhoodBlueste -4.048e-01 2.714e+00 -0.149 0.881469
## NeighborhoodBrDale 5.897e-01 1.544e+00 0.382 0.702573
## NeighborhoodBrkSide 2.561e-01 1.335e+00 0.192 0.847905
## NeighborhoodClearCr -2.670e+00 1.269e+00 -2.105 0.035496 *
## NeighborhoodCollgCr -1.596e+00 1.001e+00 -1.596 0.110842
## NeighborhoodCrawfor 2.078e+00 1.170e+00 1.776 0.076019 .
## NeighborhoodEdwards -2.816e+00 1.115e+00 -2.526 0.011644 *
## NeighborhoodGilbert -1.346e+00 1.067e+00 -1.261 0.207415
## NeighborhoodIDOTRR 5.870e-02 1.534e+00 0.038 0.969476
## NeighborhoodMeadowV -1.111e+00 1.508e+00 -0.737 0.461258
## NeighborhoodMitchel -2.931e+00 1.143e+00 -2.564 0.010463 *
## NeighborhoodNAmes -2.824e+00 1.068e+00 -2.643 0.008308 **
## NeighborhoodNoRidge 3.897e+00 1.133e+00 3.440 0.000601 ***
## NeighborhoodNPkVill 2.450e-01 1.510e+00 0.162 0.871101
## NeighborhoodNridgHt 2.383e+00 1.028e+00 2.318 0.020626 *
## NeighborhoodNWAmes -3.199e+00 1.090e+00 -2.935 0.003395 **
## NeighborhoodOldTown -1.761e+00 1.359e+00 -1.296 0.195352
## NeighborhoodSawyer -2.415e+00 1.130e+00 -2.137 0.032752 *
## NeighborhoodSawyerW -1.630e+00 1.077e+00 -1.514 0.130263
## NeighborhoodSomerst 1.032e-01 1.227e+00 0.084 0.932952
## NeighborhoodStoneBr 4.592e+00 1.137e+00 4.040 5.67e-05 ***
## NeighborhoodSWISU -1.613e+00 1.396e+00 -1.155 0.248223
## NeighborhoodTimber -2.365e+00 1.134e+00 -2.086 0.037184 *
## NeighborhoodVeenker -4.455e-01 1.423e+00 -0.313 0.754300
## Condition2Feedr -4.597e-01 3.200e+00 -0.144 0.885817
## Condition2Norm 1.773e+00 2.774e+00 0.639 0.522863
## Condition2PosA 4.921e+00 4.761e+00 1.033 0.301574
## Condition2PosN -2.977e+01 3.764e+00 -7.908 5.65e-15 ***
## Condition2RRAe -1.167e+00 4.569e+00 -0.255 0.798505
## Condition2RRAn 1.533e+00 4.526e+00 0.339 0.734882
## Condition2RRNn 1.807e+00 3.768e+00 0.480 0.631571
## BldgType2fmCon -1.817e+00 8.579e-01 -2.118 0.034355 *
## BldgTypeDuplex -4.733e+00 7.774e-01 -6.089 1.50e-09 ***
## BldgTypeTwnhs -3.808e+00 7.631e-01 -4.990 6.89e-07 ***
## BldgTypeTwnhsE -2.642e+00 5.025e-01 -5.257 1.72e-07 ***
## HouseStyle1.5Unf 2.405e+00 1.231e+00 1.954 0.050934 .
## HouseStyle1Story 2.230e+00 5.548e-01 4.020 6.16e-05 ***
## HouseStyle2.5Fin -4.718e+00 1.749e+00 -2.698 0.007061 **
## HouseStyle2.5Unf -1.392e+00 1.272e+00 -1.094 0.274007
## HouseStyle2Story -5.351e-01 4.611e-01 -1.160 0.246073
## HouseStyleSFoyer 1.853e+00 9.027e-01 2.053 0.040315 *
## HouseStyleSLvl 1.294e+00 6.733e-01 1.922 0.054838 .
## RoofMatlCompShg 8.637e+01 4.128e+00 20.922 < 2e-16 ***
## RoofMatlMembran 8.869e+01 5.465e+00 16.228 < 2e-16 ***
## RoofMatlMetal 8.894e+01 5.488e+00 16.206 < 2e-16 ***
## RoofMatlRoll 8.892e+01 5.528e+00 16.084 < 2e-16 ***
## RoofMatlTar&Grv 8.396e+01 4.316e+00 19.454 < 2e-16 ***
## RoofMatlWdShake 8.656e+01 4.479e+00 19.324 < 2e-16 ***
## RoofMatlWdShngl 9.277e+01 4.335e+00 21.402 < 2e-16 ***
## ExterQualFa -4.088e+00 1.620e+00 -2.523 0.011750 *
## ExterQualGd -3.551e+00 6.783e-01 -5.235 1.93e-07 ***
## ExterQualTA -3.620e+00 7.542e-01 -4.799 1.78e-06 ***
## BsmtQualFa -2.859e+00 9.097e-01 -3.143 0.001713 **
## BsmtQualGd -3.579e+00 4.696e-01 -7.622 4.87e-14 ***
## BsmtQualTA -3.555e+00 5.744e-01 -6.188 8.20e-10 ***
## BsmtExposureGd 2.329e+00 4.321e-01 5.389 8.44e-08 ***
## BsmtExposureMn -4.716e-01 4.446e-01 -1.061 0.288956
## BsmtExposureNo -1.045e+00 3.170e-01 -3.296 0.001007 **
## KitchenQualFa -4.086e+00 9.502e-01 -4.300 1.84e-05 ***
## KitchenQualGd -3.684e+00 5.058e-01 -7.283 5.72e-13 ***
## KitchenQualTA -3.858e+00 5.678e-01 -6.795 1.66e-11 ***
## GarageQualFa -1.288e+01 4.309e+00 -2.989 0.002854 **
## GarageQualGd -1.141e+01 4.426e+00 -2.578 0.010055 *
## GarageQualPo -1.435e+01 5.121e+00 -2.802 0.005154 **
## GarageQualTA -1.288e+01 4.270e+00 -3.017 0.002604 **
## GarageCondFa 1.213e+01 4.990e+00 2.431 0.015202 *
## GarageCondGd 1.210e+01 5.157e+00 2.346 0.019148 *
## GarageCondPo 1.155e+01 5.304e+00 2.177 0.029654 *
## GarageCondTA 1.256e+01 4.934e+00 2.546 0.011028 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.463 on 1265 degrees of freedom
## (112 observations deleted due to missingness)
## Multiple R-squared: 0.9066, Adjusted R-squared: 0.9005
## F-statistic: 149.7 on 82 and 1265 DF, p-value: < 2.2e-16
In Model 9 below we remove X1stFlrSF
. We’re starting to
see an uptick in the residual spread but it’s only in the min and max
which we can safely ignore as we’re only concerned with the
interquartile spread and that it evenly borders zero.
lm9 <- update(lm8, .~. -X1stFlrSF, data=train)
summary(lm9)
##
## Call:
## lm(formula = ssp ~ LotArea + OverallQual + OverallCond + YearBuilt +
## BsmtUnfSF + TotalBsmtSF + GrLivArea + GarageArea + MSZoning +
## Neighborhood + Condition2 + BldgType + HouseStyle + RoofMatl +
## ExterQual + BsmtQual + BsmtExposure + KitchenQual + GarageQual +
## GarageCond, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -25.119 -1.413 0.061 1.463 26.279
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.238e+02 1.998e+01 -11.204 < 2e-16 ***
## LotArea 6.418e-05 1.116e-05 5.749 1.12e-08 ***
## OverallQual 1.184e+00 1.472e-01 8.046 1.95e-15 ***
## OverallCond 9.913e-01 1.106e-01 8.960 < 2e-16 ***
## YearBuilt 6.801e-02 9.270e-03 7.337 3.90e-13 ***
## BsmtUnfSF -2.327e-03 2.717e-04 -8.564 < 2e-16 ***
## TotalBsmtSF 4.215e-03 5.593e-04 7.537 9.10e-14 ***
## GrLivArea 8.691e-03 4.423e-04 19.650 < 2e-16 ***
## GarageArea 3.557e-03 7.490e-04 4.750 2.27e-06 ***
## MSZoningFV 5.435e+00 1.776e+00 3.061 0.002256 **
## MSZoningRH 4.302e+00 1.880e+00 2.288 0.022291 *
## MSZoningRL 4.349e+00 1.539e+00 2.827 0.004775 **
## MSZoningRM 3.593e+00 1.448e+00 2.481 0.013216 *
## NeighborhoodBlueste -2.753e-01 2.729e+00 -0.101 0.919662
## NeighborhoodBrDale 4.020e-01 1.552e+00 0.259 0.795618
## NeighborhoodBrkSide 2.516e-01 1.343e+00 0.187 0.851382
## NeighborhoodClearCr -2.312e+00 1.272e+00 -1.817 0.069380 .
## NeighborhoodCollgCr -1.457e+00 1.005e+00 -1.449 0.147665
## NeighborhoodCrawfor 1.961e+00 1.176e+00 1.667 0.095718 .
## NeighborhoodEdwards -2.825e+00 1.121e+00 -2.521 0.011835 *
## NeighborhoodGilbert -1.227e+00 1.072e+00 -1.144 0.252680
## NeighborhoodIDOTRR 1.265e-01 1.542e+00 0.082 0.934632
## NeighborhoodMeadowV -1.067e+00 1.516e+00 -0.704 0.481672
## NeighborhoodMitchel -2.800e+00 1.149e+00 -2.436 0.014968 *
## NeighborhoodNAmes -2.761e+00 1.074e+00 -2.570 0.010291 *
## NeighborhoodNoRidge 4.285e+00 1.135e+00 3.776 0.000167 ***
## NeighborhoodNPkVill 3.024e-01 1.518e+00 0.199 0.842131
## NeighborhoodNridgHt 2.507e+00 1.034e+00 2.426 0.015410 *
## NeighborhoodNWAmes -3.231e+00 1.096e+00 -2.949 0.003250 **
## NeighborhoodOldTown -1.684e+00 1.367e+00 -1.232 0.218014
## NeighborhoodSawyer -2.404e+00 1.136e+00 -2.116 0.034577 *
## NeighborhoodSawyerW -1.610e+00 1.083e+00 -1.487 0.137336
## NeighborhoodSomerst 1.199e-01 1.234e+00 0.097 0.922605
## NeighborhoodStoneBr 4.800e+00 1.142e+00 4.204 2.81e-05 ***
## NeighborhoodSWISU -1.685e+00 1.404e+00 -1.200 0.230228
## NeighborhoodTimber -2.325e+00 1.140e+00 -2.040 0.041562 *
## NeighborhoodVeenker -3.800e-01 1.431e+00 -0.266 0.790653
## Condition2Feedr -7.199e-01 3.217e+00 -0.224 0.822980
## Condition2Norm 1.648e+00 2.789e+00 0.591 0.554755
## Condition2PosA 5.407e+00 4.786e+00 1.130 0.258754
## Condition2PosN -2.964e+01 3.785e+00 -7.831 1.02e-14 ***
## Condition2RRAe -2.972e-02 4.585e+00 -0.006 0.994830
## Condition2RRAn 1.367e+00 4.550e+00 0.300 0.763852
## Condition2RRNn 1.924e+00 3.789e+00 0.508 0.611728
## BldgType2fmCon -2.007e+00 8.612e-01 -2.331 0.019919 *
## BldgTypeDuplex -4.705e+00 7.816e-01 -6.019 2.30e-09 ***
## BldgTypeTwnhs -3.925e+00 7.667e-01 -5.119 3.56e-07 ***
## BldgTypeTwnhsE -2.754e+00 5.045e-01 -5.459 5.74e-08 ***
## HouseStyle1.5Unf 1.475e+00 1.214e+00 1.215 0.224644
## HouseStyle1Story 1.078e+00 4.716e-01 2.286 0.022442 *
## HouseStyle2.5Fin -3.806e+00 1.742e+00 -2.184 0.029109 *
## HouseStyle2.5Unf -8.240e-01 1.270e+00 -0.649 0.516709
## HouseStyle2Story 7.708e-02 4.358e-01 0.177 0.859656
## HouseStyleSFoyer 8.558e-01 8.703e-01 0.983 0.325638
## HouseStyleSLvl 2.705e-01 6.231e-01 0.434 0.664240
## RoofMatlCompShg 8.522e+01 4.140e+00 20.582 < 2e-16 ***
## RoofMatlMembran 8.753e+01 5.487e+00 15.951 < 2e-16 ***
## RoofMatlMetal 8.763e+01 5.508e+00 15.910 < 2e-16 ***
## RoofMatlRoll 8.817e+01 5.556e+00 15.870 < 2e-16 ***
## RoofMatlTar&Grv 8.230e+01 4.319e+00 19.056 < 2e-16 ***
## RoofMatlWdShake 8.484e+01 4.482e+00 18.928 < 2e-16 ***
## RoofMatlWdShngl 9.177e+01 4.351e+00 21.091 < 2e-16 ***
## ExterQualFa -3.966e+00 1.629e+00 -2.435 0.015031 *
## ExterQualGd -3.548e+00 6.820e-01 -5.202 2.29e-07 ***
## ExterQualTA -3.693e+00 7.581e-01 -4.872 1.25e-06 ***
## BsmtQualFa -3.187e+00 9.108e-01 -3.499 0.000484 ***
## BsmtQualGd -3.647e+00 4.718e-01 -7.729 2.19e-14 ***
## BsmtQualTA -3.644e+00 5.771e-01 -6.313 3.77e-10 ***
## BsmtExposureGd 2.353e+00 4.345e-01 5.415 7.33e-08 ***
## BsmtExposureMn -3.644e-01 4.462e-01 -0.817 0.414241
## BsmtExposureNo -9.608e-01 3.180e-01 -3.022 0.002565 **
## KitchenQualFa -3.950e+00 9.548e-01 -4.137 3.75e-05 ***
## KitchenQualGd -3.676e+00 5.086e-01 -7.226 8.56e-13 ***
## KitchenQualTA -3.820e+00 5.709e-01 -6.692 3.30e-11 ***
## GarageQualFa -1.407e+01 4.322e+00 -3.256 0.001161 **
## GarageQualGd -1.290e+01 4.433e+00 -2.911 0.003669 **
## GarageQualPo -1.492e+01 5.147e+00 -2.899 0.003805 **
## GarageQualTA -1.424e+01 4.279e+00 -3.329 0.000897 ***
## GarageCondFa 1.312e+01 5.012e+00 2.618 0.008944 **
## GarageCondGd 1.335e+01 5.175e+00 2.579 0.010011 *
## GarageCondPo 1.235e+01 5.329e+00 2.318 0.020601 *
## GarageCondTA 1.362e+01 4.954e+00 2.750 0.006044 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.482 on 1266 degrees of freedom
## (112 observations deleted due to missingness)
## Multiple R-squared: 0.9054, Adjusted R-squared: 0.8994
## F-statistic: 149.7 on 81 and 1266 DF, p-value: < 2.2e-16
In Model 10 below we remove HouseStyle
. No meaningful
changes, let’s keep simplifying this model.
lm10 <- update(lm9, .~. -HouseStyle, data=train)
summary(lm10)
##
## Call:
## lm(formula = ssp ~ LotArea + OverallQual + OverallCond + YearBuilt +
## BsmtUnfSF + TotalBsmtSF + GrLivArea + GarageArea + MSZoning +
## Neighborhood + Condition2 + BldgType + RoofMatl + ExterQual +
## BsmtQual + BsmtExposure + KitchenQual + GarageQual + GarageCond,
## data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.9550 -1.4096 0.0264 1.4135 27.0198
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.277e+02 1.979e+01 -11.508 < 2e-16 ***
## LotArea 6.488e-05 1.115e-05 5.821 7.38e-09 ***
## OverallQual 1.171e+00 1.448e-01 8.086 1.43e-15 ***
## OverallCond 9.947e-01 1.106e-01 8.991 < 2e-16 ***
## YearBuilt 6.977e-02 9.138e-03 7.635 4.42e-14 ***
## BsmtUnfSF -2.339e-03 2.709e-04 -8.634 < 2e-16 ***
## TotalBsmtSF 5.186e-03 3.896e-04 13.310 < 2e-16 ***
## GrLivArea 7.805e-03 2.936e-04 26.586 < 2e-16 ***
## GarageArea 3.662e-03 7.419e-04 4.936 9.02e-07 ***
## MSZoningFV 5.377e+00 1.773e+00 3.033 0.002467 **
## MSZoningRH 4.479e+00 1.874e+00 2.390 0.016997 *
## MSZoningRL 4.415e+00 1.536e+00 2.875 0.004106 **
## MSZoningRM 3.706e+00 1.445e+00 2.565 0.010443 *
## NeighborhoodBlueste -6.045e-01 2.727e+00 -0.222 0.824609
## NeighborhoodBrDale -9.021e-02 1.533e+00 -0.059 0.953094
## NeighborhoodBrkSide -8.693e-02 1.328e+00 -0.065 0.947807
## NeighborhoodClearCr -2.545e+00 1.270e+00 -2.003 0.045340 *
## NeighborhoodCollgCr -1.635e+00 1.005e+00 -1.627 0.103952
## NeighborhoodCrawfor 1.874e+00 1.177e+00 1.592 0.111602
## NeighborhoodEdwards -3.047e+00 1.117e+00 -2.729 0.006447 **
## NeighborhoodGilbert -1.457e+00 1.069e+00 -1.363 0.173049
## NeighborhoodIDOTRR -1.694e-01 1.534e+00 -0.110 0.912091
## NeighborhoodMeadowV -1.328e+00 1.505e+00 -0.882 0.377816
## NeighborhoodMitchel -3.000e+00 1.146e+00 -2.618 0.008938 **
## NeighborhoodNAmes -2.856e+00 1.072e+00 -2.665 0.007800 **
## NeighborhoodNoRidge 4.210e+00 1.133e+00 3.715 0.000212 ***
## NeighborhoodNPkVill 3.192e-02 1.512e+00 0.021 0.983159
## NeighborhoodNridgHt 2.291e+00 1.031e+00 2.221 0.026497 *
## NeighborhoodNWAmes -3.303e+00 1.094e+00 -3.020 0.002575 **
## NeighborhoodOldTown -1.988e+00 1.360e+00 -1.462 0.144104
## NeighborhoodSawyer -2.462e+00 1.133e+00 -2.174 0.029890 *
## NeighborhoodSawyerW -1.759e+00 1.081e+00 -1.627 0.103918
## NeighborhoodSomerst -5.575e-02 1.234e+00 -0.045 0.963981
## NeighborhoodStoneBr 4.708e+00 1.143e+00 4.121 4.02e-05 ***
## NeighborhoodSWISU -2.401e+00 1.379e+00 -1.741 0.081888 .
## NeighborhoodTimber -2.472e+00 1.140e+00 -2.168 0.030317 *
## NeighborhoodVeenker -4.751e-01 1.431e+00 -0.332 0.739941
## Condition2Feedr -7.017e-01 3.148e+00 -0.223 0.823657
## Condition2Norm 1.440e+00 2.725e+00 0.529 0.597208
## Condition2PosA 4.731e+00 4.590e+00 1.031 0.302874
## Condition2PosN -2.970e+01 3.748e+00 -7.924 4.97e-15 ***
## Condition2RRAe -2.197e-01 4.514e+00 -0.049 0.961194
## Condition2RRAn 1.104e+00 4.513e+00 0.245 0.806839
## Condition2RRNn 1.870e+00 3.735e+00 0.501 0.616768
## BldgType2fmCon -2.013e+00 8.586e-01 -2.345 0.019203 *
## BldgTypeDuplex -4.707e+00 7.603e-01 -6.191 8.06e-10 ***
## BldgTypeTwnhs -4.020e+00 7.671e-01 -5.241 1.87e-07 ***
## BldgTypeTwnhsE -2.748e+00 5.024e-01 -5.469 5.45e-08 ***
## RoofMatlCompShg 8.702e+01 4.030e+00 21.593 < 2e-16 ***
## RoofMatlMembran 8.961e+01 5.387e+00 16.634 < 2e-16 ***
## RoofMatlMetal 8.893e+01 5.444e+00 16.335 < 2e-16 ***
## RoofMatlRoll 8.997e+01 5.478e+00 16.425 < 2e-16 ***
## RoofMatlTar&Grv 8.447e+01 4.187e+00 20.175 < 2e-16 ***
## RoofMatlWdShake 8.727e+01 4.329e+00 20.160 < 2e-16 ***
## RoofMatlWdShngl 9.357e+01 4.253e+00 22.002 < 2e-16 ***
## ExterQualFa -3.887e+00 1.622e+00 -2.397 0.016693 *
## ExterQualGd -3.581e+00 6.821e-01 -5.250 1.78e-07 ***
## ExterQualTA -3.706e+00 7.587e-01 -4.885 1.16e-06 ***
## BsmtQualFa -3.027e+00 9.068e-01 -3.337 0.000870 ***
## BsmtQualGd -3.669e+00 4.694e-01 -7.816 1.14e-14 ***
## BsmtQualTA -3.620e+00 5.758e-01 -6.286 4.46e-10 ***
## BsmtExposureGd 2.320e+00 4.347e-01 5.338 1.11e-07 ***
## BsmtExposureMn -2.886e-01 4.297e-01 -0.672 0.501969
## BsmtExposureNo -8.966e-01 2.947e-01 -3.042 0.002397 **
## KitchenQualFa -3.907e+00 9.541e-01 -4.095 4.49e-05 ***
## KitchenQualGd -3.681e+00 5.083e-01 -7.243 7.60e-13 ***
## KitchenQualTA -3.897e+00 5.702e-01 -6.835 1.27e-11 ***
## GarageQualFa -1.154e+01 4.003e+00 -2.884 0.003994 **
## GarageQualGd -1.033e+01 4.102e+00 -2.519 0.011895 *
## GarageQualPo -1.225e+01 4.890e+00 -2.506 0.012342 *
## GarageQualTA -1.184e+01 3.967e+00 -2.984 0.002896 **
## GarageCondFa 1.052e+01 4.757e+00 2.211 0.027232 *
## GarageCondGd 1.043e+01 4.862e+00 2.145 0.032106 *
## GarageCondPo 9.753e+00 5.078e+00 1.920 0.055032 .
## GarageCondTA 1.105e+01 4.694e+00 2.355 0.018669 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.488 on 1273 degrees of freedom
## (112 observations deleted due to missingness)
## Multiple R-squared: 0.9046, Adjusted R-squared: 0.899
## F-statistic: 163.1 on 74 and 1273 DF, p-value: < 2.2e-16
In Model 11 below we remove Condition2
. Let’s keep
simplifying the model.
lm11 <- update(lm10, .~. -Condition2, data=train)
summary(lm11)
##
## Call:
## lm(formula = ssp ~ LotArea + OverallQual + OverallCond + YearBuilt +
## BsmtUnfSF + TotalBsmtSF + GrLivArea + GarageArea + MSZoning +
## Neighborhood + BldgType + RoofMatl + ExterQual + BsmtQual +
## BsmtExposure + KitchenQual + GarageQual + GarageCond, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -52.499 -1.417 -0.014 1.499 28.300
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.097e+02 2.033e+01 -10.315 < 2e-16 ***
## LotArea 6.009e-05 1.173e-05 5.122 3.48e-07 ***
## OverallQual 1.138e+00 1.514e-01 7.519 1.04e-13 ***
## OverallCond 9.780e-01 1.164e-01 8.400 < 2e-16 ***
## YearBuilt 6.456e-02 9.545e-03 6.763 2.05e-11 ***
## BsmtUnfSF -2.127e-03 2.842e-04 -7.487 1.31e-13 ***
## TotalBsmtSF 4.759e-03 4.081e-04 11.661 < 2e-16 ***
## GrLivArea 7.500e-03 3.070e-04 24.430 < 2e-16 ***
## GarageArea 3.362e-03 7.769e-04 4.327 1.63e-05 ***
## MSZoningFV 4.972e+00 1.866e+00 2.664 0.007814 **
## MSZoningRH 4.189e+00 1.972e+00 2.123 0.033906 *
## MSZoningRL 4.247e+00 1.616e+00 2.629 0.008670 **
## MSZoningRM 3.709e+00 1.523e+00 2.435 0.015012 *
## NeighborhoodBlueste -8.735e-01 2.873e+00 -0.304 0.761176
## NeighborhoodBrDale -5.992e-01 1.613e+00 -0.372 0.710269
## NeighborhoodBrkSide -9.817e-01 1.385e+00 -0.709 0.478451
## NeighborhoodClearCr -2.548e+00 1.338e+00 -1.904 0.057184 .
## NeighborhoodCollgCr -1.689e+00 1.059e+00 -1.596 0.110843
## NeighborhoodCrawfor 1.659e+00 1.240e+00 1.338 0.181124
## NeighborhoodEdwards -3.819e+00 1.175e+00 -3.251 0.001182 **
## NeighborhoodGilbert -1.573e+00 1.126e+00 -1.397 0.162634
## NeighborhoodIDOTRR -9.880e-01 1.610e+00 -0.614 0.539459
## NeighborhoodMeadowV -1.785e+00 1.584e+00 -1.127 0.259971
## NeighborhoodMitchel -3.121e+00 1.207e+00 -2.585 0.009846 **
## NeighborhoodNAmes -3.149e+00 1.129e+00 -2.789 0.005368 **
## NeighborhoodNoRidge 4.667e+00 1.193e+00 3.912 9.63e-05 ***
## NeighborhoodNPkVill -8.470e-02 1.594e+00 -0.053 0.957624
## NeighborhoodNridgHt 2.558e+00 1.086e+00 2.355 0.018682 *
## NeighborhoodNWAmes -3.338e+00 1.151e+00 -2.899 0.003804 **
## NeighborhoodOldTown -2.617e+00 1.430e+00 -1.830 0.067535 .
## NeighborhoodSawyer -2.821e+00 1.192e+00 -2.367 0.018068 *
## NeighborhoodSawyerW -1.814e+00 1.139e+00 -1.592 0.111559
## NeighborhoodSomerst 1.877e-01 1.301e+00 0.144 0.885251
## NeighborhoodStoneBr 5.158e+00 1.203e+00 4.286 1.95e-05 ***
## NeighborhoodSWISU -2.902e+00 1.452e+00 -1.998 0.045889 *
## NeighborhoodTimber -2.255e+00 1.201e+00 -1.878 0.060675 .
## NeighborhoodVeenker -4.527e-01 1.508e+00 -0.300 0.764143
## BldgType2fmCon -2.005e+00 8.677e-01 -2.310 0.021036 *
## BldgTypeDuplex -4.553e+00 7.737e-01 -5.884 5.11e-09 ***
## BldgTypeTwnhs -4.237e+00 8.080e-01 -5.244 1.84e-07 ***
## BldgTypeTwnhsE -2.895e+00 5.287e-01 -5.475 5.26e-08 ***
## RoofMatlCompShg 8.149e+01 4.217e+00 19.326 < 2e-16 ***
## RoofMatlMembran 8.416e+01 5.656e+00 14.879 < 2e-16 ***
## RoofMatlMetal 8.314e+01 5.713e+00 14.553 < 2e-16 ***
## RoofMatlRoll 8.446e+01 5.747e+00 14.698 < 2e-16 ***
## RoofMatlTar&Grv 7.917e+01 4.386e+00 18.050 < 2e-16 ***
## RoofMatlWdShake 8.186e+01 4.536e+00 18.048 < 2e-16 ***
## RoofMatlWdShngl 8.871e+01 4.460e+00 19.890 < 2e-16 ***
## ExterQualFa -3.249e+00 1.702e+00 -1.908 0.056569 .
## ExterQualGd -2.777e+00 7.046e-01 -3.941 8.56e-05 ***
## ExterQualTA -3.002e+00 7.860e-01 -3.819 0.000141 ***
## BsmtQualFa -3.338e+00 9.502e-01 -3.513 0.000459 ***
## BsmtQualGd -3.696e+00 4.938e-01 -7.484 1.33e-13 ***
## BsmtQualTA -3.641e+00 6.053e-01 -6.016 2.33e-09 ***
## BsmtExposureGd 2.486e+00 4.577e-01 5.433 6.64e-08 ***
## BsmtExposureMn -1.271e-01 4.526e-01 -0.281 0.778904
## BsmtExposureNo -7.870e-01 3.105e-01 -2.535 0.011367 *
## KitchenQualFa -3.889e+00 9.971e-01 -3.900 0.000101 ***
## KitchenQualGd -3.621e+00 5.339e-01 -6.782 1.80e-11 ***
## KitchenQualTA -3.894e+00 5.995e-01 -6.497 1.17e-10 ***
## GarageQualFa -1.158e+01 4.218e+00 -2.746 0.006114 **
## GarageQualGd -1.007e+01 4.303e+00 -2.341 0.019400 *
## GarageQualPo -1.237e+01 5.151e+00 -2.401 0.016508 *
## GarageQualTA -1.181e+01 4.179e+00 -2.826 0.004790 **
## GarageCondFa 1.033e+01 5.013e+00 2.061 0.039459 *
## GarageCondGd 1.039e+01 5.120e+00 2.030 0.042566 *
## GarageCondPo 9.800e+00 5.352e+00 1.831 0.067323 .
## GarageCondTA 1.106e+01 4.946e+00 2.236 0.025515 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.677 on 1280 degrees of freedom
## (112 observations deleted due to missingness)
## Multiple R-squared: 0.8934, Adjusted R-squared: 0.8878
## F-statistic: 160.1 on 67 and 1280 DF, p-value: < 2.2e-16
Look in the output below how GarageCond
,
GarageQual
, and MSZoning
have dropped off in
value in that order of significance. That tells me we did have
overfitting from too many variables. We have our next three to remove
and which order to remove them in.
Otherwise, our residuals look great with the median essentially being zero and the 1st and 3rd quartiles being tight around the median (remember we scaled our target value, SalesPrice to 0-100.). Adjusted R-squared is going down but it’s ok, housing is a complicated market. We want a generalizable model not a perfect model.
My only concern at this stage is to evaluate whether it would have been better to take the power of any of our variables. Maybe we can go back and evaluate pairs() and residuals when we get to the end of this Backwards Elimination process.
In Model 12 below we remove Neighborhood
. It’s the last
of our triaged predictors to be removed and where we identify the next
three to try removing. Out of many variables only a few values of
Neighborhood have a definitive affect on the price. I don’t know the
math for it but it we aggregated the p-values for all of them some way,
it’s likely the combined p-value would not meet our significance
criteria.
lm12 <- update(lm11, .~. -Neighborhood, data=train)
summary(lm12)
##
## Call:
## lm(formula = ssp ~ LotArea + OverallQual + OverallCond + YearBuilt +
## BsmtUnfSF + TotalBsmtSF + GrLivArea + GarageArea + MSZoning +
## BldgType + RoofMatl + ExterQual + BsmtQual + BsmtExposure +
## KitchenQual + GarageQual + GarageCond, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -59.616 -1.609 0.013 1.546 29.672
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.949e+02 1.653e+01 -11.792 < 2e-16 ***
## LotArea 5.605e-05 1.207e-05 4.643 3.78e-06 ***
## OverallQual 1.556e+00 1.572e-01 9.898 < 2e-16 ***
## OverallCond 9.665e-01 1.227e-01 7.880 6.85e-15 ***
## YearBuilt 5.184e-02 7.715e-03 6.719 2.73e-11 ***
## BsmtUnfSF -2.206e-03 2.938e-04 -7.510 1.10e-13 ***
## TotalBsmtSF 5.061e-03 4.201e-04 12.047 < 2e-16 ***
## GrLivArea 7.964e-03 3.097e-04 25.711 < 2e-16 ***
## GarageArea 4.132e-03 8.097e-04 5.103 3.83e-07 ***
## MSZoningFV 3.776e+00 1.623e+00 2.327 0.020143 *
## MSZoningRH 2.932e+00 1.918e+00 1.529 0.126581
## MSZoningRL 2.807e+00 1.529e+00 1.836 0.066569 .
## MSZoningRM 2.076e+00 1.533e+00 1.355 0.175732
## BldgType2fmCon -2.235e+00 9.218e-01 -2.424 0.015475 *
## BldgTypeDuplex -4.737e+00 8.270e-01 -5.728 1.26e-08 ***
## BldgTypeTwnhs -2.522e+00 7.165e-01 -3.520 0.000447 ***
## BldgTypeTwnhsE -1.437e+00 4.548e-01 -3.159 0.001620 **
## RoofMatlCompShg 9.029e+01 4.456e+00 20.263 < 2e-16 ***
## RoofMatlMembran 9.204e+01 5.989e+00 15.367 < 2e-16 ***
## RoofMatlMetal 9.233e+01 6.035e+00 15.299 < 2e-16 ***
## RoofMatlRoll 9.231e+01 6.139e+00 15.037 < 2e-16 ***
## RoofMatlTar&Grv 8.678e+01 4.638e+00 18.709 < 2e-16 ***
## RoofMatlWdShake 8.817e+01 4.790e+00 18.406 < 2e-16 ***
## RoofMatlWdShngl 9.644e+01 4.723e+00 20.421 < 2e-16 ***
## ExterQualFa -3.707e+00 1.833e+00 -2.023 0.043282 *
## ExterQualGd -2.976e+00 7.517e-01 -3.959 7.95e-05 ***
## ExterQualTA -4.117e+00 8.362e-01 -4.923 9.59e-07 ***
## BsmtQualFa -4.175e+00 1.006e+00 -4.148 3.57e-05 ***
## BsmtQualGd -4.125e+00 5.136e-01 -8.031 2.14e-15 ***
## BsmtQualTA -4.639e+00 6.271e-01 -7.397 2.48e-13 ***
## BsmtExposureGd 2.202e+00 4.877e-01 4.516 6.86e-06 ***
## BsmtExposureMn -1.468e-01 4.878e-01 -0.301 0.763518
## BsmtExposureNo -7.930e-01 3.291e-01 -2.410 0.016098 *
## KitchenQualFa -3.759e+00 1.071e+00 -3.511 0.000463 ***
## KitchenQualGd -3.958e+00 5.692e-01 -6.954 5.60e-12 ***
## KitchenQualTA -4.338e+00 6.369e-01 -6.811 1.48e-11 ***
## GarageQualFa -7.458e+00 4.566e+00 -1.633 0.102671
## GarageQualGd -6.087e+00 4.658e+00 -1.307 0.191532
## GarageQualPo -6.209e+00 5.557e+00 -1.117 0.264011
## GarageQualTA -7.759e+00 4.522e+00 -1.716 0.086444 .
## GarageCondFa 6.086e+00 5.413e+00 1.124 0.261059
## GarageCondGd 5.416e+00 5.525e+00 0.980 0.327127
## GarageCondPo 4.580e+00 5.772e+00 0.793 0.427688
## GarageCondTA 6.321e+00 5.338e+00 1.184 0.236543
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.008 on 1304 degrees of freedom
## (112 observations deleted due to missingness)
## Multiple R-squared: 0.871, Adjusted R-squared: 0.8667
## F-statistic: 204.7 on 43 and 1304 DF, p-value: < 2.2e-16
In Model 13 below, we remove GarageCond
. Look in the
output below how adjusted-R-Squared increased for the first time, but we
also have changed from 112 records excluded due to missing values to 38
records excluded. Even though we can’t compare directly between the
Adjusted-R-Squareds, this is an improvement we’ll keep, and the residual
interquartile range is still tightly banded around a median near
zero.
lm13 <- update(lm12, .~. -GarageCond, data=train)
summary(lm13)
##
## Call:
## lm(formula = ssp ~ LotArea + OverallQual + OverallCond + YearBuilt +
## BsmtUnfSF + TotalBsmtSF + GrLivArea + GarageArea + MSZoning +
## BldgType + RoofMatl + ExterQual + BsmtQual + BsmtExposure +
## KitchenQual + GarageQual, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -59.623 -1.661 0.005 1.564 29.611
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.937e+02 1.632e+01 -11.864 < 2e-16 ***
## LotArea 5.621e-05 1.206e-05 4.662 3.46e-06 ***
## OverallQual 1.566e+00 1.569e-01 9.987 < 2e-16 ***
## OverallCond 9.773e-01 1.221e-01 8.007 2.59e-15 ***
## YearBuilt 5.203e-02 7.673e-03 6.781 1.80e-11 ***
## BsmtUnfSF -2.195e-03 2.932e-04 -7.486 1.30e-13 ***
## TotalBsmtSF 5.043e-03 4.191e-04 12.031 < 2e-16 ***
## GrLivArea 7.981e-03 3.086e-04 25.859 < 2e-16 ***
## GarageArea 4.139e-03 8.091e-04 5.116 3.59e-07 ***
## MSZoningFV 3.776e+00 1.621e+00 2.330 0.019978 *
## MSZoningRH 2.956e+00 1.917e+00 1.542 0.123243
## MSZoningRL 2.808e+00 1.526e+00 1.840 0.065964 .
## MSZoningRM 2.095e+00 1.530e+00 1.369 0.171168
## BldgType2fmCon -2.199e+00 9.194e-01 -2.392 0.016914 *
## BldgTypeDuplex -4.703e+00 8.259e-01 -5.695 1.53e-08 ***
## BldgTypeTwnhs -2.515e+00 7.159e-01 -3.513 0.000458 ***
## BldgTypeTwnhsE -1.420e+00 4.542e-01 -3.126 0.001811 **
## RoofMatlCompShg 9.026e+01 4.453e+00 20.272 < 2e-16 ***
## RoofMatlMembran 9.204e+01 5.985e+00 15.378 < 2e-16 ***
## RoofMatlMetal 9.235e+01 6.031e+00 15.311 < 2e-16 ***
## RoofMatlRoll 9.203e+01 6.081e+00 15.133 < 2e-16 ***
## RoofMatlTar&Grv 8.670e+01 4.634e+00 18.710 < 2e-16 ***
## RoofMatlWdShake 8.813e+01 4.787e+00 18.412 < 2e-16 ***
## RoofMatlWdShngl 9.715e+01 4.684e+00 20.739 < 2e-16 ***
## ExterQualFa -3.652e+00 1.829e+00 -1.997 0.045996 *
## ExterQualGd -2.941e+00 7.498e-01 -3.923 9.22e-05 ***
## ExterQualTA -4.094e+00 8.347e-01 -4.904 1.05e-06 ***
## BsmtQualFa -4.177e+00 1.005e+00 -4.155 3.47e-05 ***
## BsmtQualGd -4.089e+00 5.125e-01 -7.979 3.19e-15 ***
## BsmtQualTA -4.584e+00 6.248e-01 -7.337 3.83e-13 ***
## BsmtExposureGd 2.180e+00 4.868e-01 4.479 8.15e-06 ***
## BsmtExposureMn -1.266e-01 4.854e-01 -0.261 0.794242
## BsmtExposureNo -7.977e-01 3.286e-01 -2.427 0.015351 *
## KitchenQualFa -3.827e+00 1.068e+00 -3.582 0.000354 ***
## KitchenQualGd -4.002e+00 5.654e-01 -7.078 2.39e-12 ***
## KitchenQualTA -4.378e+00 6.332e-01 -6.914 7.35e-12 ***
## GarageQualFa -3.168e+00 2.488e+00 -1.273 0.203214
## GarageQualGd -1.834e+00 2.645e+00 -0.693 0.488131
## GarageQualPo -3.381e+00 3.408e+00 -0.992 0.321290
## GarageQualTA -3.242e+00 2.417e+00 -1.341 0.180071
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.005 on 1308 degrees of freedom
## (112 observations deleted due to missingness)
## Multiple R-squared: 0.8707, Adjusted R-squared: 0.8669
## F-statistic: 225.9 on 39 and 1308 DF, p-value: < 2.2e-16
In Model 14 below, we remove GarageQual
. Let’s continue
after.
lm14 <- update(lm13, .~. -GarageQual, data=train)
summary(lm14)
##
## Call:
## lm(formula = ssp ~ LotArea + OverallQual + OverallCond + YearBuilt +
## BsmtUnfSF + TotalBsmtSF + GrLivArea + GarageArea + MSZoning +
## BldgType + RoofMatl + ExterQual + BsmtQual + BsmtExposure +
## KitchenQual, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -58.997 -1.668 0.026 1.615 30.387
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.903e+02 1.480e+01 -12.856 < 2e-16 ***
## LotArea 5.578e-05 1.188e-05 4.696 2.92e-06 ***
## OverallQual 1.495e+00 1.470e-01 10.166 < 2e-16 ***
## OverallCond 9.202e-01 1.127e-01 8.166 7.09e-16 ***
## YearBuilt 5.036e-02 7.002e-03 7.192 1.04e-12 ***
## BsmtUnfSF -2.206e-03 2.849e-04 -7.745 1.83e-14 ***
## TotalBsmtSF 4.938e-03 4.091e-04 12.069 < 2e-16 ***
## GrLivArea 7.813e-03 2.937e-04 26.604 < 2e-16 ***
## GarageArea 3.642e-03 6.783e-04 5.369 9.27e-08 ***
## MSZoningFV 3.728e+00 1.458e+00 2.557 0.010677 *
## MSZoningRH 2.465e+00 1.664e+00 1.482 0.138592
## MSZoningRL 2.828e+00 1.355e+00 2.088 0.036981 *
## MSZoningRM 2.042e+00 1.359e+00 1.503 0.133045
## BldgType2fmCon -2.061e+00 7.822e-01 -2.635 0.008517 **
## BldgTypeDuplex -3.754e+00 7.016e-01 -5.351 1.02e-07 ***
## BldgTypeTwnhs -2.501e+00 6.732e-01 -3.715 0.000212 ***
## BldgTypeTwnhsE -1.525e+00 4.439e-01 -3.436 0.000607 ***
## RoofMatlCompShg 8.889e+01 4.396e+00 20.220 < 2e-16 ***
## RoofMatlMembran 9.079e+01 5.917e+00 15.343 < 2e-16 ***
## RoofMatlMetal 9.091e+01 5.964e+00 15.242 < 2e-16 ***
## RoofMatlRoll 8.966e+01 5.996e+00 14.953 < 2e-16 ***
## RoofMatlTar&Grv 8.590e+01 4.565e+00 18.816 < 2e-16 ***
## RoofMatlWdShake 8.690e+01 4.729e+00 18.377 < 2e-16 ***
## RoofMatlWdShngl 9.661e+01 4.619e+00 20.914 < 2e-16 ***
## ExterQualFa -3.782e+00 1.529e+00 -2.474 0.013480 *
## ExterQualGd -3.219e+00 7.404e-01 -4.348 1.47e-05 ***
## ExterQualTA -4.589e+00 8.193e-01 -5.601 2.57e-08 ***
## BsmtQualFa -4.396e+00 9.537e-01 -4.609 4.42e-06 ***
## BsmtQualGd -4.275e+00 5.055e-01 -8.456 < 2e-16 ***
## BsmtQualTA -4.810e+00 6.071e-01 -7.922 4.74e-15 ***
## BsmtExposureGd 2.234e+00 4.726e-01 4.727 2.51e-06 ***
## BsmtExposureMn -5.700e-02 4.723e-01 -0.121 0.903950
## BsmtExposureNo -7.758e-01 3.189e-01 -2.432 0.015128 *
## KitchenQualFa -3.555e+00 9.506e-01 -3.740 0.000192 ***
## KitchenQualGd -3.905e+00 5.459e-01 -7.155 1.36e-12 ***
## KitchenQualTA -4.354e+00 6.075e-01 -7.168 1.24e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.974 on 1386 degrees of freedom
## (38 observations deleted due to missingness)
## Multiple R-squared: 0.8734, Adjusted R-squared: 0.8702
## F-statistic: 273.1 on 35 and 1386 DF, p-value: < 2.2e-16
After removing MSZoning
below, it looks like Model 15 is
our best so far.
Let’s keep going to see if we’ve hit the inflection point between simplicity and accuracy.
lm15 <- update(lm14, .~. -MSZoning, data=train)
summary(lm15)
##
## Call:
## lm(formula = ssp ~ LotArea + OverallQual + OverallCond + YearBuilt +
## BsmtUnfSF + TotalBsmtSF + GrLivArea + GarageArea + BldgType +
## RoofMatl + ExterQual + BsmtQual + BsmtExposure + KitchenQual,
## data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -59.179 -1.679 0.000 1.623 30.292
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.034e+02 1.387e+01 -14.665 < 2e-16 ***
## LotArea 5.830e-05 1.185e-05 4.919 9.71e-07 ***
## OverallQual 1.525e+00 1.466e-01 10.398 < 2e-16 ***
## OverallCond 9.316e-01 1.122e-01 8.302 2.40e-16 ***
## YearBuilt 5.797e-02 6.501e-03 8.917 < 2e-16 ***
## BsmtUnfSF -2.269e-03 2.840e-04 -7.989 2.83e-15 ***
## TotalBsmtSF 4.988e-03 4.009e-04 12.445 < 2e-16 ***
## GrLivArea 7.824e-03 2.939e-04 26.625 < 2e-16 ***
## GarageArea 3.542e-03 6.748e-04 5.248 1.77e-07 ***
## BldgType2fmCon -2.060e+00 7.823e-01 -2.633 0.008548 **
## BldgTypeDuplex -3.721e+00 7.008e-01 -5.310 1.27e-07 ***
## BldgTypeTwnhs -2.782e+00 6.453e-01 -4.311 1.74e-05 ***
## BldgTypeTwnhsE -1.717e+00 4.262e-01 -4.028 5.93e-05 ***
## RoofMatlCompShg 8.927e+01 4.386e+00 20.351 < 2e-16 ***
## RoofMatlMembran 9.137e+01 5.916e+00 15.443 < 2e-16 ***
## RoofMatlMetal 9.144e+01 5.962e+00 15.338 < 2e-16 ***
## RoofMatlRoll 9.025e+01 5.987e+00 15.074 < 2e-16 ***
## RoofMatlTar&Grv 8.644e+01 4.554e+00 18.982 < 2e-16 ***
## RoofMatlWdShake 8.730e+01 4.717e+00 18.505 < 2e-16 ***
## RoofMatlWdShngl 9.697e+01 4.614e+00 21.014 < 2e-16 ***
## ExterQualFa -4.444e+00 1.501e+00 -2.960 0.003126 **
## ExterQualGd -3.209e+00 7.421e-01 -4.324 1.64e-05 ***
## ExterQualTA -4.582e+00 8.196e-01 -5.591 2.71e-08 ***
## BsmtQualFa -4.128e+00 9.515e-01 -4.338 1.54e-05 ***
## BsmtQualGd -4.169e+00 5.053e-01 -8.251 3.62e-16 ***
## BsmtQualTA -4.651e+00 6.056e-01 -7.681 2.97e-14 ***
## BsmtExposureGd 2.190e+00 4.729e-01 4.631 3.99e-06 ***
## BsmtExposureMn 2.576e-02 4.722e-01 0.055 0.956493
## BsmtExposureNo -6.799e-01 3.180e-01 -2.138 0.032712 *
## KitchenQualFa -3.435e+00 9.523e-01 -3.607 0.000321 ***
## KitchenQualGd -3.856e+00 5.467e-01 -7.053 2.76e-12 ***
## KitchenQualTA -4.334e+00 6.090e-01 -7.117 1.77e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.985 on 1390 degrees of freedom
## (38 observations deleted due to missingness)
## Multiple R-squared: 0.8723, Adjusted R-squared: 0.8695
## F-statistic: 306.3 on 31 and 1390 DF, p-value: < 2.2e-16
Removing BsmtExposure
in Model 16 below gave us back an
additional observation, our Adjusted-R-squared didn’t drop meaningfully,
and the residual interquartile range is still tight.
lm16 <- update(lm15, .~. -BsmtExposure, data=train)
summary(lm16)
##
## Call:
## lm(formula = ssp ~ LotArea + OverallQual + OverallCond + YearBuilt +
## BsmtUnfSF + TotalBsmtSF + GrLivArea + GarageArea + BldgType +
## RoofMatl + ExterQual + BsmtQual + KitchenQual, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -58.322 -1.702 0.000 1.592 29.655
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.054e+02 1.402e+01 -14.657 < 2e-16 ***
## LotArea 7.263e-05 1.181e-05 6.152 9.98e-10 ***
## OverallQual 1.600e+00 1.483e-01 10.788 < 2e-16 ***
## OverallCond 9.463e-01 1.138e-01 8.318 < 2e-16 ***
## YearBuilt 5.786e-02 6.570e-03 8.808 < 2e-16 ***
## BsmtUnfSF -2.741e-03 2.791e-04 -9.821 < 2e-16 ***
## TotalBsmtSF 5.618e-03 3.954e-04 14.208 < 2e-16 ***
## GrLivArea 7.644e-03 2.952e-04 25.899 < 2e-16 ***
## GarageArea 3.566e-03 6.831e-04 5.221 2.05e-07 ***
## BldgType2fmCon -1.845e+00 7.903e-01 -2.334 0.019722 *
## BldgTypeDuplex -3.290e+00 7.071e-01 -4.653 3.58e-06 ***
## BldgTypeTwnhs -2.806e+00 6.538e-01 -4.292 1.89e-05 ***
## BldgTypeTwnhsE -1.652e+00 4.321e-01 -3.823 0.000138 ***
## RoofMatlCompShg 9.066e+01 4.445e+00 20.396 < 2e-16 ***
## RoofMatlMembran 9.423e+01 5.987e+00 15.739 < 2e-16 ***
## RoofMatlMetal 9.503e+01 6.025e+00 15.773 < 2e-16 ***
## RoofMatlRoll 9.127e+01 6.073e+00 15.030 < 2e-16 ***
## RoofMatlTar&Grv 8.951e+01 4.598e+00 19.468 < 2e-16 ***
## RoofMatlWdShake 8.886e+01 4.781e+00 18.589 < 2e-16 ***
## RoofMatlWdShngl 9.930e+01 4.668e+00 21.273 < 2e-16 ***
## ExterQualFa -4.057e+00 1.522e+00 -2.666 0.007769 **
## ExterQualGd -3.395e+00 7.501e-01 -4.526 6.53e-06 ***
## ExterQualTA -4.663e+00 8.296e-01 -5.620 2.30e-08 ***
## BsmtQualFa -4.413e+00 9.634e-01 -4.581 5.04e-06 ***
## BsmtQualGd -4.433e+00 5.111e-01 -8.674 < 2e-16 ***
## BsmtQualTA -5.111e+00 6.095e-01 -8.386 < 2e-16 ***
## KitchenQualFa -3.283e+00 9.653e-01 -3.401 0.000691 ***
## KitchenQualGd -3.673e+00 5.522e-01 -6.652 4.15e-11 ***
## KitchenQualTA -4.250e+00 6.164e-01 -6.894 8.18e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.043 on 1394 degrees of freedom
## (37 observations deleted due to missingness)
## Multiple R-squared: 0.8682, Adjusted R-squared: 0.8655
## F-statistic: 327.8 on 28 and 1394 DF, p-value: < 2.2e-16
Model 16 is good however we want to compare it to what Model 17 could be by looking at the Adjusted R-squareds and number of excluded records depending on what we remove next.
In reviewing Model 16’s output above, between BldgType
and ExterQual
, it looks like BldgType
has less
significance for the model so we’ll try removing that first.
lm17a <- update(lm16, .~. -BldgType, data=train)
summary(lm17a)$adj.r.squared
## [1] 0.8605953
length(summary(lm17a)$na.action)
## [1] 37
summary(lm16)$adj.r.squared
## [1] 0.8655162
length(summary(lm16)$na.action)
## [1] 37
Removing either of them didn’t have a significant impact on Adjusted R-squared so it looks like we can remove both of them.
lm17b <- update(lm16, .~. -ExterQual, data=train)
summary(lm17b)$adj.r.squared
## [1] 0.8626119
length(summary(lm17b)$na.action)
## [1] 37
summary(lm16)$adj.r.squared
## [1] 0.8655162
length(summary(lm16)$na.action)
## [1] 37
Removing ExterQual
actually had less of an impact than
removing BldgType
so we’ll take out ExterQual
first.
lm17 <- update(lm16, .~. -ExterQual, data=train)
summary(lm17)
##
## Call:
## lm(formula = ssp ~ LotArea + OverallQual + OverallCond + YearBuilt +
## BsmtUnfSF + TotalBsmtSF + GrLivArea + GarageArea + BldgType +
## RoofMatl + BsmtQual + KitchenQual, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -57.116 -1.746 -0.048 1.598 27.672
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.153e+02 1.395e+01 -15.430 < 2e-16 ***
## LotArea 7.164e-05 1.193e-05 6.007 2.41e-09 ***
## OverallQual 1.761e+00 1.456e-01 12.096 < 2e-16 ***
## OverallCond 9.617e-01 1.138e-01 8.453 < 2e-16 ***
## YearBuilt 6.131e-02 6.569e-03 9.334 < 2e-16 ***
## BsmtUnfSF -2.613e-03 2.805e-04 -9.314 < 2e-16 ***
## TotalBsmtSF 5.756e-03 3.986e-04 14.443 < 2e-16 ***
## GrLivArea 7.693e-03 2.980e-04 25.813 < 2e-16 ***
## GarageArea 3.648e-03 6.898e-04 5.288 1.44e-07 ***
## BldgType2fmCon -1.747e+00 7.939e-01 -2.200 0.027965 *
## BldgTypeDuplex -3.436e+00 7.131e-01 -4.818 1.61e-06 ***
## BldgTypeTwnhs -2.747e+00 6.607e-01 -4.158 3.41e-05 ***
## BldgTypeTwnhsE -1.493e+00 4.355e-01 -3.428 0.000625 ***
## RoofMatlCompShg 8.988e+01 4.485e+00 20.041 < 2e-16 ***
## RoofMatlMembran 9.309e+01 6.045e+00 15.400 < 2e-16 ***
## RoofMatlMetal 9.416e+01 6.084e+00 15.478 < 2e-16 ***
## RoofMatlRoll 9.064e+01 6.131e+00 14.784 < 2e-16 ***
## RoofMatlTar&Grv 8.848e+01 4.639e+00 19.074 < 2e-16 ***
## RoofMatlWdShake 8.826e+01 4.821e+00 18.307 < 2e-16 ***
## RoofMatlWdShngl 9.831e+01 4.709e+00 20.875 < 2e-16 ***
## BsmtQualFa -5.050e+00 9.642e-01 -5.237 1.88e-07 ***
## BsmtQualGd -5.110e+00 4.950e-01 -10.322 < 2e-16 ***
## BsmtQualTA -5.853e+00 6.004e-01 -9.748 < 2e-16 ***
## KitchenQualFa -4.261e+00 9.434e-01 -4.517 6.80e-06 ***
## KitchenQualGd -4.549e+00 5.223e-01 -8.710 < 2e-16 ***
## KitchenQualTA -5.492e+00 5.836e-01 -9.411 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.086 on 1397 degrees of freedom
## (37 observations deleted due to missingness)
## Multiple R-squared: 0.865, Adjusted R-squared: 0.8626
## F-statistic: 358.1 on 25 and 1397 DF, p-value: < 2.2e-16
After removing BldgType
in Model 18 below we’ve removed
all of the predictors that don’t clearly impact the model with strong
(low) p-values.
lm18 <- update(lm17, .~. -BldgType, data=train)
summary(lm18)
##
## Call:
## lm(formula = ssp ~ LotArea + OverallQual + OverallCond + YearBuilt +
## BsmtUnfSF + TotalBsmtSF + GrLivArea + GarageArea + RoofMatl +
## BsmtQual + KitchenQual, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -57.125 -1.839 -0.052 1.714 27.907
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.101e+02 1.404e+01 -14.963 < 2e-16 ***
## LotArea 8.286e-05 1.190e-05 6.963 5.10e-12 ***
## OverallQual 1.805e+00 1.460e-01 12.359 < 2e-16 ***
## OverallCond 1.028e+00 1.151e-01 8.939 < 2e-16 ***
## YearBuilt 5.798e-02 6.583e-03 8.808 < 2e-16 ***
## BsmtUnfSF -2.539e-03 2.845e-04 -8.925 < 2e-16 ***
## TotalBsmtSF 5.670e-03 3.982e-04 14.237 < 2e-16 ***
## GrLivArea 7.663e-03 2.964e-04 25.854 < 2e-16 ***
## GarageArea 4.136e-03 6.959e-04 5.943 3.52e-09 ***
## RoofMatlCompShg 9.010e+01 4.535e+00 19.868 < 2e-16 ***
## RoofMatlMembran 9.338e+01 6.130e+00 15.233 < 2e-16 ***
## RoofMatlMetal 9.459e+01 6.167e+00 15.339 < 2e-16 ***
## RoofMatlRoll 8.777e+01 6.149e+00 14.273 < 2e-16 ***
## RoofMatlTar&Grv 8.859e+01 4.689e+00 18.892 < 2e-16 ***
## RoofMatlWdShake 8.873e+01 4.879e+00 18.185 < 2e-16 ***
## RoofMatlWdShngl 9.841e+01 4.770e+00 20.630 < 2e-16 ***
## BsmtQualFa -4.968e+00 9.797e-01 -5.071 4.48e-07 ***
## BsmtQualGd -5.218e+00 5.030e-01 -10.374 < 2e-16 ***
## BsmtQualTA -5.765e+00 6.086e-01 -9.472 < 2e-16 ***
## KitchenQualFa -4.290e+00 9.565e-01 -4.485 7.87e-06 ***
## KitchenQualGd -4.494e+00 5.311e-01 -8.463 < 2e-16 ***
## KitchenQualTA -5.619e+00 5.914e-01 -9.502 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.156 on 1401 degrees of freedom
## (37 observations deleted due to missingness)
## Multiple R-squared: 0.86, Adjusted R-squared: 0.8579
## F-statistic: 409.7 on 21 and 1401 DF, p-value: < 2.2e-16
We could try removing each one of the variables and compare to our proposed final model 18 like we did twice before. However let’s go examine our remaining predictors like we did in the Commonsense Elimination step.
First we need to identify and remove the non-numeric predictors,
RoofMatl
, BsmtQual
, KitchenQual
so we can chart the pair-wise relationships between our target variable,
ssp
, on the y-axis and our numeric predictors on the
x-axes.
Looking at the pair wise chart below, a number of the relationships
looked curved and not linear. Let’s try a non-linear transformation on
our target variable, SalePrice
and see if any of the
relationships with the predictors flatten out.
Code Summary + Identify Numeric Predictors + Review Pairwise Plots
str(train[c("ssp", "LotArea", "OverallQual", "OverallCond", "YearBuilt", "BsmtUnfSF", "TotalBsmtSF", "GrLivArea", "GarageArea", "RoofMatl", "BsmtQual", "KitchenQual")])
## 'data.frame': 1460 obs. of 12 variables:
## $ ssp : num 24.1 20.4 26.2 14.6 29.9 ...
## $ LotArea : int 8450 9600 11250 9550 14260 14115 10084 10382 6120 7420 ...
## $ OverallQual: int 7 6 7 7 8 5 8 7 7 5 ...
## $ OverallCond: int 5 8 5 5 5 5 5 6 5 6 ...
## $ YearBuilt : int 2003 1976 2001 1915 2000 1993 2004 1973 1931 1939 ...
## $ BsmtUnfSF : int 150 284 434 540 490 64 317 216 952 140 ...
## $ TotalBsmtSF: int 856 1262 920 756 1145 796 1686 1107 952 991 ...
## $ GrLivArea : int 1710 1262 1786 1717 2198 1362 1694 2090 1774 1077 ...
## $ GarageArea : int 548 460 608 642 836 480 636 484 468 205 ...
## $ RoofMatl : chr "CompShg" "CompShg" "CompShg" "CompShg" ...
## $ BsmtQual : chr "Gd" "Gd" "Gd" "TA" ...
## $ KitchenQual: chr "Gd" "TA" "Gd" "Gd" ...
pairs18 <- train[c("ssp", "LotArea", "OverallQual", "OverallCond", "YearBuilt", "BsmtUnfSF", "TotalBsmtSF", "GrLivArea", "GarageArea")]
pairs(pairs18,gap=.5)
Based on the previous section we’re going to take a log transformation of the target variable. We need to start with the original variable so we can easily undo the transformation on our predicted values if we end up keeping the transformation for our model.
train$sspln <- log(train$SalePrice)
max_sspln <- max(train$sspln)
min_sspln <- min(train$sspln)
range <- max_sspln - min_sspln
train$sspln <- 100 * (train$sspln - min_sspln) / range
It’s hard to tell with certainty but it does look like the obvious
curve in OverallQual
has linearized so we’ll keep this
change for our model.
pairs18 <- train[c("sspln", "LotArea", "OverallQual", "OverallCond", "YearBuilt", "BsmtUnfSF", "TotalBsmtSF", "GrLivArea", "GarageArea")]
pairs(pairs18,gap=.5)
Let’s retrain our model using log-transformed (and then scaled to
100) SalePrice
to compare against our previous model.
It looks like Adjusted R-Squared has gone up but our model now slightly overestimates the value of homes.
Let’s keep and submit and see how we did.
lm19 <- lm(sspln ~ LotArea + OverallQual + OverallCond + YearBuilt +
BsmtUnfSF + TotalBsmtSF + GrLivArea + GarageArea + RoofMatl +
BsmtQual + KitchenQual, data=train)
summary(lm19)
##
## Call:
## lm(formula = sspln ~ LotArea + OverallQual + OverallCond + YearBuilt +
## BsmtUnfSF + TotalBsmtSF + GrLivArea + GarageArea + RoofMatl +
## BsmtQual + KitchenQual, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -58.642 -1.875 0.316 2.397 15.579
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.954e+02 1.536e+01 -19.228 < 2e-16 ***
## LotArea 9.797e-05 1.302e-05 7.524 9.46e-14 ***
## OverallQual 2.439e+00 1.598e-01 15.262 < 2e-16 ***
## OverallCond 1.901e+00 1.259e-01 15.097 < 2e-16 ***
## YearBuilt 1.039e-01 7.203e-03 14.430 < 2e-16 ***
## BsmtUnfSF -2.709e-03 3.113e-04 -8.703 < 2e-16 ***
## TotalBsmtSF 6.286e-03 4.357e-04 14.426 < 2e-16 ***
## GrLivArea 8.815e-03 3.243e-04 27.180 < 2e-16 ***
## GarageArea 6.491e-03 7.614e-04 8.525 < 2e-16 ***
## RoofMatlCompShg 9.753e+01 4.962e+00 19.656 < 2e-16 ***
## RoofMatlMembran 1.023e+02 6.708e+00 15.255 < 2e-16 ***
## RoofMatlMetal 1.053e+02 6.748e+00 15.599 < 2e-16 ***
## RoofMatlRoll 9.666e+01 6.729e+00 14.366 < 2e-16 ***
## RoofMatlTar&Grv 9.700e+01 5.131e+00 18.904 < 2e-16 ***
## RoofMatlWdShake 9.586e+01 5.339e+00 17.955 < 2e-16 ***
## RoofMatlWdShngl 9.844e+01 5.219e+00 18.860 < 2e-16 ***
## BsmtQualFa -2.702e+00 1.072e+00 -2.520 0.011832 *
## BsmtQualGd -1.832e+00 5.504e-01 -3.328 0.000899 ***
## BsmtQualTA -2.711e+00 6.660e-01 -4.071 4.94e-05 ***
## KitchenQualFa -2.996e+00 1.047e+00 -2.863 0.004262 **
## KitchenQualGd -1.329e+00 5.811e-01 -2.288 0.022304 *
## KitchenQualTA -2.778e+00 6.471e-01 -4.293 1.89e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.548 on 1401 degrees of freedom
## (37 observations deleted due to missingness)
## Multiple R-squared: 0.8758, Adjusted R-squared: 0.874
## F-statistic: 470.5 on 21 and 1401 DF, p-value: < 2.2e-16
It’s time to train the final model, predict the target values for the test data, and submit our results to the contest!
Here we train the final model but without the 0-100 scaling we used to help compare the residual interquartile range.
train$spln <- log(train$SalePrice)
lm20 <- lm(spln ~ LotArea + OverallQual + OverallCond + YearBuilt +
BsmtUnfSF + TotalBsmtSF + GrLivArea + GarageArea + RoofMatl +
BsmtQual + KitchenQual, data=train)
This code produces target values for the test data, removes the log-transformation in the predicted sale price, and writes it to a file so we can submit it.
test_targetsln <- predict(lm20, newdata=test)
test_targets <- exp(test_targetsln)
targets <- data.frame(cbind(test_targets))
test$SalePrice <- targets[,1]
sub = data.frame(test$Id,test$SalePrice)
colnames(sub)[1] ="Id"
colnames(sub)[2] ="SalePrice"
write.csv(sub, file="./submission.csv", row.names=FALSE)
Our submission failed becasue there were N/A values. It looks like
the majority of the NA values come from the BsmtQual
predictor so we’re going to adjust our model one more time.
Here we took just the rows where SalePrice
is
NA
and look at just the predictors used in the model to
arrive at needing to remove BsmtQual
from our model.
Currently there are 46 missing.
new_test <- test[is.na(test$SalePrice),]
new_test[c('Id', 'SalePrice', 'LotArea', 'OverallQual', 'OverallCond', 'YearBuilt', 'BsmtUnfSF', 'TotalBsmtSF', 'GrLivArea', 'GarageArea', 'RoofMatl', 'BsmtQual', 'KitchenQual')]
## Id SalePrice LotArea OverallQual OverallCond YearBuilt BsmtUnfSF
## 96 1556 NA 10632 5 3 1917 689
## 126 1586 NA 8777 3 6 1945 0
## 134 1594 NA 7200 4 6 1967 0
## 270 1730 NA 8250 6 7 1981 0
## 319 1779 NA 9533 5 5 1953 0
## 355 1815 NA 5925 2 4 1940 0
## 388 1848 NA 9000 2 2 1947 0
## 389 1849 NA 15635 4 5 1954 0
## 397 1857 NA 26400 5 7 1880 0
## 398 1858 NA 7018 5 5 1979 0
## 399 1859 NA 7018 5 5 1979 0
## 401 1861 NA 7007 5 5 1979 0
## 456 1916 NA 21780 2 4 1910 0
## 591 2051 NA 7785 5 5 1956 0
## 607 2067 NA 8838 5 3 1957 0
## 609 2069 NA 10122 4 6 1948 0
## 661 2121 NA 5940 4 7 1946 NA
## 663 2123 NA 6120 5 6 1945 0
## 729 2189 NA 47007 5 7 1959 0
## 730 2190 NA 6012 4 5 1955 0
## 731 2191 NA 6845 4 5 1955 0
## 734 2194 NA 8050 5 8 1947 0
## 757 2217 NA 14584 1 5 1952 0
## 758 2218 NA 5280 4 7 1895 173
## 759 2219 NA 5150 4 7 1910 356
## 765 2225 NA 10260 5 4 1976 0
## 928 2388 NA 10899 4 5 1964 0
## 976 2436 NA 7000 5 6 1961 0
## 993 2453 NA 8626 4 6 1956 0
## 994 2454 NA 11800 4 7 1949 0
## 1031 2491 NA 9000 4 7 1945 0
## 1039 2499 NA 11515 4 5 1958 0
## 1088 2548 NA 9555 5 6 1979 0
## 1093 2553 NA 6882 4 3 1955 0
## 1105 2565 NA 13108 5 5 1951 0
## 1117 2577 NA 9060 5 6 1923 311
## 1119 2579 NA 11067 2 4 1939 0
## 1140 2600 NA 43500 3 5 1953 0
## 1243 2703 NA 8927 6 6 1977 0
## 1304 2764 NA 11650 7 5 1959 0
## 1307 2767 NA 8544 3 4 1950 0
## 1344 2804 NA 21370 5 5 1950 0
## 1345 2805 NA 8250 5 7 1935 0
## 1365 2825 NA 12048 5 6 1952 0
## 1432 2892 NA 12366 3 5 1945 0
## 1445 2905 NA 31250 1 3 1951 0
## TotalBsmtSF GrLivArea GarageArea RoofMatl BsmtQual KitchenQual
## 96 689 1224 180 CompShg Gd <NA>
## 126 0 640 240 CompShg <NA> TA
## 134 0 2650 0 Tar&Grv <NA> TA
## 270 0 1882 612 CompShg <NA> TA
## 319 0 1210 616 CompShg <NA> TA
## 355 0 612 308 CompShg <NA> TA
## 388 0 660 0 CompShg <NA> Fa
## 389 0 1383 498 CompShg <NA> TA
## 397 0 2016 576 CompShg <NA> TA
## 398 0 2228 720 CompShg <NA> TA
## 399 0 1535 400 CompShg <NA> TA
## 401 0 1513 400 CompShg <NA> TA
## 456 0 810 280 CompShg <NA> TA
## 591 0 1014 267 CompShg <NA> TA
## 607 0 1764 301 CompShg <NA> TA
## 609 0 869 390 CompShg <NA> TA
## 661 NA 896 280 CompShg <NA> TA
## 663 0 808 164 CompShg <NA> TA
## 729 0 3820 624 CompShg <NA> Ex
## 730 0 1152 0 CompShg <NA> TA
## 731 0 1152 0 CompShg <NA> TA
## 734 0 1137 0 CompShg <NA> TA
## 757 0 733 487 CompShg <NA> Fa
## 758 173 1361 185 CompShg <NA> TA
## 759 356 1049 195 CompShg <NA> TA
## 765 0 1872 484 CompShg <NA> TA
## 928 0 1224 530 CompShg <NA> TA
## 976 0 925 300 CompShg <NA> TA
## 993 0 968 331 CompShg <NA> TA
## 994 0 1382 384 CompShg <NA> TA
## 1031 0 998 460 CompShg <NA> TA
## 1039 0 943 308 CompShg <NA> Gd
## 1088 0 2233 579 CompShg <NA> TA
## 1093 0 1152 0 CompShg <NA> Fa
## 1105 0 1226 400 CompShg <NA> TA
## 1117 859 1828 NA CompShg Gd Gd
## 1119 0 845 256 CompShg <NA> TA
## 1140 0 2034 1041 CompShg <NA> TA
## 1243 0 1654 528 CompShg <NA> TA
## 1304 0 1472 484 CompShg <NA> Gd
## 1307 0 1040 400 CompShg <NA> TA
## 1344 0 1640 394 CompShg <NA> TA
## 1345 0 1032 260 CompShg <NA> TA
## 1365 0 1488 569 CompShg <NA> TA
## 1432 0 729 0 CompShg <NA> TA
## 1445 0 1600 270 CompShg <NA> TA
Originally I was going to train a model without BsmtQual
to run on just the 46 with missing. This is how I’d like to proceed but
then it wouldn’t be one model. I’ll save that approach for real world
applications. Looking at the model output it still looks great.
lm21 <- lm(spln ~ LotArea + OverallQual + OverallCond + YearBuilt +
BsmtUnfSF + TotalBsmtSF + GrLivArea + GarageArea + RoofMatl + KitchenQual, data=train)
summary(lm21)
##
## Call:
## lm(formula = spln ~ LotArea + OverallQual + OverallCond + YearBuilt +
## BsmtUnfSF + TotalBsmtSF + GrLivArea + GarageArea + RoofMatl +
## KitchenQual, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.79879 -0.05970 0.00734 0.07420 0.47087
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.610e-01 4.011e-01 2.147 0.031979 *
## LotArea 3.009e-06 3.998e-07 7.525 9.21e-14 ***
## OverallQual 7.898e-02 4.688e-03 16.846 < 2e-16 ***
## OverallCond 5.603e-02 3.782e-03 14.815 < 2e-16 ***
## YearBuilt 3.417e-03 1.865e-04 18.326 < 2e-16 ***
## BsmtUnfSF -8.465e-05 9.563e-06 -8.852 < 2e-16 ***
## TotalBsmtSF 1.970e-04 1.211e-05 16.267 < 2e-16 ***
## GrLivArea 2.725e-04 9.778e-06 27.870 < 2e-16 ***
## GarageArea 2.089e-04 2.285e-05 9.144 < 2e-16 ***
## RoofMatlCompShg 3.013e+00 1.519e-01 19.834 < 2e-16 ***
## RoofMatlMembran 3.173e+00 2.061e-01 15.399 < 2e-16 ***
## RoofMatlMetal 3.233e+00 2.072e-01 15.608 < 2e-16 ***
## RoofMatlRoll 2.979e+00 2.065e-01 14.424 < 2e-16 ***
## RoofMatlTar&Grv 3.004e+00 1.568e-01 19.160 < 2e-16 ***
## RoofMatlWdShake 2.964e+00 1.636e-01 18.114 < 2e-16 ***
## RoofMatlWdShngl 3.040e+00 1.601e-01 18.986 < 2e-16 ***
## KitchenQualFa -1.226e-01 3.101e-02 -3.955 8.02e-05 ***
## KitchenQualGd -6.098e-02 1.650e-02 -3.695 0.000228 ***
## KitchenQualTA -1.077e-01 1.883e-02 -5.719 1.30e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1403 on 1441 degrees of freedom
## Multiple R-squared: 0.8781, Adjusted R-squared: 0.8766
## F-statistic: 576.7 on 18 and 1441 DF, p-value: < 2.2e-16
We know there will be a few NA
s. We’re going to replace
them with the median SalePrice
.
Good news! We’re down to 3 NA
s with the NA
s
spread over four predictors.
Let’s swap the NA
s with the median value since we’re not
going to train a model without the offending predictor just for these
few.
Code Summary + predict test values with new Model 21 + generate table
with the NA
records and their predictors + overwrite their
values with the median sales price $158,008.5
test_targetsln <- predict(lm21, newdata=test)
test_targets <- exp(test_targetsln)
targets <- data.frame(cbind(test_targets))
test$SalePrice <- targets[,1]
new_test <- test[is.na(test$SalePrice),]
new_test[c('Id', 'SalePrice', 'LotArea', 'OverallQual', 'OverallCond', 'YearBuilt', 'BsmtUnfSF', 'TotalBsmtSF', 'GrLivArea', 'GarageArea', 'RoofMatl', 'KitchenQual')]
## Id SalePrice LotArea OverallQual OverallCond YearBuilt BsmtUnfSF
## 96 1556 NA 10632 5 3 1917 689
## 661 2121 NA 5940 4 7 1946 NA
## 1117 2577 NA 9060 5 6 1923 311
## TotalBsmtSF GrLivArea GarageArea RoofMatl KitchenQual
## 96 689 1224 180 CompShg <NA>
## 661 NA 896 280 CompShg TA
## 1117 859 1828 NA CompShg Gd
medianSalesPrice <- median(test$SalePrice, na.rm=TRUE)
test$SalePrice[96] <- medianSalesPrice
test$SalePrice[661] <- medianSalesPrice
test$SalePrice[1117] <- medianSalesPrice
Here we print Model 21’s predictions of the test data
sub = data.frame(test$Id,test$SalePrice)
colnames(sub)[1] ="Id"
colnames(sub)[2] ="SalePrice"
write.csv(sub, file="./submission.csv", row.names=FALSE)
We got 0.15197
From a Google search, “the majority of scores are between 0 and 0.25, with a median value of 0.1446”, so we do ok!