Angus Huang, Pavan Akula, Nnaemezue Obi-Eyisi, Aryeh Sturm, Nathan Cooper, Joshua Sturm
May 14, 2018
Maximize throughput of data, visualizations and analysis by delegation of tasks.
Utilize Regression skills learned in DATA 621 to create numerous models and select the model that preforms the best.
Tentative Project Breakdown:
## Tasks Data.Exploration Data.Visualization Model.Building Analysis
## 1 Lead Angus Pavan Pavan Joshua
## 2 Lead Mezu Aryeh Nathan Joshua
## Write.up Presentation
## 1 Aryeh Nathan
## 2 Joshua Nathan
## Id MSSubClass MSZoning LotFrontage LotArea Street
## 1 1 60 RL 65 8450 Pave
## 2 2 20 RL 80 9600 Pave
## 3 3 60 RL 68 11250 Pave
## 4 4 70 RL 60 9550 Pave
## 5 5 60 RL 84 14260 Pave
## 6 6 50 RL 85 14115 Pave
A corresponding short description of the predictors variables are given as below.
## # A tibble: 81 x 2
## Columns_Name Short_Desc
## <fct> <fct>
## 1 Id RowINdex
## 2 MSSubClass Building Class
## 3 MSZoning Zoning Classification
## 4 LotFrontage Linear feet of street connnected to property
## 5 LotArea Lot size in square feet
## 6 Street Type of road access
## 7 Alley Type of alley access
## 8 LotShape General shape of property
## 9 LandContour Flatness of the property
## 10 Utilities Type of utilities available
## 11 LotConfig Lot Configuration
## 12 LandSlope Slope of property
## 13 Neighborhood Physical locations within Ames city limits
## # ... with 68 more rows
The data is checked to see if it contains any blank or NA values.
## na_count col_names
## 17 1453 PoolQC
## 19 1406 MiscFeature
## 2 1369 Alley
## 18 1179 Fence
## 11 690 FireplaceQu
## 1 259 LotFrontage
## 12 81 GarageType
## 13 81 GarageYrBlt
## 14 81 GarageFinish
## 15 81 GarageQual
## 16 81 GarageCond
## 7 38 BsmtExposure
## 9 38 BsmtFinType2
## 5 37 BsmtQual
## 6 37 BsmtCond
## 8 37 BsmtFinType1
## 3 8 MasVnrType
## 4 8 MasVnrArea
## 10 1 Electrical
There are quite a few categorical variables as shown in below examples.
unique(housetrain$GarageFinish)
## [1] RFn Unf Fin <NA>
## Levels: Fin RFn Unf
unique(housetrain$GarageQual)
## [1] TA Fa Gd <NA> Ex Po
## Levels: Ex Fa Gd Po TA
We will separate out the categorical data columns with missing variables and create a new data frame.
Here is an example of transforming a numeric variable PoolArea into a categorical variable that identifies No Pool.
housetrain.asis$PoolQC <- as.character(housetrain.asis$PoolQC)
housetrain.asis$PoolQC <- ifelse(housetrain.asis$PoolArea == 0, 'NP', housetrain.asis$PoolQC) #NP - No Pool
housetrain.asis$PoolQC <- factor(housetrain.asis$PoolQC)
Other ’NA’s are the result of a house missing a feature such as a Garage, or Basement.
The next step was to check data for data missing that does not have a logical reason.
These data will be candidates for Imputation before the models are created.
We also want to make sure that there are no zero values that do not make sense, eg. SalePrice.
Any invalid zeros will also be candidates for Imputation.
variable | Total | Percentage |
---|---|---|
YearBuilt | 64 | 4.38% |
YearRemodAdd | 124 | 8.49% |
MasVnrArea | 861 | 58.97% |
BsmtFinSF1 | 467 | 31.99% |
BsmtFinSF2 | 1293 | 88.56% |
BsmtUnfSF | 118 | 8.08% |
TotalBsmtSF | 37 | 2.53% |
X2ndFlrSF | 829 | 56.78% |
LowQualFinSF | 1434 | 98.22% |
GarageYrBlt | 165 | 11.3% |
GarageArea | 81 | 5.55% |
WoodDeckSF | 761 | 52.12% |
OpenPorchSF | 656 | 44.93% |
EnclosedPorch | 1252 | 85.75% |
X3SsnPorch | 1436 | 98.36% |
ScreenPorch | 1344 | 92.05% |
PoolArea | 1453 | 99.52% |
MiscVal | 1408 | 96.44% |
MiscVal is an example of data that consisted of mostly zeros.
SalePrice shows a unimodal distribution with a strong right skew.
OverallCond shows a weak correlation to SalePrice with varying medians, but some overlap in IQR
OverallQual shows stronger correlation to SalePrice with a clear trend in medians and little overlap in IQR.
We used two Regression techniques as a basis of comparison.
We built Multiple Linear Regression models based on the Data Cleaning Above
We also built KNN Regression models on data using a simpler central Imputation
For Linear Multiple Regression Models, missing data was imputed using the knnImputation() function of the DMwR library.
For the KNN regression models, the centralImputation() function, also from the DMwR library, was used for missing data.
K values from 5 to 20 were tested to find the best performance.
Graph below shows the error metrics, MAE, MSE RMSE, and MAPE vs. K value
Plots show 11
or 13
is optimal number of neighbors. We selected k = 11
.
The initial model was an all-in multiple linear regression with no transformations preformed on any of the variables, predictor or response.
Initial model had Residual standard error: 21670 on 1153 degrees of freedom, Multiple R\(^2\): 0.9412, Adjusted R\(^2\): 0.9256 and F-statistic: 60.31 on 306 and 1153 DF, p-value: < 2.2e-16
7 variables were linearly dependent on other variables and returned ‘NA’ in the model
37 of the predictor variables were statistically significant to the model above a 95% confidence level.
##
## Call:
## lm(formula = SalePrice ~ ., data = housetrain.knn)
##
## Residuals:
## Min 1Q Median 3Q Max
## -175318 -8247 0 8403 175318
##
## Coefficients: (12 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -856585.8319 157716.3303 -5.431 6.82e-08 ***
## MSSubClass30 312.5315 4749.6819 0.066 0.947548
## MSSubClass40 -409.5717 17178.9075 -0.024 0.980983
## MSSubClass45 -8448.1143 24905.5631 -0.339 0.734516
## MSSubClass50 -7789.8938 8883.9598 -0.877 0.380751
## MSSubClass60 -3455.2813 7988.2812 -0.433 0.665427
## MSSubClass70 1407.6420 8632.4992 0.163 0.870497
## MSSubClass75 -16035.4407 17178.6156 -0.933 0.350781
## MSSubClass80 -12511.4539 12767.3952 -0.980 0.327315
## MSSubClass85 -19952.9352 11440.4242 -1.744 0.081413 .
## MSSubClass90 -13673.4045 8666.1110 -1.578 0.114885
## MSSubClass120 -16420.7858 14393.4399 -1.141 0.254168
## MSSubClass160 -23288.8672 17582.6446 -1.325 0.185587
## MSSubClass180 -27703.9733 19593.5634 -1.414 0.157652
## MSSubClass190 -11951.5154 28135.6002 -0.425 0.671074
## MSZoningFV 42969.1076 12180.5092 3.528 0.000436 ***
## MSZoningRH 35748.6420 12351.1849 2.894 0.003871 **
## MSZoningRL 37363.5453 10587.6176 3.529 0.000434 ***
## MSZoningRM 32768.1718 9986.3092 3.281 0.001064 **
## LotFrontage 20.9988 45.9370 0.457 0.647669
## LotArea 0.6401 0.1119 5.721 1.35e-08 ***
## StreetPave 26130.3168 12945.7496 2.018 0.043776 *
## AlleyNAly -2668.2968 4213.4032 -0.633 0.526671
## AlleyPave -1682.7633 6157.6373 -0.273 0.784686
## LotShapeIR2 6358.7594 4106.9414 1.548 0.121826
## LotShapeIR3 9541.4823 8621.2235 1.107 0.268636
## LotShapeReg 1894.3697 1575.0553 1.203 0.229327
## LandContourHLS 7728.6638 5115.8398 1.511 0.131131
## LandContourLow -3788.3299 6471.8550 -0.585 0.558424
## LandContourLvl 6462.7773 3699.0781 1.747 0.080881 .
## UtilitiesNoSeWa -54405.5796 27880.6483 -1.951 0.051254 .
## LotConfigCulDSac 7679.3814 3250.9841 2.362 0.018334 *
## LotConfigFR2 -6874.7706 3926.6985 -1.751 0.080250 .
## LotConfigFR3 -16419.4216 12253.8526 -1.340 0.180529
## LotConfigInside -1253.4217 1772.8608 -0.707 0.479706
## LandSlopeMod 8787.3664 3979.4485 2.208 0.027427 *
## LandSlopeSev -35399.1020 11398.8698 -3.105 0.001946 **
## NeighborhoodBlueste 7572.3432 19589.2186 0.387 0.699156
## NeighborhoodBrDale 222.9895 11597.3364 0.019 0.984663
## NeighborhoodBrkSide -2714.4237 9603.5291 -0.283 0.777497
## NeighborhoodClearCr -9383.2870 9400.8443 -0.998 0.318425
## NeighborhoodCollgCr -6779.0451 7339.0932 -0.924 0.355841
## NeighborhoodCrawfor 14877.3462 8737.1968 1.703 0.088883 .
## NeighborhoodEdwards -20142.1236 8117.2991 -2.481 0.013229 *
## NeighborhoodGilbert -5301.0633 7745.8229 -0.684 0.493875
## NeighborhoodIDOTRR -9914.5639 10845.9875 -0.914 0.360843
## NeighborhoodMeadowV -12908.5356 12283.7122 -1.051 0.293540
## NeighborhoodMitchel -15578.3349 8289.1501 -1.879 0.060447 .
## NeighborhoodNAmes -13625.8055 7993.8666 -1.705 0.088551 .
## NeighborhoodNoRidge 20006.9890 8449.2484 2.368 0.018054 *
## NeighborhoodNPkVill 14385.0367 13742.7571 1.047 0.295441
## NeighborhoodNridgHt 14416.5586 7548.3604 1.910 0.056395 .
## NeighborhoodNWAmes -11363.7422 8167.2857 -1.391 0.164381
## NeighborhoodOldTown -11497.1115 9740.5410 -1.180 0.238110
## NeighborhoodSawyer -7551.0614 8228.7155 -0.918 0.358995
## NeighborhoodSawyerW -1267.9200 7857.6183 -0.161 0.871837
## NeighborhoodSomerst -143.1670 9062.2290 -0.016 0.987398
## NeighborhoodStoneBr 34444.5007 8497.2360 4.054 5.38e-05 ***
## NeighborhoodSWISU -7443.5025 9823.7496 -0.758 0.448783
## NeighborhoodTimber -6187.1498 8176.1157 -0.757 0.449363
## NeighborhoodVeenker 3661.4979 10550.4316 0.347 0.728619
## Condition1Feedr 4936.5026 4941.8952 0.999 0.318048
## Condition1Norm 14939.4485 4128.3378 3.619 0.000309 ***
## Condition1PosA 12556.2385 9770.1399 1.285 0.198993
## Condition1PosN 13904.7262 7324.1268 1.898 0.057882 .
## Condition1RRAe -15239.3626 8854.0985 -1.721 0.085489 .
## Condition1RRAn 13984.8142 6793.2366 2.059 0.039753 *
## Condition1RRNe -1202.2300 16928.8206 -0.071 0.943397
## Condition1RRNn 13839.8090 12546.4842 1.103 0.270221
## Condition2Feedr 8776.1690 26428.8248 0.332 0.739898
## Condition2Norm 2378.5998 24002.3491 0.099 0.921077
## Condition2PosA -18834.5971 39170.0381 -0.481 0.630720
## Condition2PosN -258489.4013 30430.3119 -8.494 < 2e-16 ***
## Condition2RRAe -107562.5019 73291.8879 -1.468 0.142488
## Condition2RRAn -4699.2716 33637.6551 -0.140 0.888919
## Condition2RRNn 10020.6099 29937.9729 0.335 0.737903
## BldgType2fmCon -2212.3330 27346.4174 -0.081 0.935535
## BldgTypeDuplex NA NA NA NA
## BldgTypeTwnhs -2088.1771 15296.6039 -0.137 0.891440
## BldgTypeTwnhsE 554.7325 14557.6446 0.038 0.969610
## HouseStyle1.5Unf 7316.6389 24498.4576 0.299 0.765255
## HouseStyle1Story -7828.3626 9029.7557 -0.867 0.386149
## HouseStyle2.5Fin -6561.5638 19009.3043 -0.345 0.730025
## HouseStyle2.5Unf 5641.4270 17247.6215 0.327 0.743663
## HouseStyle2Story -10516.5652 8237.3960 -1.277 0.201970
## HouseStyleSFoyer 2310.0048 12573.1599 0.184 0.854261
## HouseStyleSLvl 2065.1615 14037.0544 0.147 0.883061
## OverallQual2 26840.6553 31267.5125 0.858 0.390839
## OverallQual3 11242.8658 28835.4037 0.390 0.696684
## OverallQual4 10677.5264 28585.6786 0.374 0.708825
## OverallQual5 10001.3553 28733.4023 0.348 0.727848
## OverallQual6 13383.2947 28803.0126 0.465 0.642270
## OverallQual7 20047.1855 28836.3697 0.695 0.487067
## OverallQual8 32971.8738 28978.0187 1.138 0.255431
## OverallQual9 63180.6625 29527.6733 2.140 0.032588 *
## OverallQual10 101560.1520 30421.2659 3.338 0.000869 ***
## OverallCond2 -23026.8519 52799.2337 -0.436 0.662831
## OverallCond3 -51538.9146 54972.2965 -0.938 0.348675
## OverallCond4 -40842.2151 55222.5517 -0.740 0.459698
## OverallCond5 -33476.7489 55321.9175 -0.605 0.545214
## OverallCond6 -27577.4167 55313.3528 -0.499 0.618179
## OverallCond7 -21069.9957 55309.5363 -0.381 0.703313
## OverallCond8 -17420.5720 55363.9373 -0.315 0.753080
## OverallCond9 -9748.6585 55477.2757 -0.176 0.860542
## YearBuilt -398.0932 83.3638 -4.775 2.02e-06 ***
## YearRemodAdd -75.4919 55.8749 -1.351 0.176933
## RoofStyleGable -3871.8558 18240.0897 -0.212 0.831933
## RoofStyleGambrel -3081.4416 20092.1772 -0.153 0.878137
## RoofStyleHip -4582.1507 18303.3675 -0.250 0.802365
## RoofStyleMansard 1227.4811 21859.3903 0.056 0.955229
## RoofStyleShed 85967.0079 40024.9183 2.148 0.031935 *
## RoofMatlCompShg 585529.3349 54166.5268 10.810 < 2e-16 ***
## RoofMatlMembran 661166.2063 63596.3180 10.396 < 2e-16 ***
## RoofMatlMetal 635156.8076 63237.2752 10.044 < 2e-16 ***
## RoofMatlRoll 587564.7093 59654.3825 9.849 < 2e-16 ***
## RoofMatlTarGrv 574270.6328 57660.1437 9.960 < 2e-16 ***
## RoofMatlWdShake 578481.5889 56417.7754 10.254 < 2e-16 ***
## RoofMatlWdShngl 630849.0006 55216.2901 11.425 < 2e-16 ***
## Exterior1stAsphShn -12740.4988 32393.4394 -0.393 0.694167
## Exterior1stBrkComm 9229.6717 28221.9229 0.327 0.743698
## Exterior1stBrkFace 19898.7413 12885.3882 1.544 0.122793
## Exterior1stCBlock -6262.5973 27440.6493 -0.228 0.819513
## Exterior1stCemntBd 4142.8709 19080.6403 0.217 0.828150
## Exterior1stHdBoard 478.0806 13067.8272 0.037 0.970823
## Exterior1stImStucc -15290.1013 27565.1749 -0.555 0.579215
## Exterior1stMetalSd 7396.1988 14788.9052 0.500 0.617087
## Exterior1stPlywood -594.1876 12859.6453 -0.046 0.963154
## Exterior1stStone 6864.0167 23986.0633 0.286 0.774802
## Exterior1stStucco 4875.4891 14525.0416 0.336 0.737188
## Exterior1stVinylSd 792.7966 13587.0974 0.058 0.953481
## Exterior1stWd Sdng 95.3849 12583.1559 0.008 0.993953
## Exterior1stWdShing 3269.5070 13659.5921 0.239 0.810872
## Exterior2ndAsphShn 10495.9599 22184.0199 0.473 0.636209
## Exterior2ndBrk Cmn 2109.6978 20304.2606 0.104 0.917263
## Exterior2ndBrkFace -3872.1856 13257.6970 -0.292 0.770285
## Exterior2ndCBlock NA NA NA NA
## Exterior2ndCmentBd 4310.1523 18665.2459 0.231 0.817419
## Exterior2ndHdBoard 1071.2443 12513.4361 0.086 0.931793
## Exterior2ndImStucc 4670.4074 14451.5051 0.323 0.746619
## Exterior2ndMetalSd -1449.7693 14371.1190 -0.101 0.919663
## Exterior2ndOther -21576.2416 26520.0761 -0.814 0.416053
## Exterior2ndPlywood 774.8445 12156.8698 0.064 0.949191
## Exterior2ndStone -13676.9448 17235.1148 -0.794 0.427620
## Exterior2ndStucco -1879.4273 13861.3453 -0.136 0.892171
## Exterior2ndVinylSd 4633.7771 13043.7155 0.355 0.722467
## Exterior2ndWd Sdng 5956.6046 12106.0224 0.492 0.622787
## Exterior2ndWd Shng 681.3413 12719.7997 0.054 0.957291
## MasVnrTypeBrkFace 7664.6659 6208.2553 1.235 0.217234
## MasVnrTypeNone 9208.9986 6246.7646 1.474 0.140700
## MasVnrTypeStone 10195.0205 6570.4481 1.552 0.121021
## MasVnrArea 18.0445 5.6607 3.188 0.001473 **
## ExterQualFa -2204.5097 12989.7205 -0.170 0.865267
## ExterQualGd -5895.7940 5137.4725 -1.148 0.251369
## ExterQualTA -6570.6929 5565.0549 -1.181 0.237963
## ExterCondFa -9579.9182 18315.4165 -0.523 0.601038
## ExterCondGd -14087.9576 17453.0756 -0.807 0.419723
## ExterCondPo -42296.7164 35966.5800 -1.176 0.239837
## ExterCondTA -11641.5717 17500.0714 -0.665 0.506036
## FoundationCBlock 1115.6569 3196.5688 0.349 0.727140
## FoundationPConc 3774.7042 3407.6568 1.108 0.268217
## FoundationSlab -3844.9456 9952.4991 -0.386 0.699324
## FoundationStone 4080.2253 12316.4734 0.331 0.740492
## FoundationWood -33236.7575 14480.1387 -2.295 0.021893 *
## BsmtQualFa -4372.2416 6420.5463 -0.681 0.496023
## BsmtQualGd -11284.2298 3388.1880 -3.330 0.000894 ***
## BsmtQualNB 732.5989 13512.1403 0.054 0.956771
## BsmtQualTA -9243.4600 4149.4311 -2.228 0.026097 *
## BsmtCondGd 3164.3826 5295.5566 0.598 0.550255
## BsmtCondNB NA NA NA NA
## BsmtCondPo -10843.0162 37529.5570 -0.289 0.772695
## BsmtCondTA 5816.9029 4295.9220 1.354 0.175985
## BsmtExposureGd 10706.9710 3004.4654 3.564 0.000381 ***
## BsmtExposureMn -2003.0047 2977.6050 -0.673 0.501279
## BsmtExposureNB NA NA NA NA
## BsmtExposureNo -4166.6345 2156.1283 -1.932 0.053547 .
## BsmtFinType1BLQ 1506.4290 2748.5311 0.548 0.583740
## BsmtFinType1GLQ 5177.3063 2473.9355 2.093 0.036590 *
## BsmtFinType1LwQ -2403.3611 3662.5666 -0.656 0.511829
## BsmtFinType1NB NA NA NA NA
## BsmtFinType1Rec 173.7837 2946.8354 0.059 0.952984
## BsmtFinType1Unf 2749.8198 2900.7502 0.948 0.343344
## BsmtFinSF1 33.8308 5.1443 6.576 7.29e-11 ***
## BsmtFinType2BLQ -8805.3928 7282.6886 -1.209 0.226878
## BsmtFinType2GLQ -3284.9591 9100.3017 -0.361 0.718186
## BsmtFinType2LwQ -8467.4203 7133.7992 -1.187 0.235494
## BsmtFinType2NB NA NA NA NA
## BsmtFinType2Rec -6382.6716 6802.4816 -0.938 0.348294
## BsmtFinType2Unf -3855.7317 7250.4149 -0.532 0.594971
## BsmtFinSF2 29.0478 8.8092 3.297 0.001005 **
## BsmtUnfSF 14.9952 4.7291 3.171 0.001560 **
## TotalBsmtSF NA NA NA NA
## HeatingGasA 22192.1664 25410.0888 0.873 0.382648
## HeatingGasW 22970.5173 26232.4959 0.876 0.381402
## HeatingGrav 7560.5202 27647.0722 0.273 0.784544
## HeatingOthW 8770.9961 31266.3951 0.281 0.779125
## HeatingWall 33611.6671 29252.1490 1.149 0.250781
## HeatingQCFa -1341.0831 4734.4495 -0.283 0.777028
## HeatingQCGd -2938.4171 2024.0437 -1.452 0.146842
## HeatingQCPo 9488.6207 25966.2178 0.365 0.714864
## HeatingQCTA -2589.3972 2037.7375 -1.271 0.204084
## CentralAirY -416.1486 3868.0706 -0.108 0.914343
## ElectricalFuseF -2745.2904 5841.1398 -0.470 0.638449
## ElectricalFuseP -10212.7563 18649.6849 -0.548 0.584066
## ElectricalMix NA NA NA NA
## ElectricalSBrkr -2825.2548 2950.5700 -0.958 0.338501
## X1stFlrSF 48.4013 5.5595 8.706 < 2e-16 ***
## X2ndFlrSF 52.5641 6.0736 8.654 < 2e-16 ***
## LowQualFinSF -4.6670 19.2784 -0.242 0.808759
## GrLivArea NA NA NA NA
## BsmtFullBath1 1177.8488 1997.6882 0.590 0.555570
## BsmtFullBath2 5505.5181 9962.1301 0.553 0.580614
## BsmtFullBath3 29366.4887 27298.1445 1.076 0.282256
## BsmtHalfBath1 2649.5711 3118.1774 0.850 0.395658
## BsmtHalfBath2 -23464.8224 29650.6071 -0.791 0.428887
## FullBath1 -7464.1327 17478.7877 -0.427 0.669430
## FullBath2 -7047.3866 17772.4836 -0.397 0.691785
## FullBath3 16637.4679 18535.7249 0.898 0.369592
## HalfBath1 3307.8658 2207.3000 1.499 0.134250
## HalfBath2 -1763.0251 9266.7583 -0.190 0.849145
## BedroomAbvGr1 18147.7194 16784.4248 1.081 0.279824
## BedroomAbvGr2 22185.8470 16541.3568 1.341 0.180108
## BedroomAbvGr3 15933.9648 16679.2490 0.955 0.339618
## BedroomAbvGr4 17253.2694 16911.5135 1.020 0.307844
## BedroomAbvGr5 8833.1865 18068.8629 0.489 0.625032
## BedroomAbvGr6 20038.4118 20235.5406 0.990 0.322256
## BedroomAbvGr8 37862.6508 35315.2674 1.072 0.283885
## KitchenAbvGr1 -5630.6399 44118.1705 -0.128 0.898467
## KitchenAbvGr2 -15664.0709 44590.5941 -0.351 0.725438
## KitchenAbvGr3 -26808.1967 48140.5373 -0.557 0.577722
## KitchenQualFa -17447.3368 6256.0149 -2.789 0.005376 **
## KitchenQualGd -17430.3145 3516.6326 -4.957 8.25e-07 ***
## KitchenQualTA -17969.5274 3933.3294 -4.569 5.44e-06 ***
## TotRmsAbvGrd 1317.5771 950.2140 1.387 0.165828
## FunctionalMaj2 -9876.0786 14324.0639 -0.689 0.490663
## FunctionalMin1 3385.2590 8540.6642 0.396 0.691906
## FunctionalMin2 3180.4629 8724.4960 0.365 0.715519
## FunctionalMod -7104.6055 10488.3939 -0.677 0.498302
## FunctionalSev -34338.6173 29070.0654 -1.181 0.237752
## FunctionalTyp 13063.0907 7564.5761 1.727 0.084458 .
## Fireplaces1 -6582.4482 5483.7242 -1.200 0.230246
## Fireplaces2 127.2270 6014.5918 0.021 0.983127
## Fireplaces3 1869.1869 12866.5352 0.145 0.884519
## FireplaceQuFa 4549.1504 6786.9507 0.670 0.502814
## FireplaceQuGd 8208.6063 5316.1123 1.544 0.122839
## FireplaceQuNF NA NA NA NA
## FireplaceQuPo 13969.8680 7798.4762 1.791 0.073498 .
## FireplaceQuTA 9360.7897 5505.9034 1.700 0.089375 .
## GarageTypeAttchd 33119.6708 11809.5244 2.804 0.005124 **
## GarageTypeBasment 39938.5136 13369.5144 2.987 0.002874 **
## GarageTypeBuiltIn 30354.1390 12287.1570 2.470 0.013640 *
## GarageTypeCarPort 38737.7148 15411.6653 2.514 0.012088 *
## GarageTypeDetchd 34919.9286 11803.0625 2.959 0.003154 **
## GarageTypeNG -99198.6001 49654.0658 -1.998 0.045974 *
## GarageYrBlt 2.2244 60.7921 0.037 0.970818
## GarageFinishNG 114505.6266 47214.7084 2.425 0.015452 *
## GarageFinishRFn -701.9726 1929.9356 -0.364 0.716127
## GarageFinishUnf -1033.6589 2384.3036 -0.434 0.664713
## GarageCars1 -15045.6134 19130.8915 -0.786 0.431762
## GarageCars2 -15940.7516 19061.8343 -0.836 0.403179
## GarageCars3 -6305.7140 19187.2391 -0.329 0.742487
## GarageArea 20.6652 7.8375 2.637 0.008484 **
## GarageQualFa -73062.0313 31793.5482 -2.298 0.021739 *
## GarageQualGd -68166.6546 32566.1915 -2.093 0.036552 *
## GarageQualNG NA NA NA NA
## GarageQualPo -78037.5768 40775.8366 -1.914 0.055891 .
## GarageQualTA -68294.4160 31429.2263 -2.173 0.029986 *
## GarageCondFa 63157.9279 35385.9877 1.785 0.074552 .
## GarageCondGd 63734.0725 36764.4691 1.734 0.083260 .
## GarageCondNG NA NA NA NA
## GarageCondPo 69100.6168 38774.4055 1.782 0.074993 .
## GarageCondTA 64411.1056 35167.4126 1.832 0.067275 .
## PavedDriveP -4946.4662 5489.7861 -0.901 0.367760
## PavedDriveY -703.1284 3461.6202 -0.203 0.839076
## WoodDeckSF 10.1153 5.7623 1.755 0.079453 .
## OpenPorchSF 5.0249 11.4393 0.439 0.660553
## EnclosedPorch 7.8812 12.4620 0.632 0.527237
## X3SsnPorch 52.1939 21.7640 2.398 0.016635 *
## ScreenPorch 46.3312 12.2859 3.771 0.000171 ***
## PoolArea 651.3711 226.5377 2.875 0.004110 **
## PoolQCFa -143233.9603 40580.7579 -3.530 0.000433 ***
## PoolQCGd -108692.0229 36466.6736 -2.981 0.002937 **
## PoolQCNP 252148.3295 123801.7614 2.037 0.041907 *
## FenceGdWo 3602.4174 4875.0861 0.739 0.460091
## FenceMnPrv 6919.0492 3968.7734 1.743 0.081535 .
## FenceMnWw 617.7255 8009.5008 0.077 0.938538
## FenceNF 5291.5861 3626.2491 1.459 0.144770
## MiscFeatureNM -452.7031 98780.6709 -0.005 0.996344
## MiscFeatureOthr 14837.5813 90648.6409 0.164 0.870010
## MiscFeatureShed 1399.1800 94548.0229 0.015 0.988195
## MiscFeatureTenC 6595.4966 97756.4521 0.067 0.946220
## MiscVal 0.2458 6.2278 0.039 0.968524
## MoSold2 -7731.8508 4640.6001 -1.666 0.095959 .
## MoSold3 -2669.7935 4084.7448 -0.654 0.513499
## MoSold4 -2592.0432 3873.1388 -0.669 0.503479
## MoSold5 -952.1368 3706.0580 -0.257 0.797291
## MoSold6 -2476.5190 3655.8633 -0.677 0.498282
## MoSold7 -347.0836 3713.8167 -0.093 0.925556
## MoSold8 -6521.0467 3919.2241 -1.664 0.096412 .
## MoSold9 -5516.9199 4484.2655 -1.230 0.218842
## MoSold10 -7657.0159 4256.0910 -1.799 0.072269 .
## MoSold11 -4929.3721 4292.5573 -1.148 0.251061
## MoSold12 -4544.2673 4605.6940 -0.987 0.324015
## YrSold2007 86.9338 1923.1899 0.045 0.963953
## YrSold2008 2545.6042 2015.5626 1.263 0.206854
## YrSold2009 -3.1244 1957.9171 -0.002 0.998727
## YrSold2010 3058.1111 2448.0367 1.249 0.211842
## SaleTypeCon 25106.9899 17511.8883 1.434 0.151926
## SaleTypeConLD 14126.8993 9891.0703 1.428 0.153491
## SaleTypeConLI 547.0299 11431.8229 0.048 0.961843
## SaleTypeConLw 2737.2419 12140.6171 0.225 0.821660
## SaleTypeCWD 11139.1104 12647.6958 0.881 0.378652
## SaleTypeNew 25491.2792 15371.7619 1.658 0.097525 .
## SaleTypeOth 7346.2181 14491.6217 0.507 0.612302
## SaleTypeWD 145.1526 4109.8105 0.035 0.971832
## SaleConditionAdjLand 27076.3022 16205.4457 1.671 0.095030 .
## SaleConditionAlloca -3537.0955 9952.9400 -0.355 0.722368
## SaleConditionFamily 753.0752 6013.8156 0.125 0.900368
## SaleConditionNormal 7021.8337 2873.5884 2.444 0.014692 *
## SaleConditionPartial -2789.4755 14754.9403 -0.189 0.850084
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 21670 on 1153 degrees of freedom
## Multiple R-squared: 0.9412, Adjusted R-squared: 0.9256
## F-statistic: 60.31 on 306 and 1153 DF, p-value: < 2.2e-16
Linearly Dependent Categories were producing NA’s for coefficients
As data was formatted we could not remove Partial Categories.
To work around this we created categories in their own columns such that individual column could be removed.
This considerably widen the data set.
Once Linearly dependent Categories were removed, we resumed building models
Both Forward Added and Backward removal techniques were employed.
These techniques produced the same models.
#These values are linearly dependent on other variables
housetrain.lm6 <- lm(SalePrice~BsmtQual+BsmtExposure, data = housetrain.knn)
summary(housetrain.lm6)
The 8th iteration of the multiple linear regression is the model we decided to use for the contest.
Residual standard error(RSE): 30440 on 1347 DF
R2 is 0.8645, explaining 86% variance in sale price.
F-statistic is 76.7 suggesting Model-8 is better than null model.
\(GVIF^{\bigg(\frac{1}{2*Df}\bigg)}\) value of variables is less than 3.0, suggesting some degree of multicollinearity exists.
Standard Errors and 95% Confidence Intervals are smaller compared to any other model
KNN was used to see if an unparameterized regression method would work better on such a large data set.
The KNN Imputation used above may create problems generalizing the data to the test set- Central Imputation was used.
Exploring The K parameter from 5 to 20 in 6-fold cross validation shows K of 7 or 11 to minimize error.
K = 11, MAE: 23034.5377504002 MSE: 1437553316.64935 RMSE: 37915.080332888 MAPE:0.133958673609819
Output show predicted sale price(fit) and lower and upper limits.
All predicted values are positive.
Model-8 is applied to predict the sale price.
fit | lwr | upr | |
---|---|---|---|
1 | 11537.87 | -54293.97 | 77369.71 |
2 | 19599.66 | -47171.58 | 86370.90 |
3 | 28033.83 | -85935.58 | 142003.24 |
4 | 28797.12 | -38749.81 | 96344.06 |
5 | 30153.52 | -77718.34 | 138025.37 |
6 | 32288.34 | -34975.26 | 99551.94 |
7 | 35331.33 | -29525.27 | 100187.93 |
8 | 38340.33 | -35238.25 | 111918.91 |
9 | 42118.46 | -32085.87 | 116322.78 |
10 | 43876.84 | -73913.55 | 161667.22 |
Model-9 was built removing BsmtQual
and Neighborhood
.
Difference between lower and upper limit is very large.
Predicted value have negatives, indicating Model-9 is over fitted model and does not fit well with test data.
fit | lwr | upr | |
---|---|---|---|
1 | -9050.59 | -129702.34 | 111601.16 |
2 | 4541.86 | -66614.21 | 75697.92 |
3 | 16437.65 | -55704.25 | 88579.55 |
4 | 19254.72 | -94764.09 | 133273.54 |
5 | 19519.60 | -53449.37 | 92488.56 |
6 | 22202.95 | -49356.22 | 93762.11 |
7 | 24145.89 | -95026.96 | 143318.74 |
8 | 27868.54 | -98062.90 | 153799.98 |
9 | 28388.78 | -41718.84 | 98496.41 |
10 | 30323.87 | -90381.49 | 151029.24 |
Model-10 was built by removing BsmtQual
, Neighborhood
and GarageType
.
Difference between lower and upper limit is very large.
Predicted value have negatives, indicating Model-10 is over fitted model and does not fit well with test data.
fit | lwr | upr | |
---|---|---|---|
1 | -10807.48 | -132920.28 | 111305.31 |
2 | 1923.39 | -113212.93 | 117059.70 |
3 | 4076.57 | -67828.73 | 75981.87 |
4 | 15594.48 | -57272.02 | 88460.97 |
5 | 17371.82 | -56382.23 | 91125.87 |
6 | 21698.34 | -98938.62 | 142335.29 |
7 | 24441.14 | -47841.68 | 96723.96 |
8 | 24983.82 | -102478.96 | 152446.60 |
9 | 28217.72 | -42700.22 | 99135.67 |
10 | 30684.52 | -91468.75 | 152837.80 |
We decided that MLR 8 had the best overall performance with RSE of 30440 on 1347 df, and R\(^2\) of 0.86.
It scored a Root Mean Square Logarithmic Error (RMSLE) of 0.19290 and 4031 out of 5,419 on the kaggle competition.
For comparison the model in 10th place out of 5,419 has RMSLE of 0.10964.
Out of curiosity the KNN k=11 model was also submitted and ranked 3,485 out of 5,419 with a Mean Squared Logarithmic Error of 0.18599