first let take us take a glance for train structure
introduce(train)
## rows columns discrete_columns continuous_columns all_missing_columns
## 1 1460 81 43 38 0
## total_missing_values total_observations memory_usage
## 1 6965 118260 516808
We can conduct from above the following: * There is a balance betweeen discreat and continous features. * Nearly 6 % of data is missing. * Let us see how data is organized.
glimpse(train)
## Observations: 1,460
## Variables: 81
## $ Id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1...
## $ MSSubClass <int> 60, 20, 60, 70, 60, 50, 20, 60, 50, 190, 20, 60,...
## $ MSZoning <fct> RL, RL, RL, RL, RL, RL, RL, RL, RM, RL, RL, RL, ...
## $ LotFrontage <int> 65, 80, 68, 60, 84, 85, 75, NA, 51, 50, 70, 85, ...
## $ LotArea <int> 8450, 9600, 11250, 9550, 14260, 14115, 10084, 10...
## $ Street <fct> Pave, Pave, Pave, Pave, Pave, Pave, Pave, Pave, ...
## $ Alley <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ LotShape <fct> Reg, Reg, IR1, IR1, IR1, IR1, Reg, IR1, Reg, Reg...
## $ LandContour <fct> Lvl, Lvl, Lvl, Lvl, Lvl, Lvl, Lvl, Lvl, Lvl, Lvl...
## $ Utilities <fct> AllPub, AllPub, AllPub, AllPub, AllPub, AllPub, ...
## $ LotConfig <fct> Inside, FR2, Inside, Corner, FR2, Inside, Inside...
## $ LandSlope <fct> Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl...
## $ Neighborhood <fct> CollgCr, Veenker, CollgCr, Crawfor, NoRidge, Mit...
## $ Condition1 <fct> Norm, Feedr, Norm, Norm, Norm, Norm, Norm, PosN,...
## $ Condition2 <fct> Norm, Norm, Norm, Norm, Norm, Norm, Norm, Norm, ...
## $ BldgType <fct> 1Fam, 1Fam, 1Fam, 1Fam, 1Fam, 1Fam, 1Fam, 1Fam, ...
## $ HouseStyle <fct> 2Story, 1Story, 2Story, 2Story, 2Story, 1.5Fin, ...
## $ OverallQual <int> 7, 6, 7, 7, 8, 5, 8, 7, 7, 5, 5, 9, 5, 7, 6, 7, ...
## $ OverallCond <int> 5, 8, 5, 5, 5, 5, 5, 6, 5, 6, 5, 5, 6, 5, 5, 8, ...
## $ YearBuilt <int> 2003, 1976, 2001, 1915, 2000, 1993, 2004, 1973, ...
## $ YearRemodAdd <int> 2003, 1976, 2002, 1970, 2000, 1995, 2005, 1973, ...
## $ RoofStyle <fct> Gable, Gable, Gable, Gable, Gable, Gable, Gable,...
## $ RoofMatl <fct> CompShg, CompShg, CompShg, CompShg, CompShg, Com...
## $ Exterior1st <fct> VinylSd, MetalSd, VinylSd, Wd Sdng, VinylSd, Vin...
## $ Exterior2nd <fct> VinylSd, MetalSd, VinylSd, Wd Shng, VinylSd, Vin...
## $ MasVnrType <fct> BrkFace, None, BrkFace, None, BrkFace, None, Sto...
## $ MasVnrArea <int> 196, 0, 162, 0, 350, 0, 186, 240, 0, 0, 0, 286, ...
## $ ExterQual <fct> Gd, TA, Gd, TA, Gd, TA, Gd, TA, TA, TA, TA, Ex, ...
## $ ExterCond <fct> TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, ...
## $ Foundation <fct> PConc, CBlock, PConc, BrkTil, PConc, Wood, PConc...
## $ BsmtQual <fct> Gd, Gd, Gd, TA, Gd, Gd, Ex, Gd, TA, TA, TA, Ex, ...
## $ BsmtCond <fct> TA, TA, TA, Gd, TA, TA, TA, TA, TA, TA, TA, TA, ...
## $ BsmtExposure <fct> No, Gd, Mn, No, Av, No, Av, Mn, No, No, No, No, ...
## $ BsmtFinType1 <fct> GLQ, ALQ, GLQ, ALQ, GLQ, GLQ, GLQ, ALQ, Unf, GLQ...
## $ BsmtFinSF1 <int> 706, 978, 486, 216, 655, 732, 1369, 859, 0, 851,...
## $ BsmtFinType2 <fct> Unf, Unf, Unf, Unf, Unf, Unf, Unf, BLQ, Unf, Unf...
## $ BsmtFinSF2 <int> 0, 0, 0, 0, 0, 0, 0, 32, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ BsmtUnfSF <int> 150, 284, 434, 540, 490, 64, 317, 216, 952, 140,...
## $ TotalBsmtSF <int> 856, 1262, 920, 756, 1145, 796, 1686, 1107, 952,...
## $ Heating <fct> GasA, GasA, GasA, GasA, GasA, GasA, GasA, GasA, ...
## $ HeatingQC <fct> Ex, Ex, Ex, Gd, Ex, Ex, Ex, Ex, Gd, Ex, Ex, Ex, ...
## $ CentralAir <fct> Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, ...
## $ Electrical <fct> SBrkr, SBrkr, SBrkr, SBrkr, SBrkr, SBrkr, SBrkr,...
## $ X1stFlrSF <int> 856, 1262, 920, 961, 1145, 796, 1694, 1107, 1022...
## $ X2ndFlrSF <int> 854, 0, 866, 756, 1053, 566, 0, 983, 752, 0, 0, ...
## $ LowQualFinSF <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ GrLivArea <int> 1710, 1262, 1786, 1717, 2198, 1362, 1694, 2090, ...
## $ BsmtFullBath <int> 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, ...
## $ BsmtHalfBath <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ FullBath <int> 2, 2, 2, 1, 2, 1, 2, 2, 2, 1, 1, 3, 1, 2, 1, 1, ...
## $ HalfBath <int> 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, ...
## $ BedroomAbvGr <int> 3, 3, 3, 3, 4, 1, 3, 3, 2, 2, 3, 4, 2, 3, 2, 2, ...
## $ KitchenAbvGr <int> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1, ...
## $ KitchenQual <fct> Gd, TA, Gd, Gd, Gd, TA, Gd, TA, TA, TA, TA, Ex, ...
## $ TotRmsAbvGrd <int> 8, 6, 6, 7, 9, 5, 7, 7, 8, 5, 5, 11, 4, 7, 5, 5,...
## $ Functional <fct> Typ, Typ, Typ, Typ, Typ, Typ, Typ, Typ, Min1, Ty...
## $ Fireplaces <int> 0, 1, 1, 1, 1, 0, 1, 2, 2, 2, 0, 2, 0, 1, 1, 0, ...
## $ FireplaceQu <fct> NA, TA, TA, Gd, TA, NA, Gd, TA, TA, TA, NA, Gd, ...
## $ GarageType <fct> Attchd, Attchd, Attchd, Detchd, Attchd, Attchd, ...
## $ GarageYrBlt <int> 2003, 1976, 2001, 1998, 2000, 1993, 2004, 1973, ...
## $ GarageFinish <fct> RFn, RFn, RFn, Unf, RFn, Unf, RFn, RFn, Unf, RFn...
## $ GarageCars <int> 2, 2, 2, 3, 3, 2, 2, 2, 2, 1, 1, 3, 1, 3, 1, 2, ...
## $ GarageArea <int> 548, 460, 608, 642, 836, 480, 636, 484, 468, 205...
## $ GarageQual <fct> TA, TA, TA, TA, TA, TA, TA, TA, Fa, Gd, TA, TA, ...
## $ GarageCond <fct> TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, ...
## $ PavedDrive <fct> Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, ...
## $ WoodDeckSF <int> 0, 298, 0, 0, 192, 40, 255, 235, 90, 0, 0, 147, ...
## $ OpenPorchSF <int> 61, 0, 42, 35, 84, 30, 57, 204, 0, 4, 0, 21, 0, ...
## $ EnclosedPorch <int> 0, 0, 0, 272, 0, 0, 0, 228, 205, 0, 0, 0, 0, 0, ...
## $ X3SsnPorch <int> 0, 0, 0, 0, 0, 320, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ ScreenPorch <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 176, 0, 0, 0...
## $ PoolArea <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ PoolQC <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ Fence <fct> NA, NA, NA, NA, NA, MnPrv, NA, NA, NA, NA, NA, N...
## $ MiscFeature <fct> NA, NA, NA, NA, NA, Shed, NA, Shed, NA, NA, NA, ...
## $ MiscVal <int> 0, 0, 0, 0, 0, 700, 0, 350, 0, 0, 0, 0, 0, 0, 0,...
## $ MoSold <int> 2, 5, 9, 2, 12, 10, 8, 11, 4, 1, 2, 7, 9, 8, 5, ...
## $ YrSold <int> 2008, 2007, 2008, 2006, 2008, 2009, 2007, 2009, ...
## $ SaleType <fct> WD, WD, WD, WD, WD, WD, WD, WD, WD, WD, WD, New,...
## $ SaleCondition <fct> Normal, Normal, Normal, Abnorml, Normal, Normal,...
## $ SalePrice <int> 208500, 181500, 223500, 140000, 250000, 143000, ...
summary(train)
## Id MSSubClass MSZoning LotFrontage
## Min. : 1.0 Min. : 20.0 C (all): 10 Min. : 21.00
## 1st Qu.: 365.8 1st Qu.: 20.0 FV : 65 1st Qu.: 59.00
## Median : 730.5 Median : 50.0 RH : 16 Median : 69.00
## Mean : 730.5 Mean : 56.9 RL :1151 Mean : 70.05
## 3rd Qu.:1095.2 3rd Qu.: 70.0 RM : 218 3rd Qu.: 80.00
## Max. :1460.0 Max. :190.0 Max. :313.00
## NA's :259
## LotArea Street Alley LotShape LandContour
## Min. : 1300 Grvl: 6 Grvl: 50 IR1:484 Bnk: 63
## 1st Qu.: 7554 Pave:1454 Pave: 41 IR2: 41 HLS: 50
## Median : 9478 NA's:1369 IR3: 10 Low: 36
## Mean : 10517 Reg:925 Lvl:1311
## 3rd Qu.: 11602
## Max. :215245
##
## Utilities LotConfig LandSlope Neighborhood Condition1
## AllPub:1459 Corner : 263 Gtl:1382 NAmes :225 Norm :1260
## NoSeWa: 1 CulDSac: 94 Mod: 65 CollgCr:150 Feedr : 81
## FR2 : 47 Sev: 13 OldTown:113 Artery : 48
## FR3 : 4 Edwards:100 RRAn : 26
## Inside :1052 Somerst: 86 PosN : 19
## Gilbert: 79 RRAe : 11
## (Other):707 (Other): 15
## Condition2 BldgType HouseStyle OverallQual
## Norm :1445 1Fam :1220 1Story :726 Min. : 1.000
## Feedr : 6 2fmCon: 31 2Story :445 1st Qu.: 5.000
## Artery : 2 Duplex: 52 1.5Fin :154 Median : 6.000
## PosN : 2 Twnhs : 43 SLvl : 65 Mean : 6.099
## RRNn : 2 TwnhsE: 114 SFoyer : 37 3rd Qu.: 7.000
## PosA : 1 1.5Unf : 14 Max. :10.000
## (Other): 2 (Other): 19
## OverallCond YearBuilt YearRemodAdd RoofStyle
## Min. :1.000 Min. :1872 Min. :1950 Flat : 13
## 1st Qu.:5.000 1st Qu.:1954 1st Qu.:1967 Gable :1141
## Median :5.000 Median :1973 Median :1994 Gambrel: 11
## Mean :5.575 Mean :1971 Mean :1985 Hip : 286
## 3rd Qu.:6.000 3rd Qu.:2000 3rd Qu.:2004 Mansard: 7
## Max. :9.000 Max. :2010 Max. :2010 Shed : 2
##
## RoofMatl Exterior1st Exterior2nd MasVnrType MasVnrArea
## CompShg:1434 VinylSd:515 VinylSd:504 BrkCmn : 15 Min. : 0.0
## Tar&Grv: 11 HdBoard:222 MetalSd:214 BrkFace:445 1st Qu.: 0.0
## WdShngl: 6 MetalSd:220 HdBoard:207 None :864 Median : 0.0
## WdShake: 5 Wd Sdng:206 Wd Sdng:197 Stone :128 Mean : 103.7
## ClyTile: 1 Plywood:108 Plywood:142 NA's : 8 3rd Qu.: 166.0
## Membran: 1 CemntBd: 61 CmentBd: 60 Max. :1600.0
## (Other): 2 (Other):128 (Other):136 NA's :8
## ExterQual ExterCond Foundation BsmtQual BsmtCond BsmtExposure
## Ex: 52 Ex: 3 BrkTil:146 Ex :121 Fa : 45 Av :221
## Fa: 14 Fa: 28 CBlock:634 Fa : 35 Gd : 65 Gd :134
## Gd:488 Gd: 146 PConc :647 Gd :618 Po : 2 Mn :114
## TA:906 Po: 1 Slab : 24 TA :649 TA :1311 No :953
## TA:1282 Stone : 6 NA's: 37 NA's: 37 NA's: 38
## Wood : 3
##
## BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2
## ALQ :220 Min. : 0.0 ALQ : 19 Min. : 0.00
## BLQ :148 1st Qu.: 0.0 BLQ : 33 1st Qu.: 0.00
## GLQ :418 Median : 383.5 GLQ : 14 Median : 0.00
## LwQ : 74 Mean : 443.6 LwQ : 46 Mean : 46.55
## Rec :133 3rd Qu.: 712.2 Rec : 54 3rd Qu.: 0.00
## Unf :430 Max. :5644.0 Unf :1256 Max. :1474.00
## NA's: 37 NA's: 38
## BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir
## Min. : 0.0 Min. : 0.0 Floor: 1 Ex:741 N: 95
## 1st Qu.: 223.0 1st Qu.: 795.8 GasA :1428 Fa: 49 Y:1365
## Median : 477.5 Median : 991.5 GasW : 18 Gd:241
## Mean : 567.2 Mean :1057.4 Grav : 7 Po: 1
## 3rd Qu.: 808.0 3rd Qu.:1298.2 OthW : 2 TA:428
## Max. :2336.0 Max. :6110.0 Wall : 4
##
## Electrical X1stFlrSF X2ndFlrSF LowQualFinSF
## FuseA: 94 Min. : 334 Min. : 0 Min. : 0.000
## FuseF: 27 1st Qu.: 882 1st Qu.: 0 1st Qu.: 0.000
## FuseP: 3 Median :1087 Median : 0 Median : 0.000
## Mix : 1 Mean :1163 Mean : 347 Mean : 5.845
## SBrkr:1334 3rd Qu.:1391 3rd Qu.: 728 3rd Qu.: 0.000
## NA's : 1 Max. :4692 Max. :2065 Max. :572.000
##
## GrLivArea BsmtFullBath BsmtHalfBath FullBath
## Min. : 334 Min. :0.0000 Min. :0.00000 Min. :0.000
## 1st Qu.:1130 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:1.000
## Median :1464 Median :0.0000 Median :0.00000 Median :2.000
## Mean :1515 Mean :0.4253 Mean :0.05753 Mean :1.565
## 3rd Qu.:1777 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:2.000
## Max. :5642 Max. :3.0000 Max. :2.00000 Max. :3.000
##
## HalfBath BedroomAbvGr KitchenAbvGr KitchenQual
## Min. :0.0000 Min. :0.000 Min. :0.000 Ex:100
## 1st Qu.:0.0000 1st Qu.:2.000 1st Qu.:1.000 Fa: 39
## Median :0.0000 Median :3.000 Median :1.000 Gd:586
## Mean :0.3829 Mean :2.866 Mean :1.047 TA:735
## 3rd Qu.:1.0000 3rd Qu.:3.000 3rd Qu.:1.000
## Max. :2.0000 Max. :8.000 Max. :3.000
##
## TotRmsAbvGrd Functional Fireplaces FireplaceQu GarageType
## Min. : 2.000 Maj1: 14 Min. :0.000 Ex : 24 2Types : 6
## 1st Qu.: 5.000 Maj2: 5 1st Qu.:0.000 Fa : 33 Attchd :870
## Median : 6.000 Min1: 31 Median :1.000 Gd :380 Basment: 19
## Mean : 6.518 Min2: 34 Mean :0.613 Po : 20 BuiltIn: 88
## 3rd Qu.: 7.000 Mod : 15 3rd Qu.:1.000 TA :313 CarPort: 9
## Max. :14.000 Sev : 1 Max. :3.000 NA's:690 Detchd :387
## Typ :1360 NA's : 81
## GarageYrBlt GarageFinish GarageCars GarageArea GarageQual
## Min. :1900 Fin :352 Min. :0.000 Min. : 0.0 Ex : 3
## 1st Qu.:1961 RFn :422 1st Qu.:1.000 1st Qu.: 334.5 Fa : 48
## Median :1980 Unf :605 Median :2.000 Median : 480.0 Gd : 14
## Mean :1979 NA's: 81 Mean :1.767 Mean : 473.0 Po : 3
## 3rd Qu.:2002 3rd Qu.:2.000 3rd Qu.: 576.0 TA :1311
## Max. :2010 Max. :4.000 Max. :1418.0 NA's: 81
## NA's :81
## GarageCond PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch
## Ex : 2 N: 90 Min. : 0.00 Min. : 0.00 Min. : 0.00
## Fa : 35 P: 30 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00
## Gd : 9 Y:1340 Median : 0.00 Median : 25.00 Median : 0.00
## Po : 7 Mean : 94.24 Mean : 46.66 Mean : 21.95
## TA :1326 3rd Qu.:168.00 3rd Qu.: 68.00 3rd Qu.: 0.00
## NA's: 81 Max. :857.00 Max. :547.00 Max. :552.00
##
## X3SsnPorch ScreenPorch PoolArea PoolQC
## Min. : 0.00 Min. : 0.00 Min. : 0.000 Ex : 2
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.000 Fa : 2
## Median : 0.00 Median : 0.00 Median : 0.000 Gd : 3
## Mean : 3.41 Mean : 15.06 Mean : 2.759 NA's:1453
## 3rd Qu.: 0.00 3rd Qu.: 0.00 3rd Qu.: 0.000
## Max. :508.00 Max. :480.00 Max. :738.000
##
## Fence MiscFeature MiscVal MoSold
## GdPrv: 59 Gar2: 2 Min. : 0.00 Min. : 1.000
## GdWo : 54 Othr: 2 1st Qu.: 0.00 1st Qu.: 5.000
## MnPrv: 157 Shed: 49 Median : 0.00 Median : 6.000
## MnWw : 11 TenC: 1 Mean : 43.49 Mean : 6.322
## NA's :1179 NA's:1406 3rd Qu.: 0.00 3rd Qu.: 8.000
## Max. :15500.00 Max. :12.000
##
## YrSold SaleType SaleCondition SalePrice
## Min. :2006 WD :1267 Abnorml: 101 Min. : 34900
## 1st Qu.:2007 New : 122 AdjLand: 4 1st Qu.:129975
## Median :2008 COD : 43 Alloca : 12 Median :163000
## Mean :2008 ConLD : 9 Family : 20 Mean :180921
## 3rd Qu.:2009 ConLI : 5 Normal :1198 3rd Qu.:214000
## Max. :2010 ConLw : 5 Partial: 125 Max. :755000
## (Other): 9
From above we see: * There are some outliers scattred here and there, we investigate in details in next sections. * The missing observaes are scattred among features, let us investigate that more.
plot_missing(train, title = "Missing Data", ggtheme = theme_gray(base_size = 15))
The categorical features with the largest number of missing values are:
PoolQC (99.52%): Pool Quality, no wonder :)MiscFeature (96.3%): Miscellaneous features not covered in other categoriesAlley (93.7%): indicates the type of alley accessFence (80%): Fence QualityFirePlaceQu (47.26%): Fireplace qualityGarageType (5.55%): related featuresGarageYrBlt (5.55%): I will convert this feature to categorical and treat it like thatGarageFinish (5.55%): Interior finish of the garageGarageQUal (5.55%): Garage qualityGarageCond (5.55%): Garage conditionBsmtExposure (2.6%): Refers to walkout or garden level walls.BsmtFinType2 (2.6%): Rating of basement finished area (if multiple types)BsmtQual (2.53%): Evaluates the height of the basementBsmtCond (2.53%): Evaluates the general condition of the basementBsmtFinType1 (2.53%): Rating of basement finished areaMasVnrType (0.55%): Masonry veneer typeI will Impute categorical features by converting NA to Not available level except MasVnrType I will add level others as they must used something to build with.
The missing values indicate that majority of the houses do not have alley access, no pool, no fence and no elevator, 2nd garage, shed or tennis court that is covered by the MiscFeature.
The numeric variables do not have as many missing values but there are still some present:
LotFrontage (17.74%): Linear feet of street connected to property
mean or meadin functions.MasVnrArea (0.55%): Masonry veneer area in square feet
Let us have a quick view
plot_bar(train)
From the first look there are some features with many levels with no realy small values as:
NeighborhoodCondition1Condition2HouseStyleRoofMatlExterior1stExterior2ndFunctionalSaleTypeNow, let us check the continuos features
plot_density(train[,-c(1)], ggtheme = theme_gray(base_size = 15, base_family = "serif"))
From plots, it seems there are many fluctations in many features and we will need to deal with each one of it individually.
Now, let us see how discreate and continuos features interact with the response variable first.
plot_scatterplot(train[,-c(1)], by = "SalePrice")
## Warning: Removed 267 rows containing missing values (geom_point).
## Warning: Removed 81 rows containing missing values (geom_point).
The plots confirm my doubs about continuos features in specific, it needs serious handling. Now let us move to the final stage of our EDA, corrleation.
numeric_var <- names(train)[which(sapply(train, is.numeric))]
correlations <- cor(na.omit(train[, numeric_var]))
# correlations
row_indic <- apply(correlations, 1, function(x) sum(x > 0.3 | x < -0.3) > 1)
correlations<- correlations[row_indic ,row_indic ]
corrplot(correlations, method="square")
It seems there is a high corrletation among continuos features, we will need to treat that in Feature Engineering phase.
The correlation matrix below shows that there are several variables that are strongly and positively correlated with housing price.
High positive correlation:
The number of enclosed porches are negatively correlated with year built. It seems that potential housebuyers do not want an enclosed porch and house developers have been building less enclosed porches in recent years. It is also negatively correlated with SalePrice, which makes sense.
There is some slight negative correlation between OverallCond and SalePrice. There is also strong negative correlation between Yearbuilt and OverallCond. It seems to be that recently built houses tend to been in worse Overall Condition.
train %>%
select(OverallCond, YearBuilt) %>%
ggplot(aes(as.factor(OverallCond),YearBuilt)) +
geom_boxplot() +
xlab('Overall Condition')
Now we came to the most critical part that will determine what feature our model will depend on. I will check all featuers with the followingin mind:
Let us check for summary firt
summary(train$SalePrice)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 34900 129975 163000 180921 214000 755000
OK, the good news we do not have missing data, but it seems we have outlier. Let us make sure.
outlier_values <- boxplot.stats(train$SalePrice)$out # outlier values.
boxplot(train$SalePrice, main="Price", boxwex=0.1)
mtext(paste("Outliers: ", paste(outlier_values, collapse=", ")), cex=0.6)
and by using outliers package
outlier(train$SalePrice)
## [1] 755000
OK, it seems that we have one or two obserations as outlier at least, it is not much trouble, is it? Let us check for normality
Density plot and Q-Q plot can be used to check normality visually.
ggdensity(train$SalePrice,
main = "Density plot of SalePrice",
xlab = "Sale Price")
qqPlot(train$SalePrice)
## [1] 692 1183
OK, we have a long right tail on first plot and baised line on the second, so it is not so normal. Let us confirm that by performing significance test
shapiro.test(train$SalePrice)
##
## Shapiro-Wilk normality test
##
## data: train$SalePrice
## W = 0.86967, p-value < 2.2e-16
It is confirmed, let us now transform the response variable and recheck.
train$SalePrice <- log(train$SalePrice)
ggdensity(train$SalePrice,
main = "Density plot of SalePrice",
xlab = "Sale Price")
qqPlot(train$SalePrice)
## [1] 496 917
much better. now let us move to high missing features.
I changed my mind, I will drop high missing values, it seems to risky to keep them
summary(train$PoolQC)
## Ex Fa Gd NA's
## 2 2 3 1453
train$PoolQC <- NULL
test$PoolQC <- NULL
summary(train$MiscFeature)
## Gar2 Othr Shed TenC NA's
## 2 2 49 1 1406
train$MiscFeature <- NULL
test$MiscFeature <- NULL
summary(train$Alley)
## Grvl Pave NA's
## 50 41 1369
train$Alley <- NULL
test$Alley <- NULL
summary(train$Fence )
## GdPrv GdWo MnPrv MnWw NA's
## 59 54 157 11 1179
train$Fence <- NULL
test$Fence <- NULL
I will impute others
summary(train$FireplaceQu)
## Ex Fa Gd Po TA NA's
## 24 33 380 20 313 690
train$FireplaceQu <- fct_explicit_na(train$FireplaceQu, "NA")
test$FireplaceQu <- fct_explicit_na(test$FireplaceQu, "NA")
summary(train$LotFrontage)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 21.00 59.00 69.00 70.05 80.00 313.00 259
train$LotFrontage[is.na(train$LotFrontage)] <- mean(train$LotFrontage, na.rm = TRUE)
test$LotFrontage[is.na(test$LotFrontage)] <- mean(test$LotFrontage, na.rm = TRUE)
summary(train$GarageType)
## 2Types Attchd Basment BuiltIn CarPort Detchd NA's
## 6 870 19 88 9 387 81
train$GarageType <- fct_explicit_na(train$GarageType, "NA")
test$GarageType <- fct_explicit_na(test$GarageType, "NA")
# I will convert GarageYrBlt to factor
train$GarageYrBlt <- as.factor(train$GarageYrBlt)
test$GarageYrBlt <- as.factor(test$GarageYrBlt)
summary(train$GarageYrBlt)
## 1900 1906 1908 1910 1914 1915 1916 1918 1920 1921 1922 1923 1924 1925 1926
## 1 1 1 3 2 2 5 2 14 3 5 3 3 10 6
## 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941
## 1 4 2 8 4 3 1 2 4 5 2 3 9 14 10
## 1942 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958
## 2 4 4 2 11 8 24 6 3 12 19 13 16 20 21
## 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973
## 17 19 13 21 16 18 21 21 15 26 15 20 13 14 14
## 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988
## 18 9 29 35 19 15 15 10 4 7 8 10 6 11 14
## 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003
## 10 16 9 13 22 18 18 20 19 31 30 27 20 26 50
## 2004 2005 2006 2007 2008 2009 2010 NA's
## 53 65 59 49 29 21 3 81
train$GarageYrBlt <- fct_explicit_na(train$GarageYrBlt, "NA")
test$GarageYrBlt <- fct_explicit_na(test$GarageYrBlt, "NA")
summary(train$GarageFinish)
## Fin RFn Unf NA's
## 352 422 605 81
train$GarageFinish <- fct_explicit_na(train$GarageFinish, "NA")
test$GarageFinish <- fct_explicit_na(test$GarageFinish, "NA")
summary(train$GarageQual)
## Ex Fa Gd Po TA NA's
## 3 48 14 3 1311 81
train$GarageQual <- fct_explicit_na(train$GarageQual, "NA")
test$GarageQual <- fct_explicit_na(test$GarageQual, "NA")
summary(train$GarageCond)
## Ex Fa Gd Po TA NA's
## 2 35 9 7 1326 81
train$GarageCond <- fct_explicit_na(train$GarageCond, "NA")
test$GarageCond <- fct_explicit_na(test$GarageCond, "NA")
summary(train$BsmtExposure)
## Av Gd Mn No NA's
## 221 134 114 953 38
train$BsmtExposure <- fct_explicit_na(train$BsmtExposure, "NA")
test$BsmtExposure <- fct_explicit_na(test$BsmtExposure, "NA")
summary(train$BsmtFinType2)
## ALQ BLQ GLQ LwQ Rec Unf NA's
## 19 33 14 46 54 1256 38
train$BsmtFinType2 <- fct_explicit_na(train$BsmtFinType2, "NA")
test$BsmtFinType2 <- fct_explicit_na(test$BsmtFinType2, "NA")
summary(train$BsmtQual)
## Ex Fa Gd TA NA's
## 121 35 618 649 37
train$BsmtQual <- fct_explicit_na(train$BsmtQual, "NA")
test$BsmtQual <- fct_explicit_na(test$BsmtQual, "NA")
summary(train$BsmtCond)
## Fa Gd Po TA NA's
## 45 65 2 1311 37
train$BsmtCond <- fct_explicit_na(train$BsmtCond, "NA")
test$BsmtCond <- fct_explicit_na(test$BsmtCond, "NA")
summary(train$BsmtFinType1)
## ALQ BLQ GLQ LwQ Rec Unf NA's
## 220 148 418 74 133 430 37
train$BsmtFinType1 <- fct_explicit_na(train$BsmtFinType1, "NA")
test$BsmtFinType1 <- fct_explicit_na(test$BsmtFinType1, "NA")
summary(train$MasVnrType)
## BrkCmn BrkFace None Stone NA's
## 15 445 864 128 8
train$MasVnrType <- fct_explicit_na(train$MasVnrType, "NA")
test$MasVnrType <- fct_explicit_na(test$MasVnrType, "NA")
summary(train$MasVnrArea)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0 0.0 0.0 103.7 166.0 1600.0 8
train$MasVnrArea[is.na(train$MasVnrArea)]<- 0
test$MasVnrArea[is.na(test$MasVnrArea)]<- 0
summary(train$Electrical)
## FuseA FuseF FuseP Mix SBrkr NA's
## 94 27 3 1 1334 1
train$Electrical <- fct_explicit_na(train$Electrical, "NA")
test$Electrical <- fct_explicit_na(test$Electrical, "NA")
First I will need to identify the most important features to work on and eleminate others to save effort and time
# Decide if a variable is important or not using Boruta
response <- train[, "SalePrice"]
boruta_output <- Boruta(response ~ . , data = train, doTrace=2) # perform Boruta search
boruta_signif <- names(boruta_output$finalDecision[boruta_output$finalDecision %in% c("Confirmed", "Tentative")]) # collect Confirmed and Tentative variables
boruta_signif
## [1] "MSSubClass" "MSZoning" "LotFrontage" "LotArea"
## [5] "LotShape" "LandContour" "LandSlope" "Neighborhood"
## [9] "Condition1" "BldgType" "HouseStyle" "OverallQual"
## [13] "OverallCond" "YearBuilt" "YearRemodAdd" "RoofStyle"
## [17] "Exterior1st" "Exterior2nd" "MasVnrType" "MasVnrArea"
## [21] "ExterQual" "Foundation" "BsmtQual" "BsmtCond"
## [25] "BsmtExposure" "BsmtFinType1" "BsmtFinSF1" "BsmtFinType2"
## [29] "BsmtUnfSF" "TotalBsmtSF" "HeatingQC" "CentralAir"
## [33] "Electrical" "X1stFlrSF" "X2ndFlrSF" "GrLivArea"
## [37] "BsmtFullBath" "FullBath" "HalfBath" "BedroomAbvGr"
## [41] "KitchenAbvGr" "KitchenQual" "TotRmsAbvGrd" "Functional"
## [45] "Fireplaces" "FireplaceQu" "GarageType" "GarageYrBlt"
## [49] "GarageFinish" "GarageCars" "GarageArea" "GarageQual"
## [53] "GarageCond" "PavedDrive" "WoodDeckSF" "OpenPorchSF"
## [57] "ScreenPorch" "SaleCondition" "SalePrice"
We eleminated 20 features, let us use another method
lmMod <- earth(SalePrice ~ . , data = train) # fit lm() model
ev <- evimp (lmMod) # estimate variable importance
plot (ev)
I will enter a loop start with building different models using these important features as a base line, then start to improvethem by going into Features engierring steps one by one and remodell and compaer until we are satisfied. So let us continue investigation on the important features. ### YearBuilt, YearRemodAdd, OverallQual and OverallCond These are correlated fields that we need to treat them toghather. #### Description
YearBuilt: Original construction date YearRemodAdd: Remodel date (same as construction date if no remodeling or additions) OverallQual: Rates the overall material and finish of the house
* 10 Very Excellent
* 9 Excellent
* 8 Very Good
* 7 Good
* 6 Above Average
* 5 Average
* 4 Below Average
* 3 Fair
* 2 Poor
* 1 Very Poor
OverallCond: Rates the overall condition of the house
* 10 Very Excellent
* 9 Excellent
* 8 Very Good
* 7 Good
* 6 Above Average
* 5 Average
* 4 Below Average
* 3 Fair
* 2 Poor
* 1 Very Poor