EDA

Data Structure

first let take us take a glance for train structure

introduce(train)
##   rows columns discrete_columns continuous_columns all_missing_columns
## 1 1460      81               43                 38                   0
##   total_missing_values total_observations memory_usage
## 1                 6965             118260       516808

We can conduct from above the following: * There is a balance betweeen discreat and continous features. * Nearly 6 % of data is missing. * Let us see how data is organized.

glimpse(train)
## Observations: 1,460
## Variables: 81
## $ Id            <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1...
## $ MSSubClass    <int> 60, 20, 60, 70, 60, 50, 20, 60, 50, 190, 20, 60,...
## $ MSZoning      <fct> RL, RL, RL, RL, RL, RL, RL, RL, RM, RL, RL, RL, ...
## $ LotFrontage   <int> 65, 80, 68, 60, 84, 85, 75, NA, 51, 50, 70, 85, ...
## $ LotArea       <int> 8450, 9600, 11250, 9550, 14260, 14115, 10084, 10...
## $ Street        <fct> Pave, Pave, Pave, Pave, Pave, Pave, Pave, Pave, ...
## $ Alley         <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ LotShape      <fct> Reg, Reg, IR1, IR1, IR1, IR1, Reg, IR1, Reg, Reg...
## $ LandContour   <fct> Lvl, Lvl, Lvl, Lvl, Lvl, Lvl, Lvl, Lvl, Lvl, Lvl...
## $ Utilities     <fct> AllPub, AllPub, AllPub, AllPub, AllPub, AllPub, ...
## $ LotConfig     <fct> Inside, FR2, Inside, Corner, FR2, Inside, Inside...
## $ LandSlope     <fct> Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl...
## $ Neighborhood  <fct> CollgCr, Veenker, CollgCr, Crawfor, NoRidge, Mit...
## $ Condition1    <fct> Norm, Feedr, Norm, Norm, Norm, Norm, Norm, PosN,...
## $ Condition2    <fct> Norm, Norm, Norm, Norm, Norm, Norm, Norm, Norm, ...
## $ BldgType      <fct> 1Fam, 1Fam, 1Fam, 1Fam, 1Fam, 1Fam, 1Fam, 1Fam, ...
## $ HouseStyle    <fct> 2Story, 1Story, 2Story, 2Story, 2Story, 1.5Fin, ...
## $ OverallQual   <int> 7, 6, 7, 7, 8, 5, 8, 7, 7, 5, 5, 9, 5, 7, 6, 7, ...
## $ OverallCond   <int> 5, 8, 5, 5, 5, 5, 5, 6, 5, 6, 5, 5, 6, 5, 5, 8, ...
## $ YearBuilt     <int> 2003, 1976, 2001, 1915, 2000, 1993, 2004, 1973, ...
## $ YearRemodAdd  <int> 2003, 1976, 2002, 1970, 2000, 1995, 2005, 1973, ...
## $ RoofStyle     <fct> Gable, Gable, Gable, Gable, Gable, Gable, Gable,...
## $ RoofMatl      <fct> CompShg, CompShg, CompShg, CompShg, CompShg, Com...
## $ Exterior1st   <fct> VinylSd, MetalSd, VinylSd, Wd Sdng, VinylSd, Vin...
## $ Exterior2nd   <fct> VinylSd, MetalSd, VinylSd, Wd Shng, VinylSd, Vin...
## $ MasVnrType    <fct> BrkFace, None, BrkFace, None, BrkFace, None, Sto...
## $ MasVnrArea    <int> 196, 0, 162, 0, 350, 0, 186, 240, 0, 0, 0, 286, ...
## $ ExterQual     <fct> Gd, TA, Gd, TA, Gd, TA, Gd, TA, TA, TA, TA, Ex, ...
## $ ExterCond     <fct> TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, ...
## $ Foundation    <fct> PConc, CBlock, PConc, BrkTil, PConc, Wood, PConc...
## $ BsmtQual      <fct> Gd, Gd, Gd, TA, Gd, Gd, Ex, Gd, TA, TA, TA, Ex, ...
## $ BsmtCond      <fct> TA, TA, TA, Gd, TA, TA, TA, TA, TA, TA, TA, TA, ...
## $ BsmtExposure  <fct> No, Gd, Mn, No, Av, No, Av, Mn, No, No, No, No, ...
## $ BsmtFinType1  <fct> GLQ, ALQ, GLQ, ALQ, GLQ, GLQ, GLQ, ALQ, Unf, GLQ...
## $ BsmtFinSF1    <int> 706, 978, 486, 216, 655, 732, 1369, 859, 0, 851,...
## $ BsmtFinType2  <fct> Unf, Unf, Unf, Unf, Unf, Unf, Unf, BLQ, Unf, Unf...
## $ BsmtFinSF2    <int> 0, 0, 0, 0, 0, 0, 0, 32, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ BsmtUnfSF     <int> 150, 284, 434, 540, 490, 64, 317, 216, 952, 140,...
## $ TotalBsmtSF   <int> 856, 1262, 920, 756, 1145, 796, 1686, 1107, 952,...
## $ Heating       <fct> GasA, GasA, GasA, GasA, GasA, GasA, GasA, GasA, ...
## $ HeatingQC     <fct> Ex, Ex, Ex, Gd, Ex, Ex, Ex, Ex, Gd, Ex, Ex, Ex, ...
## $ CentralAir    <fct> Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, ...
## $ Electrical    <fct> SBrkr, SBrkr, SBrkr, SBrkr, SBrkr, SBrkr, SBrkr,...
## $ X1stFlrSF     <int> 856, 1262, 920, 961, 1145, 796, 1694, 1107, 1022...
## $ X2ndFlrSF     <int> 854, 0, 866, 756, 1053, 566, 0, 983, 752, 0, 0, ...
## $ LowQualFinSF  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ GrLivArea     <int> 1710, 1262, 1786, 1717, 2198, 1362, 1694, 2090, ...
## $ BsmtFullBath  <int> 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, ...
## $ BsmtHalfBath  <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ FullBath      <int> 2, 2, 2, 1, 2, 1, 2, 2, 2, 1, 1, 3, 1, 2, 1, 1, ...
## $ HalfBath      <int> 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, ...
## $ BedroomAbvGr  <int> 3, 3, 3, 3, 4, 1, 3, 3, 2, 2, 3, 4, 2, 3, 2, 2, ...
## $ KitchenAbvGr  <int> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1, ...
## $ KitchenQual   <fct> Gd, TA, Gd, Gd, Gd, TA, Gd, TA, TA, TA, TA, Ex, ...
## $ TotRmsAbvGrd  <int> 8, 6, 6, 7, 9, 5, 7, 7, 8, 5, 5, 11, 4, 7, 5, 5,...
## $ Functional    <fct> Typ, Typ, Typ, Typ, Typ, Typ, Typ, Typ, Min1, Ty...
## $ Fireplaces    <int> 0, 1, 1, 1, 1, 0, 1, 2, 2, 2, 0, 2, 0, 1, 1, 0, ...
## $ FireplaceQu   <fct> NA, TA, TA, Gd, TA, NA, Gd, TA, TA, TA, NA, Gd, ...
## $ GarageType    <fct> Attchd, Attchd, Attchd, Detchd, Attchd, Attchd, ...
## $ GarageYrBlt   <int> 2003, 1976, 2001, 1998, 2000, 1993, 2004, 1973, ...
## $ GarageFinish  <fct> RFn, RFn, RFn, Unf, RFn, Unf, RFn, RFn, Unf, RFn...
## $ GarageCars    <int> 2, 2, 2, 3, 3, 2, 2, 2, 2, 1, 1, 3, 1, 3, 1, 2, ...
## $ GarageArea    <int> 548, 460, 608, 642, 836, 480, 636, 484, 468, 205...
## $ GarageQual    <fct> TA, TA, TA, TA, TA, TA, TA, TA, Fa, Gd, TA, TA, ...
## $ GarageCond    <fct> TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, ...
## $ PavedDrive    <fct> Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, ...
## $ WoodDeckSF    <int> 0, 298, 0, 0, 192, 40, 255, 235, 90, 0, 0, 147, ...
## $ OpenPorchSF   <int> 61, 0, 42, 35, 84, 30, 57, 204, 0, 4, 0, 21, 0, ...
## $ EnclosedPorch <int> 0, 0, 0, 272, 0, 0, 0, 228, 205, 0, 0, 0, 0, 0, ...
## $ X3SsnPorch    <int> 0, 0, 0, 0, 0, 320, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ ScreenPorch   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 176, 0, 0, 0...
## $ PoolArea      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ PoolQC        <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ Fence         <fct> NA, NA, NA, NA, NA, MnPrv, NA, NA, NA, NA, NA, N...
## $ MiscFeature   <fct> NA, NA, NA, NA, NA, Shed, NA, Shed, NA, NA, NA, ...
## $ MiscVal       <int> 0, 0, 0, 0, 0, 700, 0, 350, 0, 0, 0, 0, 0, 0, 0,...
## $ MoSold        <int> 2, 5, 9, 2, 12, 10, 8, 11, 4, 1, 2, 7, 9, 8, 5, ...
## $ YrSold        <int> 2008, 2007, 2008, 2006, 2008, 2009, 2007, 2009, ...
## $ SaleType      <fct> WD, WD, WD, WD, WD, WD, WD, WD, WD, WD, WD, New,...
## $ SaleCondition <fct> Normal, Normal, Normal, Abnorml, Normal, Normal,...
## $ SalePrice     <int> 208500, 181500, 223500, 140000, 250000, 143000, ...
summary(train)
##        Id           MSSubClass       MSZoning     LotFrontage    
##  Min.   :   1.0   Min.   : 20.0   C (all):  10   Min.   : 21.00  
##  1st Qu.: 365.8   1st Qu.: 20.0   FV     :  65   1st Qu.: 59.00  
##  Median : 730.5   Median : 50.0   RH     :  16   Median : 69.00  
##  Mean   : 730.5   Mean   : 56.9   RL     :1151   Mean   : 70.05  
##  3rd Qu.:1095.2   3rd Qu.: 70.0   RM     : 218   3rd Qu.: 80.00  
##  Max.   :1460.0   Max.   :190.0                  Max.   :313.00  
##                                                  NA's   :259     
##     LotArea        Street      Alley      LotShape  LandContour
##  Min.   :  1300   Grvl:   6   Grvl:  50   IR1:484   Bnk:  63   
##  1st Qu.:  7554   Pave:1454   Pave:  41   IR2: 41   HLS:  50   
##  Median :  9478               NA's:1369   IR3: 10   Low:  36   
##  Mean   : 10517                           Reg:925   Lvl:1311   
##  3rd Qu.: 11602                                                
##  Max.   :215245                                                
##                                                                
##   Utilities      LotConfig    LandSlope   Neighborhood   Condition1  
##  AllPub:1459   Corner : 263   Gtl:1382   NAmes  :225   Norm   :1260  
##  NoSeWa:   1   CulDSac:  94   Mod:  65   CollgCr:150   Feedr  :  81  
##                FR2    :  47   Sev:  13   OldTown:113   Artery :  48  
##                FR3    :   4              Edwards:100   RRAn   :  26  
##                Inside :1052              Somerst: 86   PosN   :  19  
##                                          Gilbert: 79   RRAe   :  11  
##                                          (Other):707   (Other):  15  
##    Condition2     BldgType      HouseStyle   OverallQual    
##  Norm   :1445   1Fam  :1220   1Story :726   Min.   : 1.000  
##  Feedr  :   6   2fmCon:  31   2Story :445   1st Qu.: 5.000  
##  Artery :   2   Duplex:  52   1.5Fin :154   Median : 6.000  
##  PosN   :   2   Twnhs :  43   SLvl   : 65   Mean   : 6.099  
##  RRNn   :   2   TwnhsE: 114   SFoyer : 37   3rd Qu.: 7.000  
##  PosA   :   1                 1.5Unf : 14   Max.   :10.000  
##  (Other):   2                 (Other): 19                   
##   OverallCond      YearBuilt     YearRemodAdd    RoofStyle   
##  Min.   :1.000   Min.   :1872   Min.   :1950   Flat   :  13  
##  1st Qu.:5.000   1st Qu.:1954   1st Qu.:1967   Gable  :1141  
##  Median :5.000   Median :1973   Median :1994   Gambrel:  11  
##  Mean   :5.575   Mean   :1971   Mean   :1985   Hip    : 286  
##  3rd Qu.:6.000   3rd Qu.:2000   3rd Qu.:2004   Mansard:   7  
##  Max.   :9.000   Max.   :2010   Max.   :2010   Shed   :   2  
##                                                              
##     RoofMatl     Exterior1st   Exterior2nd    MasVnrType    MasVnrArea    
##  CompShg:1434   VinylSd:515   VinylSd:504   BrkCmn : 15   Min.   :   0.0  
##  Tar&Grv:  11   HdBoard:222   MetalSd:214   BrkFace:445   1st Qu.:   0.0  
##  WdShngl:   6   MetalSd:220   HdBoard:207   None   :864   Median :   0.0  
##  WdShake:   5   Wd Sdng:206   Wd Sdng:197   Stone  :128   Mean   : 103.7  
##  ClyTile:   1   Plywood:108   Plywood:142   NA's   :  8   3rd Qu.: 166.0  
##  Membran:   1   CemntBd: 61   CmentBd: 60                 Max.   :1600.0  
##  (Other):   2   (Other):128   (Other):136                 NA's   :8       
##  ExterQual ExterCond  Foundation  BsmtQual   BsmtCond    BsmtExposure
##  Ex: 52    Ex:   3   BrkTil:146   Ex  :121   Fa  :  45   Av  :221    
##  Fa: 14    Fa:  28   CBlock:634   Fa  : 35   Gd  :  65   Gd  :134    
##  Gd:488    Gd: 146   PConc :647   Gd  :618   Po  :   2   Mn  :114    
##  TA:906    Po:   1   Slab  : 24   TA  :649   TA  :1311   No  :953    
##            TA:1282   Stone :  6   NA's: 37   NA's:  37   NA's: 38    
##                      Wood  :  3                                      
##                                                                      
##  BsmtFinType1   BsmtFinSF1     BsmtFinType2   BsmtFinSF2     
##  ALQ :220     Min.   :   0.0   ALQ :  19    Min.   :   0.00  
##  BLQ :148     1st Qu.:   0.0   BLQ :  33    1st Qu.:   0.00  
##  GLQ :418     Median : 383.5   GLQ :  14    Median :   0.00  
##  LwQ : 74     Mean   : 443.6   LwQ :  46    Mean   :  46.55  
##  Rec :133     3rd Qu.: 712.2   Rec :  54    3rd Qu.:   0.00  
##  Unf :430     Max.   :5644.0   Unf :1256    Max.   :1474.00  
##  NA's: 37                      NA's:  38                     
##    BsmtUnfSF       TotalBsmtSF      Heating     HeatingQC CentralAir
##  Min.   :   0.0   Min.   :   0.0   Floor:   1   Ex:741    N:  95    
##  1st Qu.: 223.0   1st Qu.: 795.8   GasA :1428   Fa: 49    Y:1365    
##  Median : 477.5   Median : 991.5   GasW :  18   Gd:241              
##  Mean   : 567.2   Mean   :1057.4   Grav :   7   Po:  1              
##  3rd Qu.: 808.0   3rd Qu.:1298.2   OthW :   2   TA:428              
##  Max.   :2336.0   Max.   :6110.0   Wall :   4                       
##                                                                     
##  Electrical     X1stFlrSF      X2ndFlrSF     LowQualFinSF    
##  FuseA:  94   Min.   : 334   Min.   :   0   Min.   :  0.000  
##  FuseF:  27   1st Qu.: 882   1st Qu.:   0   1st Qu.:  0.000  
##  FuseP:   3   Median :1087   Median :   0   Median :  0.000  
##  Mix  :   1   Mean   :1163   Mean   : 347   Mean   :  5.845  
##  SBrkr:1334   3rd Qu.:1391   3rd Qu.: 728   3rd Qu.:  0.000  
##  NA's :   1   Max.   :4692   Max.   :2065   Max.   :572.000  
##                                                              
##    GrLivArea     BsmtFullBath     BsmtHalfBath        FullBath    
##  Min.   : 334   Min.   :0.0000   Min.   :0.00000   Min.   :0.000  
##  1st Qu.:1130   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:1.000  
##  Median :1464   Median :0.0000   Median :0.00000   Median :2.000  
##  Mean   :1515   Mean   :0.4253   Mean   :0.05753   Mean   :1.565  
##  3rd Qu.:1777   3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.:2.000  
##  Max.   :5642   Max.   :3.0000   Max.   :2.00000   Max.   :3.000  
##                                                                   
##     HalfBath       BedroomAbvGr    KitchenAbvGr   KitchenQual
##  Min.   :0.0000   Min.   :0.000   Min.   :0.000   Ex:100     
##  1st Qu.:0.0000   1st Qu.:2.000   1st Qu.:1.000   Fa: 39     
##  Median :0.0000   Median :3.000   Median :1.000   Gd:586     
##  Mean   :0.3829   Mean   :2.866   Mean   :1.047   TA:735     
##  3rd Qu.:1.0000   3rd Qu.:3.000   3rd Qu.:1.000              
##  Max.   :2.0000   Max.   :8.000   Max.   :3.000              
##                                                              
##   TotRmsAbvGrd    Functional    Fireplaces    FireplaceQu   GarageType 
##  Min.   : 2.000   Maj1:  14   Min.   :0.000   Ex  : 24    2Types :  6  
##  1st Qu.: 5.000   Maj2:   5   1st Qu.:0.000   Fa  : 33    Attchd :870  
##  Median : 6.000   Min1:  31   Median :1.000   Gd  :380    Basment: 19  
##  Mean   : 6.518   Min2:  34   Mean   :0.613   Po  : 20    BuiltIn: 88  
##  3rd Qu.: 7.000   Mod :  15   3rd Qu.:1.000   TA  :313    CarPort:  9  
##  Max.   :14.000   Sev :   1   Max.   :3.000   NA's:690    Detchd :387  
##                   Typ :1360                               NA's   : 81  
##   GarageYrBlt   GarageFinish   GarageCars      GarageArea     GarageQual 
##  Min.   :1900   Fin :352     Min.   :0.000   Min.   :   0.0   Ex  :   3  
##  1st Qu.:1961   RFn :422     1st Qu.:1.000   1st Qu.: 334.5   Fa  :  48  
##  Median :1980   Unf :605     Median :2.000   Median : 480.0   Gd  :  14  
##  Mean   :1979   NA's: 81     Mean   :1.767   Mean   : 473.0   Po  :   3  
##  3rd Qu.:2002                3rd Qu.:2.000   3rd Qu.: 576.0   TA  :1311  
##  Max.   :2010                Max.   :4.000   Max.   :1418.0   NA's:  81  
##  NA's   :81                                                              
##  GarageCond  PavedDrive   WoodDeckSF      OpenPorchSF     EnclosedPorch   
##  Ex  :   2   N:  90     Min.   :  0.00   Min.   :  0.00   Min.   :  0.00  
##  Fa  :  35   P:  30     1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.00  
##  Gd  :   9   Y:1340     Median :  0.00   Median : 25.00   Median :  0.00  
##  Po  :   7              Mean   : 94.24   Mean   : 46.66   Mean   : 21.95  
##  TA  :1326              3rd Qu.:168.00   3rd Qu.: 68.00   3rd Qu.:  0.00  
##  NA's:  81              Max.   :857.00   Max.   :547.00   Max.   :552.00  
##                                                                           
##    X3SsnPorch      ScreenPorch        PoolArea        PoolQC    
##  Min.   :  0.00   Min.   :  0.00   Min.   :  0.000   Ex  :   2  
##  1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.000   Fa  :   2  
##  Median :  0.00   Median :  0.00   Median :  0.000   Gd  :   3  
##  Mean   :  3.41   Mean   : 15.06   Mean   :  2.759   NA's:1453  
##  3rd Qu.:  0.00   3rd Qu.:  0.00   3rd Qu.:  0.000              
##  Max.   :508.00   Max.   :480.00   Max.   :738.000              
##                                                                 
##    Fence      MiscFeature    MiscVal             MoSold      
##  GdPrv:  59   Gar2:   2   Min.   :    0.00   Min.   : 1.000  
##  GdWo :  54   Othr:   2   1st Qu.:    0.00   1st Qu.: 5.000  
##  MnPrv: 157   Shed:  49   Median :    0.00   Median : 6.000  
##  MnWw :  11   TenC:   1   Mean   :   43.49   Mean   : 6.322  
##  NA's :1179   NA's:1406   3rd Qu.:    0.00   3rd Qu.: 8.000  
##                           Max.   :15500.00   Max.   :12.000  
##                                                              
##      YrSold        SaleType    SaleCondition    SalePrice     
##  Min.   :2006   WD     :1267   Abnorml: 101   Min.   : 34900  
##  1st Qu.:2007   New    : 122   AdjLand:   4   1st Qu.:129975  
##  Median :2008   COD    :  43   Alloca :  12   Median :163000  
##  Mean   :2008   ConLD  :   9   Family :  20   Mean   :180921  
##  3rd Qu.:2009   ConLI  :   5   Normal :1198   3rd Qu.:214000  
##  Max.   :2010   ConLw  :   5   Partial: 125   Max.   :755000  
##                 (Other):   9

From above we see: * There are some outliers scattred here and there, we investigate in details in next sections. * The missing observaes are scattred among features, let us investigate that more.

Missing Data

plot_missing(train, title = "Missing Data", ggtheme = theme_gray(base_size = 15))

The categorical features with the largest number of missing values are:

  • PoolQC (99.52%): Pool Quality, no wonder :)
  • MiscFeature (96.3%): Miscellaneous features not covered in other categories
  • Alley (93.7%): indicates the type of alley access
  • Fence (80%): Fence Quality
  • FirePlaceQu (47.26%): Fireplace quality
  • GarageType (5.55%): related features
  • GarageYrBlt (5.55%): I will convert this feature to categorical and treat it like that
  • GarageFinish (5.55%): Interior finish of the garage
  • GarageQUal (5.55%): Garage quality
  • GarageCond (5.55%): Garage condition
  • BsmtExposure (2.6%): Refers to walkout or garden level walls.
  • BsmtFinType2 (2.6%): Rating of basement finished area (if multiple types)
  • BsmtQual (2.53%): Evaluates the height of the basement
  • BsmtCond (2.53%): Evaluates the general condition of the basement
  • BsmtFinType1 (2.53%): Rating of basement finished area
  • MasVnrType (0.55%): Masonry veneer type

I will Impute categorical features by converting NA to Not available level except MasVnrType I will add level others as they must used something to build with.

The missing values indicate that majority of the houses do not have alley access, no pool, no fence and no elevator, 2nd garage, shed or tennis court that is covered by the MiscFeature.

The numeric variables do not have as many missing values but there are still some present:

  • LotFrontage (17.74%): Linear feet of street connected to property
    • Maybe I will use the mean or meadin functions.
  • MasVnrArea (0.55%): Masonry veneer area in square feet
    • Will impute by 0, as missing means that it does not exist.

Discreate Features Overview

Let us have a quick view

plot_bar(train)

From the first look there are some features with many levels with no realy small values as:

  • Neighborhood
  • Condition1
  • Condition2
  • HouseStyle
  • RoofMatl
  • Exterior1st
  • Exterior2nd
  • Functional
  • SaleType

Continuos Features Overview

Now, let us check the continuos features

plot_density(train[,-c(1)], ggtheme = theme_gray(base_size = 15, base_family = "serif"))

From plots, it seems there are many fluctations in many features and we will need to deal with each one of it individually.

Resonse Variable Against Features Overview

Now, let us see how discreate and continuos features interact with the response variable first.

plot_scatterplot(train[,-c(1)], by = "SalePrice")
## Warning: Removed 267 rows containing missing values (geom_point).

## Warning: Removed 81 rows containing missing values (geom_point).

The plots confirm my doubs about continuos features in specific, it needs serious handling. Now let us move to the final stage of our EDA, corrleation.

Corrleation

numeric_var <- names(train)[which(sapply(train, is.numeric))]
correlations <- cor(na.omit(train[, numeric_var]))
# correlations
row_indic <- apply(correlations, 1, function(x) sum(x > 0.3 | x < -0.3) > 1)

correlations<- correlations[row_indic ,row_indic ]
corrplot(correlations, method="square")

It seems there is a high corrletation among continuos features, we will need to treat that in Feature Engineering phase.

Plot scatter plot for variables that have high correlation.

The correlation matrix below shows that there are several variables that are strongly and positively correlated with housing price.

High positive correlation:

  • OverallQual
  • YearBuilt
  • YearRemodAdd
  • MasvnrArea
  • BsmtFinSF1
  • TotalBsmtSF
  • 1stFlrSF
  • GrLiveArea
  • FullBath
  • TotRmsAbvGrd
  • FirePlaces
  • GarageYrBlt
  • GarageCars
  • GarageArea
  • WoodDeskSF
  • OpenPorchSF

The number of enclosed porches are negatively correlated with year built. It seems that potential housebuyers do not want an enclosed porch and house developers have been building less enclosed porches in recent years. It is also negatively correlated with SalePrice, which makes sense.

There is some slight negative correlation between OverallCond and SalePrice. There is also strong negative correlation between Yearbuilt and OverallCond. It seems to be that recently built houses tend to been in worse Overall Condition.

train %>% 
  select(OverallCond, YearBuilt) %>% 
  ggplot(aes(as.factor(OverallCond),YearBuilt)) +
  geom_boxplot() +
  xlab('Overall Condition')

Feature Engineering

Now we came to the most critical part that will determine what feature our model will depend on. I will check all featuers with the followingin mind:

Response Variable

Outlier check

Univariate approach

Let us check for summary firt

summary(train$SalePrice)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   34900  129975  163000  180921  214000  755000

OK, the good news we do not have missing data, but it seems we have outlier. Let us make sure.

outlier_values <- boxplot.stats(train$SalePrice)$out  # outlier values.
boxplot(train$SalePrice, main="Price", boxwex=0.1)
mtext(paste("Outliers: ", paste(outlier_values, collapse=", ")), cex=0.6)

and by using outliers package

outlier(train$SalePrice)
## [1] 755000

OK, it seems that we have one or two obserations as outlier at least, it is not much trouble, is it? Let us check for normality

Normality

Density plot and Q-Q plot can be used to check normality visually.

ggdensity(train$SalePrice, 
          main = "Density plot of SalePrice",
          xlab = "Sale Price")

qqPlot(train$SalePrice)

## [1]  692 1183

OK, we have a long right tail on first plot and baised line on the second, so it is not so normal. Let us confirm that by performing significance test

shapiro.test(train$SalePrice)
## 
##  Shapiro-Wilk normality test
## 
## data:  train$SalePrice
## W = 0.86967, p-value < 2.2e-16

It is confirmed, let us now transform the response variable and recheck.

train$SalePrice <- log(train$SalePrice)
ggdensity(train$SalePrice, 
          main = "Density plot of SalePrice",
          xlab = "Sale Price")

qqPlot(train$SalePrice)

## [1] 496 917

much better. now let us move to high missing features.

Missing Features Treatment

High Missing Values Percentage

I changed my mind, I will drop high missing values, it seems to risky to keep them

summary(train$PoolQC)
##   Ex   Fa   Gd NA's 
##    2    2    3 1453
train$PoolQC <- NULL
test$PoolQC <- NULL
summary(train$MiscFeature)
## Gar2 Othr Shed TenC NA's 
##    2    2   49    1 1406
train$MiscFeature <- NULL
test$MiscFeature <- NULL
summary(train$Alley)
## Grvl Pave NA's 
##   50   41 1369
train$Alley <- NULL
test$Alley <- NULL
summary(train$Fence )
## GdPrv  GdWo MnPrv  MnWw  NA's 
##    59    54   157    11  1179
train$Fence  <- NULL
test$Fence  <- NULL

Others

I will impute others

summary(train$FireplaceQu)
##   Ex   Fa   Gd   Po   TA NA's 
##   24   33  380   20  313  690
train$FireplaceQu <- fct_explicit_na(train$FireplaceQu, "NA")
test$FireplaceQu <- fct_explicit_na(test$FireplaceQu, "NA")
summary(train$LotFrontage)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   21.00   59.00   69.00   70.05   80.00  313.00     259
train$LotFrontage[is.na(train$LotFrontage)] <- mean(train$LotFrontage, na.rm = TRUE)
test$LotFrontage[is.na(test$LotFrontage)] <- mean(test$LotFrontage, na.rm = TRUE)
summary(train$GarageType)
##  2Types  Attchd Basment BuiltIn CarPort  Detchd    NA's 
##       6     870      19      88       9     387      81
train$GarageType <- fct_explicit_na(train$GarageType, "NA")
test$GarageType <- fct_explicit_na(test$GarageType, "NA")

# I will convert GarageYrBlt to factor
train$GarageYrBlt <- as.factor(train$GarageYrBlt)
test$GarageYrBlt <- as.factor(test$GarageYrBlt)

summary(train$GarageYrBlt)
## 1900 1906 1908 1910 1914 1915 1916 1918 1920 1921 1922 1923 1924 1925 1926 
##    1    1    1    3    2    2    5    2   14    3    5    3    3   10    6 
## 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 
##    1    4    2    8    4    3    1    2    4    5    2    3    9   14   10 
## 1942 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 
##    2    4    4    2   11    8   24    6    3   12   19   13   16   20   21 
## 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 
##   17   19   13   21   16   18   21   21   15   26   15   20   13   14   14 
## 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 
##   18    9   29   35   19   15   15   10    4    7    8   10    6   11   14 
## 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 
##   10   16    9   13   22   18   18   20   19   31   30   27   20   26   50 
## 2004 2005 2006 2007 2008 2009 2010 NA's 
##   53   65   59   49   29   21    3   81
train$GarageYrBlt <- fct_explicit_na(train$GarageYrBlt, "NA")
test$GarageYrBlt <- fct_explicit_na(test$GarageYrBlt, "NA")

summary(train$GarageFinish)
##  Fin  RFn  Unf NA's 
##  352  422  605   81
train$GarageFinish <- fct_explicit_na(train$GarageFinish, "NA")
test$GarageFinish <- fct_explicit_na(test$GarageFinish, "NA")

summary(train$GarageQual)
##   Ex   Fa   Gd   Po   TA NA's 
##    3   48   14    3 1311   81
train$GarageQual <- fct_explicit_na(train$GarageQual, "NA")
test$GarageQual <- fct_explicit_na(test$GarageQual, "NA")

summary(train$GarageCond)
##   Ex   Fa   Gd   Po   TA NA's 
##    2   35    9    7 1326   81
train$GarageCond <- fct_explicit_na(train$GarageCond, "NA")
test$GarageCond <- fct_explicit_na(test$GarageCond, "NA")
summary(train$BsmtExposure)
##   Av   Gd   Mn   No NA's 
##  221  134  114  953   38
train$BsmtExposure <- fct_explicit_na(train$BsmtExposure, "NA")
test$BsmtExposure <- fct_explicit_na(test$BsmtExposure, "NA")

summary(train$BsmtFinType2)
##  ALQ  BLQ  GLQ  LwQ  Rec  Unf NA's 
##   19   33   14   46   54 1256   38
train$BsmtFinType2 <- fct_explicit_na(train$BsmtFinType2, "NA")
test$BsmtFinType2 <- fct_explicit_na(test$BsmtFinType2, "NA")

summary(train$BsmtQual)
##   Ex   Fa   Gd   TA NA's 
##  121   35  618  649   37
train$BsmtQual <- fct_explicit_na(train$BsmtQual, "NA")
test$BsmtQual <- fct_explicit_na(test$BsmtQual, "NA")

summary(train$BsmtCond)
##   Fa   Gd   Po   TA NA's 
##   45   65    2 1311   37
train$BsmtCond <- fct_explicit_na(train$BsmtCond, "NA")
test$BsmtCond <- fct_explicit_na(test$BsmtCond, "NA")

summary(train$BsmtFinType1)
##  ALQ  BLQ  GLQ  LwQ  Rec  Unf NA's 
##  220  148  418   74  133  430   37
train$BsmtFinType1 <- fct_explicit_na(train$BsmtFinType1, "NA")
test$BsmtFinType1 <- fct_explicit_na(test$BsmtFinType1, "NA")
summary(train$MasVnrType)
##  BrkCmn BrkFace    None   Stone    NA's 
##      15     445     864     128       8
train$MasVnrType <- fct_explicit_na(train$MasVnrType, "NA")
test$MasVnrType <- fct_explicit_na(test$MasVnrType, "NA")

summary(train$MasVnrArea)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     0.0     0.0     0.0   103.7   166.0  1600.0       8
train$MasVnrArea[is.na(train$MasVnrArea)]<- 0
test$MasVnrArea[is.na(test$MasVnrArea)]<- 0
summary(train$Electrical)
## FuseA FuseF FuseP   Mix SBrkr  NA's 
##    94    27     3     1  1334     1
train$Electrical <- fct_explicit_na(train$Electrical, "NA")
test$Electrical <- fct_explicit_na(test$Electrical, "NA")

Important Features

Definition

First I will need to identify the most important features to work on and eleminate others to save effort and time

# Decide if a variable is important or not using Boruta
response <- train[, "SalePrice"]
boruta_output <- Boruta(response ~ . , data = train, doTrace=2)  # perform Boruta search
boruta_signif <- names(boruta_output$finalDecision[boruta_output$finalDecision %in% c("Confirmed", "Tentative")])  # collect Confirmed and Tentative variables
boruta_signif
##  [1] "MSSubClass"    "MSZoning"      "LotFrontage"   "LotArea"      
##  [5] "LotShape"      "LandContour"   "LandSlope"     "Neighborhood" 
##  [9] "Condition1"    "BldgType"      "HouseStyle"    "OverallQual"  
## [13] "OverallCond"   "YearBuilt"     "YearRemodAdd"  "RoofStyle"    
## [17] "Exterior1st"   "Exterior2nd"   "MasVnrType"    "MasVnrArea"   
## [21] "ExterQual"     "Foundation"    "BsmtQual"      "BsmtCond"     
## [25] "BsmtExposure"  "BsmtFinType1"  "BsmtFinSF1"    "BsmtFinType2" 
## [29] "BsmtUnfSF"     "TotalBsmtSF"   "HeatingQC"     "CentralAir"   
## [33] "Electrical"    "X1stFlrSF"     "X2ndFlrSF"     "GrLivArea"    
## [37] "BsmtFullBath"  "FullBath"      "HalfBath"      "BedroomAbvGr" 
## [41] "KitchenAbvGr"  "KitchenQual"   "TotRmsAbvGrd"  "Functional"   
## [45] "Fireplaces"    "FireplaceQu"   "GarageType"    "GarageYrBlt"  
## [49] "GarageFinish"  "GarageCars"    "GarageArea"    "GarageQual"   
## [53] "GarageCond"    "PavedDrive"    "WoodDeckSF"    "OpenPorchSF"  
## [57] "ScreenPorch"   "SaleCondition" "SalePrice"

We eleminated 20 features, let us use another method

lmMod <- earth(SalePrice ~ . , data = train)  # fit lm() model
ev <- evimp (lmMod) # estimate variable importance
plot (ev)

I will enter a loop start with building different models using these important features as a base line, then start to improvethem by going into Features engierring steps one by one and remodell and compaer until we are satisfied. So let us continue investigation on the important features. ### YearBuilt, YearRemodAdd, OverallQual and OverallCond These are correlated fields that we need to treat them toghather. #### Description

YearBuilt: Original construction date YearRemodAdd: Remodel date (same as construction date if no remodeling or additions) OverallQual: Rates the overall material and finish of the house

   * 10 Very Excellent
   * 9  Excellent
   * 8  Very Good
   * 7  Good
   * 6  Above Average
   * 5  Average
   * 4  Below Average
   * 3  Fair
   * 2  Poor
   * 1  Very Poor

OverallCond: Rates the overall condition of the house

   * 10 Very Excellent
   * 9  Excellent
   * 8  Very Good
   * 7  Good
   * 6  Above Average   
   * 5  Average
   * 4  Below Average   
   * 3  Fair
   * 2  Poor
   * 1  Very Poor