This is the first part of the Machine learning workflow with Amnes Housing price data. In this part, we will explore the dataset and possible ways to deal with the missing data.

In the next part, we will

1 Load data and libraries

This is needed libraries.

# library(Amelia)
# library(data.table)
# library(ggplot2)
# library(dplyr)
# library(caret)
# library(magrittr)
# library(tidyverse)

# I did not load them all here to avoid conflict within different packages. 

Load the training and test data.

boston_raw=data.table::fread("train.csv")
boston_test=data.table::fread("test.csv")

Look at the data for the first few lines.

head(boston_raw)
##    Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape
## 1:  1         60       RL          65    8450   Pave    NA      Reg
## 2:  2         20       RL          80    9600   Pave    NA      Reg
## 3:  3         60       RL          68   11250   Pave    NA      IR1
## 4:  4         70       RL          60    9550   Pave    NA      IR1
## 5:  5         60       RL          84   14260   Pave    NA      IR1
## 6:  6         50       RL          85   14115   Pave    NA      IR1
##    LandContour Utilities LotConfig LandSlope Neighborhood Condition1
## 1:         Lvl    AllPub    Inside       Gtl      CollgCr       Norm
## 2:         Lvl    AllPub       FR2       Gtl      Veenker      Feedr
## 3:         Lvl    AllPub    Inside       Gtl      CollgCr       Norm
## 4:         Lvl    AllPub    Corner       Gtl      Crawfor       Norm
## 5:         Lvl    AllPub       FR2       Gtl      NoRidge       Norm
## 6:         Lvl    AllPub    Inside       Gtl      Mitchel       Norm
##    Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt
## 1:       Norm     1Fam     2Story           7           5      2003
## 2:       Norm     1Fam     1Story           6           8      1976
## 3:       Norm     1Fam     2Story           7           5      2001
## 4:       Norm     1Fam     2Story           7           5      1915
## 5:       Norm     1Fam     2Story           8           5      2000
## 6:       Norm     1Fam     1.5Fin           5           5      1993
##    YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType
## 1:         2003     Gable  CompShg     VinylSd     VinylSd    BrkFace
## 2:         1976     Gable  CompShg     MetalSd     MetalSd       None
## 3:         2002     Gable  CompShg     VinylSd     VinylSd    BrkFace
## 4:         1970     Gable  CompShg     Wd Sdng     Wd Shng       None
## 5:         2000     Gable  CompShg     VinylSd     VinylSd    BrkFace
## 6:         1995     Gable  CompShg     VinylSd     VinylSd       None
##    MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond
## 1:        196        Gd        TA      PConc       Gd       TA
## 2:          0        TA        TA     CBlock       Gd       TA
## 3:        162        Gd        TA      PConc       Gd       TA
## 4:          0        TA        TA     BrkTil       TA       Gd
## 5:        350        Gd        TA      PConc       Gd       TA
## 6:          0        TA        TA       Wood       Gd       TA
##    BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF
## 1:           No          GLQ        706          Unf          0       150
## 2:           Gd          ALQ        978          Unf          0       284
## 3:           Mn          GLQ        486          Unf          0       434
## 4:           No          ALQ        216          Unf          0       540
## 5:           Av          GLQ        655          Unf          0       490
## 6:           No          GLQ        732          Unf          0        64
##    TotalBsmtSF Heating HeatingQC CentralAir Electrical 1stFlrSF 2ndFlrSF
## 1:         856    GasA        Ex          Y      SBrkr      856      854
## 2:        1262    GasA        Ex          Y      SBrkr     1262        0
## 3:         920    GasA        Ex          Y      SBrkr      920      866
## 4:         756    GasA        Gd          Y      SBrkr      961      756
## 5:        1145    GasA        Ex          Y      SBrkr     1145     1053
## 6:         796    GasA        Ex          Y      SBrkr      796      566
##    LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath
## 1:            0      1710            1            0        2        1
## 2:            0      1262            0            1        2        0
## 3:            0      1786            1            0        2        1
## 4:            0      1717            1            0        1        0
## 5:            0      2198            1            0        2        1
## 6:            0      1362            1            0        1        1
##    BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd Functional
## 1:            3            1          Gd            8        Typ
## 2:            3            1          TA            6        Typ
## 3:            3            1          Gd            6        Typ
## 4:            3            1          Gd            7        Typ
## 5:            4            1          Gd            9        Typ
## 6:            1            1          TA            5        Typ
##    Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish GarageCars
## 1:          0          NA     Attchd        2003          RFn          2
## 2:          1          TA     Attchd        1976          RFn          2
## 3:          1          TA     Attchd        2001          RFn          2
## 4:          1          Gd     Detchd        1998          Unf          3
## 5:          1          TA     Attchd        2000          RFn          3
## 6:          0          NA     Attchd        1993          Unf          2
##    GarageArea GarageQual GarageCond PavedDrive WoodDeckSF OpenPorchSF
## 1:        548         TA         TA          Y          0          61
## 2:        460         TA         TA          Y        298           0
## 3:        608         TA         TA          Y          0          42
## 4:        642         TA         TA          Y          0          35
## 5:        836         TA         TA          Y        192          84
## 6:        480         TA         TA          Y         40          30
##    EnclosedPorch 3SsnPorch ScreenPorch PoolArea PoolQC Fence MiscFeature
## 1:             0         0           0        0     NA    NA          NA
## 2:             0         0           0        0     NA    NA          NA
## 3:             0         0           0        0     NA    NA          NA
## 4:           272         0           0        0     NA    NA          NA
## 5:             0         0           0        0     NA    NA          NA
## 6:             0       320           0        0     NA MnPrv        Shed
##    MiscVal MoSold YrSold SaleType SaleCondition SalePrice
## 1:       0      2   2008       WD        Normal    208500
## 2:       0      5   2007       WD        Normal    181500
## 3:       0      9   2008       WD        Normal    223500
## 4:       0      2   2006       WD       Abnorml    140000
## 5:       0     12   2008       WD        Normal    250000
## 6:     700     10   2009       WD        Normal    143000

Let us look at the dimensions of the data

dim(boston_raw)
## [1] 1460   81

Do a summary of different variables.

summary(boston_raw)
##        Id           MSSubClass      MSZoning          LotFrontage    
##  Min.   :   1.0   Min.   : 20.0   Length:1460        Min.   : 21.00  
##  1st Qu.: 365.8   1st Qu.: 20.0   Class :character   1st Qu.: 59.00  
##  Median : 730.5   Median : 50.0   Mode  :character   Median : 69.00  
##  Mean   : 730.5   Mean   : 56.9                      Mean   : 70.05  
##  3rd Qu.:1095.2   3rd Qu.: 70.0                      3rd Qu.: 80.00  
##  Max.   :1460.0   Max.   :190.0                      Max.   :313.00  
##                                                      NA's   :259     
##     LotArea          Street             Alley             LotShape        
##  Min.   :  1300   Length:1460        Length:1460        Length:1460       
##  1st Qu.:  7554   Class :character   Class :character   Class :character  
##  Median :  9478   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 10517                                                           
##  3rd Qu.: 11602                                                           
##  Max.   :215245                                                           
##                                                                           
##  LandContour         Utilities          LotConfig        
##  Length:1460        Length:1460        Length:1460       
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##   LandSlope         Neighborhood        Condition1       
##  Length:1460        Length:1460        Length:1460       
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##   Condition2          BldgType          HouseStyle         OverallQual    
##  Length:1460        Length:1460        Length:1460        Min.   : 1.000  
##  Class :character   Class :character   Class :character   1st Qu.: 5.000  
##  Mode  :character   Mode  :character   Mode  :character   Median : 6.000  
##                                                           Mean   : 6.099  
##                                                           3rd Qu.: 7.000  
##                                                           Max.   :10.000  
##                                                                           
##   OverallCond      YearBuilt     YearRemodAdd   RoofStyle        
##  Min.   :1.000   Min.   :1872   Min.   :1950   Length:1460       
##  1st Qu.:5.000   1st Qu.:1954   1st Qu.:1967   Class :character  
##  Median :5.000   Median :1973   Median :1994   Mode  :character  
##  Mean   :5.575   Mean   :1971   Mean   :1985                     
##  3rd Qu.:6.000   3rd Qu.:2000   3rd Qu.:2004                     
##  Max.   :9.000   Max.   :2010   Max.   :2010                     
##                                                                  
##    RoofMatl         Exterior1st        Exterior2nd       
##  Length:1460        Length:1460        Length:1460       
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##   MasVnrType          MasVnrArea      ExterQual          ExterCond        
##  Length:1460        Min.   :   0.0   Length:1460        Length:1460       
##  Class :character   1st Qu.:   0.0   Class :character   Class :character  
##  Mode  :character   Median :   0.0   Mode  :character   Mode  :character  
##                     Mean   : 103.7                                        
##                     3rd Qu.: 166.0                                        
##                     Max.   :1600.0                                        
##                     NA's   :8                                             
##   Foundation          BsmtQual           BsmtCond        
##  Length:1460        Length:1460        Length:1460       
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##  BsmtExposure       BsmtFinType1         BsmtFinSF1     BsmtFinType2      
##  Length:1460        Length:1460        Min.   :   0.0   Length:1460       
##  Class :character   Class :character   1st Qu.:   0.0   Class :character  
##  Mode  :character   Mode  :character   Median : 383.5   Mode  :character  
##                                        Mean   : 443.6                     
##                                        3rd Qu.: 712.2                     
##                                        Max.   :5644.0                     
##                                                                           
##    BsmtFinSF2        BsmtUnfSF       TotalBsmtSF       Heating         
##  Min.   :   0.00   Min.   :   0.0   Min.   :   0.0   Length:1460       
##  1st Qu.:   0.00   1st Qu.: 223.0   1st Qu.: 795.8   Class :character  
##  Median :   0.00   Median : 477.5   Median : 991.5   Mode  :character  
##  Mean   :  46.55   Mean   : 567.2   Mean   :1057.4                     
##  3rd Qu.:   0.00   3rd Qu.: 808.0   3rd Qu.:1298.2                     
##  Max.   :1474.00   Max.   :2336.0   Max.   :6110.0                     
##                                                                        
##   HeatingQC          CentralAir         Electrical           1stFlrSF   
##  Length:1460        Length:1460        Length:1460        Min.   : 334  
##  Class :character   Class :character   Class :character   1st Qu.: 882  
##  Mode  :character   Mode  :character   Mode  :character   Median :1087  
##                                                           Mean   :1163  
##                                                           3rd Qu.:1391  
##                                                           Max.   :4692  
##                                                                         
##     2ndFlrSF     LowQualFinSF       GrLivArea     BsmtFullBath   
##  Min.   :   0   Min.   :  0.000   Min.   : 334   Min.   :0.0000  
##  1st Qu.:   0   1st Qu.:  0.000   1st Qu.:1130   1st Qu.:0.0000  
##  Median :   0   Median :  0.000   Median :1464   Median :0.0000  
##  Mean   : 347   Mean   :  5.845   Mean   :1515   Mean   :0.4253  
##  3rd Qu.: 728   3rd Qu.:  0.000   3rd Qu.:1777   3rd Qu.:1.0000  
##  Max.   :2065   Max.   :572.000   Max.   :5642   Max.   :3.0000  
##                                                                  
##   BsmtHalfBath        FullBath        HalfBath       BedroomAbvGr  
##  Min.   :0.00000   Min.   :0.000   Min.   :0.0000   Min.   :0.000  
##  1st Qu.:0.00000   1st Qu.:1.000   1st Qu.:0.0000   1st Qu.:2.000  
##  Median :0.00000   Median :2.000   Median :0.0000   Median :3.000  
##  Mean   :0.05753   Mean   :1.565   Mean   :0.3829   Mean   :2.866  
##  3rd Qu.:0.00000   3rd Qu.:2.000   3rd Qu.:1.0000   3rd Qu.:3.000  
##  Max.   :2.00000   Max.   :3.000   Max.   :2.0000   Max.   :8.000  
##                                                                    
##   KitchenAbvGr   KitchenQual         TotRmsAbvGrd     Functional       
##  Min.   :0.000   Length:1460        Min.   : 2.000   Length:1460       
##  1st Qu.:1.000   Class :character   1st Qu.: 5.000   Class :character  
##  Median :1.000   Mode  :character   Median : 6.000   Mode  :character  
##  Mean   :1.047                      Mean   : 6.518                     
##  3rd Qu.:1.000                      3rd Qu.: 7.000                     
##  Max.   :3.000                      Max.   :14.000                     
##                                                                        
##    Fireplaces    FireplaceQu         GarageType         GarageYrBlt  
##  Min.   :0.000   Length:1460        Length:1460        Min.   :1900  
##  1st Qu.:0.000   Class :character   Class :character   1st Qu.:1961  
##  Median :1.000   Mode  :character   Mode  :character   Median :1980  
##  Mean   :0.613                                         Mean   :1979  
##  3rd Qu.:1.000                                         3rd Qu.:2002  
##  Max.   :3.000                                         Max.   :2010  
##                                                        NA's   :81    
##  GarageFinish         GarageCars      GarageArea      GarageQual       
##  Length:1460        Min.   :0.000   Min.   :   0.0   Length:1460       
##  Class :character   1st Qu.:1.000   1st Qu.: 334.5   Class :character  
##  Mode  :character   Median :2.000   Median : 480.0   Mode  :character  
##                     Mean   :1.767   Mean   : 473.0                     
##                     3rd Qu.:2.000   3rd Qu.: 576.0                     
##                     Max.   :4.000   Max.   :1418.0                     
##                                                                        
##   GarageCond         PavedDrive          WoodDeckSF      OpenPorchSF    
##  Length:1460        Length:1460        Min.   :  0.00   Min.   :  0.00  
##  Class :character   Class :character   1st Qu.:  0.00   1st Qu.:  0.00  
##  Mode  :character   Mode  :character   Median :  0.00   Median : 25.00  
##                                        Mean   : 94.24   Mean   : 46.66  
##                                        3rd Qu.:168.00   3rd Qu.: 68.00  
##                                        Max.   :857.00   Max.   :547.00  
##                                                                         
##  EnclosedPorch      3SsnPorch       ScreenPorch        PoolArea      
##  Min.   :  0.00   Min.   :  0.00   Min.   :  0.00   Min.   :  0.000  
##  1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.000  
##  Median :  0.00   Median :  0.00   Median :  0.00   Median :  0.000  
##  Mean   : 21.95   Mean   :  3.41   Mean   : 15.06   Mean   :  2.759  
##  3rd Qu.:  0.00   3rd Qu.:  0.00   3rd Qu.:  0.00   3rd Qu.:  0.000  
##  Max.   :552.00   Max.   :508.00   Max.   :480.00   Max.   :738.000  
##                                                                      
##     PoolQC             Fence           MiscFeature       
##  Length:1460        Length:1460        Length:1460       
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##     MiscVal             MoSold           YrSold       SaleType        
##  Min.   :    0.00   Min.   : 1.000   Min.   :2006   Length:1460       
##  1st Qu.:    0.00   1st Qu.: 5.000   1st Qu.:2007   Class :character  
##  Median :    0.00   Median : 6.000   Median :2008   Mode  :character  
##  Mean   :   43.49   Mean   : 6.322   Mean   :2008                     
##  3rd Qu.:    0.00   3rd Qu.: 8.000   3rd Qu.:2009                     
##  Max.   :15500.00   Max.   :12.000   Max.   :2010                     
##                                                                       
##  SaleCondition        SalePrice     
##  Length:1460        Min.   : 34900  
##  Class :character   1st Qu.:129975  
##  Mode  :character   Median :163000  
##                     Mean   :180921  
##                     3rd Qu.:214000  
##                     Max.   :755000  
## 

Further glimpse on the type of each dataset

dplyr::glimpse(boston_raw)
## Observations: 1,460
## Variables: 81
## $ Id            <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1...
## $ MSSubClass    <int> 60, 20, 60, 70, 60, 50, 20, 60, 50, 190, 20, 60,...
## $ MSZoning      <chr> "RL", "RL", "RL", "RL", "RL", "RL", "RL", "RL", ...
## $ LotFrontage   <int> 65, 80, 68, 60, 84, 85, 75, NA, 51, 50, 70, 85, ...
## $ LotArea       <int> 8450, 9600, 11250, 9550, 14260, 14115, 10084, 10...
## $ Street        <chr> "Pave", "Pave", "Pave", "Pave", "Pave", "Pave", ...
## $ Alley         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ LotShape      <chr> "Reg", "Reg", "IR1", "IR1", "IR1", "IR1", "Reg",...
## $ LandContour   <chr> "Lvl", "Lvl", "Lvl", "Lvl", "Lvl", "Lvl", "Lvl",...
## $ Utilities     <chr> "AllPub", "AllPub", "AllPub", "AllPub", "AllPub"...
## $ LotConfig     <chr> "Inside", "FR2", "Inside", "Corner", "FR2", "Ins...
## $ LandSlope     <chr> "Gtl", "Gtl", "Gtl", "Gtl", "Gtl", "Gtl", "Gtl",...
## $ Neighborhood  <chr> "CollgCr", "Veenker", "CollgCr", "Crawfor", "NoR...
## $ Condition1    <chr> "Norm", "Feedr", "Norm", "Norm", "Norm", "Norm",...
## $ Condition2    <chr> "Norm", "Norm", "Norm", "Norm", "Norm", "Norm", ...
## $ BldgType      <chr> "1Fam", "1Fam", "1Fam", "1Fam", "1Fam", "1Fam", ...
## $ HouseStyle    <chr> "2Story", "1Story", "2Story", "2Story", "2Story"...
## $ OverallQual   <int> 7, 6, 7, 7, 8, 5, 8, 7, 7, 5, 5, 9, 5, 7, 6, 7, ...
## $ OverallCond   <int> 5, 8, 5, 5, 5, 5, 5, 6, 5, 6, 5, 5, 6, 5, 5, 8, ...
## $ YearBuilt     <int> 2003, 1976, 2001, 1915, 2000, 1993, 2004, 1973, ...
## $ YearRemodAdd  <int> 2003, 1976, 2002, 1970, 2000, 1995, 2005, 1973, ...
## $ RoofStyle     <chr> "Gable", "Gable", "Gable", "Gable", "Gable", "Ga...
## $ RoofMatl      <chr> "CompShg", "CompShg", "CompShg", "CompShg", "Com...
## $ Exterior1st   <chr> "VinylSd", "MetalSd", "VinylSd", "Wd Sdng", "Vin...
## $ Exterior2nd   <chr> "VinylSd", "MetalSd", "VinylSd", "Wd Shng", "Vin...
## $ MasVnrType    <chr> "BrkFace", "None", "BrkFace", "None", "BrkFace",...
## $ MasVnrArea    <int> 196, 0, 162, 0, 350, 0, 186, 240, 0, 0, 0, 286, ...
## $ ExterQual     <chr> "Gd", "TA", "Gd", "TA", "Gd", "TA", "Gd", "TA", ...
## $ ExterCond     <chr> "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", ...
## $ Foundation    <chr> "PConc", "CBlock", "PConc", "BrkTil", "PConc", "...
## $ BsmtQual      <chr> "Gd", "Gd", "Gd", "TA", "Gd", "Gd", "Ex", "Gd", ...
## $ BsmtCond      <chr> "TA", "TA", "TA", "Gd", "TA", "TA", "TA", "TA", ...
## $ BsmtExposure  <chr> "No", "Gd", "Mn", "No", "Av", "No", "Av", "Mn", ...
## $ BsmtFinType1  <chr> "GLQ", "ALQ", "GLQ", "ALQ", "GLQ", "GLQ", "GLQ",...
## $ BsmtFinSF1    <int> 706, 978, 486, 216, 655, 732, 1369, 859, 0, 851,...
## $ BsmtFinType2  <chr> "Unf", "Unf", "Unf", "Unf", "Unf", "Unf", "Unf",...
## $ BsmtFinSF2    <int> 0, 0, 0, 0, 0, 0, 0, 32, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ BsmtUnfSF     <int> 150, 284, 434, 540, 490, 64, 317, 216, 952, 140,...
## $ TotalBsmtSF   <int> 856, 1262, 920, 756, 1145, 796, 1686, 1107, 952,...
## $ Heating       <chr> "GasA", "GasA", "GasA", "GasA", "GasA", "GasA", ...
## $ HeatingQC     <chr> "Ex", "Ex", "Ex", "Gd", "Ex", "Ex", "Ex", "Ex", ...
## $ CentralAir    <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y"...
## $ Electrical    <chr> "SBrkr", "SBrkr", "SBrkr", "SBrkr", "SBrkr", "SB...
## $ 1stFlrSF      <int> 856, 1262, 920, 961, 1145, 796, 1694, 1107, 1022...
## $ 2ndFlrSF      <int> 854, 0, 866, 756, 1053, 566, 0, 983, 752, 0, 0, ...
## $ LowQualFinSF  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ GrLivArea     <int> 1710, 1262, 1786, 1717, 2198, 1362, 1694, 2090, ...
## $ BsmtFullBath  <int> 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, ...
## $ BsmtHalfBath  <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ FullBath      <int> 2, 2, 2, 1, 2, 1, 2, 2, 2, 1, 1, 3, 1, 2, 1, 1, ...
## $ HalfBath      <int> 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, ...
## $ BedroomAbvGr  <int> 3, 3, 3, 3, 4, 1, 3, 3, 2, 2, 3, 4, 2, 3, 2, 2, ...
## $ KitchenAbvGr  <int> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1, ...
## $ KitchenQual   <chr> "Gd", "TA", "Gd", "Gd", "Gd", "TA", "Gd", "TA", ...
## $ TotRmsAbvGrd  <int> 8, 6, 6, 7, 9, 5, 7, 7, 8, 5, 5, 11, 4, 7, 5, 5,...
## $ Functional    <chr> "Typ", "Typ", "Typ", "Typ", "Typ", "Typ", "Typ",...
## $ Fireplaces    <int> 0, 1, 1, 1, 1, 0, 1, 2, 2, 2, 0, 2, 0, 1, 1, 0, ...
## $ FireplaceQu   <chr> NA, "TA", "TA", "Gd", "TA", NA, "Gd", "TA", "TA"...
## $ GarageType    <chr> "Attchd", "Attchd", "Attchd", "Detchd", "Attchd"...
## $ GarageYrBlt   <int> 2003, 1976, 2001, 1998, 2000, 1993, 2004, 1973, ...
## $ GarageFinish  <chr> "RFn", "RFn", "RFn", "Unf", "RFn", "Unf", "RFn",...
## $ GarageCars    <int> 2, 2, 2, 3, 3, 2, 2, 2, 2, 1, 1, 3, 1, 3, 1, 2, ...
## $ GarageArea    <int> 548, 460, 608, 642, 836, 480, 636, 484, 468, 205...
## $ GarageQual    <chr> "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", ...
## $ GarageCond    <chr> "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", ...
## $ PavedDrive    <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y"...
## $ WoodDeckSF    <int> 0, 298, 0, 0, 192, 40, 255, 235, 90, 0, 0, 147, ...
## $ OpenPorchSF   <int> 61, 0, 42, 35, 84, 30, 57, 204, 0, 4, 0, 21, 0, ...
## $ EnclosedPorch <int> 0, 0, 0, 272, 0, 0, 0, 228, 205, 0, 0, 0, 0, 0, ...
## $ 3SsnPorch     <int> 0, 0, 0, 0, 0, 320, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ ScreenPorch   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 176, 0, 0, 0...
## $ PoolArea      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ PoolQC        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ Fence         <chr> NA, NA, NA, NA, NA, "MnPrv", NA, NA, NA, NA, NA,...
## $ MiscFeature   <chr> NA, NA, NA, NA, NA, "Shed", NA, "Shed", NA, NA, ...
## $ MiscVal       <int> 0, 0, 0, 0, 0, 700, 0, 350, 0, 0, 0, 0, 0, 0, 0,...
## $ MoSold        <int> 2, 5, 9, 2, 12, 10, 8, 11, 4, 1, 2, 7, 9, 8, 5, ...
## $ YrSold        <int> 2008, 2007, 2008, 2006, 2008, 2009, 2007, 2009, ...
## $ SaleType      <chr> "WD", "WD", "WD", "WD", "WD", "WD", "WD", "WD", ...
## $ SaleCondition <chr> "Normal", "Normal", "Normal", "Abnorml", "Normal...
## $ SalePrice     <int> 208500, 181500, 223500, 140000, 250000, 143000, ...

2 Data exploration

Let us look at the percentage of missing values

library(Amelia)
## Loading required package: Rcpp
## ## 
## ## Amelia II: Multiple Imputation
## ## (Version 1.7.4, built: 2015-12-05)
## ## Copyright (C) 2005-2017 James Honaker, Gary King and Matthew Blackwell
## ## Refer to http://gking.harvard.edu/amelia/ for more information
## ##
missmap(boston_raw,main="Training Data - Missings Map", col=c("yellow", "black"), legend=FALSE)
## Warning in if (class(obj) == "amelia") {: the condition has length > 1 and
## only the first element will be used

It seems that most data all complete. Let us take a closer look at the percentile of the variable with most missing values

2.1 Percentage plot of missing value

miss_pct_plot=function(data){
  library(dplyr)
  library(magrittr)
  library(ggplot2)
  # Define a function to calculate percentage of missing value in each feature
  miss_pct=function(data){
    pct=sum(is.na(data))/length(data)
  }
  
  # apply the function define above to every columns of the data set 
  missing=data%>%sapply(miss_pct)%>%as.data.frame()
  
  
  colnames(missing)="miss_pct"
  missing=data.frame(missing_pct=missing,names=row.names(missing)%>%as.character())
  
  # Select only the features with missing values and their missing value percentage
  missing=missing%>%filter(miss_pct>0)%>%arrange(desc(miss_pct))
  
  paste("There are ",  dim(missing)[1] ," features with missing values.",sep="")%>%print()
  
  # Plot the missing percentage
  plot=missing%>%ggplot(aes(x=reorder(names,-miss_pct),y=miss_pct))+
    geom_bar(stat = "identity",fill="blue")+
    theme(axis.text.x = element_text(angle = 60,hjust=1))+
    labs(x="Features with missing value", y="Percentage of missing value")
  
  return(plot)
}

miss_pct_plot(boston_raw)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## [1] "There are 19 features with missing values."

There are features with missing values.

Get rid of variables with more than 75% missing value.

# Define a function to filter out variables with missing value more than p, which has a default value of 50%. 


## The function name is short for complete data with percentile p
select_comp=function(data,p=0.5){
  # Define a function to calculate percentage of missing value in each feature
  miss_pct=function(data){
    pct=sum(is.na(data))/length(data)
  }
  
  # apply the function define above to every columns of the data set 
  missing=sapply(data,miss_pct)%>%as.data.frame()
  
  names=row.names(missing)
  colnames(missing)="miss_pct"
  missing=data.frame(missing_pct=missing,names=row.names(missing)%>%as.character())
  
  # Select only the features with missing values and their missing value percentage
  missing=missing%>%filter(miss_pct>p)%>%arrange(desc(miss_pct))
  
  
  
  names=missing$names%>%as.character()
  
  data=data%>%select(-one_of(names) )
  return(data)
}


boston_clean1=boston_raw%>%select_comp(p=0.75)

Change the character value to factor.

char_to_factor=function(data){
  data=as.data.frame(data)
  library(tidyverse)
  sapply(data,class)%>%table()%>%print
  char_index=sapply(data,class)=="character"
  char_index=char_index%>%as.logical()
  
  char_features=data[,char_index]
  
  
  char_features=char_features%>%sapply(as.factor)
  char_features=as.data.frame(char_features)
  
  data[,!char_index]%>%sapply(class)%>%table
  
  data=cbind(data[,!char_index],char_features)
  
  data%>%sapply(class)%>%table%>%print
  return(data)
  
}

boston_clean1=boston_clean1%>%char_to_factor()
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag():    dplyr, stats
## .
## character   integer 
##        39        38 
## .
##  factor integer 
##      39      38

Define a function to select integer and numeric features in the dataset.

select_num=function(data){
  library(magrittr)
  data=as.data.frame(data)
  
  data%>%sapply(class)%>%table()%>%print
  int_index=sapply(data,class)== "integer"
  int_index=int_index%>%as.logical()
  num_index=sapply(data,class)== "numeric"
  num_index=num_index%>%as.logical()
  # data[,!int_index]%>%sapply(class)%>%table
  
  data1=data[,int_index]
  data2=data[,num_index]
  data=cbind(data1,data2)
  data%>%sapply(class)%>%table%>%print
  return(data)
}


# Find all factor variables 

select_fac=function(data){
  library(magrittr)
  data=as.data.frame(data)
  
  data%>%sapply(class)%>%table()%>%print
  fac_index=sapply(data,class)== "factor"
  fac_index=fac_index%>%as.logical()
  # data[,!int_index]%>%sapply(class)%>%table
  
  data=data[,fac_index]
  
  data%>%sapply(class)%>%table%>%print
  return(data)
}

Let us look at the correlation between the variables.

source('~/Dropbox/Utility/correlation.R', echo=TRUE)
## 
## > correlation = function(data) {
## +     select_num = function(data) {
## +         library(magrittr)
## +         data = as.data.frame(data)
## +         int_in .... [TRUNCATED]
library(corrplot)

correlation(boston_raw)%>%corrplot(method="pie",type="lower",
                                   order="hclust", # First principle order
                                   diag=FALSE, # No diagonal values
                                   tl.cex=0.65, # size of text label 
                                   tl.col="red"# color of the text label
                                   )
## .
## character   integer 
##        43        38 
## .
##  factor integer 
##      43      38 
## .
## character   integer 
##        43        38 
## .
##  factor integer 
##      43      38 
## .
##  factor integer 
##      27      35 
## .
## factor 
##     27

This gives a basic sense of how continuous and factor variables are related to the price. If there are too many features available, then it is best to apply dimension reduction techniques likePCA (Principle component analysis) first.

Next, we want to look at features that have the highest correlation with target variable, SalePrice.

library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
library(dplyr)
correlate=function(){
  # Convert character to factor, select variables with no missing value by setting p=0, select only numerical variables 
  boston_raw1=boston_raw%>%select_comp(p=0)%>%select_num()
  # Convert character to factor, select variables with no missing value by setting p=0, select only factor variables, and convert it into integer values 
  boston_raw2=boston_raw%>%char_to_factor()%>%select_comp(p=0)%>%select_fac()%>%sapply(as.integer)
  # combine the numerical and factor variables 
  boston_raw=cbind(boston_raw1,boston_raw2)
  # calculate correlations 
  correlation=cor(boston_raw$SalePrice,boston_raw%>%select(-SalePrice))
  #print("contains only numeric or interger for correlation calculation")
  # Return the correlation data 
  names=correlation%>%colnames()
  result=data.frame(features=names,correlation=correlation%>%as.vector())%>%arrange(desc(abs(correlation)))
  return(result)
}
correlate()
## .
## character   integer 
##        27        35 
## .
## integer 
##      35 
## .
## character   integer 
##        43        38 
## .
##  factor integer 
##      43      38 
## .
##  factor integer 
##      27      35 
## .
## factor 
##     27
##         features  correlation
## 1    OverallQual  0.790981601
## 2      GrLivArea  0.708624478
## 3     GarageCars  0.640409197
## 4      ExterQual -0.636883694
## 5     GarageArea  0.623431439
## 6    TotalBsmtSF  0.613580552
## 7       1stFlrSF  0.605852185
## 8    KitchenQual -0.589188778
## 9       FullBath  0.560663763
## 10  TotRmsAbvGrd  0.533723156
## 11     YearBuilt  0.522897333
## 12  YearRemodAdd  0.507100967
## 13    Fireplaces  0.466928837
## 14     HeatingQC -0.400177543
## 15    BsmtFinSF1  0.386419806
## 16    Foundation  0.382478991
## 17    WoodDeckSF  0.324413445
## 18      2ndFlrSF  0.319333803
## 19   OpenPorchSF  0.315856227
## 20      HalfBath  0.284107676
## 21       LotArea  0.263843354
## 22      LotShape -0.255579870
## 23    CentralAir  0.251328164
## 24    PavedDrive  0.231356952
## 25  BsmtFullBath  0.227122233
## 26     RoofStyle  0.222405292
## 27     BsmtUnfSF  0.214479106
## 28 SaleCondition  0.213092030
## 29    HouseStyle  0.180162623
## 30  Neighborhood  0.170941316
## 31  BedroomAbvGr  0.168213154
## 32      MSZoning -0.166872203
## 33  KitchenAbvGr -0.135907371
## 34      RoofMatl  0.132383112
## 35 EnclosedPorch -0.128577958
## 36     ExterCond  0.117302657
## 37    Functional  0.115327917
## 38   ScreenPorch  0.111446571
## 39   Exterior2nd  0.103765537
## 40   Exterior1st  0.103551011
## 41       Heating -0.098812076
## 42      PoolArea  0.092403549
## 43    Condition1  0.091154912
## 44      BldgType -0.085590608
## 45    MSSubClass -0.084284135
## 46   OverallCond -0.077855894
## 47     LotConfig -0.067396023
## 48     LandSlope  0.051152248
## 49      SaleType -0.050369536
## 50        MoSold  0.046432245
## 51     3SsnPorch  0.044583665
## 52        Street  0.041035536
## 53        YrSold -0.028922585
## 54  LowQualFinSF -0.025606130
## 55            Id -0.021916719
## 56       MiscVal -0.021189580
## 57  BsmtHalfBath -0.016844154
## 58   LandContour  0.015453242
## 59     Utilities -0.014314296
## 60    BsmtFinSF2 -0.011378121
## 61    Condition2  0.007512734

Let us look at the relationship between OverallQual and SalePrice.

library(ggplot2)
boston_raw$OverallQual=as.factor(boston_raw$OverallQual)
boston_raw%>%ggplot(aes(x=OverallQual,y=log1p(SalePrice)))+geom_violin(aes(fill=OverallQual))

Let us look at the relationship between GrLivArea and SalePrice. Plus, we will color the points with different OverallQual.

library(ggplot2)
boston_raw$GrLivArea=as.numeric(boston_raw$GrLivArea)
boston_raw%>%ggplot(aes(x=GrLivArea,y=log1p(SalePrice)))+geom_point(aes(color=OverallQual),alpha=0.5)

Let us look at the relationship between GarageCars and SalePrice.

library(ggplot2)
boston_raw$GarageCars=as.factor(boston_raw$GarageCars)
boston_raw%>%ggplot(aes(x=GarageCars,y=log1p(SalePrice)))+geom_violin(aes(fill=GarageCars))

Let us look at the relationship between ExterQual and SalePrice.

library(ggplot2)

boston_raw%>%ggplot(aes(x=ExterQual,y=SalePrice))+geom_violin(aes(fill=ExterQual))

Let us look at the general relationship between those top influential features.

boston_raw%>%ggplot()+geom_point(aes(y=log1p(SalePrice),x=GrLivArea,color=OverallQual))+facet_wrap(ExterQual~GarageCars)

Let us look at the relationship between GarageArea and SalePrice.

library(ggplot2)
boston_raw$GarageArea=as.numeric(boston_raw$GarageArea)
boston_raw%>%ggplot(aes(x=GarageArea,y=log1p(SalePrice)))+geom_point(aes(color=OverallQual),alpha=0.5)

library(ggplot2)
boston_raw$TotalBsmtSF=as.numeric(boston_raw$TotalBsmtSF)
boston_raw%>%ggplot(aes(x=TotalBsmtSF,y=log1p(SalePrice)))+geom_point(aes(color=OverallQual),alpha=0.3)

library(ggplot2)
boston_raw%>%ggplot(aes(x=FullBath,y=log1p(SalePrice)))+geom_violin(aes(fill=ExterQual))
## Warning: position_dodge requires non-overlapping x intervals

library(ggplot2)
boston_raw%>%ggplot(aes(x=TotRmsAbvGrd,y=log1p(SalePrice)))+geom_violin(aes(fill=ExterQual))
## Warning: position_dodge requires non-overlapping x intervals

library(ggplot2)
boston_raw$YearBuilt=as.numeric(boston_raw$YearBuilt)
boston_raw%>%ggplot(aes(x=YearBuilt,y=log1p(SalePrice)))+geom_point(aes(color=OverallQual),alpha=0.3)

A further look into the dataset shows that there are some zero variance or near zero variance features. Zero variance features would definitely needs to be removed from the dataset, since it contains no information for predicting the target variable.

But whether we should remove near zero variance features is a more complicated question. We could either remove the near zero variance features or use PCA (Principle component analysis) to extract some potential useful signal from the dataset.

Let us find the zero and near-zero variance features.

library(dplyr)
library(caret)
nzv=function(data){
  names=nearZeroVar(data,saveMetrics = TRUE)%>%row.names()
result=cbind(features=names,nearZeroVar(boston_raw,saveMetrics = TRUE) )%>%arrange(percentUnique)
return(result)
}

boston_raw%>%nzv()
##         features   freqRatio percentUnique zeroVar   nzv
## 1         Street  242.333333     0.1369863   FALSE  TRUE
## 2          Alley    1.219512     0.1369863   FALSE FALSE
## 3      Utilities 1459.000000     0.1369863   FALSE  TRUE
## 4     CentralAir   14.368421     0.1369863   FALSE FALSE
## 5      LandSlope   21.261538     0.2054795   FALSE  TRUE
## 6   BsmtHalfBath   17.225000     0.2054795   FALSE FALSE
## 7       HalfBath    1.706542     0.2054795   FALSE FALSE
## 8   GarageFinish    1.433649     0.2054795   FALSE FALSE
## 9     PavedDrive   14.888889     0.2054795   FALSE FALSE
## 10        PoolQC    1.500000     0.2054795   FALSE FALSE
## 11      LotShape    1.911157     0.2739726   FALSE FALSE
## 12   LandContour   20.809524     0.2739726   FALSE  TRUE
## 13    MasVnrType    1.941573     0.2739726   FALSE FALSE
## 14     ExterQual    1.856557     0.2739726   FALSE FALSE
## 15      BsmtQual    1.050162     0.2739726   FALSE FALSE
## 16      BsmtCond   20.169231     0.2739726   FALSE  TRUE
## 17  BsmtExposure    4.312217     0.2739726   FALSE FALSE
## 18  BsmtFullBath    1.455782     0.2739726   FALSE FALSE
## 19      FullBath    1.181538     0.2739726   FALSE FALSE
## 20  KitchenAbvGr   21.415385     0.2739726   FALSE  TRUE
## 21   KitchenQual    1.254266     0.2739726   FALSE FALSE
## 22    Fireplaces    1.061538     0.2739726   FALSE FALSE
## 23         Fence    2.661017     0.2739726   FALSE FALSE
## 24   MiscFeature   24.500000     0.2739726   FALSE  TRUE
## 25      MSZoning    5.279817     0.3424658   FALSE FALSE
## 26     LotConfig    4.000000     0.3424658   FALSE FALSE
## 27      BldgType   10.701754     0.3424658   FALSE FALSE
## 28     ExterCond    8.780822     0.3424658   FALSE FALSE
## 29     HeatingQC    1.731308     0.3424658   FALSE FALSE
## 30    Electrical   14.191489     0.3424658   FALSE FALSE
## 31   FireplaceQu    1.214058     0.3424658   FALSE FALSE
## 32    GarageCars    2.233062     0.3424658   FALSE FALSE
## 33    GarageQual   27.312500     0.3424658   FALSE  TRUE
## 34    GarageCond   37.885714     0.3424658   FALSE  TRUE
## 35        YrSold    1.027356     0.3424658   FALSE FALSE
## 36     RoofStyle    3.989510     0.4109589   FALSE FALSE
## 37    Foundation    1.020505     0.4109589   FALSE FALSE
## 38  BsmtFinType1    1.028708     0.4109589   FALSE FALSE
## 39  BsmtFinType2   23.259259     0.4109589   FALSE  TRUE
## 40       Heating   79.333333     0.4109589   FALSE  TRUE
## 41    GarageType    2.248062     0.4109589   FALSE FALSE
## 42 SaleCondition    9.584000     0.4109589   FALSE FALSE
## 43    Functional   40.000000     0.4794521   FALSE  TRUE
## 44    Condition2  240.833333     0.5479452   FALSE  TRUE
## 45    HouseStyle    1.631461     0.5479452   FALSE FALSE
## 46      RoofMatl  130.363636     0.5479452   FALSE  TRUE
## 47  BedroomAbvGr    2.245810     0.5479452   FALSE FALSE
## 48      PoolArea 1453.000000     0.5479452   FALSE  TRUE
## 49    Condition1   15.555556     0.6164384   FALSE FALSE
## 50   OverallCond    3.257937     0.6164384   FALSE FALSE
## 51      SaleType   10.385246     0.6164384   FALSE FALSE
## 52   OverallQual    1.061497     0.6849315   FALSE FALSE
## 53  TotRmsAbvGrd    1.221884     0.8219178   FALSE FALSE
## 54        MoSold    1.081197     0.8219178   FALSE FALSE
## 55    MSSubClass    1.792642     1.0273973   FALSE FALSE
## 56   Exterior1st    2.319820     1.0273973   FALSE FALSE
## 57   Exterior2nd    2.355140     1.0958904   FALSE FALSE
## 58     3SsnPorch  478.666667     1.3698630   FALSE  TRUE
## 59       MiscVal  128.000000     1.4383562   FALSE  TRUE
## 60  LowQualFinSF  478.000000     1.6438356   FALSE  TRUE
## 61  Neighborhood    1.500000     1.7123288   FALSE FALSE
## 62  YearRemodAdd    1.835052     4.1780822   FALSE FALSE
## 63   ScreenPorch  224.000000     5.2054795   FALSE  TRUE
## 64   GarageYrBlt    1.101695     6.6438356   FALSE FALSE
## 65   LotFrontage    2.042857     7.5342466   FALSE FALSE
## 66     YearBuilt    1.046875     7.6712329   FALSE FALSE
## 67 EnclosedPorch   83.466667     8.2191781   FALSE  TRUE
## 68    BsmtFinSF2  258.600000     9.8630137   FALSE  TRUE
## 69   OpenPorchSF   22.620690    13.8356164   FALSE FALSE
## 70    WoodDeckSF   20.026316    18.7671233   FALSE FALSE
## 71    MasVnrArea  107.625000    22.3972603   FALSE FALSE
## 72      2ndFlrSF   82.900000    28.5616438   FALSE FALSE
## 73    GarageArea    1.653061    30.2054795   FALSE FALSE
## 74    BsmtFinSF1   38.916667    43.6301370   FALSE FALSE
## 75     SalePrice    1.176471    45.4109589   FALSE FALSE
## 76   TotalBsmtSF    1.057143    49.3835616   FALSE FALSE
## 77      1stFlrSF    1.562500    51.5753425   FALSE FALSE
## 78     BsmtUnfSF   13.111111    53.4246575   FALSE FALSE
## 79     GrLivArea    1.571429    58.9726027   FALSE FALSE
## 80       LotArea    1.041667    73.4931507   FALSE FALSE
## 81            Id    1.000000   100.0000000   FALSE FALSE

Let us look at how does Alley features relate with SalePrice, the target variable.

library(ggplot2)
boston_raw%>%ggplot()+geom_violin(aes(x=Alley,y=log1p(SalePrice),fill=Alley))

3 Data cleaning

3.1 Missing value and unreasonable value

The next step is to understand the missing value and impute it.

There are basically two types of missing value: 1. Systematic missing value 2. Missing value at random boston_clean1%>%select_num()%>%sapply(skewness)%>%sort()

Let us first look at the features that have missing values.

miss_name=function(data){
  # Define a function to calculate percentage of missing value in each feature
  miss_pct=function(data){
    pct=sum(is.na(data))/length(data)
  }
  
  # apply the function define above to every columns of the data set 
  missing=data%>%sapply(miss_pct)%>%as.data.frame()
  
  
  colnames(missing)="miss_pct"
  missing=data.frame(names=row.names(missing)%>%as.character(),missing_pct=missing)
  
  # Select only the features with missing values and their missing value percentage
  missing=missing%>%filter(miss_pct>0)%>%arrange(desc(miss_pct))
  
  
  return(missing)
}

miss_name(boston_raw)
##           names     miss_pct
## 1        PoolQC 0.9952054795
## 2   MiscFeature 0.9630136986
## 3         Alley 0.9376712329
## 4         Fence 0.8075342466
## 5   FireplaceQu 0.4726027397
## 6   LotFrontage 0.1773972603
## 7    GarageType 0.0554794521
## 8   GarageYrBlt 0.0554794521
## 9  GarageFinish 0.0554794521
## 10   GarageQual 0.0554794521
## 11   GarageCond 0.0554794521
## 12 BsmtExposure 0.0260273973
## 13 BsmtFinType2 0.0260273973
## 14     BsmtQual 0.0253424658
## 15     BsmtCond 0.0253424658
## 16 BsmtFinType1 0.0253424658
## 17   MasVnrType 0.0054794521
## 18   MasVnrArea 0.0054794521
## 19   Electrical 0.0006849315

Let us look at the data dictionary to find out what NA value acutally means something else.

Alley: NA No alley access BsmtQual: NA No Basement BsmtCond: NA No Basement BsmtExposure:NA No Basement BsmtFinType1:NA No Basement BsmtFinType2: NA No Basement

FireplaceQu: NA No Fireplace

GarageType: NA No Garage GarageFinish:NA No Garage GarageQual:NA No Garage GarageCond:NA No Garage

PoolQC: NA No Poolcaret

Fence:NA No FenceGarageYrBlt MiscFeature: NA None

It seems that in all these features, NA does not mean missing value, it simply means there is no facility. Thus, we need to convert NA in these features into a factor value like “None”

## Change NA values of the features into "None", a new factor value 

na_to_fac=function(data,names){
  # libraries (change it into installing it if not installed later)
  library(magrittr)
  library(dplyr)
  # Split the data into data with NA value and data with no NA value 
  data_na=data%>%select(names)
  data_nna=data%>%select(-one_of(names))
  data_na[is.na(data_na)]="None"
  
  data=cbind(data_na,data_nna)
  return(data)
  
}

boston_clean2=boston_raw%>%na_to_fac(names=c("Alley","BsmtQual","BsmtCond","BsmtExposure","BsmtFinType1","BsmtFinType2","FireplaceQu","GarageType","GarageFinish","GarageQual","GarageCond","PoolQC","Fence","MiscFeature"))%>%char_to_factor()
## .
## character    factor   integer   numeric 
##        43         2        32         4 
## .
##  factor integer numeric 
##      45      32       4
miss_pct_plot(boston_clean2)
## [1] "There are 5 features with missing values."

# boston_clean2%>%sapply(class)

Then, after this cleaning we are left with 5 features with missing values.

Let us look futher into those variables.

LotFrontage is the Linear feet of street connected to property.

GarageYrBlt is the Year garage was built.

MasVnrArea is the Masonry veneer area in square feet. MasVnrType is the Masonry veneer type

Electrical is Electrical system with various factors.

  1. Look at the missing value at Electrical
boston_raw%>%filter(is.na(Electrical))
##     Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape
## 1 1380         80       RL          73    9735   Pave  <NA>      Reg
##   LandContour Utilities LotConfig LandSlope Neighborhood Condition1
## 1         Lvl    AllPub    Inside       Gtl       Timber       Norm
##   Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt
## 1       Norm     1Fam       SLvl           5           5      2006
##   YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType
## 1         2007     Gable  CompShg     VinylSd     VinylSd       None
##   MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond BsmtExposure
## 1          0        TA        TA      PConc       Gd       TA           No
##   BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF
## 1          Unf          0          Unf          0       384         384
##   Heating HeatingQC CentralAir Electrical 1stFlrSF 2ndFlrSF LowQualFinSF
## 1    GasA        Gd          Y       <NA>      754      640            0
##   GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr
## 1      1394            0            0        2        1            3
##   KitchenAbvGr KitchenQual TotRmsAbvGrd Functional Fireplaces FireplaceQu
## 1            1          Gd            7        Typ          0        <NA>
##   GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual
## 1    BuiltIn        2007          Fin          2        400         TA
##   GarageCond PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch 3SsnPorch
## 1         TA          Y        100           0             0         0
##   ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold
## 1           0        0   <NA>  <NA>        <NA>       0      5   2008
##   SaleType SaleCondition SalePrice
## 1       WD        Normal    167500

There is only one missing value. We could definitely impute it with some methods.

2.Look at the missing value at MasVnrArea and Masonry veneer type.

boston_raw%>%filter(is.na(MasVnrArea))
##     Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape
## 1  235         60       RL          NA    7851   Pave  <NA>      Reg
## 2  530         20       RL          NA   32668   Pave  <NA>      IR1
## 3  651         60       FV          65    8125   Pave  <NA>      Reg
## 4  937         20       RL          67   10083   Pave  <NA>      Reg
## 5  974         20       FV          95   11639   Pave  <NA>      Reg
## 6  978        120       FV          35    4274   Pave  Pave      IR1
## 7 1244         20       RL         107   13891   Pave  <NA>      Reg
## 8 1279         60       RL          75    9473   Pave  <NA>      Reg
##   LandContour Utilities LotConfig LandSlope Neighborhood Condition1
## 1         Lvl    AllPub    Inside       Gtl      Gilbert       Norm
## 2         Lvl    AllPub   CulDSac       Gtl      Crawfor       Norm
## 3         Lvl    AllPub    Inside       Gtl      Somerst       Norm
## 4         Lvl    AllPub    Inside       Gtl      SawyerW       Norm
## 5         Lvl    AllPub    Corner       Gtl      Somerst       Norm
## 6         Lvl    AllPub    Inside       Gtl      Somerst       Norm
## 7         Lvl    AllPub    Inside       Gtl      NridgHt       Norm
## 8         Lvl    AllPub    Inside       Gtl      CollgCr       Norm
##   Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt
## 1       Norm     1Fam     2Story           6           5      2002
## 2       Norm     1Fam     1Story           6           3      1957
## 3       Norm     1Fam     2Story           7           6      2007
## 4       Norm     1Fam     1Story           7           5      2003
## 5       Norm     1Fam     1Story           7           5      2007
## 6       Norm   TwnhsE     1Story           7           5      2006
## 7       Norm     1Fam     1Story          10           5      2006
## 8       Norm     1Fam     2Story           8           5      2002
##   YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType
## 1         2002     Gable  CompShg     VinylSd     VinylSd       <NA>
## 2         1975       Hip  CompShg     Wd Sdng       Stone       <NA>
## 3         2007     Gable  CompShg     CemntBd     CmentBd       <NA>
## 4         2003     Gable  CompShg     VinylSd     VinylSd       <NA>
## 5         2008     Gable  CompShg     CemntBd     CmentBd       <NA>
## 6         2007     Gable  CompShg     VinylSd     VinylSd       <NA>
## 7         2006     Gable  CompShg     VinylSd     VinylSd       <NA>
## 8         2002     Gable  CompShg     VinylSd     VinylSd       <NA>
##   MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond BsmtExposure
## 1         NA        Gd        TA      PConc       Gd       TA           No
## 2         NA        Gd        TA      PConc       TA       TA           No
## 3         NA        Gd        TA      PConc       Gd       TA           No
## 4         NA        Gd        TA      PConc       Gd       TA           No
## 5         NA        Gd        TA      PConc       Gd       TA           No
## 6         NA        Gd        TA      PConc       Gd       TA           No
## 7         NA        Ex        TA      PConc       Ex       Gd           Gd
## 8         NA        Gd        TA      PConc       Gd       TA           No
##   BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF
## 1          GLQ        625          Unf          0       235         860
## 2          Rec       1219          Unf          0       816        2035
## 3          Unf          0          Unf          0       813         813
## 4          GLQ        833          Unf          0       343        1176
## 5          Unf          0          Unf          0      1428        1428
## 6          GLQ       1106          Unf          0       135        1241
## 7          GLQ       1386          Unf          0       690        2076
## 8          GLQ        804          Unf          0       324        1128
##   Heating HeatingQC CentralAir Electrical 1stFlrSF 2ndFlrSF LowQualFinSF
## 1    GasA        Ex          Y      SBrkr      860     1100            0
## 2    GasA        TA          Y      SBrkr     2515        0            0
## 3    GasA        Ex          Y      SBrkr      822      843            0
## 4    GasA        Ex          Y      SBrkr     1200        0            0
## 5    GasA        Ex          Y      SBrkr     1428        0            0
## 6    GasA        Ex          Y      SBrkr     1241        0            0
## 7    GasA        Ex          Y      SBrkr     2076        0            0
## 8    GasA        Ex          Y      SBrkr     1128      903            0
##   GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr
## 1      1960            1            0        2        1            4
## 2      2515            1            0        3        0            4
## 3      1665            0            0        2        1            3
## 4      1200            1            0        2        0            2
## 5      1428            0            0        2        0            3
## 6      1241            1            0        1        1            1
## 7      2076            1            0        2        1            2
## 8      2031            1            0        2        1            3
##   KitchenAbvGr KitchenQual TotRmsAbvGrd Functional Fireplaces FireplaceQu
## 1            1          Gd            8        Typ          2          TA
## 2            2          TA            9       Maj1          2          TA
## 3            1          Gd            7        Typ          0        <NA>
## 4            1          Gd            5        Typ          0        <NA>
## 5            1          Gd            6        Typ          0        <NA>
## 6            1          Gd            4        Typ          0        <NA>
## 7            1          Ex            7        Typ          1          Gd
## 8            1          Gd            7        Typ          1          Gd
##   GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual
## 1    BuiltIn        2002          Fin          2        440         TA
## 2     Attchd        1975          RFn          2        484         TA
## 3     Attchd        2007          RFn          2        562         TA
## 4     Attchd        2003          RFn          2        555         TA
## 5     Attchd        2007          Fin          2        480         TA
## 6     Attchd        2007          Fin          2        569         TA
## 7     Attchd        2006          Fin          3        850         TA
## 8     Attchd        2002          RFn          2        577         TA
##   GarageCond PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch 3SsnPorch
## 1         TA          Y        288          48             0         0
## 2         TA          Y          0           0           200         0
## 3         TA          Y          0           0             0         0
## 4         TA          Y          0          41             0         0
## 5         TA          Y          0         120             0         0
## 6         TA          Y          0         116             0         0
## 7         TA          Y        216         229             0         0
## 8         TA          Y          0         211             0         0
##   ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold
## 1           0        0   <NA>  <NA>        <NA>       0      5   2010
## 2           0        0   <NA>  <NA>        <NA>       0      3   2007
## 3           0        0   <NA>  <NA>        <NA>       0      5   2008
## 4           0        0   <NA>  <NA>        <NA>       0      8   2009
## 5           0        0   <NA>  <NA>        <NA>       0     12   2008
## 6           0        0   <NA>  <NA>        <NA>       0     11   2007
## 7           0        0   <NA>  <NA>        <NA>       0      9   2006
## 8           0        0   <NA>  <NA>        <NA>       0      3   2008
##   SaleType SaleCondition SalePrice
## 1       WD        Normal    216500
## 2       WD        Alloca    200624
## 3       WD        Normal    205950
## 4       WD        Normal    184900
## 5      New       Partial    182000
## 6      New       Partial    199900
## 7      New       Partial    465000
## 8       WD        Normal    237000
boston_raw%>%filter(is.na(MasVnrType))%>%select(Id)==boston_raw%>%filter(is.na(MasVnrArea))%>%select(Id)
##        Id
## [1,] TRUE
## [2,] TRUE
## [3,] TRUE
## [4,] TRUE
## [5,] TRUE
## [6,] TRUE
## [7,] TRUE
## [8,] TRUE

The house that has a missing value in MasVnrType also has a missing value in MasVnrArea. source(‘~/Dropbox/Utility/miss_pct_df.R’, echo=TRUE)source(‘~/Dropbox/Utility/miss_pct_df.R’, echo=TRUE) I would consider two solutions: 1) use mice package to do the imputation 2) Assign missing values in MasVnrType to None and MasVnrArea to 0.

  1. Look at the missing values at GarageYrBlt.
boston_raw%>%filter(is.na(GarageYrBlt))%>%head()
##    Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape
## 1  40         90       RL          65    6040   Pave  <NA>      Reg
## 2  49        190       RM          33    4456   Pave  <NA>      Reg
## 3  79         90       RL          72   10778   Pave  <NA>      Reg
## 4  89         50  C (all)         105    8470   Pave  <NA>      IR1
## 5  90         20       RL          60    8070   Pave  <NA>      Reg
## 6 100         20       RL          77    9320   Pave  <NA>      IR1
##   LandContour Utilities LotConfig LandSlope Neighborhood Condition1
## 1         Lvl    AllPub    Inside       Gtl      Edwards       Norm
## 2         Lvl    AllPub    Inside       Gtl      OldTown       Norm
## 3         Lvl    AllPub    Inside       Gtl       Sawyer       Norm
## 4         Lvl    AllPub    Corner       Gtl       IDOTRR      Feedr
## 5         Lvl    AllPub    Inside       Gtl      CollgCr       Norm
## 6         Lvl    AllPub    Inside       Gtl        NAmes       Norm
##   Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt
## 1       Norm   Duplex     1Story           4           5      1955
## 2       Norm   2fmCon     2Story           4           5      1920
## 3       Norm   Duplex     1Story           4           5      1968
## 4      Feedr     1Fam     1.5Fin           3           2      1915
## 5       Norm     1Fam     1Story           4           5      1994
## 6       Norm     1Fam     1Story           4           5      1959
##   YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType
## 1         1955     Gable  CompShg     AsbShng     Plywood       None
## 2         2008     Gable  CompShg     MetalSd     MetalSd       None
## 3         1968       Hip  CompShg     HdBoard     HdBoard       None
## 4         1982       Hip  CompShg     Plywood     Plywood       None
## 5         1995     Gable  CompShg     VinylSd     VinylSd       None
## 6         1959     Gable  CompShg     Plywood     Plywood       None
##   MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond BsmtExposure
## 1          0        TA        TA      PConc     <NA>     <NA>         <NA>
## 2          0        TA        TA     BrkTil       TA       TA           No
## 3          0        TA        TA     CBlock       TA       TA           No
## 4          0        Fa        Fa     CBlock       TA       Fa           No
## 5          0        TA        TA      PConc       Gd       TA           No
## 6          0        TA        TA     CBlock       TA       TA           No
##   BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF
## 1         <NA>          0         <NA>          0         0           0
## 2          Unf          0          Unf          0       736         736
## 3          Unf          0          Unf          0      1768        1768
## 4          Unf          0          Unf          0      1013        1013
## 5          GLQ        588          Unf          0       402         990
## 6          ALQ        569          Unf          0       381         950
##   Heating HeatingQC CentralAir Electrical 1stFlrSF 2ndFlrSF LowQualFinSF
## 1    GasA        TA          N      FuseP     1152        0            0
## 2    GasA        Gd          Y      SBrkr      736      716            0
## 3    GasA        TA          N      SBrkr     1768        0            0
## 4    GasA        TA          N      SBrkr     1013        0          513
## 5    GasA        Ex          Y      SBrkr      990        0            0
## 6    GasA        Fa          Y      SBrkr     1225        0            0
##   GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr
## 1      1152            0            0        2        0            2
## 2      1452            0            0        2        0            2
## 3      1768            0            0        2        0            4
## 4      1526            0            0        1        0            2
## 5       990            1            0        1        0            3
## 6      1225            1            0        1        1            3
##   KitchenAbvGr KitchenQual TotRmsAbvGrd Functional Fireplaces FireplaceQu
## 1            2          Fa            6        Typ          0        <NA>
## 2            3          TA            8        Typ          0        <NA>
## 3            2          TA            8        Typ          0        <NA>
## 4            1          Fa            6        Typ          0        <NA>
## 5            1          TA            5        Typ          0        <NA>
## 6            1          TA            6        Typ          0        <NA>
##   GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual
## 1       <NA>          NA         <NA>          0          0       <NA>
## 2       <NA>          NA         <NA>          0          0       <NA>
## 3       <NA>          NA         <NA>          0          0       <NA>
## 4       <NA>          NA         <NA>          0          0       <NA>
## 5       <NA>          NA         <NA>          0          0       <NA>
## 6       <NA>          NA         <NA>          0          0       <NA>
##   GarageCond PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch 3SsnPorch
## 1       <NA>          N          0           0             0         0
## 2       <NA>          N          0           0           102         0
## 3       <NA>          Y          0           0             0         0
## 4       <NA>          N          0           0           156         0
## 5       <NA>          Y          0           0             0         0
## 6       <NA>          Y        352           0             0         0
##   ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold
## 1           0        0   <NA>  <NA>        <NA>       0      6   2008
## 2           0        0   <NA>  <NA>        <NA>       0      6   2009
## 3           0        0   <NA>  <NA>        <NA>       0      4   2010
## 4           0        0   <NA> MnPrv        <NA>       0     10   2009
## 5           0        0   <NA>  <NA>        <NA>       0      8   2007
## 6           0        0   <NA>  <NA>        Shed     400      1   2010
##   SaleType SaleCondition SalePrice
## 1       WD       AdjLand     82000
## 2      New       Partial    113000
## 3       WD        Normal    136500
## 4    ConLD       Abnorml     85000
## 5       WD        Normal    123600
## 6       WD        Normal    128950

It seems that other garage features are also missing when GarageYrBlt is missing.

Let us check.

boston_raw%>%filter(is.na(GarageYrBlt))%>%select(Id)==boston_raw%>%filter(is.na(GarageQual))%>%select(Id)
##         Id
##  [1,] TRUE
##  [2,] TRUE
##  [3,] TRUE
##  [4,] TRUE
##  [5,] TRUE
##  [6,] TRUE
##  [7,] TRUE
##  [8,] TRUE
##  [9,] TRUE
## [10,] TRUE
## [11,] TRUE
## [12,] TRUE
## [13,] TRUE
## [14,] TRUE
## [15,] TRUE
## [16,] TRUE
## [17,] TRUE
## [18,] TRUE
## [19,] TRUE
## [20,] TRUE
## [21,] TRUE
## [22,] TRUE
## [23,] TRUE
## [24,] TRUE
## [25,] TRUE
## [26,] TRUE
## [27,] TRUE
## [28,] TRUE
## [29,] TRUE
## [30,] TRUE
## [31,] TRUE
## [32,] TRUE
## [33,] TRUE
## [34,] TRUE
## [35,] TRUE
## [36,] TRUE
## [37,] TRUE
## [38,] TRUE
## [39,] TRUE
## [40,] TRUE
## [41,] TRUE
## [42,] TRUE
## [43,] TRUE
## [44,] TRUE
## [45,] TRUE
## [46,] TRUE
## [47,] TRUE
## [48,] TRUE
## [49,] TRUE
## [50,] TRUE
## [51,] TRUE
## [52,] TRUE
## [53,] TRUE
## [54,] TRUE
## [55,] TRUE
## [56,] TRUE
## [57,] TRUE
## [58,] TRUE
## [59,] TRUE
## [60,] TRUE
## [61,] TRUE
## [62,] TRUE
## [63,] TRUE
## [64,] TRUE
## [65,] TRUE
## [66,] TRUE
## [67,] TRUE
## [68,] TRUE
## [69,] TRUE
## [70,] TRUE
## [71,] TRUE
## [72,] TRUE
## [73,] TRUE
## [74,] TRUE
## [75,] TRUE
## [76,] TRUE
## [77,] TRUE
## [78,] TRUE
## [79,] TRUE
## [80,] TRUE
## [81,] TRUE

Let us look at the relationship between GarageYrBlt and SalePrice. Plus, we will color the points with different OverallQual.

library(ggplot2)
boston_raw%>%ggplot(aes(x=GarageYrBlt,y=log1p(SalePrice)))+geom_point(aes(color=OverallQual),alpha=0.6)
## Warning: Removed 81 rows containing missing values (geom_point).

There are no perfect solutions to this NA value.

There are three solutions available: 1) assign it a median value 2) assign it a zero value 3) assign it a value using CART imputation.

  1. Last but not the least LotFrontage:the Linear feet of street connected to property.
boston_raw%>%filter(is.na(LotFrontage))%>%head()
##   Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape
## 1  8         60       RL          NA   10382   Pave  <NA>      IR1
## 2 13         20       RL          NA   12968   Pave  <NA>      IR2
## 3 15         20       RL          NA   10920   Pave  <NA>      IR1
## 4 17         20       RL          NA   11241   Pave  <NA>      IR1
## 5 25         20       RL          NA    8246   Pave  <NA>      IR1
## 6 32         20       RL          NA    8544   Pave  <NA>      IR1
##   LandContour Utilities LotConfig LandSlope Neighborhood Condition1
## 1         Lvl    AllPub    Corner       Gtl       NWAmes       PosN
## 2         Lvl    AllPub    Inside       Gtl       Sawyer       Norm
## 3         Lvl    AllPub    Corner       Gtl        NAmes       Norm
## 4         Lvl    AllPub   CulDSac       Gtl        NAmes       Norm
## 5         Lvl    AllPub    Inside       Gtl       Sawyer       Norm
## 6         Lvl    AllPub   CulDSac       Gtl       Sawyer       Norm
##   Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt
## 1       Norm     1Fam     2Story           7           6      1973
## 2       Norm     1Fam     1Story           5           6      1962
## 3       Norm     1Fam     1Story           6           5      1960
## 4       Norm     1Fam     1Story           6           7      1970
## 5       Norm     1Fam     1Story           5           8      1968
## 6       Norm     1Fam     1Story           5           6      1966
##   YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType
## 1         1973     Gable  CompShg     HdBoard     HdBoard      Stone
## 2         1962       Hip  CompShg     HdBoard     Plywood       None
## 3         1960       Hip  CompShg     MetalSd     MetalSd    BrkFace
## 4         1970     Gable  CompShg     Wd Sdng     Wd Sdng    BrkFace
## 5         2001     Gable  CompShg     Plywood     Plywood       None
## 6         2006     Gable  CompShg     HdBoard     HdBoard       None
##   MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond BsmtExposure
## 1        240        TA        TA     CBlock       Gd       TA           Mn
## 2          0        TA        TA     CBlock       TA       TA           No
## 3        212        TA        TA     CBlock       TA       TA           No
## 4        180        TA        TA     CBlock       TA       TA           No
## 5          0        TA        Gd     CBlock       TA       TA           Mn
## 6          0        TA        TA     CBlock       TA       TA           No
##   BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF
## 1          ALQ        859          BLQ         32       216        1107
## 2          ALQ        737          Unf          0       175         912
## 3          BLQ        733          Unf          0       520        1253
## 4          ALQ        578          Unf          0       426        1004
## 5          Rec        188          ALQ        668       204        1060
## 6          Unf          0          Unf          0      1228        1228
##   Heating HeatingQC CentralAir Electrical 1stFlrSF 2ndFlrSF LowQualFinSF
## 1    GasA        Ex          Y      SBrkr     1107      983            0
## 2    GasA        TA          Y      SBrkr      912        0            0
## 3    GasA        TA          Y      SBrkr     1253        0            0
## 4    GasA        Ex          Y      SBrkr     1004        0            0
## 5    GasA        Ex          Y      SBrkr     1060        0            0
## 6    GasA        Gd          Y      SBrkr     1228        0            0
##   GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr
## 1      2090            1            0        2        1            3
## 2       912            1            0        1        0            2
## 3      1253            1            0        1        1            2
## 4      1004            1            0        1        0            2
## 5      1060            1            0        1        0            3
## 6      1228            0            0        1        1            3
##   KitchenAbvGr KitchenQual TotRmsAbvGrd Functional Fireplaces FireplaceQu
## 1            1          TA            7        Typ          2          TA
## 2            1          TA            4        Typ          0        <NA>
## 3            1          TA            5        Typ          1          Fa
## 4            1          TA            5        Typ          1          TA
## 5            1          Gd            6        Typ          1          TA
## 6            1          Gd            6        Typ          0        <NA>
##   GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual
## 1     Attchd        1973          RFn          2        484         TA
## 2     Detchd        1962          Unf          1        352         TA
## 3     Attchd        1960          RFn          1        352         TA
## 4     Attchd        1970          Fin          2        480         TA
## 5     Attchd        1968          Unf          1        270         TA
## 6     Attchd        1966          Unf          1        271         TA
##   GarageCond PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch 3SsnPorch
## 1         TA          Y        235         204           228         0
## 2         TA          Y        140           0             0         0
## 3         TA          Y          0         213           176         0
## 4         TA          Y          0           0             0         0
## 5         TA          Y        406          90             0         0
## 6         TA          Y          0          65             0         0
##   ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold
## 1           0        0   <NA>  <NA>        Shed     350     11   2009
## 2         176        0   <NA>  <NA>        <NA>       0      9   2008
## 3           0        0   <NA>  GdWo        <NA>       0      5   2008
## 4           0        0   <NA>  <NA>        Shed     700      3   2010
## 5           0        0   <NA> MnPrv        <NA>       0      5   2010
## 6           0        0   <NA> MnPrv        <NA>       0      6   2008
##   SaleType SaleCondition SalePrice
## 1       WD        Normal    200000
## 2       WD        Normal    144000
## 3       WD        Normal    157000
## 4       WD        Normal    149000
## 5       WD        Normal    154000
## 6       WD        Normal    149350

As per usual, the two solutions we could compare is: 1) assign 0 to the NA value in LotFrontage 2) use CART to impute the value.

After analyzing the missing values, let us actually do clean the data with various method and compare them with baseline models.

The details of the data cleaning code will be in the next section.

source('~/Dropbox/Boston_house/2_Data_Clean/data_clean.R', echo=TRUE)
## 
## > source("~/Dropbox/Utility/source_self.R", echo = TRUE)
## 
## > source("~/Dropbox/Utility/char_to_factor.R", echo = TRUE)
## 
## > char_to_factor = function(data) {
## +     library(magrittr)
## +     data = as.data.frame(data)
## +     sapply(data, class) %>% table() %>% print
## +     cha .... [TRUNCATED] 
## 
## > source("~/Dropbox/Utility/correlation.R", echo = TRUE)
## 
## > correlation = function(data) {
## +     select_num = function(data) {
## +         library(magrittr)
## +         data = as.data.frame(data)
## +         int_in .... [TRUNCATED] 
## 
## > source("~/Dropbox/Utility/miss_name.R", echo = TRUE)
## 
## > miss_name = function(data) {
## +     miss_pct = function(data) {
## +         pct = sum(is.na(data))/length(data)
## +     }
## +     missing = data %>% sapply .... [TRUNCATED] 
## 
## > source("~/Dropbox/Utility/miss_pct_plot.R", echo = TRUE)
## 
## > miss_pct_plot = function(data) {
## +     miss_pct = function(data) {
## +         pct = sum(is.na(data))/length(data)
## +     }
## +     missing = boston_raw  .... [TRUNCATED] 
## 
## > source("~/Dropbox/Utility/na_to_fac.R", echo = TRUE)
## 
## > na_to_fac = function(data, names) {
## +     library(magrittr)
## +     library(dplyr)
## +     data_na = data %>% select(names)
## +     data_nna = data %>% se .... [TRUNCATED] 
## 
## > source("~/Dropbox/Utility/select_comp.R", echo = TRUE)
## 
## > select_comp = function(data, p = 0.5) {
## +     miss_pct = function(data) {
## +         pct = sum(is.na(data))/length(data)
## +     }
## +     missing = sapp .... [TRUNCATED] 
## 
## > source("~/Dropbox/Utility/select_fac.R", echo = TRUE)
## 
## > select_fac = function(data) {
## +     library(magrittr)
## +     data = as.data.frame(data)
## +     data %>% sapply(class) %>% table() %>% print
## +     fac_ .... [TRUNCATED] 
## 
## > source("~/Dropbox/Utility/select_num.R", echo = TRUE)
## 
## > select_num = function(data) {
## +     library(magrittr)
## +     data = as.data.frame(data)
## +     data %>% sapply(class) %>% table() %>% print
## +     int_ .... [TRUNCATED] 
## 
## > source("~/Dropbox/Utility/timeRecord_functions.R", 
## +     echo = TRUE)
## 
## > timeRecordA = function() {
## +     file_name = paste(format(Sys.time(), "%F_%T"), ".log", sep = "")
## +     file.create(file_name)
## +     extremely_long_ .... [TRUNCATED] 
## 
## > timeRecordB = function() {
## +     if (file.exists(extremely_long_name_i_do_not_think_anyone_would_be_sanely_to_use)) {
## +         file_name = extremel .... [TRUNCATED] 
## 
## > timeRecordR = function(output = "job", unit = "s") {
## +     if (file.exists(extremely_long_name_i_do_not_think_anyone_would_be_sanely_to_use)) {
## +    .... [TRUNCATED] 
## 
## > source("~/Dropbox/Utility/miss_pct_df.R", echo = TRUE)
## 
## > miss_pct_df = function(data) {
## +     miss_pct = function(data) {
## +         pct = sum(is.na(data))/length(data)
## +     }
## +     missing = data %>% sapp .... [TRUNCATED] 
## 
## > source("~/Dropbox/Utility/int_to_num.R", echo = TRUE)
## 
## > int_to_num = function(data) {
## +     data = as.data.frame(data)
## +     library(magrittr)
## +     data = as.data.frame(data)
## +     sapply(data, class) %> .... [TRUNCATED] 
## 
## > library(dplyr)
## 
## > library(magrittr)
## 
## > boston_raw = data.table::fread("train.csv")
## 
## > timeRecordA()
## [1] "Fun time log has been created"
## 
## > data_clean1 = function(data) {
## +     library(mice)
## +     data1 = data %>% select_comp(p = 0.5) %>% char_to_factor()
## +     data1_impute = data1 %>% s .... [TRUNCATED] 
## 
## > boston_clean1 = boston_raw %>% data_clean1()
## 
## Attaching package: 'mice'
## The following object is masked from 'package:tidyr':
## 
##     complete
## .
## character   integer 
##        39        38 
## .
##  factor integer 
##      39      38 
## 
##  iter imp variable
##   1   1  LotFrontage  MasVnrArea  GarageYrBlt  MasVnrType  BsmtQual  BsmtCond  BsmtExposure  BsmtFinType1  BsmtFinType2  Electrical  FireplaceQu  GarageType  GarageFinish  GarageQual  GarageCond
##   2   1  LotFrontage  MasVnrArea  GarageYrBlt  MasVnrType  BsmtQual  BsmtCond  BsmtExposure  BsmtFinType1  BsmtFinType2  Electrical  FireplaceQu  GarageType  GarageFinish  GarageQual  GarageCond
##   3   1  LotFrontage  MasVnrArea  GarageYrBlt  MasVnrType  BsmtQual  BsmtCond  BsmtExposure  BsmtFinType1  BsmtFinType2  Electrical  FireplaceQu  GarageType  GarageFinish  GarageQual  GarageCond
##   4   1  LotFrontage  MasVnrArea  GarageYrBlt  MasVnrType  BsmtQual  BsmtCond  BsmtExposure  BsmtFinType1  BsmtFinType2  Electrical  FireplaceQu  GarageType  GarageFinish  GarageQual  GarageCond
##   5   1  LotFrontage  MasVnrArea  GarageYrBlt  MasVnrType  BsmtQual  BsmtCond  BsmtExposure  BsmtFinType1  BsmtFinType2  Electrical  FireplaceQu  GarageType  GarageFinish  GarageQual  GarageCond
## 
## > timeRecordB()
## 
## > data_clean2 = function(data) {
## +     features = c("Alley", "BsmtQual", "BsmtCond", "BsmtExposure", 
## +         "BsmtFinType1", "BsmtFinType2", "Firep ..." ... [TRUNCATED] 
## 
## > boston_clean2 = boston_raw %>% data_clean2()
## [1] "There are 2 features with missing values."
## .
## character   integer 
##        43        38 
## .
##  factor integer 
##      43      38 
## 
## > timeRecordB()
## 
## > data_clean3 = function(data, method) {
## +     library(mice)
## +     features = c("Alley", "BsmtQual", "BsmtCond", "BsmtExposure", 
## +         "BsmtFinTy ..." ... [TRUNCATED] 
## 
## > boston_clean3 = boston_raw %>% data_clean3(method = "rf")
## 
##  iter imp variable
##   1   1  LotFrontage  MasVnrArea  GarageYrBlt
##   2   1  LotFrontage  MasVnrArea  GarageYrBlt
##   3   1  LotFrontage  MasVnrArea  GarageYrBlt
##   4   1  LotFrontage  MasVnrArea  GarageYrBlt
##   5   1  LotFrontage  MasVnrArea  GarageYrBlt
## .
## character   integer 
##        43        38 
## .
##  factor integer 
##      43      38 
## [1] "There are 0 features with missing values."
## 
## > timeRecordB()
## 
## > boston_clean4 = boston_raw %>% data_clean3(method = "cart")
## 
##  iter imp variable
##   1   1  LotFrontage  MasVnrArea  GarageYrBlt
##   2   1  LotFrontage  MasVnrArea  GarageYrBlt
##   3   1  LotFrontage  MasVnrArea  GarageYrBlt
##   4   1  LotFrontage  MasVnrArea  GarageYrBlt
##   5   1  LotFrontage  MasVnrArea  GarageYrBlt
## .
## character   integer 
##        43        38 
## .
##  factor integer 
##      43      38 
## [1] "There are 0 features with missing values."
## 
## > timeRecordB()
## 
## > boston_clean5 = boston_raw %>% data_clean3(method = "sample")
## 
##  iter imp variable
##   1   1  LotFrontage  MasVnrArea  GarageYrBlt
##   2   1  LotFrontage  MasVnrArea  GarageYrBlt
##   3   1  LotFrontage  MasVnrArea  GarageYrBlt
##   4   1  LotFrontage  MasVnrArea  GarageYrBlt
##   5   1  LotFrontage  MasVnrArea  GarageYrBlt
## .
## character   integer 
##        43        38 
## .
##  factor integer 
##      43      38 
## [1] "There are 0 features with missing values."
## 
## > timeRecordB()
## 
## > data_clean5 = function(data, method) {
## +     library(caret)
## +     features = c("Alley", "BsmtQual", "BsmtCond", "BsmtExposure", 
## +         "BsmtFinT ..." ... [TRUNCATED] 
## 
## > boston_clean6 = boston_raw %>% data_clean5(method = "knnImpute")
## .
## character   integer 
##        43        37 
## .
##  factor integer 
##      43      37 
## 
## > timeRecordB()
## 
## > boston_clean7 = boston_raw %>% data_clean5(method = "medianImpute")
## .
## character   integer 
##        43        37 
## .
##  factor integer 
##      43      37 
## 
## > timeRecordB()
## 
## > timeRecordR()
##   user.self sys.self elapsed user.child sys.child run_time
## 1    12.780    0.116  12.866          0         0    0.000
## 2    81.532    0.844  82.456          0         0   69.590
## 3    81.588    0.844  82.512          0         0    0.056
## 4    84.288    0.852  85.221          0         0    2.709
## 5    85.960    0.852  86.897          0         0    1.676
## 6    86.216    0.852  87.150          0         0    0.253
## 7    86.832    0.880  87.795          0         0    0.645
## 8    87.184    0.880  88.146          0         0    0.351
## 
## > boston_raw = boston_raw %>% char_to_factor()
## .
## character   integer 
##        43        38 
## .
##  factor integer 
##      43      38 
## 
## > miss_pct_df(boston_raw)
## [1] "There are 19 features with missing values."
##        miss_pct        names
## 1  0.9952054795       PoolQC
## 2  0.9630136986  MiscFeature
## 3  0.9376712329        Alley
## 4  0.8075342466        Fence
## 5  0.4726027397  FireplaceQu
## 6  0.1773972603  LotFrontage
## 7  0.0554794521  GarageYrBlt
## 8  0.0554794521   GarageType
## 9  0.0554794521 GarageFinish
## 10 0.0554794521   GarageQual
## 11 0.0554794521   GarageCond
## 12 0.0260273973 BsmtExposure
## 13 0.0260273973 BsmtFinType2
## 14 0.0253424658     BsmtQual
## 15 0.0253424658     BsmtCond
## 16 0.0253424658 BsmtFinType1
## 17 0.0054794521   MasVnrArea
## 18 0.0054794521   MasVnrType
## 19 0.0006849315   Electrical
## 
## > miss_pct_df(boston_clean1)
## [1] "There are 0 features with missing values."
## [1] miss_pct names   
## <0 rows> (or 0-length row.names)
## 
## > miss_pct_df(boston_clean2)
## [1] "There are 0 features with missing values."
## [1] miss_pct names   
## <0 rows> (or 0-length row.names)
## 
## > miss_pct_df(boston_clean3)
## [1] "There are 0 features with missing values."
## [1] miss_pct names   
## <0 rows> (or 0-length row.names)
## 
## > miss_pct_df(boston_clean4)
## [1] "There are 0 features with missing values."
## [1] miss_pct names   
## <0 rows> (or 0-length row.names)
## 
## > miss_pct_df(boston_clean5)
## [1] "There are 0 features with missing values."
## [1] miss_pct names   
## <0 rows> (or 0-length row.names)
## 
## > miss_pct_df(boston_clean6)
## [1] "There are 0 features with missing values."
## [1] miss_pct names   
## <0 rows> (or 0-length row.names)
## 
## > miss_pct_df(boston_clean7)
## [1] "There are 0 features with missing values."
## [1] miss_pct names   
## <0 rows> (or 0-length row.names)

The last few lines show that all dataset are imputed with no more missing values in them.

Next, let us look at the skewness of each feature. The skewness of a feature could affect the predictions of some machine learning algorithms, if the features have a really big value for skewness, then we need to do a Box-Cox transformation.

# First define a function to select integer or numeric value in features. 
select_num=function(data){
  library(magrittr)
  data=as.data.frame(data)
  
  data%>%sapply(class)%>%table()%>%print
  int_index=sapply(data,class)== "integer"
  int_index=int_index%>%as.logical()
  num_index=sapply(data,class)== "numeric"
  num_index=num_index%>%as.logical()
  # data[,!int_index]%>%sapply(class)%>%table
  
  data1=data[,int_index]
  data2=data[,num_index]
  data=cbind(data1,data2)
  data%>%sapply(class)%>%table%>%print
  return(data)
}
library(e1071)
library(dplyr)
# Select only integer or numeric data types for calculating skewness. 
boston_clean1%>%select_num()%>%sapply(skewness)
## .
##  factor integer 
##      39      38 
## .
## integer 
##      38
##            Id    MSSubClass   LotFrontage       LotArea   OverallQual 
##    0.00000000    1.40476562    2.01792140   12.18261502    0.21649836 
##   OverallCond     YearBuilt  YearRemodAdd    MasVnrArea    BsmtFinSF1 
##    0.69164401   -0.61220121   -0.50252776    2.66751655    1.68204129 
##    BsmtFinSF2     BsmtUnfSF   TotalBsmtSF      1stFlrSF      2ndFlrSF 
##    4.24652141    0.91837835    1.52112395    1.37392896    0.81135997 
##  LowQualFinSF     GrLivArea  BsmtFullBath  BsmtHalfBath      FullBath 
##    8.99283329    1.36375364    0.59484237    4.09497490    0.03648647 
##      HalfBath  BedroomAbvGr  KitchenAbvGr  TotRmsAbvGrd    Fireplaces 
##    0.67450925    0.21135511    4.47917826    0.67495173    0.64823107 
##   GarageYrBlt    GarageCars    GarageArea    WoodDeckSF   OpenPorchSF 
##   -0.57900250   -0.34184538    0.17961125    1.53820999    2.35948572 
## EnclosedPorch     3SsnPorch   ScreenPorch      PoolArea       MiscVal 
##    3.08352575   10.28317840    4.11374731   14.79791829   24.42652237 
##        MoSold        YrSold     SalePrice 
##    0.21161746    0.09607079    1.87900860
boston_clean2%>%select_num()%>%sapply(skewness)
## .
##  factor integer 
##      43      38 
## .
## integer 
##      38
##            Id    MSSubClass   LotFrontage       LotArea   OverallQual 
##    0.00000000    1.40476562    0.26727232   12.18261502    0.21649836 
##   OverallCond     YearBuilt  YearRemodAdd    MasVnrArea    BsmtFinSF1 
##    0.69164401   -0.61220121   -0.50252776    2.67211701    1.68204129 
##    BsmtFinSF2     BsmtUnfSF   TotalBsmtSF      1stFlrSF      2ndFlrSF 
##    4.24652141    0.91837835    1.52112395    1.37392896    0.81135997 
##  LowQualFinSF     GrLivArea  BsmtFullBath  BsmtHalfBath      FullBath 
##    8.99283329    1.36375364    0.59484237    4.09497490    0.03648647 
##      HalfBath  BedroomAbvGr  KitchenAbvGr  TotRmsAbvGrd    Fireplaces 
##    0.67450925    0.21135511    4.47917826    0.67495173    0.64823107 
##   GarageYrBlt    GarageCars    GarageArea    WoodDeckSF   OpenPorchSF 
##   -3.86155342   -0.34184538    0.17961125    1.53820999    2.35948572 
## EnclosedPorch     3SsnPorch   ScreenPorch      PoolArea       MiscVal 
##    3.08352575   10.28317840    4.11374731   14.79791829   24.42652237 
##        MoSold        YrSold     SalePrice 
##    0.21161746    0.09607079    1.87900860

It is obvious that some features have very high skewness. Thus, a Box-Cox transformation would seem necessary for better performance for some of the models. Is it really true? It is for you to try it out.