This is the first part of the Machine learning workflow with Amnes Housing price data. In this part, we will explore the dataset and possible ways to deal with the missing data.
In the next part, we will
This is needed libraries.
# library(Amelia)
# library(data.table)
# library(ggplot2)
# library(dplyr)
# library(caret)
# library(magrittr)
# library(tidyverse)
# I did not load them all here to avoid conflict within different packages.
Load the training and test data.
boston_raw=data.table::fread("train.csv")
boston_test=data.table::fread("test.csv")
Look at the data for the first few lines.
head(boston_raw)
## Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape
## 1: 1 60 RL 65 8450 Pave NA Reg
## 2: 2 20 RL 80 9600 Pave NA Reg
## 3: 3 60 RL 68 11250 Pave NA IR1
## 4: 4 70 RL 60 9550 Pave NA IR1
## 5: 5 60 RL 84 14260 Pave NA IR1
## 6: 6 50 RL 85 14115 Pave NA IR1
## LandContour Utilities LotConfig LandSlope Neighborhood Condition1
## 1: Lvl AllPub Inside Gtl CollgCr Norm
## 2: Lvl AllPub FR2 Gtl Veenker Feedr
## 3: Lvl AllPub Inside Gtl CollgCr Norm
## 4: Lvl AllPub Corner Gtl Crawfor Norm
## 5: Lvl AllPub FR2 Gtl NoRidge Norm
## 6: Lvl AllPub Inside Gtl Mitchel Norm
## Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt
## 1: Norm 1Fam 2Story 7 5 2003
## 2: Norm 1Fam 1Story 6 8 1976
## 3: Norm 1Fam 2Story 7 5 2001
## 4: Norm 1Fam 2Story 7 5 1915
## 5: Norm 1Fam 2Story 8 5 2000
## 6: Norm 1Fam 1.5Fin 5 5 1993
## YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType
## 1: 2003 Gable CompShg VinylSd VinylSd BrkFace
## 2: 1976 Gable CompShg MetalSd MetalSd None
## 3: 2002 Gable CompShg VinylSd VinylSd BrkFace
## 4: 1970 Gable CompShg Wd Sdng Wd Shng None
## 5: 2000 Gable CompShg VinylSd VinylSd BrkFace
## 6: 1995 Gable CompShg VinylSd VinylSd None
## MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond
## 1: 196 Gd TA PConc Gd TA
## 2: 0 TA TA CBlock Gd TA
## 3: 162 Gd TA PConc Gd TA
## 4: 0 TA TA BrkTil TA Gd
## 5: 350 Gd TA PConc Gd TA
## 6: 0 TA TA Wood Gd TA
## BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF
## 1: No GLQ 706 Unf 0 150
## 2: Gd ALQ 978 Unf 0 284
## 3: Mn GLQ 486 Unf 0 434
## 4: No ALQ 216 Unf 0 540
## 5: Av GLQ 655 Unf 0 490
## 6: No GLQ 732 Unf 0 64
## TotalBsmtSF Heating HeatingQC CentralAir Electrical 1stFlrSF 2ndFlrSF
## 1: 856 GasA Ex Y SBrkr 856 854
## 2: 1262 GasA Ex Y SBrkr 1262 0
## 3: 920 GasA Ex Y SBrkr 920 866
## 4: 756 GasA Gd Y SBrkr 961 756
## 5: 1145 GasA Ex Y SBrkr 1145 1053
## 6: 796 GasA Ex Y SBrkr 796 566
## LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath
## 1: 0 1710 1 0 2 1
## 2: 0 1262 0 1 2 0
## 3: 0 1786 1 0 2 1
## 4: 0 1717 1 0 1 0
## 5: 0 2198 1 0 2 1
## 6: 0 1362 1 0 1 1
## BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd Functional
## 1: 3 1 Gd 8 Typ
## 2: 3 1 TA 6 Typ
## 3: 3 1 Gd 6 Typ
## 4: 3 1 Gd 7 Typ
## 5: 4 1 Gd 9 Typ
## 6: 1 1 TA 5 Typ
## Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish GarageCars
## 1: 0 NA Attchd 2003 RFn 2
## 2: 1 TA Attchd 1976 RFn 2
## 3: 1 TA Attchd 2001 RFn 2
## 4: 1 Gd Detchd 1998 Unf 3
## 5: 1 TA Attchd 2000 RFn 3
## 6: 0 NA Attchd 1993 Unf 2
## GarageArea GarageQual GarageCond PavedDrive WoodDeckSF OpenPorchSF
## 1: 548 TA TA Y 0 61
## 2: 460 TA TA Y 298 0
## 3: 608 TA TA Y 0 42
## 4: 642 TA TA Y 0 35
## 5: 836 TA TA Y 192 84
## 6: 480 TA TA Y 40 30
## EnclosedPorch 3SsnPorch ScreenPorch PoolArea PoolQC Fence MiscFeature
## 1: 0 0 0 0 NA NA NA
## 2: 0 0 0 0 NA NA NA
## 3: 0 0 0 0 NA NA NA
## 4: 272 0 0 0 NA NA NA
## 5: 0 0 0 0 NA NA NA
## 6: 0 320 0 0 NA MnPrv Shed
## MiscVal MoSold YrSold SaleType SaleCondition SalePrice
## 1: 0 2 2008 WD Normal 208500
## 2: 0 5 2007 WD Normal 181500
## 3: 0 9 2008 WD Normal 223500
## 4: 0 2 2006 WD Abnorml 140000
## 5: 0 12 2008 WD Normal 250000
## 6: 700 10 2009 WD Normal 143000
Let us look at the dimensions of the data
dim(boston_raw)
## [1] 1460 81
Do a summary of different variables.
summary(boston_raw)
## Id MSSubClass MSZoning LotFrontage
## Min. : 1.0 Min. : 20.0 Length:1460 Min. : 21.00
## 1st Qu.: 365.8 1st Qu.: 20.0 Class :character 1st Qu.: 59.00
## Median : 730.5 Median : 50.0 Mode :character Median : 69.00
## Mean : 730.5 Mean : 56.9 Mean : 70.05
## 3rd Qu.:1095.2 3rd Qu.: 70.0 3rd Qu.: 80.00
## Max. :1460.0 Max. :190.0 Max. :313.00
## NA's :259
## LotArea Street Alley LotShape
## Min. : 1300 Length:1460 Length:1460 Length:1460
## 1st Qu.: 7554 Class :character Class :character Class :character
## Median : 9478 Mode :character Mode :character Mode :character
## Mean : 10517
## 3rd Qu.: 11602
## Max. :215245
##
## LandContour Utilities LotConfig
## Length:1460 Length:1460 Length:1460
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## LandSlope Neighborhood Condition1
## Length:1460 Length:1460 Length:1460
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## Condition2 BldgType HouseStyle OverallQual
## Length:1460 Length:1460 Length:1460 Min. : 1.000
## Class :character Class :character Class :character 1st Qu.: 5.000
## Mode :character Mode :character Mode :character Median : 6.000
## Mean : 6.099
## 3rd Qu.: 7.000
## Max. :10.000
##
## OverallCond YearBuilt YearRemodAdd RoofStyle
## Min. :1.000 Min. :1872 Min. :1950 Length:1460
## 1st Qu.:5.000 1st Qu.:1954 1st Qu.:1967 Class :character
## Median :5.000 Median :1973 Median :1994 Mode :character
## Mean :5.575 Mean :1971 Mean :1985
## 3rd Qu.:6.000 3rd Qu.:2000 3rd Qu.:2004
## Max. :9.000 Max. :2010 Max. :2010
##
## RoofMatl Exterior1st Exterior2nd
## Length:1460 Length:1460 Length:1460
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## MasVnrType MasVnrArea ExterQual ExterCond
## Length:1460 Min. : 0.0 Length:1460 Length:1460
## Class :character 1st Qu.: 0.0 Class :character Class :character
## Mode :character Median : 0.0 Mode :character Mode :character
## Mean : 103.7
## 3rd Qu.: 166.0
## Max. :1600.0
## NA's :8
## Foundation BsmtQual BsmtCond
## Length:1460 Length:1460 Length:1460
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2
## Length:1460 Length:1460 Min. : 0.0 Length:1460
## Class :character Class :character 1st Qu.: 0.0 Class :character
## Mode :character Mode :character Median : 383.5 Mode :character
## Mean : 443.6
## 3rd Qu.: 712.2
## Max. :5644.0
##
## BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating
## Min. : 0.00 Min. : 0.0 Min. : 0.0 Length:1460
## 1st Qu.: 0.00 1st Qu.: 223.0 1st Qu.: 795.8 Class :character
## Median : 0.00 Median : 477.5 Median : 991.5 Mode :character
## Mean : 46.55 Mean : 567.2 Mean :1057.4
## 3rd Qu.: 0.00 3rd Qu.: 808.0 3rd Qu.:1298.2
## Max. :1474.00 Max. :2336.0 Max. :6110.0
##
## HeatingQC CentralAir Electrical 1stFlrSF
## Length:1460 Length:1460 Length:1460 Min. : 334
## Class :character Class :character Class :character 1st Qu.: 882
## Mode :character Mode :character Mode :character Median :1087
## Mean :1163
## 3rd Qu.:1391
## Max. :4692
##
## 2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath
## Min. : 0 Min. : 0.000 Min. : 334 Min. :0.0000
## 1st Qu.: 0 1st Qu.: 0.000 1st Qu.:1130 1st Qu.:0.0000
## Median : 0 Median : 0.000 Median :1464 Median :0.0000
## Mean : 347 Mean : 5.845 Mean :1515 Mean :0.4253
## 3rd Qu.: 728 3rd Qu.: 0.000 3rd Qu.:1777 3rd Qu.:1.0000
## Max. :2065 Max. :572.000 Max. :5642 Max. :3.0000
##
## BsmtHalfBath FullBath HalfBath BedroomAbvGr
## Min. :0.00000 Min. :0.000 Min. :0.0000 Min. :0.000
## 1st Qu.:0.00000 1st Qu.:1.000 1st Qu.:0.0000 1st Qu.:2.000
## Median :0.00000 Median :2.000 Median :0.0000 Median :3.000
## Mean :0.05753 Mean :1.565 Mean :0.3829 Mean :2.866
## 3rd Qu.:0.00000 3rd Qu.:2.000 3rd Qu.:1.0000 3rd Qu.:3.000
## Max. :2.00000 Max. :3.000 Max. :2.0000 Max. :8.000
##
## KitchenAbvGr KitchenQual TotRmsAbvGrd Functional
## Min. :0.000 Length:1460 Min. : 2.000 Length:1460
## 1st Qu.:1.000 Class :character 1st Qu.: 5.000 Class :character
## Median :1.000 Mode :character Median : 6.000 Mode :character
## Mean :1.047 Mean : 6.518
## 3rd Qu.:1.000 3rd Qu.: 7.000
## Max. :3.000 Max. :14.000
##
## Fireplaces FireplaceQu GarageType GarageYrBlt
## Min. :0.000 Length:1460 Length:1460 Min. :1900
## 1st Qu.:0.000 Class :character Class :character 1st Qu.:1961
## Median :1.000 Mode :character Mode :character Median :1980
## Mean :0.613 Mean :1979
## 3rd Qu.:1.000 3rd Qu.:2002
## Max. :3.000 Max. :2010
## NA's :81
## GarageFinish GarageCars GarageArea GarageQual
## Length:1460 Min. :0.000 Min. : 0.0 Length:1460
## Class :character 1st Qu.:1.000 1st Qu.: 334.5 Class :character
## Mode :character Median :2.000 Median : 480.0 Mode :character
## Mean :1.767 Mean : 473.0
## 3rd Qu.:2.000 3rd Qu.: 576.0
## Max. :4.000 Max. :1418.0
##
## GarageCond PavedDrive WoodDeckSF OpenPorchSF
## Length:1460 Length:1460 Min. : 0.00 Min. : 0.00
## Class :character Class :character 1st Qu.: 0.00 1st Qu.: 0.00
## Mode :character Mode :character Median : 0.00 Median : 25.00
## Mean : 94.24 Mean : 46.66
## 3rd Qu.:168.00 3rd Qu.: 68.00
## Max. :857.00 Max. :547.00
##
## EnclosedPorch 3SsnPorch ScreenPorch PoolArea
## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.000
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.000
## Median : 0.00 Median : 0.00 Median : 0.00 Median : 0.000
## Mean : 21.95 Mean : 3.41 Mean : 15.06 Mean : 2.759
## 3rd Qu.: 0.00 3rd Qu.: 0.00 3rd Qu.: 0.00 3rd Qu.: 0.000
## Max. :552.00 Max. :508.00 Max. :480.00 Max. :738.000
##
## PoolQC Fence MiscFeature
## Length:1460 Length:1460 Length:1460
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## MiscVal MoSold YrSold SaleType
## Min. : 0.00 Min. : 1.000 Min. :2006 Length:1460
## 1st Qu.: 0.00 1st Qu.: 5.000 1st Qu.:2007 Class :character
## Median : 0.00 Median : 6.000 Median :2008 Mode :character
## Mean : 43.49 Mean : 6.322 Mean :2008
## 3rd Qu.: 0.00 3rd Qu.: 8.000 3rd Qu.:2009
## Max. :15500.00 Max. :12.000 Max. :2010
##
## SaleCondition SalePrice
## Length:1460 Min. : 34900
## Class :character 1st Qu.:129975
## Mode :character Median :163000
## Mean :180921
## 3rd Qu.:214000
## Max. :755000
##
Further glimpse on the type of each dataset
dplyr::glimpse(boston_raw)
## Observations: 1,460
## Variables: 81
## $ Id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1...
## $ MSSubClass <int> 60, 20, 60, 70, 60, 50, 20, 60, 50, 190, 20, 60,...
## $ MSZoning <chr> "RL", "RL", "RL", "RL", "RL", "RL", "RL", "RL", ...
## $ LotFrontage <int> 65, 80, 68, 60, 84, 85, 75, NA, 51, 50, 70, 85, ...
## $ LotArea <int> 8450, 9600, 11250, 9550, 14260, 14115, 10084, 10...
## $ Street <chr> "Pave", "Pave", "Pave", "Pave", "Pave", "Pave", ...
## $ Alley <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ LotShape <chr> "Reg", "Reg", "IR1", "IR1", "IR1", "IR1", "Reg",...
## $ LandContour <chr> "Lvl", "Lvl", "Lvl", "Lvl", "Lvl", "Lvl", "Lvl",...
## $ Utilities <chr> "AllPub", "AllPub", "AllPub", "AllPub", "AllPub"...
## $ LotConfig <chr> "Inside", "FR2", "Inside", "Corner", "FR2", "Ins...
## $ LandSlope <chr> "Gtl", "Gtl", "Gtl", "Gtl", "Gtl", "Gtl", "Gtl",...
## $ Neighborhood <chr> "CollgCr", "Veenker", "CollgCr", "Crawfor", "NoR...
## $ Condition1 <chr> "Norm", "Feedr", "Norm", "Norm", "Norm", "Norm",...
## $ Condition2 <chr> "Norm", "Norm", "Norm", "Norm", "Norm", "Norm", ...
## $ BldgType <chr> "1Fam", "1Fam", "1Fam", "1Fam", "1Fam", "1Fam", ...
## $ HouseStyle <chr> "2Story", "1Story", "2Story", "2Story", "2Story"...
## $ OverallQual <int> 7, 6, 7, 7, 8, 5, 8, 7, 7, 5, 5, 9, 5, 7, 6, 7, ...
## $ OverallCond <int> 5, 8, 5, 5, 5, 5, 5, 6, 5, 6, 5, 5, 6, 5, 5, 8, ...
## $ YearBuilt <int> 2003, 1976, 2001, 1915, 2000, 1993, 2004, 1973, ...
## $ YearRemodAdd <int> 2003, 1976, 2002, 1970, 2000, 1995, 2005, 1973, ...
## $ RoofStyle <chr> "Gable", "Gable", "Gable", "Gable", "Gable", "Ga...
## $ RoofMatl <chr> "CompShg", "CompShg", "CompShg", "CompShg", "Com...
## $ Exterior1st <chr> "VinylSd", "MetalSd", "VinylSd", "Wd Sdng", "Vin...
## $ Exterior2nd <chr> "VinylSd", "MetalSd", "VinylSd", "Wd Shng", "Vin...
## $ MasVnrType <chr> "BrkFace", "None", "BrkFace", "None", "BrkFace",...
## $ MasVnrArea <int> 196, 0, 162, 0, 350, 0, 186, 240, 0, 0, 0, 286, ...
## $ ExterQual <chr> "Gd", "TA", "Gd", "TA", "Gd", "TA", "Gd", "TA", ...
## $ ExterCond <chr> "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", ...
## $ Foundation <chr> "PConc", "CBlock", "PConc", "BrkTil", "PConc", "...
## $ BsmtQual <chr> "Gd", "Gd", "Gd", "TA", "Gd", "Gd", "Ex", "Gd", ...
## $ BsmtCond <chr> "TA", "TA", "TA", "Gd", "TA", "TA", "TA", "TA", ...
## $ BsmtExposure <chr> "No", "Gd", "Mn", "No", "Av", "No", "Av", "Mn", ...
## $ BsmtFinType1 <chr> "GLQ", "ALQ", "GLQ", "ALQ", "GLQ", "GLQ", "GLQ",...
## $ BsmtFinSF1 <int> 706, 978, 486, 216, 655, 732, 1369, 859, 0, 851,...
## $ BsmtFinType2 <chr> "Unf", "Unf", "Unf", "Unf", "Unf", "Unf", "Unf",...
## $ BsmtFinSF2 <int> 0, 0, 0, 0, 0, 0, 0, 32, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ BsmtUnfSF <int> 150, 284, 434, 540, 490, 64, 317, 216, 952, 140,...
## $ TotalBsmtSF <int> 856, 1262, 920, 756, 1145, 796, 1686, 1107, 952,...
## $ Heating <chr> "GasA", "GasA", "GasA", "GasA", "GasA", "GasA", ...
## $ HeatingQC <chr> "Ex", "Ex", "Ex", "Gd", "Ex", "Ex", "Ex", "Ex", ...
## $ CentralAir <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y"...
## $ Electrical <chr> "SBrkr", "SBrkr", "SBrkr", "SBrkr", "SBrkr", "SB...
## $ 1stFlrSF <int> 856, 1262, 920, 961, 1145, 796, 1694, 1107, 1022...
## $ 2ndFlrSF <int> 854, 0, 866, 756, 1053, 566, 0, 983, 752, 0, 0, ...
## $ LowQualFinSF <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ GrLivArea <int> 1710, 1262, 1786, 1717, 2198, 1362, 1694, 2090, ...
## $ BsmtFullBath <int> 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, ...
## $ BsmtHalfBath <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ FullBath <int> 2, 2, 2, 1, 2, 1, 2, 2, 2, 1, 1, 3, 1, 2, 1, 1, ...
## $ HalfBath <int> 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, ...
## $ BedroomAbvGr <int> 3, 3, 3, 3, 4, 1, 3, 3, 2, 2, 3, 4, 2, 3, 2, 2, ...
## $ KitchenAbvGr <int> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1, ...
## $ KitchenQual <chr> "Gd", "TA", "Gd", "Gd", "Gd", "TA", "Gd", "TA", ...
## $ TotRmsAbvGrd <int> 8, 6, 6, 7, 9, 5, 7, 7, 8, 5, 5, 11, 4, 7, 5, 5,...
## $ Functional <chr> "Typ", "Typ", "Typ", "Typ", "Typ", "Typ", "Typ",...
## $ Fireplaces <int> 0, 1, 1, 1, 1, 0, 1, 2, 2, 2, 0, 2, 0, 1, 1, 0, ...
## $ FireplaceQu <chr> NA, "TA", "TA", "Gd", "TA", NA, "Gd", "TA", "TA"...
## $ GarageType <chr> "Attchd", "Attchd", "Attchd", "Detchd", "Attchd"...
## $ GarageYrBlt <int> 2003, 1976, 2001, 1998, 2000, 1993, 2004, 1973, ...
## $ GarageFinish <chr> "RFn", "RFn", "RFn", "Unf", "RFn", "Unf", "RFn",...
## $ GarageCars <int> 2, 2, 2, 3, 3, 2, 2, 2, 2, 1, 1, 3, 1, 3, 1, 2, ...
## $ GarageArea <int> 548, 460, 608, 642, 836, 480, 636, 484, 468, 205...
## $ GarageQual <chr> "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", ...
## $ GarageCond <chr> "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", ...
## $ PavedDrive <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y"...
## $ WoodDeckSF <int> 0, 298, 0, 0, 192, 40, 255, 235, 90, 0, 0, 147, ...
## $ OpenPorchSF <int> 61, 0, 42, 35, 84, 30, 57, 204, 0, 4, 0, 21, 0, ...
## $ EnclosedPorch <int> 0, 0, 0, 272, 0, 0, 0, 228, 205, 0, 0, 0, 0, 0, ...
## $ 3SsnPorch <int> 0, 0, 0, 0, 0, 320, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ ScreenPorch <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 176, 0, 0, 0...
## $ PoolArea <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ PoolQC <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ Fence <chr> NA, NA, NA, NA, NA, "MnPrv", NA, NA, NA, NA, NA,...
## $ MiscFeature <chr> NA, NA, NA, NA, NA, "Shed", NA, "Shed", NA, NA, ...
## $ MiscVal <int> 0, 0, 0, 0, 0, 700, 0, 350, 0, 0, 0, 0, 0, 0, 0,...
## $ MoSold <int> 2, 5, 9, 2, 12, 10, 8, 11, 4, 1, 2, 7, 9, 8, 5, ...
## $ YrSold <int> 2008, 2007, 2008, 2006, 2008, 2009, 2007, 2009, ...
## $ SaleType <chr> "WD", "WD", "WD", "WD", "WD", "WD", "WD", "WD", ...
## $ SaleCondition <chr> "Normal", "Normal", "Normal", "Abnorml", "Normal...
## $ SalePrice <int> 208500, 181500, 223500, 140000, 250000, 143000, ...
Let us look at the percentage of missing values
library(Amelia)
## Loading required package: Rcpp
## ##
## ## Amelia II: Multiple Imputation
## ## (Version 1.7.4, built: 2015-12-05)
## ## Copyright (C) 2005-2017 James Honaker, Gary King and Matthew Blackwell
## ## Refer to http://gking.harvard.edu/amelia/ for more information
## ##
missmap(boston_raw,main="Training Data - Missings Map", col=c("yellow", "black"), legend=FALSE)
## Warning in if (class(obj) == "amelia") {: the condition has length > 1 and
## only the first element will be used
It seems that most data all complete. Let us take a closer look at the percentile of the variable with most missing values
miss_pct_plot=function(data){
library(dplyr)
library(magrittr)
library(ggplot2)
# Define a function to calculate percentage of missing value in each feature
miss_pct=function(data){
pct=sum(is.na(data))/length(data)
}
# apply the function define above to every columns of the data set
missing=data%>%sapply(miss_pct)%>%as.data.frame()
colnames(missing)="miss_pct"
missing=data.frame(missing_pct=missing,names=row.names(missing)%>%as.character())
# Select only the features with missing values and their missing value percentage
missing=missing%>%filter(miss_pct>0)%>%arrange(desc(miss_pct))
paste("There are ", dim(missing)[1] ," features with missing values.",sep="")%>%print()
# Plot the missing percentage
plot=missing%>%ggplot(aes(x=reorder(names,-miss_pct),y=miss_pct))+
geom_bar(stat = "identity",fill="blue")+
theme(axis.text.x = element_text(angle = 60,hjust=1))+
labs(x="Features with missing value", y="Percentage of missing value")
return(plot)
}
miss_pct_plot(boston_raw)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## [1] "There are 19 features with missing values."
There are features with missing values.
Get rid of variables with more than 75% missing value.
# Define a function to filter out variables with missing value more than p, which has a default value of 50%.
## The function name is short for complete data with percentile p
select_comp=function(data,p=0.5){
# Define a function to calculate percentage of missing value in each feature
miss_pct=function(data){
pct=sum(is.na(data))/length(data)
}
# apply the function define above to every columns of the data set
missing=sapply(data,miss_pct)%>%as.data.frame()
names=row.names(missing)
colnames(missing)="miss_pct"
missing=data.frame(missing_pct=missing,names=row.names(missing)%>%as.character())
# Select only the features with missing values and their missing value percentage
missing=missing%>%filter(miss_pct>p)%>%arrange(desc(miss_pct))
names=missing$names%>%as.character()
data=data%>%select(-one_of(names) )
return(data)
}
boston_clean1=boston_raw%>%select_comp(p=0.75)
Change the character value to factor.
char_to_factor=function(data){
data=as.data.frame(data)
library(tidyverse)
sapply(data,class)%>%table()%>%print
char_index=sapply(data,class)=="character"
char_index=char_index%>%as.logical()
char_features=data[,char_index]
char_features=char_features%>%sapply(as.factor)
char_features=as.data.frame(char_features)
data[,!char_index]%>%sapply(class)%>%table
data=cbind(data[,!char_index],char_features)
data%>%sapply(class)%>%table%>%print
return(data)
}
boston_clean1=boston_clean1%>%char_to_factor()
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag(): dplyr, stats
## .
## character integer
## 39 38
## .
## factor integer
## 39 38
Define a function to select integer and numeric features in the dataset.
select_num=function(data){
library(magrittr)
data=as.data.frame(data)
data%>%sapply(class)%>%table()%>%print
int_index=sapply(data,class)== "integer"
int_index=int_index%>%as.logical()
num_index=sapply(data,class)== "numeric"
num_index=num_index%>%as.logical()
# data[,!int_index]%>%sapply(class)%>%table
data1=data[,int_index]
data2=data[,num_index]
data=cbind(data1,data2)
data%>%sapply(class)%>%table%>%print
return(data)
}
# Find all factor variables
select_fac=function(data){
library(magrittr)
data=as.data.frame(data)
data%>%sapply(class)%>%table()%>%print
fac_index=sapply(data,class)== "factor"
fac_index=fac_index%>%as.logical()
# data[,!int_index]%>%sapply(class)%>%table
data=data[,fac_index]
data%>%sapply(class)%>%table%>%print
return(data)
}
Let us look at the correlation between the variables.
source('~/Dropbox/Utility/correlation.R', echo=TRUE)
##
## > correlation = function(data) {
## + select_num = function(data) {
## + library(magrittr)
## + data = as.data.frame(data)
## + int_in .... [TRUNCATED]
library(corrplot)
correlation(boston_raw)%>%corrplot(method="pie",type="lower",
order="hclust", # First principle order
diag=FALSE, # No diagonal values
tl.cex=0.65, # size of text label
tl.col="red"# color of the text label
)
## .
## character integer
## 43 38
## .
## factor integer
## 43 38
## .
## character integer
## 43 38
## .
## factor integer
## 43 38
## .
## factor integer
## 27 35
## .
## factor
## 27
This gives a basic sense of how continuous and factor variables are related to the price. If there are too many features available, then it is best to apply dimension reduction techniques likePCA (Principle component analysis) first.
Next, we want to look at features that have the highest correlation with target variable, SalePrice.
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
library(dplyr)
correlate=function(){
# Convert character to factor, select variables with no missing value by setting p=0, select only numerical variables
boston_raw1=boston_raw%>%select_comp(p=0)%>%select_num()
# Convert character to factor, select variables with no missing value by setting p=0, select only factor variables, and convert it into integer values
boston_raw2=boston_raw%>%char_to_factor()%>%select_comp(p=0)%>%select_fac()%>%sapply(as.integer)
# combine the numerical and factor variables
boston_raw=cbind(boston_raw1,boston_raw2)
# calculate correlations
correlation=cor(boston_raw$SalePrice,boston_raw%>%select(-SalePrice))
#print("contains only numeric or interger for correlation calculation")
# Return the correlation data
names=correlation%>%colnames()
result=data.frame(features=names,correlation=correlation%>%as.vector())%>%arrange(desc(abs(correlation)))
return(result)
}
correlate()
## .
## character integer
## 27 35
## .
## integer
## 35
## .
## character integer
## 43 38
## .
## factor integer
## 43 38
## .
## factor integer
## 27 35
## .
## factor
## 27
## features correlation
## 1 OverallQual 0.790981601
## 2 GrLivArea 0.708624478
## 3 GarageCars 0.640409197
## 4 ExterQual -0.636883694
## 5 GarageArea 0.623431439
## 6 TotalBsmtSF 0.613580552
## 7 1stFlrSF 0.605852185
## 8 KitchenQual -0.589188778
## 9 FullBath 0.560663763
## 10 TotRmsAbvGrd 0.533723156
## 11 YearBuilt 0.522897333
## 12 YearRemodAdd 0.507100967
## 13 Fireplaces 0.466928837
## 14 HeatingQC -0.400177543
## 15 BsmtFinSF1 0.386419806
## 16 Foundation 0.382478991
## 17 WoodDeckSF 0.324413445
## 18 2ndFlrSF 0.319333803
## 19 OpenPorchSF 0.315856227
## 20 HalfBath 0.284107676
## 21 LotArea 0.263843354
## 22 LotShape -0.255579870
## 23 CentralAir 0.251328164
## 24 PavedDrive 0.231356952
## 25 BsmtFullBath 0.227122233
## 26 RoofStyle 0.222405292
## 27 BsmtUnfSF 0.214479106
## 28 SaleCondition 0.213092030
## 29 HouseStyle 0.180162623
## 30 Neighborhood 0.170941316
## 31 BedroomAbvGr 0.168213154
## 32 MSZoning -0.166872203
## 33 KitchenAbvGr -0.135907371
## 34 RoofMatl 0.132383112
## 35 EnclosedPorch -0.128577958
## 36 ExterCond 0.117302657
## 37 Functional 0.115327917
## 38 ScreenPorch 0.111446571
## 39 Exterior2nd 0.103765537
## 40 Exterior1st 0.103551011
## 41 Heating -0.098812076
## 42 PoolArea 0.092403549
## 43 Condition1 0.091154912
## 44 BldgType -0.085590608
## 45 MSSubClass -0.084284135
## 46 OverallCond -0.077855894
## 47 LotConfig -0.067396023
## 48 LandSlope 0.051152248
## 49 SaleType -0.050369536
## 50 MoSold 0.046432245
## 51 3SsnPorch 0.044583665
## 52 Street 0.041035536
## 53 YrSold -0.028922585
## 54 LowQualFinSF -0.025606130
## 55 Id -0.021916719
## 56 MiscVal -0.021189580
## 57 BsmtHalfBath -0.016844154
## 58 LandContour 0.015453242
## 59 Utilities -0.014314296
## 60 BsmtFinSF2 -0.011378121
## 61 Condition2 0.007512734
Let us look at the relationship between OverallQual and SalePrice.
library(ggplot2)
boston_raw$OverallQual=as.factor(boston_raw$OverallQual)
boston_raw%>%ggplot(aes(x=OverallQual,y=log1p(SalePrice)))+geom_violin(aes(fill=OverallQual))
Let us look at the relationship between GrLivArea and SalePrice. Plus, we will color the points with different OverallQual.
library(ggplot2)
boston_raw$GrLivArea=as.numeric(boston_raw$GrLivArea)
boston_raw%>%ggplot(aes(x=GrLivArea,y=log1p(SalePrice)))+geom_point(aes(color=OverallQual),alpha=0.5)
Let us look at the relationship between GarageCars and SalePrice.
library(ggplot2)
boston_raw$GarageCars=as.factor(boston_raw$GarageCars)
boston_raw%>%ggplot(aes(x=GarageCars,y=log1p(SalePrice)))+geom_violin(aes(fill=GarageCars))
Let us look at the relationship between ExterQual and SalePrice.
library(ggplot2)
boston_raw%>%ggplot(aes(x=ExterQual,y=SalePrice))+geom_violin(aes(fill=ExterQual))
Let us look at the general relationship between those top influential features.
boston_raw%>%ggplot()+geom_point(aes(y=log1p(SalePrice),x=GrLivArea,color=OverallQual))+facet_wrap(ExterQual~GarageCars)
Let us look at the relationship between GarageArea and SalePrice.
library(ggplot2)
boston_raw$GarageArea=as.numeric(boston_raw$GarageArea)
boston_raw%>%ggplot(aes(x=GarageArea,y=log1p(SalePrice)))+geom_point(aes(color=OverallQual),alpha=0.5)
library(ggplot2)
boston_raw$TotalBsmtSF=as.numeric(boston_raw$TotalBsmtSF)
boston_raw%>%ggplot(aes(x=TotalBsmtSF,y=log1p(SalePrice)))+geom_point(aes(color=OverallQual),alpha=0.3)
library(ggplot2)
boston_raw%>%ggplot(aes(x=FullBath,y=log1p(SalePrice)))+geom_violin(aes(fill=ExterQual))
## Warning: position_dodge requires non-overlapping x intervals
library(ggplot2)
boston_raw%>%ggplot(aes(x=TotRmsAbvGrd,y=log1p(SalePrice)))+geom_violin(aes(fill=ExterQual))
## Warning: position_dodge requires non-overlapping x intervals
library(ggplot2)
boston_raw$YearBuilt=as.numeric(boston_raw$YearBuilt)
boston_raw%>%ggplot(aes(x=YearBuilt,y=log1p(SalePrice)))+geom_point(aes(color=OverallQual),alpha=0.3)
A further look into the dataset shows that there are some zero variance or near zero variance features. Zero variance features would definitely needs to be removed from the dataset, since it contains no information for predicting the target variable.
But whether we should remove near zero variance features is a more complicated question. We could either remove the near zero variance features or use PCA (Principle component analysis) to extract some potential useful signal from the dataset.
Let us find the zero and near-zero variance features.
library(dplyr)
library(caret)
nzv=function(data){
names=nearZeroVar(data,saveMetrics = TRUE)%>%row.names()
result=cbind(features=names,nearZeroVar(boston_raw,saveMetrics = TRUE) )%>%arrange(percentUnique)
return(result)
}
boston_raw%>%nzv()
## features freqRatio percentUnique zeroVar nzv
## 1 Street 242.333333 0.1369863 FALSE TRUE
## 2 Alley 1.219512 0.1369863 FALSE FALSE
## 3 Utilities 1459.000000 0.1369863 FALSE TRUE
## 4 CentralAir 14.368421 0.1369863 FALSE FALSE
## 5 LandSlope 21.261538 0.2054795 FALSE TRUE
## 6 BsmtHalfBath 17.225000 0.2054795 FALSE FALSE
## 7 HalfBath 1.706542 0.2054795 FALSE FALSE
## 8 GarageFinish 1.433649 0.2054795 FALSE FALSE
## 9 PavedDrive 14.888889 0.2054795 FALSE FALSE
## 10 PoolQC 1.500000 0.2054795 FALSE FALSE
## 11 LotShape 1.911157 0.2739726 FALSE FALSE
## 12 LandContour 20.809524 0.2739726 FALSE TRUE
## 13 MasVnrType 1.941573 0.2739726 FALSE FALSE
## 14 ExterQual 1.856557 0.2739726 FALSE FALSE
## 15 BsmtQual 1.050162 0.2739726 FALSE FALSE
## 16 BsmtCond 20.169231 0.2739726 FALSE TRUE
## 17 BsmtExposure 4.312217 0.2739726 FALSE FALSE
## 18 BsmtFullBath 1.455782 0.2739726 FALSE FALSE
## 19 FullBath 1.181538 0.2739726 FALSE FALSE
## 20 KitchenAbvGr 21.415385 0.2739726 FALSE TRUE
## 21 KitchenQual 1.254266 0.2739726 FALSE FALSE
## 22 Fireplaces 1.061538 0.2739726 FALSE FALSE
## 23 Fence 2.661017 0.2739726 FALSE FALSE
## 24 MiscFeature 24.500000 0.2739726 FALSE TRUE
## 25 MSZoning 5.279817 0.3424658 FALSE FALSE
## 26 LotConfig 4.000000 0.3424658 FALSE FALSE
## 27 BldgType 10.701754 0.3424658 FALSE FALSE
## 28 ExterCond 8.780822 0.3424658 FALSE FALSE
## 29 HeatingQC 1.731308 0.3424658 FALSE FALSE
## 30 Electrical 14.191489 0.3424658 FALSE FALSE
## 31 FireplaceQu 1.214058 0.3424658 FALSE FALSE
## 32 GarageCars 2.233062 0.3424658 FALSE FALSE
## 33 GarageQual 27.312500 0.3424658 FALSE TRUE
## 34 GarageCond 37.885714 0.3424658 FALSE TRUE
## 35 YrSold 1.027356 0.3424658 FALSE FALSE
## 36 RoofStyle 3.989510 0.4109589 FALSE FALSE
## 37 Foundation 1.020505 0.4109589 FALSE FALSE
## 38 BsmtFinType1 1.028708 0.4109589 FALSE FALSE
## 39 BsmtFinType2 23.259259 0.4109589 FALSE TRUE
## 40 Heating 79.333333 0.4109589 FALSE TRUE
## 41 GarageType 2.248062 0.4109589 FALSE FALSE
## 42 SaleCondition 9.584000 0.4109589 FALSE FALSE
## 43 Functional 40.000000 0.4794521 FALSE TRUE
## 44 Condition2 240.833333 0.5479452 FALSE TRUE
## 45 HouseStyle 1.631461 0.5479452 FALSE FALSE
## 46 RoofMatl 130.363636 0.5479452 FALSE TRUE
## 47 BedroomAbvGr 2.245810 0.5479452 FALSE FALSE
## 48 PoolArea 1453.000000 0.5479452 FALSE TRUE
## 49 Condition1 15.555556 0.6164384 FALSE FALSE
## 50 OverallCond 3.257937 0.6164384 FALSE FALSE
## 51 SaleType 10.385246 0.6164384 FALSE FALSE
## 52 OverallQual 1.061497 0.6849315 FALSE FALSE
## 53 TotRmsAbvGrd 1.221884 0.8219178 FALSE FALSE
## 54 MoSold 1.081197 0.8219178 FALSE FALSE
## 55 MSSubClass 1.792642 1.0273973 FALSE FALSE
## 56 Exterior1st 2.319820 1.0273973 FALSE FALSE
## 57 Exterior2nd 2.355140 1.0958904 FALSE FALSE
## 58 3SsnPorch 478.666667 1.3698630 FALSE TRUE
## 59 MiscVal 128.000000 1.4383562 FALSE TRUE
## 60 LowQualFinSF 478.000000 1.6438356 FALSE TRUE
## 61 Neighborhood 1.500000 1.7123288 FALSE FALSE
## 62 YearRemodAdd 1.835052 4.1780822 FALSE FALSE
## 63 ScreenPorch 224.000000 5.2054795 FALSE TRUE
## 64 GarageYrBlt 1.101695 6.6438356 FALSE FALSE
## 65 LotFrontage 2.042857 7.5342466 FALSE FALSE
## 66 YearBuilt 1.046875 7.6712329 FALSE FALSE
## 67 EnclosedPorch 83.466667 8.2191781 FALSE TRUE
## 68 BsmtFinSF2 258.600000 9.8630137 FALSE TRUE
## 69 OpenPorchSF 22.620690 13.8356164 FALSE FALSE
## 70 WoodDeckSF 20.026316 18.7671233 FALSE FALSE
## 71 MasVnrArea 107.625000 22.3972603 FALSE FALSE
## 72 2ndFlrSF 82.900000 28.5616438 FALSE FALSE
## 73 GarageArea 1.653061 30.2054795 FALSE FALSE
## 74 BsmtFinSF1 38.916667 43.6301370 FALSE FALSE
## 75 SalePrice 1.176471 45.4109589 FALSE FALSE
## 76 TotalBsmtSF 1.057143 49.3835616 FALSE FALSE
## 77 1stFlrSF 1.562500 51.5753425 FALSE FALSE
## 78 BsmtUnfSF 13.111111 53.4246575 FALSE FALSE
## 79 GrLivArea 1.571429 58.9726027 FALSE FALSE
## 80 LotArea 1.041667 73.4931507 FALSE FALSE
## 81 Id 1.000000 100.0000000 FALSE FALSE
Let us look at how does Alley features relate with SalePrice, the target variable.
library(ggplot2)
boston_raw%>%ggplot()+geom_violin(aes(x=Alley,y=log1p(SalePrice),fill=Alley))
The next step is to understand the missing value and impute it.
There are basically two types of missing value: 1. Systematic missing value 2. Missing value at random boston_clean1%>%select_num()%>%sapply(skewness)%>%sort()
Let us first look at the features that have missing values.
miss_name=function(data){
# Define a function to calculate percentage of missing value in each feature
miss_pct=function(data){
pct=sum(is.na(data))/length(data)
}
# apply the function define above to every columns of the data set
missing=data%>%sapply(miss_pct)%>%as.data.frame()
colnames(missing)="miss_pct"
missing=data.frame(names=row.names(missing)%>%as.character(),missing_pct=missing)
# Select only the features with missing values and their missing value percentage
missing=missing%>%filter(miss_pct>0)%>%arrange(desc(miss_pct))
return(missing)
}
miss_name(boston_raw)
## names miss_pct
## 1 PoolQC 0.9952054795
## 2 MiscFeature 0.9630136986
## 3 Alley 0.9376712329
## 4 Fence 0.8075342466
## 5 FireplaceQu 0.4726027397
## 6 LotFrontage 0.1773972603
## 7 GarageType 0.0554794521
## 8 GarageYrBlt 0.0554794521
## 9 GarageFinish 0.0554794521
## 10 GarageQual 0.0554794521
## 11 GarageCond 0.0554794521
## 12 BsmtExposure 0.0260273973
## 13 BsmtFinType2 0.0260273973
## 14 BsmtQual 0.0253424658
## 15 BsmtCond 0.0253424658
## 16 BsmtFinType1 0.0253424658
## 17 MasVnrType 0.0054794521
## 18 MasVnrArea 0.0054794521
## 19 Electrical 0.0006849315
Let us look at the data dictionary to find out what NA value acutally means something else.
Alley: NA No alley access BsmtQual: NA No Basement BsmtCond: NA No Basement BsmtExposure:NA No Basement BsmtFinType1:NA No Basement BsmtFinType2: NA No Basement
FireplaceQu: NA No Fireplace
GarageType: NA No Garage GarageFinish:NA No Garage GarageQual:NA No Garage GarageCond:NA No Garage
PoolQC: NA No Poolcaret
Fence:NA No FenceGarageYrBlt MiscFeature: NA None
It seems that in all these features, NA does not mean missing value, it simply means there is no facility. Thus, we need to convert NA in these features into a factor value like “None”
## Change NA values of the features into "None", a new factor value
na_to_fac=function(data,names){
# libraries (change it into installing it if not installed later)
library(magrittr)
library(dplyr)
# Split the data into data with NA value and data with no NA value
data_na=data%>%select(names)
data_nna=data%>%select(-one_of(names))
data_na[is.na(data_na)]="None"
data=cbind(data_na,data_nna)
return(data)
}
boston_clean2=boston_raw%>%na_to_fac(names=c("Alley","BsmtQual","BsmtCond","BsmtExposure","BsmtFinType1","BsmtFinType2","FireplaceQu","GarageType","GarageFinish","GarageQual","GarageCond","PoolQC","Fence","MiscFeature"))%>%char_to_factor()
## .
## character factor integer numeric
## 43 2 32 4
## .
## factor integer numeric
## 45 32 4
miss_pct_plot(boston_clean2)
## [1] "There are 5 features with missing values."
# boston_clean2%>%sapply(class)
Then, after this cleaning we are left with 5 features with missing values.
Let us look futher into those variables.
LotFrontage is the Linear feet of street connected to property.
GarageYrBlt is the Year garage was built.
MasVnrArea is the Masonry veneer area in square feet. MasVnrType is the Masonry veneer type
Electrical is Electrical system with various factors.
boston_raw%>%filter(is.na(Electrical))
## Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape
## 1 1380 80 RL 73 9735 Pave <NA> Reg
## LandContour Utilities LotConfig LandSlope Neighborhood Condition1
## 1 Lvl AllPub Inside Gtl Timber Norm
## Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt
## 1 Norm 1Fam SLvl 5 5 2006
## YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType
## 1 2007 Gable CompShg VinylSd VinylSd None
## MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond BsmtExposure
## 1 0 TA TA PConc Gd TA No
## BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF
## 1 Unf 0 Unf 0 384 384
## Heating HeatingQC CentralAir Electrical 1stFlrSF 2ndFlrSF LowQualFinSF
## 1 GasA Gd Y <NA> 754 640 0
## GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr
## 1 1394 0 0 2 1 3
## KitchenAbvGr KitchenQual TotRmsAbvGrd Functional Fireplaces FireplaceQu
## 1 1 Gd 7 Typ 0 <NA>
## GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual
## 1 BuiltIn 2007 Fin 2 400 TA
## GarageCond PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch 3SsnPorch
## 1 TA Y 100 0 0 0
## ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold
## 1 0 0 <NA> <NA> <NA> 0 5 2008
## SaleType SaleCondition SalePrice
## 1 WD Normal 167500
There is only one missing value. We could definitely impute it with some methods.
2.Look at the missing value at MasVnrArea and Masonry veneer type.
boston_raw%>%filter(is.na(MasVnrArea))
## Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape
## 1 235 60 RL NA 7851 Pave <NA> Reg
## 2 530 20 RL NA 32668 Pave <NA> IR1
## 3 651 60 FV 65 8125 Pave <NA> Reg
## 4 937 20 RL 67 10083 Pave <NA> Reg
## 5 974 20 FV 95 11639 Pave <NA> Reg
## 6 978 120 FV 35 4274 Pave Pave IR1
## 7 1244 20 RL 107 13891 Pave <NA> Reg
## 8 1279 60 RL 75 9473 Pave <NA> Reg
## LandContour Utilities LotConfig LandSlope Neighborhood Condition1
## 1 Lvl AllPub Inside Gtl Gilbert Norm
## 2 Lvl AllPub CulDSac Gtl Crawfor Norm
## 3 Lvl AllPub Inside Gtl Somerst Norm
## 4 Lvl AllPub Inside Gtl SawyerW Norm
## 5 Lvl AllPub Corner Gtl Somerst Norm
## 6 Lvl AllPub Inside Gtl Somerst Norm
## 7 Lvl AllPub Inside Gtl NridgHt Norm
## 8 Lvl AllPub Inside Gtl CollgCr Norm
## Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt
## 1 Norm 1Fam 2Story 6 5 2002
## 2 Norm 1Fam 1Story 6 3 1957
## 3 Norm 1Fam 2Story 7 6 2007
## 4 Norm 1Fam 1Story 7 5 2003
## 5 Norm 1Fam 1Story 7 5 2007
## 6 Norm TwnhsE 1Story 7 5 2006
## 7 Norm 1Fam 1Story 10 5 2006
## 8 Norm 1Fam 2Story 8 5 2002
## YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType
## 1 2002 Gable CompShg VinylSd VinylSd <NA>
## 2 1975 Hip CompShg Wd Sdng Stone <NA>
## 3 2007 Gable CompShg CemntBd CmentBd <NA>
## 4 2003 Gable CompShg VinylSd VinylSd <NA>
## 5 2008 Gable CompShg CemntBd CmentBd <NA>
## 6 2007 Gable CompShg VinylSd VinylSd <NA>
## 7 2006 Gable CompShg VinylSd VinylSd <NA>
## 8 2002 Gable CompShg VinylSd VinylSd <NA>
## MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond BsmtExposure
## 1 NA Gd TA PConc Gd TA No
## 2 NA Gd TA PConc TA TA No
## 3 NA Gd TA PConc Gd TA No
## 4 NA Gd TA PConc Gd TA No
## 5 NA Gd TA PConc Gd TA No
## 6 NA Gd TA PConc Gd TA No
## 7 NA Ex TA PConc Ex Gd Gd
## 8 NA Gd TA PConc Gd TA No
## BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF
## 1 GLQ 625 Unf 0 235 860
## 2 Rec 1219 Unf 0 816 2035
## 3 Unf 0 Unf 0 813 813
## 4 GLQ 833 Unf 0 343 1176
## 5 Unf 0 Unf 0 1428 1428
## 6 GLQ 1106 Unf 0 135 1241
## 7 GLQ 1386 Unf 0 690 2076
## 8 GLQ 804 Unf 0 324 1128
## Heating HeatingQC CentralAir Electrical 1stFlrSF 2ndFlrSF LowQualFinSF
## 1 GasA Ex Y SBrkr 860 1100 0
## 2 GasA TA Y SBrkr 2515 0 0
## 3 GasA Ex Y SBrkr 822 843 0
## 4 GasA Ex Y SBrkr 1200 0 0
## 5 GasA Ex Y SBrkr 1428 0 0
## 6 GasA Ex Y SBrkr 1241 0 0
## 7 GasA Ex Y SBrkr 2076 0 0
## 8 GasA Ex Y SBrkr 1128 903 0
## GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr
## 1 1960 1 0 2 1 4
## 2 2515 1 0 3 0 4
## 3 1665 0 0 2 1 3
## 4 1200 1 0 2 0 2
## 5 1428 0 0 2 0 3
## 6 1241 1 0 1 1 1
## 7 2076 1 0 2 1 2
## 8 2031 1 0 2 1 3
## KitchenAbvGr KitchenQual TotRmsAbvGrd Functional Fireplaces FireplaceQu
## 1 1 Gd 8 Typ 2 TA
## 2 2 TA 9 Maj1 2 TA
## 3 1 Gd 7 Typ 0 <NA>
## 4 1 Gd 5 Typ 0 <NA>
## 5 1 Gd 6 Typ 0 <NA>
## 6 1 Gd 4 Typ 0 <NA>
## 7 1 Ex 7 Typ 1 Gd
## 8 1 Gd 7 Typ 1 Gd
## GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual
## 1 BuiltIn 2002 Fin 2 440 TA
## 2 Attchd 1975 RFn 2 484 TA
## 3 Attchd 2007 RFn 2 562 TA
## 4 Attchd 2003 RFn 2 555 TA
## 5 Attchd 2007 Fin 2 480 TA
## 6 Attchd 2007 Fin 2 569 TA
## 7 Attchd 2006 Fin 3 850 TA
## 8 Attchd 2002 RFn 2 577 TA
## GarageCond PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch 3SsnPorch
## 1 TA Y 288 48 0 0
## 2 TA Y 0 0 200 0
## 3 TA Y 0 0 0 0
## 4 TA Y 0 41 0 0
## 5 TA Y 0 120 0 0
## 6 TA Y 0 116 0 0
## 7 TA Y 216 229 0 0
## 8 TA Y 0 211 0 0
## ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold
## 1 0 0 <NA> <NA> <NA> 0 5 2010
## 2 0 0 <NA> <NA> <NA> 0 3 2007
## 3 0 0 <NA> <NA> <NA> 0 5 2008
## 4 0 0 <NA> <NA> <NA> 0 8 2009
## 5 0 0 <NA> <NA> <NA> 0 12 2008
## 6 0 0 <NA> <NA> <NA> 0 11 2007
## 7 0 0 <NA> <NA> <NA> 0 9 2006
## 8 0 0 <NA> <NA> <NA> 0 3 2008
## SaleType SaleCondition SalePrice
## 1 WD Normal 216500
## 2 WD Alloca 200624
## 3 WD Normal 205950
## 4 WD Normal 184900
## 5 New Partial 182000
## 6 New Partial 199900
## 7 New Partial 465000
## 8 WD Normal 237000
boston_raw%>%filter(is.na(MasVnrType))%>%select(Id)==boston_raw%>%filter(is.na(MasVnrArea))%>%select(Id)
## Id
## [1,] TRUE
## [2,] TRUE
## [3,] TRUE
## [4,] TRUE
## [5,] TRUE
## [6,] TRUE
## [7,] TRUE
## [8,] TRUE
The house that has a missing value in MasVnrType also has a missing value in MasVnrArea. source(‘~/Dropbox/Utility/miss_pct_df.R’, echo=TRUE)source(‘~/Dropbox/Utility/miss_pct_df.R’, echo=TRUE) I would consider two solutions: 1) use mice package to do the imputation 2) Assign missing values in MasVnrType to None and MasVnrArea to 0.
boston_raw%>%filter(is.na(GarageYrBlt))%>%head()
## Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape
## 1 40 90 RL 65 6040 Pave <NA> Reg
## 2 49 190 RM 33 4456 Pave <NA> Reg
## 3 79 90 RL 72 10778 Pave <NA> Reg
## 4 89 50 C (all) 105 8470 Pave <NA> IR1
## 5 90 20 RL 60 8070 Pave <NA> Reg
## 6 100 20 RL 77 9320 Pave <NA> IR1
## LandContour Utilities LotConfig LandSlope Neighborhood Condition1
## 1 Lvl AllPub Inside Gtl Edwards Norm
## 2 Lvl AllPub Inside Gtl OldTown Norm
## 3 Lvl AllPub Inside Gtl Sawyer Norm
## 4 Lvl AllPub Corner Gtl IDOTRR Feedr
## 5 Lvl AllPub Inside Gtl CollgCr Norm
## 6 Lvl AllPub Inside Gtl NAmes Norm
## Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt
## 1 Norm Duplex 1Story 4 5 1955
## 2 Norm 2fmCon 2Story 4 5 1920
## 3 Norm Duplex 1Story 4 5 1968
## 4 Feedr 1Fam 1.5Fin 3 2 1915
## 5 Norm 1Fam 1Story 4 5 1994
## 6 Norm 1Fam 1Story 4 5 1959
## YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType
## 1 1955 Gable CompShg AsbShng Plywood None
## 2 2008 Gable CompShg MetalSd MetalSd None
## 3 1968 Hip CompShg HdBoard HdBoard None
## 4 1982 Hip CompShg Plywood Plywood None
## 5 1995 Gable CompShg VinylSd VinylSd None
## 6 1959 Gable CompShg Plywood Plywood None
## MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond BsmtExposure
## 1 0 TA TA PConc <NA> <NA> <NA>
## 2 0 TA TA BrkTil TA TA No
## 3 0 TA TA CBlock TA TA No
## 4 0 Fa Fa CBlock TA Fa No
## 5 0 TA TA PConc Gd TA No
## 6 0 TA TA CBlock TA TA No
## BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF
## 1 <NA> 0 <NA> 0 0 0
## 2 Unf 0 Unf 0 736 736
## 3 Unf 0 Unf 0 1768 1768
## 4 Unf 0 Unf 0 1013 1013
## 5 GLQ 588 Unf 0 402 990
## 6 ALQ 569 Unf 0 381 950
## Heating HeatingQC CentralAir Electrical 1stFlrSF 2ndFlrSF LowQualFinSF
## 1 GasA TA N FuseP 1152 0 0
## 2 GasA Gd Y SBrkr 736 716 0
## 3 GasA TA N SBrkr 1768 0 0
## 4 GasA TA N SBrkr 1013 0 513
## 5 GasA Ex Y SBrkr 990 0 0
## 6 GasA Fa Y SBrkr 1225 0 0
## GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr
## 1 1152 0 0 2 0 2
## 2 1452 0 0 2 0 2
## 3 1768 0 0 2 0 4
## 4 1526 0 0 1 0 2
## 5 990 1 0 1 0 3
## 6 1225 1 0 1 1 3
## KitchenAbvGr KitchenQual TotRmsAbvGrd Functional Fireplaces FireplaceQu
## 1 2 Fa 6 Typ 0 <NA>
## 2 3 TA 8 Typ 0 <NA>
## 3 2 TA 8 Typ 0 <NA>
## 4 1 Fa 6 Typ 0 <NA>
## 5 1 TA 5 Typ 0 <NA>
## 6 1 TA 6 Typ 0 <NA>
## GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual
## 1 <NA> NA <NA> 0 0 <NA>
## 2 <NA> NA <NA> 0 0 <NA>
## 3 <NA> NA <NA> 0 0 <NA>
## 4 <NA> NA <NA> 0 0 <NA>
## 5 <NA> NA <NA> 0 0 <NA>
## 6 <NA> NA <NA> 0 0 <NA>
## GarageCond PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch 3SsnPorch
## 1 <NA> N 0 0 0 0
## 2 <NA> N 0 0 102 0
## 3 <NA> Y 0 0 0 0
## 4 <NA> N 0 0 156 0
## 5 <NA> Y 0 0 0 0
## 6 <NA> Y 352 0 0 0
## ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold
## 1 0 0 <NA> <NA> <NA> 0 6 2008
## 2 0 0 <NA> <NA> <NA> 0 6 2009
## 3 0 0 <NA> <NA> <NA> 0 4 2010
## 4 0 0 <NA> MnPrv <NA> 0 10 2009
## 5 0 0 <NA> <NA> <NA> 0 8 2007
## 6 0 0 <NA> <NA> Shed 400 1 2010
## SaleType SaleCondition SalePrice
## 1 WD AdjLand 82000
## 2 New Partial 113000
## 3 WD Normal 136500
## 4 ConLD Abnorml 85000
## 5 WD Normal 123600
## 6 WD Normal 128950
It seems that other garage features are also missing when GarageYrBlt is missing.
Let us check.
boston_raw%>%filter(is.na(GarageYrBlt))%>%select(Id)==boston_raw%>%filter(is.na(GarageQual))%>%select(Id)
## Id
## [1,] TRUE
## [2,] TRUE
## [3,] TRUE
## [4,] TRUE
## [5,] TRUE
## [6,] TRUE
## [7,] TRUE
## [8,] TRUE
## [9,] TRUE
## [10,] TRUE
## [11,] TRUE
## [12,] TRUE
## [13,] TRUE
## [14,] TRUE
## [15,] TRUE
## [16,] TRUE
## [17,] TRUE
## [18,] TRUE
## [19,] TRUE
## [20,] TRUE
## [21,] TRUE
## [22,] TRUE
## [23,] TRUE
## [24,] TRUE
## [25,] TRUE
## [26,] TRUE
## [27,] TRUE
## [28,] TRUE
## [29,] TRUE
## [30,] TRUE
## [31,] TRUE
## [32,] TRUE
## [33,] TRUE
## [34,] TRUE
## [35,] TRUE
## [36,] TRUE
## [37,] TRUE
## [38,] TRUE
## [39,] TRUE
## [40,] TRUE
## [41,] TRUE
## [42,] TRUE
## [43,] TRUE
## [44,] TRUE
## [45,] TRUE
## [46,] TRUE
## [47,] TRUE
## [48,] TRUE
## [49,] TRUE
## [50,] TRUE
## [51,] TRUE
## [52,] TRUE
## [53,] TRUE
## [54,] TRUE
## [55,] TRUE
## [56,] TRUE
## [57,] TRUE
## [58,] TRUE
## [59,] TRUE
## [60,] TRUE
## [61,] TRUE
## [62,] TRUE
## [63,] TRUE
## [64,] TRUE
## [65,] TRUE
## [66,] TRUE
## [67,] TRUE
## [68,] TRUE
## [69,] TRUE
## [70,] TRUE
## [71,] TRUE
## [72,] TRUE
## [73,] TRUE
## [74,] TRUE
## [75,] TRUE
## [76,] TRUE
## [77,] TRUE
## [78,] TRUE
## [79,] TRUE
## [80,] TRUE
## [81,] TRUE
Let us look at the relationship between GarageYrBlt and SalePrice. Plus, we will color the points with different OverallQual.
library(ggplot2)
boston_raw%>%ggplot(aes(x=GarageYrBlt,y=log1p(SalePrice)))+geom_point(aes(color=OverallQual),alpha=0.6)
## Warning: Removed 81 rows containing missing values (geom_point).
There are no perfect solutions to this NA value.
There are three solutions available: 1) assign it a median value 2) assign it a zero value 3) assign it a value using CART imputation.
boston_raw%>%filter(is.na(LotFrontage))%>%head()
## Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape
## 1 8 60 RL NA 10382 Pave <NA> IR1
## 2 13 20 RL NA 12968 Pave <NA> IR2
## 3 15 20 RL NA 10920 Pave <NA> IR1
## 4 17 20 RL NA 11241 Pave <NA> IR1
## 5 25 20 RL NA 8246 Pave <NA> IR1
## 6 32 20 RL NA 8544 Pave <NA> IR1
## LandContour Utilities LotConfig LandSlope Neighborhood Condition1
## 1 Lvl AllPub Corner Gtl NWAmes PosN
## 2 Lvl AllPub Inside Gtl Sawyer Norm
## 3 Lvl AllPub Corner Gtl NAmes Norm
## 4 Lvl AllPub CulDSac Gtl NAmes Norm
## 5 Lvl AllPub Inside Gtl Sawyer Norm
## 6 Lvl AllPub CulDSac Gtl Sawyer Norm
## Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt
## 1 Norm 1Fam 2Story 7 6 1973
## 2 Norm 1Fam 1Story 5 6 1962
## 3 Norm 1Fam 1Story 6 5 1960
## 4 Norm 1Fam 1Story 6 7 1970
## 5 Norm 1Fam 1Story 5 8 1968
## 6 Norm 1Fam 1Story 5 6 1966
## YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType
## 1 1973 Gable CompShg HdBoard HdBoard Stone
## 2 1962 Hip CompShg HdBoard Plywood None
## 3 1960 Hip CompShg MetalSd MetalSd BrkFace
## 4 1970 Gable CompShg Wd Sdng Wd Sdng BrkFace
## 5 2001 Gable CompShg Plywood Plywood None
## 6 2006 Gable CompShg HdBoard HdBoard None
## MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond BsmtExposure
## 1 240 TA TA CBlock Gd TA Mn
## 2 0 TA TA CBlock TA TA No
## 3 212 TA TA CBlock TA TA No
## 4 180 TA TA CBlock TA TA No
## 5 0 TA Gd CBlock TA TA Mn
## 6 0 TA TA CBlock TA TA No
## BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF
## 1 ALQ 859 BLQ 32 216 1107
## 2 ALQ 737 Unf 0 175 912
## 3 BLQ 733 Unf 0 520 1253
## 4 ALQ 578 Unf 0 426 1004
## 5 Rec 188 ALQ 668 204 1060
## 6 Unf 0 Unf 0 1228 1228
## Heating HeatingQC CentralAir Electrical 1stFlrSF 2ndFlrSF LowQualFinSF
## 1 GasA Ex Y SBrkr 1107 983 0
## 2 GasA TA Y SBrkr 912 0 0
## 3 GasA TA Y SBrkr 1253 0 0
## 4 GasA Ex Y SBrkr 1004 0 0
## 5 GasA Ex Y SBrkr 1060 0 0
## 6 GasA Gd Y SBrkr 1228 0 0
## GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr
## 1 2090 1 0 2 1 3
## 2 912 1 0 1 0 2
## 3 1253 1 0 1 1 2
## 4 1004 1 0 1 0 2
## 5 1060 1 0 1 0 3
## 6 1228 0 0 1 1 3
## KitchenAbvGr KitchenQual TotRmsAbvGrd Functional Fireplaces FireplaceQu
## 1 1 TA 7 Typ 2 TA
## 2 1 TA 4 Typ 0 <NA>
## 3 1 TA 5 Typ 1 Fa
## 4 1 TA 5 Typ 1 TA
## 5 1 Gd 6 Typ 1 TA
## 6 1 Gd 6 Typ 0 <NA>
## GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual
## 1 Attchd 1973 RFn 2 484 TA
## 2 Detchd 1962 Unf 1 352 TA
## 3 Attchd 1960 RFn 1 352 TA
## 4 Attchd 1970 Fin 2 480 TA
## 5 Attchd 1968 Unf 1 270 TA
## 6 Attchd 1966 Unf 1 271 TA
## GarageCond PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch 3SsnPorch
## 1 TA Y 235 204 228 0
## 2 TA Y 140 0 0 0
## 3 TA Y 0 213 176 0
## 4 TA Y 0 0 0 0
## 5 TA Y 406 90 0 0
## 6 TA Y 0 65 0 0
## ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold
## 1 0 0 <NA> <NA> Shed 350 11 2009
## 2 176 0 <NA> <NA> <NA> 0 9 2008
## 3 0 0 <NA> GdWo <NA> 0 5 2008
## 4 0 0 <NA> <NA> Shed 700 3 2010
## 5 0 0 <NA> MnPrv <NA> 0 5 2010
## 6 0 0 <NA> MnPrv <NA> 0 6 2008
## SaleType SaleCondition SalePrice
## 1 WD Normal 200000
## 2 WD Normal 144000
## 3 WD Normal 157000
## 4 WD Normal 149000
## 5 WD Normal 154000
## 6 WD Normal 149350
As per usual, the two solutions we could compare is: 1) assign 0 to the NA value in LotFrontage 2) use CART to impute the value.
After analyzing the missing values, let us actually do clean the data with various method and compare them with baseline models.
The details of the data cleaning code will be in the next section.
source('~/Dropbox/Boston_house/2_Data_Clean/data_clean.R', echo=TRUE)
##
## > source("~/Dropbox/Utility/source_self.R", echo = TRUE)
##
## > source("~/Dropbox/Utility/char_to_factor.R", echo = TRUE)
##
## > char_to_factor = function(data) {
## + library(magrittr)
## + data = as.data.frame(data)
## + sapply(data, class) %>% table() %>% print
## + cha .... [TRUNCATED]
##
## > source("~/Dropbox/Utility/correlation.R", echo = TRUE)
##
## > correlation = function(data) {
## + select_num = function(data) {
## + library(magrittr)
## + data = as.data.frame(data)
## + int_in .... [TRUNCATED]
##
## > source("~/Dropbox/Utility/miss_name.R", echo = TRUE)
##
## > miss_name = function(data) {
## + miss_pct = function(data) {
## + pct = sum(is.na(data))/length(data)
## + }
## + missing = data %>% sapply .... [TRUNCATED]
##
## > source("~/Dropbox/Utility/miss_pct_plot.R", echo = TRUE)
##
## > miss_pct_plot = function(data) {
## + miss_pct = function(data) {
## + pct = sum(is.na(data))/length(data)
## + }
## + missing = boston_raw .... [TRUNCATED]
##
## > source("~/Dropbox/Utility/na_to_fac.R", echo = TRUE)
##
## > na_to_fac = function(data, names) {
## + library(magrittr)
## + library(dplyr)
## + data_na = data %>% select(names)
## + data_nna = data %>% se .... [TRUNCATED]
##
## > source("~/Dropbox/Utility/select_comp.R", echo = TRUE)
##
## > select_comp = function(data, p = 0.5) {
## + miss_pct = function(data) {
## + pct = sum(is.na(data))/length(data)
## + }
## + missing = sapp .... [TRUNCATED]
##
## > source("~/Dropbox/Utility/select_fac.R", echo = TRUE)
##
## > select_fac = function(data) {
## + library(magrittr)
## + data = as.data.frame(data)
## + data %>% sapply(class) %>% table() %>% print
## + fac_ .... [TRUNCATED]
##
## > source("~/Dropbox/Utility/select_num.R", echo = TRUE)
##
## > select_num = function(data) {
## + library(magrittr)
## + data = as.data.frame(data)
## + data %>% sapply(class) %>% table() %>% print
## + int_ .... [TRUNCATED]
##
## > source("~/Dropbox/Utility/timeRecord_functions.R",
## + echo = TRUE)
##
## > timeRecordA = function() {
## + file_name = paste(format(Sys.time(), "%F_%T"), ".log", sep = "")
## + file.create(file_name)
## + extremely_long_ .... [TRUNCATED]
##
## > timeRecordB = function() {
## + if (file.exists(extremely_long_name_i_do_not_think_anyone_would_be_sanely_to_use)) {
## + file_name = extremel .... [TRUNCATED]
##
## > timeRecordR = function(output = "job", unit = "s") {
## + if (file.exists(extremely_long_name_i_do_not_think_anyone_would_be_sanely_to_use)) {
## + .... [TRUNCATED]
##
## > source("~/Dropbox/Utility/miss_pct_df.R", echo = TRUE)
##
## > miss_pct_df = function(data) {
## + miss_pct = function(data) {
## + pct = sum(is.na(data))/length(data)
## + }
## + missing = data %>% sapp .... [TRUNCATED]
##
## > source("~/Dropbox/Utility/int_to_num.R", echo = TRUE)
##
## > int_to_num = function(data) {
## + data = as.data.frame(data)
## + library(magrittr)
## + data = as.data.frame(data)
## + sapply(data, class) %> .... [TRUNCATED]
##
## > library(dplyr)
##
## > library(magrittr)
##
## > boston_raw = data.table::fread("train.csv")
##
## > timeRecordA()
## [1] "Fun time log has been created"
##
## > data_clean1 = function(data) {
## + library(mice)
## + data1 = data %>% select_comp(p = 0.5) %>% char_to_factor()
## + data1_impute = data1 %>% s .... [TRUNCATED]
##
## > boston_clean1 = boston_raw %>% data_clean1()
##
## Attaching package: 'mice'
## The following object is masked from 'package:tidyr':
##
## complete
## .
## character integer
## 39 38
## .
## factor integer
## 39 38
##
## iter imp variable
## 1 1 LotFrontage MasVnrArea GarageYrBlt MasVnrType BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinType2 Electrical FireplaceQu GarageType GarageFinish GarageQual GarageCond
## 2 1 LotFrontage MasVnrArea GarageYrBlt MasVnrType BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinType2 Electrical FireplaceQu GarageType GarageFinish GarageQual GarageCond
## 3 1 LotFrontage MasVnrArea GarageYrBlt MasVnrType BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinType2 Electrical FireplaceQu GarageType GarageFinish GarageQual GarageCond
## 4 1 LotFrontage MasVnrArea GarageYrBlt MasVnrType BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinType2 Electrical FireplaceQu GarageType GarageFinish GarageQual GarageCond
## 5 1 LotFrontage MasVnrArea GarageYrBlt MasVnrType BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinType2 Electrical FireplaceQu GarageType GarageFinish GarageQual GarageCond
##
## > timeRecordB()
##
## > data_clean2 = function(data) {
## + features = c("Alley", "BsmtQual", "BsmtCond", "BsmtExposure",
## + "BsmtFinType1", "BsmtFinType2", "Firep ..." ... [TRUNCATED]
##
## > boston_clean2 = boston_raw %>% data_clean2()
## [1] "There are 2 features with missing values."
## .
## character integer
## 43 38
## .
## factor integer
## 43 38
##
## > timeRecordB()
##
## > data_clean3 = function(data, method) {
## + library(mice)
## + features = c("Alley", "BsmtQual", "BsmtCond", "BsmtExposure",
## + "BsmtFinTy ..." ... [TRUNCATED]
##
## > boston_clean3 = boston_raw %>% data_clean3(method = "rf")
##
## iter imp variable
## 1 1 LotFrontage MasVnrArea GarageYrBlt
## 2 1 LotFrontage MasVnrArea GarageYrBlt
## 3 1 LotFrontage MasVnrArea GarageYrBlt
## 4 1 LotFrontage MasVnrArea GarageYrBlt
## 5 1 LotFrontage MasVnrArea GarageYrBlt
## .
## character integer
## 43 38
## .
## factor integer
## 43 38
## [1] "There are 0 features with missing values."
##
## > timeRecordB()
##
## > boston_clean4 = boston_raw %>% data_clean3(method = "cart")
##
## iter imp variable
## 1 1 LotFrontage MasVnrArea GarageYrBlt
## 2 1 LotFrontage MasVnrArea GarageYrBlt
## 3 1 LotFrontage MasVnrArea GarageYrBlt
## 4 1 LotFrontage MasVnrArea GarageYrBlt
## 5 1 LotFrontage MasVnrArea GarageYrBlt
## .
## character integer
## 43 38
## .
## factor integer
## 43 38
## [1] "There are 0 features with missing values."
##
## > timeRecordB()
##
## > boston_clean5 = boston_raw %>% data_clean3(method = "sample")
##
## iter imp variable
## 1 1 LotFrontage MasVnrArea GarageYrBlt
## 2 1 LotFrontage MasVnrArea GarageYrBlt
## 3 1 LotFrontage MasVnrArea GarageYrBlt
## 4 1 LotFrontage MasVnrArea GarageYrBlt
## 5 1 LotFrontage MasVnrArea GarageYrBlt
## .
## character integer
## 43 38
## .
## factor integer
## 43 38
## [1] "There are 0 features with missing values."
##
## > timeRecordB()
##
## > data_clean5 = function(data, method) {
## + library(caret)
## + features = c("Alley", "BsmtQual", "BsmtCond", "BsmtExposure",
## + "BsmtFinT ..." ... [TRUNCATED]
##
## > boston_clean6 = boston_raw %>% data_clean5(method = "knnImpute")
## .
## character integer
## 43 37
## .
## factor integer
## 43 37
##
## > timeRecordB()
##
## > boston_clean7 = boston_raw %>% data_clean5(method = "medianImpute")
## .
## character integer
## 43 37
## .
## factor integer
## 43 37
##
## > timeRecordB()
##
## > timeRecordR()
## user.self sys.self elapsed user.child sys.child run_time
## 1 12.780 0.116 12.866 0 0 0.000
## 2 81.532 0.844 82.456 0 0 69.590
## 3 81.588 0.844 82.512 0 0 0.056
## 4 84.288 0.852 85.221 0 0 2.709
## 5 85.960 0.852 86.897 0 0 1.676
## 6 86.216 0.852 87.150 0 0 0.253
## 7 86.832 0.880 87.795 0 0 0.645
## 8 87.184 0.880 88.146 0 0 0.351
##
## > boston_raw = boston_raw %>% char_to_factor()
## .
## character integer
## 43 38
## .
## factor integer
## 43 38
##
## > miss_pct_df(boston_raw)
## [1] "There are 19 features with missing values."
## miss_pct names
## 1 0.9952054795 PoolQC
## 2 0.9630136986 MiscFeature
## 3 0.9376712329 Alley
## 4 0.8075342466 Fence
## 5 0.4726027397 FireplaceQu
## 6 0.1773972603 LotFrontage
## 7 0.0554794521 GarageYrBlt
## 8 0.0554794521 GarageType
## 9 0.0554794521 GarageFinish
## 10 0.0554794521 GarageQual
## 11 0.0554794521 GarageCond
## 12 0.0260273973 BsmtExposure
## 13 0.0260273973 BsmtFinType2
## 14 0.0253424658 BsmtQual
## 15 0.0253424658 BsmtCond
## 16 0.0253424658 BsmtFinType1
## 17 0.0054794521 MasVnrArea
## 18 0.0054794521 MasVnrType
## 19 0.0006849315 Electrical
##
## > miss_pct_df(boston_clean1)
## [1] "There are 0 features with missing values."
## [1] miss_pct names
## <0 rows> (or 0-length row.names)
##
## > miss_pct_df(boston_clean2)
## [1] "There are 0 features with missing values."
## [1] miss_pct names
## <0 rows> (or 0-length row.names)
##
## > miss_pct_df(boston_clean3)
## [1] "There are 0 features with missing values."
## [1] miss_pct names
## <0 rows> (or 0-length row.names)
##
## > miss_pct_df(boston_clean4)
## [1] "There are 0 features with missing values."
## [1] miss_pct names
## <0 rows> (or 0-length row.names)
##
## > miss_pct_df(boston_clean5)
## [1] "There are 0 features with missing values."
## [1] miss_pct names
## <0 rows> (or 0-length row.names)
##
## > miss_pct_df(boston_clean6)
## [1] "There are 0 features with missing values."
## [1] miss_pct names
## <0 rows> (or 0-length row.names)
##
## > miss_pct_df(boston_clean7)
## [1] "There are 0 features with missing values."
## [1] miss_pct names
## <0 rows> (or 0-length row.names)
The last few lines show that all dataset are imputed with no more missing values in them.
Next, let us look at the skewness of each feature. The skewness of a feature could affect the predictions of some machine learning algorithms, if the features have a really big value for skewness, then we need to do a Box-Cox transformation.
# First define a function to select integer or numeric value in features.
select_num=function(data){
library(magrittr)
data=as.data.frame(data)
data%>%sapply(class)%>%table()%>%print
int_index=sapply(data,class)== "integer"
int_index=int_index%>%as.logical()
num_index=sapply(data,class)== "numeric"
num_index=num_index%>%as.logical()
# data[,!int_index]%>%sapply(class)%>%table
data1=data[,int_index]
data2=data[,num_index]
data=cbind(data1,data2)
data%>%sapply(class)%>%table%>%print
return(data)
}
library(e1071)
library(dplyr)
# Select only integer or numeric data types for calculating skewness.
boston_clean1%>%select_num()%>%sapply(skewness)
## .
## factor integer
## 39 38
## .
## integer
## 38
## Id MSSubClass LotFrontage LotArea OverallQual
## 0.00000000 1.40476562 2.01792140 12.18261502 0.21649836
## OverallCond YearBuilt YearRemodAdd MasVnrArea BsmtFinSF1
## 0.69164401 -0.61220121 -0.50252776 2.66751655 1.68204129
## BsmtFinSF2 BsmtUnfSF TotalBsmtSF 1stFlrSF 2ndFlrSF
## 4.24652141 0.91837835 1.52112395 1.37392896 0.81135997
## LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath
## 8.99283329 1.36375364 0.59484237 4.09497490 0.03648647
## HalfBath BedroomAbvGr KitchenAbvGr TotRmsAbvGrd Fireplaces
## 0.67450925 0.21135511 4.47917826 0.67495173 0.64823107
## GarageYrBlt GarageCars GarageArea WoodDeckSF OpenPorchSF
## -0.57900250 -0.34184538 0.17961125 1.53820999 2.35948572
## EnclosedPorch 3SsnPorch ScreenPorch PoolArea MiscVal
## 3.08352575 10.28317840 4.11374731 14.79791829 24.42652237
## MoSold YrSold SalePrice
## 0.21161746 0.09607079 1.87900860
boston_clean2%>%select_num()%>%sapply(skewness)
## .
## factor integer
## 43 38
## .
## integer
## 38
## Id MSSubClass LotFrontage LotArea OverallQual
## 0.00000000 1.40476562 0.26727232 12.18261502 0.21649836
## OverallCond YearBuilt YearRemodAdd MasVnrArea BsmtFinSF1
## 0.69164401 -0.61220121 -0.50252776 2.67211701 1.68204129
## BsmtFinSF2 BsmtUnfSF TotalBsmtSF 1stFlrSF 2ndFlrSF
## 4.24652141 0.91837835 1.52112395 1.37392896 0.81135997
## LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath
## 8.99283329 1.36375364 0.59484237 4.09497490 0.03648647
## HalfBath BedroomAbvGr KitchenAbvGr TotRmsAbvGrd Fireplaces
## 0.67450925 0.21135511 4.47917826 0.67495173 0.64823107
## GarageYrBlt GarageCars GarageArea WoodDeckSF OpenPorchSF
## -3.86155342 -0.34184538 0.17961125 1.53820999 2.35948572
## EnclosedPorch 3SsnPorch ScreenPorch PoolArea MiscVal
## 3.08352575 10.28317840 4.11374731 14.79791829 24.42652237
## MoSold YrSold SalePrice
## 0.21161746 0.09607079 1.87900860
It is obvious that some features have very high skewness. Thus, a Box-Cox transformation would seem necessary for better performance for some of the models. Is it really true? It is for you to try it out.