Problem 2 You are to register for Kaggle.com (free) and compete in the House Prices: Advanced Regression Techniques competition. https://www.kaggle.com/c/house-prices-advanced-regression-techniques . I want you to do the following.

5 points. Descriptive and Inferential Statistics. Provide univariate descriptive statistics and appropriate plots for the training data set. Providea scatterplot matrix for at least two of the independent variables and the dependent variable. Derive a correlation matrix for any three quantitative variables in the dataset. Test the hypotheses that the correlations betweeneach pairwise set of variables is 0 and provide an80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?

Prepare working environment. Load the most fselectuently used packages for data analysis.

to use for prediction submitted to Kaggle.com) files from github.

data exploration

load the train (housing characteristics data) and test (

train <- read.csv("https://raw.githubusercontent.com/johnpannyc/data605-final/master/train.csv")
test<- read.csv("https://raw.githubusercontent.com/johnpannyc/data605-final/master/test.csv")

Using “glimps” function to dig deeper into the datasets. We can that part of the variables are numerical, part of the data are categorical. Categorical variables need to be convert to integer (levels) for further analysis. Also, there are some missing values in the dataset.

glimpse(train)
## Observations: 1,460
## Variables: 81
## $ Id            <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1...
## $ MSSubClass    <int> 60, 20, 60, 70, 60, 50, 20, 60, 50, 190, 20, 60,...
## $ MSZoning      <fct> RL, RL, RL, RL, RL, RL, RL, RL, RM, RL, RL, RL, ...
## $ LotFrontage   <int> 65, 80, 68, 60, 84, 85, 75, NA, 51, 50, 70, 85, ...
## $ LotArea       <int> 8450, 9600, 11250, 9550, 14260, 14115, 10084, 10...
## $ Street        <fct> Pave, Pave, Pave, Pave, Pave, Pave, Pave, Pave, ...
## $ Alley         <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ LotShape      <fct> Reg, Reg, IR1, IR1, IR1, IR1, Reg, IR1, Reg, Reg...
## $ LandContour   <fct> Lvl, Lvl, Lvl, Lvl, Lvl, Lvl, Lvl, Lvl, Lvl, Lvl...
## $ Utilities     <fct> AllPub, AllPub, AllPub, AllPub, AllPub, AllPub, ...
## $ LotConfig     <fct> Inside, FR2, Inside, Corner, FR2, Inside, Inside...
## $ LandSlope     <fct> Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl...
## $ Neighborhood  <fct> CollgCr, Veenker, CollgCr, Crawfor, NoRidge, Mit...
## $ Condition1    <fct> Norm, Feedr, Norm, Norm, Norm, Norm, Norm, PosN,...
## $ Condition2    <fct> Norm, Norm, Norm, Norm, Norm, Norm, Norm, Norm, ...
## $ BldgType      <fct> 1Fam, 1Fam, 1Fam, 1Fam, 1Fam, 1Fam, 1Fam, 1Fam, ...
## $ HouseStyle    <fct> 2Story, 1Story, 2Story, 2Story, 2Story, 1.5Fin, ...
## $ OverallQual   <int> 7, 6, 7, 7, 8, 5, 8, 7, 7, 5, 5, 9, 5, 7, 6, 7, ...
## $ OverallCond   <int> 5, 8, 5, 5, 5, 5, 5, 6, 5, 6, 5, 5, 6, 5, 5, 8, ...
## $ YearBuilt     <int> 2003, 1976, 2001, 1915, 2000, 1993, 2004, 1973, ...
## $ YearRemodAdd  <int> 2003, 1976, 2002, 1970, 2000, 1995, 2005, 1973, ...
## $ RoofStyle     <fct> Gable, Gable, Gable, Gable, Gable, Gable, Gable,...
## $ RoofMatl      <fct> CompShg, CompShg, CompShg, CompShg, CompShg, Com...
## $ Exterior1st   <fct> VinylSd, MetalSd, VinylSd, Wd Sdng, VinylSd, Vin...
## $ Exterior2nd   <fct> VinylSd, MetalSd, VinylSd, Wd Shng, VinylSd, Vin...
## $ MasVnrType    <fct> BrkFace, None, BrkFace, None, BrkFace, None, Sto...
## $ MasVnrArea    <int> 196, 0, 162, 0, 350, 0, 186, 240, 0, 0, 0, 286, ...
## $ ExterQual     <fct> Gd, TA, Gd, TA, Gd, TA, Gd, TA, TA, TA, TA, Ex, ...
## $ ExterCond     <fct> TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, ...
## $ Foundation    <fct> PConc, CBlock, PConc, BrkTil, PConc, Wood, PConc...
## $ BsmtQual      <fct> Gd, Gd, Gd, TA, Gd, Gd, Ex, Gd, TA, TA, TA, Ex, ...
## $ BsmtCond      <fct> TA, TA, TA, Gd, TA, TA, TA, TA, TA, TA, TA, TA, ...
## $ BsmtExposure  <fct> No, Gd, Mn, No, Av, No, Av, Mn, No, No, No, No, ...
## $ BsmtFinType1  <fct> GLQ, ALQ, GLQ, ALQ, GLQ, GLQ, GLQ, ALQ, Unf, GLQ...
## $ BsmtFinSF1    <int> 706, 978, 486, 216, 655, 732, 1369, 859, 0, 851,...
## $ BsmtFinType2  <fct> Unf, Unf, Unf, Unf, Unf, Unf, Unf, BLQ, Unf, Unf...
## $ BsmtFinSF2    <int> 0, 0, 0, 0, 0, 0, 0, 32, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ BsmtUnfSF     <int> 150, 284, 434, 540, 490, 64, 317, 216, 952, 140,...
## $ TotalBsmtSF   <int> 856, 1262, 920, 756, 1145, 796, 1686, 1107, 952,...
## $ Heating       <fct> GasA, GasA, GasA, GasA, GasA, GasA, GasA, GasA, ...
## $ HeatingQC     <fct> Ex, Ex, Ex, Gd, Ex, Ex, Ex, Ex, Gd, Ex, Ex, Ex, ...
## $ CentralAir    <fct> Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, ...
## $ Electrical    <fct> SBrkr, SBrkr, SBrkr, SBrkr, SBrkr, SBrkr, SBrkr,...
## $ X1stFlrSF     <int> 856, 1262, 920, 961, 1145, 796, 1694, 1107, 1022...
## $ X2ndFlrSF     <int> 854, 0, 866, 756, 1053, 566, 0, 983, 752, 0, 0, ...
## $ LowQualFinSF  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ GrLivArea     <int> 1710, 1262, 1786, 1717, 2198, 1362, 1694, 2090, ...
## $ BsmtFullBath  <int> 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, ...
## $ BsmtHalfBath  <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ FullBath      <int> 2, 2, 2, 1, 2, 1, 2, 2, 2, 1, 1, 3, 1, 2, 1, 1, ...
## $ HalfBath      <int> 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, ...
## $ BedroomAbvGr  <int> 3, 3, 3, 3, 4, 1, 3, 3, 2, 2, 3, 4, 2, 3, 2, 2, ...
## $ KitchenAbvGr  <int> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1, ...
## $ KitchenQual   <fct> Gd, TA, Gd, Gd, Gd, TA, Gd, TA, TA, TA, TA, Ex, ...
## $ TotRmsAbvGrd  <int> 8, 6, 6, 7, 9, 5, 7, 7, 8, 5, 5, 11, 4, 7, 5, 5,...
## $ Functional    <fct> Typ, Typ, Typ, Typ, Typ, Typ, Typ, Typ, Min1, Ty...
## $ Fireplaces    <int> 0, 1, 1, 1, 1, 0, 1, 2, 2, 2, 0, 2, 0, 1, 1, 0, ...
## $ FireplaceQu   <fct> NA, TA, TA, Gd, TA, NA, Gd, TA, TA, TA, NA, Gd, ...
## $ GarageType    <fct> Attchd, Attchd, Attchd, Detchd, Attchd, Attchd, ...
## $ GarageYrBlt   <int> 2003, 1976, 2001, 1998, 2000, 1993, 2004, 1973, ...
## $ GarageFinish  <fct> RFn, RFn, RFn, Unf, RFn, Unf, RFn, RFn, Unf, RFn...
## $ GarageCars    <int> 2, 2, 2, 3, 3, 2, 2, 2, 2, 1, 1, 3, 1, 3, 1, 2, ...
## $ GarageArea    <int> 548, 460, 608, 642, 836, 480, 636, 484, 468, 205...
## $ GarageQual    <fct> TA, TA, TA, TA, TA, TA, TA, TA, Fa, Gd, TA, TA, ...
## $ GarageCond    <fct> TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, ...
## $ PavedDrive    <fct> Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, ...
## $ WoodDeckSF    <int> 0, 298, 0, 0, 192, 40, 255, 235, 90, 0, 0, 147, ...
## $ OpenPorchSF   <int> 61, 0, 42, 35, 84, 30, 57, 204, 0, 4, 0, 21, 0, ...
## $ EnclosedPorch <int> 0, 0, 0, 272, 0, 0, 0, 228, 205, 0, 0, 0, 0, 0, ...
## $ X3SsnPorch    <int> 0, 0, 0, 0, 0, 320, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ ScreenPorch   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 176, 0, 0, 0...
## $ PoolArea      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ PoolQC        <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ Fence         <fct> NA, NA, NA, NA, NA, MnPrv, NA, NA, NA, NA, NA, N...
## $ MiscFeature   <fct> NA, NA, NA, NA, NA, Shed, NA, Shed, NA, NA, NA, ...
## $ MiscVal       <int> 0, 0, 0, 0, 0, 700, 0, 350, 0, 0, 0, 0, 0, 0, 0,...
## $ MoSold        <int> 2, 5, 9, 2, 12, 10, 8, 11, 4, 1, 2, 7, 9, 8, 5, ...
## $ YrSold        <int> 2008, 2007, 2008, 2006, 2008, 2009, 2007, 2009, ...
## $ SaleType      <fct> WD, WD, WD, WD, WD, WD, WD, WD, WD, WD, WD, New,...
## $ SaleCondition <fct> Normal, Normal, Normal, Abnorml, Normal, Normal,...
## $ SalePrice     <int> 208500, 181500, 223500, 140000, 250000, 143000, ...

This dataset is almost the same as training except without SalePrice column. We need to use the model based on train data to predict the salesprice for the test dataset and submit to kaggle.

glimpse(test)
## Observations: 1,459
## Variables: 80
## $ Id            <int> 1461, 1462, 1463, 1464, 1465, 1466, 1467, 1468, ...
## $ MSSubClass    <int> 20, 20, 60, 60, 120, 60, 20, 60, 20, 20, 120, 16...
## $ MSZoning      <fct> RH, RL, RL, RL, RL, RL, RL, RL, RL, RL, RH, RM, ...
## $ LotFrontage   <int> 80, 81, 74, 78, 43, 75, NA, 63, 85, 70, 26, 21, ...
## $ LotArea       <int> 11622, 14267, 13830, 9978, 5005, 10000, 7980, 84...
## $ Street        <fct> Pave, Pave, Pave, Pave, Pave, Pave, Pave, Pave, ...
## $ Alley         <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ LotShape      <fct> Reg, IR1, IR1, IR1, IR1, IR1, IR1, IR1, Reg, Reg...
## $ LandContour   <fct> Lvl, Lvl, Lvl, Lvl, HLS, Lvl, Lvl, Lvl, Lvl, Lvl...
## $ Utilities     <fct> AllPub, AllPub, AllPub, AllPub, AllPub, AllPub, ...
## $ LotConfig     <fct> Inside, Corner, Inside, Inside, Inside, Corner, ...
## $ LandSlope     <fct> Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl...
## $ Neighborhood  <fct> NAmes, NAmes, Gilbert, Gilbert, StoneBr, Gilbert...
## $ Condition1    <fct> Feedr, Norm, Norm, Norm, Norm, Norm, Norm, Norm,...
## $ Condition2    <fct> Norm, Norm, Norm, Norm, Norm, Norm, Norm, Norm, ...
## $ BldgType      <fct> 1Fam, 1Fam, 1Fam, 1Fam, TwnhsE, 1Fam, 1Fam, 1Fam...
## $ HouseStyle    <fct> 1Story, 1Story, 2Story, 2Story, 1Story, 2Story, ...
## $ OverallQual   <int> 5, 6, 5, 6, 8, 6, 6, 6, 7, 4, 7, 6, 5, 6, 7, 9, ...
## $ OverallCond   <int> 6, 6, 5, 6, 5, 5, 7, 5, 5, 5, 5, 5, 5, 6, 6, 5, ...
## $ YearBuilt     <int> 1961, 1958, 1997, 1998, 1992, 1993, 1992, 1998, ...
## $ YearRemodAdd  <int> 1961, 1958, 1998, 1998, 1992, 1994, 2007, 1998, ...
## $ RoofStyle     <fct> Gable, Hip, Gable, Gable, Gable, Gable, Gable, G...
## $ RoofMatl      <fct> CompShg, CompShg, CompShg, CompShg, CompShg, Com...
## $ Exterior1st   <fct> VinylSd, Wd Sdng, VinylSd, VinylSd, HdBoard, HdB...
## $ Exterior2nd   <fct> VinylSd, Wd Sdng, VinylSd, VinylSd, HdBoard, HdB...
## $ MasVnrType    <fct> None, BrkFace, None, BrkFace, None, None, None, ...
## $ MasVnrArea    <int> 0, 108, 0, 20, 0, 0, 0, 0, 0, 0, 0, 504, 492, 0,...
## $ ExterQual     <fct> TA, TA, TA, TA, Gd, TA, TA, TA, TA, TA, Gd, TA, ...
## $ ExterCond     <fct> TA, TA, TA, TA, TA, TA, Gd, TA, TA, TA, TA, TA, ...
## $ Foundation    <fct> CBlock, CBlock, PConc, PConc, PConc, PConc, PCon...
## $ BsmtQual      <fct> TA, TA, Gd, TA, Gd, Gd, Gd, Gd, Gd, TA, Gd, TA, ...
## $ BsmtCond      <fct> TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, ...
## $ BsmtExposure  <fct> No, No, No, No, No, No, No, No, Gd, No, No, No, ...
## $ BsmtFinType1  <fct> Rec, ALQ, GLQ, GLQ, ALQ, Unf, ALQ, Unf, GLQ, ALQ...
## $ BsmtFinSF1    <int> 468, 923, 791, 602, 263, 0, 935, 0, 637, 804, 10...
## $ BsmtFinType2  <fct> LwQ, Unf, Unf, Unf, Unf, Unf, Unf, Unf, Unf, Rec...
## $ BsmtFinSF2    <int> 144, 0, 0, 0, 0, 0, 0, 0, 0, 78, 0, 0, 0, 0, 0, ...
## $ BsmtUnfSF     <int> 270, 406, 137, 324, 1017, 763, 233, 789, 663, 0,...
## $ TotalBsmtSF   <int> 882, 1329, 928, 926, 1280, 763, 1168, 789, 1300,...
## $ Heating       <fct> GasA, GasA, GasA, GasA, GasA, GasA, GasA, GasA, ...
## $ HeatingQC     <fct> TA, TA, Gd, Ex, Ex, Gd, Ex, Gd, Gd, TA, Ex, TA, ...
## $ CentralAir    <fct> Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, ...
## $ Electrical    <fct> SBrkr, SBrkr, SBrkr, SBrkr, SBrkr, SBrkr, SBrkr,...
## $ X1stFlrSF     <int> 896, 1329, 928, 926, 1280, 763, 1187, 789, 1341,...
## $ X2ndFlrSF     <int> 0, 0, 701, 678, 0, 892, 0, 676, 0, 0, 0, 504, 56...
## $ LowQualFinSF  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ GrLivArea     <int> 896, 1329, 1629, 1604, 1280, 1655, 1187, 1465, 1...
## $ BsmtFullBath  <int> 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, ...
## $ BsmtHalfBath  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ FullBath      <int> 1, 1, 2, 2, 2, 2, 2, 2, 1, 1, 2, 1, 1, 2, 1, 2, ...
## $ HalfBath      <int> 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, ...
## $ BedroomAbvGr  <int> 2, 3, 3, 3, 2, 3, 3, 3, 2, 2, 2, 2, 3, 3, 2, 3, ...
## $ KitchenAbvGr  <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ KitchenQual   <fct> TA, Gd, TA, Gd, Gd, TA, TA, TA, Gd, TA, Gd, TA, ...
## $ TotRmsAbvGrd  <int> 5, 6, 6, 7, 5, 7, 6, 7, 5, 4, 5, 5, 6, 6, 4, 10,...
## $ Functional    <fct> Typ, Typ, Typ, Typ, Typ, Typ, Typ, Typ, Typ, Typ...
## $ Fireplaces    <int> 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, ...
## $ FireplaceQu   <fct> NA, NA, TA, Gd, NA, TA, NA, Gd, Po, NA, Fa, NA, ...
## $ GarageType    <fct> Attchd, Attchd, Attchd, Attchd, Attchd, Attchd, ...
## $ GarageYrBlt   <int> 1961, 1958, 1997, 1998, 1992, 1993, 1992, 1998, ...
## $ GarageFinish  <fct> Unf, Unf, Fin, Fin, RFn, Fin, Fin, Fin, Unf, Fin...
## $ GarageCars    <int> 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 2, 1, 3, ...
## $ GarageArea    <int> 730, 312, 482, 470, 506, 440, 420, 393, 506, 525...
## $ GarageQual    <fct> TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, ...
## $ GarageCond    <fct> TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, ...
## $ PavedDrive    <fct> Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, ...
## $ WoodDeckSF    <int> 140, 393, 212, 360, 0, 157, 483, 0, 192, 240, 20...
## $ OpenPorchSF   <int> 0, 36, 34, 36, 82, 84, 21, 75, 0, 0, 68, 0, 0, 0...
## $ EnclosedPorch <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ X3SsnPorch    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ ScreenPorch   <int> 120, 0, 0, 0, 144, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ PoolArea      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ PoolQC        <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ Fence         <fct> MnPrv, NA, MnPrv, NA, NA, NA, GdPrv, NA, NA, MnP...
## $ MiscFeature   <fct> NA, Gar2, NA, NA, NA, NA, Shed, NA, NA, NA, NA, ...
## $ MiscVal       <int> 0, 12500, 0, 0, 0, 0, 500, 0, 0, 0, 0, 0, 0, 0, ...
## $ MoSold        <int> 6, 6, 3, 6, 1, 4, 3, 5, 2, 4, 6, 2, 3, 6, 6, 1, ...
## $ YrSold        <int> 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, ...
## $ SaleType      <fct> WD, WD, WD, WD, WD, WD, WD, WD, WD, WD, WD, COD,...
## $ SaleCondition <fct> Normal, Normal, Normal, Normal, Normal, Normal, ...

Visualize all numberical variables Select numberical variables togeter

ntrain<-select_if(train, is.numeric)

Density plot of variables

ntrain <- as.data.frame((ntrain))

par(mfrow=c(3, 3))
colnames <- dimnames(ntrain)[[2]]

  for(col in 2:ncol(ntrain)) {

    d <- density(na.omit(ntrain[,col]))
   
    plot(d, type="n", main=colnames[col])
    polygon(d, col="light green", border="red")
  }

vis_miss(train)

n_miss(train)
## [1] 6965
prop_miss(train)
## [1] 0.05889565
miss_var_summary(train)
## # A tibble: 81 x 3
##    variable     n_miss pct_miss
##    <chr>         <int>    <dbl>
##  1 PoolQC         1453    99.5 
##  2 MiscFeature    1406    96.3 
##  3 Alley          1369    93.8 
##  4 Fence          1179    80.8 
##  5 FireplaceQu     690    47.3 
##  6 LotFrontage     259    17.7 
##  7 GarageType       81     5.55
##  8 GarageYrBlt      81     5.55
##  9 GarageFinish     81     5.55
## 10 GarageQual       81     5.55
## # ... with 71 more rows
miss_case_summary(train)
## # A tibble: 1,460 x 3
##     case n_miss pct_miss
##    <int>  <int>    <dbl>
##  1    40     15     18.5
##  2   534     15     18.5
##  3  1012     15     18.5
##  4  1219     15     18.5
##  5   521     14     17.3
##  6   706     14     17.3
##  7  1180     14     17.3
##  8   288     11     13.6
##  9   343     11     13.6
## 10   376     11     13.6
## # ... with 1,450 more rows
gg_miss_var(train)

vis_miss(test)

n_miss(test)
## [1] 7000
prop_miss(test)
## [1] 0.05997258
sapply(train, function(x) sum(is.na(x))) %>% kable() %>% kable_styling()
x
Id 0
MSSubClass 0
MSZoning 0
LotFrontage 259
LotArea 0
Street 0
Alley 1369
LotShape 0
LandContour 0
Utilities 0
LotConfig 0
LandSlope 0
Neighborhood 0
Condition1 0
Condition2 0
BldgType 0
HouseStyle 0
OverallQual 0
OverallCond 0
YearBuilt 0
YearRemodAdd 0
RoofStyle 0
RoofMatl 0
Exterior1st 0
Exterior2nd 0
MasVnrType 8
MasVnrArea 8
ExterQual 0
ExterCond 0
Foundation 0
BsmtQual 37
BsmtCond 37
BsmtExposure 38
BsmtFinType1 37
BsmtFinSF1 0
BsmtFinType2 38
BsmtFinSF2 0
BsmtUnfSF 0
TotalBsmtSF 0
Heating 0
HeatingQC 0
CentralAir 0
Electrical 1
X1stFlrSF 0
X2ndFlrSF 0
LowQualFinSF 0
GrLivArea 0
BsmtFullBath 0
BsmtHalfBath 0
FullBath 0
HalfBath 0
BedroomAbvGr 0
KitchenAbvGr 0
KitchenQual 0
TotRmsAbvGrd 0
Functional 0
Fireplaces 0
FireplaceQu 690
GarageType 81
GarageYrBlt 81
GarageFinish 81
GarageCars 0
GarageArea 0
GarageQual 81
GarageCond 81
PavedDrive 0
WoodDeckSF 0
OpenPorchSF 0
EnclosedPorch 0
X3SsnPorch 0
ScreenPorch 0
PoolArea 0
PoolQC 1453
Fence 1179
MiscFeature 1406
MiscVal 0
MoSold 0
YrSold 0
SaleType 0
SaleCondition 0
SalePrice 0
sapply(test, function(x) sum(is.na(x))) %>% kable() %>% kable_styling()
x
Id 0
MSSubClass 0
MSZoning 4
LotFrontage 227
LotArea 0
Street 0
Alley 1352
LotShape 0
LandContour 0
Utilities 2
LotConfig 0
LandSlope 0
Neighborhood 0
Condition1 0
Condition2 0
BldgType 0
HouseStyle 0
OverallQual 0
OverallCond 0
YearBuilt 0
YearRemodAdd 0
RoofStyle 0
RoofMatl 0
Exterior1st 1
Exterior2nd 1
MasVnrType 16
MasVnrArea 15
ExterQual 0
ExterCond 0
Foundation 0
BsmtQual 44
BsmtCond 45
BsmtExposure 44
BsmtFinType1 42
BsmtFinSF1 1
BsmtFinType2 42
BsmtFinSF2 1
BsmtUnfSF 1
TotalBsmtSF 1
Heating 0
HeatingQC 0
CentralAir 0
Electrical 0
X1stFlrSF 0
X2ndFlrSF 0
LowQualFinSF 0
GrLivArea 0
BsmtFullBath 2
BsmtHalfBath 2
FullBath 0
HalfBath 0
BedroomAbvGr 0
KitchenAbvGr 0
KitchenQual 1
TotRmsAbvGrd 0
Functional 2
Fireplaces 0
FireplaceQu 730
GarageType 76
GarageYrBlt 78
GarageFinish 78
GarageCars 1
GarageArea 1
GarageQual 78
GarageCond 78
PavedDrive 0
WoodDeckSF 0
OpenPorchSF 0
EnclosedPorch 0
X3SsnPorch 0
ScreenPorch 0
PoolArea 0
PoolQC 1456
Fence 1169
MiscFeature 1408
MiscVal 0
MoSold 0
YrSold 0
SaleType 1
SaleCondition 0

Univariate descriptive statistics

selectCols <- c('GrLivArea', 'SalePrice')
select.data <- train[selectCols]

kable(select.data[sample(nrow(select.data), 12), ] , format="pandoc", align="l", row.names = F, caption = "Sample of Univariate Descriptive Stat.")
Sample of Univariate Descriptive Stat.
GrLivArea SalePrice
1936 230500
1863 219210
1477 145000
1092 139000
1983 225000
954 119750
1348 117000
2018 378500
1593 178000
1285 143500
2792 256000
996 108000
summary(select.data)
##    GrLivArea      SalePrice     
##  Min.   : 334   Min.   : 34900  
##  1st Qu.:1130   1st Qu.:129975  
##  Median :1464   Median :163000  
##  Mean   :1515   Mean   :180921  
##  3rd Qu.:1777   3rd Qu.:214000  
##  Max.   :5642   Max.   :755000

Create correlation matrixes based on obsevervation and probability

total.obs <- nrow(select.data)

#No. of observations, X>x and Y>y
xy.pos.data <- select.data[which(select.data$GrLivArea > 1130 & select.data$SalePrice > 163000),]
xy.pos <- nrow(xy.pos.data)

#No. of observations,X<x and Y<y
xy.neg.data <- select.data[which(select.data$GrLivArea <= 1130 & select.data$SalePrice <= 163000),]
xy.neg <- nrow(xy.neg.data)

#No. of observations, X>x and Y<y
x.pos.y.neg.data <- select.data[which(select.data$GrLivArea > 1130 & select.data$SalePrice <= 163000),]
x.pos.y.neg <- nrow(x.pos.y.neg.data)

#No. of observations, X<x and Y>y
x.neg.y.pos.data <- select.data[which(select.data$GrLivArea <= 1130 & select.data$SalePrice > 163000),]
x.neg.y.pos <- nrow(x.neg.y.pos.data)

house.data<- matrix(c(xy.pos, x.neg.y.pos,x.pos.y.neg,xy.neg), nrow=2, ncol=2)
#add column and row totals
house.data<- cbind(house.data, Total = rowSums(house.data))
house.data<- rbind(house.data, Total = colSums(house.data))

rownames(house.data)<- c('(X>x)', '(X<=x)', 'Total')


kable(house.data, digits = 2, col.names = c('(Y>y)', '(Y<=y)', 'Total'), align = "l", caption = 'Correlation Matrix of Observations')
Correlation Matrix of Observations
(Y>y) (Y<=y) Total
(X>x) 720 374 1094
(X<=x) 8 358 366
Total 728 732 1460
house.data.prob <- matrix(c(round(xy.pos/total.obs,4), round(x.neg.y.pos/total.obs,4),round(x.pos.y.neg/total.obs,4),round(xy.neg/total.obs,4)), nrow=2, ncol=2)

house.data.prob <- cbind(house.data.prob, Total = round(rowSums(house.data.prob),2))
house.data.prob <- rbind(house.data.prob, Total = round(colSums(house.data.prob),2))

rownames(house.data.prob) <- c('(X>x)', '(X<=x)', 'Total')


kable(house.data.prob, digits = 4, col.names = c('(Y>y)', '(Y<=y)', 'Total'), align = "l", caption = 'Correlation Matrix of Joint Probabilities')
Correlation Matrix of Joint Probabilities
(Y>y) (Y<=y) Total
(X>x) 0.4932 0.2562 0.75
(X<=x) 0.0055 0.2452 0.25
Total 0.5000 0.5000 1.00

a. \(P(X>x~ |~ Y>y )\), read as probability X (GrLivArea) is greater than 1130 square feet given Y (SalePrice) is greater than $163000.

This is known as conditional probability because we are computing the probability under a condition, SalePrice is greater than $163000. Two parts to a conditional probability, the outcome of interest and the condition. We can assume condition as information we know to be true, and this information usually can be used to describe outcome.

\(P(GrLivArea~ > 1130~ sq.ft. | SalePrice~ >~ \$163000) = \frac {\#~ cases~ where~ GrLivArea~ > 1130~ sq.ft.~ and~ SalePrice~ >~ \$163000 }{\# cases~ where~ SalePrice~ >~ \$163000}\)

= \(\frac{720}{728} = 98.9 \%\)

Using the joint probabilities,

= \(\frac{0.4932}{0.5000} = 98.64 \%\)

Therefore, probability that SalePrice will be greater than \(\$163000\), if GrLivArea is greater than \(1130~ sq.ft.\) is \(99\%\)

b. \(P(X>x~ \&~ Y>y)\), read a probability X (GrLivArea) is greater than 1130 square feet and Y (SalePrice) is greater than $163000.

This is known as joint probability because we are computing the probability using outcomes of two variables.

\(P(GrLivArea~ > 1130~ sq.ft. and SalePrice~ >~ \$163000) = \frac {\#~ cases~ where~ GrLivArea~ > 1130~ sq.ft.~ and~ SalePrice~ >~ \$163000 }{\# cases~ observed}\)

= \(\frac{720}{1460} = 49.32 \%\)

Therefore, probability that GrLivArea is greater than \(1130~ sq.ft.\) and SalePrice will be greater than \(\$163000\), is \(49.32 \%\)

c. \(P(X<x~ |~ Y>y )\), read as probability X (GrLivArea) is less than 1130 square feet given Y (SalePrice) is greater than $163000. This is conditional probability.

\(P(GrLivArea~ < 1130~ sq.ft. | SalePrice~ >~ \$163000) = \frac {\#~ cases~ where~ GrLivArea~ < 1130~ sq.ft.~ and~ SalePrice~ >~ \$163000 }{\# cases~ where~ SalePrice~ >~ \$163000}\)

= \(\frac{8}{728} = 1.1 \%\)

Using the joint probabilities,

= \(\frac{0.0055}{0.5000} = 1.1 \%\)

Therefore, probability that SalePrice will be greater than \(\$163000\), if GrLivArea is less than \(1130~ sq.ft.\) is \(1.1 \%\)


Relation to independence:

\(P(XY) = P(X)P(Y)\)

Above condition can be rewritten as

\(P(X \cap Y) = P(X)P(Y)\), condition will be true only when \(X\) and \(Y\) are independent.

We can say that above grade living area and sale price are independent only when an increase or decrease in the area does not affect the probability of increase or decrease of the sale price of the house. We can test the condition by using the following hypothesis.

Null Hypothesis(\(H_0\)): Sale price of the house is not influenced by above grade living area.

Alternative Hypothesis(\(H_A\)): Above grade living area has significant influence on sale price of the house.

If two variables were to be independent it should satisfy the condition

\(P(X>x~ |~ Y>y)P(Y>y) = P(X>x~ \cap ~ Y>y) = P(Y>y~ |~ X>x)P(X>x)\)

We will solve above conditions in two parts,

\(P(X>x~ |~ Y>y)P(Y>y) = P(X>x~ \cap ~ Y>y)\)

= \(P(X>x~ |~ Y>y) = \frac{P(X>x~ \cap ~ Y>y)}{P(Y>y)}\)

= \(P(X>x~ \cap ~ Y>y)\) - probability where GrLivArea > 1130 sq.ft. and SalePrice > $163000

= \(P(Y>y)\) - probability where SalePrice > $163000

= \(P(X>x~ |~ Y>y) = \frac{720}{728} = 98.9 \%\)

Comparing other way,

\(P(Y>y~ |~ X>x)P(X>x) = P(X>x~ \cap ~ Y>y)\)

= \(P(X>x~ \cap ~ Y>y) = P(Y>y~ |~ X>x)P(X>x)\)

= \(P(Y>y~ |~ X>x) = \frac{P(X>x~ \cap ~ Y>y)}{P(X>x)}\)

= \(P(X>x)\) - probability where GrLivArea > 1130 sq.ft.

= \(P(Y>y~ |~ X>x) = \frac{720}{1094} = 65.81 \%\)

Since \(P(X>x~ |~ Y>y)P(Y>y) = P(X>x~ \cap ~ Y>y) = P(Y>y~ |~ X>x)P(X>x)\), condition is not met we reject Null Hypothesis(\(H_0\)), and accept Alternative Hypothesis(\(H_A\)) that above grade living area has significant influence on sale price of the house.


Using _Chi Square test

house.data <- matrix(c(xy.pos, x.neg.y.pos,x.pos.y.neg,xy.neg), nrow=2, ncol=2)
house.data
##      [,1] [,2]
## [1,]  720  374
## [2,]    8  358
chisq.test(house.data) 
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  house.data
## X-squared = 441.58, df = 1, p-value < 2.2e-16

Because we have only 2 variables GrLivArea and SalePrice, degrees of freedom(df) = 1. p-value = \(2.2 \times 10^{-16}\) is almost “0”, which is far smaller compared to \(0.05\) significance level. So we reject Null Hypothesis(\(H_0\)), and accept Alternative Hypothesis(\(H_A\)) thatGrLivArea has significant influence on sale price of the house.

getmode <- function(v) {
   uniqv <- unique(v)
   uniqv[which.max(tabulate(match(v, uniqv)))]
}

house.stats <- matrix(data = NA, nrow=11,ncol=2)
qarea <- quantile(select.data$GrLivArea)
qprice <- quantile(select.data$SalePrice)

house.stats[1,1] <- nrow(select.data)
house.stats[1,2] <- nrow(select.data)

house.stats[2,1] <- length(select.data$GrLivArea[!is.na(select.data$GrLivArea)])
house.stats[2,2] <- length(select.data$SalePrice[!is.na(select.data$SalePrice)])

house.stats[3,1] <- paste0(min(select.data$GrLivArea), ' sq. ft.')
house.stats[3,2] <- paste0('$', min(select.data$SalePrice))

house.stats[4,1] <- paste0(max(select.data$GrLivArea), ' sq. ft.')
house.stats[4,2] <- paste0('$', max(select.data$SalePrice))

house.stats[5,1] <- paste0(median(select.data$GrLivArea), ' sq. ft.')
house.stats[5,2] <- paste0('$', median(select.data$SalePrice))

house.stats[6,1] <- paste0(qarea[2], ' sq. ft.')
house.stats[6,2] <- paste0('$', qprice[2])

house.stats[7,1] <- paste0(qarea[4], ' sq. ft.')
house.stats[7,2] <- paste0('$', qprice[4])

house.stats[8,1] <- paste0(round(mean(select.data$GrLivArea),2), ' sq. ft.')
house.stats[8,2] <- paste0('$', round(mean(select.data$SalePrice),2))

house.stats[9,1] <- round(sd(select.data$GrLivArea),2)
house.stats[9,2] <- round(sd(select.data$SalePrice),2)

house.stats[10,1] <- paste0(getmode(select.data$GrLivArea), ' sq. ft.')
house.stats[10,2] <- paste0('$', getmode(select.data$SalePrice))

house.stats[11,1] <- paste0(IQR(select.data$GrLivArea), ' sq. ft.')
house.stats[11,2] <- paste0('$', IQR(select.data$SalePrice))

rownames(house.stats)<- c('Number of Observations', 'Non-missing values', 'Minimum','Maximum', 'Median','1st quartile','3rd quartile', 'Average(mean)', 'Standard deviation', 'Mode', 'Interquartile range(IQR)')

kable(house.stats, digits = 2, 
      col.names = c('GrLivArea', 'SalePrice'), 
      align = "l", 
      caption = 'Univariate Descriptive Statistics', "html") %>%  
  kable_styling(bootstrap_options = c("striped", "hover"))
Univariate Descriptive Statistics
GrLivArea SalePrice
Number of Observations 1460 1460
Non-missing values 1460 1460
Minimum 334 sq. ft. $34900
Maximum 5642 sq. ft. $755000
Median 1464 sq. ft. $163000
1st quartile 1129.5 sq. ft. $129975
3rd quartile 1776.75 sq. ft. $214000
Average(mean) 1515.46 sq. ft. $180921.2
Standard deviation 525.48 79442.5
Mode 864 sq. ft. $140000
Interquartile range(IQR) 647.25 sq. ft. $84025

Graphs

ggplot(train, aes(GrLivArea)) + geom_histogram(binwidth = 150, alpha=0.5, color="red", fill="light green")

Histogram shows distribution of “GrLivArea” . Average area is 1515.46 sq.ft. with standard deviation as 525.48. It also shows right tail, suggesting existence of outliers to the right of the average.

ggplot(train, aes(SalePrice)) + geom_histogram(binwidth = 30000, alpha=0.5, color="red", fill="light green")

Histogram shows distribution of sale price of houses. Average sale price is $180921.2, with sandard deviation of $79442.5. It also shows right tail, suggesting existence of outliers to the right of the average

ggplot(train, aes(GrLivArea,SalePrice))+geom_boxplot(color="red", fill="light green", outlier.size=3)
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?

d <- ggplot(train, aes(x = GrLivArea, y = SalePrice)) +
    geom_boxplot() 
  
d
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?

ggplot(train, aes(x=GrLivArea, y=SalePrice)) + geom_point(color="green", size=3, alpha=0.3) + stat_smooth(method="lm", color="red", se=FALSE)+ggtitle("Correlation of GrLivArea and SalePrice")

From the above boxplot, histograms and scatterplot , we can notice there are some outliers and the variation among “GrLivArea” and “SalePrice” is not constant. This causes a longer tail on the right side.

Quantiles of SalePrice

kable(qarea, digits = 2, 
      caption = 'Quartiles of "SalePrice" $k', 
      align = 'l', padding = 10, "html") %>%  
  kable_styling(bootstrap_options = c("striped", "hover"))
Quartiles of “SalePrice” $k
x
0% 334.00
25% 1129.50
50% 1464.00
75% 1776.75
100% 5642.00

Linear Model

lm_model_price_area <- lm(train$SalePrice ~ train$GrLivArea)
summary(lm_model_price_area)
## 
## Call:
## lm(formula = train$SalePrice ~ train$GrLivArea)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -462999  -29800   -1124   21957  339832 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     18569.026   4480.755   4.144 3.61e-05 ***
## train$GrLivArea   107.130      2.794  38.348  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 56070 on 1458 degrees of freedom
## Multiple R-squared:  0.5021, Adjusted R-squared:  0.5018 
## F-statistic:  1471 on 1 and 1458 DF,  p-value: < 2.2e-16

Multiple R-squared: 0.5021 means that regression model can explain 50.21% of the variation in data. Residual standard error: 56070` suggests that the average distance of the data points from the fitted line is about 56070. And 95% of times sale price should fall between 2*56070.

Box-Cox Transformation

As we can see, the variables “GrLivArea” and “SalePrice” is not normal distribution. Normality is an important assumption for many statistical techniques; Box-Cox transformation is a way to transform non-normal dependent variables into a normal shape.

For all positive values of \(y\), it is defined by

\[ y(\lambda)=\begin{cases} \frac{y^{\lambda} - 1}{\lambda}, & \text{if }\lambda \neq 0\ \\ log~ y, & \text{if }\lambda = 0\ \end{cases} \]

If \(y\) has negative values then it is defined as

\[ y(\lambda)=\begin{cases} \frac{(y + {\lambda}_2)^{\lambda_1} - 1}{\lambda_1}, & \text{if }\lambda_1 \neq 0\ \\ log~ (y + {\lambda}_2), & \text{if }\lambda_1 = 0\ \end{cases} \]

We will using R-function boxcox from MASS library to determine optimal lambda(\(\lambda\)) value.

par(mfrow=c(1,2))
house.bc <- boxcox(lm_model_price_area)
house.bc.df <- as.data.frame(house.bc)
lambda <- house.bc.df[which.max(house.bc.df$y),1]
boxcox(lm_model_price_area, plotit=T, lambda=seq(0,0.20,by=0.05))

From above boxcox plot, optimal lambda(\(\lambda\)) is 0.10`. Confidence interval runs between \(0.02\) and \(0.18\). Beause \(\lambda\) is less than \(0.5\), there is no need to transform data.

However, we still performation the transformation to compare the result.

train$SalePrice_trans <- ((train$SalePrice^lambda) -1)/lambda


ggplot(train, aes(x=GrLivArea, y=SalePrice_trans)) +
  geom_point(alpha=0.3, size=3)+
  stat_smooth(method="lm", color="blue", se=FALSE) 

  labs(title="Scatterplot GrLivArea Vs. Transformed SalePrice",
       x="GrLivArea(sq.ft.)", y = "Transformed SalePrice")
## $x
## [1] "GrLivArea(sq.ft.)"
## 
## $y
## [1] "Transformed SalePrice"
## 
## $title
## [1] "Scatterplot GrLivArea Vs. Transformed SalePrice"
## 
## attr(,"class")
## [1] "labels"
house_t.lm <- lm(train$SalePrice_trans ~ train$GrLivArea)
summary(house_t.lm)
## 
## Call:
## lm(formula = train$SalePrice_trans ~ train$GrLivArea)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.6488 -0.4945  0.0843  0.5314  3.1560 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     2.073e+01  7.656e-02  270.75   <2e-16 ***
## train$GrLivArea 1.813e-03  4.774e-05   37.99   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9581 on 1458 degrees of freedom
## Multiple R-squared:  0.4975, Adjusted R-squared:  0.4971 
## F-statistic:  1443 on 1 and 1458 DF,  p-value: < 2.2e-16

As we see, Multiple R-squared value is smaller than the non-transformation model . The transformation is worthless in this case.

ggplot(train, aes(x=LotArea, y=SalePrice)) + geom_point(color="green", size=3, alpha=0.3) + stat_smooth(method="lm", color="red", se=FALSE)+ggtitle("Correlation of LotArea and SalePrice")

ggplot(train, aes(x=X1stFlrSF, y=SalePrice)) + geom_point(color="green", size=3, alpha=0.3) + stat_smooth(method="lm", color="red", se=FALSE)+ggtitle("Correlation of X1stFlrSF and SalePrice")

corr_data<-subset(train,select=c("X1stFlrSF","LotArea", "SalePrice"))


correlation_matrix <- round(cor(corr_data),2)


get_lower_tri<-function(correlation_matrix){
    correlation_matrix[upper.tri(correlation_matrix)] <- NA
    return(correlation_matrix)
}
  
get_upper_tri <- function(correlation_matrix){
    correlation_matrix[lower.tri(correlation_matrix)]<- NA
    return(correlation_matrix)
}
  
upper_tri <- get_upper_tri(correlation_matrix)
  

melted_correlation_matrix <- melt(upper_tri, na.rm = TRUE)


ggheatmap <- ggplot(data = melted_correlation_matrix, aes(Var2, Var1, fill = value))+
 geom_tile(color = "white")+
 scale_fill_gradient2(low = "blue", high = "red", mid = "white", 
   midpoint = 0, limit = c(-1,1), space = "Lab", 
   name="Pearson\nCorrelation") +
  theme_minimal()+ 
 theme(axis.text.x = element_text(angle = 45, vjust = 1, 
    size = 15, hjust = 1))+
 coord_fixed()


 
ggheatmap + 
geom_text(aes(Var2, Var1, label = value), color = "black", size = 3) +
theme(
  axis.title.x = element_blank(),
  axis.title.y = element_blank(),
  axis.text.x=element_text(size=rel(0.8), angle=90),
  axis.text.y=element_text(size=rel(0.8)),
  panel.grid.major = element_blank(),
  panel.border = element_blank(),
  panel.background = element_blank(),
  axis.ticks = element_blank(),
  legend.justification = c(1, 0),
  legend.position = c(0.6, 0.7),
  legend.direction = "horizontal")+
  guides(fill = guide_colorbar(barwicrash_training2h = 7, barheight = 1,
                title.position = "top", title.hjust = 0.5))

Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide a 80% confidence interval.

cor.test(corr_data$X1stFlrSF, corr_data$SalePrice, method = c("pearson", "kendall", "spearman"), conf.level = 0.8)
## 
##  Pearson's product-moment correlation
## 
## data:  corr_data$X1stFlrSF and corr_data$SalePrice
## t = 29.078, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.5841687 0.6266715
## sample estimates:
##       cor 
## 0.6058522
cor.test(corr_data$LotArea, corr_data$SalePrice, method = c("pearson", "kendall", "spearman"), conf.level = 0.8)
## 
##  Pearson's product-moment correlation
## 
## data:  corr_data$LotArea and corr_data$SalePrice
## t = 10.445, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.2323391 0.2947946
## sample estimates:
##       cor 
## 0.2638434
cor.test(corr_data$X1stFlrSF, corr_data$LotArea, method = c("pearson", "kendall", "spearman"), conf.level = 0.8)
## 
##  Pearson's product-moment correlation
## 
## data:  corr_data$X1stFlrSF and corr_data$LotArea
## t = 11.985, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.2686127 0.3297222
## sample estimates:
##       cor 
## 0.2994746

For every two variables, we have generated an 80 percent of confidence interval. All the p values are < 0.001. Hence, for the three iterations of testing, we can reject the the null hypothesis and conclude that the true correlation is not 0 for the selected variables.

Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?

family wise error is a measurment of error when it comes to performing several iterations of estimates. This might cause results to be interpreted as being more independent then they really are. Our three tests of correlation had low p values, hence we can use that to derive the familywise error rate.

n=3

alpha=(0.5)/n

print(paste0("Familywise error rate is ", 1-alpha))
## [1] "Familywise error rate is 0.833333333333333"

Linear Algebra

5 points. Linear Algebra and Correlation. Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.

Following are the three untransformed independent variables (LotArea,TotalBsmtSF,GrLivArea ) and one dependent variable(SalePrice).

correlation provides trends shared between two variables. If the value is close to 1 variables are positively related. If the value is close to -1, then variables are negatively related or inversely related. If the value is close to 0, the two variables are less correlated.

select.Cols <- c('LotArea','TotalBsmtSF','GrLivArea', 'SalePrice')
select.data <- train[select.Cols]

pearson.cor <- cor(select.data,method="pearson")
pearson.cor
##               LotArea TotalBsmtSF GrLivArea SalePrice
## LotArea     1.0000000   0.2608331 0.2631162 0.2638434
## TotalBsmtSF 0.2608331   1.0000000 0.4548682 0.6135806
## GrLivArea   0.2631162   0.4548682 1.0000000 0.7086245
## SalePrice   0.2638434   0.6135806 0.7086245 1.0000000

Correlation between TotalBsmtSF and SalePrice is 0.61. It explains bigger basement area will result in the better sale price. Square value of the coefficient is 0.3721. It means 37.21% percent of the variance in the sale price of a house can be explained by the total area of the basement.

Correlation between GrLivArea and SalePrice is 0.71. It explains bigger living area will result in the better sale price. Square value of the coefficient is 0.5041. It means 50.41% percent of the variance in the sale price of a house can be explained by the total above grade living area.

Precision matrix is inverse of Correlation Matrix.

inv.cor <- solve(pearson.cor)
inv.cor
##                 LotArea TotalBsmtSF  GrLivArea   SalePrice
## LotArea      1.10622180  -0.1703170 -0.1623394 -0.07232846
## TotalBsmtSF -0.17031695   1.6321069 -0.0397442 -0.92832834
## GrLivArea   -0.16233936  -0.0397442  2.0350650 -1.37487844
## SalePrice   -0.07232846  -0.9283283 -1.3748784  2.56296011

Matrix Multiplication

Correlation Matrix multiplied by Precision Matrix

round(pearson.cor %*% inv.cor)
##             LotArea TotalBsmtSF GrLivArea SalePrice
## LotArea           1           0         0         0
## TotalBsmtSF       0           1         0         0
## GrLivArea         0           0         1         0
## SalePrice         0           0         0         1

Precision Matrix multiplied by Correlation Matrix

round(inv.cor %*% pearson.cor)
##             LotArea TotalBsmtSF GrLivArea SalePrice
## LotArea           1           0         0         0
## TotalBsmtSF       0           1         0         0
## GrLivArea         0           0         1         0
## SalePrice         0           0         0         1

Correlation Matrix multiplied by Precision Matrix and Precision Matrix multiplied by Correlation Matrix results in identity matrix.

Calculus-Based Probability & Statistics

5 points. Calculus-Based Probability & Statistics. Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zeroif necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of ???for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, ???)). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.

In this part, professor suggested a variable which is right skewed, shift it so that the minimum value is absolutely above zeroif necessary. After review the numerical data, we found that most variables are left skewed. It makes perfect sense that most house are smaller and medium size. The only rights skewed variables are related to years, because most houses have been built or renovated recently.

ggplot(train, aes(x=YearBuilt)) + geom_histogram(binwidth=10,color="red", fill="light green")

ggplot(train, aes(x=YearBuilt, y=SalePrice)) + geom_point(color="green", size=3, alpha=0.3) + stat_smooth(method="lm", color="red", se=FALSE)+ggtitle("Correlation of YearBuilt and SalePrice")

In general, YearBuilt should be postitive linear correlated with SalePrice. As long as the inflation, raise of wages and increase costof building matereial are playing an important role to push the SalePrice high. However, a few high ends units play a role as outlier to shift the linear correlation.

selectCols <- c('YearBuilt', 'SalePrice')
select.data <- train[selectCols]

fit.output <- fitdistr(select.data$YearBuilt, densfun="normal")
fit.output
##        mean            sd     
##   1971.2678082     30.1925588 
##  (   0.7901754) (   0.5587384)

Output of “optim” function, average(\(\mu\)) = 1971.27 and standard deviation(\(\sigma\)) =30.19.

To find the optimal estimates, I will be using ‘optim’ and ‘dnorm’ functions. ‘dnorm’ is the R function that calculates the probability density of the normal distribution. Because YearBuilt should be greater than zero, we will use the output of ‘fitdistr’ to get optimum values.

likelihood.func <- function(params) { -sum(dnorm(select.data$YearBuilt, params[1], params[2], log=TRUE)) }
optim.output <- optim(c(fit.output$estimate[1], 30), likelihood.func)   
## Warning in dnorm(select.data$YearBuilt, params[1], params[2], log = TRUE):
## NaNs produced

## Warning in dnorm(select.data$YearBuilt, params[1], params[2], log = TRUE):
## NaNs produced
optim.output
## $par
##      mean           
## 1971.2765   30.1827 
## 
## $value
## [1] 7046.74
## 
## $counts
## function gradient 
##       57       NA 
## 
## $convergence
## [1] 0
## 
## $message
## NULL

Output of ‘optim function’, average(\(\mu\)) = 1971.28 and standard deviation(\(\sigma\)) = 30.18.

Apart from rounding, optim function produced same output as ‘fitdistr’.

Sample Data

To generate 1000 samples, ‘rnorm’ function with the optimal parameters generated by ‘optim’ function will be used.

#generate 1000 samples
year.sample <- rnorm(n=1000, mean=round(optim.output$par[1],2), sd=round(optim.output$par[2],2))
year.sample <- data.frame(year.sample)
names(year.sample)[1] <- "Samples"

select.data$yearSplit <- as.factor(select.data[,1]>=round(optim.output$par[1],2))

a <- ggplot(select.data, aes(YearBuilt, fill=select.data$yearSplit)) + 
          geom_histogram(color="light pink", binwidth=10) + 
          scale_x_continuous(name = "Year") +
          scale_fill_manual(values=c("light green","green"),labels=c("Year >= 1971","Year < 1971")) +
          ylab("Number of Observations") +
          ggtitle("Observed Year Distribution") +
          geom_vline(xintercept = mean(select.data$YearBuilt), color="red", labels="Average", lwd=1)
## Warning: Ignoring unknown parameters: labels
year.sample$yearSplit <- as.factor(year.sample[,1]>=round(optim.output$par[1],2))

b <- ggplot(year.sample, aes(Samples, fill=year.sample$yearSplit)) + 
          geom_histogram(color="light pink", binwidth=10) + 
          scale_x_continuous(name = "Sample Year") +
          scale_fill_manual(values=c("light green","green"),labels=c("Year >= 1971","Year < 1971")) +
          ylab("Number of Samples") +
          ggtitle("Sample Year Distribution") +
          geom_vline(xintercept = mean(select.data$YearBuilt), color="red", labels="Average", lwd=1)
## Warning: Ignoring unknown parameters: labels
grid.arrange(b, a, nrow = 2, top='Sampling Data and Original Data')

To generate 1000 samples, ‘rnorm’ function with the optimal parameters generated by ‘optim’ function will be used.

Mean and SD of samples and observed data is same, 1971, 30 respectively. ‘Red line’ represents ’average’of the data.

Actual observed data have some outliers, while sample data does not have outliers.

Goodness of fit test

Chi-Square test will be used to see if the sample generated represents a normal distribution. In our prediction, there should be 50% cases where year is greater than or equal to average and 50% cases less than average.

Hypothesis,

\(H_0\) : Sample data follow a specified distribution.

\(H_A\) : Sample data do not follow the specified distribution.

#Chi-square test

#Ratio of actual observed values
null_p<-c(0.50, 0.50)

#Samples generated
sample.rows <- c(sum(year.sample$Samples >= round(optim.output$par[1],2), na.rm=TRUE), sum(year.sample$Samples < round(optim.output$par[1],2), na.rm=TRUE))

#Goodness-of-Fit Test
chisq.test(sample.rows, p=null_p)
## 
##  Chi-squared test for given probabilities
## 
## data:  sample.rows
## X-squared = 1.156, df = 1, p-value = 0.2823

Following is a chi square test to see whether sample represents actual observed data. Hypothesis: \(H_0\): Sample data represents actual observed data. \(H_A\): Sample data do not represent actual observed data.

#Ratio of actual values
null_p<-c(round((sum(select.data$YearBuilt >= round(optim.output$par[1],2), na.rm=TRUE)) / nrow(select.data),2), round((sum(select.data$YearBuilt < round(optim.output$par[1],2), na.rm=TRUE)) / nrow(select.data),2))

#Samples generated
sample.rows <- c(sum(year.sample$Samples >= round(optim.output$par[1],2), na.rm=TRUE), sum(year.sample$Samples < round(optim.output$par[1],2), na.rm=TRUE))

#Goodness-of-Fit Test
chisq.test(sample.rows, p=null_p)
## 
##  Chi-squared test for given probabilities
## 
## data:  sample.rows
## X-squared = 5.4848, df = 1, p-value = 0.01918

Because ’p-value’is 0.8 which is greater than 0.05, we accept null hypothesis\(H0\). In conclusion, that sample data represents actual observed data.

10 points. Modeling. Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.

At the begining, we use ‘glimpse’ function to notice that the database not only includes numeric variables, but also categorical variables or ordinal variables. Of couse, we have no problem to deal with numerica variables. For categorical or ordinal varibles, if they are already been converted as number, we can consider it as numbers for building model. If it is a vector of characters, it will be hard for us to deal with. We are not realtors, so it is hard to give a rank of the character vector. But we try our best to incoporate a couple of categorical variables in our study.

Here is a plan to deal with Data: 1. For some variables with more than 50% of missing information such as “Alley”, “PoolQC”, “Fence”, “Miss Feature”, I will drop it. 2. For numerical variables, I will try to keep as many as possible. If there is missing information, I can impute it with mean. 3. For categorical variables, it is hard to analysis using “as is” condition. Too drop all categorical data is not wise, because it has lots of information. For this kind of situation, I like to keep some categorical variables for analysis by transforming from a factor in character into an ordinal variables coded with a serials of numbers. For some categorical variables which can not each to give a ordinal code, I am going to drop it.

head(train)
##   Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape
## 1  1         60       RL          65    8450   Pave  <NA>      Reg
## 2  2         20       RL          80    9600   Pave  <NA>      Reg
## 3  3         60       RL          68   11250   Pave  <NA>      IR1
## 4  4         70       RL          60    9550   Pave  <NA>      IR1
## 5  5         60       RL          84   14260   Pave  <NA>      IR1
## 6  6         50       RL          85   14115   Pave  <NA>      IR1
##   LandContour Utilities LotConfig LandSlope Neighborhood Condition1
## 1         Lvl    AllPub    Inside       Gtl      CollgCr       Norm
## 2         Lvl    AllPub       FR2       Gtl      Veenker      Feedr
## 3         Lvl    AllPub    Inside       Gtl      CollgCr       Norm
## 4         Lvl    AllPub    Corner       Gtl      Crawfor       Norm
## 5         Lvl    AllPub       FR2       Gtl      NoRidge       Norm
## 6         Lvl    AllPub    Inside       Gtl      Mitchel       Norm
##   Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt
## 1       Norm     1Fam     2Story           7           5      2003
## 2       Norm     1Fam     1Story           6           8      1976
## 3       Norm     1Fam     2Story           7           5      2001
## 4       Norm     1Fam     2Story           7           5      1915
## 5       Norm     1Fam     2Story           8           5      2000
## 6       Norm     1Fam     1.5Fin           5           5      1993
##   YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType
## 1         2003     Gable  CompShg     VinylSd     VinylSd    BrkFace
## 2         1976     Gable  CompShg     MetalSd     MetalSd       None
## 3         2002     Gable  CompShg     VinylSd     VinylSd    BrkFace
## 4         1970     Gable  CompShg     Wd Sdng     Wd Shng       None
## 5         2000     Gable  CompShg     VinylSd     VinylSd    BrkFace
## 6         1995     Gable  CompShg     VinylSd     VinylSd       None
##   MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond BsmtExposure
## 1        196        Gd        TA      PConc       Gd       TA           No
## 2          0        TA        TA     CBlock       Gd       TA           Gd
## 3        162        Gd        TA      PConc       Gd       TA           Mn
## 4          0        TA        TA     BrkTil       TA       Gd           No
## 5        350        Gd        TA      PConc       Gd       TA           Av
## 6          0        TA        TA       Wood       Gd       TA           No
##   BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF
## 1          GLQ        706          Unf          0       150         856
## 2          ALQ        978          Unf          0       284        1262
## 3          GLQ        486          Unf          0       434         920
## 4          ALQ        216          Unf          0       540         756
## 5          GLQ        655          Unf          0       490        1145
## 6          GLQ        732          Unf          0        64         796
##   Heating HeatingQC CentralAir Electrical X1stFlrSF X2ndFlrSF LowQualFinSF
## 1    GasA        Ex          Y      SBrkr       856       854            0
## 2    GasA        Ex          Y      SBrkr      1262         0            0
## 3    GasA        Ex          Y      SBrkr       920       866            0
## 4    GasA        Gd          Y      SBrkr       961       756            0
## 5    GasA        Ex          Y      SBrkr      1145      1053            0
## 6    GasA        Ex          Y      SBrkr       796       566            0
##   GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr
## 1      1710            1            0        2        1            3
## 2      1262            0            1        2        0            3
## 3      1786            1            0        2        1            3
## 4      1717            1            0        1        0            3
## 5      2198            1            0        2        1            4
## 6      1362            1            0        1        1            1
##   KitchenAbvGr KitchenQual TotRmsAbvGrd Functional Fireplaces FireplaceQu
## 1            1          Gd            8        Typ          0        <NA>
## 2            1          TA            6        Typ          1          TA
## 3            1          Gd            6        Typ          1          TA
## 4            1          Gd            7        Typ          1          Gd
## 5            1          Gd            9        Typ          1          TA
## 6            1          TA            5        Typ          0        <NA>
##   GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual
## 1     Attchd        2003          RFn          2        548         TA
## 2     Attchd        1976          RFn          2        460         TA
## 3     Attchd        2001          RFn          2        608         TA
## 4     Detchd        1998          Unf          3        642         TA
## 5     Attchd        2000          RFn          3        836         TA
## 6     Attchd        1993          Unf          2        480         TA
##   GarageCond PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch
## 1         TA          Y          0          61             0          0
## 2         TA          Y        298           0             0          0
## 3         TA          Y          0          42             0          0
## 4         TA          Y          0          35           272          0
## 5         TA          Y        192          84             0          0
## 6         TA          Y         40          30             0        320
##   ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold
## 1           0        0   <NA>  <NA>        <NA>       0      2   2008
## 2           0        0   <NA>  <NA>        <NA>       0      5   2007
## 3           0        0   <NA>  <NA>        <NA>       0      9   2008
## 4           0        0   <NA>  <NA>        <NA>       0      2   2006
## 5           0        0   <NA>  <NA>        <NA>       0     12   2008
## 6           0        0   <NA> MnPrv        Shed     700     10   2009
##   SaleType SaleCondition SalePrice SalePrice_trans
## 1       WD        Normal    208500        24.21290
## 2       WD        Normal    181500        23.73836
## 3       WD        Normal    223500        24.45313
## 4       WD       Abnorml    140000        22.86771
## 5       WD        Normal    250000        24.84415
## 6       WD        Normal    143000        22.93796

Drop the SalePrice_trans from dataframe

train$SalePrice_trans <- NULL
train <- read.csv("https://raw.githubusercontent.com/johnpannyc/data605-final/master/train.csv")

Simplify train database by using the above principles

train1 <- dplyr::select(train,MSSubClass,Neighborhood,LotFrontage,LotArea,BldgType,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,CentralAir,X1stFlrSF,X2ndFlrSF,LowQualFinSF,GrLivArea,TotRmsAbvGrd,GarageCars,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,ScreenPorch,X3SsnPorch,PoolArea,MiscVal,MoSold,YrSold,SaleCondition,SalePrice)
glimpse(train1)
## Observations: 1,460
## Variables: 33
## $ MSSubClass    <int> 60, 20, 60, 70, 60, 50, 20, 60, 50, 190, 20, 60,...
## $ Neighborhood  <fct> CollgCr, Veenker, CollgCr, Crawfor, NoRidge, Mit...
## $ LotFrontage   <int> 65, 80, 68, 60, 84, 85, 75, NA, 51, 50, 70, 85, ...
## $ LotArea       <int> 8450, 9600, 11250, 9550, 14260, 14115, 10084, 10...
## $ BldgType      <fct> 1Fam, 1Fam, 1Fam, 1Fam, 1Fam, 1Fam, 1Fam, 1Fam, ...
## $ OverallQual   <int> 7, 6, 7, 7, 8, 5, 8, 7, 7, 5, 5, 9, 5, 7, 6, 7, ...
## $ OverallCond   <int> 5, 8, 5, 5, 5, 5, 5, 6, 5, 6, 5, 5, 6, 5, 5, 8, ...
## $ YearBuilt     <int> 2003, 1976, 2001, 1915, 2000, 1993, 2004, 1973, ...
## $ YearRemodAdd  <int> 2003, 1976, 2002, 1970, 2000, 1995, 2005, 1973, ...
## $ MasVnrArea    <int> 196, 0, 162, 0, 350, 0, 186, 240, 0, 0, 0, 286, ...
## $ BsmtFinSF1    <int> 706, 978, 486, 216, 655, 732, 1369, 859, 0, 851,...
## $ BsmtFinSF2    <int> 0, 0, 0, 0, 0, 0, 0, 32, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ BsmtUnfSF     <int> 150, 284, 434, 540, 490, 64, 317, 216, 952, 140,...
## $ TotalBsmtSF   <int> 856, 1262, 920, 756, 1145, 796, 1686, 1107, 952,...
## $ CentralAir    <fct> Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, ...
## $ X1stFlrSF     <int> 856, 1262, 920, 961, 1145, 796, 1694, 1107, 1022...
## $ X2ndFlrSF     <int> 854, 0, 866, 756, 1053, 566, 0, 983, 752, 0, 0, ...
## $ LowQualFinSF  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ GrLivArea     <int> 1710, 1262, 1786, 1717, 2198, 1362, 1694, 2090, ...
## $ TotRmsAbvGrd  <int> 8, 6, 6, 7, 9, 5, 7, 7, 8, 5, 5, 11, 4, 7, 5, 5,...
## $ GarageCars    <int> 2, 2, 2, 3, 3, 2, 2, 2, 2, 1, 1, 3, 1, 3, 1, 2, ...
## $ GarageArea    <int> 548, 460, 608, 642, 836, 480, 636, 484, 468, 205...
## $ WoodDeckSF    <int> 0, 298, 0, 0, 192, 40, 255, 235, 90, 0, 0, 147, ...
## $ OpenPorchSF   <int> 61, 0, 42, 35, 84, 30, 57, 204, 0, 4, 0, 21, 0, ...
## $ EnclosedPorch <int> 0, 0, 0, 272, 0, 0, 0, 228, 205, 0, 0, 0, 0, 0, ...
## $ ScreenPorch   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 176, 0, 0, 0...
## $ X3SsnPorch    <int> 0, 0, 0, 0, 0, 320, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ PoolArea      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ MiscVal       <int> 0, 0, 0, 0, 0, 700, 0, 350, 0, 0, 0, 0, 0, 0, 0,...
## $ MoSold        <int> 2, 5, 9, 2, 12, 10, 8, 11, 4, 1, 2, 7, 9, 8, 5, ...
## $ YrSold        <int> 2008, 2007, 2008, 2006, 2008, 2009, 2007, 2009, ...
## $ SaleCondition <fct> Normal, Normal, Normal, Abnorml, Normal, Normal,...
## $ SalePrice     <int> 208500, 181500, 223500, 140000, 250000, 143000, ...

Get the rank of average Neighborhood Sales Price Neighborhood is an important factor for per square foot price. Because I am not familiar with the neighborhood in dataset, I will get the median sale price of each neighborhood first.

df <- train %>%
  
  group_by(Neighborhood) %>%
  summarize(medianSalePrice = median(SalePrice)) %>% arrange(desc(medianSalePrice))
df
## # A tibble: 25 x 2
##    Neighborhood medianSalePrice
##    <fct>                  <dbl>
##  1 NridgHt               315000
##  2 NoRidge               301500
##  3 StoneBr               278000
##  4 Timber                228475
##  5 Somerst               225500
##  6 Veenker               218000
##  7 Crawfor               200624
##  8 ClearCr               200250
##  9 CollgCr               197200
## 10 Blmngtn               191000
## # ... with 15 more rows

For each neighborhood, I will impute as a score from 25 to 1 according the medianSalePrice from Highest (NridgHt) to Lowest (MeadowV)

train1$Neighborhood <- as.character(train1$Neighborhood)
train1$Neighborhood[which(train1$Neighborhood == "NridgHt")] <- "25"
train1$Neighborhood[which(train1$Neighborhood == "NoRidge")] <- "24"
train1$Neighborhood[which(train1$Neighborhood == "StoneBr")] <- "23"
train1$Neighborhood[which(train1$Neighborhood == "Timber")] <- "22"
train1$Neighborhood[which(train1$Neighborhood == "Somerst")] <- "21"
train1$Neighborhood[which(train1$Neighborhood == "Veenker")] <- "20"
train1$Neighborhood[which(train1$Neighborhood == "Crawfor")] <- "19"
train1$Neighborhood[which(train1$Neighborhood == "ClearCr")] <- "18"
train1$Neighborhood[which(train1$Neighborhood == "CollgCr")] <- "17"
train1$Neighborhood[which(train1$Neighborhood == "Blmngtn")] <- "16"

train1$Neighborhood[which(train1$Neighborhood == "NWAmes")] <- "15"
train1$Neighborhood[which(train1$Neighborhood == "Gilbert")] <- "14"
train1$Neighborhood[which(train1$Neighborhood == "SawyerW")] <- "13"
train1$Neighborhood[which(train1$Neighborhood == "Mitchel")] <- "12"
train1$Neighborhood[which(train1$Neighborhood == "NPkVill")] <- "11"
train1$Neighborhood[which(train1$Neighborhood == "NAmes")] <- "10"
train1$Neighborhood[which(train1$Neighborhood == "SWISU")] <- "9"
train1$Neighborhood[which(train1$Neighborhood == "Blueste")] <- "8"
train1$Neighborhood[which(train1$Neighborhood == "Sawyer")] <- "7"
train1$Neighborhood[which(train1$Neighborhood == "BrkSide")] <- "6"

train1$Neighborhood[which(train1$Neighborhood == "Edwards")] <- "5"
train1$Neighborhood[which(train1$Neighborhood == "OldTown")] <- "4"
train1$Neighborhood[which(train1$Neighborhood == "BrDale")] <- "3"
train1$Neighborhood[which(train1$Neighborhood == "IDOTRR")] <- "2"
train1$Neighborhood[which(train1$Neighborhood == "MeadowV")] <- "1"


train1$Neighborhood <- as.numeric(train1$Neighborhood)

Convert indicator variables to numbers.

train1$CentralAir <- as.character(train1$CentralAir)
train1$CentralAir[which(train1$CentralAir == "Y")] <- "1"
train1$CentralAir[which(train1$CentralAir == "N")] <- "0"
train1$CentralAir <- as.numeric(train1$CentralAir)



train1$CentralAir <- as.character(train1$CentralAir)
train1$CentralAir[which(train1$CentralAir == "Y")] <- "1"
train1$CentralAir[which(train1$CentralAir == "N")] <- "0"
train1$CentralAir <- as.numeric(train1$CentralAir)

train1$SaleCondition <- as.character(train1$SaleCondition)
train1$SaleCondition[which(train1$SaleCondition == "Normal")] <- "1"
train1$SaleCondition[which(train1$SaleCondition == "Abnorml")] <- "0"
train1$SaleCondition[which(train1$SaleCondition == "AdjLand")] <- "0"
train1$SaleCondition[which(train1$SaleCondition == "Alloca")] <- "0"
train1$SaleCondition[which(train1$SaleCondition == "Family")] <- "0"
train1$SaleCondition[which(train1$SaleCondition == "Partial")] <- "0"
train1$SaleCondition[which(train1$SaleCondition == "N")] <- "0"
train1$SaleCondition <- as.numeric(train1$SaleCondition)



train1$BldgType <- as.character(train1$BldgType)
train1$BldgType[which(train1$BldgType == "1Fam")] <- "5"
train1$BldgType[which(train1$BldgType == "2fmCon")] <- "4"
train1$BldgType[which(train1$BldgType == "Duplex")] <- "3"
train1$BldgType[which(train1$BldgType == "Twnhs")] <- "2"
train1$BldgType[which(train1$BldgType == "TwnhsE")] <- "1"
train1$BldgType <- as.numeric(train1$BldgType)
sapply(train1, function(x) sum(is.na(x))) %>% kable() %>% kable_styling()
x
MSSubClass 0
Neighborhood 0
LotFrontage 259
LotArea 0
BldgType 0
OverallQual 0
OverallCond 0
YearBuilt 0
YearRemodAdd 0
MasVnrArea 8
BsmtFinSF1 0
BsmtFinSF2 0
BsmtUnfSF 0
TotalBsmtSF 0
CentralAir 0
X1stFlrSF 0
X2ndFlrSF 0
LowQualFinSF 0
GrLivArea 0
TotRmsAbvGrd 0
GarageCars 0
GarageArea 0
WoodDeckSF 0
OpenPorchSF 0
EnclosedPorch 0
ScreenPorch 0
X3SsnPorch 0
PoolArea 0
MiscVal 0
MoSold 0
YrSold 0
SaleCondition 0
SalePrice 0

Impute missing data by mean

train1$LotFrontage[is.na(train1$LotFrontage)] <- mean(train1$LotFrontage, na.rm=TRUE)
train1$MasVnrArea[is.na(train1$MasVnrArea)] <- mean(train1$MasVnrArea, na.rm=TRUE)
vis_miss(train1)

sapply(train1, function(x) sum(is.na(x))) %>% kable() %>% kable_styling()
x
MSSubClass 0
Neighborhood 0
LotFrontage 0
LotArea 0
BldgType 0
OverallQual 0
OverallCond 0
YearBuilt 0
YearRemodAdd 0
MasVnrArea 0
BsmtFinSF1 0
BsmtFinSF2 0
BsmtUnfSF 0
TotalBsmtSF 0
CentralAir 0
X1stFlrSF 0
X2ndFlrSF 0
LowQualFinSF 0
GrLivArea 0
TotRmsAbvGrd 0
GarageCars 0
GarageArea 0
WoodDeckSF 0
OpenPorchSF 0
EnclosedPorch 0
ScreenPorch 0
X3SsnPorch 0
PoolArea 0
MiscVal 0
MoSold 0
YrSold 0
SaleCondition 0
SalePrice 0

View the category variables by

table(train1$ BldgType)
## 
##    1    2    3    4    5 
##  114   43   52   31 1220
head(train1)
##   MSSubClass Neighborhood LotFrontage LotArea BldgType OverallQual
## 1         60           17          65    8450        5           7
## 2         20           20          80    9600        5           6
## 3         60           17          68   11250        5           7
## 4         70           19          60    9550        5           7
## 5         60           24          84   14260        5           8
## 6         50           12          85   14115        5           5
##   OverallCond YearBuilt YearRemodAdd MasVnrArea BsmtFinSF1 BsmtFinSF2
## 1           5      2003         2003        196        706          0
## 2           8      1976         1976          0        978          0
## 3           5      2001         2002        162        486          0
## 4           5      1915         1970          0        216          0
## 5           5      2000         2000        350        655          0
## 6           5      1993         1995          0        732          0
##   BsmtUnfSF TotalBsmtSF CentralAir X1stFlrSF X2ndFlrSF LowQualFinSF
## 1       150         856          1       856       854            0
## 2       284        1262          1      1262         0            0
## 3       434         920          1       920       866            0
## 4       540         756          1       961       756            0
## 5       490        1145          1      1145      1053            0
## 6        64         796          1       796       566            0
##   GrLivArea TotRmsAbvGrd GarageCars GarageArea WoodDeckSF OpenPorchSF
## 1      1710            8          2        548          0          61
## 2      1262            6          2        460        298           0
## 3      1786            6          2        608          0          42
## 4      1717            7          3        642          0          35
## 5      2198            9          3        836        192          84
## 6      1362            5          2        480         40          30
##   EnclosedPorch ScreenPorch X3SsnPorch PoolArea MiscVal MoSold YrSold
## 1             0           0          0        0       0      2   2008
## 2             0           0          0        0       0      5   2007
## 3             0           0          0        0       0      9   2008
## 4           272           0          0        0       0      2   2006
## 5             0           0          0        0       0     12   2008
## 6             0           0        320        0     700     10   2009
##   SaleCondition SalePrice
## 1             1    208500
## 2             1    181500
## 3             1    223500
## 4             0    140000
## 5             1    250000
## 6             1    143000
glimpse(train1)
## Observations: 1,460
## Variables: 33
## $ MSSubClass    <int> 60, 20, 60, 70, 60, 50, 20, 60, 50, 190, 20, 60,...
## $ Neighborhood  <dbl> 17, 20, 17, 19, 24, 12, 21, 15, 4, 6, 7, 25, 7, ...
## $ LotFrontage   <dbl> 65.00000, 80.00000, 68.00000, 60.00000, 84.00000...
## $ LotArea       <int> 8450, 9600, 11250, 9550, 14260, 14115, 10084, 10...
## $ BldgType      <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 4, 5, 5, 5, 5, 5, 5, ...
## $ OverallQual   <int> 7, 6, 7, 7, 8, 5, 8, 7, 7, 5, 5, 9, 5, 7, 6, 7, ...
## $ OverallCond   <int> 5, 8, 5, 5, 5, 5, 5, 6, 5, 6, 5, 5, 6, 5, 5, 8, ...
## $ YearBuilt     <int> 2003, 1976, 2001, 1915, 2000, 1993, 2004, 1973, ...
## $ YearRemodAdd  <int> 2003, 1976, 2002, 1970, 2000, 1995, 2005, 1973, ...
## $ MasVnrArea    <dbl> 196, 0, 162, 0, 350, 0, 186, 240, 0, 0, 0, 286, ...
## $ BsmtFinSF1    <int> 706, 978, 486, 216, 655, 732, 1369, 859, 0, 851,...
## $ BsmtFinSF2    <int> 0, 0, 0, 0, 0, 0, 0, 32, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ BsmtUnfSF     <int> 150, 284, 434, 540, 490, 64, 317, 216, 952, 140,...
## $ TotalBsmtSF   <int> 856, 1262, 920, 756, 1145, 796, 1686, 1107, 952,...
## $ CentralAir    <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ X1stFlrSF     <int> 856, 1262, 920, 961, 1145, 796, 1694, 1107, 1022...
## $ X2ndFlrSF     <int> 854, 0, 866, 756, 1053, 566, 0, 983, 752, 0, 0, ...
## $ LowQualFinSF  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ GrLivArea     <int> 1710, 1262, 1786, 1717, 2198, 1362, 1694, 2090, ...
## $ TotRmsAbvGrd  <int> 8, 6, 6, 7, 9, 5, 7, 7, 8, 5, 5, 11, 4, 7, 5, 5,...
## $ GarageCars    <int> 2, 2, 2, 3, 3, 2, 2, 2, 2, 1, 1, 3, 1, 3, 1, 2, ...
## $ GarageArea    <int> 548, 460, 608, 642, 836, 480, 636, 484, 468, 205...
## $ WoodDeckSF    <int> 0, 298, 0, 0, 192, 40, 255, 235, 90, 0, 0, 147, ...
## $ OpenPorchSF   <int> 61, 0, 42, 35, 84, 30, 57, 204, 0, 4, 0, 21, 0, ...
## $ EnclosedPorch <int> 0, 0, 0, 272, 0, 0, 0, 228, 205, 0, 0, 0, 0, 0, ...
## $ ScreenPorch   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 176, 0, 0, 0...
## $ X3SsnPorch    <int> 0, 0, 0, 0, 0, 320, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ PoolArea      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ MiscVal       <int> 0, 0, 0, 0, 0, 700, 0, 350, 0, 0, 0, 0, 0, 0, 0,...
## $ MoSold        <int> 2, 5, 9, 2, 12, 10, 8, 11, 4, 1, 2, 7, 9, 8, 5, ...
## $ YrSold        <int> 2008, 2007, 2008, 2006, 2008, 2009, 2007, 2009, ...
## $ SaleCondition <dbl> 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, ...
## $ SalePrice     <int> 208500, 181500, 223500, 140000, 250000, 143000, ...
vis_miss(train1)

mb_cor <- cor(train1)

round(mb_cor, 3)
##               MSSubClass Neighborhood LotFrontage LotArea BldgType
## MSSubClass         1.000       -0.046      -0.357  -0.140   -0.746
## Neighborhood      -0.046        1.000       0.219   0.167   -0.061
## LotFrontage       -0.357        0.219       1.000   0.307    0.409
## LotArea           -0.140        0.167       0.307   1.000    0.206
## BldgType          -0.746       -0.061       0.409   0.206    1.000
## OverallQual        0.033        0.671       0.234   0.106   -0.050
## OverallCond       -0.059       -0.215      -0.053  -0.006    0.162
## YearBuilt          0.028        0.684       0.118   0.014   -0.218
## YearRemodAdd       0.041        0.514       0.083   0.014   -0.105
## MasVnrArea         0.023        0.370       0.179   0.104   -0.043
## BsmtFinSF1        -0.070        0.244       0.216   0.214   -0.007
## BsmtFinSF2        -0.066       -0.034       0.043   0.111    0.017
## BsmtUnfSF         -0.141        0.213       0.122  -0.003    0.051
## TotalBsmtSF       -0.239        0.455       0.363   0.261    0.050
## CentralAir        -0.102        0.268       0.069   0.050   -0.018
## X1stFlrSF         -0.252        0.403       0.414   0.299    0.074
## X2ndFlrSF          0.308        0.135       0.072   0.051    0.084
## LowQualFinSF       0.046       -0.088       0.037   0.005    0.030
## GrLivArea          0.075        0.400       0.368   0.263    0.127
## TotRmsAbvGrd       0.040        0.271       0.320   0.190    0.198
## GarageCars        -0.040        0.571       0.270   0.155   -0.007
## GarageArea        -0.099        0.527       0.324   0.180    0.061
## WoodDeckSF        -0.013        0.224       0.077   0.172    0.013
## OpenPorchSF       -0.006        0.205       0.137   0.085    0.037
## EnclosedPorch     -0.012       -0.216       0.010  -0.018    0.115
## ScreenPorch       -0.026        0.013       0.038   0.043    0.028
## X3SsnPorch        -0.044        0.024       0.062   0.020    0.023
## PoolArea           0.008       -0.007       0.181   0.078    0.028
## MiscVal           -0.008       -0.040       0.001   0.038    0.010
## MoSold            -0.014        0.049       0.010   0.001    0.026
## YrSold            -0.021       -0.028       0.007  -0.014   -0.002
## SaleCondition      0.024       -0.139      -0.072   0.006    0.027
## SalePrice         -0.084        0.696       0.335   0.264    0.086
##               OverallQual OverallCond YearBuilt YearRemodAdd MasVnrArea
## MSSubClass          0.033      -0.059     0.028        0.041      0.023
## Neighborhood        0.671      -0.215     0.684        0.514      0.370
## LotFrontage         0.234      -0.053     0.118        0.083      0.179
## LotArea             0.106      -0.006     0.014        0.014      0.104
## BldgType           -0.050       0.162    -0.218       -0.105     -0.043
## OverallQual         1.000      -0.092     0.572        0.551      0.410
## OverallCond        -0.092       1.000    -0.376        0.074     -0.128
## YearBuilt           0.572      -0.376     1.000        0.593      0.315
## YearRemodAdd        0.551       0.074     0.593        1.000      0.179
## MasVnrArea          0.410      -0.128     0.315        0.179      1.000
## BsmtFinSF1          0.240      -0.046     0.250        0.128      0.264
## BsmtFinSF2         -0.059       0.040    -0.049       -0.068     -0.072
## BsmtUnfSF           0.308      -0.137     0.149        0.181      0.114
## TotalBsmtSF         0.538      -0.171     0.391        0.291      0.362
## CentralAir          0.272       0.119     0.382        0.299      0.127
## X1stFlrSF           0.476      -0.144     0.282        0.240      0.342
## X2ndFlrSF           0.295       0.029     0.010        0.140      0.174
## LowQualFinSF       -0.030       0.025    -0.184       -0.062     -0.069
## GrLivArea           0.593      -0.080     0.199        0.287      0.390
## TotRmsAbvGrd        0.427      -0.058     0.096        0.192      0.280
## GarageCars          0.601      -0.186     0.538        0.421      0.364
## GarageArea          0.562      -0.152     0.479        0.372      0.373
## WoodDeckSF          0.239      -0.003     0.225        0.206      0.159
## OpenPorchSF         0.309      -0.033     0.189        0.226      0.125
## EnclosedPorch      -0.114       0.070    -0.387       -0.194     -0.110
## ScreenPorch         0.065       0.055    -0.050       -0.039      0.061
## X3SsnPorch          0.030       0.026     0.031        0.045      0.019
## PoolArea            0.065      -0.002     0.005        0.006      0.012
## MiscVal            -0.031       0.069    -0.034       -0.010     -0.030
## MoSold              0.071      -0.004     0.012        0.021     -0.006
## YrSold             -0.027       0.044    -0.014        0.036     -0.008
## SaleCondition      -0.143       0.162    -0.158       -0.121     -0.084
## SalePrice           0.791      -0.078     0.523        0.507      0.475
##               BsmtFinSF1 BsmtFinSF2 BsmtUnfSF TotalBsmtSF CentralAir
## MSSubClass        -0.070     -0.066    -0.141      -0.239     -0.102
## Neighborhood       0.244     -0.034     0.213       0.455      0.268
## LotFrontage        0.216      0.043     0.122       0.363      0.069
## LotArea            0.214      0.111    -0.003       0.261      0.050
## BldgType          -0.007      0.017     0.051       0.050     -0.018
## OverallQual        0.240     -0.059     0.308       0.538      0.272
## OverallCond       -0.046      0.040    -0.137      -0.171      0.119
## YearBuilt          0.250     -0.049     0.149       0.391      0.382
## YearRemodAdd       0.128     -0.068     0.181       0.291      0.299
## MasVnrArea         0.264     -0.072     0.114       0.362      0.127
## BsmtFinSF1         1.000     -0.050    -0.495       0.522      0.166
## BsmtFinSF2        -0.050      1.000    -0.209       0.105      0.040
## BsmtUnfSF         -0.495     -0.209     1.000       0.415      0.020
## TotalBsmtSF        0.522      0.105     0.415       1.000      0.208
## CentralAir         0.166      0.040     0.020       0.208      1.000
## X1stFlrSF          0.446      0.097     0.318       0.820      0.147
## X2ndFlrSF         -0.137     -0.099     0.004      -0.175     -0.012
## LowQualFinSF      -0.065      0.015     0.028      -0.033     -0.050
## GrLivArea          0.208     -0.010     0.240       0.455      0.094
## TotRmsAbvGrd       0.044     -0.035     0.251       0.286      0.035
## GarageCars         0.224     -0.038     0.214       0.435      0.234
## GarageArea         0.297     -0.018     0.183       0.487      0.231
## WoodDeckSF         0.204      0.068    -0.005       0.232      0.146
## OpenPorchSF        0.112      0.003     0.129       0.247      0.026
## EnclosedPorch     -0.102      0.037    -0.003      -0.095     -0.157
## ScreenPorch        0.062      0.089    -0.013       0.084      0.051
## X3SsnPorch         0.026     -0.030     0.021       0.037      0.031
## PoolArea           0.140      0.042    -0.035       0.126      0.018
## MiscVal            0.004      0.005    -0.024      -0.018     -0.002
## MoSold            -0.016     -0.015     0.035       0.013      0.010
## YrSold             0.014      0.032    -0.041      -0.015     -0.009
## SaleCondition     -0.020      0.041    -0.154      -0.160     -0.015
## SalePrice          0.386     -0.011     0.214       0.614      0.251
##               X1stFlrSF X2ndFlrSF LowQualFinSF GrLivArea TotRmsAbvGrd
## MSSubClass       -0.252     0.308        0.046     0.075        0.040
## Neighborhood      0.403     0.135       -0.088     0.400        0.271
## LotFrontage       0.414     0.072        0.037     0.368        0.320
## LotArea           0.299     0.051        0.005     0.263        0.190
## BldgType          0.074     0.084        0.030     0.127        0.198
## OverallQual       0.476     0.295       -0.030     0.593        0.427
## OverallCond      -0.144     0.029        0.025    -0.080       -0.058
## YearBuilt         0.282     0.010       -0.184     0.199        0.096
## YearRemodAdd      0.240     0.140       -0.062     0.287        0.192
## MasVnrArea        0.342     0.174       -0.069     0.390        0.280
## BsmtFinSF1        0.446    -0.137       -0.065     0.208        0.044
## BsmtFinSF2        0.097    -0.099        0.015    -0.010       -0.035
## BsmtUnfSF         0.318     0.004        0.028     0.240        0.251
## TotalBsmtSF       0.820    -0.175       -0.033     0.455        0.286
## CentralAir        0.147    -0.012       -0.050     0.094        0.035
## X1stFlrSF         1.000    -0.203       -0.014     0.566        0.410
## X2ndFlrSF        -0.203     1.000        0.063     0.688        0.616
## LowQualFinSF     -0.014     0.063        1.000     0.135        0.131
## GrLivArea         0.566     0.688        0.135     1.000        0.825
## TotRmsAbvGrd      0.410     0.616        0.131     0.825        1.000
## GarageCars        0.439     0.184       -0.094     0.467        0.362
## GarageArea        0.490     0.138       -0.068     0.469        0.338
## WoodDeckSF        0.235     0.092       -0.025     0.247        0.166
## OpenPorchSF       0.212     0.208        0.018     0.330        0.234
## EnclosedPorch    -0.065     0.062        0.061     0.009        0.004
## ScreenPorch       0.089     0.041        0.027     0.102        0.059
## X3SsnPorch        0.056    -0.024       -0.004     0.021       -0.007
## PoolArea          0.132     0.081        0.062     0.170        0.084
## MiscVal          -0.021     0.016       -0.004    -0.002        0.025
## MoSold            0.031     0.035       -0.022     0.050        0.037
## YrSold           -0.014    -0.029       -0.029    -0.037       -0.035
## SaleCondition    -0.159     0.032       -0.012    -0.092       -0.093
## SalePrice         0.606     0.319       -0.026     0.709        0.534
##               GarageCars GarageArea WoodDeckSF OpenPorchSF EnclosedPorch
## MSSubClass        -0.040     -0.099     -0.013      -0.006        -0.012
## Neighborhood       0.571      0.527      0.224       0.205        -0.216
## LotFrontage        0.270      0.324      0.077       0.137         0.010
## LotArea            0.155      0.180      0.172       0.085        -0.018
## BldgType          -0.007      0.061      0.013       0.037         0.115
## OverallQual        0.601      0.562      0.239       0.309        -0.114
## OverallCond       -0.186     -0.152     -0.003      -0.033         0.070
## YearBuilt          0.538      0.479      0.225       0.189        -0.387
## YearRemodAdd       0.421      0.372      0.206       0.226        -0.194
## MasVnrArea         0.364      0.373      0.159       0.125        -0.110
## BsmtFinSF1         0.224      0.297      0.204       0.112        -0.102
## BsmtFinSF2        -0.038     -0.018      0.068       0.003         0.037
## BsmtUnfSF          0.214      0.183     -0.005       0.129        -0.003
## TotalBsmtSF        0.435      0.487      0.232       0.247        -0.095
## CentralAir         0.234      0.231      0.146       0.026        -0.157
## X1stFlrSF          0.439      0.490      0.235       0.212        -0.065
## X2ndFlrSF          0.184      0.138      0.092       0.208         0.062
## LowQualFinSF      -0.094     -0.068     -0.025       0.018         0.061
## GrLivArea          0.467      0.469      0.247       0.330         0.009
## TotRmsAbvGrd       0.362      0.338      0.166       0.234         0.004
## GarageCars         1.000      0.882      0.226       0.214        -0.151
## GarageArea         0.882      1.000      0.225       0.241        -0.122
## WoodDeckSF         0.226      0.225      1.000       0.059        -0.126
## OpenPorchSF        0.214      0.241      0.059       1.000        -0.093
## EnclosedPorch     -0.151     -0.122     -0.126      -0.093         1.000
## ScreenPorch        0.050      0.051     -0.074       0.074        -0.083
## X3SsnPorch         0.036      0.035     -0.033      -0.006        -0.037
## PoolArea           0.021      0.061      0.073       0.061         0.054
## MiscVal           -0.043     -0.027     -0.010      -0.019         0.018
## MoSold             0.041      0.028      0.021       0.071        -0.029
## YrSold            -0.039     -0.027      0.022      -0.058        -0.010
## SaleCondition     -0.122     -0.131      0.027      -0.096         0.026
## SalePrice          0.640      0.623      0.324       0.316        -0.129
##               ScreenPorch X3SsnPorch PoolArea MiscVal MoSold YrSold
## MSSubClass         -0.026     -0.044    0.008  -0.008 -0.014 -0.021
## Neighborhood        0.013      0.024   -0.007  -0.040  0.049 -0.028
## LotFrontage         0.038      0.062    0.181   0.001  0.010  0.007
## LotArea             0.043      0.020    0.078   0.038  0.001 -0.014
## BldgType            0.028      0.023    0.028   0.010  0.026 -0.002
## OverallQual         0.065      0.030    0.065  -0.031  0.071 -0.027
## OverallCond         0.055      0.026   -0.002   0.069 -0.004  0.044
## YearBuilt          -0.050      0.031    0.005  -0.034  0.012 -0.014
## YearRemodAdd       -0.039      0.045    0.006  -0.010  0.021  0.036
## MasVnrArea          0.061      0.019    0.012  -0.030 -0.006 -0.008
## BsmtFinSF1          0.062      0.026    0.140   0.004 -0.016  0.014
## BsmtFinSF2          0.089     -0.030    0.042   0.005 -0.015  0.032
## BsmtUnfSF          -0.013      0.021   -0.035  -0.024  0.035 -0.041
## TotalBsmtSF         0.084      0.037    0.126  -0.018  0.013 -0.015
## CentralAir          0.051      0.031    0.018  -0.002  0.010 -0.009
## X1stFlrSF           0.089      0.056    0.132  -0.021  0.031 -0.014
## X2ndFlrSF           0.041     -0.024    0.081   0.016  0.035 -0.029
## LowQualFinSF        0.027     -0.004    0.062  -0.004 -0.022 -0.029
## GrLivArea           0.102      0.021    0.170  -0.002  0.050 -0.037
## TotRmsAbvGrd        0.059     -0.007    0.084   0.025  0.037 -0.035
## GarageCars          0.050      0.036    0.021  -0.043  0.041 -0.039
## GarageArea          0.051      0.035    0.061  -0.027  0.028 -0.027
## WoodDeckSF         -0.074     -0.033    0.073  -0.010  0.021  0.022
## OpenPorchSF         0.074     -0.006    0.061  -0.019  0.071 -0.058
## EnclosedPorch      -0.083     -0.037    0.054   0.018 -0.029 -0.010
## ScreenPorch         1.000     -0.031    0.051   0.032  0.023  0.011
## X3SsnPorch         -0.031      1.000   -0.008   0.000  0.029  0.019
## PoolArea            0.051     -0.008    1.000   0.030 -0.034 -0.060
## MiscVal             0.032      0.000    0.030   1.000 -0.006  0.005
## MoSold              0.023      0.029   -0.034  -0.006  1.000 -0.146
## YrSold              0.011      0.019   -0.060   0.005 -0.146  1.000
## SaleCondition       0.011     -0.009   -0.069   0.037 -0.072  0.131
## SalePrice           0.111      0.045    0.092  -0.021  0.046 -0.029
##               SaleCondition SalePrice
## MSSubClass            0.024    -0.084
## Neighborhood         -0.139     0.696
## LotFrontage          -0.072     0.335
## LotArea               0.006     0.264
## BldgType              0.027     0.086
## OverallQual          -0.143     0.791
## OverallCond           0.162    -0.078
## YearBuilt            -0.158     0.523
## YearRemodAdd         -0.121     0.507
## MasVnrArea           -0.084     0.475
## BsmtFinSF1           -0.020     0.386
## BsmtFinSF2            0.041    -0.011
## BsmtUnfSF            -0.154     0.214
## TotalBsmtSF          -0.160     0.614
## CentralAir           -0.015     0.251
## X1stFlrSF            -0.159     0.606
## X2ndFlrSF             0.032     0.319
## LowQualFinSF         -0.012    -0.026
## GrLivArea            -0.092     0.709
## TotRmsAbvGrd         -0.093     0.534
## GarageCars           -0.122     0.640
## GarageArea           -0.131     0.623
## WoodDeckSF            0.027     0.324
## OpenPorchSF          -0.096     0.316
## EnclosedPorch         0.026    -0.129
## ScreenPorch           0.011     0.111
## X3SsnPorch           -0.009     0.045
## PoolArea             -0.069     0.092
## MiscVal               0.037    -0.021
## MoSold               -0.072     0.046
## YrSold                0.131    -0.029
## SaleCondition         1.000    -0.154
## SalePrice            -0.154     1.000
M<-cor(train1)
corrplot(M, method="number")

Colinearity

From the correlation matrix and plot, we find that ‘TotalBsmtSF’ is highly associated with ‘GrLivArea (0.825)’ and ‘BsmtFinSF1’. So we will DROP ‘TotalBsmtSF’.

‘GrLivArea (0.825)’ is also highly with ‘TotRmsAbvGrd’(0.825), ‘X1stFlrSF’(0.566) and X2ndFlrSF(0.688). So we will also DROP the above three variables.

train1$TotalBsmtSF <- NULL
train1$TotRmsAbvGrd<- NULL
train1$X1stFlrSF<- NULL
train1$X2ndFlrSF<- NULL

Outlinear

From the density plots and summary. We feel that the following varibles (“LotFrontage”,“LotArea”,“MasVnrArea”,“BsmtFinSF1”,“BsmtFinSF2”,“BsmtUnfSF”,“GrLivArea”,“SalePrice”,“WoodDeckSF”,“OpenPorchSF”,“EnclosedPorch”,“ScreenPorch”,“X3SsnPorch”,“PoolArea”,“MiscVal”) may have outliers. We will replace the outliers.

replace with outliers

replaceOutliers = function(x) { 

    quantiles <- quantile( x, c(0.5,.95 ) )
    x[ x < quantiles[1] ] <- quantiles[1]
   
    x[ x > quantiles[2] ] <- quantiles[2]
    return(x)
}
   
train1$LotFrontage <- replaceOutliers(train1$LotFrontage)
train1$LotArea <- replaceOutliers(train1$LotArea)
train1$MasVnrArea <- replaceOutliers(train1$MasVnrArea)
train1$BsmtFinSF1 <- replaceOutliers(train1$BsmtFinSF1)
train1$BsmtFinSF2 <- replaceOutliers(train1$BsmtFinSF2)
train1$BsmtUnfSF <- replaceOutliers(train1$BsmtUnfSF)
train1$GrLivArea <- replaceOutliers(train1$GrLivArea)
train1$SalePrice <- replaceOutliers(train1$SalePrice)
train1$WoodDeckSF <- replaceOutliers(train1$WoodDeckSF)
train1$OpenPorchSF <- replaceOutliers(train1$OpenPorchSF)
train1$EnclosedPorch <- replaceOutliers(train1$EnclosedPorch)
train1$ScreenPorch <- replaceOutliers(train1$ScreenPorch)
train1$X3SsnPorch <- replaceOutliers(train1$X3SsnPorch)
train1$PoolArea <- replaceOutliers(train1$PoolArea)
train1$MiscVal <- replaceOutliers(train1$MiscVal)
summary(train1)
##    MSSubClass     Neighborhood    LotFrontage        LotArea     
##  Min.   : 20.0   Min.   : 1.00   Min.   : 70.05   Min.   : 9478  
##  1st Qu.: 20.0   1st Qu.: 7.00   1st Qu.: 70.05   1st Qu.: 9478  
##  Median : 50.0   Median :13.00   Median : 70.05   Median : 9479  
##  Mean   : 56.9   Mean   :12.84   Mean   : 75.74   Mean   :10917  
##  3rd Qu.: 70.0   3rd Qu.:17.00   3rd Qu.: 79.00   3rd Qu.:11602  
##  Max.   :190.0   Max.   :25.00   Max.   :104.00   Max.   :17401  
##     BldgType      OverallQual      OverallCond      YearBuilt   
##  Min.   :1.000   Min.   : 1.000   Min.   :1.000   Min.   :1872  
##  1st Qu.:5.000   1st Qu.: 5.000   1st Qu.:5.000   1st Qu.:1954  
##  Median :5.000   Median : 6.000   Median :5.000   Median :1973  
##  Mean   :4.507   Mean   : 6.099   Mean   :5.575   Mean   :1971  
##  3rd Qu.:5.000   3rd Qu.: 7.000   3rd Qu.:6.000   3rd Qu.:2000  
##  Max.   :5.000   Max.   :10.000   Max.   :9.000   Max.   :2010  
##   YearRemodAdd    MasVnrArea       BsmtFinSF1       BsmtFinSF2    
##  Min.   :1950   Min.   :  0.00   Min.   : 383.5   Min.   :  0.00  
##  1st Qu.:1967   1st Qu.:  0.00   1st Qu.: 383.5   1st Qu.:  0.00  
##  Median :1994   Median :  0.00   Median : 383.8   Median :  0.00  
##  Mean   :1985   Mean   : 92.62   Mean   : 583.3   Mean   : 32.93  
##  3rd Qu.:2004   3rd Qu.:164.25   3rd Qu.: 712.2   3rd Qu.:  0.00  
##  Max.   :2010   Max.   :456.00   Max.   :1274.0   Max.   :396.20  
##    BsmtUnfSF        CentralAir      LowQualFinSF       GrLivArea   
##  Min.   : 477.5   Min.   :0.0000   Min.   :  0.000   Min.   :1464  
##  1st Qu.: 477.5   1st Qu.:1.0000   1st Qu.:  0.000   1st Qu.:1464  
##  Median : 478.2   Median :1.0000   Median :  0.000   Median :1464  
##  Mean   : 685.3   Mean   :0.9349   Mean   :  5.845   Mean   :1666  
##  3rd Qu.: 808.0   3rd Qu.:1.0000   3rd Qu.:  0.000   3rd Qu.:1777  
##  Max.   :1468.0   Max.   :1.0000   Max.   :572.000   Max.   :2466  
##    GarageCars      GarageArea       WoodDeckSF      OpenPorchSF    
##  Min.   :0.000   Min.   :   0.0   Min.   :  0.00   Min.   : 25.00  
##  1st Qu.:1.000   1st Qu.: 334.5   1st Qu.:  0.00   1st Qu.: 25.00  
##  Median :2.000   Median : 480.0   Median :  0.00   Median : 25.00  
##  Mean   :1.767   Mean   : 473.0   Mean   : 88.89   Mean   : 54.37  
##  3rd Qu.:2.000   3rd Qu.: 576.0   3rd Qu.:168.00   3rd Qu.: 68.00  
##  Max.   :4.000   Max.   :1418.0   Max.   :335.00   Max.   :175.05  
##  EnclosedPorch     ScreenPorch       X3SsnPorch    PoolArea    MiscVal 
##  Min.   :  0.00   Min.   :  0.00   Min.   :0    Min.   :0   Min.   :0  
##  1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:0    1st Qu.:0   1st Qu.:0  
##  Median :  0.00   Median :  0.00   Median :0    Median :0   Median :0  
##  Mean   : 19.15   Mean   : 11.58   Mean   :0    Mean   :0   Mean   :0  
##  3rd Qu.:  0.00   3rd Qu.:  0.00   3rd Qu.:0    3rd Qu.:0   3rd Qu.:0  
##  Max.   :180.15   Max.   :160.00   Max.   :0    Max.   :0   Max.   :0  
##      MoSold           YrSold     SaleCondition      SalePrice     
##  Min.   : 1.000   Min.   :2006   Min.   :0.0000   Min.   :163000  
##  1st Qu.: 5.000   1st Qu.:2007   1st Qu.:1.0000   1st Qu.:163000  
##  Median : 6.000   Median :2008   Median :1.0000   Median :163000  
##  Mean   : 6.322   Mean   :2008   Mean   :0.8205   Mean   :195479  
##  3rd Qu.: 8.000   3rd Qu.:2009   3rd Qu.:1.0000   3rd Qu.:214000  
##  Max.   :12.000   Max.   :2010   Max.   :1.0000   Max.   :326100
vis_miss(train1)

Full Model (including all above variables)

full.model <- lm(SalePrice~MSSubClass+Neighborhood+LotFrontage+LotArea+BldgType+OverallQual+OverallCond+YearBuilt+YearRemodAdd+MasVnrArea+BsmtFinSF1+BsmtFinSF2+BsmtUnfSF+CentralAir+LowQualFinSF+GrLivArea+GarageCars+GarageArea+WoodDeckSF+OpenPorchSF+EnclosedPorch+ScreenPorch+X3SsnPorch+PoolArea+MiscVal+MoSold+YrSold+SaleCondition, data=train1)
summary(full.model)
## 
## Call:
## lm(formula = SalePrice ~ MSSubClass + Neighborhood + LotFrontage + 
##     LotArea + BldgType + OverallQual + OverallCond + YearBuilt + 
##     YearRemodAdd + MasVnrArea + BsmtFinSF1 + BsmtFinSF2 + BsmtUnfSF + 
##     CentralAir + LowQualFinSF + GrLivArea + GarageCars + GarageArea + 
##     WoodDeckSF + OpenPorchSF + EnclosedPorch + ScreenPorch + 
##     X3SsnPorch + PoolArea + MiscVal + MoSold + YrSold + SaleCondition, 
##     data = train1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -166284  -14553   -1482   13177   90498 
## 
## Coefficients: (3 not defined because of singularities)
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -1.040e+06  9.531e+05  -1.091 0.275247    
## MSSubClass    -1.264e+02  2.528e+01  -4.998 6.50e-07 ***
## Neighborhood   1.258e+03  1.482e+02   8.484  < 2e-16 ***
## LotFrontage    1.275e+02  7.182e+01   1.775 0.076160 .  
## LotArea        8.315e-01  3.262e-01   2.549 0.010903 *  
## BldgType      -1.282e+03  9.049e+02  -1.416 0.156874    
## OverallQual    1.036e+04  7.544e+02  13.739  < 2e-16 ***
## OverallCond    5.883e+02  7.078e+02   0.831 0.405982    
## YearBuilt     -2.133e+00  4.160e+01  -0.051 0.959110    
## YearRemodAdd   1.617e+02  4.328e+01   3.737 0.000194 ***
## MasVnrArea     1.333e+01  5.159e+00   2.584 0.009863 ** 
## BsmtFinSF1     3.308e+01  3.104e+00  10.658  < 2e-16 ***
## BsmtFinSF2     2.141e+00  6.601e+00   0.324 0.745723    
## BsmtUnfSF      8.201e+00  2.811e+00   2.918 0.003583 ** 
## CentralAir    -1.283e+04  2.891e+03  -4.437 9.84e-06 ***
## LowQualFinSF  -2.113e+01  1.318e+01  -1.603 0.109051    
## GrLivArea      5.762e+01  2.874e+00  20.048  < 2e-16 ***
## GarageCars    -5.585e+02  1.879e+03  -0.297 0.766324    
## GarageArea     1.198e+01  6.367e+00   1.882 0.059984 .  
## WoodDeckSF     2.028e+01  6.182e+00   3.281 0.001060 ** 
## OpenPorchSF    3.093e+01  1.486e+01   2.082 0.037559 *  
## EnclosedPorch  9.064e+00  1.373e+01   0.660 0.509327    
## ScreenPorch    1.299e+01  1.582e+01   0.821 0.411689    
## X3SsnPorch            NA         NA      NA       NA    
## PoolArea              NA         NA      NA       NA    
## MiscVal               NA         NA      NA       NA    
## MoSold         1.382e+02  2.315e+02   0.597 0.550572    
## YrSold         3.562e+02  4.748e+02   0.750 0.453224    
## SaleCondition -5.034e+03  1.673e+03  -3.009 0.002668 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23430 on 1434 degrees of freedom
## Multiple R-squared:  0.7773, Adjusted R-squared:  0.7734 
## F-statistic: 200.2 on 25 and 1434 DF,  p-value: < 2.2e-16

Reduced Model: only positive variable in full model

reduced.model <- lm(SalePrice~MSSubClass+Neighborhood+LotArea+OverallQual+YearBuilt+YearRemodAdd+MasVnrArea+BsmtFinSF1+BsmtUnfSF+CentralAir+GrLivArea+WoodDeckSF+OpenPorchSF+SaleCondition, data=train1)
summary(reduced.model)
## 
## Call:
## lm(formula = SalePrice ~ MSSubClass + Neighborhood + LotArea + 
##     OverallQual + YearBuilt + YearRemodAdd + MasVnrArea + BsmtFinSF1 + 
##     BsmtUnfSF + CentralAir + GrLivArea + WoodDeckSF + OpenPorchSF + 
##     SaleCondition, data = train1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -158692  -14854   -1359   13074   89497 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -3.476e+05  8.463e+04  -4.107 4.24e-05 ***
## MSSubClass    -1.103e+02  1.533e+01  -7.192 1.02e-12 ***
## Neighborhood   1.296e+03  1.477e+02   8.770  < 2e-16 ***
## LotArea        1.017e+00  3.114e-01   3.267 0.001113 ** 
## OverallQual    1.083e+04  7.304e+02  14.830  < 2e-16 ***
## YearBuilt      1.693e+00  3.314e+01   0.051 0.959270    
## YearRemodAdd   1.713e+02  3.993e+01   4.290 1.91e-05 ***
## MasVnrArea     1.682e+01  5.083e+00   3.309 0.000959 ***
## BsmtFinSF1     3.474e+01  2.897e+00  11.993  < 2e-16 ***
## BsmtUnfSF      9.070e+00  2.594e+00   3.497 0.000485 ***
## CentralAir    -1.183e+04  2.779e+03  -4.258 2.20e-05 ***
## GrLivArea      5.753e+01  2.700e+00  21.310  < 2e-16 ***
## WoodDeckSF     1.951e+01  6.078e+00   3.210 0.001357 ** 
## OpenPorchSF    3.153e+01  1.480e+01   2.131 0.033293 *  
## SaleCondition -4.975e+03  1.650e+03  -3.016 0.002608 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23490 on 1445 degrees of freedom
## Multiple R-squared:  0.7744, Adjusted R-squared:  0.7722 
## F-statistic: 354.3 on 14 and 1445 DF,  p-value: < 2.2e-16

Backward elimination

backward.model<- step (full.model, direction = "backward") 
## Start:  AIC=29406.11
## SalePrice ~ MSSubClass + Neighborhood + LotFrontage + LotArea + 
##     BldgType + OverallQual + OverallCond + YearBuilt + YearRemodAdd + 
##     MasVnrArea + BsmtFinSF1 + BsmtFinSF2 + BsmtUnfSF + CentralAir + 
##     LowQualFinSF + GrLivArea + GarageCars + GarageArea + WoodDeckSF + 
##     OpenPorchSF + EnclosedPorch + ScreenPorch + X3SsnPorch + 
##     PoolArea + MiscVal + MoSold + YrSold + SaleCondition
## 
## 
## Step:  AIC=29406.11
## SalePrice ~ MSSubClass + Neighborhood + LotFrontage + LotArea + 
##     BldgType + OverallQual + OverallCond + YearBuilt + YearRemodAdd + 
##     MasVnrArea + BsmtFinSF1 + BsmtFinSF2 + BsmtUnfSF + CentralAir + 
##     LowQualFinSF + GrLivArea + GarageCars + GarageArea + WoodDeckSF + 
##     OpenPorchSF + EnclosedPorch + ScreenPorch + X3SsnPorch + 
##     PoolArea + MoSold + YrSold + SaleCondition
## 
## 
## Step:  AIC=29406.11
## SalePrice ~ MSSubClass + Neighborhood + LotFrontage + LotArea + 
##     BldgType + OverallQual + OverallCond + YearBuilt + YearRemodAdd + 
##     MasVnrArea + BsmtFinSF1 + BsmtFinSF2 + BsmtUnfSF + CentralAir + 
##     LowQualFinSF + GrLivArea + GarageCars + GarageArea + WoodDeckSF + 
##     OpenPorchSF + EnclosedPorch + ScreenPorch + X3SsnPorch + 
##     MoSold + YrSold + SaleCondition
## 
## 
## Step:  AIC=29406.11
## SalePrice ~ MSSubClass + Neighborhood + LotFrontage + LotArea + 
##     BldgType + OverallQual + OverallCond + YearBuilt + YearRemodAdd + 
##     MasVnrArea + BsmtFinSF1 + BsmtFinSF2 + BsmtUnfSF + CentralAir + 
##     LowQualFinSF + GrLivArea + GarageCars + GarageArea + WoodDeckSF + 
##     OpenPorchSF + EnclosedPorch + ScreenPorch + MoSold + YrSold + 
##     SaleCondition
## 
##                 Df  Sum of Sq        RSS   AIC
## - YearBuilt      1 1.4435e+06 7.8720e+11 29404
## - GarageCars     1 4.8502e+07 7.8725e+11 29404
## - BsmtFinSF2     1 5.7750e+07 7.8726e+11 29404
## - MoSold         1 1.9568e+08 7.8739e+11 29405
## - EnclosedPorch  1 2.3916e+08 7.8744e+11 29405
## - YrSold         1 3.0900e+08 7.8751e+11 29405
## - ScreenPorch    1 3.7017e+08 7.8757e+11 29405
## - OverallCond    1 3.7930e+08 7.8758e+11 29405
## <none>                        7.8720e+11 29406
## - BldgType       1 1.1013e+09 7.8830e+11 29406
## - LowQualFinSF   1 1.4114e+09 7.8861e+11 29407
## - LotFrontage    1 1.7290e+09 7.8893e+11 29407
## - GarageArea     1 1.9452e+09 7.8914e+11 29408
## - OpenPorchSF    1 2.3786e+09 7.8958e+11 29409
## - LotArea        1 3.5671e+09 7.9077e+11 29411
## - MasVnrArea     1 3.6655e+09 7.9086e+11 29411
## - BsmtUnfSF      1 4.6727e+09 7.9187e+11 29413
## - SaleCondition  1 4.9700e+09 7.9217e+11 29413
## - WoodDeckSF     1 5.9085e+09 7.9311e+11 29415
## - YearRemodAdd   1 7.6649e+09 7.9486e+11 29418
## - CentralAir     1 1.0805e+10 7.9800e+11 29424
## - MSSubClass     1 1.3713e+10 8.0091e+11 29429
## - Neighborhood   1 3.9512e+10 8.2671e+11 29476
## - BsmtFinSF1     1 6.2360e+10 8.4956e+11 29515
## - OverallQual    1 1.0362e+11 8.9082e+11 29585
## - GrLivArea      1 2.2063e+11 1.0078e+12 29765
## 
## Step:  AIC=29404.11
## SalePrice ~ MSSubClass + Neighborhood + LotFrontage + LotArea + 
##     BldgType + OverallQual + OverallCond + YearRemodAdd + MasVnrArea + 
##     BsmtFinSF1 + BsmtFinSF2 + BsmtUnfSF + CentralAir + LowQualFinSF + 
##     GrLivArea + GarageCars + GarageArea + WoodDeckSF + OpenPorchSF + 
##     EnclosedPorch + ScreenPorch + MoSold + YrSold + SaleCondition
## 
##                 Df  Sum of Sq        RSS   AIC
## - GarageCars     1 4.9934e+07 7.8725e+11 29402
## - BsmtFinSF2     1 5.7201e+07 7.8726e+11 29402
## - MoSold         1 1.9758e+08 7.8740e+11 29403
## - EnclosedPorch  1 2.8229e+08 7.8748e+11 29403
## - YrSold         1 3.0970e+08 7.8751e+11 29403
## - ScreenPorch    1 3.7937e+08 7.8758e+11 29403
## - OverallCond    1 5.2126e+08 7.8772e+11 29403
## <none>                        7.8720e+11 29404
## - BldgType       1 1.1049e+09 7.8831e+11 29404
## - LowQualFinSF   1 1.4346e+09 7.8864e+11 29405
## - LotFrontage    1 1.7289e+09 7.8893e+11 29405
## - GarageArea     1 1.9439e+09 7.8914e+11 29406
## - OpenPorchSF    1 2.3780e+09 7.8958e+11 29407
## - LotArea        1 3.5998e+09 7.9080e+11 29409
## - MasVnrArea     1 3.6784e+09 7.9088e+11 29409
## - BsmtUnfSF      1 4.6799e+09 7.9188e+11 29411
## - SaleCondition  1 4.9688e+09 7.9217e+11 29411
## - WoodDeckSF     1 5.9189e+09 7.9312e+11 29413
## - YearRemodAdd   1 9.0187e+09 7.9622e+11 29419
## - CentralAir     1 1.1822e+10 7.9902e+11 29424
## - MSSubClass     1 1.3732e+10 8.0093e+11 29427
## - Neighborhood   1 4.4157e+10 8.3136e+11 29482
## - BsmtFinSF1     1 6.2377e+10 8.4958e+11 29513
## - OverallQual    1 1.0592e+11 8.9312e+11 29586
## - GrLivArea      1 2.2865e+11 1.0159e+12 29774
## 
## Step:  AIC=29402.21
## SalePrice ~ MSSubClass + Neighborhood + LotFrontage + LotArea + 
##     BldgType + OverallQual + OverallCond + YearRemodAdd + MasVnrArea + 
##     BsmtFinSF1 + BsmtFinSF2 + BsmtUnfSF + CentralAir + LowQualFinSF + 
##     GrLivArea + GarageArea + WoodDeckSF + OpenPorchSF + EnclosedPorch + 
##     ScreenPorch + MoSold + YrSold + SaleCondition
## 
##                 Df  Sum of Sq        RSS   AIC
## - BsmtFinSF2     1 6.1445e+07 7.8731e+11 29400
## - MoSold         1 1.9406e+08 7.8744e+11 29401
## - EnclosedPorch  1 2.9205e+08 7.8754e+11 29401
## - YrSold         1 3.1835e+08 7.8757e+11 29401
## - ScreenPorch    1 3.7533e+08 7.8763e+11 29401
## - OverallCond    1 5.5615e+08 7.8781e+11 29401
## <none>                        7.8725e+11 29402
## - BldgType       1 1.0881e+09 7.8834e+11 29402
## - LowQualFinSF   1 1.4036e+09 7.8865e+11 29403
## - LotFrontage    1 1.7164e+09 7.8897e+11 29403
## - OpenPorchSF    1 2.4329e+09 7.8968e+11 29405
## - LotArea        1 3.5761e+09 7.9083e+11 29407
## - MasVnrArea     1 3.6739e+09 7.9092e+11 29407
## - GarageArea     1 4.0137e+09 7.9126e+11 29408
## - BsmtUnfSF      1 4.6647e+09 7.9192e+11 29409
## - SaleCondition  1 5.0151e+09 7.9227e+11 29410
## - WoodDeckSF     1 5.9006e+09 7.9315e+11 29411
## - YearRemodAdd   1 8.9708e+09 7.9622e+11 29417
## - CentralAir     1 1.1834e+10 7.9908e+11 29422
## - MSSubClass     1 1.3758e+10 8.0101e+11 29426
## - Neighborhood   1 4.4252e+10 8.3150e+11 29480
## - BsmtFinSF1     1 6.2663e+10 8.4991e+11 29512
## - OverallQual    1 1.0673e+11 8.9398e+11 29586
## - GrLivArea      1 2.2863e+11 1.0159e+12 29773
## 
## Step:  AIC=29400.32
## SalePrice ~ MSSubClass + Neighborhood + LotFrontage + LotArea + 
##     BldgType + OverallQual + OverallCond + YearRemodAdd + MasVnrArea + 
##     BsmtFinSF1 + BsmtUnfSF + CentralAir + LowQualFinSF + GrLivArea + 
##     GarageArea + WoodDeckSF + OpenPorchSF + EnclosedPorch + ScreenPorch + 
##     MoSold + YrSold + SaleCondition
## 
##                 Df  Sum of Sq        RSS   AIC
## - MoSold         1 1.9346e+08 7.8751e+11 29399
## - EnclosedPorch  1 3.0561e+08 7.8762e+11 29399
## - YrSold         1 3.2292e+08 7.8763e+11 29399
## - ScreenPorch    1 4.0108e+08 7.8771e+11 29399
## - OverallCond    1 5.5771e+08 7.8787e+11 29399
## <none>                        7.8731e+11 29400
## - BldgType       1 1.1945e+09 7.8851e+11 29401
## - LowQualFinSF   1 1.3953e+09 7.8871e+11 29401
## - LotFrontage    1 1.7652e+09 7.8908e+11 29402
## - OpenPorchSF    1 2.4401e+09 7.8975e+11 29403
## - MasVnrArea     1 3.6331e+09 7.9094e+11 29405
## - LotArea        1 3.6926e+09 7.9100e+11 29405
## - GarageArea     1 4.0768e+09 7.9139e+11 29406
## - BsmtUnfSF      1 4.7426e+09 7.9205e+11 29407
## - SaleCondition  1 5.0124e+09 7.9232e+11 29408
## - WoodDeckSF     1 6.1424e+09 7.9345e+11 29410
## - YearRemodAdd   1 8.9160e+09 7.9623e+11 29415
## - CentralAir     1 1.1783e+10 7.9909e+11 29420
## - MSSubClass     1 1.4485e+10 8.0180e+11 29425
## - Neighborhood   1 4.4334e+10 8.3165e+11 29478
## - BsmtFinSF1     1 6.6334e+10 8.5365e+11 29516
## - OverallQual    1 1.0672e+11 8.9403e+11 29584
## - GrLivArea      1 2.2948e+11 1.0168e+12 29772
## 
## Step:  AIC=29398.68
## SalePrice ~ MSSubClass + Neighborhood + LotFrontage + LotArea + 
##     BldgType + OverallQual + OverallCond + YearRemodAdd + MasVnrArea + 
##     BsmtFinSF1 + BsmtUnfSF + CentralAir + LowQualFinSF + GrLivArea + 
##     GarageArea + WoodDeckSF + OpenPorchSF + EnclosedPorch + ScreenPorch + 
##     YrSold + SaleCondition
## 
##                 Df  Sum of Sq        RSS   AIC
## - YrSold         1 2.6350e+08 7.8777e+11 29397
## - EnclosedPorch  1 2.9459e+08 7.8780e+11 29397
## - ScreenPorch    1 4.1391e+08 7.8792e+11 29397
## - OverallCond    1 5.6762e+08 7.8807e+11 29398
## <none>                        7.8751e+11 29399
## - BldgType       1 1.1745e+09 7.8868e+11 29399
## - LowQualFinSF   1 1.4311e+09 7.8894e+11 29399
## - LotFrontage    1 1.7511e+09 7.8926e+11 29400
## - OpenPorchSF    1 2.5072e+09 7.9001e+11 29401
## - MasVnrArea     1 3.5864e+09 7.9109e+11 29403
## - LotArea        1 3.6212e+09 7.9113e+11 29403
## - GarageArea     1 4.0445e+09 7.9155e+11 29404
## - BsmtUnfSF      1 4.7798e+09 7.9229e+11 29406
## - SaleCondition  1 5.1210e+09 7.9263e+11 29406
## - WoodDeckSF     1 6.1904e+09 7.9370e+11 29408
## - YearRemodAdd   1 8.8516e+09 7.9636e+11 29413
## - CentralAir     1 1.1786e+10 7.9929e+11 29418
## - MSSubClass     1 1.4474e+10 8.0198e+11 29423
## - Neighborhood   1 4.4413e+10 8.3192e+11 29477
## - BsmtFinSF1     1 6.6398e+10 8.5390e+11 29515
## - OverallQual    1 1.0719e+11 8.9470e+11 29583
## - GrLivArea      1 2.3010e+11 1.0176e+12 29771
## 
## Step:  AIC=29397.17
## SalePrice ~ MSSubClass + Neighborhood + LotFrontage + LotArea + 
##     BldgType + OverallQual + OverallCond + YearRemodAdd + MasVnrArea + 
##     BsmtFinSF1 + BsmtUnfSF + CentralAir + LowQualFinSF + GrLivArea + 
##     GarageArea + WoodDeckSF + OpenPorchSF + EnclosedPorch + ScreenPorch + 
##     SaleCondition
## 
##                 Df  Sum of Sq        RSS   AIC
## - EnclosedPorch  1 3.0131e+08 7.8807e+11 29396
## - ScreenPorch    1 4.3757e+08 7.8821e+11 29396
## - OverallCond    1 5.7491e+08 7.8834e+11 29396
## <none>                        7.8777e+11 29397
## - BldgType       1 1.2080e+09 7.8898e+11 29397
## - LowQualFinSF   1 1.4522e+09 7.8922e+11 29398
## - LotFrontage    1 1.8119e+09 7.8958e+11 29399
## - OpenPorchSF    1 2.4526e+09 7.9022e+11 29400
## - LotArea        1 3.5364e+09 7.9131e+11 29402
## - MasVnrArea     1 3.6424e+09 7.9141e+11 29402
## - GarageArea     1 4.0311e+09 7.9180e+11 29403
## - BsmtUnfSF      1 4.7054e+09 7.9247e+11 29404
## - SaleCondition  1 4.9120e+09 7.9268e+11 29404
## - WoodDeckSF     1 6.2795e+09 7.9405e+11 29407
## - YearRemodAdd   1 9.1267e+09 7.9690e+11 29412
## - CentralAir     1 1.1913e+10 7.9968e+11 29417
## - MSSubClass     1 1.4676e+10 8.0245e+11 29422
## - Neighborhood   1 4.4291e+10 8.3206e+11 29475
## - BsmtFinSF1     1 6.6316e+10 8.5408e+11 29513
## - OverallQual    1 1.0710e+11 8.9487e+11 29581
## - GrLivArea      1 2.2999e+11 1.0178e+12 29769
## 
## Step:  AIC=29395.73
## SalePrice ~ MSSubClass + Neighborhood + LotFrontage + LotArea + 
##     BldgType + OverallQual + OverallCond + YearRemodAdd + MasVnrArea + 
##     BsmtFinSF1 + BsmtUnfSF + CentralAir + LowQualFinSF + GrLivArea + 
##     GarageArea + WoodDeckSF + OpenPorchSF + ScreenPorch + SaleCondition
## 
##                 Df  Sum of Sq        RSS   AIC
## - ScreenPorch    1 3.6864e+08 7.8844e+11 29394
## - OverallCond    1 6.2477e+08 7.8869e+11 29395
## <none>                        7.8807e+11 29396
## - BldgType       1 1.1235e+09 7.8919e+11 29396
## - LowQualFinSF   1 1.4633e+09 7.8953e+11 29396
## - LotFrontage    1 1.8296e+09 7.8990e+11 29397
## - OpenPorchSF    1 2.3675e+09 7.9044e+11 29398
## - LotArea        1 3.5210e+09 7.9159e+11 29400
## - MasVnrArea     1 3.5508e+09 7.9162e+11 29400
## - GarageArea     1 3.9816e+09 7.9205e+11 29401
## - BsmtUnfSF      1 4.7474e+09 7.9282e+11 29403
## - SaleCondition  1 4.9142e+09 7.9298e+11 29403
## - WoodDeckSF     1 6.0868e+09 7.9416e+11 29405
## - YearRemodAdd   1 8.8830e+09 7.9695e+11 29410
## - CentralAir     1 1.2286e+10 8.0036e+11 29416
## - MSSubClass     1 1.4550e+10 8.0262e+11 29420
## - Neighborhood   1 4.4024e+10 8.3209e+11 29473
## - BsmtFinSF1     1 6.6378e+10 8.5445e+11 29512
## - OverallQual    1 1.0846e+11 8.9653e+11 29582
## - GrLivArea      1 2.3135e+11 1.0194e+12 29770
## 
## Step:  AIC=29394.41
## SalePrice ~ MSSubClass + Neighborhood + LotFrontage + LotArea + 
##     BldgType + OverallQual + OverallCond + YearRemodAdd + MasVnrArea + 
##     BsmtFinSF1 + BsmtUnfSF + CentralAir + LowQualFinSF + GrLivArea + 
##     GarageArea + WoodDeckSF + OpenPorchSF + SaleCondition
## 
##                 Df  Sum of Sq        RSS   AIC
## - OverallCond    1 6.8851e+08 7.8913e+11 29394
## <none>                        7.8844e+11 29394
## - BldgType       1 1.1691e+09 7.8961e+11 29395
## - LowQualFinSF   1 1.4739e+09 7.8991e+11 29395
## - LotFrontage    1 1.8439e+09 7.9028e+11 29396
## - OpenPorchSF    1 2.3550e+09 7.9079e+11 29397
## - MasVnrArea     1 3.6127e+09 7.9205e+11 29399
## - LotArea        1 3.6421e+09 7.9208e+11 29399
## - GarageArea     1 4.0082e+09 7.9245e+11 29400
## - BsmtUnfSF      1 4.6706e+09 7.9311e+11 29401
## - SaleCondition  1 4.8989e+09 7.9334e+11 29402
## - WoodDeckSF     1 5.8284e+09 7.9427e+11 29403
## - YearRemodAdd   1 8.6845e+09 7.9712e+11 29408
## - CentralAir     1 1.2143e+10 8.0058e+11 29415
## - MSSubClass     1 1.4736e+10 8.0317e+11 29419
## - Neighborhood   1 4.3878e+10 8.3232e+11 29472
## - BsmtFinSF1     1 6.6710e+10 8.5515e+11 29511
## - OverallQual    1 1.0950e+11 8.9793e+11 29582
## - GrLivArea      1 2.3226e+11 1.0207e+12 29769
## 
## Step:  AIC=29393.68
## SalePrice ~ MSSubClass + Neighborhood + LotFrontage + LotArea + 
##     BldgType + OverallQual + YearRemodAdd + MasVnrArea + BsmtFinSF1 + 
##     BsmtUnfSF + CentralAir + LowQualFinSF + GrLivArea + GarageArea + 
##     WoodDeckSF + OpenPorchSF + SaleCondition
## 
##                 Df  Sum of Sq        RSS   AIC
## - BldgType       1 9.5995e+08 7.9009e+11 29394
## <none>                        7.8913e+11 29394
## - LowQualFinSF   1 1.4576e+09 7.9058e+11 29394
## - LotFrontage    1 1.7141e+09 7.9084e+11 29395
## - OpenPorchSF    1 2.3467e+09 7.9147e+11 29396
## - MasVnrArea     1 3.5372e+09 7.9266e+11 29398
## - GarageArea     1 3.7775e+09 7.9290e+11 29399
## - LotArea        1 3.8245e+09 7.9295e+11 29399
## - BsmtUnfSF      1 4.2428e+09 7.9337e+11 29400
## - SaleCondition  1 4.5360e+09 7.9366e+11 29400
## - WoodDeckSF     1 5.8308e+09 7.9496e+11 29402
## - YearRemodAdd   1 1.0261e+10 7.9939e+11 29411
## - CentralAir     1 1.1551e+10 8.0068e+11 29413
## - MSSubClass     1 1.4591e+10 8.0372e+11 29418
## - Neighborhood   1 4.3532e+10 8.3266e+11 29470
## - BsmtFinSF1     1 6.6024e+10 8.5515e+11 29509
## - OverallQual    1 1.1102e+11 9.0015e+11 29584
## - GrLivArea      1 2.3263e+11 1.0218e+12 29769
## 
## Step:  AIC=29393.46
## SalePrice ~ MSSubClass + Neighborhood + LotFrontage + LotArea + 
##     OverallQual + YearRemodAdd + MasVnrArea + BsmtFinSF1 + BsmtUnfSF + 
##     CentralAir + LowQualFinSF + GrLivArea + GarageArea + WoodDeckSF + 
##     OpenPorchSF + SaleCondition
## 
##                 Df  Sum of Sq        RSS   AIC
## <none>                        7.9009e+11 29394
## - LowQualFinSF   1 1.5046e+09 7.9159e+11 29394
## - LotFrontage    1 1.5529e+09 7.9164e+11 29394
## - OpenPorchSF    1 2.2599e+09 7.9235e+11 29396
## - LotArea        1 3.5038e+09 7.9359e+11 29398
## - GarageArea     1 3.7251e+09 7.9381e+11 29398
## - MasVnrArea     1 3.8423e+09 7.9393e+11 29399
## - SaleCondition  1 4.6649e+09 7.9475e+11 29400
## - BsmtUnfSF      1 5.4688e+09 7.9556e+11 29402
## - WoodDeckSF     1 5.7555e+09 7.9584e+11 29402
## - YearRemodAdd   1 1.0411e+10 8.0050e+11 29411
## - CentralAir     1 1.1062e+10 8.0115e+11 29412
## - MSSubClass     1 2.4161e+10 8.1425e+11 29435
## - Neighborhood   1 4.6055e+10 8.3614e+11 29474
## - BsmtFinSF1     1 7.2486e+10 8.6257e+11 29520
## - OverallQual    1 1.1045e+11 9.0053e+11 29583
## - GrLivArea      1 2.4667e+11 1.0368e+12 29788
summary(backward.model)
## 
## Call:
## lm(formula = SalePrice ~ MSSubClass + Neighborhood + LotFrontage + 
##     LotArea + OverallQual + YearRemodAdd + MasVnrArea + BsmtFinSF1 + 
##     BsmtUnfSF + CentralAir + LowQualFinSF + GrLivArea + GarageArea + 
##     WoodDeckSF + OpenPorchSF + SaleCondition, data = train1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -166244  -14826   -1306   13387   90702 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -3.390e+05  7.423e+04  -4.567 5.36e-06 ***
## MSSubClass    -1.023e+02  1.539e+01  -6.643 4.35e-11 ***
## Neighborhood   1.228e+03  1.339e+02   9.171  < 2e-16 ***
## LotFrontage    1.199e+02  7.118e+01   1.684  0.09238 .  
## LotArea        8.102e-01  3.203e-01   2.530  0.01152 *  
## OverallQual    1.044e+04  7.347e+02  14.203  < 2e-16 ***
## YearRemodAdd   1.660e+02  3.806e+01   4.361 1.39e-05 ***
## MasVnrArea     1.351e+01  5.099e+00   2.649  0.00816 ** 
## BsmtFinSF1     3.338e+01  2.901e+00  11.506  < 2e-16 ***
## BsmtUnfSF      8.189e+00  2.591e+00   3.160  0.00161 ** 
## CentralAir    -1.220e+04  2.714e+03  -4.495 7.52e-06 ***
## LowQualFinSF  -2.145e+01  1.294e+01  -1.658  0.09759 .  
## GrLivArea      5.679e+01  2.675e+00  21.225  < 2e-16 ***
## GarageArea     1.002e+01  3.842e+00   2.608  0.00919 ** 
## WoodDeckSF     1.961e+01  6.048e+00   3.242  0.00121 ** 
## OpenPorchSF    2.994e+01  1.474e+01   2.032  0.04238 *  
## SaleCondition -4.793e+03  1.642e+03  -2.919  0.00357 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23400 on 1443 degrees of freedom
## Multiple R-squared:  0.7765, Adjusted R-squared:  0.774 
## F-statistic: 313.3 on 16 and 1443 DF,  p-value: < 2.2e-16

MODEL4:- Create model With top 5 high correlation columns as features

cors <- sapply(train1, cor, y=train1$SalePrice)
## Warning in FUN(X[[i]], ...): the standard deviation is zero

## Warning in FUN(X[[i]], ...): the standard deviation is zero

## Warning in FUN(X[[i]], ...): the standard deviation is zero
mask <- (rank(-abs(cors)) <= 6 )
best5.pred <- train1[, mask]

best5.pred <- subset(best5.pred, select = c(-SalePrice) )
summary(best5.pred)
##   Neighborhood    OverallQual       GrLivArea      GarageCars   
##  Min.   : 1.00   Min.   : 1.000   Min.   :1464   Min.   :0.000  
##  1st Qu.: 7.00   1st Qu.: 5.000   1st Qu.:1464   1st Qu.:1.000  
##  Median :13.00   Median : 6.000   Median :1464   Median :2.000  
##  Mean   :12.84   Mean   : 6.099   Mean   :1666   Mean   :1.767  
##  3rd Qu.:17.00   3rd Qu.: 7.000   3rd Qu.:1777   3rd Qu.:2.000  
##  Max.   :25.00   Max.   :10.000   Max.   :2466   Max.   :4.000  
##    GarageArea    
##  Min.   :   0.0  
##  1st Qu.: 334.5  
##  Median : 480.0  
##  Mean   : 473.0  
##  3rd Qu.: 576.0  
##  Max.   :1418.0

Stepwise backward regression

model.best5 <- lm (SalePrice ~     Neighborhood + OverallQual + GrLivArea + GarageCars + GarageArea, data=train1)

model.best5<- step (model.best5, direction = "backward")    
## Start:  AIC=29680.5
## SalePrice ~ Neighborhood + OverallQual + GrLivArea + GarageCars + 
##     GarageArea
## 
##                Df  Sum of Sq        RSS   AIC
## - GarageCars    1 6.0233e+08 9.7695e+11 29679
## <none>                       9.7634e+11 29681
## - GarageArea    1 1.6068e+10 9.9241e+11 29702
## - Neighborhood  1 1.0040e+11 1.0767e+12 29821
## - OverallQual   1 1.7478e+11 1.1511e+12 29919
## - GrLivArea     1 3.5575e+11 1.3321e+12 30132
## 
## Step:  AIC=29679.4
## SalePrice ~ Neighborhood + OverallQual + GrLivArea + GarageArea
## 
##                Df  Sum of Sq        RSS   AIC
## <none>                       9.7695e+11 29679
## - GarageArea    1 3.2766e+10 1.0097e+12 29726
## - Neighborhood  1 1.0030e+11 1.0772e+12 29820
## - OverallQual   1 1.7567e+11 1.1526e+12 29919
## - GrLivArea     1 3.5573e+11 1.3327e+12 30131
summary(model.best5)
## 
## Call:
## lm(formula = SalePrice ~ Neighborhood + OverallQual + GrLivArea + 
##     GarageArea, data = train1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -140973  -15613   -1288   13625  100183 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -14519.434   4213.363  -3.446 0.000585 ***
## Neighborhood   1723.991    141.056  12.222  < 2e-16 ***
## OverallQual   11899.446    735.671  16.175  < 2e-16 ***
## GrLivArea        61.235      2.660  23.017  < 2e-16 ***
## GarageArea       28.115      4.025   6.986  4.3e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25910 on 1455 degrees of freedom
## Multiple R-squared:  0.7236, Adjusted R-squared:  0.7229 
## F-statistic: 952.5 on 4 and 1455 DF,  p-value: < 2.2e-16
anova(full.model,reduced.model,backward.model, model.best5, test="Chisq")
## Analysis of Variance Table
## 
## Model 1: SalePrice ~ MSSubClass + Neighborhood + LotFrontage + LotArea + 
##     BldgType + OverallQual + OverallCond + YearBuilt + YearRemodAdd + 
##     MasVnrArea + BsmtFinSF1 + BsmtFinSF2 + BsmtUnfSF + CentralAir + 
##     LowQualFinSF + GrLivArea + GarageCars + GarageArea + WoodDeckSF + 
##     OpenPorchSF + EnclosedPorch + ScreenPorch + X3SsnPorch + 
##     PoolArea + MiscVal + MoSold + YrSold + SaleCondition
## Model 2: SalePrice ~ MSSubClass + Neighborhood + LotArea + OverallQual + 
##     YearBuilt + YearRemodAdd + MasVnrArea + BsmtFinSF1 + BsmtUnfSF + 
##     CentralAir + GrLivArea + WoodDeckSF + OpenPorchSF + SaleCondition
## Model 3: SalePrice ~ MSSubClass + Neighborhood + LotFrontage + LotArea + 
##     OverallQual + YearRemodAdd + MasVnrArea + BsmtFinSF1 + BsmtUnfSF + 
##     CentralAir + LowQualFinSF + GrLivArea + GarageArea + WoodDeckSF + 
##     OpenPorchSF + SaleCondition
## Model 4: SalePrice ~ Neighborhood + OverallQual + GrLivArea + GarageArea
##   Res.Df        RSS  Df   Sum of Sq  Pr(>Chi)    
## 1   1434 7.8720e+11                              
## 2   1445 7.9752e+11 -11 -1.0322e+10  0.064735 .  
## 3   1443 7.9009e+11   2  7.4334e+09  0.001147 ** 
## 4   1455 9.7695e+11 -12 -1.8686e+11 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Prepare test data

test<- read.csv("https://raw.githubusercontent.com/johnpannyc/data605-final/master/test.csv")
names(test)
##  [1] "Id"            "MSSubClass"    "MSZoning"      "LotFrontage"  
##  [5] "LotArea"       "Street"        "Alley"         "LotShape"     
##  [9] "LandContour"   "Utilities"     "LotConfig"     "LandSlope"    
## [13] "Neighborhood"  "Condition1"    "Condition2"    "BldgType"     
## [17] "HouseStyle"    "OverallQual"   "OverallCond"   "YearBuilt"    
## [21] "YearRemodAdd"  "RoofStyle"     "RoofMatl"      "Exterior1st"  
## [25] "Exterior2nd"   "MasVnrType"    "MasVnrArea"    "ExterQual"    
## [29] "ExterCond"     "Foundation"    "BsmtQual"      "BsmtCond"     
## [33] "BsmtExposure"  "BsmtFinType1"  "BsmtFinSF1"    "BsmtFinType2" 
## [37] "BsmtFinSF2"    "BsmtUnfSF"     "TotalBsmtSF"   "Heating"      
## [41] "HeatingQC"     "CentralAir"    "Electrical"    "X1stFlrSF"    
## [45] "X2ndFlrSF"     "LowQualFinSF"  "GrLivArea"     "BsmtFullBath" 
## [49] "BsmtHalfBath"  "FullBath"      "HalfBath"      "BedroomAbvGr" 
## [53] "KitchenAbvGr"  "KitchenQual"   "TotRmsAbvGrd"  "Functional"   
## [57] "Fireplaces"    "FireplaceQu"   "GarageType"    "GarageYrBlt"  
## [61] "GarageFinish"  "GarageCars"    "GarageArea"    "GarageQual"   
## [65] "GarageCond"    "PavedDrive"    "WoodDeckSF"    "OpenPorchSF"  
## [69] "EnclosedPorch" "X3SsnPorch"    "ScreenPorch"   "PoolArea"     
## [73] "PoolQC"        "Fence"         "MiscFeature"   "MiscVal"      
## [77] "MoSold"        "YrSold"        "SaleType"      "SaleCondition"
test1 <- dplyr::select(test,Id,MSSubClass,Neighborhood,LotFrontage,LotArea,BldgType,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,CentralAir,X1stFlrSF,X2ndFlrSF,LowQualFinSF,GrLivArea,TotRmsAbvGrd,GarageCars,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,ScreenPorch,X3SsnPorch,PoolArea,MiscVal,MoSold,YrSold,SaleCondition)
test1$Neighborhood <- as.character(test1$Neighborhood)
test1$Neighborhood[which(test1$Neighborhood == "NridgHt")] <- "25"
test1$Neighborhood[which(test1$Neighborhood == "NoRidge")] <- "24"
test1$Neighborhood[which(test1$Neighborhood == "StoneBr")] <- "23"
test1$Neighborhood[which(test1$Neighborhood == "Timber")] <- "22"
test1$Neighborhood[which(test1$Neighborhood == "Somerst")] <- "21"
test1$Neighborhood[which(test1$Neighborhood == "Veenker")] <- "20"
test1$Neighborhood[which(test1$Neighborhood == "Crawfor")] <- "19"
test1$Neighborhood[which(test1$Neighborhood == "ClearCr")] <- "18"
test1$Neighborhood[which(test1$Neighborhood == "CollgCr")] <- "17"
test1$Neighborhood[which(test1$Neighborhood == "Blmngtn")] <- "16"

test1$Neighborhood[which(test1$Neighborhood == "NWAmes")] <- "15"
test1$Neighborhood[which(test1$Neighborhood == "Gilbert")] <- "14"
test1$Neighborhood[which(test1$Neighborhood == "SawyerW")] <- "13"
test1$Neighborhood[which(test1$Neighborhood == "Mitchel")] <- "12"
test1$Neighborhood[which(test1$Neighborhood == "NPkVill")] <- "11"
test1$Neighborhood[which(test1$Neighborhood == "NAmes")] <- "10"
test1$Neighborhood[which(test1$Neighborhood == "SWISU")] <- "9"
test1$Neighborhood[which(test1$Neighborhood == "Blueste")] <- "8"
test1$Neighborhood[which(test1$Neighborhood == "Sawyer")] <- "7"
test1$Neighborhood[which(test1$Neighborhood == "BrkSide")] <- "6"

test1$Neighborhood[which(test1$Neighborhood == "Edwards")] <- "5"
test1$Neighborhood[which(test1$Neighborhood == "OldTown")] <- "4"
test1$Neighborhood[which(test1$Neighborhood == "BrDale")] <- "3"
test1$Neighborhood[which(test1$Neighborhood == "IDOTRR")] <- "2"
test1$Neighborhood[which(test1$Neighborhood == "MeadowV")] <- "1"


test1$Neighborhood <- as.numeric(test1$Neighborhood)

Convert indicator variables to numbers.

test1$CentralAir <- as.character(test1$CentralAir)
test1$CentralAir[which(test1$CentralAir == "Y")] <- "1"
test1$CentralAir[which(test1$CentralAir == "N")] <- "0"
test1$CentralAir <- as.numeric(test1$CentralAir)



test1$CentralAir <- as.character(test1$CentralAir)
test1$CentralAir[which(test1$CentralAir == "Y")] <- "1"
test1$CentralAir[which(test1$CentralAir == "N")] <- "0"
test1$CentralAir <- as.numeric(test1$CentralAir)

test1$SaleCondition <- as.character(test1$SaleCondition)
test1$SaleCondition[which(test1$SaleCondition == "Normal")] <- "1"
test1$SaleCondition[which(test1$SaleCondition == "Abnorml")] <- "0"
test1$SaleCondition[which(test1$SaleCondition == "AdjLand")] <- "0"
test1$SaleCondition[which(test1$SaleCondition == "Alloca")] <- "0"
test1$SaleCondition[which(test1$SaleCondition == "Family")] <- "0"
test1$SaleCondition[which(test1$SaleCondition == "Partial")] <- "0"
test1$SaleCondition[which(test1$SaleCondition == "N")] <- "0"
test1$SaleCondition <- as.numeric(test1$SaleCondition)



test1$BldgType <- as.character(test1$BldgType)
test1$BldgType[which(test1$BldgType == "1Fam")] <- "5"
test1$BldgType[which(test1$BldgType == "2fmCon")] <- "4"
test1$BldgType[which(test1$BldgType == "Duplex")] <- "3"
test1$BldgType[which(test1$BldgType == "Twnhs")] <- "2"
test1$BldgType[which(test1$BldgType == "TwnhsE")] <- "1"
test1$BldgType <- as.numeric(test1$BldgType)
sapply(test1, function(x) sum(is.na(x))) %>% kable() %>% kable_styling()
x
Id 0
MSSubClass 0
Neighborhood 0
LotFrontage 227
LotArea 0
BldgType 0
OverallQual 0
OverallCond 0
YearBuilt 0
YearRemodAdd 0
MasVnrArea 15
BsmtFinSF1 1
BsmtFinSF2 1
BsmtUnfSF 1
TotalBsmtSF 1
CentralAir 0
X1stFlrSF 0
X2ndFlrSF 0
LowQualFinSF 0
GrLivArea 0
TotRmsAbvGrd 0
GarageCars 1
GarageArea 1
WoodDeckSF 0
OpenPorchSF 0
EnclosedPorch 0
ScreenPorch 0
X3SsnPorch 0
PoolArea 0
MiscVal 0
MoSold 0
YrSold 0
SaleCondition 0

Impute missing data by mean

test1$LotFrontage[is.na(test1$LotFrontage)] <- mean(test1$LotFrontage, na.rm=TRUE)
test1$MasVnrArea[is.na(test1$MasVnrArea)] <- mean(test1$MasVnrArea, na.rm=TRUE)
test1$BsmtFinSF1[is.na(test1$BsmtFinSF1)] <- mean(test1$BsmtFinSF1, na.rm=TRUE)
test1$BsmtFinSF2[is.na(test1$BsmtFinSF2)] <- mean(test1$BsmtFinSF2, na.rm=TRUE)
test1$BsmtUnfSF[is.na(test1$BsmtUnfSF)] <- mean(test1$BsmtUnfSF, na.rm=TRUE)
test1$TotalBsmtSF[is.na(test1$TotalBsmtSF)] <- mean(test1$TotalBsmtSF, na.rm=TRUE)
test1$GarageCars[is.na(test1$GarageCars)] <- mean(test1$GarageCars, na.rm=TRUE)
test1$GarageArea[is.na(test1$GarageArea)] <- mean(test1$GarageArea, na.rm=TRUE)

Delete the following columns for colinearity

train1$TotalBsmtSF <- NULL
train1$TotRmsAbvGrd<- NULL
train1$X1stFlrSF<- NULL
train1$X2ndFlrSF<- NULL

replace with outliers

replaceOutliers = function(x) { 

    quantiles <- quantile( x, c(0.5,.95 ) )
    x[ x < quantiles[1] ] <- quantiles[1]
   
    x[ x > quantiles[2] ] <- quantiles[2]
    return(x)
}
   
train1$LotFrontage <- replaceOutliers(train1$LotFrontage)
train1$LotArea <- replaceOutliers(train1$LotArea)
train1$MasVnrArea <- replaceOutliers(train1$MasVnrArea)
train1$BsmtFinSF1 <- replaceOutliers(train1$BsmtFinSF1)
train1$BsmtFinSF2 <- replaceOutliers(train1$BsmtFinSF2)
train1$BsmtUnfSF <- replaceOutliers(train1$BsmtUnfSF)
train1$GrLivArea <- replaceOutliers(train1$GrLivArea)
train1$WoodDeckSF <- replaceOutliers(train1$WoodDeckSF)
train1$OpenPorchSF <- replaceOutliers(train1$OpenPorchSF)
train1$EnclosedPorch <- replaceOutliers(train1$EnclosedPorch)
train1$ScreenPorch <- replaceOutliers(train1$ScreenPorch)
train1$X3SsnPorch <- replaceOutliers(train1$X3SsnPorch)
train1$PoolArea <- replaceOutliers(train1$PoolArea)
train1$MiscVal <- replaceOutliers(train1$MiscVal)
summary(test1)
##        Id         MSSubClass      Neighborhood    LotFrontage    
##  Min.   :1461   Min.   : 20.00   Min.   : 1.00   Min.   : 21.00  
##  1st Qu.:1826   1st Qu.: 20.00   1st Qu.: 7.00   1st Qu.: 60.00  
##  Median :2190   Median : 50.00   Median :12.00   Median : 68.58  
##  Mean   :2190   Mean   : 57.38   Mean   :12.55   Mean   : 68.58  
##  3rd Qu.:2554   3rd Qu.: 70.00   3rd Qu.:17.00   3rd Qu.: 78.00  
##  Max.   :2919   Max.   :190.00   Max.   :25.00   Max.   :200.00  
##     LotArea         BldgType      OverallQual      OverallCond   
##  Min.   : 1470   Min.   :1.000   Min.   : 1.000   Min.   :1.000  
##  1st Qu.: 7391   1st Qu.:5.000   1st Qu.: 5.000   1st Qu.:5.000  
##  Median : 9399   Median :5.000   Median : 6.000   Median :5.000  
##  Mean   : 9819   Mean   :4.482   Mean   : 6.079   Mean   :5.554  
##  3rd Qu.:11518   3rd Qu.:5.000   3rd Qu.: 7.000   3rd Qu.:6.000  
##  Max.   :56600   Max.   :5.000   Max.   :10.000   Max.   :9.000  
##    YearBuilt     YearRemodAdd    MasVnrArea       BsmtFinSF1    
##  Min.   :1879   Min.   :1950   Min.   :   0.0   Min.   :   0.0  
##  1st Qu.:1953   1st Qu.:1963   1st Qu.:   0.0   1st Qu.:   0.0  
##  Median :1973   Median :1992   Median :   0.0   Median : 351.0  
##  Mean   :1971   Mean   :1984   Mean   : 100.7   Mean   : 439.2  
##  3rd Qu.:2001   3rd Qu.:2004   3rd Qu.: 162.0   3rd Qu.: 752.0  
##  Max.   :2010   Max.   :2010   Max.   :1290.0   Max.   :4010.0  
##    BsmtFinSF2        BsmtUnfSF       TotalBsmtSF     CentralAir    
##  Min.   :   0.00   Min.   :   0.0   Min.   :   0   Min.   :0.0000  
##  1st Qu.:   0.00   1st Qu.: 219.5   1st Qu.: 784   1st Qu.:1.0000  
##  Median :   0.00   Median : 460.0   Median : 988   Median :1.0000  
##  Mean   :  52.62   Mean   : 554.3   Mean   :1046   Mean   :0.9308  
##  3rd Qu.:   0.00   3rd Qu.: 797.5   3rd Qu.:1304   3rd Qu.:1.0000  
##  Max.   :1526.00   Max.   :2140.0   Max.   :5095   Max.   :1.0000  
##    X1stFlrSF        X2ndFlrSF     LowQualFinSF        GrLivArea   
##  Min.   : 407.0   Min.   :   0   Min.   :   0.000   Min.   : 407  
##  1st Qu.: 873.5   1st Qu.:   0   1st Qu.:   0.000   1st Qu.:1118  
##  Median :1079.0   Median :   0   Median :   0.000   Median :1432  
##  Mean   :1156.5   Mean   : 326   Mean   :   3.543   Mean   :1486  
##  3rd Qu.:1382.5   3rd Qu.: 676   3rd Qu.:   0.000   3rd Qu.:1721  
##  Max.   :5095.0   Max.   :1862   Max.   :1064.000   Max.   :5095  
##   TotRmsAbvGrd      GarageCars      GarageArea       WoodDeckSF     
##  Min.   : 3.000   Min.   :0.000   Min.   :   0.0   Min.   :   0.00  
##  1st Qu.: 5.000   1st Qu.:1.000   1st Qu.: 318.0   1st Qu.:   0.00  
##  Median : 6.000   Median :2.000   Median : 480.0   Median :   0.00  
##  Mean   : 6.385   Mean   :1.766   Mean   : 472.8   Mean   :  93.17  
##  3rd Qu.: 7.000   3rd Qu.:2.000   3rd Qu.: 576.0   3rd Qu.: 168.00  
##  Max.   :15.000   Max.   :5.000   Max.   :1488.0   Max.   :1424.00  
##   OpenPorchSF     EnclosedPorch      ScreenPorch       X3SsnPorch     
##  Min.   :  0.00   Min.   :   0.00   Min.   :  0.00   Min.   :  0.000  
##  1st Qu.:  0.00   1st Qu.:   0.00   1st Qu.:  0.00   1st Qu.:  0.000  
##  Median : 28.00   Median :   0.00   Median :  0.00   Median :  0.000  
##  Mean   : 48.31   Mean   :  24.24   Mean   : 17.06   Mean   :  1.794  
##  3rd Qu.: 72.00   3rd Qu.:   0.00   3rd Qu.:  0.00   3rd Qu.:  0.000  
##  Max.   :742.00   Max.   :1012.00   Max.   :576.00   Max.   :360.000  
##     PoolArea          MiscVal             MoSold           YrSold    
##  Min.   :  0.000   Min.   :    0.00   Min.   : 1.000   Min.   :2006  
##  1st Qu.:  0.000   1st Qu.:    0.00   1st Qu.: 4.000   1st Qu.:2007  
##  Median :  0.000   Median :    0.00   Median : 6.000   Median :2008  
##  Mean   :  1.744   Mean   :   58.17   Mean   : 6.104   Mean   :2008  
##  3rd Qu.:  0.000   3rd Qu.:    0.00   3rd Qu.: 8.000   3rd Qu.:2009  
##  Max.   :800.000   Max.   :17000.00   Max.   :12.000   Max.   :2010  
##  SaleCondition   
##  Min.   :0.0000  
##  1st Qu.:1.0000  
##  Median :1.0000  
##  Mean   :0.8252  
##  3rd Qu.:1.0000  
##  Max.   :1.0000
par(mfrow=c(3, 3))
colnames <- dimnames(test1)[[2]]

  for(col in 2:ncol(test1)) {

    d <- density(na.omit(test1[,col]))
   
    plot(d, type="n", main=colnames[col])
    polygon(d, col="light green", border="red")
  }

`

Price Prediction

Full Model Prediction

full.model.pred <- cbind(test1, s<-predict(full.model, test1))
names(full.model.pred)[ncol(full.model.pred)] <- "SalePrice"

full.model.submission <- dplyr::select(full.model.pred,Id,SalePrice)

write.csv(full.model.submission, file="full.model.submission.csv")

Reduced Model Prediction

reduced.model.pred <- cbind(test1, s<-predict(reduced.model, test1))
names(reduced.model.pred)[ncol(reduced.model.pred)] <- "SalePrice"

reduced.model.submission <- dplyr::select(reduced.model.pred,Id,SalePrice)

write.csv(reduced.model.submission, file="reduced.model.submission.csv")

Backward Model Prediction

backward.model.pred <- cbind(test1, s<-predict(backward.model, test1))
names(backward.model.pred)[ncol(backward.model.pred)] <- "SalePrice"

backward.model.submission <- dplyr::select(backward.model.pred,Id,SalePrice)

write.csv(backward.model.submission, file="backward.model.submission.csv")

Model Best 5

model.best5.pred <- cbind(test1, s<-predict(model.best5, test1))
names(model.best5.pred)[ncol(model.best5.pred)] <- "SalePrice"

model.best5.submission <- dplyr::select(model.best5.pred,Id,SalePrice)

write.csv(model.best5.submission, file="model.best5.submission.csv")

Summary of 4 models on SalePrice

summary(full.model.submission$SalePrice)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   46528  135289  168058  176857  209857  626428
summary(reduced.model.submission$SalePrice)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   49564  135726  168064  177044  209976  629480
summary(backward.model.submission$SalePrice)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   49564  135553  167851  176928  209720  623185
summary(model.best5.submission$SalePrice)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   49448  143541  178091  183743  216547  457530

Histogram of 4 models

ggplot(full.model.submission, aes(SalePrice)) + geom_histogram(binwidth = 20000, alpha=0.5, color="red", fill="light green")

ggplot(reduced.model.submission, aes(SalePrice)) + geom_histogram(binwidth = 20000, alpha=0.5, color="red", fill="light green")

ggplot(backward.model.submission, aes(SalePrice)) + geom_histogram(binwidth = 20000, alpha=0.5, color="red", fill="light green")

ggplot(model.best5.submission, aes(SalePrice)) + geom_histogram(binwidth = 20000, alpha=0.5, color="red", fill="light green")

After submitted the csv file from the the above model. Here is the scores: Best 5 (0.213), Backward (0.186), Full (0.185), Reduced(0.185).