Problem 2 You are to register for Kaggle.com (free) and compete in the House Prices: Advanced Regression Techniques competition. https://www.kaggle.com/c/house-prices-advanced-regression-techniques . I want you to do the following.
5 points. Descriptive and Inferential Statistics. Provide univariate descriptive statistics and appropriate plots for the training data set. Providea scatterplot matrix for at least two of the independent variables and the dependent variable. Derive a correlation matrix for any three quantitative variables in the dataset. Test the hypotheses that the correlations betweeneach pairwise set of variables is 0 and provide an80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?
Prepare working environment. Load the most fselectuently used packages for data analysis.
to use for prediction submitted to Kaggle.com) files from github.
load the train (housing characteristics data) and test (
train <- read.csv("https://raw.githubusercontent.com/johnpannyc/data605-final/master/train.csv")
test<- read.csv("https://raw.githubusercontent.com/johnpannyc/data605-final/master/test.csv")
Using “glimps” function to dig deeper into the datasets. We can that part of the variables are numerical, part of the data are categorical. Categorical variables need to be convert to integer (levels) for further analysis. Also, there are some missing values in the dataset.
glimpse(train)
## Observations: 1,460
## Variables: 81
## $ Id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1...
## $ MSSubClass <int> 60, 20, 60, 70, 60, 50, 20, 60, 50, 190, 20, 60,...
## $ MSZoning <fct> RL, RL, RL, RL, RL, RL, RL, RL, RM, RL, RL, RL, ...
## $ LotFrontage <int> 65, 80, 68, 60, 84, 85, 75, NA, 51, 50, 70, 85, ...
## $ LotArea <int> 8450, 9600, 11250, 9550, 14260, 14115, 10084, 10...
## $ Street <fct> Pave, Pave, Pave, Pave, Pave, Pave, Pave, Pave, ...
## $ Alley <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ LotShape <fct> Reg, Reg, IR1, IR1, IR1, IR1, Reg, IR1, Reg, Reg...
## $ LandContour <fct> Lvl, Lvl, Lvl, Lvl, Lvl, Lvl, Lvl, Lvl, Lvl, Lvl...
## $ Utilities <fct> AllPub, AllPub, AllPub, AllPub, AllPub, AllPub, ...
## $ LotConfig <fct> Inside, FR2, Inside, Corner, FR2, Inside, Inside...
## $ LandSlope <fct> Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl...
## $ Neighborhood <fct> CollgCr, Veenker, CollgCr, Crawfor, NoRidge, Mit...
## $ Condition1 <fct> Norm, Feedr, Norm, Norm, Norm, Norm, Norm, PosN,...
## $ Condition2 <fct> Norm, Norm, Norm, Norm, Norm, Norm, Norm, Norm, ...
## $ BldgType <fct> 1Fam, 1Fam, 1Fam, 1Fam, 1Fam, 1Fam, 1Fam, 1Fam, ...
## $ HouseStyle <fct> 2Story, 1Story, 2Story, 2Story, 2Story, 1.5Fin, ...
## $ OverallQual <int> 7, 6, 7, 7, 8, 5, 8, 7, 7, 5, 5, 9, 5, 7, 6, 7, ...
## $ OverallCond <int> 5, 8, 5, 5, 5, 5, 5, 6, 5, 6, 5, 5, 6, 5, 5, 8, ...
## $ YearBuilt <int> 2003, 1976, 2001, 1915, 2000, 1993, 2004, 1973, ...
## $ YearRemodAdd <int> 2003, 1976, 2002, 1970, 2000, 1995, 2005, 1973, ...
## $ RoofStyle <fct> Gable, Gable, Gable, Gable, Gable, Gable, Gable,...
## $ RoofMatl <fct> CompShg, CompShg, CompShg, CompShg, CompShg, Com...
## $ Exterior1st <fct> VinylSd, MetalSd, VinylSd, Wd Sdng, VinylSd, Vin...
## $ Exterior2nd <fct> VinylSd, MetalSd, VinylSd, Wd Shng, VinylSd, Vin...
## $ MasVnrType <fct> BrkFace, None, BrkFace, None, BrkFace, None, Sto...
## $ MasVnrArea <int> 196, 0, 162, 0, 350, 0, 186, 240, 0, 0, 0, 286, ...
## $ ExterQual <fct> Gd, TA, Gd, TA, Gd, TA, Gd, TA, TA, TA, TA, Ex, ...
## $ ExterCond <fct> TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, ...
## $ Foundation <fct> PConc, CBlock, PConc, BrkTil, PConc, Wood, PConc...
## $ BsmtQual <fct> Gd, Gd, Gd, TA, Gd, Gd, Ex, Gd, TA, TA, TA, Ex, ...
## $ BsmtCond <fct> TA, TA, TA, Gd, TA, TA, TA, TA, TA, TA, TA, TA, ...
## $ BsmtExposure <fct> No, Gd, Mn, No, Av, No, Av, Mn, No, No, No, No, ...
## $ BsmtFinType1 <fct> GLQ, ALQ, GLQ, ALQ, GLQ, GLQ, GLQ, ALQ, Unf, GLQ...
## $ BsmtFinSF1 <int> 706, 978, 486, 216, 655, 732, 1369, 859, 0, 851,...
## $ BsmtFinType2 <fct> Unf, Unf, Unf, Unf, Unf, Unf, Unf, BLQ, Unf, Unf...
## $ BsmtFinSF2 <int> 0, 0, 0, 0, 0, 0, 0, 32, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ BsmtUnfSF <int> 150, 284, 434, 540, 490, 64, 317, 216, 952, 140,...
## $ TotalBsmtSF <int> 856, 1262, 920, 756, 1145, 796, 1686, 1107, 952,...
## $ Heating <fct> GasA, GasA, GasA, GasA, GasA, GasA, GasA, GasA, ...
## $ HeatingQC <fct> Ex, Ex, Ex, Gd, Ex, Ex, Ex, Ex, Gd, Ex, Ex, Ex, ...
## $ CentralAir <fct> Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, ...
## $ Electrical <fct> SBrkr, SBrkr, SBrkr, SBrkr, SBrkr, SBrkr, SBrkr,...
## $ X1stFlrSF <int> 856, 1262, 920, 961, 1145, 796, 1694, 1107, 1022...
## $ X2ndFlrSF <int> 854, 0, 866, 756, 1053, 566, 0, 983, 752, 0, 0, ...
## $ LowQualFinSF <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ GrLivArea <int> 1710, 1262, 1786, 1717, 2198, 1362, 1694, 2090, ...
## $ BsmtFullBath <int> 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, ...
## $ BsmtHalfBath <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ FullBath <int> 2, 2, 2, 1, 2, 1, 2, 2, 2, 1, 1, 3, 1, 2, 1, 1, ...
## $ HalfBath <int> 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, ...
## $ BedroomAbvGr <int> 3, 3, 3, 3, 4, 1, 3, 3, 2, 2, 3, 4, 2, 3, 2, 2, ...
## $ KitchenAbvGr <int> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1, ...
## $ KitchenQual <fct> Gd, TA, Gd, Gd, Gd, TA, Gd, TA, TA, TA, TA, Ex, ...
## $ TotRmsAbvGrd <int> 8, 6, 6, 7, 9, 5, 7, 7, 8, 5, 5, 11, 4, 7, 5, 5,...
## $ Functional <fct> Typ, Typ, Typ, Typ, Typ, Typ, Typ, Typ, Min1, Ty...
## $ Fireplaces <int> 0, 1, 1, 1, 1, 0, 1, 2, 2, 2, 0, 2, 0, 1, 1, 0, ...
## $ FireplaceQu <fct> NA, TA, TA, Gd, TA, NA, Gd, TA, TA, TA, NA, Gd, ...
## $ GarageType <fct> Attchd, Attchd, Attchd, Detchd, Attchd, Attchd, ...
## $ GarageYrBlt <int> 2003, 1976, 2001, 1998, 2000, 1993, 2004, 1973, ...
## $ GarageFinish <fct> RFn, RFn, RFn, Unf, RFn, Unf, RFn, RFn, Unf, RFn...
## $ GarageCars <int> 2, 2, 2, 3, 3, 2, 2, 2, 2, 1, 1, 3, 1, 3, 1, 2, ...
## $ GarageArea <int> 548, 460, 608, 642, 836, 480, 636, 484, 468, 205...
## $ GarageQual <fct> TA, TA, TA, TA, TA, TA, TA, TA, Fa, Gd, TA, TA, ...
## $ GarageCond <fct> TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, ...
## $ PavedDrive <fct> Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, ...
## $ WoodDeckSF <int> 0, 298, 0, 0, 192, 40, 255, 235, 90, 0, 0, 147, ...
## $ OpenPorchSF <int> 61, 0, 42, 35, 84, 30, 57, 204, 0, 4, 0, 21, 0, ...
## $ EnclosedPorch <int> 0, 0, 0, 272, 0, 0, 0, 228, 205, 0, 0, 0, 0, 0, ...
## $ X3SsnPorch <int> 0, 0, 0, 0, 0, 320, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ ScreenPorch <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 176, 0, 0, 0...
## $ PoolArea <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ PoolQC <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ Fence <fct> NA, NA, NA, NA, NA, MnPrv, NA, NA, NA, NA, NA, N...
## $ MiscFeature <fct> NA, NA, NA, NA, NA, Shed, NA, Shed, NA, NA, NA, ...
## $ MiscVal <int> 0, 0, 0, 0, 0, 700, 0, 350, 0, 0, 0, 0, 0, 0, 0,...
## $ MoSold <int> 2, 5, 9, 2, 12, 10, 8, 11, 4, 1, 2, 7, 9, 8, 5, ...
## $ YrSold <int> 2008, 2007, 2008, 2006, 2008, 2009, 2007, 2009, ...
## $ SaleType <fct> WD, WD, WD, WD, WD, WD, WD, WD, WD, WD, WD, New,...
## $ SaleCondition <fct> Normal, Normal, Normal, Abnorml, Normal, Normal,...
## $ SalePrice <int> 208500, 181500, 223500, 140000, 250000, 143000, ...
This dataset is almost the same as training except without SalePrice column. We need to use the model based on train data to predict the salesprice for the test dataset and submit to kaggle.
glimpse(test)
## Observations: 1,459
## Variables: 80
## $ Id <int> 1461, 1462, 1463, 1464, 1465, 1466, 1467, 1468, ...
## $ MSSubClass <int> 20, 20, 60, 60, 120, 60, 20, 60, 20, 20, 120, 16...
## $ MSZoning <fct> RH, RL, RL, RL, RL, RL, RL, RL, RL, RL, RH, RM, ...
## $ LotFrontage <int> 80, 81, 74, 78, 43, 75, NA, 63, 85, 70, 26, 21, ...
## $ LotArea <int> 11622, 14267, 13830, 9978, 5005, 10000, 7980, 84...
## $ Street <fct> Pave, Pave, Pave, Pave, Pave, Pave, Pave, Pave, ...
## $ Alley <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ LotShape <fct> Reg, IR1, IR1, IR1, IR1, IR1, IR1, IR1, Reg, Reg...
## $ LandContour <fct> Lvl, Lvl, Lvl, Lvl, HLS, Lvl, Lvl, Lvl, Lvl, Lvl...
## $ Utilities <fct> AllPub, AllPub, AllPub, AllPub, AllPub, AllPub, ...
## $ LotConfig <fct> Inside, Corner, Inside, Inside, Inside, Corner, ...
## $ LandSlope <fct> Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl...
## $ Neighborhood <fct> NAmes, NAmes, Gilbert, Gilbert, StoneBr, Gilbert...
## $ Condition1 <fct> Feedr, Norm, Norm, Norm, Norm, Norm, Norm, Norm,...
## $ Condition2 <fct> Norm, Norm, Norm, Norm, Norm, Norm, Norm, Norm, ...
## $ BldgType <fct> 1Fam, 1Fam, 1Fam, 1Fam, TwnhsE, 1Fam, 1Fam, 1Fam...
## $ HouseStyle <fct> 1Story, 1Story, 2Story, 2Story, 1Story, 2Story, ...
## $ OverallQual <int> 5, 6, 5, 6, 8, 6, 6, 6, 7, 4, 7, 6, 5, 6, 7, 9, ...
## $ OverallCond <int> 6, 6, 5, 6, 5, 5, 7, 5, 5, 5, 5, 5, 5, 6, 6, 5, ...
## $ YearBuilt <int> 1961, 1958, 1997, 1998, 1992, 1993, 1992, 1998, ...
## $ YearRemodAdd <int> 1961, 1958, 1998, 1998, 1992, 1994, 2007, 1998, ...
## $ RoofStyle <fct> Gable, Hip, Gable, Gable, Gable, Gable, Gable, G...
## $ RoofMatl <fct> CompShg, CompShg, CompShg, CompShg, CompShg, Com...
## $ Exterior1st <fct> VinylSd, Wd Sdng, VinylSd, VinylSd, HdBoard, HdB...
## $ Exterior2nd <fct> VinylSd, Wd Sdng, VinylSd, VinylSd, HdBoard, HdB...
## $ MasVnrType <fct> None, BrkFace, None, BrkFace, None, None, None, ...
## $ MasVnrArea <int> 0, 108, 0, 20, 0, 0, 0, 0, 0, 0, 0, 504, 492, 0,...
## $ ExterQual <fct> TA, TA, TA, TA, Gd, TA, TA, TA, TA, TA, Gd, TA, ...
## $ ExterCond <fct> TA, TA, TA, TA, TA, TA, Gd, TA, TA, TA, TA, TA, ...
## $ Foundation <fct> CBlock, CBlock, PConc, PConc, PConc, PConc, PCon...
## $ BsmtQual <fct> TA, TA, Gd, TA, Gd, Gd, Gd, Gd, Gd, TA, Gd, TA, ...
## $ BsmtCond <fct> TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, ...
## $ BsmtExposure <fct> No, No, No, No, No, No, No, No, Gd, No, No, No, ...
## $ BsmtFinType1 <fct> Rec, ALQ, GLQ, GLQ, ALQ, Unf, ALQ, Unf, GLQ, ALQ...
## $ BsmtFinSF1 <int> 468, 923, 791, 602, 263, 0, 935, 0, 637, 804, 10...
## $ BsmtFinType2 <fct> LwQ, Unf, Unf, Unf, Unf, Unf, Unf, Unf, Unf, Rec...
## $ BsmtFinSF2 <int> 144, 0, 0, 0, 0, 0, 0, 0, 0, 78, 0, 0, 0, 0, 0, ...
## $ BsmtUnfSF <int> 270, 406, 137, 324, 1017, 763, 233, 789, 663, 0,...
## $ TotalBsmtSF <int> 882, 1329, 928, 926, 1280, 763, 1168, 789, 1300,...
## $ Heating <fct> GasA, GasA, GasA, GasA, GasA, GasA, GasA, GasA, ...
## $ HeatingQC <fct> TA, TA, Gd, Ex, Ex, Gd, Ex, Gd, Gd, TA, Ex, TA, ...
## $ CentralAir <fct> Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, ...
## $ Electrical <fct> SBrkr, SBrkr, SBrkr, SBrkr, SBrkr, SBrkr, SBrkr,...
## $ X1stFlrSF <int> 896, 1329, 928, 926, 1280, 763, 1187, 789, 1341,...
## $ X2ndFlrSF <int> 0, 0, 701, 678, 0, 892, 0, 676, 0, 0, 0, 504, 56...
## $ LowQualFinSF <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ GrLivArea <int> 896, 1329, 1629, 1604, 1280, 1655, 1187, 1465, 1...
## $ BsmtFullBath <int> 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, ...
## $ BsmtHalfBath <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ FullBath <int> 1, 1, 2, 2, 2, 2, 2, 2, 1, 1, 2, 1, 1, 2, 1, 2, ...
## $ HalfBath <int> 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, ...
## $ BedroomAbvGr <int> 2, 3, 3, 3, 2, 3, 3, 3, 2, 2, 2, 2, 3, 3, 2, 3, ...
## $ KitchenAbvGr <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ KitchenQual <fct> TA, Gd, TA, Gd, Gd, TA, TA, TA, Gd, TA, Gd, TA, ...
## $ TotRmsAbvGrd <int> 5, 6, 6, 7, 5, 7, 6, 7, 5, 4, 5, 5, 6, 6, 4, 10,...
## $ Functional <fct> Typ, Typ, Typ, Typ, Typ, Typ, Typ, Typ, Typ, Typ...
## $ Fireplaces <int> 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, ...
## $ FireplaceQu <fct> NA, NA, TA, Gd, NA, TA, NA, Gd, Po, NA, Fa, NA, ...
## $ GarageType <fct> Attchd, Attchd, Attchd, Attchd, Attchd, Attchd, ...
## $ GarageYrBlt <int> 1961, 1958, 1997, 1998, 1992, 1993, 1992, 1998, ...
## $ GarageFinish <fct> Unf, Unf, Fin, Fin, RFn, Fin, Fin, Fin, Unf, Fin...
## $ GarageCars <int> 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 2, 1, 3, ...
## $ GarageArea <int> 730, 312, 482, 470, 506, 440, 420, 393, 506, 525...
## $ GarageQual <fct> TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, ...
## $ GarageCond <fct> TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, ...
## $ PavedDrive <fct> Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, ...
## $ WoodDeckSF <int> 140, 393, 212, 360, 0, 157, 483, 0, 192, 240, 20...
## $ OpenPorchSF <int> 0, 36, 34, 36, 82, 84, 21, 75, 0, 0, 68, 0, 0, 0...
## $ EnclosedPorch <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ X3SsnPorch <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ ScreenPorch <int> 120, 0, 0, 0, 144, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ PoolArea <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ PoolQC <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ Fence <fct> MnPrv, NA, MnPrv, NA, NA, NA, GdPrv, NA, NA, MnP...
## $ MiscFeature <fct> NA, Gar2, NA, NA, NA, NA, Shed, NA, NA, NA, NA, ...
## $ MiscVal <int> 0, 12500, 0, 0, 0, 0, 500, 0, 0, 0, 0, 0, 0, 0, ...
## $ MoSold <int> 6, 6, 3, 6, 1, 4, 3, 5, 2, 4, 6, 2, 3, 6, 6, 1, ...
## $ YrSold <int> 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, ...
## $ SaleType <fct> WD, WD, WD, WD, WD, WD, WD, WD, WD, WD, WD, COD,...
## $ SaleCondition <fct> Normal, Normal, Normal, Normal, Normal, Normal, ...
Visualize all numberical variables Select numberical variables togeter
ntrain<-select_if(train, is.numeric)
Density plot of variables
ntrain <- as.data.frame((ntrain))
par(mfrow=c(3, 3))
colnames <- dimnames(ntrain)[[2]]
for(col in 2:ncol(ntrain)) {
d <- density(na.omit(ntrain[,col]))
plot(d, type="n", main=colnames[col])
polygon(d, col="light green", border="red")
}
vis_miss(train)
n_miss(train)
## [1] 6965
prop_miss(train)
## [1] 0.05889565
miss_var_summary(train)
## # A tibble: 81 x 3
## variable n_miss pct_miss
## <chr> <int> <dbl>
## 1 PoolQC 1453 99.5
## 2 MiscFeature 1406 96.3
## 3 Alley 1369 93.8
## 4 Fence 1179 80.8
## 5 FireplaceQu 690 47.3
## 6 LotFrontage 259 17.7
## 7 GarageType 81 5.55
## 8 GarageYrBlt 81 5.55
## 9 GarageFinish 81 5.55
## 10 GarageQual 81 5.55
## # ... with 71 more rows
miss_case_summary(train)
## # A tibble: 1,460 x 3
## case n_miss pct_miss
## <int> <int> <dbl>
## 1 40 15 18.5
## 2 534 15 18.5
## 3 1012 15 18.5
## 4 1219 15 18.5
## 5 521 14 17.3
## 6 706 14 17.3
## 7 1180 14 17.3
## 8 288 11 13.6
## 9 343 11 13.6
## 10 376 11 13.6
## # ... with 1,450 more rows
gg_miss_var(train)
vis_miss(test)
n_miss(test)
## [1] 7000
prop_miss(test)
## [1] 0.05997258
sapply(train, function(x) sum(is.na(x))) %>% kable() %>% kable_styling()
x | |
---|---|
Id | 0 |
MSSubClass | 0 |
MSZoning | 0 |
LotFrontage | 259 |
LotArea | 0 |
Street | 0 |
Alley | 1369 |
LotShape | 0 |
LandContour | 0 |
Utilities | 0 |
LotConfig | 0 |
LandSlope | 0 |
Neighborhood | 0 |
Condition1 | 0 |
Condition2 | 0 |
BldgType | 0 |
HouseStyle | 0 |
OverallQual | 0 |
OverallCond | 0 |
YearBuilt | 0 |
YearRemodAdd | 0 |
RoofStyle | 0 |
RoofMatl | 0 |
Exterior1st | 0 |
Exterior2nd | 0 |
MasVnrType | 8 |
MasVnrArea | 8 |
ExterQual | 0 |
ExterCond | 0 |
Foundation | 0 |
BsmtQual | 37 |
BsmtCond | 37 |
BsmtExposure | 38 |
BsmtFinType1 | 37 |
BsmtFinSF1 | 0 |
BsmtFinType2 | 38 |
BsmtFinSF2 | 0 |
BsmtUnfSF | 0 |
TotalBsmtSF | 0 |
Heating | 0 |
HeatingQC | 0 |
CentralAir | 0 |
Electrical | 1 |
X1stFlrSF | 0 |
X2ndFlrSF | 0 |
LowQualFinSF | 0 |
GrLivArea | 0 |
BsmtFullBath | 0 |
BsmtHalfBath | 0 |
FullBath | 0 |
HalfBath | 0 |
BedroomAbvGr | 0 |
KitchenAbvGr | 0 |
KitchenQual | 0 |
TotRmsAbvGrd | 0 |
Functional | 0 |
Fireplaces | 0 |
FireplaceQu | 690 |
GarageType | 81 |
GarageYrBlt | 81 |
GarageFinish | 81 |
GarageCars | 0 |
GarageArea | 0 |
GarageQual | 81 |
GarageCond | 81 |
PavedDrive | 0 |
WoodDeckSF | 0 |
OpenPorchSF | 0 |
EnclosedPorch | 0 |
X3SsnPorch | 0 |
ScreenPorch | 0 |
PoolArea | 0 |
PoolQC | 1453 |
Fence | 1179 |
MiscFeature | 1406 |
MiscVal | 0 |
MoSold | 0 |
YrSold | 0 |
SaleType | 0 |
SaleCondition | 0 |
SalePrice | 0 |
sapply(test, function(x) sum(is.na(x))) %>% kable() %>% kable_styling()
x | |
---|---|
Id | 0 |
MSSubClass | 0 |
MSZoning | 4 |
LotFrontage | 227 |
LotArea | 0 |
Street | 0 |
Alley | 1352 |
LotShape | 0 |
LandContour | 0 |
Utilities | 2 |
LotConfig | 0 |
LandSlope | 0 |
Neighborhood | 0 |
Condition1 | 0 |
Condition2 | 0 |
BldgType | 0 |
HouseStyle | 0 |
OverallQual | 0 |
OverallCond | 0 |
YearBuilt | 0 |
YearRemodAdd | 0 |
RoofStyle | 0 |
RoofMatl | 0 |
Exterior1st | 1 |
Exterior2nd | 1 |
MasVnrType | 16 |
MasVnrArea | 15 |
ExterQual | 0 |
ExterCond | 0 |
Foundation | 0 |
BsmtQual | 44 |
BsmtCond | 45 |
BsmtExposure | 44 |
BsmtFinType1 | 42 |
BsmtFinSF1 | 1 |
BsmtFinType2 | 42 |
BsmtFinSF2 | 1 |
BsmtUnfSF | 1 |
TotalBsmtSF | 1 |
Heating | 0 |
HeatingQC | 0 |
CentralAir | 0 |
Electrical | 0 |
X1stFlrSF | 0 |
X2ndFlrSF | 0 |
LowQualFinSF | 0 |
GrLivArea | 0 |
BsmtFullBath | 2 |
BsmtHalfBath | 2 |
FullBath | 0 |
HalfBath | 0 |
BedroomAbvGr | 0 |
KitchenAbvGr | 0 |
KitchenQual | 1 |
TotRmsAbvGrd | 0 |
Functional | 2 |
Fireplaces | 0 |
FireplaceQu | 730 |
GarageType | 76 |
GarageYrBlt | 78 |
GarageFinish | 78 |
GarageCars | 1 |
GarageArea | 1 |
GarageQual | 78 |
GarageCond | 78 |
PavedDrive | 0 |
WoodDeckSF | 0 |
OpenPorchSF | 0 |
EnclosedPorch | 0 |
X3SsnPorch | 0 |
ScreenPorch | 0 |
PoolArea | 0 |
PoolQC | 1456 |
Fence | 1169 |
MiscFeature | 1408 |
MiscVal | 0 |
MoSold | 0 |
YrSold | 0 |
SaleType | 1 |
SaleCondition | 0 |
selectCols <- c('GrLivArea', 'SalePrice')
select.data <- train[selectCols]
kable(select.data[sample(nrow(select.data), 12), ] , format="pandoc", align="l", row.names = F, caption = "Sample of Univariate Descriptive Stat.")
GrLivArea | SalePrice |
---|---|
1936 | 230500 |
1863 | 219210 |
1477 | 145000 |
1092 | 139000 |
1983 | 225000 |
954 | 119750 |
1348 | 117000 |
2018 | 378500 |
1593 | 178000 |
1285 | 143500 |
2792 | 256000 |
996 | 108000 |
summary(select.data)
## GrLivArea SalePrice
## Min. : 334 Min. : 34900
## 1st Qu.:1130 1st Qu.:129975
## Median :1464 Median :163000
## Mean :1515 Mean :180921
## 3rd Qu.:1777 3rd Qu.:214000
## Max. :5642 Max. :755000
Create correlation matrixes based on obsevervation and probability
total.obs <- nrow(select.data)
#No. of observations, X>x and Y>y
xy.pos.data <- select.data[which(select.data$GrLivArea > 1130 & select.data$SalePrice > 163000),]
xy.pos <- nrow(xy.pos.data)
#No. of observations,X<x and Y<y
xy.neg.data <- select.data[which(select.data$GrLivArea <= 1130 & select.data$SalePrice <= 163000),]
xy.neg <- nrow(xy.neg.data)
#No. of observations, X>x and Y<y
x.pos.y.neg.data <- select.data[which(select.data$GrLivArea > 1130 & select.data$SalePrice <= 163000),]
x.pos.y.neg <- nrow(x.pos.y.neg.data)
#No. of observations, X<x and Y>y
x.neg.y.pos.data <- select.data[which(select.data$GrLivArea <= 1130 & select.data$SalePrice > 163000),]
x.neg.y.pos <- nrow(x.neg.y.pos.data)
house.data<- matrix(c(xy.pos, x.neg.y.pos,x.pos.y.neg,xy.neg), nrow=2, ncol=2)
#add column and row totals
house.data<- cbind(house.data, Total = rowSums(house.data))
house.data<- rbind(house.data, Total = colSums(house.data))
rownames(house.data)<- c('(X>x)', '(X<=x)', 'Total')
kable(house.data, digits = 2, col.names = c('(Y>y)', '(Y<=y)', 'Total'), align = "l", caption = 'Correlation Matrix of Observations')
(Y>y) | (Y<=y) | Total | |
---|---|---|---|
(X>x) | 720 | 374 | 1094 |
(X<=x) | 8 | 358 | 366 |
Total | 728 | 732 | 1460 |
house.data.prob <- matrix(c(round(xy.pos/total.obs,4), round(x.neg.y.pos/total.obs,4),round(x.pos.y.neg/total.obs,4),round(xy.neg/total.obs,4)), nrow=2, ncol=2)
house.data.prob <- cbind(house.data.prob, Total = round(rowSums(house.data.prob),2))
house.data.prob <- rbind(house.data.prob, Total = round(colSums(house.data.prob),2))
rownames(house.data.prob) <- c('(X>x)', '(X<=x)', 'Total')
kable(house.data.prob, digits = 4, col.names = c('(Y>y)', '(Y<=y)', 'Total'), align = "l", caption = 'Correlation Matrix of Joint Probabilities')
(Y>y) | (Y<=y) | Total | |
---|---|---|---|
(X>x) | 0.4932 | 0.2562 | 0.75 |
(X<=x) | 0.0055 | 0.2452 | 0.25 |
Total | 0.5000 | 0.5000 | 1.00 |
a. \(P(X>x~ |~ Y>y )\), read as probability X (GrLivArea) is greater than 1130 square feet given Y (SalePrice) is greater than $163000.
This is known as conditional probability because we are computing the probability under a condition, SalePrice
is greater than $163000. Two parts to a conditional probability, the outcome of interest and the condition. We can assume condition as information we know to be true, and this information usually can be used to describe outcome.
\(P(GrLivArea~ > 1130~ sq.ft. | SalePrice~ >~ \$163000) = \frac {\#~ cases~ where~ GrLivArea~ > 1130~ sq.ft.~ and~ SalePrice~ >~ \$163000 }{\# cases~ where~ SalePrice~ >~ \$163000}\)
= \(\frac{720}{728} = 98.9 \%\)
Using the joint probabilities,
= \(\frac{0.4932}{0.5000} = 98.64 \%\)
Therefore, probability that SalePrice
will be greater than \(\$163000\), if GrLivArea
is greater than \(1130~ sq.ft.\) is \(99\%\)
b. \(P(X>x~ \&~ Y>y)\), read a probability X (GrLivArea) is greater than 1130 square feet and Y (SalePrice) is greater than $163000.
This is known as joint probability because we are computing the probability using outcomes of two variables.
\(P(GrLivArea~ > 1130~ sq.ft. and SalePrice~ >~ \$163000) = \frac {\#~ cases~ where~ GrLivArea~ > 1130~ sq.ft.~ and~ SalePrice~ >~ \$163000 }{\# cases~ observed}\)
= \(\frac{720}{1460} = 49.32 \%\)
Therefore, probability that GrLivArea
is greater than \(1130~ sq.ft.\) and SalePrice
will be greater than \(\$163000\), is \(49.32 \%\)
c. \(P(X<x~ |~ Y>y )\), read as probability X (GrLivArea) is less than 1130 square feet given Y (SalePrice) is greater than $163000. This is conditional probability.
\(P(GrLivArea~ < 1130~ sq.ft. | SalePrice~ >~ \$163000) = \frac {\#~ cases~ where~ GrLivArea~ < 1130~ sq.ft.~ and~ SalePrice~ >~ \$163000 }{\# cases~ where~ SalePrice~ >~ \$163000}\)
= \(\frac{8}{728} = 1.1 \%\)
Using the joint probabilities,
= \(\frac{0.0055}{0.5000} = 1.1 \%\)
Therefore, probability that SalePrice
will be greater than \(\$163000\), if GrLivArea
is less than \(1130~ sq.ft.\) is \(1.1 \%\)
\(P(XY) = P(X)P(Y)\)
Above condition can be rewritten as
\(P(X \cap Y) = P(X)P(Y)\), condition will be true only when \(X\) and \(Y\) are independent.
We can say that above grade living area and sale price are independent only when an increase or decrease in the area does not affect the probability of increase or decrease of the sale price of the house. We can test the condition by using the following hypothesis.
Null Hypothesis(\(H_0\)): Sale price of the house is not influenced by above grade living area.
Alternative Hypothesis(\(H_A\)): Above grade living area has significant influence on sale price of the house.
If two variables were to be independent it should satisfy the condition
\(P(X>x~ |~ Y>y)P(Y>y) = P(X>x~ \cap ~ Y>y) = P(Y>y~ |~ X>x)P(X>x)\)
We will solve above conditions in two parts,
\(P(X>x~ |~ Y>y)P(Y>y) = P(X>x~ \cap ~ Y>y)\)
= \(P(X>x~ |~ Y>y) = \frac{P(X>x~ \cap ~ Y>y)}{P(Y>y)}\)
= \(P(X>x~ \cap ~ Y>y)\) - probability where GrLivArea > 1130 sq.ft. and SalePrice > $163000
= \(P(Y>y)\) - probability where SalePrice > $163000
= \(P(X>x~ |~ Y>y) = \frac{720}{728} = 98.9 \%\)
Comparing other way,
\(P(Y>y~ |~ X>x)P(X>x) = P(X>x~ \cap ~ Y>y)\)
= \(P(X>x~ \cap ~ Y>y) = P(Y>y~ |~ X>x)P(X>x)\)
= \(P(Y>y~ |~ X>x) = \frac{P(X>x~ \cap ~ Y>y)}{P(X>x)}\)
= \(P(X>x)\) - probability where GrLivArea > 1130 sq.ft.
= \(P(Y>y~ |~ X>x) = \frac{720}{1094} = 65.81 \%\)
Since \(P(X>x~ |~ Y>y)P(Y>y) = P(X>x~ \cap ~ Y>y) = P(Y>y~ |~ X>x)P(X>x)\), condition is not met we reject Null Hypothesis(\(H_0\)), and accept Alternative Hypothesis(\(H_A\)) that above grade living area has significant influence on sale price of the house.
house.data <- matrix(c(xy.pos, x.neg.y.pos,x.pos.y.neg,xy.neg), nrow=2, ncol=2)
house.data
## [,1] [,2]
## [1,] 720 374
## [2,] 8 358
chisq.test(house.data)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: house.data
## X-squared = 441.58, df = 1, p-value < 2.2e-16
Because we have only 2 variables GrLivArea
and SalePrice
, degrees of freedom(df) = 1. p-value = \(2.2 \times 10^{-16}\) is almost “0”, which is far smaller compared to \(0.05\) significance level. So we reject Null Hypothesis(\(H_0\)), and accept Alternative Hypothesis(\(H_A\)) thatGrLivArea
has significant influence on sale price of the house.
getmode <- function(v) {
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}
house.stats <- matrix(data = NA, nrow=11,ncol=2)
qarea <- quantile(select.data$GrLivArea)
qprice <- quantile(select.data$SalePrice)
house.stats[1,1] <- nrow(select.data)
house.stats[1,2] <- nrow(select.data)
house.stats[2,1] <- length(select.data$GrLivArea[!is.na(select.data$GrLivArea)])
house.stats[2,2] <- length(select.data$SalePrice[!is.na(select.data$SalePrice)])
house.stats[3,1] <- paste0(min(select.data$GrLivArea), ' sq. ft.')
house.stats[3,2] <- paste0('$', min(select.data$SalePrice))
house.stats[4,1] <- paste0(max(select.data$GrLivArea), ' sq. ft.')
house.stats[4,2] <- paste0('$', max(select.data$SalePrice))
house.stats[5,1] <- paste0(median(select.data$GrLivArea), ' sq. ft.')
house.stats[5,2] <- paste0('$', median(select.data$SalePrice))
house.stats[6,1] <- paste0(qarea[2], ' sq. ft.')
house.stats[6,2] <- paste0('$', qprice[2])
house.stats[7,1] <- paste0(qarea[4], ' sq. ft.')
house.stats[7,2] <- paste0('$', qprice[4])
house.stats[8,1] <- paste0(round(mean(select.data$GrLivArea),2), ' sq. ft.')
house.stats[8,2] <- paste0('$', round(mean(select.data$SalePrice),2))
house.stats[9,1] <- round(sd(select.data$GrLivArea),2)
house.stats[9,2] <- round(sd(select.data$SalePrice),2)
house.stats[10,1] <- paste0(getmode(select.data$GrLivArea), ' sq. ft.')
house.stats[10,2] <- paste0('$', getmode(select.data$SalePrice))
house.stats[11,1] <- paste0(IQR(select.data$GrLivArea), ' sq. ft.')
house.stats[11,2] <- paste0('$', IQR(select.data$SalePrice))
rownames(house.stats)<- c('Number of Observations', 'Non-missing values', 'Minimum','Maximum', 'Median','1st quartile','3rd quartile', 'Average(mean)', 'Standard deviation', 'Mode', 'Interquartile range(IQR)')
kable(house.stats, digits = 2,
col.names = c('GrLivArea', 'SalePrice'),
align = "l",
caption = 'Univariate Descriptive Statistics', "html") %>%
kable_styling(bootstrap_options = c("striped", "hover"))
GrLivArea | SalePrice | |
---|---|---|
Number of Observations | 1460 | 1460 |
Non-missing values | 1460 | 1460 |
Minimum | 334 sq. ft. | $34900 |
Maximum | 5642 sq. ft. | $755000 |
Median | 1464 sq. ft. | $163000 |
1st quartile | 1129.5 sq. ft. | $129975 |
3rd quartile | 1776.75 sq. ft. | $214000 |
Average(mean) | 1515.46 sq. ft. | $180921.2 |
Standard deviation | 525.48 | 79442.5 |
Mode | 864 sq. ft. | $140000 |
Interquartile range(IQR) | 647.25 sq. ft. | $84025 |
ggplot(train, aes(GrLivArea)) + geom_histogram(binwidth = 150, alpha=0.5, color="red", fill="light green")
Histogram shows distribution of “GrLivArea” . Average area is 1515.46 sq.ft. with standard deviation as 525.48. It also shows right tail, suggesting existence of outliers to the right of the average.
ggplot(train, aes(SalePrice)) + geom_histogram(binwidth = 30000, alpha=0.5, color="red", fill="light green")
Histogram shows distribution of sale price of houses. Average sale price is $180921.2, with sandard deviation of $79442.5. It also shows right tail, suggesting existence of outliers to the right of the average
ggplot(train, aes(GrLivArea,SalePrice))+geom_boxplot(color="red", fill="light green", outlier.size=3)
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
d <- ggplot(train, aes(x = GrLivArea, y = SalePrice)) +
geom_boxplot()
d
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
ggplot(train, aes(x=GrLivArea, y=SalePrice)) + geom_point(color="green", size=3, alpha=0.3) + stat_smooth(method="lm", color="red", se=FALSE)+ggtitle("Correlation of GrLivArea and SalePrice")
From the above boxplot, histograms and scatterplot , we can notice there are some outliers and the variation among “GrLivArea” and “SalePrice” is not constant. This causes a longer tail on the right side.
Quantiles of SalePrice
kable(qarea, digits = 2,
caption = 'Quartiles of "SalePrice" $k',
align = 'l', padding = 10, "html") %>%
kable_styling(bootstrap_options = c("striped", "hover"))
x | |
---|---|
0% | 334.00 |
25% | 1129.50 |
50% | 1464.00 |
75% | 1776.75 |
100% | 5642.00 |
lm_model_price_area <- lm(train$SalePrice ~ train$GrLivArea)
summary(lm_model_price_area)
##
## Call:
## lm(formula = train$SalePrice ~ train$GrLivArea)
##
## Residuals:
## Min 1Q Median 3Q Max
## -462999 -29800 -1124 21957 339832
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18569.026 4480.755 4.144 3.61e-05 ***
## train$GrLivArea 107.130 2.794 38.348 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 56070 on 1458 degrees of freedom
## Multiple R-squared: 0.5021, Adjusted R-squared: 0.5018
## F-statistic: 1471 on 1 and 1458 DF, p-value: < 2.2e-16
Multiple R-squared: 0.5021 means that regression model can explain 50.21% of the variation in data. Residual standard error: 56070` suggests that the average distance of the data points from the fitted line is about 56070. And 95% of times sale price should fall between 2*56070.
As we can see, the variables “GrLivArea” and “SalePrice” is not normal distribution. Normality is an important assumption for many statistical techniques; Box-Cox transformation is a way to transform non-normal dependent variables into a normal shape.
For all positive values of \(y\), it is defined by
\[ y(\lambda)=\begin{cases} \frac{y^{\lambda} - 1}{\lambda}, & \text{if }\lambda \neq 0\ \\ log~ y, & \text{if }\lambda = 0\ \end{cases} \]
If \(y\) has negative values then it is defined as
\[ y(\lambda)=\begin{cases} \frac{(y + {\lambda}_2)^{\lambda_1} - 1}{\lambda_1}, & \text{if }\lambda_1 \neq 0\ \\ log~ (y + {\lambda}_2), & \text{if }\lambda_1 = 0\ \end{cases} \]
We will using R-function boxcox
from MASS
library to determine optimal lambda(\(\lambda\)) value.
par(mfrow=c(1,2))
house.bc <- boxcox(lm_model_price_area)
house.bc.df <- as.data.frame(house.bc)
lambda <- house.bc.df[which.max(house.bc.df$y),1]
boxcox(lm_model_price_area, plotit=T, lambda=seq(0,0.20,by=0.05))
From above boxcox
plot, optimal lambda(\(\lambda\)) is 0.10`. Confidence interval runs between \(0.02\) and \(0.18\). Beause \(\lambda\) is less than \(0.5\), there is no need to transform data.
However, we still performation the transformation to compare the result.
train$SalePrice_trans <- ((train$SalePrice^lambda) -1)/lambda
ggplot(train, aes(x=GrLivArea, y=SalePrice_trans)) +
geom_point(alpha=0.3, size=3)+
stat_smooth(method="lm", color="blue", se=FALSE)
labs(title="Scatterplot GrLivArea Vs. Transformed SalePrice",
x="GrLivArea(sq.ft.)", y = "Transformed SalePrice")
## $x
## [1] "GrLivArea(sq.ft.)"
##
## $y
## [1] "Transformed SalePrice"
##
## $title
## [1] "Scatterplot GrLivArea Vs. Transformed SalePrice"
##
## attr(,"class")
## [1] "labels"
house_t.lm <- lm(train$SalePrice_trans ~ train$GrLivArea)
summary(house_t.lm)
##
## Call:
## lm(formula = train$SalePrice_trans ~ train$GrLivArea)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.6488 -0.4945 0.0843 0.5314 3.1560
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.073e+01 7.656e-02 270.75 <2e-16 ***
## train$GrLivArea 1.813e-03 4.774e-05 37.99 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9581 on 1458 degrees of freedom
## Multiple R-squared: 0.4975, Adjusted R-squared: 0.4971
## F-statistic: 1443 on 1 and 1458 DF, p-value: < 2.2e-16
As we see, Multiple R-squared value is smaller than the non-transformation model . The transformation is worthless in this case.
ggplot(train, aes(x=LotArea, y=SalePrice)) + geom_point(color="green", size=3, alpha=0.3) + stat_smooth(method="lm", color="red", se=FALSE)+ggtitle("Correlation of LotArea and SalePrice")
ggplot(train, aes(x=X1stFlrSF, y=SalePrice)) + geom_point(color="green", size=3, alpha=0.3) + stat_smooth(method="lm", color="red", se=FALSE)+ggtitle("Correlation of X1stFlrSF and SalePrice")
corr_data<-subset(train,select=c("X1stFlrSF","LotArea", "SalePrice"))
correlation_matrix <- round(cor(corr_data),2)
get_lower_tri<-function(correlation_matrix){
correlation_matrix[upper.tri(correlation_matrix)] <- NA
return(correlation_matrix)
}
get_upper_tri <- function(correlation_matrix){
correlation_matrix[lower.tri(correlation_matrix)]<- NA
return(correlation_matrix)
}
upper_tri <- get_upper_tri(correlation_matrix)
melted_correlation_matrix <- melt(upper_tri, na.rm = TRUE)
ggheatmap <- ggplot(data = melted_correlation_matrix, aes(Var2, Var1, fill = value))+
geom_tile(color = "white")+
scale_fill_gradient2(low = "blue", high = "red", mid = "white",
midpoint = 0, limit = c(-1,1), space = "Lab",
name="Pearson\nCorrelation") +
theme_minimal()+
theme(axis.text.x = element_text(angle = 45, vjust = 1,
size = 15, hjust = 1))+
coord_fixed()
ggheatmap +
geom_text(aes(Var2, Var1, label = value), color = "black", size = 3) +
theme(
axis.title.x = element_blank(),
axis.title.y = element_blank(),
axis.text.x=element_text(size=rel(0.8), angle=90),
axis.text.y=element_text(size=rel(0.8)),
panel.grid.major = element_blank(),
panel.border = element_blank(),
panel.background = element_blank(),
axis.ticks = element_blank(),
legend.justification = c(1, 0),
legend.position = c(0.6, 0.7),
legend.direction = "horizontal")+
guides(fill = guide_colorbar(barwicrash_training2h = 7, barheight = 1,
title.position = "top", title.hjust = 0.5))
Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide a 80% confidence interval.
cor.test(corr_data$X1stFlrSF, corr_data$SalePrice, method = c("pearson", "kendall", "spearman"), conf.level = 0.8)
##
## Pearson's product-moment correlation
##
## data: corr_data$X1stFlrSF and corr_data$SalePrice
## t = 29.078, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.5841687 0.6266715
## sample estimates:
## cor
## 0.6058522
cor.test(corr_data$LotArea, corr_data$SalePrice, method = c("pearson", "kendall", "spearman"), conf.level = 0.8)
##
## Pearson's product-moment correlation
##
## data: corr_data$LotArea and corr_data$SalePrice
## t = 10.445, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.2323391 0.2947946
## sample estimates:
## cor
## 0.2638434
cor.test(corr_data$X1stFlrSF, corr_data$LotArea, method = c("pearson", "kendall", "spearman"), conf.level = 0.8)
##
## Pearson's product-moment correlation
##
## data: corr_data$X1stFlrSF and corr_data$LotArea
## t = 11.985, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.2686127 0.3297222
## sample estimates:
## cor
## 0.2994746
For every two variables, we have generated an 80 percent of confidence interval. All the p values are < 0.001. Hence, for the three iterations of testing, we can reject the the null hypothesis and conclude that the true correlation is not 0 for the selected variables.
Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?
family wise error is a measurment of error when it comes to performing several iterations of estimates. This might cause results to be interpreted as being more independent then they really are. Our three tests of correlation had low p values, hence we can use that to derive the familywise error rate.
n=3
alpha=(0.5)/n
print(paste0("Familywise error rate is ", 1-alpha))
## [1] "Familywise error rate is 0.833333333333333"
5 points. Linear Algebra and Correlation. Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.
Following are the three untransformed independent variables (LotArea,TotalBsmtSF,GrLivArea ) and one dependent variable(SalePrice).
correlation provides trends shared between two variables. If the value is close to 1
variables are positively related. If the value is close to -1
, then variables are negatively related or inversely related. If the value is close to 0, the two variables are less correlated.
select.Cols <- c('LotArea','TotalBsmtSF','GrLivArea', 'SalePrice')
select.data <- train[select.Cols]
pearson.cor <- cor(select.data,method="pearson")
pearson.cor
## LotArea TotalBsmtSF GrLivArea SalePrice
## LotArea 1.0000000 0.2608331 0.2631162 0.2638434
## TotalBsmtSF 0.2608331 1.0000000 0.4548682 0.6135806
## GrLivArea 0.2631162 0.4548682 1.0000000 0.7086245
## SalePrice 0.2638434 0.6135806 0.7086245 1.0000000
Correlation between TotalBsmtSF and SalePrice is 0.61. It explains bigger basement area will result in the better sale price. Square value of the coefficient is 0.3721. It means 37.21% percent of the variance in the sale price of a house can be explained by the total area of the basement.
Correlation between GrLivArea and SalePrice is 0.71. It explains bigger living area will result in the better sale price. Square value of the coefficient is 0.5041. It means 50.41% percent of the variance in the sale price of a house can be explained by the total above grade living area.
inv.cor <- solve(pearson.cor)
inv.cor
## LotArea TotalBsmtSF GrLivArea SalePrice
## LotArea 1.10622180 -0.1703170 -0.1623394 -0.07232846
## TotalBsmtSF -0.17031695 1.6321069 -0.0397442 -0.92832834
## GrLivArea -0.16233936 -0.0397442 2.0350650 -1.37487844
## SalePrice -0.07232846 -0.9283283 -1.3748784 2.56296011
Correlation Matrix multiplied by Precision Matrix
round(pearson.cor %*% inv.cor)
## LotArea TotalBsmtSF GrLivArea SalePrice
## LotArea 1 0 0 0
## TotalBsmtSF 0 1 0 0
## GrLivArea 0 0 1 0
## SalePrice 0 0 0 1
Precision Matrix multiplied by Correlation Matrix
round(inv.cor %*% pearson.cor)
## LotArea TotalBsmtSF GrLivArea SalePrice
## LotArea 1 0 0 0
## TotalBsmtSF 0 1 0 0
## GrLivArea 0 0 1 0
## SalePrice 0 0 0 1
Correlation Matrix multiplied by Precision Matrix and Precision Matrix multiplied by Correlation Matrix results in identity matrix.
5 points. Calculus-Based Probability & Statistics. Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zeroif necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of ???for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, ???)). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.
In this part, professor suggested a variable which is right skewed, shift it so that the minimum value is absolutely above zeroif necessary. After review the numerical data, we found that most variables are left skewed. It makes perfect sense that most house are smaller and medium size. The only rights skewed variables are related to years, because most houses have been built or renovated recently.
ggplot(train, aes(x=YearBuilt)) + geom_histogram(binwidth=10,color="red", fill="light green")
ggplot(train, aes(x=YearBuilt, y=SalePrice)) + geom_point(color="green", size=3, alpha=0.3) + stat_smooth(method="lm", color="red", se=FALSE)+ggtitle("Correlation of YearBuilt and SalePrice")
In general, YearBuilt should be postitive linear correlated with SalePrice. As long as the inflation, raise of wages and increase costof building matereial are playing an important role to push the SalePrice high. However, a few high ends units play a role as outlier to shift the linear correlation.
selectCols <- c('YearBuilt', 'SalePrice')
select.data <- train[selectCols]
fit.output <- fitdistr(select.data$YearBuilt, densfun="normal")
fit.output
## mean sd
## 1971.2678082 30.1925588
## ( 0.7901754) ( 0.5587384)
Output of “optim” function, average(\(\mu\)) = 1971.27 and standard deviation(\(\sigma\)) =30.19.
To find the optimal estimates, I will be using ‘optim’ and ‘dnorm’ functions. ‘dnorm’ is the R function that calculates the probability density of the normal distribution. Because YearBuilt should be greater than zero, we will use the output of ‘fitdistr’ to get optimum values.
likelihood.func <- function(params) { -sum(dnorm(select.data$YearBuilt, params[1], params[2], log=TRUE)) }
optim.output <- optim(c(fit.output$estimate[1], 30), likelihood.func)
## Warning in dnorm(select.data$YearBuilt, params[1], params[2], log = TRUE):
## NaNs produced
## Warning in dnorm(select.data$YearBuilt, params[1], params[2], log = TRUE):
## NaNs produced
optim.output
## $par
## mean
## 1971.2765 30.1827
##
## $value
## [1] 7046.74
##
## $counts
## function gradient
## 57 NA
##
## $convergence
## [1] 0
##
## $message
## NULL
Output of ‘optim function’, average(\(\mu\)) = 1971.28 and standard deviation(\(\sigma\)) = 30.18.
Apart from rounding, optim function produced same output as ‘fitdistr’.
To generate 1000 samples, ‘rnorm’ function with the optimal parameters generated by ‘optim’ function will be used.
#generate 1000 samples
year.sample <- rnorm(n=1000, mean=round(optim.output$par[1],2), sd=round(optim.output$par[2],2))
year.sample <- data.frame(year.sample)
names(year.sample)[1] <- "Samples"
select.data$yearSplit <- as.factor(select.data[,1]>=round(optim.output$par[1],2))
a <- ggplot(select.data, aes(YearBuilt, fill=select.data$yearSplit)) +
geom_histogram(color="light pink", binwidth=10) +
scale_x_continuous(name = "Year") +
scale_fill_manual(values=c("light green","green"),labels=c("Year >= 1971","Year < 1971")) +
ylab("Number of Observations") +
ggtitle("Observed Year Distribution") +
geom_vline(xintercept = mean(select.data$YearBuilt), color="red", labels="Average", lwd=1)
## Warning: Ignoring unknown parameters: labels
year.sample$yearSplit <- as.factor(year.sample[,1]>=round(optim.output$par[1],2))
b <- ggplot(year.sample, aes(Samples, fill=year.sample$yearSplit)) +
geom_histogram(color="light pink", binwidth=10) +
scale_x_continuous(name = "Sample Year") +
scale_fill_manual(values=c("light green","green"),labels=c("Year >= 1971","Year < 1971")) +
ylab("Number of Samples") +
ggtitle("Sample Year Distribution") +
geom_vline(xintercept = mean(select.data$YearBuilt), color="red", labels="Average", lwd=1)
## Warning: Ignoring unknown parameters: labels
grid.arrange(b, a, nrow = 2, top='Sampling Data and Original Data')
To generate 1000 samples, ‘rnorm’ function with the optimal parameters generated by ‘optim’ function will be used.
Mean and SD of samples and observed data is same, 1971, 30 respectively. ‘Red line’ represents ’average’of the data.
Actual observed data have some outliers, while sample data does not have outliers.
Chi-Square
test will be used to see if the sample generated represents a normal distribution. In our prediction, there should be 50% cases where year is greater than or equal to average
and 50% cases less than average
.
Hypothesis,
\(H_0\) : Sample data follow a specified distribution.
\(H_A\) : Sample data do not follow the specified distribution.
#Chi-square test
#Ratio of actual observed values
null_p<-c(0.50, 0.50)
#Samples generated
sample.rows <- c(sum(year.sample$Samples >= round(optim.output$par[1],2), na.rm=TRUE), sum(year.sample$Samples < round(optim.output$par[1],2), na.rm=TRUE))
#Goodness-of-Fit Test
chisq.test(sample.rows, p=null_p)
##
## Chi-squared test for given probabilities
##
## data: sample.rows
## X-squared = 1.156, df = 1, p-value = 0.2823
Following is a chi square test to see whether sample represents actual observed data. Hypothesis: \(H_0\): Sample data represents actual observed data. \(H_A\): Sample data do not represent actual observed data.
#Ratio of actual values
null_p<-c(round((sum(select.data$YearBuilt >= round(optim.output$par[1],2), na.rm=TRUE)) / nrow(select.data),2), round((sum(select.data$YearBuilt < round(optim.output$par[1],2), na.rm=TRUE)) / nrow(select.data),2))
#Samples generated
sample.rows <- c(sum(year.sample$Samples >= round(optim.output$par[1],2), na.rm=TRUE), sum(year.sample$Samples < round(optim.output$par[1],2), na.rm=TRUE))
#Goodness-of-Fit Test
chisq.test(sample.rows, p=null_p)
##
## Chi-squared test for given probabilities
##
## data: sample.rows
## X-squared = 5.4848, df = 1, p-value = 0.01918
Because ’p-value’is 0.8 which is greater than 0.05, we accept null hypothesis\(H0\). In conclusion, that sample data represents actual observed data.
At the begining, we use ‘glimpse’ function to notice that the database not only includes numeric variables, but also categorical variables or ordinal variables. Of couse, we have no problem to deal with numerica variables. For categorical or ordinal varibles, if they are already been converted as number, we can consider it as numbers for building model. If it is a vector of characters, it will be hard for us to deal with. We are not realtors, so it is hard to give a rank of the character vector. But we try our best to incoporate a couple of categorical variables in our study.
Here is a plan to deal with Data: 1. For some variables with more than 50% of missing information such as “Alley”, “PoolQC”, “Fence”, “Miss Feature”, I will drop it. 2. For numerical variables, I will try to keep as many as possible. If there is missing information, I can impute it with mean. 3. For categorical variables, it is hard to analysis using “as is” condition. Too drop all categorical data is not wise, because it has lots of information. For this kind of situation, I like to keep some categorical variables for analysis by transforming from a factor in character into an ordinal variables coded with a serials of numbers. For some categorical variables which can not each to give a ordinal code, I am going to drop it.
head(train)
## Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape
## 1 1 60 RL 65 8450 Pave <NA> Reg
## 2 2 20 RL 80 9600 Pave <NA> Reg
## 3 3 60 RL 68 11250 Pave <NA> IR1
## 4 4 70 RL 60 9550 Pave <NA> IR1
## 5 5 60 RL 84 14260 Pave <NA> IR1
## 6 6 50 RL 85 14115 Pave <NA> IR1
## LandContour Utilities LotConfig LandSlope Neighborhood Condition1
## 1 Lvl AllPub Inside Gtl CollgCr Norm
## 2 Lvl AllPub FR2 Gtl Veenker Feedr
## 3 Lvl AllPub Inside Gtl CollgCr Norm
## 4 Lvl AllPub Corner Gtl Crawfor Norm
## 5 Lvl AllPub FR2 Gtl NoRidge Norm
## 6 Lvl AllPub Inside Gtl Mitchel Norm
## Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt
## 1 Norm 1Fam 2Story 7 5 2003
## 2 Norm 1Fam 1Story 6 8 1976
## 3 Norm 1Fam 2Story 7 5 2001
## 4 Norm 1Fam 2Story 7 5 1915
## 5 Norm 1Fam 2Story 8 5 2000
## 6 Norm 1Fam 1.5Fin 5 5 1993
## YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType
## 1 2003 Gable CompShg VinylSd VinylSd BrkFace
## 2 1976 Gable CompShg MetalSd MetalSd None
## 3 2002 Gable CompShg VinylSd VinylSd BrkFace
## 4 1970 Gable CompShg Wd Sdng Wd Shng None
## 5 2000 Gable CompShg VinylSd VinylSd BrkFace
## 6 1995 Gable CompShg VinylSd VinylSd None
## MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond BsmtExposure
## 1 196 Gd TA PConc Gd TA No
## 2 0 TA TA CBlock Gd TA Gd
## 3 162 Gd TA PConc Gd TA Mn
## 4 0 TA TA BrkTil TA Gd No
## 5 350 Gd TA PConc Gd TA Av
## 6 0 TA TA Wood Gd TA No
## BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF
## 1 GLQ 706 Unf 0 150 856
## 2 ALQ 978 Unf 0 284 1262
## 3 GLQ 486 Unf 0 434 920
## 4 ALQ 216 Unf 0 540 756
## 5 GLQ 655 Unf 0 490 1145
## 6 GLQ 732 Unf 0 64 796
## Heating HeatingQC CentralAir Electrical X1stFlrSF X2ndFlrSF LowQualFinSF
## 1 GasA Ex Y SBrkr 856 854 0
## 2 GasA Ex Y SBrkr 1262 0 0
## 3 GasA Ex Y SBrkr 920 866 0
## 4 GasA Gd Y SBrkr 961 756 0
## 5 GasA Ex Y SBrkr 1145 1053 0
## 6 GasA Ex Y SBrkr 796 566 0
## GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr
## 1 1710 1 0 2 1 3
## 2 1262 0 1 2 0 3
## 3 1786 1 0 2 1 3
## 4 1717 1 0 1 0 3
## 5 2198 1 0 2 1 4
## 6 1362 1 0 1 1 1
## KitchenAbvGr KitchenQual TotRmsAbvGrd Functional Fireplaces FireplaceQu
## 1 1 Gd 8 Typ 0 <NA>
## 2 1 TA 6 Typ 1 TA
## 3 1 Gd 6 Typ 1 TA
## 4 1 Gd 7 Typ 1 Gd
## 5 1 Gd 9 Typ 1 TA
## 6 1 TA 5 Typ 0 <NA>
## GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual
## 1 Attchd 2003 RFn 2 548 TA
## 2 Attchd 1976 RFn 2 460 TA
## 3 Attchd 2001 RFn 2 608 TA
## 4 Detchd 1998 Unf 3 642 TA
## 5 Attchd 2000 RFn 3 836 TA
## 6 Attchd 1993 Unf 2 480 TA
## GarageCond PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch
## 1 TA Y 0 61 0 0
## 2 TA Y 298 0 0 0
## 3 TA Y 0 42 0 0
## 4 TA Y 0 35 272 0
## 5 TA Y 192 84 0 0
## 6 TA Y 40 30 0 320
## ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold
## 1 0 0 <NA> <NA> <NA> 0 2 2008
## 2 0 0 <NA> <NA> <NA> 0 5 2007
## 3 0 0 <NA> <NA> <NA> 0 9 2008
## 4 0 0 <NA> <NA> <NA> 0 2 2006
## 5 0 0 <NA> <NA> <NA> 0 12 2008
## 6 0 0 <NA> MnPrv Shed 700 10 2009
## SaleType SaleCondition SalePrice SalePrice_trans
## 1 WD Normal 208500 24.21290
## 2 WD Normal 181500 23.73836
## 3 WD Normal 223500 24.45313
## 4 WD Abnorml 140000 22.86771
## 5 WD Normal 250000 24.84415
## 6 WD Normal 143000 22.93796
Drop the SalePrice_trans from dataframe
train$SalePrice_trans <- NULL
train <- read.csv("https://raw.githubusercontent.com/johnpannyc/data605-final/master/train.csv")
train1 <- dplyr::select(train,MSSubClass,Neighborhood,LotFrontage,LotArea,BldgType,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,CentralAir,X1stFlrSF,X2ndFlrSF,LowQualFinSF,GrLivArea,TotRmsAbvGrd,GarageCars,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,ScreenPorch,X3SsnPorch,PoolArea,MiscVal,MoSold,YrSold,SaleCondition,SalePrice)
glimpse(train1)
## Observations: 1,460
## Variables: 33
## $ MSSubClass <int> 60, 20, 60, 70, 60, 50, 20, 60, 50, 190, 20, 60,...
## $ Neighborhood <fct> CollgCr, Veenker, CollgCr, Crawfor, NoRidge, Mit...
## $ LotFrontage <int> 65, 80, 68, 60, 84, 85, 75, NA, 51, 50, 70, 85, ...
## $ LotArea <int> 8450, 9600, 11250, 9550, 14260, 14115, 10084, 10...
## $ BldgType <fct> 1Fam, 1Fam, 1Fam, 1Fam, 1Fam, 1Fam, 1Fam, 1Fam, ...
## $ OverallQual <int> 7, 6, 7, 7, 8, 5, 8, 7, 7, 5, 5, 9, 5, 7, 6, 7, ...
## $ OverallCond <int> 5, 8, 5, 5, 5, 5, 5, 6, 5, 6, 5, 5, 6, 5, 5, 8, ...
## $ YearBuilt <int> 2003, 1976, 2001, 1915, 2000, 1993, 2004, 1973, ...
## $ YearRemodAdd <int> 2003, 1976, 2002, 1970, 2000, 1995, 2005, 1973, ...
## $ MasVnrArea <int> 196, 0, 162, 0, 350, 0, 186, 240, 0, 0, 0, 286, ...
## $ BsmtFinSF1 <int> 706, 978, 486, 216, 655, 732, 1369, 859, 0, 851,...
## $ BsmtFinSF2 <int> 0, 0, 0, 0, 0, 0, 0, 32, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ BsmtUnfSF <int> 150, 284, 434, 540, 490, 64, 317, 216, 952, 140,...
## $ TotalBsmtSF <int> 856, 1262, 920, 756, 1145, 796, 1686, 1107, 952,...
## $ CentralAir <fct> Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, ...
## $ X1stFlrSF <int> 856, 1262, 920, 961, 1145, 796, 1694, 1107, 1022...
## $ X2ndFlrSF <int> 854, 0, 866, 756, 1053, 566, 0, 983, 752, 0, 0, ...
## $ LowQualFinSF <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ GrLivArea <int> 1710, 1262, 1786, 1717, 2198, 1362, 1694, 2090, ...
## $ TotRmsAbvGrd <int> 8, 6, 6, 7, 9, 5, 7, 7, 8, 5, 5, 11, 4, 7, 5, 5,...
## $ GarageCars <int> 2, 2, 2, 3, 3, 2, 2, 2, 2, 1, 1, 3, 1, 3, 1, 2, ...
## $ GarageArea <int> 548, 460, 608, 642, 836, 480, 636, 484, 468, 205...
## $ WoodDeckSF <int> 0, 298, 0, 0, 192, 40, 255, 235, 90, 0, 0, 147, ...
## $ OpenPorchSF <int> 61, 0, 42, 35, 84, 30, 57, 204, 0, 4, 0, 21, 0, ...
## $ EnclosedPorch <int> 0, 0, 0, 272, 0, 0, 0, 228, 205, 0, 0, 0, 0, 0, ...
## $ ScreenPorch <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 176, 0, 0, 0...
## $ X3SsnPorch <int> 0, 0, 0, 0, 0, 320, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ PoolArea <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ MiscVal <int> 0, 0, 0, 0, 0, 700, 0, 350, 0, 0, 0, 0, 0, 0, 0,...
## $ MoSold <int> 2, 5, 9, 2, 12, 10, 8, 11, 4, 1, 2, 7, 9, 8, 5, ...
## $ YrSold <int> 2008, 2007, 2008, 2006, 2008, 2009, 2007, 2009, ...
## $ SaleCondition <fct> Normal, Normal, Normal, Abnorml, Normal, Normal,...
## $ SalePrice <int> 208500, 181500, 223500, 140000, 250000, 143000, ...
Get the rank of average Neighborhood Sales Price Neighborhood is an important factor for per square foot price. Because I am not familiar with the neighborhood in dataset, I will get the median sale price of each neighborhood first.
df <- train %>%
group_by(Neighborhood) %>%
summarize(medianSalePrice = median(SalePrice)) %>% arrange(desc(medianSalePrice))
df
## # A tibble: 25 x 2
## Neighborhood medianSalePrice
## <fct> <dbl>
## 1 NridgHt 315000
## 2 NoRidge 301500
## 3 StoneBr 278000
## 4 Timber 228475
## 5 Somerst 225500
## 6 Veenker 218000
## 7 Crawfor 200624
## 8 ClearCr 200250
## 9 CollgCr 197200
## 10 Blmngtn 191000
## # ... with 15 more rows
For each neighborhood, I will impute as a score from 25 to 1 according the medianSalePrice from Highest (NridgHt) to Lowest (MeadowV)
train1$Neighborhood <- as.character(train1$Neighborhood)
train1$Neighborhood[which(train1$Neighborhood == "NridgHt")] <- "25"
train1$Neighborhood[which(train1$Neighborhood == "NoRidge")] <- "24"
train1$Neighborhood[which(train1$Neighborhood == "StoneBr")] <- "23"
train1$Neighborhood[which(train1$Neighborhood == "Timber")] <- "22"
train1$Neighborhood[which(train1$Neighborhood == "Somerst")] <- "21"
train1$Neighborhood[which(train1$Neighborhood == "Veenker")] <- "20"
train1$Neighborhood[which(train1$Neighborhood == "Crawfor")] <- "19"
train1$Neighborhood[which(train1$Neighborhood == "ClearCr")] <- "18"
train1$Neighborhood[which(train1$Neighborhood == "CollgCr")] <- "17"
train1$Neighborhood[which(train1$Neighborhood == "Blmngtn")] <- "16"
train1$Neighborhood[which(train1$Neighborhood == "NWAmes")] <- "15"
train1$Neighborhood[which(train1$Neighborhood == "Gilbert")] <- "14"
train1$Neighborhood[which(train1$Neighborhood == "SawyerW")] <- "13"
train1$Neighborhood[which(train1$Neighborhood == "Mitchel")] <- "12"
train1$Neighborhood[which(train1$Neighborhood == "NPkVill")] <- "11"
train1$Neighborhood[which(train1$Neighborhood == "NAmes")] <- "10"
train1$Neighborhood[which(train1$Neighborhood == "SWISU")] <- "9"
train1$Neighborhood[which(train1$Neighborhood == "Blueste")] <- "8"
train1$Neighborhood[which(train1$Neighborhood == "Sawyer")] <- "7"
train1$Neighborhood[which(train1$Neighborhood == "BrkSide")] <- "6"
train1$Neighborhood[which(train1$Neighborhood == "Edwards")] <- "5"
train1$Neighborhood[which(train1$Neighborhood == "OldTown")] <- "4"
train1$Neighborhood[which(train1$Neighborhood == "BrDale")] <- "3"
train1$Neighborhood[which(train1$Neighborhood == "IDOTRR")] <- "2"
train1$Neighborhood[which(train1$Neighborhood == "MeadowV")] <- "1"
train1$Neighborhood <- as.numeric(train1$Neighborhood)
Convert indicator variables to numbers.
train1$CentralAir <- as.character(train1$CentralAir)
train1$CentralAir[which(train1$CentralAir == "Y")] <- "1"
train1$CentralAir[which(train1$CentralAir == "N")] <- "0"
train1$CentralAir <- as.numeric(train1$CentralAir)
train1$CentralAir <- as.character(train1$CentralAir)
train1$CentralAir[which(train1$CentralAir == "Y")] <- "1"
train1$CentralAir[which(train1$CentralAir == "N")] <- "0"
train1$CentralAir <- as.numeric(train1$CentralAir)
train1$SaleCondition <- as.character(train1$SaleCondition)
train1$SaleCondition[which(train1$SaleCondition == "Normal")] <- "1"
train1$SaleCondition[which(train1$SaleCondition == "Abnorml")] <- "0"
train1$SaleCondition[which(train1$SaleCondition == "AdjLand")] <- "0"
train1$SaleCondition[which(train1$SaleCondition == "Alloca")] <- "0"
train1$SaleCondition[which(train1$SaleCondition == "Family")] <- "0"
train1$SaleCondition[which(train1$SaleCondition == "Partial")] <- "0"
train1$SaleCondition[which(train1$SaleCondition == "N")] <- "0"
train1$SaleCondition <- as.numeric(train1$SaleCondition)
train1$BldgType <- as.character(train1$BldgType)
train1$BldgType[which(train1$BldgType == "1Fam")] <- "5"
train1$BldgType[which(train1$BldgType == "2fmCon")] <- "4"
train1$BldgType[which(train1$BldgType == "Duplex")] <- "3"
train1$BldgType[which(train1$BldgType == "Twnhs")] <- "2"
train1$BldgType[which(train1$BldgType == "TwnhsE")] <- "1"
train1$BldgType <- as.numeric(train1$BldgType)
sapply(train1, function(x) sum(is.na(x))) %>% kable() %>% kable_styling()
x | |
---|---|
MSSubClass | 0 |
Neighborhood | 0 |
LotFrontage | 259 |
LotArea | 0 |
BldgType | 0 |
OverallQual | 0 |
OverallCond | 0 |
YearBuilt | 0 |
YearRemodAdd | 0 |
MasVnrArea | 8 |
BsmtFinSF1 | 0 |
BsmtFinSF2 | 0 |
BsmtUnfSF | 0 |
TotalBsmtSF | 0 |
CentralAir | 0 |
X1stFlrSF | 0 |
X2ndFlrSF | 0 |
LowQualFinSF | 0 |
GrLivArea | 0 |
TotRmsAbvGrd | 0 |
GarageCars | 0 |
GarageArea | 0 |
WoodDeckSF | 0 |
OpenPorchSF | 0 |
EnclosedPorch | 0 |
ScreenPorch | 0 |
X3SsnPorch | 0 |
PoolArea | 0 |
MiscVal | 0 |
MoSold | 0 |
YrSold | 0 |
SaleCondition | 0 |
SalePrice | 0 |
Impute missing data by mean
train1$LotFrontage[is.na(train1$LotFrontage)] <- mean(train1$LotFrontage, na.rm=TRUE)
train1$MasVnrArea[is.na(train1$MasVnrArea)] <- mean(train1$MasVnrArea, na.rm=TRUE)
vis_miss(train1)
sapply(train1, function(x) sum(is.na(x))) %>% kable() %>% kable_styling()
x | |
---|---|
MSSubClass | 0 |
Neighborhood | 0 |
LotFrontage | 0 |
LotArea | 0 |
BldgType | 0 |
OverallQual | 0 |
OverallCond | 0 |
YearBuilt | 0 |
YearRemodAdd | 0 |
MasVnrArea | 0 |
BsmtFinSF1 | 0 |
BsmtFinSF2 | 0 |
BsmtUnfSF | 0 |
TotalBsmtSF | 0 |
CentralAir | 0 |
X1stFlrSF | 0 |
X2ndFlrSF | 0 |
LowQualFinSF | 0 |
GrLivArea | 0 |
TotRmsAbvGrd | 0 |
GarageCars | 0 |
GarageArea | 0 |
WoodDeckSF | 0 |
OpenPorchSF | 0 |
EnclosedPorch | 0 |
ScreenPorch | 0 |
X3SsnPorch | 0 |
PoolArea | 0 |
MiscVal | 0 |
MoSold | 0 |
YrSold | 0 |
SaleCondition | 0 |
SalePrice | 0 |
View the category variables by
table(train1$ BldgType)
##
## 1 2 3 4 5
## 114 43 52 31 1220
head(train1)
## MSSubClass Neighborhood LotFrontage LotArea BldgType OverallQual
## 1 60 17 65 8450 5 7
## 2 20 20 80 9600 5 6
## 3 60 17 68 11250 5 7
## 4 70 19 60 9550 5 7
## 5 60 24 84 14260 5 8
## 6 50 12 85 14115 5 5
## OverallCond YearBuilt YearRemodAdd MasVnrArea BsmtFinSF1 BsmtFinSF2
## 1 5 2003 2003 196 706 0
## 2 8 1976 1976 0 978 0
## 3 5 2001 2002 162 486 0
## 4 5 1915 1970 0 216 0
## 5 5 2000 2000 350 655 0
## 6 5 1993 1995 0 732 0
## BsmtUnfSF TotalBsmtSF CentralAir X1stFlrSF X2ndFlrSF LowQualFinSF
## 1 150 856 1 856 854 0
## 2 284 1262 1 1262 0 0
## 3 434 920 1 920 866 0
## 4 540 756 1 961 756 0
## 5 490 1145 1 1145 1053 0
## 6 64 796 1 796 566 0
## GrLivArea TotRmsAbvGrd GarageCars GarageArea WoodDeckSF OpenPorchSF
## 1 1710 8 2 548 0 61
## 2 1262 6 2 460 298 0
## 3 1786 6 2 608 0 42
## 4 1717 7 3 642 0 35
## 5 2198 9 3 836 192 84
## 6 1362 5 2 480 40 30
## EnclosedPorch ScreenPorch X3SsnPorch PoolArea MiscVal MoSold YrSold
## 1 0 0 0 0 0 2 2008
## 2 0 0 0 0 0 5 2007
## 3 0 0 0 0 0 9 2008
## 4 272 0 0 0 0 2 2006
## 5 0 0 0 0 0 12 2008
## 6 0 0 320 0 700 10 2009
## SaleCondition SalePrice
## 1 1 208500
## 2 1 181500
## 3 1 223500
## 4 0 140000
## 5 1 250000
## 6 1 143000
glimpse(train1)
## Observations: 1,460
## Variables: 33
## $ MSSubClass <int> 60, 20, 60, 70, 60, 50, 20, 60, 50, 190, 20, 60,...
## $ Neighborhood <dbl> 17, 20, 17, 19, 24, 12, 21, 15, 4, 6, 7, 25, 7, ...
## $ LotFrontage <dbl> 65.00000, 80.00000, 68.00000, 60.00000, 84.00000...
## $ LotArea <int> 8450, 9600, 11250, 9550, 14260, 14115, 10084, 10...
## $ BldgType <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 4, 5, 5, 5, 5, 5, 5, ...
## $ OverallQual <int> 7, 6, 7, 7, 8, 5, 8, 7, 7, 5, 5, 9, 5, 7, 6, 7, ...
## $ OverallCond <int> 5, 8, 5, 5, 5, 5, 5, 6, 5, 6, 5, 5, 6, 5, 5, 8, ...
## $ YearBuilt <int> 2003, 1976, 2001, 1915, 2000, 1993, 2004, 1973, ...
## $ YearRemodAdd <int> 2003, 1976, 2002, 1970, 2000, 1995, 2005, 1973, ...
## $ MasVnrArea <dbl> 196, 0, 162, 0, 350, 0, 186, 240, 0, 0, 0, 286, ...
## $ BsmtFinSF1 <int> 706, 978, 486, 216, 655, 732, 1369, 859, 0, 851,...
## $ BsmtFinSF2 <int> 0, 0, 0, 0, 0, 0, 0, 32, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ BsmtUnfSF <int> 150, 284, 434, 540, 490, 64, 317, 216, 952, 140,...
## $ TotalBsmtSF <int> 856, 1262, 920, 756, 1145, 796, 1686, 1107, 952,...
## $ CentralAir <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ X1stFlrSF <int> 856, 1262, 920, 961, 1145, 796, 1694, 1107, 1022...
## $ X2ndFlrSF <int> 854, 0, 866, 756, 1053, 566, 0, 983, 752, 0, 0, ...
## $ LowQualFinSF <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ GrLivArea <int> 1710, 1262, 1786, 1717, 2198, 1362, 1694, 2090, ...
## $ TotRmsAbvGrd <int> 8, 6, 6, 7, 9, 5, 7, 7, 8, 5, 5, 11, 4, 7, 5, 5,...
## $ GarageCars <int> 2, 2, 2, 3, 3, 2, 2, 2, 2, 1, 1, 3, 1, 3, 1, 2, ...
## $ GarageArea <int> 548, 460, 608, 642, 836, 480, 636, 484, 468, 205...
## $ WoodDeckSF <int> 0, 298, 0, 0, 192, 40, 255, 235, 90, 0, 0, 147, ...
## $ OpenPorchSF <int> 61, 0, 42, 35, 84, 30, 57, 204, 0, 4, 0, 21, 0, ...
## $ EnclosedPorch <int> 0, 0, 0, 272, 0, 0, 0, 228, 205, 0, 0, 0, 0, 0, ...
## $ ScreenPorch <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 176, 0, 0, 0...
## $ X3SsnPorch <int> 0, 0, 0, 0, 0, 320, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ PoolArea <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ MiscVal <int> 0, 0, 0, 0, 0, 700, 0, 350, 0, 0, 0, 0, 0, 0, 0,...
## $ MoSold <int> 2, 5, 9, 2, 12, 10, 8, 11, 4, 1, 2, 7, 9, 8, 5, ...
## $ YrSold <int> 2008, 2007, 2008, 2006, 2008, 2009, 2007, 2009, ...
## $ SaleCondition <dbl> 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, ...
## $ SalePrice <int> 208500, 181500, 223500, 140000, 250000, 143000, ...
vis_miss(train1)
mb_cor <- cor(train1)
round(mb_cor, 3)
## MSSubClass Neighborhood LotFrontage LotArea BldgType
## MSSubClass 1.000 -0.046 -0.357 -0.140 -0.746
## Neighborhood -0.046 1.000 0.219 0.167 -0.061
## LotFrontage -0.357 0.219 1.000 0.307 0.409
## LotArea -0.140 0.167 0.307 1.000 0.206
## BldgType -0.746 -0.061 0.409 0.206 1.000
## OverallQual 0.033 0.671 0.234 0.106 -0.050
## OverallCond -0.059 -0.215 -0.053 -0.006 0.162
## YearBuilt 0.028 0.684 0.118 0.014 -0.218
## YearRemodAdd 0.041 0.514 0.083 0.014 -0.105
## MasVnrArea 0.023 0.370 0.179 0.104 -0.043
## BsmtFinSF1 -0.070 0.244 0.216 0.214 -0.007
## BsmtFinSF2 -0.066 -0.034 0.043 0.111 0.017
## BsmtUnfSF -0.141 0.213 0.122 -0.003 0.051
## TotalBsmtSF -0.239 0.455 0.363 0.261 0.050
## CentralAir -0.102 0.268 0.069 0.050 -0.018
## X1stFlrSF -0.252 0.403 0.414 0.299 0.074
## X2ndFlrSF 0.308 0.135 0.072 0.051 0.084
## LowQualFinSF 0.046 -0.088 0.037 0.005 0.030
## GrLivArea 0.075 0.400 0.368 0.263 0.127
## TotRmsAbvGrd 0.040 0.271 0.320 0.190 0.198
## GarageCars -0.040 0.571 0.270 0.155 -0.007
## GarageArea -0.099 0.527 0.324 0.180 0.061
## WoodDeckSF -0.013 0.224 0.077 0.172 0.013
## OpenPorchSF -0.006 0.205 0.137 0.085 0.037
## EnclosedPorch -0.012 -0.216 0.010 -0.018 0.115
## ScreenPorch -0.026 0.013 0.038 0.043 0.028
## X3SsnPorch -0.044 0.024 0.062 0.020 0.023
## PoolArea 0.008 -0.007 0.181 0.078 0.028
## MiscVal -0.008 -0.040 0.001 0.038 0.010
## MoSold -0.014 0.049 0.010 0.001 0.026
## YrSold -0.021 -0.028 0.007 -0.014 -0.002
## SaleCondition 0.024 -0.139 -0.072 0.006 0.027
## SalePrice -0.084 0.696 0.335 0.264 0.086
## OverallQual OverallCond YearBuilt YearRemodAdd MasVnrArea
## MSSubClass 0.033 -0.059 0.028 0.041 0.023
## Neighborhood 0.671 -0.215 0.684 0.514 0.370
## LotFrontage 0.234 -0.053 0.118 0.083 0.179
## LotArea 0.106 -0.006 0.014 0.014 0.104
## BldgType -0.050 0.162 -0.218 -0.105 -0.043
## OverallQual 1.000 -0.092 0.572 0.551 0.410
## OverallCond -0.092 1.000 -0.376 0.074 -0.128
## YearBuilt 0.572 -0.376 1.000 0.593 0.315
## YearRemodAdd 0.551 0.074 0.593 1.000 0.179
## MasVnrArea 0.410 -0.128 0.315 0.179 1.000
## BsmtFinSF1 0.240 -0.046 0.250 0.128 0.264
## BsmtFinSF2 -0.059 0.040 -0.049 -0.068 -0.072
## BsmtUnfSF 0.308 -0.137 0.149 0.181 0.114
## TotalBsmtSF 0.538 -0.171 0.391 0.291 0.362
## CentralAir 0.272 0.119 0.382 0.299 0.127
## X1stFlrSF 0.476 -0.144 0.282 0.240 0.342
## X2ndFlrSF 0.295 0.029 0.010 0.140 0.174
## LowQualFinSF -0.030 0.025 -0.184 -0.062 -0.069
## GrLivArea 0.593 -0.080 0.199 0.287 0.390
## TotRmsAbvGrd 0.427 -0.058 0.096 0.192 0.280
## GarageCars 0.601 -0.186 0.538 0.421 0.364
## GarageArea 0.562 -0.152 0.479 0.372 0.373
## WoodDeckSF 0.239 -0.003 0.225 0.206 0.159
## OpenPorchSF 0.309 -0.033 0.189 0.226 0.125
## EnclosedPorch -0.114 0.070 -0.387 -0.194 -0.110
## ScreenPorch 0.065 0.055 -0.050 -0.039 0.061
## X3SsnPorch 0.030 0.026 0.031 0.045 0.019
## PoolArea 0.065 -0.002 0.005 0.006 0.012
## MiscVal -0.031 0.069 -0.034 -0.010 -0.030
## MoSold 0.071 -0.004 0.012 0.021 -0.006
## YrSold -0.027 0.044 -0.014 0.036 -0.008
## SaleCondition -0.143 0.162 -0.158 -0.121 -0.084
## SalePrice 0.791 -0.078 0.523 0.507 0.475
## BsmtFinSF1 BsmtFinSF2 BsmtUnfSF TotalBsmtSF CentralAir
## MSSubClass -0.070 -0.066 -0.141 -0.239 -0.102
## Neighborhood 0.244 -0.034 0.213 0.455 0.268
## LotFrontage 0.216 0.043 0.122 0.363 0.069
## LotArea 0.214 0.111 -0.003 0.261 0.050
## BldgType -0.007 0.017 0.051 0.050 -0.018
## OverallQual 0.240 -0.059 0.308 0.538 0.272
## OverallCond -0.046 0.040 -0.137 -0.171 0.119
## YearBuilt 0.250 -0.049 0.149 0.391 0.382
## YearRemodAdd 0.128 -0.068 0.181 0.291 0.299
## MasVnrArea 0.264 -0.072 0.114 0.362 0.127
## BsmtFinSF1 1.000 -0.050 -0.495 0.522 0.166
## BsmtFinSF2 -0.050 1.000 -0.209 0.105 0.040
## BsmtUnfSF -0.495 -0.209 1.000 0.415 0.020
## TotalBsmtSF 0.522 0.105 0.415 1.000 0.208
## CentralAir 0.166 0.040 0.020 0.208 1.000
## X1stFlrSF 0.446 0.097 0.318 0.820 0.147
## X2ndFlrSF -0.137 -0.099 0.004 -0.175 -0.012
## LowQualFinSF -0.065 0.015 0.028 -0.033 -0.050
## GrLivArea 0.208 -0.010 0.240 0.455 0.094
## TotRmsAbvGrd 0.044 -0.035 0.251 0.286 0.035
## GarageCars 0.224 -0.038 0.214 0.435 0.234
## GarageArea 0.297 -0.018 0.183 0.487 0.231
## WoodDeckSF 0.204 0.068 -0.005 0.232 0.146
## OpenPorchSF 0.112 0.003 0.129 0.247 0.026
## EnclosedPorch -0.102 0.037 -0.003 -0.095 -0.157
## ScreenPorch 0.062 0.089 -0.013 0.084 0.051
## X3SsnPorch 0.026 -0.030 0.021 0.037 0.031
## PoolArea 0.140 0.042 -0.035 0.126 0.018
## MiscVal 0.004 0.005 -0.024 -0.018 -0.002
## MoSold -0.016 -0.015 0.035 0.013 0.010
## YrSold 0.014 0.032 -0.041 -0.015 -0.009
## SaleCondition -0.020 0.041 -0.154 -0.160 -0.015
## SalePrice 0.386 -0.011 0.214 0.614 0.251
## X1stFlrSF X2ndFlrSF LowQualFinSF GrLivArea TotRmsAbvGrd
## MSSubClass -0.252 0.308 0.046 0.075 0.040
## Neighborhood 0.403 0.135 -0.088 0.400 0.271
## LotFrontage 0.414 0.072 0.037 0.368 0.320
## LotArea 0.299 0.051 0.005 0.263 0.190
## BldgType 0.074 0.084 0.030 0.127 0.198
## OverallQual 0.476 0.295 -0.030 0.593 0.427
## OverallCond -0.144 0.029 0.025 -0.080 -0.058
## YearBuilt 0.282 0.010 -0.184 0.199 0.096
## YearRemodAdd 0.240 0.140 -0.062 0.287 0.192
## MasVnrArea 0.342 0.174 -0.069 0.390 0.280
## BsmtFinSF1 0.446 -0.137 -0.065 0.208 0.044
## BsmtFinSF2 0.097 -0.099 0.015 -0.010 -0.035
## BsmtUnfSF 0.318 0.004 0.028 0.240 0.251
## TotalBsmtSF 0.820 -0.175 -0.033 0.455 0.286
## CentralAir 0.147 -0.012 -0.050 0.094 0.035
## X1stFlrSF 1.000 -0.203 -0.014 0.566 0.410
## X2ndFlrSF -0.203 1.000 0.063 0.688 0.616
## LowQualFinSF -0.014 0.063 1.000 0.135 0.131
## GrLivArea 0.566 0.688 0.135 1.000 0.825
## TotRmsAbvGrd 0.410 0.616 0.131 0.825 1.000
## GarageCars 0.439 0.184 -0.094 0.467 0.362
## GarageArea 0.490 0.138 -0.068 0.469 0.338
## WoodDeckSF 0.235 0.092 -0.025 0.247 0.166
## OpenPorchSF 0.212 0.208 0.018 0.330 0.234
## EnclosedPorch -0.065 0.062 0.061 0.009 0.004
## ScreenPorch 0.089 0.041 0.027 0.102 0.059
## X3SsnPorch 0.056 -0.024 -0.004 0.021 -0.007
## PoolArea 0.132 0.081 0.062 0.170 0.084
## MiscVal -0.021 0.016 -0.004 -0.002 0.025
## MoSold 0.031 0.035 -0.022 0.050 0.037
## YrSold -0.014 -0.029 -0.029 -0.037 -0.035
## SaleCondition -0.159 0.032 -0.012 -0.092 -0.093
## SalePrice 0.606 0.319 -0.026 0.709 0.534
## GarageCars GarageArea WoodDeckSF OpenPorchSF EnclosedPorch
## MSSubClass -0.040 -0.099 -0.013 -0.006 -0.012
## Neighborhood 0.571 0.527 0.224 0.205 -0.216
## LotFrontage 0.270 0.324 0.077 0.137 0.010
## LotArea 0.155 0.180 0.172 0.085 -0.018
## BldgType -0.007 0.061 0.013 0.037 0.115
## OverallQual 0.601 0.562 0.239 0.309 -0.114
## OverallCond -0.186 -0.152 -0.003 -0.033 0.070
## YearBuilt 0.538 0.479 0.225 0.189 -0.387
## YearRemodAdd 0.421 0.372 0.206 0.226 -0.194
## MasVnrArea 0.364 0.373 0.159 0.125 -0.110
## BsmtFinSF1 0.224 0.297 0.204 0.112 -0.102
## BsmtFinSF2 -0.038 -0.018 0.068 0.003 0.037
## BsmtUnfSF 0.214 0.183 -0.005 0.129 -0.003
## TotalBsmtSF 0.435 0.487 0.232 0.247 -0.095
## CentralAir 0.234 0.231 0.146 0.026 -0.157
## X1stFlrSF 0.439 0.490 0.235 0.212 -0.065
## X2ndFlrSF 0.184 0.138 0.092 0.208 0.062
## LowQualFinSF -0.094 -0.068 -0.025 0.018 0.061
## GrLivArea 0.467 0.469 0.247 0.330 0.009
## TotRmsAbvGrd 0.362 0.338 0.166 0.234 0.004
## GarageCars 1.000 0.882 0.226 0.214 -0.151
## GarageArea 0.882 1.000 0.225 0.241 -0.122
## WoodDeckSF 0.226 0.225 1.000 0.059 -0.126
## OpenPorchSF 0.214 0.241 0.059 1.000 -0.093
## EnclosedPorch -0.151 -0.122 -0.126 -0.093 1.000
## ScreenPorch 0.050 0.051 -0.074 0.074 -0.083
## X3SsnPorch 0.036 0.035 -0.033 -0.006 -0.037
## PoolArea 0.021 0.061 0.073 0.061 0.054
## MiscVal -0.043 -0.027 -0.010 -0.019 0.018
## MoSold 0.041 0.028 0.021 0.071 -0.029
## YrSold -0.039 -0.027 0.022 -0.058 -0.010
## SaleCondition -0.122 -0.131 0.027 -0.096 0.026
## SalePrice 0.640 0.623 0.324 0.316 -0.129
## ScreenPorch X3SsnPorch PoolArea MiscVal MoSold YrSold
## MSSubClass -0.026 -0.044 0.008 -0.008 -0.014 -0.021
## Neighborhood 0.013 0.024 -0.007 -0.040 0.049 -0.028
## LotFrontage 0.038 0.062 0.181 0.001 0.010 0.007
## LotArea 0.043 0.020 0.078 0.038 0.001 -0.014
## BldgType 0.028 0.023 0.028 0.010 0.026 -0.002
## OverallQual 0.065 0.030 0.065 -0.031 0.071 -0.027
## OverallCond 0.055 0.026 -0.002 0.069 -0.004 0.044
## YearBuilt -0.050 0.031 0.005 -0.034 0.012 -0.014
## YearRemodAdd -0.039 0.045 0.006 -0.010 0.021 0.036
## MasVnrArea 0.061 0.019 0.012 -0.030 -0.006 -0.008
## BsmtFinSF1 0.062 0.026 0.140 0.004 -0.016 0.014
## BsmtFinSF2 0.089 -0.030 0.042 0.005 -0.015 0.032
## BsmtUnfSF -0.013 0.021 -0.035 -0.024 0.035 -0.041
## TotalBsmtSF 0.084 0.037 0.126 -0.018 0.013 -0.015
## CentralAir 0.051 0.031 0.018 -0.002 0.010 -0.009
## X1stFlrSF 0.089 0.056 0.132 -0.021 0.031 -0.014
## X2ndFlrSF 0.041 -0.024 0.081 0.016 0.035 -0.029
## LowQualFinSF 0.027 -0.004 0.062 -0.004 -0.022 -0.029
## GrLivArea 0.102 0.021 0.170 -0.002 0.050 -0.037
## TotRmsAbvGrd 0.059 -0.007 0.084 0.025 0.037 -0.035
## GarageCars 0.050 0.036 0.021 -0.043 0.041 -0.039
## GarageArea 0.051 0.035 0.061 -0.027 0.028 -0.027
## WoodDeckSF -0.074 -0.033 0.073 -0.010 0.021 0.022
## OpenPorchSF 0.074 -0.006 0.061 -0.019 0.071 -0.058
## EnclosedPorch -0.083 -0.037 0.054 0.018 -0.029 -0.010
## ScreenPorch 1.000 -0.031 0.051 0.032 0.023 0.011
## X3SsnPorch -0.031 1.000 -0.008 0.000 0.029 0.019
## PoolArea 0.051 -0.008 1.000 0.030 -0.034 -0.060
## MiscVal 0.032 0.000 0.030 1.000 -0.006 0.005
## MoSold 0.023 0.029 -0.034 -0.006 1.000 -0.146
## YrSold 0.011 0.019 -0.060 0.005 -0.146 1.000
## SaleCondition 0.011 -0.009 -0.069 0.037 -0.072 0.131
## SalePrice 0.111 0.045 0.092 -0.021 0.046 -0.029
## SaleCondition SalePrice
## MSSubClass 0.024 -0.084
## Neighborhood -0.139 0.696
## LotFrontage -0.072 0.335
## LotArea 0.006 0.264
## BldgType 0.027 0.086
## OverallQual -0.143 0.791
## OverallCond 0.162 -0.078
## YearBuilt -0.158 0.523
## YearRemodAdd -0.121 0.507
## MasVnrArea -0.084 0.475
## BsmtFinSF1 -0.020 0.386
## BsmtFinSF2 0.041 -0.011
## BsmtUnfSF -0.154 0.214
## TotalBsmtSF -0.160 0.614
## CentralAir -0.015 0.251
## X1stFlrSF -0.159 0.606
## X2ndFlrSF 0.032 0.319
## LowQualFinSF -0.012 -0.026
## GrLivArea -0.092 0.709
## TotRmsAbvGrd -0.093 0.534
## GarageCars -0.122 0.640
## GarageArea -0.131 0.623
## WoodDeckSF 0.027 0.324
## OpenPorchSF -0.096 0.316
## EnclosedPorch 0.026 -0.129
## ScreenPorch 0.011 0.111
## X3SsnPorch -0.009 0.045
## PoolArea -0.069 0.092
## MiscVal 0.037 -0.021
## MoSold -0.072 0.046
## YrSold 0.131 -0.029
## SaleCondition 1.000 -0.154
## SalePrice -0.154 1.000
M<-cor(train1)
corrplot(M, method="number")
From the correlation matrix and plot, we find that ‘TotalBsmtSF’ is highly associated with ‘GrLivArea (0.825)’ and ‘BsmtFinSF1’. So we will DROP ‘TotalBsmtSF’.
‘GrLivArea (0.825)’ is also highly with ‘TotRmsAbvGrd’(0.825), ‘X1stFlrSF’(0.566) and X2ndFlrSF(0.688). So we will also DROP the above three variables.
train1$TotalBsmtSF <- NULL
train1$TotRmsAbvGrd<- NULL
train1$X1stFlrSF<- NULL
train1$X2ndFlrSF<- NULL
From the density plots and summary. We feel that the following varibles (“LotFrontage”,“LotArea”,“MasVnrArea”,“BsmtFinSF1”,“BsmtFinSF2”,“BsmtUnfSF”,“GrLivArea”,“SalePrice”,“WoodDeckSF”,“OpenPorchSF”,“EnclosedPorch”,“ScreenPorch”,“X3SsnPorch”,“PoolArea”,“MiscVal”) may have outliers. We will replace the outliers.
replace with outliers
replaceOutliers = function(x) {
quantiles <- quantile( x, c(0.5,.95 ) )
x[ x < quantiles[1] ] <- quantiles[1]
x[ x > quantiles[2] ] <- quantiles[2]
return(x)
}
train1$LotFrontage <- replaceOutliers(train1$LotFrontage)
train1$LotArea <- replaceOutliers(train1$LotArea)
train1$MasVnrArea <- replaceOutliers(train1$MasVnrArea)
train1$BsmtFinSF1 <- replaceOutliers(train1$BsmtFinSF1)
train1$BsmtFinSF2 <- replaceOutliers(train1$BsmtFinSF2)
train1$BsmtUnfSF <- replaceOutliers(train1$BsmtUnfSF)
train1$GrLivArea <- replaceOutliers(train1$GrLivArea)
train1$SalePrice <- replaceOutliers(train1$SalePrice)
train1$WoodDeckSF <- replaceOutliers(train1$WoodDeckSF)
train1$OpenPorchSF <- replaceOutliers(train1$OpenPorchSF)
train1$EnclosedPorch <- replaceOutliers(train1$EnclosedPorch)
train1$ScreenPorch <- replaceOutliers(train1$ScreenPorch)
train1$X3SsnPorch <- replaceOutliers(train1$X3SsnPorch)
train1$PoolArea <- replaceOutliers(train1$PoolArea)
train1$MiscVal <- replaceOutliers(train1$MiscVal)
summary(train1)
## MSSubClass Neighborhood LotFrontage LotArea
## Min. : 20.0 Min. : 1.00 Min. : 70.05 Min. : 9478
## 1st Qu.: 20.0 1st Qu.: 7.00 1st Qu.: 70.05 1st Qu.: 9478
## Median : 50.0 Median :13.00 Median : 70.05 Median : 9479
## Mean : 56.9 Mean :12.84 Mean : 75.74 Mean :10917
## 3rd Qu.: 70.0 3rd Qu.:17.00 3rd Qu.: 79.00 3rd Qu.:11602
## Max. :190.0 Max. :25.00 Max. :104.00 Max. :17401
## BldgType OverallQual OverallCond YearBuilt
## Min. :1.000 Min. : 1.000 Min. :1.000 Min. :1872
## 1st Qu.:5.000 1st Qu.: 5.000 1st Qu.:5.000 1st Qu.:1954
## Median :5.000 Median : 6.000 Median :5.000 Median :1973
## Mean :4.507 Mean : 6.099 Mean :5.575 Mean :1971
## 3rd Qu.:5.000 3rd Qu.: 7.000 3rd Qu.:6.000 3rd Qu.:2000
## Max. :5.000 Max. :10.000 Max. :9.000 Max. :2010
## YearRemodAdd MasVnrArea BsmtFinSF1 BsmtFinSF2
## Min. :1950 Min. : 0.00 Min. : 383.5 Min. : 0.00
## 1st Qu.:1967 1st Qu.: 0.00 1st Qu.: 383.5 1st Qu.: 0.00
## Median :1994 Median : 0.00 Median : 383.8 Median : 0.00
## Mean :1985 Mean : 92.62 Mean : 583.3 Mean : 32.93
## 3rd Qu.:2004 3rd Qu.:164.25 3rd Qu.: 712.2 3rd Qu.: 0.00
## Max. :2010 Max. :456.00 Max. :1274.0 Max. :396.20
## BsmtUnfSF CentralAir LowQualFinSF GrLivArea
## Min. : 477.5 Min. :0.0000 Min. : 0.000 Min. :1464
## 1st Qu.: 477.5 1st Qu.:1.0000 1st Qu.: 0.000 1st Qu.:1464
## Median : 478.2 Median :1.0000 Median : 0.000 Median :1464
## Mean : 685.3 Mean :0.9349 Mean : 5.845 Mean :1666
## 3rd Qu.: 808.0 3rd Qu.:1.0000 3rd Qu.: 0.000 3rd Qu.:1777
## Max. :1468.0 Max. :1.0000 Max. :572.000 Max. :2466
## GarageCars GarageArea WoodDeckSF OpenPorchSF
## Min. :0.000 Min. : 0.0 Min. : 0.00 Min. : 25.00
## 1st Qu.:1.000 1st Qu.: 334.5 1st Qu.: 0.00 1st Qu.: 25.00
## Median :2.000 Median : 480.0 Median : 0.00 Median : 25.00
## Mean :1.767 Mean : 473.0 Mean : 88.89 Mean : 54.37
## 3rd Qu.:2.000 3rd Qu.: 576.0 3rd Qu.:168.00 3rd Qu.: 68.00
## Max. :4.000 Max. :1418.0 Max. :335.00 Max. :175.05
## EnclosedPorch ScreenPorch X3SsnPorch PoolArea MiscVal
## Min. : 0.00 Min. : 0.00 Min. :0 Min. :0 Min. :0
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.:0 1st Qu.:0 1st Qu.:0
## Median : 0.00 Median : 0.00 Median :0 Median :0 Median :0
## Mean : 19.15 Mean : 11.58 Mean :0 Mean :0 Mean :0
## 3rd Qu.: 0.00 3rd Qu.: 0.00 3rd Qu.:0 3rd Qu.:0 3rd Qu.:0
## Max. :180.15 Max. :160.00 Max. :0 Max. :0 Max. :0
## MoSold YrSold SaleCondition SalePrice
## Min. : 1.000 Min. :2006 Min. :0.0000 Min. :163000
## 1st Qu.: 5.000 1st Qu.:2007 1st Qu.:1.0000 1st Qu.:163000
## Median : 6.000 Median :2008 Median :1.0000 Median :163000
## Mean : 6.322 Mean :2008 Mean :0.8205 Mean :195479
## 3rd Qu.: 8.000 3rd Qu.:2009 3rd Qu.:1.0000 3rd Qu.:214000
## Max. :12.000 Max. :2010 Max. :1.0000 Max. :326100
vis_miss(train1)
Full Model (including all above variables)
full.model <- lm(SalePrice~MSSubClass+Neighborhood+LotFrontage+LotArea+BldgType+OverallQual+OverallCond+YearBuilt+YearRemodAdd+MasVnrArea+BsmtFinSF1+BsmtFinSF2+BsmtUnfSF+CentralAir+LowQualFinSF+GrLivArea+GarageCars+GarageArea+WoodDeckSF+OpenPorchSF+EnclosedPorch+ScreenPorch+X3SsnPorch+PoolArea+MiscVal+MoSold+YrSold+SaleCondition, data=train1)
summary(full.model)
##
## Call:
## lm(formula = SalePrice ~ MSSubClass + Neighborhood + LotFrontage +
## LotArea + BldgType + OverallQual + OverallCond + YearBuilt +
## YearRemodAdd + MasVnrArea + BsmtFinSF1 + BsmtFinSF2 + BsmtUnfSF +
## CentralAir + LowQualFinSF + GrLivArea + GarageCars + GarageArea +
## WoodDeckSF + OpenPorchSF + EnclosedPorch + ScreenPorch +
## X3SsnPorch + PoolArea + MiscVal + MoSold + YrSold + SaleCondition,
## data = train1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -166284 -14553 -1482 13177 90498
##
## Coefficients: (3 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.040e+06 9.531e+05 -1.091 0.275247
## MSSubClass -1.264e+02 2.528e+01 -4.998 6.50e-07 ***
## Neighborhood 1.258e+03 1.482e+02 8.484 < 2e-16 ***
## LotFrontage 1.275e+02 7.182e+01 1.775 0.076160 .
## LotArea 8.315e-01 3.262e-01 2.549 0.010903 *
## BldgType -1.282e+03 9.049e+02 -1.416 0.156874
## OverallQual 1.036e+04 7.544e+02 13.739 < 2e-16 ***
## OverallCond 5.883e+02 7.078e+02 0.831 0.405982
## YearBuilt -2.133e+00 4.160e+01 -0.051 0.959110
## YearRemodAdd 1.617e+02 4.328e+01 3.737 0.000194 ***
## MasVnrArea 1.333e+01 5.159e+00 2.584 0.009863 **
## BsmtFinSF1 3.308e+01 3.104e+00 10.658 < 2e-16 ***
## BsmtFinSF2 2.141e+00 6.601e+00 0.324 0.745723
## BsmtUnfSF 8.201e+00 2.811e+00 2.918 0.003583 **
## CentralAir -1.283e+04 2.891e+03 -4.437 9.84e-06 ***
## LowQualFinSF -2.113e+01 1.318e+01 -1.603 0.109051
## GrLivArea 5.762e+01 2.874e+00 20.048 < 2e-16 ***
## GarageCars -5.585e+02 1.879e+03 -0.297 0.766324
## GarageArea 1.198e+01 6.367e+00 1.882 0.059984 .
## WoodDeckSF 2.028e+01 6.182e+00 3.281 0.001060 **
## OpenPorchSF 3.093e+01 1.486e+01 2.082 0.037559 *
## EnclosedPorch 9.064e+00 1.373e+01 0.660 0.509327
## ScreenPorch 1.299e+01 1.582e+01 0.821 0.411689
## X3SsnPorch NA NA NA NA
## PoolArea NA NA NA NA
## MiscVal NA NA NA NA
## MoSold 1.382e+02 2.315e+02 0.597 0.550572
## YrSold 3.562e+02 4.748e+02 0.750 0.453224
## SaleCondition -5.034e+03 1.673e+03 -3.009 0.002668 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 23430 on 1434 degrees of freedom
## Multiple R-squared: 0.7773, Adjusted R-squared: 0.7734
## F-statistic: 200.2 on 25 and 1434 DF, p-value: < 2.2e-16
Reduced Model: only positive variable in full model
reduced.model <- lm(SalePrice~MSSubClass+Neighborhood+LotArea+OverallQual+YearBuilt+YearRemodAdd+MasVnrArea+BsmtFinSF1+BsmtUnfSF+CentralAir+GrLivArea+WoodDeckSF+OpenPorchSF+SaleCondition, data=train1)
summary(reduced.model)
##
## Call:
## lm(formula = SalePrice ~ MSSubClass + Neighborhood + LotArea +
## OverallQual + YearBuilt + YearRemodAdd + MasVnrArea + BsmtFinSF1 +
## BsmtUnfSF + CentralAir + GrLivArea + WoodDeckSF + OpenPorchSF +
## SaleCondition, data = train1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -158692 -14854 -1359 13074 89497
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.476e+05 8.463e+04 -4.107 4.24e-05 ***
## MSSubClass -1.103e+02 1.533e+01 -7.192 1.02e-12 ***
## Neighborhood 1.296e+03 1.477e+02 8.770 < 2e-16 ***
## LotArea 1.017e+00 3.114e-01 3.267 0.001113 **
## OverallQual 1.083e+04 7.304e+02 14.830 < 2e-16 ***
## YearBuilt 1.693e+00 3.314e+01 0.051 0.959270
## YearRemodAdd 1.713e+02 3.993e+01 4.290 1.91e-05 ***
## MasVnrArea 1.682e+01 5.083e+00 3.309 0.000959 ***
## BsmtFinSF1 3.474e+01 2.897e+00 11.993 < 2e-16 ***
## BsmtUnfSF 9.070e+00 2.594e+00 3.497 0.000485 ***
## CentralAir -1.183e+04 2.779e+03 -4.258 2.20e-05 ***
## GrLivArea 5.753e+01 2.700e+00 21.310 < 2e-16 ***
## WoodDeckSF 1.951e+01 6.078e+00 3.210 0.001357 **
## OpenPorchSF 3.153e+01 1.480e+01 2.131 0.033293 *
## SaleCondition -4.975e+03 1.650e+03 -3.016 0.002608 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 23490 on 1445 degrees of freedom
## Multiple R-squared: 0.7744, Adjusted R-squared: 0.7722
## F-statistic: 354.3 on 14 and 1445 DF, p-value: < 2.2e-16
Backward elimination
backward.model<- step (full.model, direction = "backward")
## Start: AIC=29406.11
## SalePrice ~ MSSubClass + Neighborhood + LotFrontage + LotArea +
## BldgType + OverallQual + OverallCond + YearBuilt + YearRemodAdd +
## MasVnrArea + BsmtFinSF1 + BsmtFinSF2 + BsmtUnfSF + CentralAir +
## LowQualFinSF + GrLivArea + GarageCars + GarageArea + WoodDeckSF +
## OpenPorchSF + EnclosedPorch + ScreenPorch + X3SsnPorch +
## PoolArea + MiscVal + MoSold + YrSold + SaleCondition
##
##
## Step: AIC=29406.11
## SalePrice ~ MSSubClass + Neighborhood + LotFrontage + LotArea +
## BldgType + OverallQual + OverallCond + YearBuilt + YearRemodAdd +
## MasVnrArea + BsmtFinSF1 + BsmtFinSF2 + BsmtUnfSF + CentralAir +
## LowQualFinSF + GrLivArea + GarageCars + GarageArea + WoodDeckSF +
## OpenPorchSF + EnclosedPorch + ScreenPorch + X3SsnPorch +
## PoolArea + MoSold + YrSold + SaleCondition
##
##
## Step: AIC=29406.11
## SalePrice ~ MSSubClass + Neighborhood + LotFrontage + LotArea +
## BldgType + OverallQual + OverallCond + YearBuilt + YearRemodAdd +
## MasVnrArea + BsmtFinSF1 + BsmtFinSF2 + BsmtUnfSF + CentralAir +
## LowQualFinSF + GrLivArea + GarageCars + GarageArea + WoodDeckSF +
## OpenPorchSF + EnclosedPorch + ScreenPorch + X3SsnPorch +
## MoSold + YrSold + SaleCondition
##
##
## Step: AIC=29406.11
## SalePrice ~ MSSubClass + Neighborhood + LotFrontage + LotArea +
## BldgType + OverallQual + OverallCond + YearBuilt + YearRemodAdd +
## MasVnrArea + BsmtFinSF1 + BsmtFinSF2 + BsmtUnfSF + CentralAir +
## LowQualFinSF + GrLivArea + GarageCars + GarageArea + WoodDeckSF +
## OpenPorchSF + EnclosedPorch + ScreenPorch + MoSold + YrSold +
## SaleCondition
##
## Df Sum of Sq RSS AIC
## - YearBuilt 1 1.4435e+06 7.8720e+11 29404
## - GarageCars 1 4.8502e+07 7.8725e+11 29404
## - BsmtFinSF2 1 5.7750e+07 7.8726e+11 29404
## - MoSold 1 1.9568e+08 7.8739e+11 29405
## - EnclosedPorch 1 2.3916e+08 7.8744e+11 29405
## - YrSold 1 3.0900e+08 7.8751e+11 29405
## - ScreenPorch 1 3.7017e+08 7.8757e+11 29405
## - OverallCond 1 3.7930e+08 7.8758e+11 29405
## <none> 7.8720e+11 29406
## - BldgType 1 1.1013e+09 7.8830e+11 29406
## - LowQualFinSF 1 1.4114e+09 7.8861e+11 29407
## - LotFrontage 1 1.7290e+09 7.8893e+11 29407
## - GarageArea 1 1.9452e+09 7.8914e+11 29408
## - OpenPorchSF 1 2.3786e+09 7.8958e+11 29409
## - LotArea 1 3.5671e+09 7.9077e+11 29411
## - MasVnrArea 1 3.6655e+09 7.9086e+11 29411
## - BsmtUnfSF 1 4.6727e+09 7.9187e+11 29413
## - SaleCondition 1 4.9700e+09 7.9217e+11 29413
## - WoodDeckSF 1 5.9085e+09 7.9311e+11 29415
## - YearRemodAdd 1 7.6649e+09 7.9486e+11 29418
## - CentralAir 1 1.0805e+10 7.9800e+11 29424
## - MSSubClass 1 1.3713e+10 8.0091e+11 29429
## - Neighborhood 1 3.9512e+10 8.2671e+11 29476
## - BsmtFinSF1 1 6.2360e+10 8.4956e+11 29515
## - OverallQual 1 1.0362e+11 8.9082e+11 29585
## - GrLivArea 1 2.2063e+11 1.0078e+12 29765
##
## Step: AIC=29404.11
## SalePrice ~ MSSubClass + Neighborhood + LotFrontage + LotArea +
## BldgType + OverallQual + OverallCond + YearRemodAdd + MasVnrArea +
## BsmtFinSF1 + BsmtFinSF2 + BsmtUnfSF + CentralAir + LowQualFinSF +
## GrLivArea + GarageCars + GarageArea + WoodDeckSF + OpenPorchSF +
## EnclosedPorch + ScreenPorch + MoSold + YrSold + SaleCondition
##
## Df Sum of Sq RSS AIC
## - GarageCars 1 4.9934e+07 7.8725e+11 29402
## - BsmtFinSF2 1 5.7201e+07 7.8726e+11 29402
## - MoSold 1 1.9758e+08 7.8740e+11 29403
## - EnclosedPorch 1 2.8229e+08 7.8748e+11 29403
## - YrSold 1 3.0970e+08 7.8751e+11 29403
## - ScreenPorch 1 3.7937e+08 7.8758e+11 29403
## - OverallCond 1 5.2126e+08 7.8772e+11 29403
## <none> 7.8720e+11 29404
## - BldgType 1 1.1049e+09 7.8831e+11 29404
## - LowQualFinSF 1 1.4346e+09 7.8864e+11 29405
## - LotFrontage 1 1.7289e+09 7.8893e+11 29405
## - GarageArea 1 1.9439e+09 7.8914e+11 29406
## - OpenPorchSF 1 2.3780e+09 7.8958e+11 29407
## - LotArea 1 3.5998e+09 7.9080e+11 29409
## - MasVnrArea 1 3.6784e+09 7.9088e+11 29409
## - BsmtUnfSF 1 4.6799e+09 7.9188e+11 29411
## - SaleCondition 1 4.9688e+09 7.9217e+11 29411
## - WoodDeckSF 1 5.9189e+09 7.9312e+11 29413
## - YearRemodAdd 1 9.0187e+09 7.9622e+11 29419
## - CentralAir 1 1.1822e+10 7.9902e+11 29424
## - MSSubClass 1 1.3732e+10 8.0093e+11 29427
## - Neighborhood 1 4.4157e+10 8.3136e+11 29482
## - BsmtFinSF1 1 6.2377e+10 8.4958e+11 29513
## - OverallQual 1 1.0592e+11 8.9312e+11 29586
## - GrLivArea 1 2.2865e+11 1.0159e+12 29774
##
## Step: AIC=29402.21
## SalePrice ~ MSSubClass + Neighborhood + LotFrontage + LotArea +
## BldgType + OverallQual + OverallCond + YearRemodAdd + MasVnrArea +
## BsmtFinSF1 + BsmtFinSF2 + BsmtUnfSF + CentralAir + LowQualFinSF +
## GrLivArea + GarageArea + WoodDeckSF + OpenPorchSF + EnclosedPorch +
## ScreenPorch + MoSold + YrSold + SaleCondition
##
## Df Sum of Sq RSS AIC
## - BsmtFinSF2 1 6.1445e+07 7.8731e+11 29400
## - MoSold 1 1.9406e+08 7.8744e+11 29401
## - EnclosedPorch 1 2.9205e+08 7.8754e+11 29401
## - YrSold 1 3.1835e+08 7.8757e+11 29401
## - ScreenPorch 1 3.7533e+08 7.8763e+11 29401
## - OverallCond 1 5.5615e+08 7.8781e+11 29401
## <none> 7.8725e+11 29402
## - BldgType 1 1.0881e+09 7.8834e+11 29402
## - LowQualFinSF 1 1.4036e+09 7.8865e+11 29403
## - LotFrontage 1 1.7164e+09 7.8897e+11 29403
## - OpenPorchSF 1 2.4329e+09 7.8968e+11 29405
## - LotArea 1 3.5761e+09 7.9083e+11 29407
## - MasVnrArea 1 3.6739e+09 7.9092e+11 29407
## - GarageArea 1 4.0137e+09 7.9126e+11 29408
## - BsmtUnfSF 1 4.6647e+09 7.9192e+11 29409
## - SaleCondition 1 5.0151e+09 7.9227e+11 29410
## - WoodDeckSF 1 5.9006e+09 7.9315e+11 29411
## - YearRemodAdd 1 8.9708e+09 7.9622e+11 29417
## - CentralAir 1 1.1834e+10 7.9908e+11 29422
## - MSSubClass 1 1.3758e+10 8.0101e+11 29426
## - Neighborhood 1 4.4252e+10 8.3150e+11 29480
## - BsmtFinSF1 1 6.2663e+10 8.4991e+11 29512
## - OverallQual 1 1.0673e+11 8.9398e+11 29586
## - GrLivArea 1 2.2863e+11 1.0159e+12 29773
##
## Step: AIC=29400.32
## SalePrice ~ MSSubClass + Neighborhood + LotFrontage + LotArea +
## BldgType + OverallQual + OverallCond + YearRemodAdd + MasVnrArea +
## BsmtFinSF1 + BsmtUnfSF + CentralAir + LowQualFinSF + GrLivArea +
## GarageArea + WoodDeckSF + OpenPorchSF + EnclosedPorch + ScreenPorch +
## MoSold + YrSold + SaleCondition
##
## Df Sum of Sq RSS AIC
## - MoSold 1 1.9346e+08 7.8751e+11 29399
## - EnclosedPorch 1 3.0561e+08 7.8762e+11 29399
## - YrSold 1 3.2292e+08 7.8763e+11 29399
## - ScreenPorch 1 4.0108e+08 7.8771e+11 29399
## - OverallCond 1 5.5771e+08 7.8787e+11 29399
## <none> 7.8731e+11 29400
## - BldgType 1 1.1945e+09 7.8851e+11 29401
## - LowQualFinSF 1 1.3953e+09 7.8871e+11 29401
## - LotFrontage 1 1.7652e+09 7.8908e+11 29402
## - OpenPorchSF 1 2.4401e+09 7.8975e+11 29403
## - MasVnrArea 1 3.6331e+09 7.9094e+11 29405
## - LotArea 1 3.6926e+09 7.9100e+11 29405
## - GarageArea 1 4.0768e+09 7.9139e+11 29406
## - BsmtUnfSF 1 4.7426e+09 7.9205e+11 29407
## - SaleCondition 1 5.0124e+09 7.9232e+11 29408
## - WoodDeckSF 1 6.1424e+09 7.9345e+11 29410
## - YearRemodAdd 1 8.9160e+09 7.9623e+11 29415
## - CentralAir 1 1.1783e+10 7.9909e+11 29420
## - MSSubClass 1 1.4485e+10 8.0180e+11 29425
## - Neighborhood 1 4.4334e+10 8.3165e+11 29478
## - BsmtFinSF1 1 6.6334e+10 8.5365e+11 29516
## - OverallQual 1 1.0672e+11 8.9403e+11 29584
## - GrLivArea 1 2.2948e+11 1.0168e+12 29772
##
## Step: AIC=29398.68
## SalePrice ~ MSSubClass + Neighborhood + LotFrontage + LotArea +
## BldgType + OverallQual + OverallCond + YearRemodAdd + MasVnrArea +
## BsmtFinSF1 + BsmtUnfSF + CentralAir + LowQualFinSF + GrLivArea +
## GarageArea + WoodDeckSF + OpenPorchSF + EnclosedPorch + ScreenPorch +
## YrSold + SaleCondition
##
## Df Sum of Sq RSS AIC
## - YrSold 1 2.6350e+08 7.8777e+11 29397
## - EnclosedPorch 1 2.9459e+08 7.8780e+11 29397
## - ScreenPorch 1 4.1391e+08 7.8792e+11 29397
## - OverallCond 1 5.6762e+08 7.8807e+11 29398
## <none> 7.8751e+11 29399
## - BldgType 1 1.1745e+09 7.8868e+11 29399
## - LowQualFinSF 1 1.4311e+09 7.8894e+11 29399
## - LotFrontage 1 1.7511e+09 7.8926e+11 29400
## - OpenPorchSF 1 2.5072e+09 7.9001e+11 29401
## - MasVnrArea 1 3.5864e+09 7.9109e+11 29403
## - LotArea 1 3.6212e+09 7.9113e+11 29403
## - GarageArea 1 4.0445e+09 7.9155e+11 29404
## - BsmtUnfSF 1 4.7798e+09 7.9229e+11 29406
## - SaleCondition 1 5.1210e+09 7.9263e+11 29406
## - WoodDeckSF 1 6.1904e+09 7.9370e+11 29408
## - YearRemodAdd 1 8.8516e+09 7.9636e+11 29413
## - CentralAir 1 1.1786e+10 7.9929e+11 29418
## - MSSubClass 1 1.4474e+10 8.0198e+11 29423
## - Neighborhood 1 4.4413e+10 8.3192e+11 29477
## - BsmtFinSF1 1 6.6398e+10 8.5390e+11 29515
## - OverallQual 1 1.0719e+11 8.9470e+11 29583
## - GrLivArea 1 2.3010e+11 1.0176e+12 29771
##
## Step: AIC=29397.17
## SalePrice ~ MSSubClass + Neighborhood + LotFrontage + LotArea +
## BldgType + OverallQual + OverallCond + YearRemodAdd + MasVnrArea +
## BsmtFinSF1 + BsmtUnfSF + CentralAir + LowQualFinSF + GrLivArea +
## GarageArea + WoodDeckSF + OpenPorchSF + EnclosedPorch + ScreenPorch +
## SaleCondition
##
## Df Sum of Sq RSS AIC
## - EnclosedPorch 1 3.0131e+08 7.8807e+11 29396
## - ScreenPorch 1 4.3757e+08 7.8821e+11 29396
## - OverallCond 1 5.7491e+08 7.8834e+11 29396
## <none> 7.8777e+11 29397
## - BldgType 1 1.2080e+09 7.8898e+11 29397
## - LowQualFinSF 1 1.4522e+09 7.8922e+11 29398
## - LotFrontage 1 1.8119e+09 7.8958e+11 29399
## - OpenPorchSF 1 2.4526e+09 7.9022e+11 29400
## - LotArea 1 3.5364e+09 7.9131e+11 29402
## - MasVnrArea 1 3.6424e+09 7.9141e+11 29402
## - GarageArea 1 4.0311e+09 7.9180e+11 29403
## - BsmtUnfSF 1 4.7054e+09 7.9247e+11 29404
## - SaleCondition 1 4.9120e+09 7.9268e+11 29404
## - WoodDeckSF 1 6.2795e+09 7.9405e+11 29407
## - YearRemodAdd 1 9.1267e+09 7.9690e+11 29412
## - CentralAir 1 1.1913e+10 7.9968e+11 29417
## - MSSubClass 1 1.4676e+10 8.0245e+11 29422
## - Neighborhood 1 4.4291e+10 8.3206e+11 29475
## - BsmtFinSF1 1 6.6316e+10 8.5408e+11 29513
## - OverallQual 1 1.0710e+11 8.9487e+11 29581
## - GrLivArea 1 2.2999e+11 1.0178e+12 29769
##
## Step: AIC=29395.73
## SalePrice ~ MSSubClass + Neighborhood + LotFrontage + LotArea +
## BldgType + OverallQual + OverallCond + YearRemodAdd + MasVnrArea +
## BsmtFinSF1 + BsmtUnfSF + CentralAir + LowQualFinSF + GrLivArea +
## GarageArea + WoodDeckSF + OpenPorchSF + ScreenPorch + SaleCondition
##
## Df Sum of Sq RSS AIC
## - ScreenPorch 1 3.6864e+08 7.8844e+11 29394
## - OverallCond 1 6.2477e+08 7.8869e+11 29395
## <none> 7.8807e+11 29396
## - BldgType 1 1.1235e+09 7.8919e+11 29396
## - LowQualFinSF 1 1.4633e+09 7.8953e+11 29396
## - LotFrontage 1 1.8296e+09 7.8990e+11 29397
## - OpenPorchSF 1 2.3675e+09 7.9044e+11 29398
## - LotArea 1 3.5210e+09 7.9159e+11 29400
## - MasVnrArea 1 3.5508e+09 7.9162e+11 29400
## - GarageArea 1 3.9816e+09 7.9205e+11 29401
## - BsmtUnfSF 1 4.7474e+09 7.9282e+11 29403
## - SaleCondition 1 4.9142e+09 7.9298e+11 29403
## - WoodDeckSF 1 6.0868e+09 7.9416e+11 29405
## - YearRemodAdd 1 8.8830e+09 7.9695e+11 29410
## - CentralAir 1 1.2286e+10 8.0036e+11 29416
## - MSSubClass 1 1.4550e+10 8.0262e+11 29420
## - Neighborhood 1 4.4024e+10 8.3209e+11 29473
## - BsmtFinSF1 1 6.6378e+10 8.5445e+11 29512
## - OverallQual 1 1.0846e+11 8.9653e+11 29582
## - GrLivArea 1 2.3135e+11 1.0194e+12 29770
##
## Step: AIC=29394.41
## SalePrice ~ MSSubClass + Neighborhood + LotFrontage + LotArea +
## BldgType + OverallQual + OverallCond + YearRemodAdd + MasVnrArea +
## BsmtFinSF1 + BsmtUnfSF + CentralAir + LowQualFinSF + GrLivArea +
## GarageArea + WoodDeckSF + OpenPorchSF + SaleCondition
##
## Df Sum of Sq RSS AIC
## - OverallCond 1 6.8851e+08 7.8913e+11 29394
## <none> 7.8844e+11 29394
## - BldgType 1 1.1691e+09 7.8961e+11 29395
## - LowQualFinSF 1 1.4739e+09 7.8991e+11 29395
## - LotFrontage 1 1.8439e+09 7.9028e+11 29396
## - OpenPorchSF 1 2.3550e+09 7.9079e+11 29397
## - MasVnrArea 1 3.6127e+09 7.9205e+11 29399
## - LotArea 1 3.6421e+09 7.9208e+11 29399
## - GarageArea 1 4.0082e+09 7.9245e+11 29400
## - BsmtUnfSF 1 4.6706e+09 7.9311e+11 29401
## - SaleCondition 1 4.8989e+09 7.9334e+11 29402
## - WoodDeckSF 1 5.8284e+09 7.9427e+11 29403
## - YearRemodAdd 1 8.6845e+09 7.9712e+11 29408
## - CentralAir 1 1.2143e+10 8.0058e+11 29415
## - MSSubClass 1 1.4736e+10 8.0317e+11 29419
## - Neighborhood 1 4.3878e+10 8.3232e+11 29472
## - BsmtFinSF1 1 6.6710e+10 8.5515e+11 29511
## - OverallQual 1 1.0950e+11 8.9793e+11 29582
## - GrLivArea 1 2.3226e+11 1.0207e+12 29769
##
## Step: AIC=29393.68
## SalePrice ~ MSSubClass + Neighborhood + LotFrontage + LotArea +
## BldgType + OverallQual + YearRemodAdd + MasVnrArea + BsmtFinSF1 +
## BsmtUnfSF + CentralAir + LowQualFinSF + GrLivArea + GarageArea +
## WoodDeckSF + OpenPorchSF + SaleCondition
##
## Df Sum of Sq RSS AIC
## - BldgType 1 9.5995e+08 7.9009e+11 29394
## <none> 7.8913e+11 29394
## - LowQualFinSF 1 1.4576e+09 7.9058e+11 29394
## - LotFrontage 1 1.7141e+09 7.9084e+11 29395
## - OpenPorchSF 1 2.3467e+09 7.9147e+11 29396
## - MasVnrArea 1 3.5372e+09 7.9266e+11 29398
## - GarageArea 1 3.7775e+09 7.9290e+11 29399
## - LotArea 1 3.8245e+09 7.9295e+11 29399
## - BsmtUnfSF 1 4.2428e+09 7.9337e+11 29400
## - SaleCondition 1 4.5360e+09 7.9366e+11 29400
## - WoodDeckSF 1 5.8308e+09 7.9496e+11 29402
## - YearRemodAdd 1 1.0261e+10 7.9939e+11 29411
## - CentralAir 1 1.1551e+10 8.0068e+11 29413
## - MSSubClass 1 1.4591e+10 8.0372e+11 29418
## - Neighborhood 1 4.3532e+10 8.3266e+11 29470
## - BsmtFinSF1 1 6.6024e+10 8.5515e+11 29509
## - OverallQual 1 1.1102e+11 9.0015e+11 29584
## - GrLivArea 1 2.3263e+11 1.0218e+12 29769
##
## Step: AIC=29393.46
## SalePrice ~ MSSubClass + Neighborhood + LotFrontage + LotArea +
## OverallQual + YearRemodAdd + MasVnrArea + BsmtFinSF1 + BsmtUnfSF +
## CentralAir + LowQualFinSF + GrLivArea + GarageArea + WoodDeckSF +
## OpenPorchSF + SaleCondition
##
## Df Sum of Sq RSS AIC
## <none> 7.9009e+11 29394
## - LowQualFinSF 1 1.5046e+09 7.9159e+11 29394
## - LotFrontage 1 1.5529e+09 7.9164e+11 29394
## - OpenPorchSF 1 2.2599e+09 7.9235e+11 29396
## - LotArea 1 3.5038e+09 7.9359e+11 29398
## - GarageArea 1 3.7251e+09 7.9381e+11 29398
## - MasVnrArea 1 3.8423e+09 7.9393e+11 29399
## - SaleCondition 1 4.6649e+09 7.9475e+11 29400
## - BsmtUnfSF 1 5.4688e+09 7.9556e+11 29402
## - WoodDeckSF 1 5.7555e+09 7.9584e+11 29402
## - YearRemodAdd 1 1.0411e+10 8.0050e+11 29411
## - CentralAir 1 1.1062e+10 8.0115e+11 29412
## - MSSubClass 1 2.4161e+10 8.1425e+11 29435
## - Neighborhood 1 4.6055e+10 8.3614e+11 29474
## - BsmtFinSF1 1 7.2486e+10 8.6257e+11 29520
## - OverallQual 1 1.1045e+11 9.0053e+11 29583
## - GrLivArea 1 2.4667e+11 1.0368e+12 29788
summary(backward.model)
##
## Call:
## lm(formula = SalePrice ~ MSSubClass + Neighborhood + LotFrontage +
## LotArea + OverallQual + YearRemodAdd + MasVnrArea + BsmtFinSF1 +
## BsmtUnfSF + CentralAir + LowQualFinSF + GrLivArea + GarageArea +
## WoodDeckSF + OpenPorchSF + SaleCondition, data = train1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -166244 -14826 -1306 13387 90702
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.390e+05 7.423e+04 -4.567 5.36e-06 ***
## MSSubClass -1.023e+02 1.539e+01 -6.643 4.35e-11 ***
## Neighborhood 1.228e+03 1.339e+02 9.171 < 2e-16 ***
## LotFrontage 1.199e+02 7.118e+01 1.684 0.09238 .
## LotArea 8.102e-01 3.203e-01 2.530 0.01152 *
## OverallQual 1.044e+04 7.347e+02 14.203 < 2e-16 ***
## YearRemodAdd 1.660e+02 3.806e+01 4.361 1.39e-05 ***
## MasVnrArea 1.351e+01 5.099e+00 2.649 0.00816 **
## BsmtFinSF1 3.338e+01 2.901e+00 11.506 < 2e-16 ***
## BsmtUnfSF 8.189e+00 2.591e+00 3.160 0.00161 **
## CentralAir -1.220e+04 2.714e+03 -4.495 7.52e-06 ***
## LowQualFinSF -2.145e+01 1.294e+01 -1.658 0.09759 .
## GrLivArea 5.679e+01 2.675e+00 21.225 < 2e-16 ***
## GarageArea 1.002e+01 3.842e+00 2.608 0.00919 **
## WoodDeckSF 1.961e+01 6.048e+00 3.242 0.00121 **
## OpenPorchSF 2.994e+01 1.474e+01 2.032 0.04238 *
## SaleCondition -4.793e+03 1.642e+03 -2.919 0.00357 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 23400 on 1443 degrees of freedom
## Multiple R-squared: 0.7765, Adjusted R-squared: 0.774
## F-statistic: 313.3 on 16 and 1443 DF, p-value: < 2.2e-16
MODEL4:- Create model With top 5 high correlation columns as features
cors <- sapply(train1, cor, y=train1$SalePrice)
## Warning in FUN(X[[i]], ...): the standard deviation is zero
## Warning in FUN(X[[i]], ...): the standard deviation is zero
## Warning in FUN(X[[i]], ...): the standard deviation is zero
mask <- (rank(-abs(cors)) <= 6 )
best5.pred <- train1[, mask]
best5.pred <- subset(best5.pred, select = c(-SalePrice) )
summary(best5.pred)
## Neighborhood OverallQual GrLivArea GarageCars
## Min. : 1.00 Min. : 1.000 Min. :1464 Min. :0.000
## 1st Qu.: 7.00 1st Qu.: 5.000 1st Qu.:1464 1st Qu.:1.000
## Median :13.00 Median : 6.000 Median :1464 Median :2.000
## Mean :12.84 Mean : 6.099 Mean :1666 Mean :1.767
## 3rd Qu.:17.00 3rd Qu.: 7.000 3rd Qu.:1777 3rd Qu.:2.000
## Max. :25.00 Max. :10.000 Max. :2466 Max. :4.000
## GarageArea
## Min. : 0.0
## 1st Qu.: 334.5
## Median : 480.0
## Mean : 473.0
## 3rd Qu.: 576.0
## Max. :1418.0
Stepwise backward regression
model.best5 <- lm (SalePrice ~ Neighborhood + OverallQual + GrLivArea + GarageCars + GarageArea, data=train1)
model.best5<- step (model.best5, direction = "backward")
## Start: AIC=29680.5
## SalePrice ~ Neighborhood + OverallQual + GrLivArea + GarageCars +
## GarageArea
##
## Df Sum of Sq RSS AIC
## - GarageCars 1 6.0233e+08 9.7695e+11 29679
## <none> 9.7634e+11 29681
## - GarageArea 1 1.6068e+10 9.9241e+11 29702
## - Neighborhood 1 1.0040e+11 1.0767e+12 29821
## - OverallQual 1 1.7478e+11 1.1511e+12 29919
## - GrLivArea 1 3.5575e+11 1.3321e+12 30132
##
## Step: AIC=29679.4
## SalePrice ~ Neighborhood + OverallQual + GrLivArea + GarageArea
##
## Df Sum of Sq RSS AIC
## <none> 9.7695e+11 29679
## - GarageArea 1 3.2766e+10 1.0097e+12 29726
## - Neighborhood 1 1.0030e+11 1.0772e+12 29820
## - OverallQual 1 1.7567e+11 1.1526e+12 29919
## - GrLivArea 1 3.5573e+11 1.3327e+12 30131
summary(model.best5)
##
## Call:
## lm(formula = SalePrice ~ Neighborhood + OverallQual + GrLivArea +
## GarageArea, data = train1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -140973 -15613 -1288 13625 100183
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -14519.434 4213.363 -3.446 0.000585 ***
## Neighborhood 1723.991 141.056 12.222 < 2e-16 ***
## OverallQual 11899.446 735.671 16.175 < 2e-16 ***
## GrLivArea 61.235 2.660 23.017 < 2e-16 ***
## GarageArea 28.115 4.025 6.986 4.3e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 25910 on 1455 degrees of freedom
## Multiple R-squared: 0.7236, Adjusted R-squared: 0.7229
## F-statistic: 952.5 on 4 and 1455 DF, p-value: < 2.2e-16
anova(full.model,reduced.model,backward.model, model.best5, test="Chisq")
## Analysis of Variance Table
##
## Model 1: SalePrice ~ MSSubClass + Neighborhood + LotFrontage + LotArea +
## BldgType + OverallQual + OverallCond + YearBuilt + YearRemodAdd +
## MasVnrArea + BsmtFinSF1 + BsmtFinSF2 + BsmtUnfSF + CentralAir +
## LowQualFinSF + GrLivArea + GarageCars + GarageArea + WoodDeckSF +
## OpenPorchSF + EnclosedPorch + ScreenPorch + X3SsnPorch +
## PoolArea + MiscVal + MoSold + YrSold + SaleCondition
## Model 2: SalePrice ~ MSSubClass + Neighborhood + LotArea + OverallQual +
## YearBuilt + YearRemodAdd + MasVnrArea + BsmtFinSF1 + BsmtUnfSF +
## CentralAir + GrLivArea + WoodDeckSF + OpenPorchSF + SaleCondition
## Model 3: SalePrice ~ MSSubClass + Neighborhood + LotFrontage + LotArea +
## OverallQual + YearRemodAdd + MasVnrArea + BsmtFinSF1 + BsmtUnfSF +
## CentralAir + LowQualFinSF + GrLivArea + GarageArea + WoodDeckSF +
## OpenPorchSF + SaleCondition
## Model 4: SalePrice ~ Neighborhood + OverallQual + GrLivArea + GarageArea
## Res.Df RSS Df Sum of Sq Pr(>Chi)
## 1 1434 7.8720e+11
## 2 1445 7.9752e+11 -11 -1.0322e+10 0.064735 .
## 3 1443 7.9009e+11 2 7.4334e+09 0.001147 **
## 4 1455 9.7695e+11 -12 -1.8686e+11 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Prepare test data
test<- read.csv("https://raw.githubusercontent.com/johnpannyc/data605-final/master/test.csv")
names(test)
## [1] "Id" "MSSubClass" "MSZoning" "LotFrontage"
## [5] "LotArea" "Street" "Alley" "LotShape"
## [9] "LandContour" "Utilities" "LotConfig" "LandSlope"
## [13] "Neighborhood" "Condition1" "Condition2" "BldgType"
## [17] "HouseStyle" "OverallQual" "OverallCond" "YearBuilt"
## [21] "YearRemodAdd" "RoofStyle" "RoofMatl" "Exterior1st"
## [25] "Exterior2nd" "MasVnrType" "MasVnrArea" "ExterQual"
## [29] "ExterCond" "Foundation" "BsmtQual" "BsmtCond"
## [33] "BsmtExposure" "BsmtFinType1" "BsmtFinSF1" "BsmtFinType2"
## [37] "BsmtFinSF2" "BsmtUnfSF" "TotalBsmtSF" "Heating"
## [41] "HeatingQC" "CentralAir" "Electrical" "X1stFlrSF"
## [45] "X2ndFlrSF" "LowQualFinSF" "GrLivArea" "BsmtFullBath"
## [49] "BsmtHalfBath" "FullBath" "HalfBath" "BedroomAbvGr"
## [53] "KitchenAbvGr" "KitchenQual" "TotRmsAbvGrd" "Functional"
## [57] "Fireplaces" "FireplaceQu" "GarageType" "GarageYrBlt"
## [61] "GarageFinish" "GarageCars" "GarageArea" "GarageQual"
## [65] "GarageCond" "PavedDrive" "WoodDeckSF" "OpenPorchSF"
## [69] "EnclosedPorch" "X3SsnPorch" "ScreenPorch" "PoolArea"
## [73] "PoolQC" "Fence" "MiscFeature" "MiscVal"
## [77] "MoSold" "YrSold" "SaleType" "SaleCondition"
test1 <- dplyr::select(test,Id,MSSubClass,Neighborhood,LotFrontage,LotArea,BldgType,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,CentralAir,X1stFlrSF,X2ndFlrSF,LowQualFinSF,GrLivArea,TotRmsAbvGrd,GarageCars,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,ScreenPorch,X3SsnPorch,PoolArea,MiscVal,MoSold,YrSold,SaleCondition)
test1$Neighborhood <- as.character(test1$Neighborhood)
test1$Neighborhood[which(test1$Neighborhood == "NridgHt")] <- "25"
test1$Neighborhood[which(test1$Neighborhood == "NoRidge")] <- "24"
test1$Neighborhood[which(test1$Neighborhood == "StoneBr")] <- "23"
test1$Neighborhood[which(test1$Neighborhood == "Timber")] <- "22"
test1$Neighborhood[which(test1$Neighborhood == "Somerst")] <- "21"
test1$Neighborhood[which(test1$Neighborhood == "Veenker")] <- "20"
test1$Neighborhood[which(test1$Neighborhood == "Crawfor")] <- "19"
test1$Neighborhood[which(test1$Neighborhood == "ClearCr")] <- "18"
test1$Neighborhood[which(test1$Neighborhood == "CollgCr")] <- "17"
test1$Neighborhood[which(test1$Neighborhood == "Blmngtn")] <- "16"
test1$Neighborhood[which(test1$Neighborhood == "NWAmes")] <- "15"
test1$Neighborhood[which(test1$Neighborhood == "Gilbert")] <- "14"
test1$Neighborhood[which(test1$Neighborhood == "SawyerW")] <- "13"
test1$Neighborhood[which(test1$Neighborhood == "Mitchel")] <- "12"
test1$Neighborhood[which(test1$Neighborhood == "NPkVill")] <- "11"
test1$Neighborhood[which(test1$Neighborhood == "NAmes")] <- "10"
test1$Neighborhood[which(test1$Neighborhood == "SWISU")] <- "9"
test1$Neighborhood[which(test1$Neighborhood == "Blueste")] <- "8"
test1$Neighborhood[which(test1$Neighborhood == "Sawyer")] <- "7"
test1$Neighborhood[which(test1$Neighborhood == "BrkSide")] <- "6"
test1$Neighborhood[which(test1$Neighborhood == "Edwards")] <- "5"
test1$Neighborhood[which(test1$Neighborhood == "OldTown")] <- "4"
test1$Neighborhood[which(test1$Neighborhood == "BrDale")] <- "3"
test1$Neighborhood[which(test1$Neighborhood == "IDOTRR")] <- "2"
test1$Neighborhood[which(test1$Neighborhood == "MeadowV")] <- "1"
test1$Neighborhood <- as.numeric(test1$Neighborhood)
Convert indicator variables to numbers.
test1$CentralAir <- as.character(test1$CentralAir)
test1$CentralAir[which(test1$CentralAir == "Y")] <- "1"
test1$CentralAir[which(test1$CentralAir == "N")] <- "0"
test1$CentralAir <- as.numeric(test1$CentralAir)
test1$CentralAir <- as.character(test1$CentralAir)
test1$CentralAir[which(test1$CentralAir == "Y")] <- "1"
test1$CentralAir[which(test1$CentralAir == "N")] <- "0"
test1$CentralAir <- as.numeric(test1$CentralAir)
test1$SaleCondition <- as.character(test1$SaleCondition)
test1$SaleCondition[which(test1$SaleCondition == "Normal")] <- "1"
test1$SaleCondition[which(test1$SaleCondition == "Abnorml")] <- "0"
test1$SaleCondition[which(test1$SaleCondition == "AdjLand")] <- "0"
test1$SaleCondition[which(test1$SaleCondition == "Alloca")] <- "0"
test1$SaleCondition[which(test1$SaleCondition == "Family")] <- "0"
test1$SaleCondition[which(test1$SaleCondition == "Partial")] <- "0"
test1$SaleCondition[which(test1$SaleCondition == "N")] <- "0"
test1$SaleCondition <- as.numeric(test1$SaleCondition)
test1$BldgType <- as.character(test1$BldgType)
test1$BldgType[which(test1$BldgType == "1Fam")] <- "5"
test1$BldgType[which(test1$BldgType == "2fmCon")] <- "4"
test1$BldgType[which(test1$BldgType == "Duplex")] <- "3"
test1$BldgType[which(test1$BldgType == "Twnhs")] <- "2"
test1$BldgType[which(test1$BldgType == "TwnhsE")] <- "1"
test1$BldgType <- as.numeric(test1$BldgType)
sapply(test1, function(x) sum(is.na(x))) %>% kable() %>% kable_styling()
x | |
---|---|
Id | 0 |
MSSubClass | 0 |
Neighborhood | 0 |
LotFrontage | 227 |
LotArea | 0 |
BldgType | 0 |
OverallQual | 0 |
OverallCond | 0 |
YearBuilt | 0 |
YearRemodAdd | 0 |
MasVnrArea | 15 |
BsmtFinSF1 | 1 |
BsmtFinSF2 | 1 |
BsmtUnfSF | 1 |
TotalBsmtSF | 1 |
CentralAir | 0 |
X1stFlrSF | 0 |
X2ndFlrSF | 0 |
LowQualFinSF | 0 |
GrLivArea | 0 |
TotRmsAbvGrd | 0 |
GarageCars | 1 |
GarageArea | 1 |
WoodDeckSF | 0 |
OpenPorchSF | 0 |
EnclosedPorch | 0 |
ScreenPorch | 0 |
X3SsnPorch | 0 |
PoolArea | 0 |
MiscVal | 0 |
MoSold | 0 |
YrSold | 0 |
SaleCondition | 0 |
Impute missing data by mean
test1$LotFrontage[is.na(test1$LotFrontage)] <- mean(test1$LotFrontage, na.rm=TRUE)
test1$MasVnrArea[is.na(test1$MasVnrArea)] <- mean(test1$MasVnrArea, na.rm=TRUE)
test1$BsmtFinSF1[is.na(test1$BsmtFinSF1)] <- mean(test1$BsmtFinSF1, na.rm=TRUE)
test1$BsmtFinSF2[is.na(test1$BsmtFinSF2)] <- mean(test1$BsmtFinSF2, na.rm=TRUE)
test1$BsmtUnfSF[is.na(test1$BsmtUnfSF)] <- mean(test1$BsmtUnfSF, na.rm=TRUE)
test1$TotalBsmtSF[is.na(test1$TotalBsmtSF)] <- mean(test1$TotalBsmtSF, na.rm=TRUE)
test1$GarageCars[is.na(test1$GarageCars)] <- mean(test1$GarageCars, na.rm=TRUE)
test1$GarageArea[is.na(test1$GarageArea)] <- mean(test1$GarageArea, na.rm=TRUE)
Delete the following columns for colinearity
train1$TotalBsmtSF <- NULL
train1$TotRmsAbvGrd<- NULL
train1$X1stFlrSF<- NULL
train1$X2ndFlrSF<- NULL
replace with outliers
replaceOutliers = function(x) {
quantiles <- quantile( x, c(0.5,.95 ) )
x[ x < quantiles[1] ] <- quantiles[1]
x[ x > quantiles[2] ] <- quantiles[2]
return(x)
}
train1$LotFrontage <- replaceOutliers(train1$LotFrontage)
train1$LotArea <- replaceOutliers(train1$LotArea)
train1$MasVnrArea <- replaceOutliers(train1$MasVnrArea)
train1$BsmtFinSF1 <- replaceOutliers(train1$BsmtFinSF1)
train1$BsmtFinSF2 <- replaceOutliers(train1$BsmtFinSF2)
train1$BsmtUnfSF <- replaceOutliers(train1$BsmtUnfSF)
train1$GrLivArea <- replaceOutliers(train1$GrLivArea)
train1$WoodDeckSF <- replaceOutliers(train1$WoodDeckSF)
train1$OpenPorchSF <- replaceOutliers(train1$OpenPorchSF)
train1$EnclosedPorch <- replaceOutliers(train1$EnclosedPorch)
train1$ScreenPorch <- replaceOutliers(train1$ScreenPorch)
train1$X3SsnPorch <- replaceOutliers(train1$X3SsnPorch)
train1$PoolArea <- replaceOutliers(train1$PoolArea)
train1$MiscVal <- replaceOutliers(train1$MiscVal)
summary(test1)
## Id MSSubClass Neighborhood LotFrontage
## Min. :1461 Min. : 20.00 Min. : 1.00 Min. : 21.00
## 1st Qu.:1826 1st Qu.: 20.00 1st Qu.: 7.00 1st Qu.: 60.00
## Median :2190 Median : 50.00 Median :12.00 Median : 68.58
## Mean :2190 Mean : 57.38 Mean :12.55 Mean : 68.58
## 3rd Qu.:2554 3rd Qu.: 70.00 3rd Qu.:17.00 3rd Qu.: 78.00
## Max. :2919 Max. :190.00 Max. :25.00 Max. :200.00
## LotArea BldgType OverallQual OverallCond
## Min. : 1470 Min. :1.000 Min. : 1.000 Min. :1.000
## 1st Qu.: 7391 1st Qu.:5.000 1st Qu.: 5.000 1st Qu.:5.000
## Median : 9399 Median :5.000 Median : 6.000 Median :5.000
## Mean : 9819 Mean :4.482 Mean : 6.079 Mean :5.554
## 3rd Qu.:11518 3rd Qu.:5.000 3rd Qu.: 7.000 3rd Qu.:6.000
## Max. :56600 Max. :5.000 Max. :10.000 Max. :9.000
## YearBuilt YearRemodAdd MasVnrArea BsmtFinSF1
## Min. :1879 Min. :1950 Min. : 0.0 Min. : 0.0
## 1st Qu.:1953 1st Qu.:1963 1st Qu.: 0.0 1st Qu.: 0.0
## Median :1973 Median :1992 Median : 0.0 Median : 351.0
## Mean :1971 Mean :1984 Mean : 100.7 Mean : 439.2
## 3rd Qu.:2001 3rd Qu.:2004 3rd Qu.: 162.0 3rd Qu.: 752.0
## Max. :2010 Max. :2010 Max. :1290.0 Max. :4010.0
## BsmtFinSF2 BsmtUnfSF TotalBsmtSF CentralAir
## Min. : 0.00 Min. : 0.0 Min. : 0 Min. :0.0000
## 1st Qu.: 0.00 1st Qu.: 219.5 1st Qu.: 784 1st Qu.:1.0000
## Median : 0.00 Median : 460.0 Median : 988 Median :1.0000
## Mean : 52.62 Mean : 554.3 Mean :1046 Mean :0.9308
## 3rd Qu.: 0.00 3rd Qu.: 797.5 3rd Qu.:1304 3rd Qu.:1.0000
## Max. :1526.00 Max. :2140.0 Max. :5095 Max. :1.0000
## X1stFlrSF X2ndFlrSF LowQualFinSF GrLivArea
## Min. : 407.0 Min. : 0 Min. : 0.000 Min. : 407
## 1st Qu.: 873.5 1st Qu.: 0 1st Qu.: 0.000 1st Qu.:1118
## Median :1079.0 Median : 0 Median : 0.000 Median :1432
## Mean :1156.5 Mean : 326 Mean : 3.543 Mean :1486
## 3rd Qu.:1382.5 3rd Qu.: 676 3rd Qu.: 0.000 3rd Qu.:1721
## Max. :5095.0 Max. :1862 Max. :1064.000 Max. :5095
## TotRmsAbvGrd GarageCars GarageArea WoodDeckSF
## Min. : 3.000 Min. :0.000 Min. : 0.0 Min. : 0.00
## 1st Qu.: 5.000 1st Qu.:1.000 1st Qu.: 318.0 1st Qu.: 0.00
## Median : 6.000 Median :2.000 Median : 480.0 Median : 0.00
## Mean : 6.385 Mean :1.766 Mean : 472.8 Mean : 93.17
## 3rd Qu.: 7.000 3rd Qu.:2.000 3rd Qu.: 576.0 3rd Qu.: 168.00
## Max. :15.000 Max. :5.000 Max. :1488.0 Max. :1424.00
## OpenPorchSF EnclosedPorch ScreenPorch X3SsnPorch
## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.000
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.000
## Median : 28.00 Median : 0.00 Median : 0.00 Median : 0.000
## Mean : 48.31 Mean : 24.24 Mean : 17.06 Mean : 1.794
## 3rd Qu.: 72.00 3rd Qu.: 0.00 3rd Qu.: 0.00 3rd Qu.: 0.000
## Max. :742.00 Max. :1012.00 Max. :576.00 Max. :360.000
## PoolArea MiscVal MoSold YrSold
## Min. : 0.000 Min. : 0.00 Min. : 1.000 Min. :2006
## 1st Qu.: 0.000 1st Qu.: 0.00 1st Qu.: 4.000 1st Qu.:2007
## Median : 0.000 Median : 0.00 Median : 6.000 Median :2008
## Mean : 1.744 Mean : 58.17 Mean : 6.104 Mean :2008
## 3rd Qu.: 0.000 3rd Qu.: 0.00 3rd Qu.: 8.000 3rd Qu.:2009
## Max. :800.000 Max. :17000.00 Max. :12.000 Max. :2010
## SaleCondition
## Min. :0.0000
## 1st Qu.:1.0000
## Median :1.0000
## Mean :0.8252
## 3rd Qu.:1.0000
## Max. :1.0000
par(mfrow=c(3, 3))
colnames <- dimnames(test1)[[2]]
for(col in 2:ncol(test1)) {
d <- density(na.omit(test1[,col]))
plot(d, type="n", main=colnames[col])
polygon(d, col="light green", border="red")
}
`
full.model.pred <- cbind(test1, s<-predict(full.model, test1))
names(full.model.pred)[ncol(full.model.pred)] <- "SalePrice"
full.model.submission <- dplyr::select(full.model.pred,Id,SalePrice)
write.csv(full.model.submission, file="full.model.submission.csv")
reduced.model.pred <- cbind(test1, s<-predict(reduced.model, test1))
names(reduced.model.pred)[ncol(reduced.model.pred)] <- "SalePrice"
reduced.model.submission <- dplyr::select(reduced.model.pred,Id,SalePrice)
write.csv(reduced.model.submission, file="reduced.model.submission.csv")
backward.model.pred <- cbind(test1, s<-predict(backward.model, test1))
names(backward.model.pred)[ncol(backward.model.pred)] <- "SalePrice"
backward.model.submission <- dplyr::select(backward.model.pred,Id,SalePrice)
write.csv(backward.model.submission, file="backward.model.submission.csv")
model.best5.pred <- cbind(test1, s<-predict(model.best5, test1))
names(model.best5.pred)[ncol(model.best5.pred)] <- "SalePrice"
model.best5.submission <- dplyr::select(model.best5.pred,Id,SalePrice)
write.csv(model.best5.submission, file="model.best5.submission.csv")
Summary of 4 models on SalePrice
summary(full.model.submission$SalePrice)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 46528 135289 168058 176857 209857 626428
summary(reduced.model.submission$SalePrice)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 49564 135726 168064 177044 209976 629480
summary(backward.model.submission$SalePrice)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 49564 135553 167851 176928 209720 623185
summary(model.best5.submission$SalePrice)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 49448 143541 178091 183743 216547 457530
Histogram of 4 models
ggplot(full.model.submission, aes(SalePrice)) + geom_histogram(binwidth = 20000, alpha=0.5, color="red", fill="light green")
ggplot(reduced.model.submission, aes(SalePrice)) + geom_histogram(binwidth = 20000, alpha=0.5, color="red", fill="light green")
ggplot(backward.model.submission, aes(SalePrice)) + geom_histogram(binwidth = 20000, alpha=0.5, color="red", fill="light green")
ggplot(model.best5.submission, aes(SalePrice)) + geom_histogram(binwidth = 20000, alpha=0.5, color="red", fill="light green")
After submitted the csv file from the the above model. Here is the scores: Best 5 (0.213), Backward (0.186), Full (0.185), Reduced(0.185).