You are to register for Kaggle.com (free) and compete in the House Prices: Advanced Regression Techniques competition. https://www.kaggle.com/c/house-prices-advanced-regression-techniques . I want you to do the following:
The House Prices dataset from Kaggle contains 79 different variables that describes residential homes in Ames, Iowa. Columns include the lot Area size, overall quality rating, the year built, etc. The purpose of this contest is to predict the final price of homes using the variables provided.
I will explore the dataset and create a regression model using variablesof my choice to predict the final price of homes, place this into dataset and submit my predictions to the Kaggle competition.
Descriptive and Inferential Statistics. Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot matrix for at least two of the independent variables and the dependent variable. Derive a correlation matrix for any three quantitative variables in the dataset. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?
test <- read.csv("test.csv",stringsAsFactors = FALSE)
train <- read.csv("train.csv",stringsAsFactors = FALSE)
Looking at the data, we can see that there are columns with Null values. We can observe that there are many different column data types. These are issues we will have to address when cleaning the data.
glimpse(train)
## Rows: 1,460
## Columns: 81
## $ Id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1…
## $ MSSubClass <int> 60, 20, 60, 70, 60, 50, 20, 60, 50, 190, 20, 60, 20, 20,…
## $ MSZoning <chr> "RL", "RL", "RL", "RL", "RL", "RL", "RL", "RL", "RM", "R…
## $ LotFrontage <int> 65, 80, 68, 60, 84, 85, 75, NA, 51, 50, 70, 85, NA, 91, …
## $ LotArea <int> 8450, 9600, 11250, 9550, 14260, 14115, 10084, 10382, 612…
## $ Street <chr> "Pave", "Pave", "Pave", "Pave", "Pave", "Pave", "Pave", …
## $ Alley <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ LotShape <chr> "Reg", "Reg", "IR1", "IR1", "IR1", "IR1", "Reg", "IR1", …
## $ LandContour <chr> "Lvl", "Lvl", "Lvl", "Lvl", "Lvl", "Lvl", "Lvl", "Lvl", …
## $ Utilities <chr> "AllPub", "AllPub", "AllPub", "AllPub", "AllPub", "AllPu…
## $ LotConfig <chr> "Inside", "FR2", "Inside", "Corner", "FR2", "Inside", "I…
## $ LandSlope <chr> "Gtl", "Gtl", "Gtl", "Gtl", "Gtl", "Gtl", "Gtl", "Gtl", …
## $ Neighborhood <chr> "CollgCr", "Veenker", "CollgCr", "Crawfor", "NoRidge", "…
## $ Condition1 <chr> "Norm", "Feedr", "Norm", "Norm", "Norm", "Norm", "Norm",…
## $ Condition2 <chr> "Norm", "Norm", "Norm", "Norm", "Norm", "Norm", "Norm", …
## $ BldgType <chr> "1Fam", "1Fam", "1Fam", "1Fam", "1Fam", "1Fam", "1Fam", …
## $ HouseStyle <chr> "2Story", "1Story", "2Story", "2Story", "2Story", "1.5Fi…
## $ OverallQual <int> 7, 6, 7, 7, 8, 5, 8, 7, 7, 5, 5, 9, 5, 7, 6, 7, 6, 4, 5,…
## $ OverallCond <int> 5, 8, 5, 5, 5, 5, 5, 6, 5, 6, 5, 5, 6, 5, 5, 8, 7, 5, 5,…
## $ YearBuilt <int> 2003, 1976, 2001, 1915, 2000, 1993, 2004, 1973, 1931, 19…
## $ YearRemodAdd <int> 2003, 1976, 2002, 1970, 2000, 1995, 2005, 1973, 1950, 19…
## $ RoofStyle <chr> "Gable", "Gable", "Gable", "Gable", "Gable", "Gable", "G…
## $ RoofMatl <chr> "CompShg", "CompShg", "CompShg", "CompShg", "CompShg", "…
## $ Exterior1st <chr> "VinylSd", "MetalSd", "VinylSd", "Wd Sdng", "VinylSd", "…
## $ Exterior2nd <chr> "VinylSd", "MetalSd", "VinylSd", "Wd Shng", "VinylSd", "…
## $ MasVnrType <chr> "BrkFace", "None", "BrkFace", "None", "BrkFace", "None",…
## $ MasVnrArea <int> 196, 0, 162, 0, 350, 0, 186, 240, 0, 0, 0, 286, 0, 306, …
## $ ExterQual <chr> "Gd", "TA", "Gd", "TA", "Gd", "TA", "Gd", "TA", "TA", "T…
## $ ExterCond <chr> "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", "T…
## $ Foundation <chr> "PConc", "CBlock", "PConc", "BrkTil", "PConc", "Wood", "…
## $ BsmtQual <chr> "Gd", "Gd", "Gd", "TA", "Gd", "Gd", "Ex", "Gd", "TA", "T…
## $ BsmtCond <chr> "TA", "TA", "TA", "Gd", "TA", "TA", "TA", "TA", "TA", "T…
## $ BsmtExposure <chr> "No", "Gd", "Mn", "No", "Av", "No", "Av", "Mn", "No", "N…
## $ BsmtFinType1 <chr> "GLQ", "ALQ", "GLQ", "ALQ", "GLQ", "GLQ", "GLQ", "ALQ", …
## $ BsmtFinSF1 <int> 706, 978, 486, 216, 655, 732, 1369, 859, 0, 851, 906, 99…
## $ BsmtFinType2 <chr> "Unf", "Unf", "Unf", "Unf", "Unf", "Unf", "Unf", "BLQ", …
## $ BsmtFinSF2 <int> 0, 0, 0, 0, 0, 0, 0, 32, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ BsmtUnfSF <int> 150, 284, 434, 540, 490, 64, 317, 216, 952, 140, 134, 17…
## $ TotalBsmtSF <int> 856, 1262, 920, 756, 1145, 796, 1686, 1107, 952, 991, 10…
## $ Heating <chr> "GasA", "GasA", "GasA", "GasA", "GasA", "GasA", "GasA", …
## $ HeatingQC <chr> "Ex", "Ex", "Ex", "Gd", "Ex", "Ex", "Ex", "Ex", "Gd", "E…
## $ CentralAir <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "…
## $ Electrical <chr> "SBrkr", "SBrkr", "SBrkr", "SBrkr", "SBrkr", "SBrkr", "S…
## $ X1stFlrSF <int> 856, 1262, 920, 961, 1145, 796, 1694, 1107, 1022, 1077, …
## $ X2ndFlrSF <int> 854, 0, 866, 756, 1053, 566, 0, 983, 752, 0, 0, 1142, 0,…
## $ LowQualFinSF <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ GrLivArea <int> 1710, 1262, 1786, 1717, 2198, 1362, 1694, 2090, 1774, 10…
## $ BsmtFullBath <int> 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1,…
## $ BsmtHalfBath <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ FullBath <int> 2, 2, 2, 1, 2, 1, 2, 2, 2, 1, 1, 3, 1, 2, 1, 1, 1, 2, 1,…
## $ HalfBath <int> 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1,…
## $ BedroomAbvGr <int> 3, 3, 3, 3, 4, 1, 3, 3, 2, 2, 3, 4, 2, 3, 2, 2, 2, 2, 3,…
## $ KitchenAbvGr <int> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1, 1, 2, 1,…
## $ KitchenQual <chr> "Gd", "TA", "Gd", "Gd", "Gd", "TA", "Gd", "TA", "TA", "T…
## $ TotRmsAbvGrd <int> 8, 6, 6, 7, 9, 5, 7, 7, 8, 5, 5, 11, 4, 7, 5, 5, 5, 6, 6…
## $ Functional <chr> "Typ", "Typ", "Typ", "Typ", "Typ", "Typ", "Typ", "Typ", …
## $ Fireplaces <int> 0, 1, 1, 1, 1, 0, 1, 2, 2, 2, 0, 2, 0, 1, 1, 0, 1, 0, 0,…
## $ FireplaceQu <chr> NA, "TA", "TA", "Gd", "TA", NA, "Gd", "TA", "TA", "TA", …
## $ GarageType <chr> "Attchd", "Attchd", "Attchd", "Detchd", "Attchd", "Attch…
## $ GarageYrBlt <int> 2003, 1976, 2001, 1998, 2000, 1993, 2004, 1973, 1931, 19…
## $ GarageFinish <chr> "RFn", "RFn", "RFn", "Unf", "RFn", "Unf", "RFn", "RFn", …
## $ GarageCars <int> 2, 2, 2, 3, 3, 2, 2, 2, 2, 1, 1, 3, 1, 3, 1, 2, 2, 2, 2,…
## $ GarageArea <int> 548, 460, 608, 642, 836, 480, 636, 484, 468, 205, 384, 7…
## $ GarageQual <chr> "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", "Fa", "G…
## $ GarageCond <chr> "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", "T…
## $ PavedDrive <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "…
## $ WoodDeckSF <int> 0, 298, 0, 0, 192, 40, 255, 235, 90, 0, 0, 147, 140, 160…
## $ OpenPorchSF <int> 61, 0, 42, 35, 84, 30, 57, 204, 0, 4, 0, 21, 0, 33, 213,…
## $ EnclosedPorch <int> 0, 0, 0, 272, 0, 0, 0, 228, 205, 0, 0, 0, 0, 0, 176, 0, …
## $ X3SsnPorch <int> 0, 0, 0, 0, 0, 320, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ ScreenPorch <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 176, 0, 0, 0, 0, 0, …
## $ PoolArea <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ PoolQC <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ Fence <chr> NA, NA, NA, NA, NA, "MnPrv", NA, NA, NA, NA, NA, NA, NA,…
## $ MiscFeature <chr> NA, NA, NA, NA, NA, "Shed", NA, "Shed", NA, NA, NA, NA, …
## $ MiscVal <int> 0, 0, 0, 0, 0, 700, 0, 350, 0, 0, 0, 0, 0, 0, 0, 0, 700,…
## $ MoSold <int> 2, 5, 9, 2, 12, 10, 8, 11, 4, 1, 2, 7, 9, 8, 5, 7, 3, 10…
## $ YrSold <int> 2008, 2007, 2008, 2006, 2008, 2009, 2007, 2009, 2008, 20…
## $ SaleType <chr> "WD", "WD", "WD", "WD", "WD", "WD", "WD", "WD", "WD", "W…
## $ SaleCondition <chr> "Normal", "Normal", "Normal", "Abnorml", "Normal", "Norm…
## $ SalePrice <int> 208500, 181500, 223500, 140000, 250000, 143000, 307000, …
The test data, unlike the train data (1460) has 1459 rows. This is
because the column SalePrice is missing from the test
data.
glimpse(test)
## Rows: 1,459
## Columns: 80
## $ Id <int> 1461, 1462, 1463, 1464, 1465, 1466, 1467, 1468, 1469, 14…
## $ MSSubClass <int> 20, 20, 60, 60, 120, 60, 20, 60, 20, 20, 120, 160, 160, …
## $ MSZoning <chr> "RH", "RL", "RL", "RL", "RL", "RL", "RL", "RL", "RL", "R…
## $ LotFrontage <int> 80, 81, 74, 78, 43, 75, NA, 63, 85, 70, 26, 21, 21, 24, …
## $ LotArea <int> 11622, 14267, 13830, 9978, 5005, 10000, 7980, 8402, 1017…
## $ Street <chr> "Pave", "Pave", "Pave", "Pave", "Pave", "Pave", "Pave", …
## $ Alley <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ LotShape <chr> "Reg", "IR1", "IR1", "IR1", "IR1", "IR1", "IR1", "IR1", …
## $ LandContour <chr> "Lvl", "Lvl", "Lvl", "Lvl", "HLS", "Lvl", "Lvl", "Lvl", …
## $ Utilities <chr> "AllPub", "AllPub", "AllPub", "AllPub", "AllPub", "AllPu…
## $ LotConfig <chr> "Inside", "Corner", "Inside", "Inside", "Inside", "Corne…
## $ LandSlope <chr> "Gtl", "Gtl", "Gtl", "Gtl", "Gtl", "Gtl", "Gtl", "Gtl", …
## $ Neighborhood <chr> "NAmes", "NAmes", "Gilbert", "Gilbert", "StoneBr", "Gilb…
## $ Condition1 <chr> "Feedr", "Norm", "Norm", "Norm", "Norm", "Norm", "Norm",…
## $ Condition2 <chr> "Norm", "Norm", "Norm", "Norm", "Norm", "Norm", "Norm", …
## $ BldgType <chr> "1Fam", "1Fam", "1Fam", "1Fam", "TwnhsE", "1Fam", "1Fam"…
## $ HouseStyle <chr> "1Story", "1Story", "2Story", "2Story", "1Story", "2Stor…
## $ OverallQual <int> 5, 6, 5, 6, 8, 6, 6, 6, 7, 4, 7, 6, 5, 6, 7, 9, 8, 9, 8,…
## $ OverallCond <int> 6, 6, 5, 6, 5, 5, 7, 5, 5, 5, 5, 5, 5, 6, 6, 5, 5, 5, 5,…
## $ YearBuilt <int> 1961, 1958, 1997, 1998, 1992, 1993, 1992, 1998, 1990, 19…
## $ YearRemodAdd <int> 1961, 1958, 1998, 1998, 1992, 1994, 2007, 1998, 1990, 19…
## $ RoofStyle <chr> "Gable", "Hip", "Gable", "Gable", "Gable", "Gable", "Gab…
## $ RoofMatl <chr> "CompShg", "CompShg", "CompShg", "CompShg", "CompShg", "…
## $ Exterior1st <chr> "VinylSd", "Wd Sdng", "VinylSd", "VinylSd", "HdBoard", "…
## $ Exterior2nd <chr> "VinylSd", "Wd Sdng", "VinylSd", "VinylSd", "HdBoard", "…
## $ MasVnrType <chr> "None", "BrkFace", "None", "BrkFace", "None", "None", "N…
## $ MasVnrArea <int> 0, 108, 0, 20, 0, 0, 0, 0, 0, 0, 0, 504, 492, 0, 0, 162,…
## $ ExterQual <chr> "TA", "TA", "TA", "TA", "Gd", "TA", "TA", "TA", "TA", "T…
## $ ExterCond <chr> "TA", "TA", "TA", "TA", "TA", "TA", "Gd", "TA", "TA", "T…
## $ Foundation <chr> "CBlock", "CBlock", "PConc", "PConc", "PConc", "PConc", …
## $ BsmtQual <chr> "TA", "TA", "Gd", "TA", "Gd", "Gd", "Gd", "Gd", "Gd", "T…
## $ BsmtCond <chr> "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", "T…
## $ BsmtExposure <chr> "No", "No", "No", "No", "No", "No", "No", "No", "Gd", "N…
## $ BsmtFinType1 <chr> "Rec", "ALQ", "GLQ", "GLQ", "ALQ", "Unf", "ALQ", "Unf", …
## $ BsmtFinSF1 <int> 468, 923, 791, 602, 263, 0, 935, 0, 637, 804, 1051, 156,…
## $ BsmtFinType2 <chr> "LwQ", "Unf", "Unf", "Unf", "Unf", "Unf", "Unf", "Unf", …
## $ BsmtFinSF2 <int> 144, 0, 0, 0, 0, 0, 0, 0, 0, 78, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ BsmtUnfSF <int> 270, 406, 137, 324, 1017, 763, 233, 789, 663, 0, 354, 32…
## $ TotalBsmtSF <int> 882, 1329, 928, 926, 1280, 763, 1168, 789, 1300, 882, 14…
## $ Heating <chr> "GasA", "GasA", "GasA", "GasA", "GasA", "GasA", "GasA", …
## $ HeatingQC <chr> "TA", "TA", "Gd", "Ex", "Ex", "Gd", "Ex", "Gd", "Gd", "T…
## $ CentralAir <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "…
## $ Electrical <chr> "SBrkr", "SBrkr", "SBrkr", "SBrkr", "SBrkr", "SBrkr", "S…
## $ X1stFlrSF <int> 896, 1329, 928, 926, 1280, 763, 1187, 789, 1341, 882, 13…
## $ X2ndFlrSF <int> 0, 0, 701, 678, 0, 892, 0, 676, 0, 0, 0, 504, 567, 601, …
## $ LowQualFinSF <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ GrLivArea <int> 896, 1329, 1629, 1604, 1280, 1655, 1187, 1465, 1341, 882…
## $ BsmtFullBath <int> 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ BsmtHalfBath <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ FullBath <int> 1, 1, 2, 2, 2, 2, 2, 2, 1, 1, 2, 1, 1, 2, 1, 2, 2, 2, 2,…
## $ HalfBath <int> 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0,…
## $ BedroomAbvGr <int> 2, 3, 3, 3, 2, 3, 3, 3, 2, 2, 2, 2, 3, 3, 2, 3, 3, 3, 3,…
## $ KitchenAbvGr <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ KitchenQual <chr> "TA", "Gd", "TA", "Gd", "Gd", "TA", "TA", "TA", "Gd", "T…
## $ TotRmsAbvGrd <int> 5, 6, 6, 7, 5, 7, 6, 7, 5, 4, 5, 5, 6, 6, 4, 10, 7, 7, 8…
## $ Functional <chr> "Typ", "Typ", "Typ", "Typ", "Typ", "Typ", "Typ", "Typ", …
## $ Fireplaces <int> 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1,…
## $ FireplaceQu <chr> NA, NA, "TA", "Gd", NA, "TA", NA, "Gd", "Po", NA, "Fa", …
## $ GarageType <chr> "Attchd", "Attchd", "Attchd", "Attchd", "Attchd", "Attch…
## $ GarageYrBlt <int> 1961, 1958, 1997, 1998, 1992, 1993, 1992, 1998, 1990, 19…
## $ GarageFinish <chr> "Unf", "Unf", "Fin", "Fin", "RFn", "Fin", "Fin", "Fin", …
## $ GarageCars <int> 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 2, 1, 3, 3, 3, 3,…
## $ GarageArea <int> 730, 312, 482, 470, 506, 440, 420, 393, 506, 525, 511, 2…
## $ GarageQual <chr> "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", "T…
## $ GarageCond <chr> "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", "T…
## $ PavedDrive <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "…
## $ WoodDeckSF <int> 140, 393, 212, 360, 0, 157, 483, 0, 192, 240, 203, 275, …
## $ OpenPorchSF <int> 0, 36, 34, 36, 82, 84, 21, 75, 0, 0, 68, 0, 0, 0, 30, 13…
## $ EnclosedPorch <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ X3SsnPorch <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ ScreenPorch <int> 120, 0, 0, 0, 144, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ PoolArea <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ PoolQC <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ Fence <chr> "MnPrv", NA, "MnPrv", NA, NA, NA, "GdPrv", NA, NA, "MnPr…
## $ MiscFeature <chr> NA, "Gar2", NA, NA, NA, NA, "Shed", NA, NA, NA, NA, NA, …
## $ MiscVal <int> 0, 12500, 0, 0, 0, 0, 500, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ MoSold <int> 6, 6, 3, 6, 1, 4, 3, 5, 2, 4, 6, 2, 3, 6, 6, 1, 6, 6, 2,…
## $ YrSold <int> 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 20…
## $ SaleType <chr> "WD", "WD", "WD", "WD", "WD", "WD", "WD", "WD", "WD", "W…
## $ SaleCondition <chr> "Normal", "Normal", "Normal", "Normal", "Normal", "Norma…
Using the skim function, we get an even clearer look at
our missing values. Column LotFrontage has the most missing
values in both the test and train datasets.
skim(train)
| Name | train |
| Number of rows | 1460 |
| Number of columns | 81 |
| _______________________ | |
| Column type frequency: | |
| character | 43 |
| numeric | 38 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| MSZoning | 0 | 1.00 | 2 | 7 | 0 | 5 | 0 |
| Street | 0 | 1.00 | 4 | 4 | 0 | 2 | 0 |
| Alley | 1369 | 0.06 | 4 | 4 | 0 | 2 | 0 |
| LotShape | 0 | 1.00 | 3 | 3 | 0 | 4 | 0 |
| LandContour | 0 | 1.00 | 3 | 3 | 0 | 4 | 0 |
| Utilities | 0 | 1.00 | 6 | 6 | 0 | 2 | 0 |
| LotConfig | 0 | 1.00 | 3 | 7 | 0 | 5 | 0 |
| LandSlope | 0 | 1.00 | 3 | 3 | 0 | 3 | 0 |
| Neighborhood | 0 | 1.00 | 5 | 7 | 0 | 25 | 0 |
| Condition1 | 0 | 1.00 | 4 | 6 | 0 | 9 | 0 |
| Condition2 | 0 | 1.00 | 4 | 6 | 0 | 8 | 0 |
| BldgType | 0 | 1.00 | 4 | 6 | 0 | 5 | 0 |
| HouseStyle | 0 | 1.00 | 4 | 6 | 0 | 8 | 0 |
| RoofStyle | 0 | 1.00 | 3 | 7 | 0 | 6 | 0 |
| RoofMatl | 0 | 1.00 | 4 | 7 | 0 | 8 | 0 |
| Exterior1st | 0 | 1.00 | 5 | 7 | 0 | 15 | 0 |
| Exterior2nd | 0 | 1.00 | 5 | 7 | 0 | 16 | 0 |
| MasVnrType | 8 | 0.99 | 4 | 7 | 0 | 4 | 0 |
| ExterQual | 0 | 1.00 | 2 | 2 | 0 | 4 | 0 |
| ExterCond | 0 | 1.00 | 2 | 2 | 0 | 5 | 0 |
| Foundation | 0 | 1.00 | 4 | 6 | 0 | 6 | 0 |
| BsmtQual | 37 | 0.97 | 2 | 2 | 0 | 4 | 0 |
| BsmtCond | 37 | 0.97 | 2 | 2 | 0 | 4 | 0 |
| BsmtExposure | 38 | 0.97 | 2 | 2 | 0 | 4 | 0 |
| BsmtFinType1 | 37 | 0.97 | 3 | 3 | 0 | 6 | 0 |
| BsmtFinType2 | 38 | 0.97 | 3 | 3 | 0 | 6 | 0 |
| Heating | 0 | 1.00 | 4 | 5 | 0 | 6 | 0 |
| HeatingQC | 0 | 1.00 | 2 | 2 | 0 | 5 | 0 |
| CentralAir | 0 | 1.00 | 1 | 1 | 0 | 2 | 0 |
| Electrical | 1 | 1.00 | 3 | 5 | 0 | 5 | 0 |
| KitchenQual | 0 | 1.00 | 2 | 2 | 0 | 4 | 0 |
| Functional | 0 | 1.00 | 3 | 4 | 0 | 7 | 0 |
| FireplaceQu | 690 | 0.53 | 2 | 2 | 0 | 5 | 0 |
| GarageType | 81 | 0.94 | 6 | 7 | 0 | 6 | 0 |
| GarageFinish | 81 | 0.94 | 3 | 3 | 0 | 3 | 0 |
| GarageQual | 81 | 0.94 | 2 | 2 | 0 | 5 | 0 |
| GarageCond | 81 | 0.94 | 2 | 2 | 0 | 5 | 0 |
| PavedDrive | 0 | 1.00 | 1 | 1 | 0 | 3 | 0 |
| PoolQC | 1453 | 0.00 | 2 | 2 | 0 | 3 | 0 |
| Fence | 1179 | 0.19 | 4 | 5 | 0 | 4 | 0 |
| MiscFeature | 1406 | 0.04 | 4 | 4 | 0 | 4 | 0 |
| SaleType | 0 | 1.00 | 2 | 5 | 0 | 9 | 0 |
| SaleCondition | 0 | 1.00 | 6 | 7 | 0 | 6 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Id | 0 | 1.00 | 730.50 | 421.61 | 1 | 365.75 | 730.5 | 1095.25 | 1460 | ▇▇▇▇▇ |
| MSSubClass | 0 | 1.00 | 56.90 | 42.30 | 20 | 20.00 | 50.0 | 70.00 | 190 | ▇▅▂▁▁ |
| LotFrontage | 259 | 0.82 | 70.05 | 24.28 | 21 | 59.00 | 69.0 | 80.00 | 313 | ▇▃▁▁▁ |
| LotArea | 0 | 1.00 | 10516.83 | 9981.26 | 1300 | 7553.50 | 9478.5 | 11601.50 | 215245 | ▇▁▁▁▁ |
| OverallQual | 0 | 1.00 | 6.10 | 1.38 | 1 | 5.00 | 6.0 | 7.00 | 10 | ▁▂▇▅▁ |
| OverallCond | 0 | 1.00 | 5.58 | 1.11 | 1 | 5.00 | 5.0 | 6.00 | 9 | ▁▁▇▅▁ |
| YearBuilt | 0 | 1.00 | 1971.27 | 30.20 | 1872 | 1954.00 | 1973.0 | 2000.00 | 2010 | ▁▂▃▆▇ |
| YearRemodAdd | 0 | 1.00 | 1984.87 | 20.65 | 1950 | 1967.00 | 1994.0 | 2004.00 | 2010 | ▅▂▂▃▇ |
| MasVnrArea | 8 | 0.99 | 103.69 | 181.07 | 0 | 0.00 | 0.0 | 166.00 | 1600 | ▇▁▁▁▁ |
| BsmtFinSF1 | 0 | 1.00 | 443.64 | 456.10 | 0 | 0.00 | 383.5 | 712.25 | 5644 | ▇▁▁▁▁ |
| BsmtFinSF2 | 0 | 1.00 | 46.55 | 161.32 | 0 | 0.00 | 0.0 | 0.00 | 1474 | ▇▁▁▁▁ |
| BsmtUnfSF | 0 | 1.00 | 567.24 | 441.87 | 0 | 223.00 | 477.5 | 808.00 | 2336 | ▇▅▂▁▁ |
| TotalBsmtSF | 0 | 1.00 | 1057.43 | 438.71 | 0 | 795.75 | 991.5 | 1298.25 | 6110 | ▇▃▁▁▁ |
| X1stFlrSF | 0 | 1.00 | 1162.63 | 386.59 | 334 | 882.00 | 1087.0 | 1391.25 | 4692 | ▇▅▁▁▁ |
| X2ndFlrSF | 0 | 1.00 | 346.99 | 436.53 | 0 | 0.00 | 0.0 | 728.00 | 2065 | ▇▃▂▁▁ |
| LowQualFinSF | 0 | 1.00 | 5.84 | 48.62 | 0 | 0.00 | 0.0 | 0.00 | 572 | ▇▁▁▁▁ |
| GrLivArea | 0 | 1.00 | 1515.46 | 525.48 | 334 | 1129.50 | 1464.0 | 1776.75 | 5642 | ▇▇▁▁▁ |
| BsmtFullBath | 0 | 1.00 | 0.43 | 0.52 | 0 | 0.00 | 0.0 | 1.00 | 3 | ▇▆▁▁▁ |
| BsmtHalfBath | 0 | 1.00 | 0.06 | 0.24 | 0 | 0.00 | 0.0 | 0.00 | 2 | ▇▁▁▁▁ |
| FullBath | 0 | 1.00 | 1.57 | 0.55 | 0 | 1.00 | 2.0 | 2.00 | 3 | ▁▇▁▇▁ |
| HalfBath | 0 | 1.00 | 0.38 | 0.50 | 0 | 0.00 | 0.0 | 1.00 | 2 | ▇▁▅▁▁ |
| BedroomAbvGr | 0 | 1.00 | 2.87 | 0.82 | 0 | 2.00 | 3.0 | 3.00 | 8 | ▁▇▂▁▁ |
| KitchenAbvGr | 0 | 1.00 | 1.05 | 0.22 | 0 | 1.00 | 1.0 | 1.00 | 3 | ▁▇▁▁▁ |
| TotRmsAbvGrd | 0 | 1.00 | 6.52 | 1.63 | 2 | 5.00 | 6.0 | 7.00 | 14 | ▂▇▇▁▁ |
| Fireplaces | 0 | 1.00 | 0.61 | 0.64 | 0 | 0.00 | 1.0 | 1.00 | 3 | ▇▇▁▁▁ |
| GarageYrBlt | 81 | 0.94 | 1978.51 | 24.69 | 1900 | 1961.00 | 1980.0 | 2002.00 | 2010 | ▁▁▅▅▇ |
| GarageCars | 0 | 1.00 | 1.77 | 0.75 | 0 | 1.00 | 2.0 | 2.00 | 4 | ▁▃▇▂▁ |
| GarageArea | 0 | 1.00 | 472.98 | 213.80 | 0 | 334.50 | 480.0 | 576.00 | 1418 | ▂▇▃▁▁ |
| WoodDeckSF | 0 | 1.00 | 94.24 | 125.34 | 0 | 0.00 | 0.0 | 168.00 | 857 | ▇▂▁▁▁ |
| OpenPorchSF | 0 | 1.00 | 46.66 | 66.26 | 0 | 0.00 | 25.0 | 68.00 | 547 | ▇▁▁▁▁ |
| EnclosedPorch | 0 | 1.00 | 21.95 | 61.12 | 0 | 0.00 | 0.0 | 0.00 | 552 | ▇▁▁▁▁ |
| X3SsnPorch | 0 | 1.00 | 3.41 | 29.32 | 0 | 0.00 | 0.0 | 0.00 | 508 | ▇▁▁▁▁ |
| ScreenPorch | 0 | 1.00 | 15.06 | 55.76 | 0 | 0.00 | 0.0 | 0.00 | 480 | ▇▁▁▁▁ |
| PoolArea | 0 | 1.00 | 2.76 | 40.18 | 0 | 0.00 | 0.0 | 0.00 | 738 | ▇▁▁▁▁ |
| MiscVal | 0 | 1.00 | 43.49 | 496.12 | 0 | 0.00 | 0.0 | 0.00 | 15500 | ▇▁▁▁▁ |
| MoSold | 0 | 1.00 | 6.32 | 2.70 | 1 | 5.00 | 6.0 | 8.00 | 12 | ▃▆▇▃▃ |
| YrSold | 0 | 1.00 | 2007.82 | 1.33 | 2006 | 2007.00 | 2008.0 | 2009.00 | 2010 | ▇▇▇▇▅ |
| SalePrice | 0 | 1.00 | 180921.20 | 79442.50 | 34900 | 129975.00 | 163000.0 | 214000.00 | 755000 | ▇▅▁▁▁ |
skim(test)
| Name | test |
| Number of rows | 1459 |
| Number of columns | 80 |
| _______________________ | |
| Column type frequency: | |
| character | 43 |
| numeric | 37 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| MSZoning | 4 | 1.00 | 2 | 7 | 0 | 5 | 0 |
| Street | 0 | 1.00 | 4 | 4 | 0 | 2 | 0 |
| Alley | 1352 | 0.07 | 4 | 4 | 0 | 2 | 0 |
| LotShape | 0 | 1.00 | 3 | 3 | 0 | 4 | 0 |
| LandContour | 0 | 1.00 | 3 | 3 | 0 | 4 | 0 |
| Utilities | 2 | 1.00 | 6 | 6 | 0 | 1 | 0 |
| LotConfig | 0 | 1.00 | 3 | 7 | 0 | 5 | 0 |
| LandSlope | 0 | 1.00 | 3 | 3 | 0 | 3 | 0 |
| Neighborhood | 0 | 1.00 | 5 | 7 | 0 | 25 | 0 |
| Condition1 | 0 | 1.00 | 4 | 6 | 0 | 9 | 0 |
| Condition2 | 0 | 1.00 | 4 | 6 | 0 | 5 | 0 |
| BldgType | 0 | 1.00 | 4 | 6 | 0 | 5 | 0 |
| HouseStyle | 0 | 1.00 | 4 | 6 | 0 | 7 | 0 |
| RoofStyle | 0 | 1.00 | 3 | 7 | 0 | 6 | 0 |
| RoofMatl | 0 | 1.00 | 7 | 7 | 0 | 4 | 0 |
| Exterior1st | 1 | 1.00 | 6 | 7 | 0 | 13 | 0 |
| Exterior2nd | 1 | 1.00 | 5 | 7 | 0 | 15 | 0 |
| MasVnrType | 16 | 0.99 | 4 | 7 | 0 | 4 | 0 |
| ExterQual | 0 | 1.00 | 2 | 2 | 0 | 4 | 0 |
| ExterCond | 0 | 1.00 | 2 | 2 | 0 | 5 | 0 |
| Foundation | 0 | 1.00 | 4 | 6 | 0 | 6 | 0 |
| BsmtQual | 44 | 0.97 | 2 | 2 | 0 | 4 | 0 |
| BsmtCond | 45 | 0.97 | 2 | 2 | 0 | 4 | 0 |
| BsmtExposure | 44 | 0.97 | 2 | 2 | 0 | 4 | 0 |
| BsmtFinType1 | 42 | 0.97 | 3 | 3 | 0 | 6 | 0 |
| BsmtFinType2 | 42 | 0.97 | 3 | 3 | 0 | 6 | 0 |
| Heating | 0 | 1.00 | 4 | 4 | 0 | 4 | 0 |
| HeatingQC | 0 | 1.00 | 2 | 2 | 0 | 5 | 0 |
| CentralAir | 0 | 1.00 | 1 | 1 | 0 | 2 | 0 |
| Electrical | 0 | 1.00 | 5 | 5 | 0 | 4 | 0 |
| KitchenQual | 1 | 1.00 | 2 | 2 | 0 | 4 | 0 |
| Functional | 2 | 1.00 | 3 | 4 | 0 | 7 | 0 |
| FireplaceQu | 730 | 0.50 | 2 | 2 | 0 | 5 | 0 |
| GarageType | 76 | 0.95 | 6 | 7 | 0 | 6 | 0 |
| GarageFinish | 78 | 0.95 | 3 | 3 | 0 | 3 | 0 |
| GarageQual | 78 | 0.95 | 2 | 2 | 0 | 4 | 0 |
| GarageCond | 78 | 0.95 | 2 | 2 | 0 | 5 | 0 |
| PavedDrive | 0 | 1.00 | 1 | 1 | 0 | 3 | 0 |
| PoolQC | 1456 | 0.00 | 2 | 2 | 0 | 2 | 0 |
| Fence | 1169 | 0.20 | 4 | 5 | 0 | 4 | 0 |
| MiscFeature | 1408 | 0.03 | 4 | 4 | 0 | 3 | 0 |
| SaleType | 1 | 1.00 | 2 | 5 | 0 | 9 | 0 |
| SaleCondition | 0 | 1.00 | 6 | 7 | 0 | 6 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Id | 0 | 1.00 | 2190.00 | 421.32 | 1461 | 1825.50 | 2190.0 | 2554.50 | 2919 | ▇▇▇▇▇ |
| MSSubClass | 0 | 1.00 | 57.38 | 42.75 | 20 | 20.00 | 50.0 | 70.00 | 190 | ▇▅▂▁▁ |
| LotFrontage | 227 | 0.84 | 68.58 | 22.38 | 21 | 58.00 | 67.0 | 80.00 | 200 | ▃▇▁▁▁ |
| LotArea | 0 | 1.00 | 9819.16 | 4955.52 | 1470 | 7391.00 | 9399.0 | 11517.50 | 56600 | ▇▂▁▁▁ |
| OverallQual | 0 | 1.00 | 6.08 | 1.44 | 1 | 5.00 | 6.0 | 7.00 | 10 | ▁▁▇▅▁ |
| OverallCond | 0 | 1.00 | 5.55 | 1.11 | 1 | 5.00 | 5.0 | 6.00 | 9 | ▁▁▇▅▁ |
| YearBuilt | 0 | 1.00 | 1971.36 | 30.39 | 1879 | 1953.00 | 1973.0 | 2001.00 | 2010 | ▁▂▃▆▇ |
| YearRemodAdd | 0 | 1.00 | 1983.66 | 21.13 | 1950 | 1963.00 | 1992.0 | 2004.00 | 2010 | ▅▂▂▃▇ |
| MasVnrArea | 15 | 0.99 | 100.71 | 177.63 | 0 | 0.00 | 0.0 | 164.00 | 1290 | ▇▁▁▁▁ |
| BsmtFinSF1 | 1 | 1.00 | 439.20 | 455.27 | 0 | 0.00 | 350.5 | 753.50 | 4010 | ▇▂▁▁▁ |
| BsmtFinSF2 | 1 | 1.00 | 52.62 | 176.75 | 0 | 0.00 | 0.0 | 0.00 | 1526 | ▇▁▁▁▁ |
| BsmtUnfSF | 1 | 1.00 | 554.29 | 437.26 | 0 | 219.25 | 460.0 | 797.75 | 2140 | ▇▆▂▁▁ |
| TotalBsmtSF | 1 | 1.00 | 1046.12 | 442.90 | 0 | 784.00 | 988.0 | 1305.00 | 5095 | ▇▇▁▁▁ |
| X1stFlrSF | 0 | 1.00 | 1156.53 | 398.17 | 407 | 873.50 | 1079.0 | 1382.50 | 5095 | ▇▃▁▁▁ |
| X2ndFlrSF | 0 | 1.00 | 325.97 | 420.61 | 0 | 0.00 | 0.0 | 676.00 | 1862 | ▇▃▂▁▁ |
| LowQualFinSF | 0 | 1.00 | 3.54 | 44.04 | 0 | 0.00 | 0.0 | 0.00 | 1064 | ▇▁▁▁▁ |
| GrLivArea | 0 | 1.00 | 1486.05 | 485.57 | 407 | 1117.50 | 1432.0 | 1721.00 | 5095 | ▇▇▁▁▁ |
| BsmtFullBath | 2 | 1.00 | 0.43 | 0.53 | 0 | 0.00 | 0.0 | 1.00 | 3 | ▇▆▁▁▁ |
| BsmtHalfBath | 2 | 1.00 | 0.07 | 0.25 | 0 | 0.00 | 0.0 | 0.00 | 2 | ▇▁▁▁▁ |
| FullBath | 0 | 1.00 | 1.57 | 0.56 | 0 | 1.00 | 2.0 | 2.00 | 4 | ▁▇▇▁▁ |
| HalfBath | 0 | 1.00 | 0.38 | 0.50 | 0 | 0.00 | 0.0 | 1.00 | 2 | ▇▁▅▁▁ |
| BedroomAbvGr | 0 | 1.00 | 2.85 | 0.83 | 0 | 2.00 | 3.0 | 3.00 | 6 | ▁▃▇▂▁ |
| KitchenAbvGr | 0 | 1.00 | 1.04 | 0.21 | 0 | 1.00 | 1.0 | 1.00 | 2 | ▁▁▇▁▁ |
| TotRmsAbvGrd | 0 | 1.00 | 6.39 | 1.51 | 3 | 5.00 | 6.0 | 7.00 | 15 | ▅▇▃▁▁ |
| Fireplaces | 0 | 1.00 | 0.58 | 0.65 | 0 | 0.00 | 0.0 | 1.00 | 4 | ▇▇▁▁▁ |
| GarageYrBlt | 78 | 0.95 | 1977.72 | 26.43 | 1895 | 1959.00 | 1979.0 | 2002.00 | 2207 | ▂▇▁▁▁ |
| GarageCars | 1 | 1.00 | 1.77 | 0.78 | 0 | 1.00 | 2.0 | 2.00 | 5 | ▅▇▂▁▁ |
| GarageArea | 1 | 1.00 | 472.77 | 217.05 | 0 | 318.00 | 480.0 | 576.00 | 1488 | ▃▇▃▁▁ |
| WoodDeckSF | 0 | 1.00 | 93.17 | 127.74 | 0 | 0.00 | 0.0 | 168.00 | 1424 | ▇▁▁▁▁ |
| OpenPorchSF | 0 | 1.00 | 48.31 | 68.88 | 0 | 0.00 | 28.0 | 72.00 | 742 | ▇▁▁▁▁ |
| EnclosedPorch | 0 | 1.00 | 24.24 | 67.23 | 0 | 0.00 | 0.0 | 0.00 | 1012 | ▇▁▁▁▁ |
| X3SsnPorch | 0 | 1.00 | 1.79 | 20.21 | 0 | 0.00 | 0.0 | 0.00 | 360 | ▇▁▁▁▁ |
| ScreenPorch | 0 | 1.00 | 17.06 | 56.61 | 0 | 0.00 | 0.0 | 0.00 | 576 | ▇▁▁▁▁ |
| PoolArea | 0 | 1.00 | 1.74 | 30.49 | 0 | 0.00 | 0.0 | 0.00 | 800 | ▇▁▁▁▁ |
| MiscVal | 0 | 1.00 | 58.17 | 630.81 | 0 | 0.00 | 0.0 | 0.00 | 17000 | ▇▁▁▁▁ |
| MoSold | 0 | 1.00 | 6.10 | 2.72 | 1 | 4.00 | 6.0 | 8.00 | 12 | ▅▆▇▃▃ |
| YrSold | 0 | 1.00 | 2007.77 | 1.30 | 2006 | 2007.00 | 2008.0 | 2009.00 | 2010 | ▇▇▇▇▃ |
test %>% plot_density()
train %>% plot_density()
Prior to creating our model, we can need to ensure that we handle Null values on the train and test set properly. In this section I opted to fill every case of Null with the value “missing” and made sure to change the column from character to factor. This will allow us to make sure both the train and test dataframes have the same factor level and also, that Nulls won’t cause problems when creating our model.
# Temporarily Remove Salesprice
SalePrice = train$SalePrice
train$SalePrice = NULL
# Merge datasets
full_data = rbind(train,test)
# Change character cols to factor and replace Nulls with "missing"
for (col in colnames(full_data)){
if (typeof(full_data[,col]) == "character"){
new_col = full_data[,col]
new_col[is.na(new_col)] = "missing"
full_data[col] = as.factor(new_col)
}
}
# Split datasets again
train = full_data[1:nrow(train),]
train$SalePrice = SalePrice
test = full_data[(nrow(train)+1):nrow(full_data),]
Looking at the summary display we can see that there are still some NA values in columns.
summary(train)
## Id MSSubClass MSZoning LotFrontage
## Min. : 1.0 Min. : 20.0 C (all): 10 Min. : 21.00
## 1st Qu.: 365.8 1st Qu.: 20.0 FV : 65 1st Qu.: 59.00
## Median : 730.5 Median : 50.0 missing: 0 Median : 69.00
## Mean : 730.5 Mean : 56.9 RH : 16 Mean : 70.05
## 3rd Qu.:1095.2 3rd Qu.: 70.0 RL :1151 3rd Qu.: 80.00
## Max. :1460.0 Max. :190.0 RM : 218 Max. :313.00
## NA's :259
## LotArea Street Alley LotShape LandContour
## Min. : 1300 Grvl: 6 Grvl : 50 IR1:484 Bnk: 63
## 1st Qu.: 7554 Pave:1454 missing:1369 IR2: 41 HLS: 50
## Median : 9478 Pave : 41 IR3: 10 Low: 36
## Mean : 10517 Reg:925 Lvl:1311
## 3rd Qu.: 11602
## Max. :215245
##
## Utilities LotConfig LandSlope Neighborhood Condition1
## AllPub :1459 Corner : 263 Gtl:1382 NAmes :225 Norm :1260
## missing: 0 CulDSac: 94 Mod: 65 CollgCr:150 Feedr : 81
## NoSeWa : 1 FR2 : 47 Sev: 13 OldTown:113 Artery : 48
## FR3 : 4 Edwards:100 RRAn : 26
## Inside :1052 Somerst: 86 PosN : 19
## Gilbert: 79 RRAe : 11
## (Other):707 (Other): 15
## Condition2 BldgType HouseStyle OverallQual OverallCond
## Norm :1445 1Fam :1220 1Story :726 Min. : 1.000 Min. :1.000
## Feedr : 6 2fmCon: 31 2Story :445 1st Qu.: 5.000 1st Qu.:5.000
## Artery : 2 Duplex: 52 1.5Fin :154 Median : 6.000 Median :5.000
## PosN : 2 Twnhs : 43 SLvl : 65 Mean : 6.099 Mean :5.575
## RRNn : 2 TwnhsE: 114 SFoyer : 37 3rd Qu.: 7.000 3rd Qu.:6.000
## PosA : 1 1.5Unf : 14 Max. :10.000 Max. :9.000
## (Other): 2 (Other): 19
## YearBuilt YearRemodAdd RoofStyle RoofMatl Exterior1st
## Min. :1872 Min. :1950 Flat : 13 CompShg:1434 VinylSd:515
## 1st Qu.:1954 1st Qu.:1967 Gable :1141 Tar&Grv: 11 HdBoard:222
## Median :1973 Median :1994 Gambrel: 11 WdShngl: 6 MetalSd:220
## Mean :1971 Mean :1985 Hip : 286 WdShake: 5 Wd Sdng:206
## 3rd Qu.:2000 3rd Qu.:2004 Mansard: 7 ClyTile: 1 Plywood:108
## Max. :2010 Max. :2010 Shed : 2 Membran: 1 CemntBd: 61
## (Other): 2 (Other):128
## Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond Foundation
## VinylSd:504 BrkCmn : 15 Min. : 0.0 Ex: 52 Ex: 3 BrkTil:146
## MetalSd:214 BrkFace:445 1st Qu.: 0.0 Fa: 14 Fa: 28 CBlock:634
## HdBoard:207 missing: 8 Median : 0.0 Gd:488 Gd: 146 PConc :647
## Wd Sdng:197 None :864 Mean : 103.7 TA:906 Po: 1 Slab : 24
## Plywood:142 Stone :128 3rd Qu.: 166.0 TA:1282 Stone : 6
## CmentBd: 60 Max. :1600.0 Wood : 3
## (Other):136 NA's :8
## BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1
## Ex :121 Fa : 45 Av :221 ALQ :220 Min. : 0.0
## Fa : 35 Gd : 65 Gd :134 BLQ :148 1st Qu.: 0.0
## Gd :618 missing: 37 missing: 38 GLQ :418 Median : 383.5
## missing: 37 Po : 2 Mn :114 LwQ : 74 Mean : 443.6
## TA :649 TA :1311 No :953 missing: 37 3rd Qu.: 712.2
## Rec :133 Max. :5644.0
## Unf :430
## BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF
## ALQ : 19 Min. : 0.00 Min. : 0.0 Min. : 0.0
## BLQ : 33 1st Qu.: 0.00 1st Qu.: 223.0 1st Qu.: 795.8
## GLQ : 14 Median : 0.00 Median : 477.5 Median : 991.5
## LwQ : 46 Mean : 46.55 Mean : 567.2 Mean :1057.4
## missing: 38 3rd Qu.: 0.00 3rd Qu.: 808.0 3rd Qu.:1298.2
## Rec : 54 Max. :1474.00 Max. :2336.0 Max. :6110.0
## Unf :1256
## Heating HeatingQC CentralAir Electrical X1stFlrSF X2ndFlrSF
## Floor: 1 Ex:741 N: 95 FuseA : 94 Min. : 334 Min. : 0
## GasA :1428 Fa: 49 Y:1365 FuseF : 27 1st Qu.: 882 1st Qu.: 0
## GasW : 18 Gd:241 FuseP : 3 Median :1087 Median : 0
## Grav : 7 Po: 1 missing: 1 Mean :1163 Mean : 347
## OthW : 2 TA:428 Mix : 1 3rd Qu.:1391 3rd Qu.: 728
## Wall : 4 SBrkr :1334 Max. :4692 Max. :2065
##
## LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath
## Min. : 0.000 Min. : 334 Min. :0.0000 Min. :0.00000
## 1st Qu.: 0.000 1st Qu.:1130 1st Qu.:0.0000 1st Qu.:0.00000
## Median : 0.000 Median :1464 Median :0.0000 Median :0.00000
## Mean : 5.845 Mean :1515 Mean :0.4253 Mean :0.05753
## 3rd Qu.: 0.000 3rd Qu.:1777 3rd Qu.:1.0000 3rd Qu.:0.00000
## Max. :572.000 Max. :5642 Max. :3.0000 Max. :2.00000
##
## FullBath HalfBath BedroomAbvGr KitchenAbvGr KitchenQual
## Min. :0.000 Min. :0.0000 Min. :0.000 Min. :0.000 Ex :100
## 1st Qu.:1.000 1st Qu.:0.0000 1st Qu.:2.000 1st Qu.:1.000 Fa : 39
## Median :2.000 Median :0.0000 Median :3.000 Median :1.000 Gd :586
## Mean :1.565 Mean :0.3829 Mean :2.866 Mean :1.047 missing: 0
## 3rd Qu.:2.000 3rd Qu.:1.0000 3rd Qu.:3.000 3rd Qu.:1.000 TA :735
## Max. :3.000 Max. :2.0000 Max. :8.000 Max. :3.000
##
## TotRmsAbvGrd Functional Fireplaces FireplaceQu GarageType
## Min. : 2.000 Typ :1360 Min. :0.000 Ex : 24 2Types : 6
## 1st Qu.: 5.000 Min2 : 34 1st Qu.:0.000 Fa : 33 Attchd :870
## Median : 6.000 Min1 : 31 Median :1.000 Gd :380 Basment: 19
## Mean : 6.518 Mod : 15 Mean :0.613 missing:690 BuiltIn: 88
## 3rd Qu.: 7.000 Maj1 : 14 3rd Qu.:1.000 Po : 20 CarPort: 9
## Max. :14.000 Maj2 : 5 Max. :3.000 TA :313 Detchd :387
## (Other): 1 missing: 81
## GarageYrBlt GarageFinish GarageCars GarageArea GarageQual
## Min. :1900 Fin :352 Min. :0.000 Min. : 0.0 Ex : 3
## 1st Qu.:1961 missing: 81 1st Qu.:1.000 1st Qu.: 334.5 Fa : 48
## Median :1980 RFn :422 Median :2.000 Median : 480.0 Gd : 14
## Mean :1979 Unf :605 Mean :1.767 Mean : 473.0 missing: 81
## 3rd Qu.:2002 3rd Qu.:2.000 3rd Qu.: 576.0 Po : 3
## Max. :2010 Max. :4.000 Max. :1418.0 TA :1311
## NA's :81
## GarageCond PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch
## Ex : 2 N: 90 Min. : 0.00 Min. : 0.00 Min. : 0.00
## Fa : 35 P: 30 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00
## Gd : 9 Y:1340 Median : 0.00 Median : 25.00 Median : 0.00
## missing: 81 Mean : 94.24 Mean : 46.66 Mean : 21.95
## Po : 7 3rd Qu.:168.00 3rd Qu.: 68.00 3rd Qu.: 0.00
## TA :1326 Max. :857.00 Max. :547.00 Max. :552.00
##
## X3SsnPorch ScreenPorch PoolArea PoolQC
## Min. : 0.00 Min. : 0.00 Min. : 0.000 Ex : 2
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.000 Fa : 2
## Median : 0.00 Median : 0.00 Median : 0.000 Gd : 3
## Mean : 3.41 Mean : 15.06 Mean : 2.759 missing:1453
## 3rd Qu.: 0.00 3rd Qu.: 0.00 3rd Qu.: 0.000
## Max. :508.00 Max. :480.00 Max. :738.000
##
## Fence MiscFeature MiscVal MoSold
## GdPrv : 59 Gar2 : 2 Min. : 0.00 Min. : 1.000
## GdWo : 54 missing:1406 1st Qu.: 0.00 1st Qu.: 5.000
## missing:1179 Othr : 2 Median : 0.00 Median : 6.000
## MnPrv : 157 Shed : 49 Mean : 43.49 Mean : 6.322
## MnWw : 11 TenC : 1 3rd Qu.: 0.00 3rd Qu.: 8.000
## Max. :15500.00 Max. :12.000
##
## YrSold SaleType SaleCondition SalePrice
## Min. :2006 WD :1267 Abnorml: 101 Min. : 34900
## 1st Qu.:2007 New : 122 AdjLand: 4 1st Qu.:129975
## Median :2008 COD : 43 Alloca : 12 Median :163000
## Mean :2008 ConLD : 9 Family : 20 Mean :180921
## 3rd Qu.:2009 ConLI : 5 Normal :1198 3rd Qu.:214000
## Max. :2010 ConLw : 5 Partial: 125 Max. :755000
## (Other): 9
We will replace Nulls with 1 and -1. The -1 will flag to us about NA values while not acting as a block when normalizing the data and creating our model.
# Fill remaining NA values with -1
train[is.na(train)] = -1
test[is.na(test)] = -1
In my study, I decided to select the variables LotArea,
GarageArea and SalePrice. Living in a populous
environment like New York, I know first hand how much of a luxury open
space can be. Based on this intuition, I suspect the amount of space in
areas such as the lot area, garage area can have some affect on how much
a house can sell for.
# Select the variables of interest
variables <- c("LotArea", "GarageArea", "SalePrice")
# Scatterplot matrix
plot(train[, variables], pch = 19)
When looking at the scatterplot and correlation matrix, we can see
that LotArea has a weak correlation with
SalesPrice. GarageArea and
SalesPrice appear to have moderately positive correlation
with each other.
# Correlation matrix
cor_matrix <- cor(train[, variables])
cor_matrix
## LotArea GarageArea SalePrice
## LotArea 1.0000000 0.1804028 0.2638434
## GarageArea 0.1804028 1.0000000 0.6234314
## SalePrice 0.2638434 0.6234314 1.0000000
corrplot(cor_matrix, method="number")
The correlation coefficient (cor) between LotArea and GarageArea is
estimated to be 0.1804028 - which tell us there is a weak correlation.
However the p value of 3.803e-12 tell us the relationship is
significant. The 80 percent confidence interval for the correlation
coefficient is between 0.1477356 and 0.2126767.
(cor1 <-cor.test(formula = ~LotArea + GarageArea,data=train, conf.level = .80))
##
## Pearson's product-moment correlation
##
## data: LotArea and GarageArea
## t = 7.0034, df = 1458, p-value = 3.803e-12
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.1477356 0.2126767
## sample estimates:
## cor
## 0.1804028
The correlation coefficient (cor) between LotArea and GarageArea is estimated to be 0.2638434 - which tell us there is a weak correlation. However the p value of 2.2e-16 tell us the relationship is significant. The 80 percent confidence interval for the correlation coefficient is between 0.2323391 and 0.2947946.
(cor2 <-cor.test(formula = ~LotArea + SalePrice,data=train, conf.level = .80))
##
## Pearson's product-moment correlation
##
## data: LotArea and SalePrice
## t = 10.445, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.2323391 0.2947946
## sample estimates:
## cor
## 0.2638434
Linear Algebra and Correlation. Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.
Christian’s Response:
We can use the function solve to invert the correlation
matrix
inverse_matrix <- solve(cor_matrix)
We can perform matrix multiplication using the %*% operator. I’ll use to mulitply matrices together.
# multiply correlation matrix by precision matrix
precision_matrix <- inverse_matrix
result1 <- cor_matrix %*% precision_matrix
# multiply precision matrix by correlation matrix
result2 <- precision_matrix %*% cor_matrix
The lu function can perform LU decomposition. We can
then look into the triangular matrix (L), the upper triangular matrix
(U), and permutation matrix (P).
decomposition <- lu(result1)
decomposition2 <- lu(result2)
print(decomposition)
## 'MatrixFactorization' of Formal class 'denseLU' [package "Matrix"] with 4 slots
## ..@ Dimnames:List of 2
## .. ..$ : chr [1:3] "LotArea" "GarageArea" "SalePrice"
## .. ..$ : chr [1:3] "LotArea" "GarageArea" "SalePrice"
## ..@ x : num [1:9] 1 0 0 0 1 0 0 0 1
## ..@ perm : int [1:3] 1 2 3
## ..@ Dim : int [1:2] 3 3
print(decomposition2)
## 'MatrixFactorization' of Formal class 'denseLU' [package "Matrix"] with 4 slots
## ..@ Dimnames:List of 2
## .. ..$ : chr [1:3] "LotArea" "GarageArea" "SalePrice"
## .. ..$ : chr [1:3] "LotArea" "GarageArea" "SalePrice"
## ..@ x : num [1:9] 1 0 0 0 1 ...
## ..@ perm : int [1:3] 1 2 3
## ..@ Dim : int [1:2] 3 3
Calculus-Based Probability & Statistics. Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of \(\lambda\) for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, \(\lambda\))). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.
Christian’s Response:
In looking at the results from the plot_density function
I found two columns that are right skewed BsmtUnfSf and X1stFlrSF.
hist(train$BsmtUnfSF, breaks="FD")
hist(train$X1stFlrSF,breaks="FD")
library(MASS)
fit <- fitdistr(train$BsmtUnfSF, "exponential")
lambda <- fit$estimate
# Generate 1000 samples from the exponential distribution
samples <- rexp(1000, rate = lambda)
In comparing both histograms. We can see the histogram of generated samples has a smoother distribution.
# Plot histogram of original variable
hist(train$BsmtUnfSF, breaks = "FD", col = "lightblue", main = "Original Variable", xlab = "Value")
# Plot histogram of generated samples
hist(samples, breaks = "FD", col = "lightgreen", main = "Exponential Distribution Samples", xlab = "Value")
# Find 5th and 95th percentiles using the exponential CDF
percentile_5 <- qexp(0.05, rate = lambda)
percentile_95 <- qexp(0.95, rate = lambda)
# Generate 95% confidence interval assuming normality
confidence_interval <- t.test(train$BsmtUnfSF)$conf.int
# Find empirical 5th and 95th percentiles
empirical_percentile_5 <- quantile(train$BsmtUnfSF, 0.05)
empirical_percentile_95 <- quantile(train$BsmtUnfSF, 0.95)
Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score. Provide a screen snapshot of your score with your name identifiable.
Christian’s Report:
# select only numeric columns
numeric_cols <- sapply(train, is.numeric)
train_numeric <- train[, numeric_cols]
# Compute the correlation matrix
correlation_matrix <- cor(train_numeric)
# Extract correlation values b/w SalesPrice and other variables
saleprice_correlation <- correlation_matrix[,"SalePrice"]
# Turns matrix to dataframe
correlation_df <- data.frame(Variable = names(saleprice_correlation), Correlation = saleprice_correlation)
correlation_df
## Variable Correlation
## Id Id -0.02191672
## MSSubClass MSSubClass -0.08428414
## LotFrontage LotFrontage 0.20780482
## LotArea LotArea 0.26384335
## OverallQual OverallQual 0.79098160
## OverallCond OverallCond -0.07785589
## YearBuilt YearBuilt 0.52289733
## YearRemodAdd YearRemodAdd 0.50710097
## MasVnrArea MasVnrArea 0.47258506
## BsmtFinSF1 BsmtFinSF1 0.38641981
## BsmtFinSF2 BsmtFinSF2 -0.01137812
## BsmtUnfSF BsmtUnfSF 0.21447911
## TotalBsmtSF TotalBsmtSF 0.61358055
## X1stFlrSF X1stFlrSF 0.60585218
## X2ndFlrSF X2ndFlrSF 0.31933380
## LowQualFinSF LowQualFinSF -0.02560613
## GrLivArea GrLivArea 0.70862448
## BsmtFullBath BsmtFullBath 0.22712223
## BsmtHalfBath BsmtHalfBath -0.01684415
## FullBath FullBath 0.56066376
## HalfBath HalfBath 0.28410768
## BedroomAbvGr BedroomAbvGr 0.16821315
## KitchenAbvGr KitchenAbvGr -0.13590737
## TotRmsAbvGrd TotRmsAbvGrd 0.53372316
## Fireplaces Fireplaces 0.46692884
## GarageYrBlt GarageYrBlt 0.26135424
## GarageCars GarageCars 0.64040920
## GarageArea GarageArea 0.62343144
## WoodDeckSF WoodDeckSF 0.32441344
## OpenPorchSF OpenPorchSF 0.31585623
## EnclosedPorch EnclosedPorch -0.12857796
## X3SsnPorch X3SsnPorch 0.04458367
## ScreenPorch ScreenPorch 0.11144657
## PoolArea PoolArea 0.09240355
## MiscVal MiscVal -0.02118958
## MoSold MoSold 0.04643225
## YrSold YrSold -0.02892259
## SalePrice SalePrice 1.00000000
# arrange dataframe based on absolute value of correlation values
correlation_df <- correlation_df[order(-abs(correlation_df$Correlation)),]
print(correlation_df)
## Variable Correlation
## SalePrice SalePrice 1.00000000
## OverallQual OverallQual 0.79098160
## GrLivArea GrLivArea 0.70862448
## GarageCars GarageCars 0.64040920
## GarageArea GarageArea 0.62343144
## TotalBsmtSF TotalBsmtSF 0.61358055
## X1stFlrSF X1stFlrSF 0.60585218
## FullBath FullBath 0.56066376
## TotRmsAbvGrd TotRmsAbvGrd 0.53372316
## YearBuilt YearBuilt 0.52289733
## YearRemodAdd YearRemodAdd 0.50710097
## MasVnrArea MasVnrArea 0.47258506
## Fireplaces Fireplaces 0.46692884
## BsmtFinSF1 BsmtFinSF1 0.38641981
## WoodDeckSF WoodDeckSF 0.32441344
## X2ndFlrSF X2ndFlrSF 0.31933380
## OpenPorchSF OpenPorchSF 0.31585623
## HalfBath HalfBath 0.28410768
## LotArea LotArea 0.26384335
## GarageYrBlt GarageYrBlt 0.26135424
## BsmtFullBath BsmtFullBath 0.22712223
## BsmtUnfSF BsmtUnfSF 0.21447911
## LotFrontage LotFrontage 0.20780482
## BedroomAbvGr BedroomAbvGr 0.16821315
## KitchenAbvGr KitchenAbvGr -0.13590737
## EnclosedPorch EnclosedPorch -0.12857796
## ScreenPorch ScreenPorch 0.11144657
## PoolArea PoolArea 0.09240355
## MSSubClass MSSubClass -0.08428414
## OverallCond OverallCond -0.07785589
## MoSold MoSold 0.04643225
## X3SsnPorch X3SsnPorch 0.04458367
## YrSold YrSold -0.02892259
## LowQualFinSF LowQualFinSF -0.02560613
## Id Id -0.02191672
## MiscVal MiscVal -0.02118958
## BsmtHalfBath BsmtHalfBath -0.01684415
## BsmtFinSF2 BsmtFinSF2 -0.01137812
# Select predictor variables for the regression model
independent_vars <- c("GrLivArea", "OverallQual")
# Create a new data frame with the predictor variables and the response variable
regression_data <- train[-1, c(independent_vars, "SalePrice")]
# Remove nulls
regression_data <- na.omit(regression_data)
# Fit the multiple regression model
model <- lm(SalePrice ~ ., data = regression_data)
# Print model summary
summary(model)
##
## Call:
## lm(formula = SalePrice ~ ., data = regression_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -379596 -22335 -381 19895 289477
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.041e+05 5.047e+03 -20.63 <2e-16 ***
## GrLivArea 5.586e+01 2.631e+00 21.24 <2e-16 ***
## OverallQual 3.285e+04 9.996e+02 32.87 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 42510 on 1456 degrees of freedom
## Multiple R-squared: 0.7142, Adjusted R-squared: 0.7138
## F-statistic: 1819 on 2 and 1456 DF, p-value: < 2.2e-16
We can observe that we have a high R-squared of 0.7142 and a low p-value of 2.2e-16. This tell us we have a model that is significant. We can also conclude that our explanatory variables explain with accuracy, the changes that take place in the data.
# Predict with the test dataset
predicted_prices <- predict(model, test[, independent_vars])
# Create dataframe with predicted prices and IDs
predicted_df <- data.frame(Id = test$Id, SalePrice = predicted_prices)
write.csv(predicted_df, file = "predictions_submission.csv", row.names = FALSE)
knitr::include_graphics("C:\\Users\\urios\\OneDrive\\Pictures\\Screenshots\\Entry.png", error = FALSE)