Instructions

You are to register for Kaggle.com (free) and compete in the House Prices: Advanced Regression Techniques competition. https://www.kaggle.com/c/house-prices-advanced-regression-techniques . I want you to do the following:

Introduction

The House Prices dataset from Kaggle contains 79 different variables that describes residential homes in Ames, Iowa. Columns include the lot Area size, overall quality rating, the year built, etc. The purpose of this contest is to predict the final price of homes using the variables provided.

I will explore the dataset and create a regression model using variablesof my choice to predict the final price of homes, place this into dataset and submit my predictions to the Kaggle competition.


Descriptive and Inferential Statistics. Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot matrix for at least two of the independent variables and the dependent variable. Derive a correlation matrix for any three quantitative variables in the dataset. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?

Load data

test <- read.csv("test.csv",stringsAsFactors = FALSE)
train <- read.csv("train.csv",stringsAsFactors = FALSE)

Exploratory Data Analysis

Looking at the data, we can see that there are columns with Null values. We can observe that there are many different column data types. These are issues we will have to address when cleaning the data.

glimpse(train)
## Rows: 1,460
## Columns: 81
## $ Id            <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1…
## $ MSSubClass    <int> 60, 20, 60, 70, 60, 50, 20, 60, 50, 190, 20, 60, 20, 20,…
## $ MSZoning      <chr> "RL", "RL", "RL", "RL", "RL", "RL", "RL", "RL", "RM", "R…
## $ LotFrontage   <int> 65, 80, 68, 60, 84, 85, 75, NA, 51, 50, 70, 85, NA, 91, …
## $ LotArea       <int> 8450, 9600, 11250, 9550, 14260, 14115, 10084, 10382, 612…
## $ Street        <chr> "Pave", "Pave", "Pave", "Pave", "Pave", "Pave", "Pave", …
## $ Alley         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ LotShape      <chr> "Reg", "Reg", "IR1", "IR1", "IR1", "IR1", "Reg", "IR1", …
## $ LandContour   <chr> "Lvl", "Lvl", "Lvl", "Lvl", "Lvl", "Lvl", "Lvl", "Lvl", …
## $ Utilities     <chr> "AllPub", "AllPub", "AllPub", "AllPub", "AllPub", "AllPu…
## $ LotConfig     <chr> "Inside", "FR2", "Inside", "Corner", "FR2", "Inside", "I…
## $ LandSlope     <chr> "Gtl", "Gtl", "Gtl", "Gtl", "Gtl", "Gtl", "Gtl", "Gtl", …
## $ Neighborhood  <chr> "CollgCr", "Veenker", "CollgCr", "Crawfor", "NoRidge", "…
## $ Condition1    <chr> "Norm", "Feedr", "Norm", "Norm", "Norm", "Norm", "Norm",…
## $ Condition2    <chr> "Norm", "Norm", "Norm", "Norm", "Norm", "Norm", "Norm", …
## $ BldgType      <chr> "1Fam", "1Fam", "1Fam", "1Fam", "1Fam", "1Fam", "1Fam", …
## $ HouseStyle    <chr> "2Story", "1Story", "2Story", "2Story", "2Story", "1.5Fi…
## $ OverallQual   <int> 7, 6, 7, 7, 8, 5, 8, 7, 7, 5, 5, 9, 5, 7, 6, 7, 6, 4, 5,…
## $ OverallCond   <int> 5, 8, 5, 5, 5, 5, 5, 6, 5, 6, 5, 5, 6, 5, 5, 8, 7, 5, 5,…
## $ YearBuilt     <int> 2003, 1976, 2001, 1915, 2000, 1993, 2004, 1973, 1931, 19…
## $ YearRemodAdd  <int> 2003, 1976, 2002, 1970, 2000, 1995, 2005, 1973, 1950, 19…
## $ RoofStyle     <chr> "Gable", "Gable", "Gable", "Gable", "Gable", "Gable", "G…
## $ RoofMatl      <chr> "CompShg", "CompShg", "CompShg", "CompShg", "CompShg", "…
## $ Exterior1st   <chr> "VinylSd", "MetalSd", "VinylSd", "Wd Sdng", "VinylSd", "…
## $ Exterior2nd   <chr> "VinylSd", "MetalSd", "VinylSd", "Wd Shng", "VinylSd", "…
## $ MasVnrType    <chr> "BrkFace", "None", "BrkFace", "None", "BrkFace", "None",…
## $ MasVnrArea    <int> 196, 0, 162, 0, 350, 0, 186, 240, 0, 0, 0, 286, 0, 306, …
## $ ExterQual     <chr> "Gd", "TA", "Gd", "TA", "Gd", "TA", "Gd", "TA", "TA", "T…
## $ ExterCond     <chr> "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", "T…
## $ Foundation    <chr> "PConc", "CBlock", "PConc", "BrkTil", "PConc", "Wood", "…
## $ BsmtQual      <chr> "Gd", "Gd", "Gd", "TA", "Gd", "Gd", "Ex", "Gd", "TA", "T…
## $ BsmtCond      <chr> "TA", "TA", "TA", "Gd", "TA", "TA", "TA", "TA", "TA", "T…
## $ BsmtExposure  <chr> "No", "Gd", "Mn", "No", "Av", "No", "Av", "Mn", "No", "N…
## $ BsmtFinType1  <chr> "GLQ", "ALQ", "GLQ", "ALQ", "GLQ", "GLQ", "GLQ", "ALQ", …
## $ BsmtFinSF1    <int> 706, 978, 486, 216, 655, 732, 1369, 859, 0, 851, 906, 99…
## $ BsmtFinType2  <chr> "Unf", "Unf", "Unf", "Unf", "Unf", "Unf", "Unf", "BLQ", …
## $ BsmtFinSF2    <int> 0, 0, 0, 0, 0, 0, 0, 32, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ BsmtUnfSF     <int> 150, 284, 434, 540, 490, 64, 317, 216, 952, 140, 134, 17…
## $ TotalBsmtSF   <int> 856, 1262, 920, 756, 1145, 796, 1686, 1107, 952, 991, 10…
## $ Heating       <chr> "GasA", "GasA", "GasA", "GasA", "GasA", "GasA", "GasA", …
## $ HeatingQC     <chr> "Ex", "Ex", "Ex", "Gd", "Ex", "Ex", "Ex", "Ex", "Gd", "E…
## $ CentralAir    <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "…
## $ Electrical    <chr> "SBrkr", "SBrkr", "SBrkr", "SBrkr", "SBrkr", "SBrkr", "S…
## $ X1stFlrSF     <int> 856, 1262, 920, 961, 1145, 796, 1694, 1107, 1022, 1077, …
## $ X2ndFlrSF     <int> 854, 0, 866, 756, 1053, 566, 0, 983, 752, 0, 0, 1142, 0,…
## $ LowQualFinSF  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ GrLivArea     <int> 1710, 1262, 1786, 1717, 2198, 1362, 1694, 2090, 1774, 10…
## $ BsmtFullBath  <int> 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1,…
## $ BsmtHalfBath  <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ FullBath      <int> 2, 2, 2, 1, 2, 1, 2, 2, 2, 1, 1, 3, 1, 2, 1, 1, 1, 2, 1,…
## $ HalfBath      <int> 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1,…
## $ BedroomAbvGr  <int> 3, 3, 3, 3, 4, 1, 3, 3, 2, 2, 3, 4, 2, 3, 2, 2, 2, 2, 3,…
## $ KitchenAbvGr  <int> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1, 1, 2, 1,…
## $ KitchenQual   <chr> "Gd", "TA", "Gd", "Gd", "Gd", "TA", "Gd", "TA", "TA", "T…
## $ TotRmsAbvGrd  <int> 8, 6, 6, 7, 9, 5, 7, 7, 8, 5, 5, 11, 4, 7, 5, 5, 5, 6, 6…
## $ Functional    <chr> "Typ", "Typ", "Typ", "Typ", "Typ", "Typ", "Typ", "Typ", …
## $ Fireplaces    <int> 0, 1, 1, 1, 1, 0, 1, 2, 2, 2, 0, 2, 0, 1, 1, 0, 1, 0, 0,…
## $ FireplaceQu   <chr> NA, "TA", "TA", "Gd", "TA", NA, "Gd", "TA", "TA", "TA", …
## $ GarageType    <chr> "Attchd", "Attchd", "Attchd", "Detchd", "Attchd", "Attch…
## $ GarageYrBlt   <int> 2003, 1976, 2001, 1998, 2000, 1993, 2004, 1973, 1931, 19…
## $ GarageFinish  <chr> "RFn", "RFn", "RFn", "Unf", "RFn", "Unf", "RFn", "RFn", …
## $ GarageCars    <int> 2, 2, 2, 3, 3, 2, 2, 2, 2, 1, 1, 3, 1, 3, 1, 2, 2, 2, 2,…
## $ GarageArea    <int> 548, 460, 608, 642, 836, 480, 636, 484, 468, 205, 384, 7…
## $ GarageQual    <chr> "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", "Fa", "G…
## $ GarageCond    <chr> "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", "T…
## $ PavedDrive    <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "…
## $ WoodDeckSF    <int> 0, 298, 0, 0, 192, 40, 255, 235, 90, 0, 0, 147, 140, 160…
## $ OpenPorchSF   <int> 61, 0, 42, 35, 84, 30, 57, 204, 0, 4, 0, 21, 0, 33, 213,…
## $ EnclosedPorch <int> 0, 0, 0, 272, 0, 0, 0, 228, 205, 0, 0, 0, 0, 0, 176, 0, …
## $ X3SsnPorch    <int> 0, 0, 0, 0, 0, 320, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ ScreenPorch   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 176, 0, 0, 0, 0, 0, …
## $ PoolArea      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ PoolQC        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ Fence         <chr> NA, NA, NA, NA, NA, "MnPrv", NA, NA, NA, NA, NA, NA, NA,…
## $ MiscFeature   <chr> NA, NA, NA, NA, NA, "Shed", NA, "Shed", NA, NA, NA, NA, …
## $ MiscVal       <int> 0, 0, 0, 0, 0, 700, 0, 350, 0, 0, 0, 0, 0, 0, 0, 0, 700,…
## $ MoSold        <int> 2, 5, 9, 2, 12, 10, 8, 11, 4, 1, 2, 7, 9, 8, 5, 7, 3, 10…
## $ YrSold        <int> 2008, 2007, 2008, 2006, 2008, 2009, 2007, 2009, 2008, 20…
## $ SaleType      <chr> "WD", "WD", "WD", "WD", "WD", "WD", "WD", "WD", "WD", "W…
## $ SaleCondition <chr> "Normal", "Normal", "Normal", "Abnorml", "Normal", "Norm…
## $ SalePrice     <int> 208500, 181500, 223500, 140000, 250000, 143000, 307000, …

The test data, unlike the train data (1460) has 1459 rows. This is because the column SalePrice is missing from the test data.

glimpse(test)
## Rows: 1,459
## Columns: 80
## $ Id            <int> 1461, 1462, 1463, 1464, 1465, 1466, 1467, 1468, 1469, 14…
## $ MSSubClass    <int> 20, 20, 60, 60, 120, 60, 20, 60, 20, 20, 120, 160, 160, …
## $ MSZoning      <chr> "RH", "RL", "RL", "RL", "RL", "RL", "RL", "RL", "RL", "R…
## $ LotFrontage   <int> 80, 81, 74, 78, 43, 75, NA, 63, 85, 70, 26, 21, 21, 24, …
## $ LotArea       <int> 11622, 14267, 13830, 9978, 5005, 10000, 7980, 8402, 1017…
## $ Street        <chr> "Pave", "Pave", "Pave", "Pave", "Pave", "Pave", "Pave", …
## $ Alley         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ LotShape      <chr> "Reg", "IR1", "IR1", "IR1", "IR1", "IR1", "IR1", "IR1", …
## $ LandContour   <chr> "Lvl", "Lvl", "Lvl", "Lvl", "HLS", "Lvl", "Lvl", "Lvl", …
## $ Utilities     <chr> "AllPub", "AllPub", "AllPub", "AllPub", "AllPub", "AllPu…
## $ LotConfig     <chr> "Inside", "Corner", "Inside", "Inside", "Inside", "Corne…
## $ LandSlope     <chr> "Gtl", "Gtl", "Gtl", "Gtl", "Gtl", "Gtl", "Gtl", "Gtl", …
## $ Neighborhood  <chr> "NAmes", "NAmes", "Gilbert", "Gilbert", "StoneBr", "Gilb…
## $ Condition1    <chr> "Feedr", "Norm", "Norm", "Norm", "Norm", "Norm", "Norm",…
## $ Condition2    <chr> "Norm", "Norm", "Norm", "Norm", "Norm", "Norm", "Norm", …
## $ BldgType      <chr> "1Fam", "1Fam", "1Fam", "1Fam", "TwnhsE", "1Fam", "1Fam"…
## $ HouseStyle    <chr> "1Story", "1Story", "2Story", "2Story", "1Story", "2Stor…
## $ OverallQual   <int> 5, 6, 5, 6, 8, 6, 6, 6, 7, 4, 7, 6, 5, 6, 7, 9, 8, 9, 8,…
## $ OverallCond   <int> 6, 6, 5, 6, 5, 5, 7, 5, 5, 5, 5, 5, 5, 6, 6, 5, 5, 5, 5,…
## $ YearBuilt     <int> 1961, 1958, 1997, 1998, 1992, 1993, 1992, 1998, 1990, 19…
## $ YearRemodAdd  <int> 1961, 1958, 1998, 1998, 1992, 1994, 2007, 1998, 1990, 19…
## $ RoofStyle     <chr> "Gable", "Hip", "Gable", "Gable", "Gable", "Gable", "Gab…
## $ RoofMatl      <chr> "CompShg", "CompShg", "CompShg", "CompShg", "CompShg", "…
## $ Exterior1st   <chr> "VinylSd", "Wd Sdng", "VinylSd", "VinylSd", "HdBoard", "…
## $ Exterior2nd   <chr> "VinylSd", "Wd Sdng", "VinylSd", "VinylSd", "HdBoard", "…
## $ MasVnrType    <chr> "None", "BrkFace", "None", "BrkFace", "None", "None", "N…
## $ MasVnrArea    <int> 0, 108, 0, 20, 0, 0, 0, 0, 0, 0, 0, 504, 492, 0, 0, 162,…
## $ ExterQual     <chr> "TA", "TA", "TA", "TA", "Gd", "TA", "TA", "TA", "TA", "T…
## $ ExterCond     <chr> "TA", "TA", "TA", "TA", "TA", "TA", "Gd", "TA", "TA", "T…
## $ Foundation    <chr> "CBlock", "CBlock", "PConc", "PConc", "PConc", "PConc", …
## $ BsmtQual      <chr> "TA", "TA", "Gd", "TA", "Gd", "Gd", "Gd", "Gd", "Gd", "T…
## $ BsmtCond      <chr> "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", "T…
## $ BsmtExposure  <chr> "No", "No", "No", "No", "No", "No", "No", "No", "Gd", "N…
## $ BsmtFinType1  <chr> "Rec", "ALQ", "GLQ", "GLQ", "ALQ", "Unf", "ALQ", "Unf", …
## $ BsmtFinSF1    <int> 468, 923, 791, 602, 263, 0, 935, 0, 637, 804, 1051, 156,…
## $ BsmtFinType2  <chr> "LwQ", "Unf", "Unf", "Unf", "Unf", "Unf", "Unf", "Unf", …
## $ BsmtFinSF2    <int> 144, 0, 0, 0, 0, 0, 0, 0, 0, 78, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ BsmtUnfSF     <int> 270, 406, 137, 324, 1017, 763, 233, 789, 663, 0, 354, 32…
## $ TotalBsmtSF   <int> 882, 1329, 928, 926, 1280, 763, 1168, 789, 1300, 882, 14…
## $ Heating       <chr> "GasA", "GasA", "GasA", "GasA", "GasA", "GasA", "GasA", …
## $ HeatingQC     <chr> "TA", "TA", "Gd", "Ex", "Ex", "Gd", "Ex", "Gd", "Gd", "T…
## $ CentralAir    <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "…
## $ Electrical    <chr> "SBrkr", "SBrkr", "SBrkr", "SBrkr", "SBrkr", "SBrkr", "S…
## $ X1stFlrSF     <int> 896, 1329, 928, 926, 1280, 763, 1187, 789, 1341, 882, 13…
## $ X2ndFlrSF     <int> 0, 0, 701, 678, 0, 892, 0, 676, 0, 0, 0, 504, 567, 601, …
## $ LowQualFinSF  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ GrLivArea     <int> 896, 1329, 1629, 1604, 1280, 1655, 1187, 1465, 1341, 882…
## $ BsmtFullBath  <int> 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ BsmtHalfBath  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ FullBath      <int> 1, 1, 2, 2, 2, 2, 2, 2, 1, 1, 2, 1, 1, 2, 1, 2, 2, 2, 2,…
## $ HalfBath      <int> 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0,…
## $ BedroomAbvGr  <int> 2, 3, 3, 3, 2, 3, 3, 3, 2, 2, 2, 2, 3, 3, 2, 3, 3, 3, 3,…
## $ KitchenAbvGr  <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ KitchenQual   <chr> "TA", "Gd", "TA", "Gd", "Gd", "TA", "TA", "TA", "Gd", "T…
## $ TotRmsAbvGrd  <int> 5, 6, 6, 7, 5, 7, 6, 7, 5, 4, 5, 5, 6, 6, 4, 10, 7, 7, 8…
## $ Functional    <chr> "Typ", "Typ", "Typ", "Typ", "Typ", "Typ", "Typ", "Typ", …
## $ Fireplaces    <int> 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1,…
## $ FireplaceQu   <chr> NA, NA, "TA", "Gd", NA, "TA", NA, "Gd", "Po", NA, "Fa", …
## $ GarageType    <chr> "Attchd", "Attchd", "Attchd", "Attchd", "Attchd", "Attch…
## $ GarageYrBlt   <int> 1961, 1958, 1997, 1998, 1992, 1993, 1992, 1998, 1990, 19…
## $ GarageFinish  <chr> "Unf", "Unf", "Fin", "Fin", "RFn", "Fin", "Fin", "Fin", …
## $ GarageCars    <int> 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 2, 1, 3, 3, 3, 3,…
## $ GarageArea    <int> 730, 312, 482, 470, 506, 440, 420, 393, 506, 525, 511, 2…
## $ GarageQual    <chr> "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", "T…
## $ GarageCond    <chr> "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", "T…
## $ PavedDrive    <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "…
## $ WoodDeckSF    <int> 140, 393, 212, 360, 0, 157, 483, 0, 192, 240, 203, 275, …
## $ OpenPorchSF   <int> 0, 36, 34, 36, 82, 84, 21, 75, 0, 0, 68, 0, 0, 0, 30, 13…
## $ EnclosedPorch <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ X3SsnPorch    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ ScreenPorch   <int> 120, 0, 0, 0, 144, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ PoolArea      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ PoolQC        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ Fence         <chr> "MnPrv", NA, "MnPrv", NA, NA, NA, "GdPrv", NA, NA, "MnPr…
## $ MiscFeature   <chr> NA, "Gar2", NA, NA, NA, NA, "Shed", NA, NA, NA, NA, NA, …
## $ MiscVal       <int> 0, 12500, 0, 0, 0, 0, 500, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ MoSold        <int> 6, 6, 3, 6, 1, 4, 3, 5, 2, 4, 6, 2, 3, 6, 6, 1, 6, 6, 2,…
## $ YrSold        <int> 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 20…
## $ SaleType      <chr> "WD", "WD", "WD", "WD", "WD", "WD", "WD", "WD", "WD", "W…
## $ SaleCondition <chr> "Normal", "Normal", "Normal", "Normal", "Normal", "Norma…

Using the skim function, we get an even clearer look at our missing values. Column LotFrontage has the most missing values in both the test and train datasets.

skim(train)
Data summary
Name train
Number of rows 1460
Number of columns 81
_______________________
Column type frequency:
character 43
numeric 38
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
MSZoning 0 1.00 2 7 0 5 0
Street 0 1.00 4 4 0 2 0
Alley 1369 0.06 4 4 0 2 0
LotShape 0 1.00 3 3 0 4 0
LandContour 0 1.00 3 3 0 4 0
Utilities 0 1.00 6 6 0 2 0
LotConfig 0 1.00 3 7 0 5 0
LandSlope 0 1.00 3 3 0 3 0
Neighborhood 0 1.00 5 7 0 25 0
Condition1 0 1.00 4 6 0 9 0
Condition2 0 1.00 4 6 0 8 0
BldgType 0 1.00 4 6 0 5 0
HouseStyle 0 1.00 4 6 0 8 0
RoofStyle 0 1.00 3 7 0 6 0
RoofMatl 0 1.00 4 7 0 8 0
Exterior1st 0 1.00 5 7 0 15 0
Exterior2nd 0 1.00 5 7 0 16 0
MasVnrType 8 0.99 4 7 0 4 0
ExterQual 0 1.00 2 2 0 4 0
ExterCond 0 1.00 2 2 0 5 0
Foundation 0 1.00 4 6 0 6 0
BsmtQual 37 0.97 2 2 0 4 0
BsmtCond 37 0.97 2 2 0 4 0
BsmtExposure 38 0.97 2 2 0 4 0
BsmtFinType1 37 0.97 3 3 0 6 0
BsmtFinType2 38 0.97 3 3 0 6 0
Heating 0 1.00 4 5 0 6 0
HeatingQC 0 1.00 2 2 0 5 0
CentralAir 0 1.00 1 1 0 2 0
Electrical 1 1.00 3 5 0 5 0
KitchenQual 0 1.00 2 2 0 4 0
Functional 0 1.00 3 4 0 7 0
FireplaceQu 690 0.53 2 2 0 5 0
GarageType 81 0.94 6 7 0 6 0
GarageFinish 81 0.94 3 3 0 3 0
GarageQual 81 0.94 2 2 0 5 0
GarageCond 81 0.94 2 2 0 5 0
PavedDrive 0 1.00 1 1 0 3 0
PoolQC 1453 0.00 2 2 0 3 0
Fence 1179 0.19 4 5 0 4 0
MiscFeature 1406 0.04 4 4 0 4 0
SaleType 0 1.00 2 5 0 9 0
SaleCondition 0 1.00 6 7 0 6 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Id 0 1.00 730.50 421.61 1 365.75 730.5 1095.25 1460 ▇▇▇▇▇
MSSubClass 0 1.00 56.90 42.30 20 20.00 50.0 70.00 190 ▇▅▂▁▁
LotFrontage 259 0.82 70.05 24.28 21 59.00 69.0 80.00 313 ▇▃▁▁▁
LotArea 0 1.00 10516.83 9981.26 1300 7553.50 9478.5 11601.50 215245 ▇▁▁▁▁
OverallQual 0 1.00 6.10 1.38 1 5.00 6.0 7.00 10 ▁▂▇▅▁
OverallCond 0 1.00 5.58 1.11 1 5.00 5.0 6.00 9 ▁▁▇▅▁
YearBuilt 0 1.00 1971.27 30.20 1872 1954.00 1973.0 2000.00 2010 ▁▂▃▆▇
YearRemodAdd 0 1.00 1984.87 20.65 1950 1967.00 1994.0 2004.00 2010 ▅▂▂▃▇
MasVnrArea 8 0.99 103.69 181.07 0 0.00 0.0 166.00 1600 ▇▁▁▁▁
BsmtFinSF1 0 1.00 443.64 456.10 0 0.00 383.5 712.25 5644 ▇▁▁▁▁
BsmtFinSF2 0 1.00 46.55 161.32 0 0.00 0.0 0.00 1474 ▇▁▁▁▁
BsmtUnfSF 0 1.00 567.24 441.87 0 223.00 477.5 808.00 2336 ▇▅▂▁▁
TotalBsmtSF 0 1.00 1057.43 438.71 0 795.75 991.5 1298.25 6110 ▇▃▁▁▁
X1stFlrSF 0 1.00 1162.63 386.59 334 882.00 1087.0 1391.25 4692 ▇▅▁▁▁
X2ndFlrSF 0 1.00 346.99 436.53 0 0.00 0.0 728.00 2065 ▇▃▂▁▁
LowQualFinSF 0 1.00 5.84 48.62 0 0.00 0.0 0.00 572 ▇▁▁▁▁
GrLivArea 0 1.00 1515.46 525.48 334 1129.50 1464.0 1776.75 5642 ▇▇▁▁▁
BsmtFullBath 0 1.00 0.43 0.52 0 0.00 0.0 1.00 3 ▇▆▁▁▁
BsmtHalfBath 0 1.00 0.06 0.24 0 0.00 0.0 0.00 2 ▇▁▁▁▁
FullBath 0 1.00 1.57 0.55 0 1.00 2.0 2.00 3 ▁▇▁▇▁
HalfBath 0 1.00 0.38 0.50 0 0.00 0.0 1.00 2 ▇▁▅▁▁
BedroomAbvGr 0 1.00 2.87 0.82 0 2.00 3.0 3.00 8 ▁▇▂▁▁
KitchenAbvGr 0 1.00 1.05 0.22 0 1.00 1.0 1.00 3 ▁▇▁▁▁
TotRmsAbvGrd 0 1.00 6.52 1.63 2 5.00 6.0 7.00 14 ▂▇▇▁▁
Fireplaces 0 1.00 0.61 0.64 0 0.00 1.0 1.00 3 ▇▇▁▁▁
GarageYrBlt 81 0.94 1978.51 24.69 1900 1961.00 1980.0 2002.00 2010 ▁▁▅▅▇
GarageCars 0 1.00 1.77 0.75 0 1.00 2.0 2.00 4 ▁▃▇▂▁
GarageArea 0 1.00 472.98 213.80 0 334.50 480.0 576.00 1418 ▂▇▃▁▁
WoodDeckSF 0 1.00 94.24 125.34 0 0.00 0.0 168.00 857 ▇▂▁▁▁
OpenPorchSF 0 1.00 46.66 66.26 0 0.00 25.0 68.00 547 ▇▁▁▁▁
EnclosedPorch 0 1.00 21.95 61.12 0 0.00 0.0 0.00 552 ▇▁▁▁▁
X3SsnPorch 0 1.00 3.41 29.32 0 0.00 0.0 0.00 508 ▇▁▁▁▁
ScreenPorch 0 1.00 15.06 55.76 0 0.00 0.0 0.00 480 ▇▁▁▁▁
PoolArea 0 1.00 2.76 40.18 0 0.00 0.0 0.00 738 ▇▁▁▁▁
MiscVal 0 1.00 43.49 496.12 0 0.00 0.0 0.00 15500 ▇▁▁▁▁
MoSold 0 1.00 6.32 2.70 1 5.00 6.0 8.00 12 ▃▆▇▃▃
YrSold 0 1.00 2007.82 1.33 2006 2007.00 2008.0 2009.00 2010 ▇▇▇▇▅
SalePrice 0 1.00 180921.20 79442.50 34900 129975.00 163000.0 214000.00 755000 ▇▅▁▁▁
skim(test)
Data summary
Name test
Number of rows 1459
Number of columns 80
_______________________
Column type frequency:
character 43
numeric 37
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
MSZoning 4 1.00 2 7 0 5 0
Street 0 1.00 4 4 0 2 0
Alley 1352 0.07 4 4 0 2 0
LotShape 0 1.00 3 3 0 4 0
LandContour 0 1.00 3 3 0 4 0
Utilities 2 1.00 6 6 0 1 0
LotConfig 0 1.00 3 7 0 5 0
LandSlope 0 1.00 3 3 0 3 0
Neighborhood 0 1.00 5 7 0 25 0
Condition1 0 1.00 4 6 0 9 0
Condition2 0 1.00 4 6 0 5 0
BldgType 0 1.00 4 6 0 5 0
HouseStyle 0 1.00 4 6 0 7 0
RoofStyle 0 1.00 3 7 0 6 0
RoofMatl 0 1.00 7 7 0 4 0
Exterior1st 1 1.00 6 7 0 13 0
Exterior2nd 1 1.00 5 7 0 15 0
MasVnrType 16 0.99 4 7 0 4 0
ExterQual 0 1.00 2 2 0 4 0
ExterCond 0 1.00 2 2 0 5 0
Foundation 0 1.00 4 6 0 6 0
BsmtQual 44 0.97 2 2 0 4 0
BsmtCond 45 0.97 2 2 0 4 0
BsmtExposure 44 0.97 2 2 0 4 0
BsmtFinType1 42 0.97 3 3 0 6 0
BsmtFinType2 42 0.97 3 3 0 6 0
Heating 0 1.00 4 4 0 4 0
HeatingQC 0 1.00 2 2 0 5 0
CentralAir 0 1.00 1 1 0 2 0
Electrical 0 1.00 5 5 0 4 0
KitchenQual 1 1.00 2 2 0 4 0
Functional 2 1.00 3 4 0 7 0
FireplaceQu 730 0.50 2 2 0 5 0
GarageType 76 0.95 6 7 0 6 0
GarageFinish 78 0.95 3 3 0 3 0
GarageQual 78 0.95 2 2 0 4 0
GarageCond 78 0.95 2 2 0 5 0
PavedDrive 0 1.00 1 1 0 3 0
PoolQC 1456 0.00 2 2 0 2 0
Fence 1169 0.20 4 5 0 4 0
MiscFeature 1408 0.03 4 4 0 3 0
SaleType 1 1.00 2 5 0 9 0
SaleCondition 0 1.00 6 7 0 6 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Id 0 1.00 2190.00 421.32 1461 1825.50 2190.0 2554.50 2919 ▇▇▇▇▇
MSSubClass 0 1.00 57.38 42.75 20 20.00 50.0 70.00 190 ▇▅▂▁▁
LotFrontage 227 0.84 68.58 22.38 21 58.00 67.0 80.00 200 ▃▇▁▁▁
LotArea 0 1.00 9819.16 4955.52 1470 7391.00 9399.0 11517.50 56600 ▇▂▁▁▁
OverallQual 0 1.00 6.08 1.44 1 5.00 6.0 7.00 10 ▁▁▇▅▁
OverallCond 0 1.00 5.55 1.11 1 5.00 5.0 6.00 9 ▁▁▇▅▁
YearBuilt 0 1.00 1971.36 30.39 1879 1953.00 1973.0 2001.00 2010 ▁▂▃▆▇
YearRemodAdd 0 1.00 1983.66 21.13 1950 1963.00 1992.0 2004.00 2010 ▅▂▂▃▇
MasVnrArea 15 0.99 100.71 177.63 0 0.00 0.0 164.00 1290 ▇▁▁▁▁
BsmtFinSF1 1 1.00 439.20 455.27 0 0.00 350.5 753.50 4010 ▇▂▁▁▁
BsmtFinSF2 1 1.00 52.62 176.75 0 0.00 0.0 0.00 1526 ▇▁▁▁▁
BsmtUnfSF 1 1.00 554.29 437.26 0 219.25 460.0 797.75 2140 ▇▆▂▁▁
TotalBsmtSF 1 1.00 1046.12 442.90 0 784.00 988.0 1305.00 5095 ▇▇▁▁▁
X1stFlrSF 0 1.00 1156.53 398.17 407 873.50 1079.0 1382.50 5095 ▇▃▁▁▁
X2ndFlrSF 0 1.00 325.97 420.61 0 0.00 0.0 676.00 1862 ▇▃▂▁▁
LowQualFinSF 0 1.00 3.54 44.04 0 0.00 0.0 0.00 1064 ▇▁▁▁▁
GrLivArea 0 1.00 1486.05 485.57 407 1117.50 1432.0 1721.00 5095 ▇▇▁▁▁
BsmtFullBath 2 1.00 0.43 0.53 0 0.00 0.0 1.00 3 ▇▆▁▁▁
BsmtHalfBath 2 1.00 0.07 0.25 0 0.00 0.0 0.00 2 ▇▁▁▁▁
FullBath 0 1.00 1.57 0.56 0 1.00 2.0 2.00 4 ▁▇▇▁▁
HalfBath 0 1.00 0.38 0.50 0 0.00 0.0 1.00 2 ▇▁▅▁▁
BedroomAbvGr 0 1.00 2.85 0.83 0 2.00 3.0 3.00 6 ▁▃▇▂▁
KitchenAbvGr 0 1.00 1.04 0.21 0 1.00 1.0 1.00 2 ▁▁▇▁▁
TotRmsAbvGrd 0 1.00 6.39 1.51 3 5.00 6.0 7.00 15 ▅▇▃▁▁
Fireplaces 0 1.00 0.58 0.65 0 0.00 0.0 1.00 4 ▇▇▁▁▁
GarageYrBlt 78 0.95 1977.72 26.43 1895 1959.00 1979.0 2002.00 2207 ▂▇▁▁▁
GarageCars 1 1.00 1.77 0.78 0 1.00 2.0 2.00 5 ▅▇▂▁▁
GarageArea 1 1.00 472.77 217.05 0 318.00 480.0 576.00 1488 ▃▇▃▁▁
WoodDeckSF 0 1.00 93.17 127.74 0 0.00 0.0 168.00 1424 ▇▁▁▁▁
OpenPorchSF 0 1.00 48.31 68.88 0 0.00 28.0 72.00 742 ▇▁▁▁▁
EnclosedPorch 0 1.00 24.24 67.23 0 0.00 0.0 0.00 1012 ▇▁▁▁▁
X3SsnPorch 0 1.00 1.79 20.21 0 0.00 0.0 0.00 360 ▇▁▁▁▁
ScreenPorch 0 1.00 17.06 56.61 0 0.00 0.0 0.00 576 ▇▁▁▁▁
PoolArea 0 1.00 1.74 30.49 0 0.00 0.0 0.00 800 ▇▁▁▁▁
MiscVal 0 1.00 58.17 630.81 0 0.00 0.0 0.00 17000 ▇▁▁▁▁
MoSold 0 1.00 6.10 2.72 1 4.00 6.0 8.00 12 ▅▆▇▃▃
YrSold 0 1.00 2007.77 1.30 2006 2007.00 2008.0 2009.00 2010 ▇▇▇▇▃
test %>% plot_density()

train %>% plot_density()

Data Cleaning

Prior to creating our model, we can need to ensure that we handle Null values on the train and test set properly. In this section I opted to fill every case of Null with the value “missing” and made sure to change the column from character to factor. This will allow us to make sure both the train and test dataframes have the same factor level and also, that Nulls won’t cause problems when creating our model.

# Temporarily Remove Salesprice
SalePrice = train$SalePrice 
train$SalePrice = NULL

# Merge datasets
full_data = rbind(train,test)

# Change character cols to factor and replace Nulls with "missing"
for (col in colnames(full_data)){
  if (typeof(full_data[,col]) == "character"){
    new_col = full_data[,col]
    new_col[is.na(new_col)] = "missing"
    full_data[col] = as.factor(new_col)
  }
}

# Split datasets again
train = full_data[1:nrow(train),]
train$SalePrice = SalePrice  
test = full_data[(nrow(train)+1):nrow(full_data),]

Looking at the summary display we can see that there are still some NA values in columns.

summary(train)
##        Id           MSSubClass       MSZoning     LotFrontage    
##  Min.   :   1.0   Min.   : 20.0   C (all):  10   Min.   : 21.00  
##  1st Qu.: 365.8   1st Qu.: 20.0   FV     :  65   1st Qu.: 59.00  
##  Median : 730.5   Median : 50.0   missing:   0   Median : 69.00  
##  Mean   : 730.5   Mean   : 56.9   RH     :  16   Mean   : 70.05  
##  3rd Qu.:1095.2   3rd Qu.: 70.0   RL     :1151   3rd Qu.: 80.00  
##  Max.   :1460.0   Max.   :190.0   RM     : 218   Max.   :313.00  
##                                                  NA's   :259     
##     LotArea        Street         Alley      LotShape  LandContour
##  Min.   :  1300   Grvl:   6   Grvl   :  50   IR1:484   Bnk:  63   
##  1st Qu.:  7554   Pave:1454   missing:1369   IR2: 41   HLS:  50   
##  Median :  9478               Pave   :  41   IR3: 10   Low:  36   
##  Mean   : 10517                              Reg:925   Lvl:1311   
##  3rd Qu.: 11602                                                   
##  Max.   :215245                                                   
##                                                                   
##    Utilities      LotConfig    LandSlope   Neighborhood   Condition1  
##  AllPub :1459   Corner : 263   Gtl:1382   NAmes  :225   Norm   :1260  
##  missing:   0   CulDSac:  94   Mod:  65   CollgCr:150   Feedr  :  81  
##  NoSeWa :   1   FR2    :  47   Sev:  13   OldTown:113   Artery :  48  
##                 FR3    :   4              Edwards:100   RRAn   :  26  
##                 Inside :1052              Somerst: 86   PosN   :  19  
##                                           Gilbert: 79   RRAe   :  11  
##                                           (Other):707   (Other):  15  
##    Condition2     BldgType      HouseStyle   OverallQual      OverallCond   
##  Norm   :1445   1Fam  :1220   1Story :726   Min.   : 1.000   Min.   :1.000  
##  Feedr  :   6   2fmCon:  31   2Story :445   1st Qu.: 5.000   1st Qu.:5.000  
##  Artery :   2   Duplex:  52   1.5Fin :154   Median : 6.000   Median :5.000  
##  PosN   :   2   Twnhs :  43   SLvl   : 65   Mean   : 6.099   Mean   :5.575  
##  RRNn   :   2   TwnhsE: 114   SFoyer : 37   3rd Qu.: 7.000   3rd Qu.:6.000  
##  PosA   :   1                 1.5Unf : 14   Max.   :10.000   Max.   :9.000  
##  (Other):   2                 (Other): 19                                   
##    YearBuilt     YearRemodAdd    RoofStyle       RoofMatl     Exterior1st 
##  Min.   :1872   Min.   :1950   Flat   :  13   CompShg:1434   VinylSd:515  
##  1st Qu.:1954   1st Qu.:1967   Gable  :1141   Tar&Grv:  11   HdBoard:222  
##  Median :1973   Median :1994   Gambrel:  11   WdShngl:   6   MetalSd:220  
##  Mean   :1971   Mean   :1985   Hip    : 286   WdShake:   5   Wd Sdng:206  
##  3rd Qu.:2000   3rd Qu.:2004   Mansard:   7   ClyTile:   1   Plywood:108  
##  Max.   :2010   Max.   :2010   Shed   :   2   Membran:   1   CemntBd: 61  
##                                               (Other):   2   (Other):128  
##   Exterior2nd    MasVnrType    MasVnrArea     ExterQual ExterCond  Foundation 
##  VinylSd:504   BrkCmn : 15   Min.   :   0.0   Ex: 52    Ex:   3   BrkTil:146  
##  MetalSd:214   BrkFace:445   1st Qu.:   0.0   Fa: 14    Fa:  28   CBlock:634  
##  HdBoard:207   missing:  8   Median :   0.0   Gd:488    Gd: 146   PConc :647  
##  Wd Sdng:197   None   :864   Mean   : 103.7   TA:906    Po:   1   Slab  : 24  
##  Plywood:142   Stone  :128   3rd Qu.: 166.0             TA:1282   Stone :  6  
##  CmentBd: 60                 Max.   :1600.0                       Wood  :  3  
##  (Other):136                 NA's   :8                                        
##     BsmtQual      BsmtCond     BsmtExposure  BsmtFinType1   BsmtFinSF1    
##  Ex     :121   Fa     :  45   Av     :221   ALQ    :220   Min.   :   0.0  
##  Fa     : 35   Gd     :  65   Gd     :134   BLQ    :148   1st Qu.:   0.0  
##  Gd     :618   missing:  37   missing: 38   GLQ    :418   Median : 383.5  
##  missing: 37   Po     :   2   Mn     :114   LwQ    : 74   Mean   : 443.6  
##  TA     :649   TA     :1311   No     :953   missing: 37   3rd Qu.: 712.2  
##                                             Rec    :133   Max.   :5644.0  
##                                             Unf    :430                   
##   BsmtFinType2    BsmtFinSF2        BsmtUnfSF       TotalBsmtSF    
##  ALQ    :  19   Min.   :   0.00   Min.   :   0.0   Min.   :   0.0  
##  BLQ    :  33   1st Qu.:   0.00   1st Qu.: 223.0   1st Qu.: 795.8  
##  GLQ    :  14   Median :   0.00   Median : 477.5   Median : 991.5  
##  LwQ    :  46   Mean   :  46.55   Mean   : 567.2   Mean   :1057.4  
##  missing:  38   3rd Qu.:   0.00   3rd Qu.: 808.0   3rd Qu.:1298.2  
##  Rec    :  54   Max.   :1474.00   Max.   :2336.0   Max.   :6110.0  
##  Unf    :1256                                                      
##   Heating     HeatingQC CentralAir   Electrical     X1stFlrSF      X2ndFlrSF   
##  Floor:   1   Ex:741    N:  95     FuseA  :  94   Min.   : 334   Min.   :   0  
##  GasA :1428   Fa: 49    Y:1365     FuseF  :  27   1st Qu.: 882   1st Qu.:   0  
##  GasW :  18   Gd:241               FuseP  :   3   Median :1087   Median :   0  
##  Grav :   7   Po:  1               missing:   1   Mean   :1163   Mean   : 347  
##  OthW :   2   TA:428               Mix    :   1   3rd Qu.:1391   3rd Qu.: 728  
##  Wall :   4                        SBrkr  :1334   Max.   :4692   Max.   :2065  
##                                                                                
##   LowQualFinSF       GrLivArea     BsmtFullBath     BsmtHalfBath    
##  Min.   :  0.000   Min.   : 334   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:  0.000   1st Qu.:1130   1st Qu.:0.0000   1st Qu.:0.00000  
##  Median :  0.000   Median :1464   Median :0.0000   Median :0.00000  
##  Mean   :  5.845   Mean   :1515   Mean   :0.4253   Mean   :0.05753  
##  3rd Qu.:  0.000   3rd Qu.:1777   3rd Qu.:1.0000   3rd Qu.:0.00000  
##  Max.   :572.000   Max.   :5642   Max.   :3.0000   Max.   :2.00000  
##                                                                     
##     FullBath        HalfBath       BedroomAbvGr    KitchenAbvGr    KitchenQual 
##  Min.   :0.000   Min.   :0.0000   Min.   :0.000   Min.   :0.000   Ex     :100  
##  1st Qu.:1.000   1st Qu.:0.0000   1st Qu.:2.000   1st Qu.:1.000   Fa     : 39  
##  Median :2.000   Median :0.0000   Median :3.000   Median :1.000   Gd     :586  
##  Mean   :1.565   Mean   :0.3829   Mean   :2.866   Mean   :1.047   missing:  0  
##  3rd Qu.:2.000   3rd Qu.:1.0000   3rd Qu.:3.000   3rd Qu.:1.000   TA     :735  
##  Max.   :3.000   Max.   :2.0000   Max.   :8.000   Max.   :3.000                
##                                                                                
##   TotRmsAbvGrd      Functional     Fireplaces     FireplaceQu    GarageType 
##  Min.   : 2.000   Typ    :1360   Min.   :0.000   Ex     : 24   2Types :  6  
##  1st Qu.: 5.000   Min2   :  34   1st Qu.:0.000   Fa     : 33   Attchd :870  
##  Median : 6.000   Min1   :  31   Median :1.000   Gd     :380   Basment: 19  
##  Mean   : 6.518   Mod    :  15   Mean   :0.613   missing:690   BuiltIn: 88  
##  3rd Qu.: 7.000   Maj1   :  14   3rd Qu.:1.000   Po     : 20   CarPort:  9  
##  Max.   :14.000   Maj2   :   5   Max.   :3.000   TA     :313   Detchd :387  
##                   (Other):   1                                 missing: 81  
##   GarageYrBlt    GarageFinish   GarageCars      GarageArea       GarageQual  
##  Min.   :1900   Fin    :352   Min.   :0.000   Min.   :   0.0   Ex     :   3  
##  1st Qu.:1961   missing: 81   1st Qu.:1.000   1st Qu.: 334.5   Fa     :  48  
##  Median :1980   RFn    :422   Median :2.000   Median : 480.0   Gd     :  14  
##  Mean   :1979   Unf    :605   Mean   :1.767   Mean   : 473.0   missing:  81  
##  3rd Qu.:2002                 3rd Qu.:2.000   3rd Qu.: 576.0   Po     :   3  
##  Max.   :2010                 Max.   :4.000   Max.   :1418.0   TA     :1311  
##  NA's   :81                                                                  
##    GarageCond   PavedDrive   WoodDeckSF      OpenPorchSF     EnclosedPorch   
##  Ex     :   2   N:  90     Min.   :  0.00   Min.   :  0.00   Min.   :  0.00  
##  Fa     :  35   P:  30     1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.00  
##  Gd     :   9   Y:1340     Median :  0.00   Median : 25.00   Median :  0.00  
##  missing:  81              Mean   : 94.24   Mean   : 46.66   Mean   : 21.95  
##  Po     :   7              3rd Qu.:168.00   3rd Qu.: 68.00   3rd Qu.:  0.00  
##  TA     :1326              Max.   :857.00   Max.   :547.00   Max.   :552.00  
##                                                                              
##    X3SsnPorch      ScreenPorch        PoolArea           PoolQC    
##  Min.   :  0.00   Min.   :  0.00   Min.   :  0.000   Ex     :   2  
##  1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.000   Fa     :   2  
##  Median :  0.00   Median :  0.00   Median :  0.000   Gd     :   3  
##  Mean   :  3.41   Mean   : 15.06   Mean   :  2.759   missing:1453  
##  3rd Qu.:  0.00   3rd Qu.:  0.00   3rd Qu.:  0.000                 
##  Max.   :508.00   Max.   :480.00   Max.   :738.000                 
##                                                                    
##      Fence       MiscFeature      MiscVal             MoSold      
##  GdPrv  :  59   Gar2   :   2   Min.   :    0.00   Min.   : 1.000  
##  GdWo   :  54   missing:1406   1st Qu.:    0.00   1st Qu.: 5.000  
##  missing:1179   Othr   :   2   Median :    0.00   Median : 6.000  
##  MnPrv  : 157   Shed   :  49   Mean   :   43.49   Mean   : 6.322  
##  MnWw   :  11   TenC   :   1   3rd Qu.:    0.00   3rd Qu.: 8.000  
##                                Max.   :15500.00   Max.   :12.000  
##                                                                   
##      YrSold        SaleType    SaleCondition    SalePrice     
##  Min.   :2006   WD     :1267   Abnorml: 101   Min.   : 34900  
##  1st Qu.:2007   New    : 122   AdjLand:   4   1st Qu.:129975  
##  Median :2008   COD    :  43   Alloca :  12   Median :163000  
##  Mean   :2008   ConLD  :   9   Family :  20   Mean   :180921  
##  3rd Qu.:2009   ConLI  :   5   Normal :1198   3rd Qu.:214000  
##  Max.   :2010   ConLw  :   5   Partial: 125   Max.   :755000  
##                 (Other):   9

We will replace Nulls with 1 and -1. The -1 will flag to us about NA values while not acting as a block when normalizing the data and creating our model.

# Fill remaining NA values with -1
train[is.na(train)] = -1
test[is.na(test)] = -1

In my study, I decided to select the variables LotArea, GarageArea and SalePrice. Living in a populous environment like New York, I know first hand how much of a luxury open space can be. Based on this intuition, I suspect the amount of space in areas such as the lot area, garage area can have some affect on how much a house can sell for.

# Select the variables of interest
variables <- c("LotArea", "GarageArea", "SalePrice")
# Scatterplot matrix
plot(train[, variables], pch = 19)

When looking at the scatterplot and correlation matrix, we can see that LotArea has a weak correlation with SalesPrice. GarageArea and SalesPrice appear to have moderately positive correlation with each other.

# Correlation matrix
cor_matrix <- cor(train[, variables])
cor_matrix
##              LotArea GarageArea SalePrice
## LotArea    1.0000000  0.1804028 0.2638434
## GarageArea 0.1804028  1.0000000 0.6234314
## SalePrice  0.2638434  0.6234314 1.0000000
corrplot(cor_matrix, method="number")

The correlation coefficient (cor) between LotArea and GarageArea is estimated to be 0.1804028 - which tell us there is a weak correlation. However the p value of 3.803e-12 tell us the relationship is significant. The 80 percent confidence interval for the correlation coefficient is between 0.1477356 and 0.2126767.

(cor1 <-cor.test(formula = ~LotArea + GarageArea,data=train,   conf.level = .80))
## 
##  Pearson's product-moment correlation
## 
## data:  LotArea and GarageArea
## t = 7.0034, df = 1458, p-value = 3.803e-12
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.1477356 0.2126767
## sample estimates:
##       cor 
## 0.1804028

The correlation coefficient (cor) between LotArea and GarageArea is estimated to be 0.2638434 - which tell us there is a weak correlation. However the p value of 2.2e-16 tell us the relationship is significant. The 80 percent confidence interval for the correlation coefficient is between 0.2323391 and 0.2947946.

(cor2 <-cor.test(formula = ~LotArea + SalePrice,data=train,   conf.level = .80))
## 
##  Pearson's product-moment correlation
## 
## data:  LotArea and SalePrice
## t = 10.445, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.2323391 0.2947946
## sample estimates:
##       cor 
## 0.2638434

Linear Algebra and Correlation. Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.

Christian’s Response:

We can use the function solve to invert the correlation matrix

inverse_matrix <- solve(cor_matrix)

We can perform matrix multiplication using the %*% operator. I’ll use to mulitply matrices together.

# multiply correlation matrix by precision matrix
precision_matrix <- inverse_matrix
result1 <- cor_matrix %*% precision_matrix
# multiply precision matrix by correlation matrix
result2 <- precision_matrix %*% cor_matrix

The lu function can perform LU decomposition. We can then look into the triangular matrix (L), the upper triangular matrix (U), and permutation matrix (P).

decomposition <- lu(result1)
decomposition2 <- lu(result2)
print(decomposition)
## 'MatrixFactorization' of Formal class 'denseLU' [package "Matrix"] with 4 slots
##   ..@ Dimnames:List of 2
##   .. ..$ : chr [1:3] "LotArea" "GarageArea" "SalePrice"
##   .. ..$ : chr [1:3] "LotArea" "GarageArea" "SalePrice"
##   ..@ x       : num [1:9] 1 0 0 0 1 0 0 0 1
##   ..@ perm    : int [1:3] 1 2 3
##   ..@ Dim     : int [1:2] 3 3
print(decomposition2)
## 'MatrixFactorization' of Formal class 'denseLU' [package "Matrix"] with 4 slots
##   ..@ Dimnames:List of 2
##   .. ..$ : chr [1:3] "LotArea" "GarageArea" "SalePrice"
##   .. ..$ : chr [1:3] "LotArea" "GarageArea" "SalePrice"
##   ..@ x       : num [1:9] 1 0 0 0 1 ...
##   ..@ perm    : int [1:3] 1 2 3
##   ..@ Dim     : int [1:2] 3 3

Calculus-Based Probability & Statistics. Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of \(\lambda\) for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, \(\lambda\))). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.

Christian’s Response:

In looking at the results from the plot_density function I found two columns that are right skewed BsmtUnfSf and X1stFlrSF.

hist(train$BsmtUnfSF, breaks="FD")

hist(train$X1stFlrSF,breaks="FD")

library(MASS)
fit <- fitdistr(train$BsmtUnfSF, "exponential")
lambda <- fit$estimate
# Generate 1000 samples from the exponential distribution
samples <- rexp(1000, rate = lambda)

In comparing both histograms. We can see the histogram of generated samples has a smoother distribution.

# Plot histogram of original variable
hist(train$BsmtUnfSF, breaks = "FD", col = "lightblue", main = "Original Variable", xlab = "Value")

# Plot histogram of generated samples
hist(samples, breaks = "FD", col = "lightgreen", main = "Exponential Distribution Samples", xlab = "Value")

# Find 5th and 95th percentiles using the exponential CDF
percentile_5 <- qexp(0.05, rate = lambda)
percentile_95 <- qexp(0.95, rate = lambda)
# Generate 95% confidence interval assuming normality
confidence_interval <- t.test(train$BsmtUnfSF)$conf.int
# Find empirical 5th and 95th percentiles
empirical_percentile_5 <- quantile(train$BsmtUnfSF, 0.05)
empirical_percentile_95 <- quantile(train$BsmtUnfSF, 0.95)

Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score. Provide a screen snapshot of your score with your name identifiable.

Christian’s Report:

# select only numeric columns
numeric_cols <- sapply(train, is.numeric)
train_numeric <- train[, numeric_cols]
# Compute the correlation matrix
correlation_matrix <- cor(train_numeric)
# Extract correlation values b/w SalesPrice and other variables
saleprice_correlation <- correlation_matrix[,"SalePrice"]
# Turns matrix to dataframe
correlation_df <- data.frame(Variable = names(saleprice_correlation), Correlation = saleprice_correlation)
correlation_df
##                    Variable Correlation
## Id                       Id -0.02191672
## MSSubClass       MSSubClass -0.08428414
## LotFrontage     LotFrontage  0.20780482
## LotArea             LotArea  0.26384335
## OverallQual     OverallQual  0.79098160
## OverallCond     OverallCond -0.07785589
## YearBuilt         YearBuilt  0.52289733
## YearRemodAdd   YearRemodAdd  0.50710097
## MasVnrArea       MasVnrArea  0.47258506
## BsmtFinSF1       BsmtFinSF1  0.38641981
## BsmtFinSF2       BsmtFinSF2 -0.01137812
## BsmtUnfSF         BsmtUnfSF  0.21447911
## TotalBsmtSF     TotalBsmtSF  0.61358055
## X1stFlrSF         X1stFlrSF  0.60585218
## X2ndFlrSF         X2ndFlrSF  0.31933380
## LowQualFinSF   LowQualFinSF -0.02560613
## GrLivArea         GrLivArea  0.70862448
## BsmtFullBath   BsmtFullBath  0.22712223
## BsmtHalfBath   BsmtHalfBath -0.01684415
## FullBath           FullBath  0.56066376
## HalfBath           HalfBath  0.28410768
## BedroomAbvGr   BedroomAbvGr  0.16821315
## KitchenAbvGr   KitchenAbvGr -0.13590737
## TotRmsAbvGrd   TotRmsAbvGrd  0.53372316
## Fireplaces       Fireplaces  0.46692884
## GarageYrBlt     GarageYrBlt  0.26135424
## GarageCars       GarageCars  0.64040920
## GarageArea       GarageArea  0.62343144
## WoodDeckSF       WoodDeckSF  0.32441344
## OpenPorchSF     OpenPorchSF  0.31585623
## EnclosedPorch EnclosedPorch -0.12857796
## X3SsnPorch       X3SsnPorch  0.04458367
## ScreenPorch     ScreenPorch  0.11144657
## PoolArea           PoolArea  0.09240355
## MiscVal             MiscVal -0.02118958
## MoSold               MoSold  0.04643225
## YrSold               YrSold -0.02892259
## SalePrice         SalePrice  1.00000000
# arrange dataframe based on absolute value of correlation values
correlation_df <- correlation_df[order(-abs(correlation_df$Correlation)),]
print(correlation_df)
##                    Variable Correlation
## SalePrice         SalePrice  1.00000000
## OverallQual     OverallQual  0.79098160
## GrLivArea         GrLivArea  0.70862448
## GarageCars       GarageCars  0.64040920
## GarageArea       GarageArea  0.62343144
## TotalBsmtSF     TotalBsmtSF  0.61358055
## X1stFlrSF         X1stFlrSF  0.60585218
## FullBath           FullBath  0.56066376
## TotRmsAbvGrd   TotRmsAbvGrd  0.53372316
## YearBuilt         YearBuilt  0.52289733
## YearRemodAdd   YearRemodAdd  0.50710097
## MasVnrArea       MasVnrArea  0.47258506
## Fireplaces       Fireplaces  0.46692884
## BsmtFinSF1       BsmtFinSF1  0.38641981
## WoodDeckSF       WoodDeckSF  0.32441344
## X2ndFlrSF         X2ndFlrSF  0.31933380
## OpenPorchSF     OpenPorchSF  0.31585623
## HalfBath           HalfBath  0.28410768
## LotArea             LotArea  0.26384335
## GarageYrBlt     GarageYrBlt  0.26135424
## BsmtFullBath   BsmtFullBath  0.22712223
## BsmtUnfSF         BsmtUnfSF  0.21447911
## LotFrontage     LotFrontage  0.20780482
## BedroomAbvGr   BedroomAbvGr  0.16821315
## KitchenAbvGr   KitchenAbvGr -0.13590737
## EnclosedPorch EnclosedPorch -0.12857796
## ScreenPorch     ScreenPorch  0.11144657
## PoolArea           PoolArea  0.09240355
## MSSubClass       MSSubClass -0.08428414
## OverallCond     OverallCond -0.07785589
## MoSold               MoSold  0.04643225
## X3SsnPorch       X3SsnPorch  0.04458367
## YrSold               YrSold -0.02892259
## LowQualFinSF   LowQualFinSF -0.02560613
## Id                       Id -0.02191672
## MiscVal             MiscVal -0.02118958
## BsmtHalfBath   BsmtHalfBath -0.01684415
## BsmtFinSF2       BsmtFinSF2 -0.01137812
# Select predictor variables for the regression model
independent_vars <- c("GrLivArea", "OverallQual")

# Create a new data frame with the predictor variables and the response variable
regression_data <- train[-1, c(independent_vars, "SalePrice")]

# Remove nulls
regression_data <- na.omit(regression_data)
# Fit the multiple regression model
model <- lm(SalePrice ~ ., data = regression_data)

# Print model summary
summary(model)
## 
## Call:
## lm(formula = SalePrice ~ ., data = regression_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -379596  -22335    -381   19895  289477 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.041e+05  5.047e+03  -20.63   <2e-16 ***
## GrLivArea    5.586e+01  2.631e+00   21.24   <2e-16 ***
## OverallQual  3.285e+04  9.996e+02   32.87   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 42510 on 1456 degrees of freedom
## Multiple R-squared:  0.7142, Adjusted R-squared:  0.7138 
## F-statistic:  1819 on 2 and 1456 DF,  p-value: < 2.2e-16

Final Thoughts

We can observe that we have a high R-squared of 0.7142 and a low p-value of 2.2e-16. This tell us we have a model that is significant. We can also conclude that our explanatory variables explain with accuracy, the changes that take place in the data.

# Predict with the test dataset
predicted_prices <- predict(model, test[, independent_vars])

# Create dataframe with predicted prices and IDs
predicted_df <- data.frame(Id = test$Id, SalePrice = predicted_prices)
write.csv(predicted_df, file = "predictions_submission.csv", row.names = FALSE)
knitr::include_graphics("C:\\Users\\urios\\OneDrive\\Pictures\\Screenshots\\Entry.png", error = FALSE)