1 Introduction

This document details the univariate of the 24 columns and bivariate exploration of variables with noticeable correlation with house sale price (38 columns) of the Kaggle Advanced Regression Housing Price competition train.csv dataset. The 24 columns representing 24 house characteristics include:

  • SalePrice
  • OverallQual
  • Neighborhood
  • MSSubClass
  • TotalBath: Number of full bath (FullBath + BsmtFullBath + HalfBath + BsmtHalfBath)
  • GarageArea: Size of garage in square feet
  • TotalBsmtSF
  • 1stFlrSF
  • 2ndFlrSF
  • GrLivArea
  • LotArea: Lot size in square feet
  • LotConfig: Lot configuration
  • AgeSold: House age (YrSold - YearBuilt)
  • Exterior1st: Exterior covering on house
  • HouseStyle: Style of dwelling
  • MasVnrType: Masonry veneer type
  • MasVnrArea: Masonry veneer area in square feet
  • BedroomAbvGr: Number of bedrooms above basement level
  • KitchenQual: Kitchen quality
  • TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)
  • LotFrontage: Linear feet of street connected to property
  • BldgType: Type of dwelling
  • RemodSold: The time difference in years between when a house was remodeled and when it was sold (YrSold - YearRemodAdd)
  • BsmtFinSF1: Quality of basement finished area in square feet

The original data set consists of 1460 rows and 81 of data. The first part of this document provides the univariate results for each of the 24 variables. Next, the bivariate analysis is conducted along with a discussion of which predictors were found to be most strongly related to SalePrice. Finally, we note any significant relationships that exist between the shortlisted predictor variables.

2 Univariate Analysis and Possible Transformation

2.1 SalePrice

Saleprice represents the dollar value that each house in the sample sold for. Of the 1460 prices included in the column, the minimum saleprice was $34900, the mean was $180921, the median was $163000, the max was $755000, and the standard deviation was $79443. The column contained no missing values. Visual inspection of both a density plot and a boxplot, revealed that the data was right skewed with most SalePrices falling between $100,000 and $200,000. However, the column also contains several very large values that range from approximately $375,000 to $750,000. This indicates that we will need to be judicious in our exploration to identify outlier based on exceedingly high sale prices.

From the density and boxplot, we can see that SalePrice distribution has a right skew. Therefore, non-linear transformation should be applied to bring the data into a more normally distributed shape. Since log10 transformation is both suitable for monetary variables (like SalePrice) and able to shift data shape to the right, this transformation will be implemented on SalePrice.

2.2 OverallQual

OverallQual represents the material and finish quality of each house in the sample. Of the 1460 prices included in the column, the minimum OverallQual was 1, the mean was 6, the median was 6, the max was 10, and the standard deviation was 1. The column contained no missing values. Visual inspection of both a density plot and a boxplot, revealed that the data was imputed with integers, because there are some troughs between the whole numbers such as 5,6 and 7.

The results show that most houses have an overall quality of 5-7. The column contains no unusual or missing values that need cleaning. It looks like only a few rows have OverallQual ratings less than 4 and above 9.

2.3 Neighborhood

The neighborhood categorical variable represents the area in Ames, Iowa where each house exists. In total, 25 unique neighborhoods exist in the file. Several of those neighborhood contain very few observations. For example, Blmngton, BrDale, and Bluest each represented less than 1% of the data, while 15% of the observations are in the NAmes neighborhood.

Table1: Neighborhood Proportions
Neighborhood Proportion
NAmes 0.15
CollgCr 0.10
OldTown 0.08
Edwards 0.07
Somerst 0.06
Gilbert 0.05
NridgHt 0.05
NWAmes 0.05
Sawyer 0.05
BrkSide 0.04
SawyerW 0.04
Crawfor 0.03
IDOTRR 0.03
Mitchel 0.03
NoRidge 0.03
Timber 0.03
ClearCr 0.02
StoneBr 0.02
SWISU 0.02
Blmngtn 0.01
BrDale 0.01
MeadowV 0.01
NPkVill 0.01
Veenker 0.01
Blueste 0.00

Since there are a lot of neighborhoods, further correlation analysis should be conducted to decide which neighbors are strongly correlated with SalePrice to be kept in the prediction model. Grouping of meaningful and non-meaningful neighborhood predictors can be conducted in further data cleaning process.

2.4 MSSubClass

The MSSubClass categorical variable represents the the building class of each house. Even though MSSubClass is a numeric column in the dataset, it should be treated as a catergorical variable. Because, the values of MSSubClass represent different MSSubClass types instead of having a numeric meaning.

In total, 15 unique building classes exist in the file. Several of those neighborhood contain very few observations. For example, 40 represented close to 0% of the data, while 37% of the observations are of 20 building class.

Table2: MSSubClass Proportions
MSSubClass Proportion
20 0.37
60 0.20
50 0.10
120 0.06
30 0.05
70 0.04
80 0.04
90 0.04
160 0.04
190 0.02
45 0.01
75 0.01
85 0.01
180 0.01
40 0.00

Since there are a lot of MSSubClass, further correlation analysis should be conducted to decide which building classes are strongly correlated with SalePrice to be kept in the prediction model. Grouping of meaningful and non-meaningful MSSubClass predictors can be conducted in further data cleaning process.

2.5 TotalFullBath

The TotalFullBath represents the total number of full bathrooms of each house, including both full bathrooms above ground and under basement. This column is the sum of FullBath and BsmtFullBath. Even though TotalFullBath is returned as a numeric variable after calculation, the column should be treated as categorical to see the detailed statistics for each level of total number of full bathrooms.

Table3: TotalFullBath Proportions
TotalFullBath Proportion
2 0.51
1 0.25
3 0.22
4 0.01
0 0.00
6 0.00

In total, 6 total number of full bathrooms exist in the file. Several of those TotalFullBath contain very few observations. For example, 0, 4, 6 total numbers of full bath represented equal or lower than 1% of the data, while 51% of the observations have 2 full bathrooms in total.

2.6 GarageArea

The GarageArea represents the size of garage in square feet of each house. Of the 1460 prices included in the column, the minimum GarageArea was 0, the mean was 473, the median was 480, the max was 1418, and the standard deviation was 214. The column contained no missing values. Visual inspection of both a density plot and a boxplot, revealed that the data was right skewed with most GarageArea falling between 300 and 600. However, the column also contains several large values that range from approximately 1000 to 1450. This indicates that we will need to be judicious in our exploration to identify outlier based on high GarageArea values.

From the density and boxplot, we can see that GarageArea distribution has a right skew. Therefore, non-linear transformation should be applied to bring the data into a more normally distributed shape. Since square root transformation can shift data shape to the right, this transformation will be implemented on GarageArea.

2.7 TotalBsmtSF

The TotalBsmtSF represents the total basement area in square feet of each house in the sample. Of the 1460 prices included in the column, the minimum TotalBsmtSF was 0, the mean was 473, the median was 480, the max was 1418, and the standard deviation was 214. The column contained no missing values. Visual inspection of both a density plot and a boxplot, revealed that the data was very right skewed with most TotalBsmtSF falling between 750 and 1300. However, the column also contains several large values that range from approximately 2000 to 6000 and some very small values at 0. This indicates that we will need to be judicious in our exploration to identify outlier based on high TotalBsmtSF values.

From the density and boxplot, we can see that TotalBsmtSF distribution has a right skew. Therefore, non-linear transformation should be applied to bring the data into a more normally distributed shape. Since cube root transformation can shift data shape to the right, this transformation will be implemented on TotalBsmtSF.

2.8 1stFlrSF

The 1stFlrSF represents the first floor area in square feet of each house in the sample. Of the 1460 prices included in the column, the minimum 1stFlrSF was 334, the mean was 1163, the median was 1087, the max was 4692, and the standard deviation was 387. The column contained no missing values. Visual inspection of both a density plot and a boxplot, revealed that the data was very right skewed with most 1stFlrSF falling between 800 and 1300. However, the column also contains several large values that range from approximately 2200 to 4700. This indicates that we will need to be judicious in our exploration to identify outlier based on high 1stFlrSF values.

From the density and boxplot, we can see that 1stFlrSF distribution has a right skew. Therefore, non-linear transformation should be applied to bring the data into a more normally distributed shape. Since natural logarithm transformation can shift data shape to the right, this transformation will be implemented on 1stFlrSF.

2.9 2ndFlrSF

The 2ndFlrSF represents the second floor area in square feet of each house in the sample. Of the 1460 prices included in the column, the minimum 2ndFlrSF was 0, the mean was 347, the median was 0, the max was 2065, and the standard deviation was 437. The column contained no missing values. Visual inspection of both a density plot and a boxplot, revealed that the data was right skewed with most 2ndFlrSF falling between 0 and 800. However, the column also contains several large values that range from approximately 1800 to 2100. This indicates that we will need to be judicious in our exploration to identify outlier based on high 2ndFlrSF values.

From the density and boxplot, we can see that 2ndFlrSF distribution has strange shape. Applying non-linear transformation will not bring the data into a normally distributed shape.

For example, before applying natural logarithm transformation to 2ndFlrSF, we have to add some small number to 2ndFlrSF values of 0 to avoid creating non-finite values.

The problem lies in the fact that half or more houses have 0 second floor area (the median of 2ndFlrSF is 0), which implicates those are one-story house. In this case, if we want to include 2ndFlrSF in the prediction model, we have to combine it with other variables to create a new meaningful predictor. Further cleaning work will address this problem.

2.10 GrLivArea

The GrLivArea represents the above grade (ground) living area in square feet of each house in the sample. Of the 1460 prices included in the column, the minimum GrLivArea was 334, the mean was 1515, the median was r median, the max was 5642, and the standard deviation was 525. The column contained no missing values. Visual inspection of both a density plot and a boxplot, revealed that the data was right skewed with most GrLivArea falling between 1200 and 1750. However, the column also contains several large values that range from approximately 2750 to 5700. This indicates that we will need to be judicious in our exploration to identify outlier based on high GrLivArea values.

From the density and boxplot, we can see that GrLivArea distribution has a right skew. Therefore, non-linear transformation should be applied to bring the data into a more normally distributed shape. Since natural logarithm transformation can shift data shape to the right, this transformation will be implemented on GrLivArea.

2.11 LotArea

The LotArea represents the lot size in square feet of each house in the sample. Of the 1460 prices included in the column, the minimum LotArea was 1300, the mean was 10517, the median was 9478, the max was 215245, and the standard deviation was 9981. The column contained no missing values. Visual inspection of both a density plot and a boxplot, revealed that the data was extremely right skewed with most LotArea falling between 0 and 2000. However, the column also contains several large values that range from approximately 5000 to 250000. This indicates that we will need to be judicious in our exploration to identify outlier based on high LotArea values.

From the density and boxplot, we can see that LotArea distribution has a significant right skew. Therefore, non-linear transformation should be applied to bring the data into a more normally distributed shape. Since logarithm with base 10 transformation can considerably shift data shape to the right, this transformation will be implemented on LotArea.

2.12 LotConfig

The LotConfig categorical variable represents the lot configuration of each house in the sample. In total, 5 unique lot configuration exist in the file. Two of those configuration, FR2 and FR3 , contain very few observations with proportions of less than 5%. Meanwhile, 72% of the observations have Inside lot configuration.

Table4: LotConfig Proportions
LotConfig Proportion
Inside 0.72
Corner 0.18
CulDSac 0.06
FR2 0.03
FR3 0.00

Further correlation analysis should be conducted to decide which lot configuration are strongly correlated with SalePrice to be kept in the prediction model.

2.13 AgeSold

The AgeSold represents the number of years between when each house was built and when it was sold, which shows the newness of each house as of when it was sold. The variable is calculated by the difference between YrSold and YearBuilt.

Of the 1460 prices included in the column, the minimum AgeSold was 0, the mean was 37, the median was 35, the max was 136, and the standard deviation was 30. The column contained no missing values. Visual inspection of both a density plot and a boxplot, revealed that the data was slightly right skewed with most AgeSold falling between 10 and 50. However, the column also contains several large values that range from approximately 120 to 140. This indicates that we will need to be judicious in our exploration to identify outlier based on high Age values.

From the density and boxplot, we can see that AgeSold distribution has a slight right skew. Therefore, non-linear transformation should be applied to bring the data into a more normally distributed shape. Since square root transformation can shift data shape to the right, this transformation will be implemented on AgeSold.

2.14 Exterior1st

The Exterior1st categorical variable represents the exterior covering on each house in the sample. In total, 15 unique lot configuration exist in the file. Several of those coverings, AsbShng, AsphShn, BrkComm, CBlock, ImStucc and Stone , contain very few observations with proportions of equal or less than 1%. Meanwhile, 35% of the observations have VinylSd exterior covering.

Table5: Exterior1st Proportions
Exterior1st Proportion
VinylSd 0.35
HdBoard 0.15
MetalSd 0.15
Wd Sdng 0.14
Plywood 0.07
CemntBd 0.04
BrkFace 0.03
Stucco 0.02
WdShing 0.02
AsbShng 0.01
AsphShn 0.00
BrkComm 0.00
CBlock 0.00
ImStucc 0.00
Stone 0.00

Further correlation analysis should be conducted to decide which exterior coverings are strongly correlated with SalePrice to be kept in the prediction model.

2.15 HouseStyle

The HouseStyle categorical variable represents the dwelling style of each house in the sample. In total, 8 unique lot configuration exist in the file. Several of those HouseStyle, 1.5Unf, 2.5Fin, 2.5Unf, contain very few observations with proportions of 1%. Meanwhile, 50% of the observations are of 1Story HouseStyle.

Table6: HouseStyle Proportions
HouseStyle Proportion
1Story 0.50
2Story 0.30
1.5Fin 0.11
SLvl 0.04
SFoyer 0.03
1.5Unf 0.01
2.5Fin 0.01
2.5Unf 0.01

Further correlation analysis should be conducted to decide which HouseStyle are strongly correlated with SalePrice to be kept in the prediction model.

2.16 MasVnrType

The MasVnrType categorical variable represents the masonry veneer type of each house in the sample. In total, 5 unique lot configuration exist in the file. Among those MasVnrtype, BrkCmn contains very few observations with proportions of 1%. Meanwhile, 60% of the observations have no masonry veneer type. The column contained 8 missing values. Thoses values will be updated as the “Unknown” type for better analysis.

Table7: MasVnrType Proportions
MasVnrType Proportion
None 0.59
BrkFace 0.30
Stone 0.09
BrkCmn 0.01
Unknown 0.01

Further correlation analysis should be conducted to decide which masonry veneer types are strongly correlated with SalePrice to be kept in the prediction model.

2.17 MasVnrArea

MasVnrArea represents the masonry veneer area in square feet of each house in the sample. Of the 1460 prices included in the column, the min masonry veneer area was NA, the mean was NA, the median was NA, the max was NA, and the standard deviation was NA. The column contained 8 missing values. Therefore, these missing values should be updated. One possible solution is to replace them with the mean MasVnrArea based on the specific neighborhood that the house is located in. Further data cleaning work will address this problem.

Visual inspection of both a density plot and a boxplot, revealed that the data was right skewed with most MasVnrArea falling between 0 and 200. However, the column also contains several very large values that range from approximately 400 to 1600. This indicates that we will need to be judicious in our exploration to identify outlier based on exceedingly high MasVnrArea.

From the density and boxplot, we can see that MasVnrArea distribution has a significant right skew. Therefore, non-linear transformation should be applied to bring the data into a more normally distributed shape. Since natural logarithm transformation can considerably shift data shape to the right, this transformation will be implemented on MasVnrArea.

2.18 BedroomAbvGr

BedroomAbvGr represents the number of bedrooms above basement level of each house in the sample. Even though BedroomAbvGr is numeric in the dataset, we want to treat it as categorical variable to see the detailed statistics for each level of total full bathroom number.

Table8: BedroomAbvGr Proportions
BedroomAbvGr Proportion
3 0.55
2 0.25
4 0.15
1 0.03
5 0.01
0 0.00
6 0.00
8 0.00

In total, 8 total number of bedrooms above ground exist in the file. Several of those BedroomAbvGr contain very few observations. For example, 0,6,8 numbers of bedrooms represented closely t 0% of the data, while 55% of the observations have 3 bedrooms.

2.19 KitchenQual

The KitchenQual categorical variable represents the kitchen quality of each house in the sample. In total, 4 unique lot configuration exist in the file. Among those, Fa kitchen quality contains very few observations with proportions of 3%. Meanwhile, 50% and 40% of the observations are of TA and Gd kitchen qualities, respectively.

Table9: KitchenQual Proportions
KitchenQual Proportion
TA 0.50
Gd 0.40
Ex 0.07
Fa 0.03

Further correlation analysis should be conducted to decide which KitchenQual are strongly correlated with SalePrice to be kept in the prediction model.

2.20 TotRmsAbvGrd

TotRmsAbvGrd represents the total number of rooms above grade (not including bathrooms) of each house in the sample. Even though TotRmsAbvGrd is numeric in the dataset, we want to treat it as categorical variable to see the detailed statistics for each level of total room number.

Table10: TotRmsAbvGrd Proportions
TotRmsAbvGrd Proportion
6 0.28
7 0.23
5 0.19
8 0.13
4 0.07
9 0.05
10 0.03
3 0.01
11 0.01
12 0.01
2 0.00
14 0.00

In total, 12 total number of rooms above ground exist in the file. Several of those TotRmsAbvGrd contain very few observations. For example, 2 and 14 numbers of room represented closely t 0% of the data. Further data cleaning should be applied to determine the meaning levels that should be included in the SalePrice prediction model.

2.21 LotFrontage

LotFrontage represents the linear feet of street connected to property of each house in the sample.

Of the 1460 prices included in the column, the min LotFrontage was NA, the mean was NA, the median was NA, the max was NA, and the standard deviation was NA. The column contained 259 missing values. Further cleaning work should be conducted to replaced these missing values with suitable data that would improve the SalePrice prediction model.

Visual inspection of both a density plot and a boxplot, revealed that the data was right skewed with most LotFrontage falling between 50 and 100. However, the column also contains several very large values that range from approximately 250 to slightly over 300. This indicates that we will need to be judicious in our exploration to identify outlier based on exceedingly high LotFrontage.

From the density and boxplot, we can see that LotFrontage distribution has a right skew. Therefore, non-linear transformation should be applied to bring the data into a more normally distributed shape. Since natural logarithm transformation can considerably shift data shape to the right, this transformation will be implemented on LotFrontage.

2.22 BldgType

BldgType represents the type of dwelling of each house in the sample.

Table11: BldgType Proportions
BldgType Proportion
1Fam 0.84
TwnhsE 0.08
Duplex 0.04
Twnhs 0.03
2fmCon 0.02

In total, 5 types of dwelling exist in the file. Several of those BldgType contain very few observations. For example, Twnhs and 2fmCon represented 3% and 2% of the data, respectively. Further data cleaning should be applied to determine the meaning levels that should be included in the SalePrice prediction model.

2.23 RemodSold

RemodSold represents the time gap between when each house was sold and when it was remodeled, which partially shows the quality of each house as of when it was sold. The column is calculated by subtracting YearRemodAdd from YrSold.

Of the 1460 prices included in the column, the min RemodSold was -1, the mean was 23, the median was 14, the max was 60, and the standard deviation was 21. The column contained 0 missing values. Interestingly, there is one house that was remodeled after being sold. Thus, the remodeling did not affect the SalePrice that had been decided on earlier. Since this point is not contributing to the SalePrice prediction model, it will possibly be cleaned later on.

Visual inspection of both a density plot and a boxplot, revealed that the data was slightly right skewed with most RemodSold falling between 10 and 20. However, the column also contains several very large values that range from approximately 50 to 60. This indicates that we will need to be judicious in our exploration to identify outlier based on exceedingly high RemodSold.

From the density and boxplot, we can see that RemodSold distribution has a right skew. Therefore, non-linear transformation should be applied to bring the data into a more normally distributed shape. Since cube root transformation can shift data shape to the right, this transformation will be implemented on RemodSold.

2.24 BsmtFinSF1

BsmtFinSF1 represents type 1 finished square feet of each house as of when it was sold.

Of the 1460 prices included in the column, the min BsmtFinSF1 was 0, the mean was 444, the median was 384, the max was 5644, and the standard deviation was 456. The column contained 0 missing values.

Visual inspection of both a density plot and a boxplot, revealed that the data was slightly right skewed with most BsmtFinSF1 falling between 0 and 750. However, the column also contains several very large values that range from approximately 1750 to 2250. This indicates that we will need to be judicious in our exploration to identify outlier based on exceedingly high BsmtFinSF1.

From the density and boxplot, we can see that BsmtFinSF1 distribution has a right skew. Therefore, non-linear transformation should be applied to bring the data into a more normally distributed shape. Since cube root transformation can shift data shape to the right, this transformation will be implemented on BsmtFinSF1.

3 Bivariate analysis

3.1 Correlation between all numeric variables and SalePrice

In this section, a correlation table between SalePrice and numeric variables available in the train dataset will be present. From the table, a list of highly correlated variables will be selected to have further one-to-one bivariate analysis with SalePrice and to be included in the SalePrice prediction model.

Table12: Correlation table between SalePrice and numeric variables
DependentVar IndependentVar Correlation
OverallQual SalePrice 0.79
GrLivArea SalePrice 0.71
GarageCars SalePrice 0.64
GarageArea SalePrice 0.62
TotalBsmtSF SalePrice 0.61
X1stFlrSF SalePrice 0.61
YearRemodAdd SalePrice 0.51
Fireplaces SalePrice 0.47
BsmtFinSF1 SalePrice 0.39
X2ndFlrSF SalePrice 0.32
WoodDeckSF SalePrice 0.32
OpenPorchSF SalePrice 0.32
LotArea SalePrice 0.26
BsmtUnfSF SalePrice 0.21
ScreenPorch SalePrice 0.11
PoolArea SalePrice 0.09
MoSold SalePrice 0.05
X3SsnPorch SalePrice 0.04
BsmtFinSF2 SalePrice -0.01
MiscVal SalePrice -0.02
LowQualFinSF SalePrice -0.03
OverallCond SalePrice -0.08
EnclosedPorch SalePrice -0.13
KitchenAbvGr SalePrice -0.14
AgeSold SalePrice -0.52

From the table, we notice some interesting correlation values between SalePrice and each of these following variables:

  • Correlation over 0.3 or below -0.3: OverallQual, GrLivArea, GarageCars, GarageArea, TotalBsmtSF, X1stFlrSF, YearRemodAdd, Fireplaces, BsmtFinSF1, X2ndFlrSF, WoodDeckSF, OpenPorchSF, AgeSold
  • Correlation between 0.09 and 0.3 or between -0.3 and -0.09: LotArea, BsmtUnfSF, ScreenPorch, PoolArea, KitchenAbvGr, EnclosedPorch.

In those variables, we will conduct detailed bivariate analysis on OverallQual, GrLivArea, GarageArea, TotalBsmtSF, X1stFlrSF, X2ndFlrSF, WoodDeckSF, OpenPorchSF, AgeSold, LotArea, BsmtUnfSF, ScreenPorch, and EnclosedPorch.

Meanwhile, GarageCars, Fireplaces, PoolArea and KitchenAbvGr will be converted into categorical to explore detailed statistics of the different levels.

3.2 Bivariate analysis between SalePrice and highly-correlated numeric variables

In this section of the report, we examine the relationships between the SalePrice and 13 shortlisted numeric variables: OverallQual, GrLivArea, GarageArea, TotalBsmtSF, X1stFlrSF, X2ndFlrSF, WoodDeckSF, OpenPorchSF, AgeSold, LotArea, BsmtUnfSF, ScreenPorch, and EnclosedPorch.

3.2.1 SalePrice & OverallQual

The scatter plots of OverallQual relative to Saleprice confirm that there is a positive association between the two variables. In general, the higher the house’s overall rating, the higher the price. However, the changes in price appears to get larger for houses with quality above 7. The wide distribution of points at qualities at 8 and above indicate that prices rises more for very high quality houses. Two interesting points stand out. Two houses that have OverallQual around 10, were sold for less than $200,000. These points could be overly influential in any model built from the data so we should remove them during our cleaning process.

3.2.2 SalePrice & GarageArea

The scatter plots of GarageArea relative to Saleprice confirm that there is a positive association between the two variables. In general, the larger the house’s garage area bathrooms, the higher the price. However, the changes in price appears to get larger for houses with quality above 600. The wide distribution of points at qualities at around 700 above indicate that prices rises more for houses with very large garage. Additionally, there are 4 outlier observations: some houses with garage areas larger than 1200 square feet but not having high SalePrice. These points could be overly influential in any model built from the data so we should remove them during our cleaning process.

3.2.3 SalePrice & TotalBsmtSF

The scatter plots of TotalBsmtSF relative to Saleprice confirm that there is a positive association between the two variables. In general, the larger the house’s basement, the higher the price. However, the changes in price appears to get larger for houses with quality above 1000. The wide distribution of points at qualities at around 1000 above indicate that prices rises more for houses with very large basement. Additionally, there is one outlier observation: a house has basement area of more than 6000 square feet but was sold for less than $200,000. This point could be overly influential in any model built from the data so we should remove it during our cleaning process.

3.2.4 SalePrice & 1stFlrSF

The scatter plots of 1stFlrSF relative to Saleprice confirm that there is a positive association between the two variables. In general, the larger the house’s 1st floor area, the higher the price. However, the changes in price appears to get larger for houses with quality above 1500. The wide distribution of points at qualities at 1500 and above indicate that prices rises more for houses with very large 1st floor. Additionally, there is one outlier observation: a house has the 1st floor area of larger than 4500 square feet but was sold for less than $200,000. This point could be overly influential in any model built from the data so we should remove it during our cleaning process.

3.2.5 SalePrice & 2ndFlrSF

The scatter plot of 2ndFlrSF relative to Saleprice confirms that there is a positive association between the two variables. In general, the larger the house’s 2nd floor area, the higher the price. However, the changes in price appears to get larger for houses with quality above 1000. The wide distribution of points at qualities at 1400 and above indicate that prices rises more for houses with very large 2nd floor.

3.2.6 SalePrice & GrLivArea

The point plots of GrlivArea relative to Saleprice show clearly that larger sized houses garner higher prices. Two interesting points stand out. Two houses that are both greater than 4500 square feet, were sold for less than $200,000. These points could be overly influential in any model built from the data so we should remove them during our cleaning process.

3.2.7 SalePrice & LotArea

The point plots of LotArea relative to Saleprice show clearly that houses with larger lot garner higher prices. Three interesting points stand out. Three houses that have lot area greater than 150,000 square feet, were sold for less than $400,000. These points could be overly influential in any model built from the data so we should remove them during our cleaning process.

3.2.8 SalePrice & AgeSold

The point plots of Age relative to Saleprice show clearly that houses with higher age garner lower prices. One interesting observation stand out. One house that was built 110 years ago was sold for nearly $500,000. This point could be overly influential in any model built from the data so we should remove it during our cleaning process.

3.2.9 SalePrice & WoodDeckSF

The point plots of WoodDeckSSF relative to Saleprice show clearly that houses with larger wood deck area garner higher prices. Two interesting observation stand out. Two house with nearly 750 square feet of wood deck were sold for under $200,000. These point could be overly influential in any model built from the data so we should remove it during our cleaning process.

3.2.10 SalePrice & OpenPorchSF

OpenPorchSF represents the open porch area in square feet of each house in the dataset. The point plots of OpenPorchSF relative to Saleprice show clearly that houses with larger open porch area garner higher prices. One interesting observation stand out. A house with more than 600 square feet of open porch was sold for under $50,000. This point could be overly influential in any model built from the data so we should remove it during our cleaning process. Besides, there are a lot of zero values, representing houses without an open porch (656 observations).

3.2.11 SalePrice & BsmtUnfSF

The point plots of BsmtUnfSF relative to Saleprice show that the house price remained relatively flat as the area of unfinished basement increases. Interestingly, two highest sold houses had BsmtUnfSF of more than 250 and nearly 100 square feet, respectively.

3.2.12 SalePrice & ScreenPorch

ScreenPorch represents the screen porch area in square feet of each house in the dataset. The point plots of ScreenPorch relative to Saleprice show that the house price remained relatively flat as the area of screen porch increases. Interestingly, there are a lot of observations with a value of screen porch area of 0, which represents houses without a screen porch (1344 observations). Therefore, this variable should be combined with another suitable variable for better analysis.

3.2.13 SalePrice & OpenScreenPorch

OpenScreenPorch column is created as the sum of OpenPorch and ScreenPorch. This variable represents the total area in squared feet of open and screen porch a house has.

The point plots of OpenScreenPorchSF relative to Saleprice show clearly that houses with larger the total area of open and screen porch garner higher prices. One interesting observation stand out. A house with more than 1000 square feet of open & screen porch was sold for under $200,000. This point could be overly influential in any model built from the data so we should remove it during our cleaning process. Additionally, after combining the two variables, there are still a lot of zero values (particularly 603 observations).

3.2.14 SalePrice & EnclosedPorch

EnclosedPorch respresents the enclosed porch area in square feet. The point plots of EnclosedPorch relative to Saleprice show that houses with larger area of enclosed porch garner slightly lower prices. There are a lot of 0 values of EnclosedPorch, respresenting houses with no enclosed porch.

3.2.15 SalePrice & LotFrontage

The point plots of LotFrontage relative to Saleprice show that houses with larger area of LotFrontage garner higher prices. There are two interesting observations: two houses with more than 300 feet of LotFrontage were sold for only around $200,000. These two outliers will possibly be replaced in the data cleaning process.

3.2.16 SalePrice & BsmtFinSF1

The point plots of BsmtFinSF1 relative to Saleprice show that houses with larger area of BsmtFinSF1 garner higher prices. There is one interesting observation: a house with 2,250 square feet of BsmtFinSF1 was sold for only around $200,000. This outlier will possibly be replaced in the data cleaning process.

3.2.17 SalePrice & RemodSold

The point plots of RemodSold relative to Saleprice show that houses with longer time gap between remodeling and selling would garner lower prices. However, the correlation seems to be stronger in the first 20 years, the later time is relatively stable.

3.3 Bivariate analysis between SalePrice and categorical variables

In this section, further one-to-one bivariate analysis with SalePrice and categorical variables available in the train dataset will be conducted. These categorical variables also include numeric variable with a few number of levels that may have different statistics. Levels of different categorical that are highly correlated with SalePrice (with the correlation above 0.09) will be shortlisted to make further analysis whether to be included in SalePrice prediction model.

After calculating the percentage of missing values in each categorical variables, we notice that Alley, PoolQC, Fence and MiscFeature have significantly high percentage of NAs (of more than 80% of the values in these columns are NAs). Therefore, these variables are ruled out from the bivariate analysis.

Categorical values with more than 50% of entries are NAs
na_per
Alley 0.9376712
PoolQC 0.9952055
Fence 0.8075342
MiscFeature 0.9630137

For other columns, all of the missing values are turned into “Unknown”.

3.3.1 SalePrice & Neighborhood

The boxplot shows that there is wide variation in the distribution of SalePrices across the neighborhoods. NridgHt, NoRidge, and StoneBr have the largest mean SalePrices. Interesting 6 of the 7 most expensive homes are located in NridgHt and NoRidge. MeadowV, IDOTRR, Brkside, and BrDale have the lowest mean SalePrice.

Finally, table 13 shows the correlation for each neighborhood with SalePrice. As the boxplot showed, NridgHt, NoRidge, StoneBr, Somerst and Timber have the strongest positive correlation with SalePrice at .40, .33, .22, .14 and .13 respectively. In contrast, BrDale, MeadowV, Sawyer, BrkSide, IDOTRR, Edwards, NAmes, and OldTown are all noticeably negatively correlated with SalePrice.

Table13: Neighborhood Correlations with SalePrice
DependentVar IndependentVar Correlation
SalePrice Neighborhood_NridgHt 0.40
SalePrice Neighborhood_NoRidge 0.33
SalePrice Neighborhood_StoneBr 0.22
SalePrice Neighborhood_Somerst 0.14
SalePrice Neighborhood_Timber 0.13
SalePrice Neighborhood_CollgCr 0.07
SalePrice Neighborhood_Crawfor 0.07
SalePrice Neighborhood_ClearCr 0.06
SalePrice Neighborhood_Veenker 0.06
SalePrice Neighborhood_Gilbert 0.04
SalePrice Neighborhood_Blmngtn 0.02
SalePrice Neighborhood_NWAmes 0.02
SalePrice Neighborhood_SawyerW 0.01
SalePrice Neighborhood_Blueste -0.02
SalePrice Neighborhood_NPkVill -0.04
SalePrice Neighborhood_Mitchel -0.06
SalePrice Neighborhood_SWISU -0.06
SalePrice Neighborhood_BrDale -0.10
SalePrice Neighborhood_MeadowV -0.11
SalePrice Neighborhood_Sawyer -0.13
SalePrice Neighborhood_BrkSide -0.14
SalePrice Neighborhood_IDOTRR -0.16
SalePrice Neighborhood_Edwards -0.18
SalePrice Neighborhood_NAmes -0.19
SalePrice Neighborhood_OldTown -0.19

3.3.2 SalePrice & MSZoning

MSZoning represents the general zoning classification of each house in the dataset. According to the boxplot, FV and Rl have the largest mean SalePrices. Interestingly, 8 most expensive homes have MSZoning of RL. Finally, table 14 shows the correlation for each MSZoning with SalePrice. As the boxplot showed, MSZoning_RL and MSZoning_FV have the strongest positive correlation with SalePrice at .25 and .09 respectively. In contrast, MSZoning_C (all) and MSZoning_RM are noticeably negatively correlated with SalePrice.

Table14: MSZoning Correlations with SalePrice
DependentVar IndependentVar Correlation
SalePrice MSZoning_RL 0.25
SalePrice MSZoning_FV 0.09
SalePrice MSZoning_RH -0.07
SalePrice MSZoning_C (all) -0.11
SalePrice MSZoning_RM -0.29

3.3.3 SalePrice & MSSubClass

According to the boxplot, MSSubClass_60 has the largest mean SalePrices. Interestingly, 2 most expensive homes have MSSubClass of 60.

Finally, table 15 shows the correlation for each MSSubClass with SalePrice. As the boxplot showed, MSSubClass_60 has the strongest positive correlation with SalePrice at 0.38. In contrast, MSSubClass_190, MSSubClass_90, MSSubClass_160, MSSubClass_50 and MSSubClass_30 are noticeably negatively correlated with SalePrice.

Table15: MSSubClass Correlations with SalePrice
DependentVar IndependentVar Correlation
SalePrice MSSubClass_60 0.38
SalePrice MSSubClass_120 0.06
SalePrice MSSubClass_20 0.04
SalePrice MSSubClass_75 0.02
SalePrice MSSubClass_40 -0.02
SalePrice MSSubClass_80 -0.03
SalePrice MSSubClass_70 -0.04
SalePrice MSSubClass_85 -0.05
SalePrice MSSubClass_45 -0.08
SalePrice MSSubClass_180 -0.08
SalePrice MSSubClass_190 -0.09
SalePrice MSSubClass_90 -0.11
SalePrice MSSubClass_160 -0.11
SalePrice MSSubClass_50 -0.16
SalePrice MSSubClass_30 -0.24

3.3.4 SalePrice & Fireplaces

Fireplaces represents the number of fireplaces of each house in the dataset. According to the boxplot, Fireplaces_3 has the largest mean SalePrices. Interestingly, 2 most expensive homes have the number of fireplaces of 2.

Finally, table 16 shows the correlation for each Fireplaces with SalePrice. Fireplaces_1 and Fireplaces_3 have the strongest positive correlation with SalePrice at 0.35 and 0.22 respectively. In contrast, Fireplaces_0 is noticeably negatively correlated with SalePrice at -0.47.

Table16: Fireplaces Correlations with SalePrice
DependentVar IndependentVar Correlation
SalePrice Fireplaces_1 0.35
SalePrice Fireplaces_2 0.22
SalePrice Fireplaces_3 0.05
SalePrice Fireplaces_0 -0.47

3.3.5 SalePrice & TotalFullBath

According to the correlation table, TotalFullBath_3 and TotalFullBath_4 have the strongest positive correlation with SalePrice at 0.49 and 0.19 respectively. In contrast, TotalFullBath_1 is noticeably negatively correlated with SalePrice at -0.42.

Table17: TotalFullBath Correlations with SalePrice
DependentVar IndependentVar Correlation
SalePrice TotalFullBath_5 0.32
SalePrice TotalFullBath_2 -0.03
SalePrice TotalFullBath_1 -0.04
SalePrice TotalFullBath_3 -0.13
SalePrice TotalFullBath_4 -0.15
SalePrice TotalFullBath_6 -0.16

3.3.6 SalePrice & GarageCars

GarageCars represents the size of garage in car capacity of each house in the dataset. According to the correlation table, GarageCars_3 has the strongest positive correlation with SalePrice at 0.61. In contrast, GarageCars_0 and GarageCars_1 are noticeably negatively correlated with SalePrice at -0.24 and -0.39 respectively.

Table18: GarageCars Correlations with SalePrice
DependentVar IndependentVar Correlation
SalePrice GarageCars_3 0.61
SalePrice GarageCars_2 0.04
SalePrice GarageCars_4 0.01
SalePrice GarageCars_0 -0.24
SalePrice GarageCars_1 -0.39

3.3.7 SalePrice & OverallCond

OverallCond represents the overall condition rating of each house in the dataset. According to the correlation table, OverallCond_5 has the strongest positive correlation with SalePrice at 0.32. In contrast, OverallCond_7, OverallCond_3, OverallCond_4 and OverallCond_6 are noticeably negatively correlated with SalePrice at -0.12, -0.13, -0.15 and -0.16 respectively.

Table19: OverallCond Correlations with SalePrice
DependentVar IndependentVar Correlation
SalePrice OverallCond_5 0.32
SalePrice OverallCond_9 0.05
SalePrice OverallCond_2 -0.03
SalePrice OverallCond_1 -0.04
SalePrice OverallCond_8 -0.07
SalePrice OverallCond_7 -0.12
SalePrice OverallCond_3 -0.13
SalePrice OverallCond_4 -0.15
SalePrice OverallCond_6 -0.16

3.3.8 SalePrice & LotShape

LotShape represents the general shape of property of each house in the dataset. According to the correlation table, LotShape_IR1 and LotShape_IR2 have the strongest positive correlation with SalePrice at 0.22 and 0.13 respectively. In contrast, LotShape_Reg is noticeably negatively correlated with SalePrice at -0.24 and -0.27.

Table20: LotShape Correlations with SalePrice
DependentVar IndependentVar Correlation
SalePrice LotShape_IR1 0.22
SalePrice LotShape_IR2 0.13
SalePrice LotShape_IR3 0.04
SalePrice LotShape_Reg -0.27

3.3.9 SalePrice & RoofStyle

RoofStyle represents the type of roof of each house in the dataset. According to the correlation table, RoofStyle_Hip has the strongest positive correlation with SalePrice at 0.24. In contrast, RoofStyle_Gable is noticeably negatively correlated with SalePrice at -0.22.

Table21: RoofStyle Correlations with SalePrice
DependentVar IndependentVar Correlation
SalePrice RoofStyle_Hip 0.24
SalePrice RoofStyle_Flat 0.02
SalePrice RoofStyle_Shed 0.02
SalePrice RoofStyle_Mansard 0.00
SalePrice RoofStyle_Gambrel -0.04
SalePrice RoofStyle_Gable -0.22

3.3.10 SalePrice & ExterQual

ExterQual represents the exterior material quality of each house in the dataset. According to the correlation table, ExterQual_Ex and ExterQual_Gd have the strongest positive correlation with SalePrice at 0.45. In contrast, ExterQual_Fa and ExterQual_TA are noticeably negatively correlated with SalePrice at -0.12 and -0.59 respectively.

Table22: ExterQual Correlations with SalePrice
DependentVar IndependentVar Correlation
SalePrice ExterQual_Ex 0.45
SalePrice ExterQual_Gd 0.45
SalePrice ExterQual_Fa -0.12
SalePrice ExterQual_TA -0.59

3.3.11 SalePrice & Foundation

Foundation represents the type of foundation of each house in the dataset. According to the correlation table, Foundation_PConc has the strongest positive correlation with SalePrice at 0.5. In contrast, Foundation_Slab, Foundation_BrkTil and Foundation_CBlock are noticeably negatively correlated with SalePrice at -0.12, -0.2 and -0.34 respectively.

Table23: Foundation Correlations with SalePrice
DependentVar IndependentVar Correlation
SalePrice Foundation_PConc 0.50
SalePrice Foundation_Wood 0.00
SalePrice Foundation_Stone -0.01
SalePrice Foundation_Slab -0.12
SalePrice Foundation_BrkTil -0.20
SalePrice Foundation_CBlock -0.34

3.3.12 SalePrice & HeatingQC

HeatingQC variable represents the heating quality and condition of each house in the dataset. According to the correlation table, HeatingQC_Ex has the strongest positive correlation with SalePrice at 0.43. In contrast, HeatingQC_Fa, HeatingQC_Gd and HeatingQC_TA are noticeably negatively correlated with SalePrice at -0.13, -0.13 and -0.31 respectively.

Table24: HeatingQC Correlations with SalePrice
DependentVar IndependentVar Correlation
SalePrice HeatingQC_Ex 0.43
SalePrice HeatingQC_Po -0.03
SalePrice HeatingQC_Fa -0.13
SalePrice HeatingQC_Gd -0.13
SalePrice HeatingQC_TA -0.31

3.3.13 SalePrice & SaleType

SaleType represents the type of sale of each house in the dataset. According to the correlation table, SaleType_New has the strongest positive correlation with SalePrice at 0.36. In contrast, SaleType_WD is noticeably negatively correlated with SalePrice at -0.24.

Table25: SaleType Correlations with SalePrice
DependentVar IndependentVar Correlation
SalePrice SaleType_New 0.36
SalePrice SaleType_Con 0.04
SalePrice SaleType_CWD 0.02
SalePrice SaleType_ConLI 0.01
SalePrice SaleType_ConLw -0.03
SalePrice SaleType_Oth -0.03
SalePrice SaleType_ConLD -0.04
SalePrice SaleType_COD -0.08
SalePrice SaleType_WD -0.24

3.3.14 SalePrice & SaleCondition

SaleCondition represents the condition of sale of each house in the dataset. According to the correlation table, SaleCondition_Partial has the strongest positive correlation with SalePrice at 0.35. In contrast, SaleCondition_Abnorml and SaleCondition_Normal are noticeably negatively correlated with SalePrice at -0.12 and -0.15 respectively.

Table26: SaleCondition Correlations with SalePrice
DependentVar IndependentVar Correlation
SalePrice SaleCondition_Partial 0.35
SalePrice SaleCondition_Alloca -0.02
SalePrice SaleCondition_AdjLand -0.05
SalePrice SaleCondition_Family -0.05
SalePrice SaleCondition_Abnorml -0.12
SalePrice SaleCondition_Normal -0.15

3.3.15 SalePrice & HouseStyle

According to the correlation table, HouseStyle_2Story has the strongest positive correlation with SalePrice at 0.24. In contrast, HouseStyle_1.5Unf, HouseStyle_SFoyer and HouseStyle_1.5Fin are noticeably negatively correlated with SalePrice at -0.09, -0.09 and -0.16 respectively.

Table27: HouseStyle Correlations with SalePrice
DependentVar IndependentVar Correlation
SalePrice HouseStyle_2Story 0.24
SalePrice HouseStyle_2.5Fin 0.04
SalePrice HouseStyle_2.5Unf -0.03
SalePrice HouseStyle_SLvl -0.04
SalePrice HouseStyle_1Story -0.06
SalePrice HouseStyle_1.5Unf -0.09
SalePrice HouseStyle_SFoyer -0.09
SalePrice HouseStyle_1.5Fin -0.16

3.3.16 SalePrice & MasVnrType

According to the correlation table, MasVnrType_Stone and MasVnrType_BrkFace have the strongest positive correlation with SalePrice at 0.33 and 0.2. In contrast, MasVnrType_None is noticeably negatively correlated with SalePrice at -0.37.

Table28: MasVnrType Correlations with SalePrice
DependentVar IndependentVar Correlation
SalePrice MasVnrType_Stone 0.33
SalePrice MasVnrType_BrkFace 0.20
SalePrice MasVnrType_Unknown 0.05
SalePrice MasVnrType_BrkCmn -0.04
SalePrice MasVnrType_None -0.37

3.3.17 SalePrice & LotConfig

According to the correlation table, LotConfig_CulDSac has the strongest correlation with SalePrice at 0.14.

Table29: LotConfig Correlations with SalePrice
DependentVar IndependentVar Correlation
SalePrice LotConfig_CulDSac 0.14
SalePrice LotConfig_FR3 0.02
SalePrice LotConfig_Corner 0.00
SalePrice LotConfig_FR2 -0.01
SalePrice LotConfig_Inside -0.08

3.3.18 SalePrice & Exterior1st

According to the correlation table, Exterior1st_VinylSd and Exterior1st_CemntBd have the strongest positive correlation with SalePrice at 0.31 and 0.13. In contrast, Exterior1st_HdBoard, Exterior1st_AsbShng, Exterior1st_Wd Sdng and Exterior1st_MetalSd are noticeably negatively correlated with SalePrice at -0.10, -0.11, -0.16 and -0.17 respectively.

Table30: Exterior1st Correlations with SalePrice
DependentVar IndependentVar Correlation
SalePrice Exterior1st_VinylSd 0.31
SalePrice Exterior1st_CemntBd 0.13
SalePrice Exterior1st_Stone 0.04
SalePrice Exterior1st_BrkFace 0.03
SalePrice Exterior1st_ImStucc 0.03
SalePrice Exterior1st_Plywood -0.02
SalePrice Exterior1st_AsphShn -0.03
SalePrice Exterior1st_CBlock -0.03
SalePrice Exterior1st_Stucco -0.03
SalePrice Exterior1st_BrkComm -0.05
SalePrice Exterior1st_WdShing -0.05
SalePrice Exterior1st_HdBoard -0.10
SalePrice Exterior1st_AsbShng -0.11
SalePrice Exterior1st_Wd Sdng -0.16
SalePrice Exterior1st_MetalSd -0.17

3.3.19 SalePrice & BedroomAbvGr

BedroomAbvGr represents the number of bedrooms above basement level of each house in the dataset. According to the correlation table, BedroomAbvGr_4 has the strongest positive correlation with SalePrice at 0.21. In contrast, BedroomAbvGr_2 is noticeably negatively correlated with SalePrice at -0.16.

Table31: BedroomAbvGr Correlations with SalePrice
DependentVar IndependentVar Correlation
SalePrice BedroomAbvGr_4 0.21
SalePrice BedroomAbvGr_0 0.03
SalePrice BedroomAbvGr_8 0.01
SalePrice BedroomAbvGr_3 0.00
SalePrice BedroomAbvGr_5 0.00
SalePrice BedroomAbvGr_1 -0.02
SalePrice BedroomAbvGr_6 -0.03
SalePrice BedroomAbvGr_2 -0.16

3.3.20 SalePrice & TotRmsAbvGrd

According to the correlation table, TotRmsAbvGrd_10, TotRmsAbvGrd_9, TotRmsAbvGrd_11, TotRmsAbvGrd_8, TotRmsAbvGrd_7 and TotRmsAbvGrd_12 have the strongest positive correlation with SalePrice at 0.26, 0.21, 0.19, 0.16, 0.11 and 0.11 respectively. In contrast, TotRmsAbvGrd_3, TotRmsAbvGrd_6, TotRmsAbvGrd_4 and TotRmsAbvGrd_5 are noticeably negatively correlated with SalePrice at -0.10, -0.15, -0.2 and -0.24 respectively.

Table32: TotRmsAbvGrd Correlations with SalePrice
DependentVar IndependentVar Correlation
SalePrice TotRmsAbvGrd_10 0.26
SalePrice TotRmsAbvGrd_9 0.21
SalePrice TotRmsAbvGrd_11 0.19
SalePrice TotRmsAbvGrd_8 0.16
SalePrice TotRmsAbvGrd_7 0.11
SalePrice TotRmsAbvGrd_12 0.11
SalePrice TotRmsAbvGrd_14 0.01
SalePrice TotRmsAbvGrd_2 -0.05
SalePrice TotRmsAbvGrd_3 -0.10
SalePrice TotRmsAbvGrd_6 -0.15
SalePrice TotRmsAbvGrd_4 -0.20
SalePrice TotRmsAbvGrd_5 -0.24

3.3.21 SalePrice & BldgType

According to the correlation table, 1Fam, 2fmCon, Twnhs, and Duplex have the strongest correlation with SalePrice at 0.14, -0.10, -0.10 and -0.11 respectively. In contrast, TwnhsE does not have any noticeable correlation with SalePrice.

Table33: BldgType Correlations with SalePrice
DependentVar IndependentVar Correlation
SalePrice BldgType_1Fam 0.14
SalePrice BldgType_TwnhsE 0.00
SalePrice BldgType_2fmCon -0.10
SalePrice BldgType_Twnhs -0.10
SalePrice BldgType_Duplex -0.11

4 Data Cleaning

In this section, data cleaning will be conducted to:

  • Work on outliers in some numeric variables that may be overly influential in the final SalePrice prediction model by either removing them or replacing them with another values.
  • Work on missing values:
    • Replace them with another suitable value for numeric variables.
    • Replace them with “Unknown” for categorical variables.
  • Group the non-meaningful levels in different categorical variables (levels with correlation with SalePrice below 0.09) in to one level called “Other”.

4.1 MasVnrArea

MasVnrArea contained 8 missing values. Therefore, these missing values should be updated. One possible solution is to replace them with the mean MasVnrArea based on the specific house style.

After cleaning, the correlation between SalePrice and MasvnrArea is around 0.48.

4.2 OverallQual

Based on the bivariate analysis between SalePrice and OverallQual, two outliers are noticeable: Two houses that have OverallQual around 10, were sold for less than $200,000. These two points can either be removed or replaced by other values. To keep most of the data available, I will replace them with the average OverallQual values in their Neighborhood.

After replacing the two outliers, the correlation between SalePrice and OverallQual increases slightly (from 0.791 to 0.796)

4.3 GrLivArea

Based on the bivariate analysis between SalePrice and GrLivArea, two outliers are noticeable: Two houses that are both greater than 4500 square feet, were sold for less than $200,000. These values will be replaced by the average GrLivArea values of their HouseStyle.

After replacing the two outliers, the correlation between SalePrice and GrLivArea increases slightly (from 0.709 to 0.736)

4.4 LotArea

Based on the bivariate analysis between SalePrice and LotArea, three outliers are noticeable: Three houses that have lot area greater than 150,000 square feet, were sold for less than $400,000.

These three outliers will be replaced by the average LotArea based on their HouseStyle.

After replacing the two outliers, the correlation between SalePrice and LotArea increases noticeably (from 0.26 to 0.39)

4.5 AgeSold

Based on the bivariate analysis between SalePrice and AgeSold, one outlier is noticeable: One house that was built 110 years ago was sold for nearly $500,000. This outlier will be replaced by the average AgeSold of the house neighborhood.

After replacing the two outliers, the correlation between SalePrice and LotArea decreases slightly (from -0.52 to -0.57).

4.6 WoodDeckSF

Based on the bivariate analysis between SalePrice and WoodDeckSF, two outliers are noticeable: Two house with nearly 750 square feet of wood deck were sold for under $200,000. These outliers will be replaced by the average WoodDeckSF of the house styles.

After replacing the two outliers, the correlation between SalePrice and WoodDeckSF increases slightly (from 0.32 to 0.33).

4.7 Neighborhood

There are 25 different Neighborhoods in the dataset, but only 13 have a meaningful relationship with SalePrice, including NridgHt, NoRidge, StoneBr, Somerst, Timber, BrDale, MeadowV, Sawyer, BrkSide, IDOTRR, Edwards, NAmes, and OldTown. Therefore, the rest will be group into a group called “Other”.

4.8 MSSubClass

There are 15 different MSSubClass in the dataset, but only 6 have a meaningful relationship with SalePrice, including “60”,“190”,“90”,“160”,“50” and “30”. Therefore, the rest will be group into a group called “Other”.

4.9 TotalFullBath

There are 6 TotalFullBath values, but only 3 have a meaningful relationship with SalePrice, including “1”, “3”, “4”. Therefore, the rest will be group into a group called “Other”.

4.10 OverallCond

There are 10 TotalFullBath values, but only 5 have a meaningful relationship with SalePrice, including “3”, “4”, “5”, “6” and “7”. Therefore, the rest will be group into a group called “Other”.

4.11 SaleType

There are 9 TotalFullBath values, but only 2 have a meaningful relationship with SalePrice, including “New” and “WD”. Therefore, the rest will be group into a group called “Other”.

4.12 Fireplaces

Among 4 numbers of Fireplaces, one, three and zero fireplaces have a meaningful relationship with SalePrice. Two fireplaces will be turned into the “Other” level.

4.13 GarageCars

There are 5 GarageCars values, but only 3 have a meaningful relationship with SalePrice, including “0”, “1”, “3”. Therefore, the rest will be group into a group called “Other”.

4.14 LotShape

Among 4 numbers of Fireplaces, IR1, IR2 and Reg have a meaningful relationship with SalePrice. IR3 will be turned into the “Other” level.

4.15 RoofStyle

There are 6 RoofStyle values, but only 2 have a meaningful relationship with SalePrice, including “Hip” and “Gable”. Therefore, the rest will be group into a group called “Other”.

4.16 Foundation

There are 6 RoofStyle values, but only 4 have a meaningful relationship with SalePrice, including “PConc”, “Slab”, “BrkTil”, “CBlock”. Therefore, the rest will be group into a group called “Other”.

4.17 HeatingQC

Among 4 numbers of HeatingQC, Ex, Fa, Gd and TA have a meaningful relationship with SalePrice. Po HeatingQC will be turned into the “Other” level.

4.18 SaleCondition

There are 6 SaleCondition values, but only 3 have a meaningful relationship with SalePrice, including “Partial”, “Abnorml”, “Normal”. Therefore, the rest will be group into a group called “Other”.

4.19 HouseStyle

There are 8 HouseStyle values, but only 4 have a meaningful relationship with SalePrice, including “Partial”, “Abnorml”, “Normal”. Therefore, the rest will be group into a group called “Other”.

4.20 MasVnrType

There are 5 HouseStyle values, but only 3 have a meaningful relationship with SalePrice, including “Stone”, “BrkFace”, “None”. Therefore, the rest will be group into a group called “Other”.

4.21 LotConfig

There are 5 LotConfig values, but only 1 has a meaningful relationship with SalePrice - “CulDSac”. Therefore, the rest will be group into a group called “Other”.

4.22 Exterior1st

There are 15 Exterior1st values, but only 1 has a meaningful relationship with SalePrice, including VinylSd, CemntBd, HdBoard, AsbShng, Wd Sdng, MetalSd. Therefore, the rest will be group into a group called “Other”.

4.23 BedroomAbvGr

There are 8 BedroomAbvGr values, but only 2 have a meaningful relationship with SalePrice, including “2” and “4”. Therefore, the rest will be group into a group called “Other”.

4.24 TotRmsAbvGrd

There are 12 BedroomAbvGr values, 10 have a meaningful relationship with SalePrice, including 3, 4, 5, 6, 7, 8, 9, 10, 11 and 12. Therefore, the rest will be group into a group called “Other”.

4.25 LotFrontage

Because LotFrontage has na_count missing values, these values will be replaced by the mean LotFrontage of the specific neighborhood.

4.26 BldgType

There are 5 BldgType values, among which TwnhsE has insignificant correlation with SalePrice. Therefore, this value will be turned into a level called “Other”.

5 Linear Model

33 variables chosen to be the predictors in SalePrice prediction model include:

  • GarageArea
  • LotArea
  • AgeSold
  • WoodDeckSF
  • BsmtUnfSF
  • OpenScreenPorch
  • Neighborhood
  • MSZoning
  • MSSubClass
  • Fireplaces
  • TotalFullBath
  • LotShape
  • ExterQual
  • SaleType
  • MasVnrType
  • LotConfig
  • Exterior1st
  • PoolArea
  • CentralAir
  • PavedDrive
  • GarageCars
  • Fence
  • HouseStyle
  • LotFrontage
  • BldgType
  • LandContour
  • BsmtFinSF1
  • GrLivArea_transformed
  • TotalBsmtSF_transformed
  • X1stFlrSF
  • X2ndFlrSF_transformed
  • OverallQualFactor
  • RemodSold_transformed

Multi-variable linear model for SalePrice_transformed:

Running the multi-variate linear model between SalePrice_transformed and the 33 predictors would return a model with an R-squared of 0.9144 and an adjusted R-squared of 0.9105. In other words, the 33 variables included in the model explained approximately 91% of the variance in SalePrice. The model was significatn with F = 231.2, p < 0.001. The median of the residuals was very close to zero. Among 33 predictors, 10 make minimal contribution even though they have significant correlation with SalePrice. This happens because their statistical significance are not as strong as other chosen predictors.

  • Note: In experimenting the model, observation in row 1069 was proven to be an outlier that is overinfluential to the model. Removing the observation results in a higher R-squared model.
## 
## Call:
## lm(formula = SalePrice_transformed ~ ., data = model_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.35905 -0.02488  0.00334  0.02888  0.16367 
## 
## Coefficients:
##                              Estimate    Std. Error t value
## (Intercept)              3.5122536729  0.1195226192  29.386
## GarageArea               0.0000365081  0.0000164848   2.215
## LotArea                  0.0000009692  0.0000001956   4.956
## AgeSold                 -0.0002886972  0.0001218927  -2.368
## WoodDeckSF               0.0000534309  0.0000134147   3.983
## BsmtUnfSF               -0.0000045802  0.0000087271  -0.525
## OpenScreenPorch          0.0000895581  0.0000180493   4.962
## NeighborhoodNoRidge      0.0421379870  0.0102786555   4.100
## NeighborhoodOldTown     -0.0239080216  0.0072676928  -3.290
## NeighborhoodNridgHt      0.0273388355  0.0084513357   3.235
## NeighborhoodEdwards     -0.0191958025  0.0061205926  -3.136
## NeighborhoodStoneBr      0.0504179674  0.0121820317   4.139
## MSZoningFV               0.2090336184  0.0197976999  10.558
## MSZoningRH               0.1647662767  0.0232699123   7.081
## MSZoningRL               0.1834033608  0.0182409110  10.055
## MSZoningRM               0.1638870060  0.0184735598   8.871
## MSSubClass60            -0.0133692913  0.0087874385  -1.521
## MSSubClass30            -0.0225775219  0.0084329439  -2.677
## MSSubClass160           -0.0490015177  0.0114304570  -4.287
## Fireplaces1              0.0126410684  0.0037574577   3.364
## Fireplaces2              0.0289492680  0.0070526426   4.105
## TotalFullBath1          -0.0061780183  0.0043719766  -1.413
## TotalFullBath3           0.0061464958  0.0048272670   1.273
## LotShapeIR2              0.0143974404  0.0101421153   1.420
## ExterQualTA             -0.0116500600  0.0052836017  -2.205
## ExterQualEx              0.0242168280  0.0115357810   2.099
## SaleTypeNew              0.0030311754  0.0069596478   0.436
## MasVnrTypeStone          0.0487439568  0.0137490856   3.545
## MasVnrTypeBrkFace        0.0430526260  0.0128014432   3.363
## MasVnrTypeNone           0.0456751153  0.0125863029   3.629
## LotConfigCulDSac         0.0184382068  0.0072309923   2.550
## Exterior1stHdBoard      -0.0097469722  0.0047467917  -2.053
## Exterior1stWd Sdng      -0.0117935482  0.0048287719  -2.442
## PoolArea555              0.1099156644  0.0577907191   1.902
## CentralAirY              0.0373026997  0.0070461067   5.294
## PavedDriveY              0.0126342340  0.0110481033   1.144
## PavedDriveN             -0.0005248141  0.0124292870  -0.042
## GarageCars0             -0.0369556495  0.0111244011  -3.322
## GarageCars1             -0.0062628902  0.0055996361  -1.118
## GarageCars3              0.0117957936  0.0075472544   1.563
## FenceGdWo               -0.0203638022  0.0080225121  -2.538
## HouseStyle1.5Fin        -0.0191158744  0.0069938029  -2.733
## LotFrontage              0.0001353186  0.0000900476   1.503
## BldgType1Fam             0.0208088543  0.0055470632   3.751
## LandContourHLS           0.0146413547  0.0084231807   1.738
## BsmtFinSF1               0.0000412186  0.0000089682   4.596
## GrLivArea_transformed    0.1676985713  0.0180889203   9.271
## TotalBsmtSF_transformed  0.0057104230  0.0013758605   4.150
## X1stFlrSF                0.0000004218  0.0000146489   0.029
## X2ndFlrSF_transformed    0.0001653528  0.0005225978   0.316
## OverallQualFactor3       0.0509089383  0.0279665008   1.820
## OverallQualFactor4       0.0927783360  0.0260791877   3.558
## OverallQualFactor5       0.1148530978  0.0263418909   4.360
## OverallQualFactor6       0.1395246557  0.0265676159   5.252
## OverallQualFactor7       0.1670171654  0.0268985399   6.209
## OverallQualFactor8       0.1999161015  0.0274659538   7.279
## OverallQualFactor9       0.2490727366  0.0294692461   8.452
## OverallQualFactor10      0.2780381659  0.0325967111   8.530
## RemodSold_transformed   -0.0173037859  0.0021496115  -8.050
##                                     Pr(>|t|)    
## (Intercept)             < 0.0000000000000002 ***
## GarageArea                          0.026964 *  
## LotArea                  0.00000081966148790 ***
## AgeSold                             0.018013 *  
## WoodDeckSF               0.00007195666198071 ***
## BsmtUnfSF                           0.599802    
## OpenScreenPorch          0.00000079417818437 ***
## NeighborhoodNoRidge      0.00004405348012279 ***
## NeighborhoodOldTown                 0.001031 ** 
## NeighborhoodNridgHt                 0.001249 ** 
## NeighborhoodEdwards                 0.001751 ** 
## NeighborhoodStoneBr      0.00003725483258440 ***
## MSZoningFV              < 0.0000000000000002 ***
## MSZoningRH               0.00000000000238250 ***
## MSZoningRL              < 0.0000000000000002 ***
## MSZoningRM              < 0.0000000000000002 ***
## MSSubClass60                        0.128409    
## MSSubClass30                        0.007519 ** 
## MSSubClass160            0.00001950308409929 ***
## Fireplaces1                         0.000791 ***
## Fireplaces2              0.00004309105068371 ***
## TotalFullBath1                      0.157876    
## TotalFullBath3                      0.203152    
## LotShapeIR2                         0.155981    
## ExterQualTA                         0.027639 *  
## ExterQualEx                         0.035992 *  
## SaleTypeNew                         0.663248    
## MasVnrTypeStone                     0.000407 ***
## MasVnrTypeBrkFace                   0.000794 ***
## MasVnrTypeNone                      0.000296 ***
## LotConfigCulDSac                    0.010894 *  
## Exterior1stHdBoard                  0.040243 *  
## Exterior1stWd Sdng                  0.014729 *  
## PoolArea555                         0.057405 .  
## CentralAirY              0.00000014108137116 ***
## PavedDriveY                         0.253022    
## PavedDriveN                         0.966327    
## GarageCars0                         0.000919 ***
## GarageCars1                         0.263591    
## GarageCars3                         0.118322    
## FenceGdWo                           0.011258 *  
## HouseStyle1.5Fin                    0.006359 ** 
## LotFrontage                         0.133156    
## BldgType1Fam                        0.000184 ***
## LandContourHLS                      0.082417 .  
## BsmtFinSF1               0.00000473959556518 ***
## GrLivArea_transformed   < 0.0000000000000002 ***
## TotalBsmtSF_transformed  0.00003542178495436 ***
## X1stFlrSF                           0.977031    
## X2ndFlrSF_transformed               0.751747    
## OverallQualFactor3                  0.068943 .  
## OverallQualFactor4                  0.000388 ***
## OverallQualFactor5       0.00001406450409477 ***
## OverallQualFactor6       0.00000017687825412 ***
## OverallQualFactor7       0.00000000072277667 ***
## OverallQualFactor8       0.00000000000059271 ***
## OverallQualFactor9      < 0.0000000000000002 ***
## OverallQualFactor10     < 0.0000000000000002 ***
## RemodSold_transformed    0.00000000000000191 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.05334 on 1255 degrees of freedom
##   (145 observations deleted due to missingness)
## Multiple R-squared:  0.9144, Adjusted R-squared:  0.9105 
## F-statistic: 231.2 on 58 and 1255 DF,  p-value: < 0.00000000000000022

Variance Inflation Factor of the linear model:

Variance inflation factor (VIF) is a way to detect multicollinearity within many predictors between a model to see if the model is overstating the prediction. Calculating VIF for every choen predictors in the SalePrice predictiion model, it can be concluded that there is no multicollinearity within the predictors. Because no predictor has a score exceeding 4.1, meanwhile the threshold in this case is 6.0.

##                              GVIF Df GVIF^(1/(2*Df))
## GarageArea               5.870681  1        2.422949
## LotArea                  1.344816  1        1.159662
## AgeSold                  6.603545  1        2.569736
## WoodDeckSF               1.239698  1        1.113417
## BsmtUnfSF                6.962755  1        2.638703
## OpenScreenPorch          1.208743  1        1.099429
## Neighborhood             6.793364  5        1.211179
## MSZoning                 3.638147  4        1.175195
## MSSubClass              15.439574  3        1.577996
## Fireplaces               1.892270  2        1.172859
## TotalFullBath            2.827840  2        1.296772
## LotShape                 1.094249  1        1.046064
## ExterQual                6.110426  2        1.572236
## SaleType                 1.856226  1        1.362434
## MasVnrType               2.088637  3        1.130604
## LotConfig                1.151927  1        1.073279
## Exterior1st              1.590798  2        1.123062
## PoolArea                 1.172875  1        1.082993
## CentralAir               1.522849  1        1.234038
## PavedDrive               1.571140  2        1.119576
## GarageCars              12.404889  3        1.521477
## Fence                    1.067033  1        1.032973
## HouseStyle               2.257656  1        1.502550
## LotFrontage              1.870442  1        1.367641
## BldgType                 2.002288  1        1.415022
## LandContour              1.130049  1        1.063038
## BsmtFinSF1               6.944447  1        2.635232
## GrLivArea_transformed   16.577368  1        4.071531
## TotalBsmtSF_transformed  3.598660  1        1.897013
## X1stFlrSF               13.767244  1        3.710424
## X2ndFlrSF_transformed   15.895099  1        3.986866
## OverallQualFactor       20.998535  8        1.209586
## RemodSold_transformed    3.030923  1        1.740955

The four diagnostic plots demonstrates no concerns with the model:

  • The Residual vs Fitted plot plots the model residuals against corresponding Fitted value. The Residual vs Fitted plot below indicated that the residuals were normally distributed with a mean close to 0.
  • The Normal QQ plot shows the standardized residuals relative to the theoretical quantiles of normality. Since most of the points fall on the dotted line, there is no possibility of multicollinearity issues. Even though the ends of the plotted points deviate away from the line, the points do not form a distinct S shape. Therefore, it would not cause any problem.
  • The Scale-Location plot shows the standardized residuals relative to the fitted values. Because there is randomness in the point distribution, there should not be any significant problem in this model.
  • The Residuals vs Leverage plot shows the relationship between each residual and the leverage that each row had on the model. Rows that have a level of leverage that might be problematic will be shown outside of a red dashed line representing Cook’s distance, which is a statistical message of problematic levels of leverage. Since there is no observation being plotted outside of a red dashed line, the model is not over-influenced by problematic points.