This document details the univariate of the 24 columns and bivariate exploration of variables with noticeable correlation with house sale price (38 columns) of the Kaggle Advanced Regression Housing Price competition train.csv dataset. The 24 columns representing 24 house characteristics include:
SalePriceOverallQualNeighborhoodMSSubClassTotalBath: Number of full bath (FullBath +
BsmtFullBath + HalfBath +
BsmtHalfBath)GarageArea: Size of garage in square feetTotalBsmtSF1stFlrSF2ndFlrSFGrLivAreaLotArea: Lot size in square feetLotConfig: Lot configurationAgeSold: House age (YrSold -
YearBuilt)Exterior1st: Exterior covering on houseHouseStyle: Style of dwellingMasVnrType: Masonry veneer typeMasVnrArea: Masonry veneer area in square feetBedroomAbvGr: Number of bedrooms above basement
levelKitchenQual: Kitchen qualityTotRmsAbvGrd: Total rooms above grade (does not include
bathrooms)LotFrontage: Linear feet of street connected to
propertyBldgType: Type of dwellingRemodSold: The time difference in years between when a
house was remodeled and when it was sold (YrSold -
YearRemodAdd)BsmtFinSF1: Quality of basement finished area in square
feetThe original data set consists of 1460 rows and 81 of data. The first part of this document provides the univariate results for each of the 24 variables. Next, the bivariate analysis is conducted along with a discussion of which predictors were found to be most strongly related to SalePrice. Finally, we note any significant relationships that exist between the shortlisted predictor variables.
SalePriceSaleprice represents the dollar value that each house in the sample sold for. Of the 1460 prices included in the column, the minimum saleprice was $34900, the mean was $180921, the median was $163000, the max was $755000, and the standard deviation was $79443. The column contained no missing values. Visual inspection of both a density plot and a boxplot, revealed that the data was right skewed with most SalePrices falling between $100,000 and $200,000. However, the column also contains several very large values that range from approximately $375,000 to $750,000. This indicates that we will need to be judicious in our exploration to identify outlier based on exceedingly high sale prices.
From the density and boxplot, we can see that SalePrice
distribution has a right skew. Therefore, non-linear transformation
should be applied to bring the data into a more normally distributed
shape. Since log10 transformation is both suitable for monetary
variables (like SalePrice) and able to shift data shape to
the right, this transformation will be implemented on
SalePrice.
OverallQualOverallQual represents the material and finish quality of each house in the sample. Of the 1460 prices included in the column, the minimum OverallQual was 1, the mean was 6, the median was 6, the max was 10, and the standard deviation was 1. The column contained no missing values. Visual inspection of both a density plot and a boxplot, revealed that the data was imputed with integers, because there are some troughs between the whole numbers such as 5,6 and 7.
The results show that most houses have an overall quality of 5-7. The column contains no unusual or missing values that need cleaning. It looks like only a few rows have OverallQual ratings less than 4 and above 9.
NeighborhoodThe neighborhood categorical variable represents the area in Ames, Iowa where each house exists. In total, 25 unique neighborhoods exist in the file. Several of those neighborhood contain very few observations. For example, Blmngton, BrDale, and Bluest each represented less than 1% of the data, while 15% of the observations are in the NAmes neighborhood.
| Neighborhood | Proportion |
|---|---|
| NAmes | 0.15 |
| CollgCr | 0.10 |
| OldTown | 0.08 |
| Edwards | 0.07 |
| Somerst | 0.06 |
| Gilbert | 0.05 |
| NridgHt | 0.05 |
| NWAmes | 0.05 |
| Sawyer | 0.05 |
| BrkSide | 0.04 |
| SawyerW | 0.04 |
| Crawfor | 0.03 |
| IDOTRR | 0.03 |
| Mitchel | 0.03 |
| NoRidge | 0.03 |
| Timber | 0.03 |
| ClearCr | 0.02 |
| StoneBr | 0.02 |
| SWISU | 0.02 |
| Blmngtn | 0.01 |
| BrDale | 0.01 |
| MeadowV | 0.01 |
| NPkVill | 0.01 |
| Veenker | 0.01 |
| Blueste | 0.00 |
Since there are a lot of neighborhoods, further correlation analysis should be conducted to decide which neighbors are strongly correlated with SalePrice to be kept in the prediction model. Grouping of meaningful and non-meaningful neighborhood predictors can be conducted in further data cleaning process.
MSSubClassThe MSSubClass categorical variable represents the the building class
of each house. Even though MSSubClass is a numeric column
in the dataset, it should be treated as a catergorical variable.
Because, the values of MSSubClass represent different
MSSubClass types instead of having a numeric meaning.
In total, 15 unique building classes exist in the file. Several of those neighborhood contain very few observations. For example, 40 represented close to 0% of the data, while 37% of the observations are of 20 building class.
| MSSubClass | Proportion |
|---|---|
| 20 | 0.37 |
| 60 | 0.20 |
| 50 | 0.10 |
| 120 | 0.06 |
| 30 | 0.05 |
| 70 | 0.04 |
| 80 | 0.04 |
| 90 | 0.04 |
| 160 | 0.04 |
| 190 | 0.02 |
| 45 | 0.01 |
| 75 | 0.01 |
| 85 | 0.01 |
| 180 | 0.01 |
| 40 | 0.00 |
Since there are a lot of MSSubClass, further correlation analysis should be conducted to decide which building classes are strongly correlated with SalePrice to be kept in the prediction model. Grouping of meaningful and non-meaningful MSSubClass predictors can be conducted in further data cleaning process.
TotalFullBathThe TotalFullBath represents the total number of full bathrooms of
each house, including both full bathrooms above ground and under
basement. This column is the sum of FullBath and
BsmtFullBath. Even though TotalFullBath is
returned as a numeric variable after calculation, the column should be
treated as categorical to see the detailed statistics for each level of
total number of full bathrooms.
| TotalFullBath | Proportion |
|---|---|
| 2 | 0.51 |
| 1 | 0.25 |
| 3 | 0.22 |
| 4 | 0.01 |
| 0 | 0.00 |
| 6 | 0.00 |
In total, 6 total number of full bathrooms exist in the file. Several of those TotalFullBath contain very few observations. For example, 0, 4, 6 total numbers of full bath represented equal or lower than 1% of the data, while 51% of the observations have 2 full bathrooms in total.
GarageAreaThe GarageArea represents the size of garage in square feet of each house. Of the 1460 prices included in the column, the minimum GarageArea was 0, the mean was 473, the median was 480, the max was 1418, and the standard deviation was 214. The column contained no missing values. Visual inspection of both a density plot and a boxplot, revealed that the data was right skewed with most GarageArea falling between 300 and 600. However, the column also contains several large values that range from approximately 1000 to 1450. This indicates that we will need to be judicious in our exploration to identify outlier based on high GarageArea values.
From the density and boxplot, we can see that GarageArea
distribution has a right skew. Therefore, non-linear transformation
should be applied to bring the data into a more normally distributed
shape. Since square root transformation can shift data shape to the
right, this transformation will be implemented on
GarageArea.
TotalBsmtSFThe TotalBsmtSF represents the total basement area in square feet of each house in the sample. Of the 1460 prices included in the column, the minimum TotalBsmtSF was 0, the mean was 473, the median was 480, the max was 1418, and the standard deviation was 214. The column contained no missing values. Visual inspection of both a density plot and a boxplot, revealed that the data was very right skewed with most TotalBsmtSF falling between 750 and 1300. However, the column also contains several large values that range from approximately 2000 to 6000 and some very small values at 0. This indicates that we will need to be judicious in our exploration to identify outlier based on high TotalBsmtSF values.
From the density and boxplot, we can see that
TotalBsmtSF distribution has a right skew. Therefore,
non-linear transformation should be applied to bring the data into a
more normally distributed shape. Since cube root transformation can
shift data shape to the right, this transformation will be implemented
on TotalBsmtSF.
1stFlrSFThe 1stFlrSF represents the first floor area in square feet of each house in the sample. Of the 1460 prices included in the column, the minimum 1stFlrSF was 334, the mean was 1163, the median was 1087, the max was 4692, and the standard deviation was 387. The column contained no missing values. Visual inspection of both a density plot and a boxplot, revealed that the data was very right skewed with most 1stFlrSF falling between 800 and 1300. However, the column also contains several large values that range from approximately 2200 to 4700. This indicates that we will need to be judicious in our exploration to identify outlier based on high 1stFlrSF values.
From the density and boxplot, we can see that 1stFlrSF
distribution has a right skew. Therefore, non-linear transformation
should be applied to bring the data into a more normally distributed
shape. Since natural logarithm transformation can shift data shape to
the right, this transformation will be implemented on
1stFlrSF.
2ndFlrSFThe 2ndFlrSF represents the second floor area in square feet of each house in the sample. Of the 1460 prices included in the column, the minimum 2ndFlrSF was 0, the mean was 347, the median was 0, the max was 2065, and the standard deviation was 437. The column contained no missing values. Visual inspection of both a density plot and a boxplot, revealed that the data was right skewed with most 2ndFlrSF falling between 0 and 800. However, the column also contains several large values that range from approximately 1800 to 2100. This indicates that we will need to be judicious in our exploration to identify outlier based on high 2ndFlrSF values.
From the density and boxplot, we can see that 2ndFlrSF
distribution has strange shape. Applying non-linear transformation will
not bring the data into a normally distributed shape.
For example, before applying natural logarithm transformation to
2ndFlrSF, we have to add some small number to
2ndFlrSF values of 0 to avoid creating non-finite
values.
The problem lies in the fact that half or more houses have 0 second
floor area (the median of 2ndFlrSF is 0), which implicates
those are one-story house. In this case, if we want to include
2ndFlrSF in the prediction model, we have to combine it
with other variables to create a new meaningful predictor. Further
cleaning work will address this problem.
GrLivAreaThe GrLivArea represents the above grade (ground) living area in square feet of each house in the sample. Of the 1460 prices included in the column, the minimum GrLivArea was 334, the mean was 1515, the median was r median, the max was 5642, and the standard deviation was 525. The column contained no missing values. Visual inspection of both a density plot and a boxplot, revealed that the data was right skewed with most GrLivArea falling between 1200 and 1750. However, the column also contains several large values that range from approximately 2750 to 5700. This indicates that we will need to be judicious in our exploration to identify outlier based on high GrLivArea values.
From the density and boxplot, we can see that GrLivArea
distribution has a right skew. Therefore, non-linear transformation
should be applied to bring the data into a more normally distributed
shape. Since natural logarithm transformation can shift data shape to
the right, this transformation will be implemented on
GrLivArea.
LotAreaThe LotArea represents the lot size in square feet of each house in the sample. Of the 1460 prices included in the column, the minimum LotArea was 1300, the mean was 10517, the median was 9478, the max was 215245, and the standard deviation was 9981. The column contained no missing values. Visual inspection of both a density plot and a boxplot, revealed that the data was extremely right skewed with most LotArea falling between 0 and 2000. However, the column also contains several large values that range from approximately 5000 to 250000. This indicates that we will need to be judicious in our exploration to identify outlier based on high LotArea values.
From the density and boxplot, we can see that LotArea
distribution has a significant right skew. Therefore, non-linear
transformation should be applied to bring the data into a more normally
distributed shape. Since logarithm with base 10 transformation can
considerably shift data shape to the right, this transformation will be
implemented on LotArea.
LotConfigThe LotConfig categorical variable represents the lot configuration of each house in the sample. In total, 5 unique lot configuration exist in the file. Two of those configuration, FR2 and FR3 , contain very few observations with proportions of less than 5%. Meanwhile, 72% of the observations have Inside lot configuration.
| LotConfig | Proportion |
|---|---|
| Inside | 0.72 |
| Corner | 0.18 |
| CulDSac | 0.06 |
| FR2 | 0.03 |
| FR3 | 0.00 |
Further correlation analysis should be conducted to decide which lot configuration are strongly correlated with SalePrice to be kept in the prediction model.
AgeSoldThe AgeSold represents the number of years between when each house was built and when it was sold, which shows the newness of each house as of when it was sold. The variable is calculated by the difference between YrSold and YearBuilt.
Of the 1460 prices included in the column, the minimum AgeSold was 0, the mean was 37, the median was 35, the max was 136, and the standard deviation was 30. The column contained no missing values. Visual inspection of both a density plot and a boxplot, revealed that the data was slightly right skewed with most AgeSold falling between 10 and 50. However, the column also contains several large values that range from approximately 120 to 140. This indicates that we will need to be judicious in our exploration to identify outlier based on high Age values.
From the density and boxplot, we can see that AgeSold
distribution has a slight right skew. Therefore, non-linear
transformation should be applied to bring the data into a more normally
distributed shape. Since square root transformation can shift data shape
to the right, this transformation will be implemented on
AgeSold.
Exterior1stThe Exterior1st categorical variable represents the exterior covering on each house in the sample. In total, 15 unique lot configuration exist in the file. Several of those coverings, AsbShng, AsphShn, BrkComm, CBlock, ImStucc and Stone , contain very few observations with proportions of equal or less than 1%. Meanwhile, 35% of the observations have VinylSd exterior covering.
| Exterior1st | Proportion |
|---|---|
| VinylSd | 0.35 |
| HdBoard | 0.15 |
| MetalSd | 0.15 |
| Wd Sdng | 0.14 |
| Plywood | 0.07 |
| CemntBd | 0.04 |
| BrkFace | 0.03 |
| Stucco | 0.02 |
| WdShing | 0.02 |
| AsbShng | 0.01 |
| AsphShn | 0.00 |
| BrkComm | 0.00 |
| CBlock | 0.00 |
| ImStucc | 0.00 |
| Stone | 0.00 |
Further correlation analysis should be conducted to decide which exterior coverings are strongly correlated with SalePrice to be kept in the prediction model.
HouseStyleThe HouseStyle categorical variable represents the dwelling style of each house in the sample. In total, 8 unique lot configuration exist in the file. Several of those HouseStyle, 1.5Unf, 2.5Fin, 2.5Unf, contain very few observations with proportions of 1%. Meanwhile, 50% of the observations are of 1Story HouseStyle.
| HouseStyle | Proportion |
|---|---|
| 1Story | 0.50 |
| 2Story | 0.30 |
| 1.5Fin | 0.11 |
| SLvl | 0.04 |
| SFoyer | 0.03 |
| 1.5Unf | 0.01 |
| 2.5Fin | 0.01 |
| 2.5Unf | 0.01 |
Further correlation analysis should be conducted to decide which HouseStyle are strongly correlated with SalePrice to be kept in the prediction model.
MasVnrTypeThe MasVnrType categorical variable represents the masonry veneer type of each house in the sample. In total, 5 unique lot configuration exist in the file. Among those MasVnrtype, BrkCmn contains very few observations with proportions of 1%. Meanwhile, 60% of the observations have no masonry veneer type. The column contained 8 missing values. Thoses values will be updated as the “Unknown” type for better analysis.
| MasVnrType | Proportion |
|---|---|
| None | 0.59 |
| BrkFace | 0.30 |
| Stone | 0.09 |
| BrkCmn | 0.01 |
| Unknown | 0.01 |
Further correlation analysis should be conducted to decide which masonry veneer types are strongly correlated with SalePrice to be kept in the prediction model.
MasVnrAreaMasVnrArea represents the masonry veneer area in square feet of each house in the sample. Of the 1460 prices included in the column, the min masonry veneer area was NA, the mean was NA, the median was NA, the max was NA, and the standard deviation was NA. The column contained 8 missing values. Therefore, these missing values should be updated. One possible solution is to replace them with the mean MasVnrArea based on the specific neighborhood that the house is located in. Further data cleaning work will address this problem.
Visual inspection of both a density plot and a boxplot, revealed that the data was right skewed with most MasVnrArea falling between 0 and 200. However, the column also contains several very large values that range from approximately 400 to 1600. This indicates that we will need to be judicious in our exploration to identify outlier based on exceedingly high MasVnrArea.
From the density and boxplot, we can see that MasVnrArea
distribution has a significant right skew. Therefore, non-linear
transformation should be applied to bring the data into a more normally
distributed shape. Since natural logarithm transformation can
considerably shift data shape to the right, this transformation will be
implemented on MasVnrArea.
BedroomAbvGrBedroomAbvGr represents the number of bedrooms above basement level
of each house in the sample. Even though BedroomAbvGr is
numeric in the dataset, we want to treat it as categorical variable to
see the detailed statistics for each level of total full bathroom
number.
| BedroomAbvGr | Proportion |
|---|---|
| 3 | 0.55 |
| 2 | 0.25 |
| 4 | 0.15 |
| 1 | 0.03 |
| 5 | 0.01 |
| 0 | 0.00 |
| 6 | 0.00 |
| 8 | 0.00 |
In total, 8 total number of bedrooms above ground exist in the file. Several of those BedroomAbvGr contain very few observations. For example, 0,6,8 numbers of bedrooms represented closely t 0% of the data, while 55% of the observations have 3 bedrooms.
KitchenQualThe KitchenQual categorical variable represents the kitchen quality of each house in the sample. In total, 4 unique lot configuration exist in the file. Among those, Fa kitchen quality contains very few observations with proportions of 3%. Meanwhile, 50% and 40% of the observations are of TA and Gd kitchen qualities, respectively.
| KitchenQual | Proportion |
|---|---|
| TA | 0.50 |
| Gd | 0.40 |
| Ex | 0.07 |
| Fa | 0.03 |
Further correlation analysis should be conducted to decide which KitchenQual are strongly correlated with SalePrice to be kept in the prediction model.
TotRmsAbvGrdTotRmsAbvGrd represents the total number of rooms above grade (not
including bathrooms) of each house in the sample. Even though
TotRmsAbvGrd is numeric in the dataset, we want to treat it
as categorical variable to see the detailed statistics for each level of
total room number.
| TotRmsAbvGrd | Proportion |
|---|---|
| 6 | 0.28 |
| 7 | 0.23 |
| 5 | 0.19 |
| 8 | 0.13 |
| 4 | 0.07 |
| 9 | 0.05 |
| 10 | 0.03 |
| 3 | 0.01 |
| 11 | 0.01 |
| 12 | 0.01 |
| 2 | 0.00 |
| 14 | 0.00 |
In total, 12 total number of rooms above ground exist in the file. Several of those TotRmsAbvGrd contain very few observations. For example, 2 and 14 numbers of room represented closely t 0% of the data. Further data cleaning should be applied to determine the meaning levels that should be included in the SalePrice prediction model.
LotFrontageLotFrontage represents the linear feet of street connected to property of each house in the sample.
Of the 1460 prices included in the column, the min LotFrontage was NA, the mean was NA, the median was NA, the max was NA, and the standard deviation was NA. The column contained 259 missing values. Further cleaning work should be conducted to replaced these missing values with suitable data that would improve the SalePrice prediction model.
Visual inspection of both a density plot and a boxplot, revealed that the data was right skewed with most LotFrontage falling between 50 and 100. However, the column also contains several very large values that range from approximately 250 to slightly over 300. This indicates that we will need to be judicious in our exploration to identify outlier based on exceedingly high LotFrontage.
From the density and boxplot, we can see that
LotFrontage distribution has a right skew. Therefore,
non-linear transformation should be applied to bring the data into a
more normally distributed shape. Since natural logarithm transformation
can considerably shift data shape to the right, this transformation will
be implemented on LotFrontage.
BldgTypeBldgType represents the type of dwelling of each house in the sample.
| BldgType | Proportion |
|---|---|
| 1Fam | 0.84 |
| TwnhsE | 0.08 |
| Duplex | 0.04 |
| Twnhs | 0.03 |
| 2fmCon | 0.02 |
In total, 5 types of dwelling exist in the file. Several of those BldgType contain very few observations. For example, Twnhs and 2fmCon represented 3% and 2% of the data, respectively. Further data cleaning should be applied to determine the meaning levels that should be included in the SalePrice prediction model.
RemodSoldRemodSold represents the time gap between when each house was sold and when it was remodeled, which partially shows the quality of each house as of when it was sold. The column is calculated by subtracting YearRemodAdd from YrSold.
Of the 1460 prices included in the column, the min RemodSold was -1, the mean was 23, the median was 14, the max was 60, and the standard deviation was 21. The column contained 0 missing values. Interestingly, there is one house that was remodeled after being sold. Thus, the remodeling did not affect the SalePrice that had been decided on earlier. Since this point is not contributing to the SalePrice prediction model, it will possibly be cleaned later on.
Visual inspection of both a density plot and a boxplot, revealed that the data was slightly right skewed with most RemodSold falling between 10 and 20. However, the column also contains several very large values that range from approximately 50 to 60. This indicates that we will need to be judicious in our exploration to identify outlier based on exceedingly high RemodSold.
From the density and boxplot, we can see that RemodSold
distribution has a right skew. Therefore, non-linear transformation
should be applied to bring the data into a more normally distributed
shape. Since cube root transformation can shift data shape to the right,
this transformation will be implemented on RemodSold.
BsmtFinSF1BsmtFinSF1 represents type 1 finished square feet of each house as of when it was sold.
Of the 1460 prices included in the column, the min BsmtFinSF1 was 0, the mean was 444, the median was 384, the max was 5644, and the standard deviation was 456. The column contained 0 missing values.
Visual inspection of both a density plot and a boxplot, revealed that the data was slightly right skewed with most BsmtFinSF1 falling between 0 and 750. However, the column also contains several very large values that range from approximately 1750 to 2250. This indicates that we will need to be judicious in our exploration to identify outlier based on exceedingly high BsmtFinSF1.
From the density and boxplot, we can see that BsmtFinSF1
distribution has a right skew. Therefore, non-linear transformation
should be applied to bring the data into a more normally distributed
shape. Since cube root transformation can shift data shape to the right,
this transformation will be implemented on BsmtFinSF1.
In this section, a correlation table between SalePrice and numeric variables available in the train dataset will be present. From the table, a list of highly correlated variables will be selected to have further one-to-one bivariate analysis with SalePrice and to be included in the SalePrice prediction model.
| DependentVar | IndependentVar | Correlation |
|---|---|---|
| OverallQual | SalePrice | 0.79 |
| GrLivArea | SalePrice | 0.71 |
| GarageCars | SalePrice | 0.64 |
| GarageArea | SalePrice | 0.62 |
| TotalBsmtSF | SalePrice | 0.61 |
| X1stFlrSF | SalePrice | 0.61 |
| YearRemodAdd | SalePrice | 0.51 |
| Fireplaces | SalePrice | 0.47 |
| BsmtFinSF1 | SalePrice | 0.39 |
| X2ndFlrSF | SalePrice | 0.32 |
| WoodDeckSF | SalePrice | 0.32 |
| OpenPorchSF | SalePrice | 0.32 |
| LotArea | SalePrice | 0.26 |
| BsmtUnfSF | SalePrice | 0.21 |
| ScreenPorch | SalePrice | 0.11 |
| PoolArea | SalePrice | 0.09 |
| MoSold | SalePrice | 0.05 |
| X3SsnPorch | SalePrice | 0.04 |
| BsmtFinSF2 | SalePrice | -0.01 |
| MiscVal | SalePrice | -0.02 |
| LowQualFinSF | SalePrice | -0.03 |
| OverallCond | SalePrice | -0.08 |
| EnclosedPorch | SalePrice | -0.13 |
| KitchenAbvGr | SalePrice | -0.14 |
| AgeSold | SalePrice | -0.52 |
From the table, we notice some interesting correlation values between SalePrice and each of these following variables:
OverallQual,
GrLivArea, GarageCars,
GarageArea, TotalBsmtSF,
X1stFlrSF, YearRemodAdd,
Fireplaces, BsmtFinSF1,
X2ndFlrSF, WoodDeckSF,
OpenPorchSF, AgeSoldLotArea, BsmtUnfSF, ScreenPorch,
PoolArea, KitchenAbvGr,
EnclosedPorch.In those variables, we will conduct detailed bivariate analysis on
OverallQual, GrLivArea,
GarageArea, TotalBsmtSF,
X1stFlrSF, X2ndFlrSF, WoodDeckSF,
OpenPorchSF, AgeSold, LotArea,
BsmtUnfSF, ScreenPorch, and
EnclosedPorch.
Meanwhile, GarageCars, Fireplaces,
PoolArea and KitchenAbvGr will be converted
into categorical to explore detailed statistics of the different
levels.
In this section, further one-to-one bivariate analysis with SalePrice and categorical variables available in the train dataset will be conducted. These categorical variables also include numeric variable with a few number of levels that may have different statistics. Levels of different categorical that are highly correlated with SalePrice (with the correlation above 0.09) will be shortlisted to make further analysis whether to be included in SalePrice prediction model.
After calculating the percentage of missing values in each
categorical variables, we notice that Alley,
PoolQC, Fence and MiscFeature
have significantly high percentage of NAs (of more than 80% of the
values in these columns are NAs). Therefore, these variables are ruled
out from the bivariate analysis.
| na_per | |
|---|---|
| Alley | 0.9376712 |
| PoolQC | 0.9952055 |
| Fence | 0.8075342 |
| MiscFeature | 0.9630137 |
For other columns, all of the missing values are turned into “Unknown”.
SalePrice & NeighborhoodThe boxplot shows that there is wide variation in the distribution of SalePrices across the neighborhoods. NridgHt, NoRidge, and StoneBr have the largest mean SalePrices. Interesting 6 of the 7 most expensive homes are located in NridgHt and NoRidge. MeadowV, IDOTRR, Brkside, and BrDale have the lowest mean SalePrice.
Finally, table 13 shows the correlation for each neighborhood with SalePrice. As the boxplot showed, NridgHt, NoRidge, StoneBr, Somerst and Timber have the strongest positive correlation with SalePrice at .40, .33, .22, .14 and .13 respectively. In contrast, BrDale, MeadowV, Sawyer, BrkSide, IDOTRR, Edwards, NAmes, and OldTown are all noticeably negatively correlated with SalePrice.
| DependentVar | IndependentVar | Correlation |
|---|---|---|
| SalePrice | Neighborhood_NridgHt | 0.40 |
| SalePrice | Neighborhood_NoRidge | 0.33 |
| SalePrice | Neighborhood_StoneBr | 0.22 |
| SalePrice | Neighborhood_Somerst | 0.14 |
| SalePrice | Neighborhood_Timber | 0.13 |
| SalePrice | Neighborhood_CollgCr | 0.07 |
| SalePrice | Neighborhood_Crawfor | 0.07 |
| SalePrice | Neighborhood_ClearCr | 0.06 |
| SalePrice | Neighborhood_Veenker | 0.06 |
| SalePrice | Neighborhood_Gilbert | 0.04 |
| SalePrice | Neighborhood_Blmngtn | 0.02 |
| SalePrice | Neighborhood_NWAmes | 0.02 |
| SalePrice | Neighborhood_SawyerW | 0.01 |
| SalePrice | Neighborhood_Blueste | -0.02 |
| SalePrice | Neighborhood_NPkVill | -0.04 |
| SalePrice | Neighborhood_Mitchel | -0.06 |
| SalePrice | Neighborhood_SWISU | -0.06 |
| SalePrice | Neighborhood_BrDale | -0.10 |
| SalePrice | Neighborhood_MeadowV | -0.11 |
| SalePrice | Neighborhood_Sawyer | -0.13 |
| SalePrice | Neighborhood_BrkSide | -0.14 |
| SalePrice | Neighborhood_IDOTRR | -0.16 |
| SalePrice | Neighborhood_Edwards | -0.18 |
| SalePrice | Neighborhood_NAmes | -0.19 |
| SalePrice | Neighborhood_OldTown | -0.19 |
SalePrice & MSZoningMSZoning represents the general zoning classification of each house in the dataset. According to the boxplot, FV and Rl have the largest mean SalePrices. Interestingly, 8 most expensive homes have MSZoning of RL. Finally, table 14 shows the correlation for each MSZoning with SalePrice. As the boxplot showed, MSZoning_RL and MSZoning_FV have the strongest positive correlation with SalePrice at .25 and .09 respectively. In contrast, MSZoning_C (all) and MSZoning_RM are noticeably negatively correlated with SalePrice.
| DependentVar | IndependentVar | Correlation |
|---|---|---|
| SalePrice | MSZoning_RL | 0.25 |
| SalePrice | MSZoning_FV | 0.09 |
| SalePrice | MSZoning_RH | -0.07 |
| SalePrice | MSZoning_C (all) | -0.11 |
| SalePrice | MSZoning_RM | -0.29 |
SalePrice & MSSubClassAccording to the boxplot, MSSubClass_60 has the largest mean SalePrices. Interestingly, 2 most expensive homes have MSSubClass of 60.
Finally, table 15 shows the correlation for each MSSubClass with SalePrice. As the boxplot showed, MSSubClass_60 has the strongest positive correlation with SalePrice at 0.38. In contrast, MSSubClass_190, MSSubClass_90, MSSubClass_160, MSSubClass_50 and MSSubClass_30 are noticeably negatively correlated with SalePrice.
| DependentVar | IndependentVar | Correlation |
|---|---|---|
| SalePrice | MSSubClass_60 | 0.38 |
| SalePrice | MSSubClass_120 | 0.06 |
| SalePrice | MSSubClass_20 | 0.04 |
| SalePrice | MSSubClass_75 | 0.02 |
| SalePrice | MSSubClass_40 | -0.02 |
| SalePrice | MSSubClass_80 | -0.03 |
| SalePrice | MSSubClass_70 | -0.04 |
| SalePrice | MSSubClass_85 | -0.05 |
| SalePrice | MSSubClass_45 | -0.08 |
| SalePrice | MSSubClass_180 | -0.08 |
| SalePrice | MSSubClass_190 | -0.09 |
| SalePrice | MSSubClass_90 | -0.11 |
| SalePrice | MSSubClass_160 | -0.11 |
| SalePrice | MSSubClass_50 | -0.16 |
| SalePrice | MSSubClass_30 | -0.24 |
SalePrice & FireplacesFireplaces represents the number of fireplaces of each house in the dataset. According to the boxplot, Fireplaces_3 has the largest mean SalePrices. Interestingly, 2 most expensive homes have the number of fireplaces of 2.
Finally, table 16 shows the correlation for each Fireplaces with SalePrice. Fireplaces_1 and Fireplaces_3 have the strongest positive correlation with SalePrice at 0.35 and 0.22 respectively. In contrast, Fireplaces_0 is noticeably negatively correlated with SalePrice at -0.47.
| DependentVar | IndependentVar | Correlation |
|---|---|---|
| SalePrice | Fireplaces_1 | 0.35 |
| SalePrice | Fireplaces_2 | 0.22 |
| SalePrice | Fireplaces_3 | 0.05 |
| SalePrice | Fireplaces_0 | -0.47 |
SalePrice & TotalFullBathAccording to the correlation table, TotalFullBath_3 and TotalFullBath_4 have the strongest positive correlation with SalePrice at 0.49 and 0.19 respectively. In contrast, TotalFullBath_1 is noticeably negatively correlated with SalePrice at -0.42.
| DependentVar | IndependentVar | Correlation |
|---|---|---|
| SalePrice | TotalFullBath_5 | 0.32 |
| SalePrice | TotalFullBath_2 | -0.03 |
| SalePrice | TotalFullBath_1 | -0.04 |
| SalePrice | TotalFullBath_3 | -0.13 |
| SalePrice | TotalFullBath_4 | -0.15 |
| SalePrice | TotalFullBath_6 | -0.16 |
SalePrice & GarageCarsGarageCars represents the size of garage in car capacity of each house in the dataset. According to the correlation table, GarageCars_3 has the strongest positive correlation with SalePrice at 0.61. In contrast, GarageCars_0 and GarageCars_1 are noticeably negatively correlated with SalePrice at -0.24 and -0.39 respectively.
| DependentVar | IndependentVar | Correlation |
|---|---|---|
| SalePrice | GarageCars_3 | 0.61 |
| SalePrice | GarageCars_2 | 0.04 |
| SalePrice | GarageCars_4 | 0.01 |
| SalePrice | GarageCars_0 | -0.24 |
| SalePrice | GarageCars_1 | -0.39 |
SalePrice & OverallCondOverallCond represents the overall condition rating of each house in the dataset. According to the correlation table, OverallCond_5 has the strongest positive correlation with SalePrice at 0.32. In contrast, OverallCond_7, OverallCond_3, OverallCond_4 and OverallCond_6 are noticeably negatively correlated with SalePrice at -0.12, -0.13, -0.15 and -0.16 respectively.
| DependentVar | IndependentVar | Correlation |
|---|---|---|
| SalePrice | OverallCond_5 | 0.32 |
| SalePrice | OverallCond_9 | 0.05 |
| SalePrice | OverallCond_2 | -0.03 |
| SalePrice | OverallCond_1 | -0.04 |
| SalePrice | OverallCond_8 | -0.07 |
| SalePrice | OverallCond_7 | -0.12 |
| SalePrice | OverallCond_3 | -0.13 |
| SalePrice | OverallCond_4 | -0.15 |
| SalePrice | OverallCond_6 | -0.16 |
SalePrice & LotShapeLotShape represents the general shape of property of each house in the dataset. According to the correlation table, LotShape_IR1 and LotShape_IR2 have the strongest positive correlation with SalePrice at 0.22 and 0.13 respectively. In contrast, LotShape_Reg is noticeably negatively correlated with SalePrice at -0.24 and -0.27.
| DependentVar | IndependentVar | Correlation |
|---|---|---|
| SalePrice | LotShape_IR1 | 0.22 |
| SalePrice | LotShape_IR2 | 0.13 |
| SalePrice | LotShape_IR3 | 0.04 |
| SalePrice | LotShape_Reg | -0.27 |
SalePrice & RoofStyleRoofStyle represents the type of roof of each house in the dataset. According to the correlation table, RoofStyle_Hip has the strongest positive correlation with SalePrice at 0.24. In contrast, RoofStyle_Gable is noticeably negatively correlated with SalePrice at -0.22.
| DependentVar | IndependentVar | Correlation |
|---|---|---|
| SalePrice | RoofStyle_Hip | 0.24 |
| SalePrice | RoofStyle_Flat | 0.02 |
| SalePrice | RoofStyle_Shed | 0.02 |
| SalePrice | RoofStyle_Mansard | 0.00 |
| SalePrice | RoofStyle_Gambrel | -0.04 |
| SalePrice | RoofStyle_Gable | -0.22 |
SalePrice & ExterQualExterQual represents the exterior material quality of each house in the dataset. According to the correlation table, ExterQual_Ex and ExterQual_Gd have the strongest positive correlation with SalePrice at 0.45. In contrast, ExterQual_Fa and ExterQual_TA are noticeably negatively correlated with SalePrice at -0.12 and -0.59 respectively.
| DependentVar | IndependentVar | Correlation |
|---|---|---|
| SalePrice | ExterQual_Ex | 0.45 |
| SalePrice | ExterQual_Gd | 0.45 |
| SalePrice | ExterQual_Fa | -0.12 |
| SalePrice | ExterQual_TA | -0.59 |
SalePrice & FoundationFoundation represents the type of foundation of each house in the dataset. According to the correlation table, Foundation_PConc has the strongest positive correlation with SalePrice at 0.5. In contrast, Foundation_Slab, Foundation_BrkTil and Foundation_CBlock are noticeably negatively correlated with SalePrice at -0.12, -0.2 and -0.34 respectively.
| DependentVar | IndependentVar | Correlation |
|---|---|---|
| SalePrice | Foundation_PConc | 0.50 |
| SalePrice | Foundation_Wood | 0.00 |
| SalePrice | Foundation_Stone | -0.01 |
| SalePrice | Foundation_Slab | -0.12 |
| SalePrice | Foundation_BrkTil | -0.20 |
| SalePrice | Foundation_CBlock | -0.34 |
SalePrice & HeatingQCHeatingQC variable represents the heating quality and condition of each house in the dataset. According to the correlation table, HeatingQC_Ex has the strongest positive correlation with SalePrice at 0.43. In contrast, HeatingQC_Fa, HeatingQC_Gd and HeatingQC_TA are noticeably negatively correlated with SalePrice at -0.13, -0.13 and -0.31 respectively.
| DependentVar | IndependentVar | Correlation |
|---|---|---|
| SalePrice | HeatingQC_Ex | 0.43 |
| SalePrice | HeatingQC_Po | -0.03 |
| SalePrice | HeatingQC_Fa | -0.13 |
| SalePrice | HeatingQC_Gd | -0.13 |
| SalePrice | HeatingQC_TA | -0.31 |
SalePrice & SaleTypeSaleType represents the type of sale of each house in the dataset. According to the correlation table, SaleType_New has the strongest positive correlation with SalePrice at 0.36. In contrast, SaleType_WD is noticeably negatively correlated with SalePrice at -0.24.
| DependentVar | IndependentVar | Correlation |
|---|---|---|
| SalePrice | SaleType_New | 0.36 |
| SalePrice | SaleType_Con | 0.04 |
| SalePrice | SaleType_CWD | 0.02 |
| SalePrice | SaleType_ConLI | 0.01 |
| SalePrice | SaleType_ConLw | -0.03 |
| SalePrice | SaleType_Oth | -0.03 |
| SalePrice | SaleType_ConLD | -0.04 |
| SalePrice | SaleType_COD | -0.08 |
| SalePrice | SaleType_WD | -0.24 |
SalePrice & SaleConditionSaleCondition represents the condition of sale of each house in the dataset. According to the correlation table, SaleCondition_Partial has the strongest positive correlation with SalePrice at 0.35. In contrast, SaleCondition_Abnorml and SaleCondition_Normal are noticeably negatively correlated with SalePrice at -0.12 and -0.15 respectively.
| DependentVar | IndependentVar | Correlation |
|---|---|---|
| SalePrice | SaleCondition_Partial | 0.35 |
| SalePrice | SaleCondition_Alloca | -0.02 |
| SalePrice | SaleCondition_AdjLand | -0.05 |
| SalePrice | SaleCondition_Family | -0.05 |
| SalePrice | SaleCondition_Abnorml | -0.12 |
| SalePrice | SaleCondition_Normal | -0.15 |
SalePrice & HouseStyleAccording to the correlation table, HouseStyle_2Story has the strongest positive correlation with SalePrice at 0.24. In contrast, HouseStyle_1.5Unf, HouseStyle_SFoyer and HouseStyle_1.5Fin are noticeably negatively correlated with SalePrice at -0.09, -0.09 and -0.16 respectively.
| DependentVar | IndependentVar | Correlation |
|---|---|---|
| SalePrice | HouseStyle_2Story | 0.24 |
| SalePrice | HouseStyle_2.5Fin | 0.04 |
| SalePrice | HouseStyle_2.5Unf | -0.03 |
| SalePrice | HouseStyle_SLvl | -0.04 |
| SalePrice | HouseStyle_1Story | -0.06 |
| SalePrice | HouseStyle_1.5Unf | -0.09 |
| SalePrice | HouseStyle_SFoyer | -0.09 |
| SalePrice | HouseStyle_1.5Fin | -0.16 |
SalePrice & MasVnrTypeAccording to the correlation table, MasVnrType_Stone and MasVnrType_BrkFace have the strongest positive correlation with SalePrice at 0.33 and 0.2. In contrast, MasVnrType_None is noticeably negatively correlated with SalePrice at -0.37.
| DependentVar | IndependentVar | Correlation |
|---|---|---|
| SalePrice | MasVnrType_Stone | 0.33 |
| SalePrice | MasVnrType_BrkFace | 0.20 |
| SalePrice | MasVnrType_Unknown | 0.05 |
| SalePrice | MasVnrType_BrkCmn | -0.04 |
| SalePrice | MasVnrType_None | -0.37 |
SalePrice & LotConfigAccording to the correlation table, LotConfig_CulDSac has the strongest correlation with SalePrice at 0.14.
| DependentVar | IndependentVar | Correlation |
|---|---|---|
| SalePrice | LotConfig_CulDSac | 0.14 |
| SalePrice | LotConfig_FR3 | 0.02 |
| SalePrice | LotConfig_Corner | 0.00 |
| SalePrice | LotConfig_FR2 | -0.01 |
| SalePrice | LotConfig_Inside | -0.08 |
SalePrice & Exterior1stAccording to the correlation table, Exterior1st_VinylSd and Exterior1st_CemntBd have the strongest positive correlation with SalePrice at 0.31 and 0.13. In contrast, Exterior1st_HdBoard, Exterior1st_AsbShng, Exterior1st_Wd Sdng and Exterior1st_MetalSd are noticeably negatively correlated with SalePrice at -0.10, -0.11, -0.16 and -0.17 respectively.
| DependentVar | IndependentVar | Correlation |
|---|---|---|
| SalePrice | Exterior1st_VinylSd | 0.31 |
| SalePrice | Exterior1st_CemntBd | 0.13 |
| SalePrice | Exterior1st_Stone | 0.04 |
| SalePrice | Exterior1st_BrkFace | 0.03 |
| SalePrice | Exterior1st_ImStucc | 0.03 |
| SalePrice | Exterior1st_Plywood | -0.02 |
| SalePrice | Exterior1st_AsphShn | -0.03 |
| SalePrice | Exterior1st_CBlock | -0.03 |
| SalePrice | Exterior1st_Stucco | -0.03 |
| SalePrice | Exterior1st_BrkComm | -0.05 |
| SalePrice | Exterior1st_WdShing | -0.05 |
| SalePrice | Exterior1st_HdBoard | -0.10 |
| SalePrice | Exterior1st_AsbShng | -0.11 |
| SalePrice | Exterior1st_Wd Sdng | -0.16 |
| SalePrice | Exterior1st_MetalSd | -0.17 |
SalePrice & BedroomAbvGrBedroomAbvGr represents the number of bedrooms above basement level of each house in the dataset. According to the correlation table, BedroomAbvGr_4 has the strongest positive correlation with SalePrice at 0.21. In contrast, BedroomAbvGr_2 is noticeably negatively correlated with SalePrice at -0.16.
| DependentVar | IndependentVar | Correlation |
|---|---|---|
| SalePrice | BedroomAbvGr_4 | 0.21 |
| SalePrice | BedroomAbvGr_0 | 0.03 |
| SalePrice | BedroomAbvGr_8 | 0.01 |
| SalePrice | BedroomAbvGr_3 | 0.00 |
| SalePrice | BedroomAbvGr_5 | 0.00 |
| SalePrice | BedroomAbvGr_1 | -0.02 |
| SalePrice | BedroomAbvGr_6 | -0.03 |
| SalePrice | BedroomAbvGr_2 | -0.16 |
SalePrice & TotRmsAbvGrdAccording to the correlation table, TotRmsAbvGrd_10, TotRmsAbvGrd_9, TotRmsAbvGrd_11, TotRmsAbvGrd_8, TotRmsAbvGrd_7 and TotRmsAbvGrd_12 have the strongest positive correlation with SalePrice at 0.26, 0.21, 0.19, 0.16, 0.11 and 0.11 respectively. In contrast, TotRmsAbvGrd_3, TotRmsAbvGrd_6, TotRmsAbvGrd_4 and TotRmsAbvGrd_5 are noticeably negatively correlated with SalePrice at -0.10, -0.15, -0.2 and -0.24 respectively.
| DependentVar | IndependentVar | Correlation |
|---|---|---|
| SalePrice | TotRmsAbvGrd_10 | 0.26 |
| SalePrice | TotRmsAbvGrd_9 | 0.21 |
| SalePrice | TotRmsAbvGrd_11 | 0.19 |
| SalePrice | TotRmsAbvGrd_8 | 0.16 |
| SalePrice | TotRmsAbvGrd_7 | 0.11 |
| SalePrice | TotRmsAbvGrd_12 | 0.11 |
| SalePrice | TotRmsAbvGrd_14 | 0.01 |
| SalePrice | TotRmsAbvGrd_2 | -0.05 |
| SalePrice | TotRmsAbvGrd_3 | -0.10 |
| SalePrice | TotRmsAbvGrd_6 | -0.15 |
| SalePrice | TotRmsAbvGrd_4 | -0.20 |
| SalePrice | TotRmsAbvGrd_5 | -0.24 |
SalePrice & BldgTypeAccording to the correlation table, 1Fam, 2fmCon, Twnhs, and Duplex have the strongest correlation with SalePrice at 0.14, -0.10, -0.10 and -0.11 respectively. In contrast, TwnhsE does not have any noticeable correlation with SalePrice.
| DependentVar | IndependentVar | Correlation |
|---|---|---|
| SalePrice | BldgType_1Fam | 0.14 |
| SalePrice | BldgType_TwnhsE | 0.00 |
| SalePrice | BldgType_2fmCon | -0.10 |
| SalePrice | BldgType_Twnhs | -0.10 |
| SalePrice | BldgType_Duplex | -0.11 |
In this section, data cleaning will be conducted to:
MasVnrAreaMasVnrArea contained 8 missing values. Therefore, these missing values should be updated. One possible solution is to replace them with the mean MasVnrArea based on the specific house style.
After cleaning, the correlation between SalePrice and MasvnrArea is around 0.48.
OverallQualBased on the bivariate analysis between SalePrice and OverallQual, two outliers are noticeable: Two houses that have OverallQual around 10, were sold for less than $200,000. These two points can either be removed or replaced by other values. To keep most of the data available, I will replace them with the average OverallQual values in their Neighborhood.
After replacing the two outliers, the correlation between SalePrice and OverallQual increases slightly (from 0.791 to 0.796)
GrLivAreaBased on the bivariate analysis between SalePrice and GrLivArea, two outliers are noticeable: Two houses that are both greater than 4500 square feet, were sold for less than $200,000. These values will be replaced by the average GrLivArea values of their HouseStyle.
After replacing the two outliers, the correlation between SalePrice and GrLivArea increases slightly (from 0.709 to 0.736)
LotAreaBased on the bivariate analysis between SalePrice and LotArea, three outliers are noticeable: Three houses that have lot area greater than 150,000 square feet, were sold for less than $400,000.
These three outliers will be replaced by the average LotArea based on their HouseStyle.
After replacing the two outliers, the correlation between SalePrice and LotArea increases noticeably (from 0.26 to 0.39)
AgeSoldBased on the bivariate analysis between SalePrice and AgeSold, one outlier is noticeable: One house that was built 110 years ago was sold for nearly $500,000. This outlier will be replaced by the average AgeSold of the house neighborhood.
After replacing the two outliers, the correlation between SalePrice and LotArea decreases slightly (from -0.52 to -0.57).
WoodDeckSFBased on the bivariate analysis between SalePrice and WoodDeckSF, two outliers are noticeable: Two house with nearly 750 square feet of wood deck were sold for under $200,000. These outliers will be replaced by the average WoodDeckSF of the house styles.
After replacing the two outliers, the correlation between SalePrice and WoodDeckSF increases slightly (from 0.32 to 0.33).
NeighborhoodThere are 25 different Neighborhoods in the dataset, but only 13 have a meaningful relationship with SalePrice, including NridgHt, NoRidge, StoneBr, Somerst, Timber, BrDale, MeadowV, Sawyer, BrkSide, IDOTRR, Edwards, NAmes, and OldTown. Therefore, the rest will be group into a group called “Other”.
MSSubClassThere are 15 different MSSubClass in the dataset, but only 6 have a meaningful relationship with SalePrice, including “60”,“190”,“90”,“160”,“50” and “30”. Therefore, the rest will be group into a group called “Other”.
TotalFullBathThere are 6 TotalFullBath values, but only 3 have a meaningful relationship with SalePrice, including “1”, “3”, “4”. Therefore, the rest will be group into a group called “Other”.
OverallCondThere are 10 TotalFullBath values, but only 5 have a meaningful relationship with SalePrice, including “3”, “4”, “5”, “6” and “7”. Therefore, the rest will be group into a group called “Other”.
SaleTypeThere are 9 TotalFullBath values, but only 2 have a meaningful relationship with SalePrice, including “New” and “WD”. Therefore, the rest will be group into a group called “Other”.
FireplacesAmong 4 numbers of Fireplaces, one, three and zero fireplaces have a meaningful relationship with SalePrice. Two fireplaces will be turned into the “Other” level.
GarageCarsThere are 5 GarageCars values, but only 3 have a meaningful relationship with SalePrice, including “0”, “1”, “3”. Therefore, the rest will be group into a group called “Other”.
LotShapeAmong 4 numbers of Fireplaces, IR1, IR2 and Reg have a meaningful relationship with SalePrice. IR3 will be turned into the “Other” level.
RoofStyleThere are 6 RoofStyle values, but only 2 have a meaningful relationship with SalePrice, including “Hip” and “Gable”. Therefore, the rest will be group into a group called “Other”.
FoundationThere are 6 RoofStyle values, but only 4 have a meaningful relationship with SalePrice, including “PConc”, “Slab”, “BrkTil”, “CBlock”. Therefore, the rest will be group into a group called “Other”.
HeatingQCAmong 4 numbers of HeatingQC, Ex, Fa, Gd and TA have a meaningful relationship with SalePrice. Po HeatingQC will be turned into the “Other” level.
SaleConditionThere are 6 SaleCondition values, but only 3 have a meaningful relationship with SalePrice, including “Partial”, “Abnorml”, “Normal”. Therefore, the rest will be group into a group called “Other”.
HouseStyleThere are 8 HouseStyle values, but only 4 have a meaningful relationship with SalePrice, including “Partial”, “Abnorml”, “Normal”. Therefore, the rest will be group into a group called “Other”.
MasVnrTypeThere are 5 HouseStyle values, but only 3 have a meaningful relationship with SalePrice, including “Stone”, “BrkFace”, “None”. Therefore, the rest will be group into a group called “Other”.
LotConfigThere are 5 LotConfig values, but only 1 has a meaningful relationship with SalePrice - “CulDSac”. Therefore, the rest will be group into a group called “Other”.
Exterior1stThere are 15 Exterior1st values, but only 1 has a meaningful relationship with SalePrice, including VinylSd, CemntBd, HdBoard, AsbShng, Wd Sdng, MetalSd. Therefore, the rest will be group into a group called “Other”.
BedroomAbvGrThere are 8 BedroomAbvGr values, but only 2 have a meaningful relationship with SalePrice, including “2” and “4”. Therefore, the rest will be group into a group called “Other”.
TotRmsAbvGrdThere are 12 BedroomAbvGr values, 10 have a meaningful relationship with SalePrice, including 3, 4, 5, 6, 7, 8, 9, 10, 11 and 12. Therefore, the rest will be group into a group called “Other”.
LotFrontageBecause LotFrontage has na_count missing
values, these values will be replaced by the mean
LotFrontage of the specific neighborhood.
BldgTypeThere are 5 BldgType values, among which TwnhsE has insignificant correlation with SalePrice. Therefore, this value will be turned into a level called “Other”.
33 variables chosen to be the predictors in SalePrice prediction model include:
Multi-variable linear model for
SalePrice_transformed:
Running the multi-variate linear model between SalePrice_transformed and the 33 predictors would return a model with an R-squared of 0.9144 and an adjusted R-squared of 0.9105. In other words, the 33 variables included in the model explained approximately 91% of the variance in SalePrice. The model was significatn with F = 231.2, p < 0.001. The median of the residuals was very close to zero. Among 33 predictors, 10 make minimal contribution even though they have significant correlation with SalePrice. This happens because their statistical significance are not as strong as other chosen predictors.
##
## Call:
## lm(formula = SalePrice_transformed ~ ., data = model_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.35905 -0.02488 0.00334 0.02888 0.16367
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 3.5122536729 0.1195226192 29.386
## GarageArea 0.0000365081 0.0000164848 2.215
## LotArea 0.0000009692 0.0000001956 4.956
## AgeSold -0.0002886972 0.0001218927 -2.368
## WoodDeckSF 0.0000534309 0.0000134147 3.983
## BsmtUnfSF -0.0000045802 0.0000087271 -0.525
## OpenScreenPorch 0.0000895581 0.0000180493 4.962
## NeighborhoodNoRidge 0.0421379870 0.0102786555 4.100
## NeighborhoodOldTown -0.0239080216 0.0072676928 -3.290
## NeighborhoodNridgHt 0.0273388355 0.0084513357 3.235
## NeighborhoodEdwards -0.0191958025 0.0061205926 -3.136
## NeighborhoodStoneBr 0.0504179674 0.0121820317 4.139
## MSZoningFV 0.2090336184 0.0197976999 10.558
## MSZoningRH 0.1647662767 0.0232699123 7.081
## MSZoningRL 0.1834033608 0.0182409110 10.055
## MSZoningRM 0.1638870060 0.0184735598 8.871
## MSSubClass60 -0.0133692913 0.0087874385 -1.521
## MSSubClass30 -0.0225775219 0.0084329439 -2.677
## MSSubClass160 -0.0490015177 0.0114304570 -4.287
## Fireplaces1 0.0126410684 0.0037574577 3.364
## Fireplaces2 0.0289492680 0.0070526426 4.105
## TotalFullBath1 -0.0061780183 0.0043719766 -1.413
## TotalFullBath3 0.0061464958 0.0048272670 1.273
## LotShapeIR2 0.0143974404 0.0101421153 1.420
## ExterQualTA -0.0116500600 0.0052836017 -2.205
## ExterQualEx 0.0242168280 0.0115357810 2.099
## SaleTypeNew 0.0030311754 0.0069596478 0.436
## MasVnrTypeStone 0.0487439568 0.0137490856 3.545
## MasVnrTypeBrkFace 0.0430526260 0.0128014432 3.363
## MasVnrTypeNone 0.0456751153 0.0125863029 3.629
## LotConfigCulDSac 0.0184382068 0.0072309923 2.550
## Exterior1stHdBoard -0.0097469722 0.0047467917 -2.053
## Exterior1stWd Sdng -0.0117935482 0.0048287719 -2.442
## PoolArea555 0.1099156644 0.0577907191 1.902
## CentralAirY 0.0373026997 0.0070461067 5.294
## PavedDriveY 0.0126342340 0.0110481033 1.144
## PavedDriveN -0.0005248141 0.0124292870 -0.042
## GarageCars0 -0.0369556495 0.0111244011 -3.322
## GarageCars1 -0.0062628902 0.0055996361 -1.118
## GarageCars3 0.0117957936 0.0075472544 1.563
## FenceGdWo -0.0203638022 0.0080225121 -2.538
## HouseStyle1.5Fin -0.0191158744 0.0069938029 -2.733
## LotFrontage 0.0001353186 0.0000900476 1.503
## BldgType1Fam 0.0208088543 0.0055470632 3.751
## LandContourHLS 0.0146413547 0.0084231807 1.738
## BsmtFinSF1 0.0000412186 0.0000089682 4.596
## GrLivArea_transformed 0.1676985713 0.0180889203 9.271
## TotalBsmtSF_transformed 0.0057104230 0.0013758605 4.150
## X1stFlrSF 0.0000004218 0.0000146489 0.029
## X2ndFlrSF_transformed 0.0001653528 0.0005225978 0.316
## OverallQualFactor3 0.0509089383 0.0279665008 1.820
## OverallQualFactor4 0.0927783360 0.0260791877 3.558
## OverallQualFactor5 0.1148530978 0.0263418909 4.360
## OverallQualFactor6 0.1395246557 0.0265676159 5.252
## OverallQualFactor7 0.1670171654 0.0268985399 6.209
## OverallQualFactor8 0.1999161015 0.0274659538 7.279
## OverallQualFactor9 0.2490727366 0.0294692461 8.452
## OverallQualFactor10 0.2780381659 0.0325967111 8.530
## RemodSold_transformed -0.0173037859 0.0021496115 -8.050
## Pr(>|t|)
## (Intercept) < 0.0000000000000002 ***
## GarageArea 0.026964 *
## LotArea 0.00000081966148790 ***
## AgeSold 0.018013 *
## WoodDeckSF 0.00007195666198071 ***
## BsmtUnfSF 0.599802
## OpenScreenPorch 0.00000079417818437 ***
## NeighborhoodNoRidge 0.00004405348012279 ***
## NeighborhoodOldTown 0.001031 **
## NeighborhoodNridgHt 0.001249 **
## NeighborhoodEdwards 0.001751 **
## NeighborhoodStoneBr 0.00003725483258440 ***
## MSZoningFV < 0.0000000000000002 ***
## MSZoningRH 0.00000000000238250 ***
## MSZoningRL < 0.0000000000000002 ***
## MSZoningRM < 0.0000000000000002 ***
## MSSubClass60 0.128409
## MSSubClass30 0.007519 **
## MSSubClass160 0.00001950308409929 ***
## Fireplaces1 0.000791 ***
## Fireplaces2 0.00004309105068371 ***
## TotalFullBath1 0.157876
## TotalFullBath3 0.203152
## LotShapeIR2 0.155981
## ExterQualTA 0.027639 *
## ExterQualEx 0.035992 *
## SaleTypeNew 0.663248
## MasVnrTypeStone 0.000407 ***
## MasVnrTypeBrkFace 0.000794 ***
## MasVnrTypeNone 0.000296 ***
## LotConfigCulDSac 0.010894 *
## Exterior1stHdBoard 0.040243 *
## Exterior1stWd Sdng 0.014729 *
## PoolArea555 0.057405 .
## CentralAirY 0.00000014108137116 ***
## PavedDriveY 0.253022
## PavedDriveN 0.966327
## GarageCars0 0.000919 ***
## GarageCars1 0.263591
## GarageCars3 0.118322
## FenceGdWo 0.011258 *
## HouseStyle1.5Fin 0.006359 **
## LotFrontage 0.133156
## BldgType1Fam 0.000184 ***
## LandContourHLS 0.082417 .
## BsmtFinSF1 0.00000473959556518 ***
## GrLivArea_transformed < 0.0000000000000002 ***
## TotalBsmtSF_transformed 0.00003542178495436 ***
## X1stFlrSF 0.977031
## X2ndFlrSF_transformed 0.751747
## OverallQualFactor3 0.068943 .
## OverallQualFactor4 0.000388 ***
## OverallQualFactor5 0.00001406450409477 ***
## OverallQualFactor6 0.00000017687825412 ***
## OverallQualFactor7 0.00000000072277667 ***
## OverallQualFactor8 0.00000000000059271 ***
## OverallQualFactor9 < 0.0000000000000002 ***
## OverallQualFactor10 < 0.0000000000000002 ***
## RemodSold_transformed 0.00000000000000191 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.05334 on 1255 degrees of freedom
## (145 observations deleted due to missingness)
## Multiple R-squared: 0.9144, Adjusted R-squared: 0.9105
## F-statistic: 231.2 on 58 and 1255 DF, p-value: < 0.00000000000000022
Variance Inflation Factor of the linear model:
Variance inflation factor (VIF) is a way to detect multicollinearity within many predictors between a model to see if the model is overstating the prediction. Calculating VIF for every choen predictors in the SalePrice predictiion model, it can be concluded that there is no multicollinearity within the predictors. Because no predictor has a score exceeding 4.1, meanwhile the threshold in this case is 6.0.
## GVIF Df GVIF^(1/(2*Df))
## GarageArea 5.870681 1 2.422949
## LotArea 1.344816 1 1.159662
## AgeSold 6.603545 1 2.569736
## WoodDeckSF 1.239698 1 1.113417
## BsmtUnfSF 6.962755 1 2.638703
## OpenScreenPorch 1.208743 1 1.099429
## Neighborhood 6.793364 5 1.211179
## MSZoning 3.638147 4 1.175195
## MSSubClass 15.439574 3 1.577996
## Fireplaces 1.892270 2 1.172859
## TotalFullBath 2.827840 2 1.296772
## LotShape 1.094249 1 1.046064
## ExterQual 6.110426 2 1.572236
## SaleType 1.856226 1 1.362434
## MasVnrType 2.088637 3 1.130604
## LotConfig 1.151927 1 1.073279
## Exterior1st 1.590798 2 1.123062
## PoolArea 1.172875 1 1.082993
## CentralAir 1.522849 1 1.234038
## PavedDrive 1.571140 2 1.119576
## GarageCars 12.404889 3 1.521477
## Fence 1.067033 1 1.032973
## HouseStyle 2.257656 1 1.502550
## LotFrontage 1.870442 1 1.367641
## BldgType 2.002288 1 1.415022
## LandContour 1.130049 1 1.063038
## BsmtFinSF1 6.944447 1 2.635232
## GrLivArea_transformed 16.577368 1 4.071531
## TotalBsmtSF_transformed 3.598660 1 1.897013
## X1stFlrSF 13.767244 1 3.710424
## X2ndFlrSF_transformed 15.895099 1 3.986866
## OverallQualFactor 20.998535 8 1.209586
## RemodSold_transformed 3.030923 1 1.740955
The four diagnostic plots demonstrates no concerns with the model: