To view all the files in this project, please go to My Github site.
Letās say someone is interested in buying a house and donāt know how to prepare for it. Imagine this person who is looking for a dream home and donāt even know where to start. Buying a house start with knowing the sale price. If this person who is wondering about buying house and has no clue, then there is a problem. Buying a house is a lifetime experience for many first-time home buyers. It is an exciting moment, but it can be a bad one if things turn out unexpectedly. Thus, the question is: how can a home buyer know about the sale price of a house? Meaning, how to predict a sale price of a house? Well! Machine learning has the answer to this question. Machine learning algorithm is a computer algorithm that teaches itself a pattern based on data. In other words, machine learning builds a pattern that helps identify the future event occurring.
When we said machine learning is a computer algorithm to build a pattern for a sale price of a house that is not fully correct. It is not correct because there are many algorithms, and the pattern relies heavily on the data. If there is no data, then there is no price prediction. Therefore, data plays an important role in predicting the price of a something with the machine learning algorithm. The dataset used to predict the sale price of a house is from the Ames housing division in Iowa. The dataset is available at Kaggle.com. The data has about 1460 observations and 81 features with sale price (the target variable) included. These features describe just about everything that is part of a house. We are talking about the lot size, the year the house was built, the garage condition, fireplace conditions, etc. Although there are enough variables in this dataset, we still need to clean the data before fitting it to the model for analysis. The cleaning involved removing all unnecessary variables, the variables that have equivalent meaning with other variables or variables that have way more missing values by visual inspection. Finally, we used correlation technique to find which variable has strong influence on the target variable.
The purpose of cleaning the data is to ensure that we have the right data for the method used for the analysis. Letās recall that data analysis is the process of collecting, modeling, and analyzing data to extract meaningful information that ultimately leads to decision-making. In other words, the leadership want to make decisions based on solid facts. Building these solid facts is where the methodology comes to play. There are several methods and techniques use in analyzing the data. The choice of the method and technique will depend on the aspect of the data (quantitative or qualitative). This Ames Housing data has quantitative aspect because it has finite number (discrete data) of values resulted from measurements collected on each house individually. For example, the number of bedrooms in the house, the number of car garage attached to the house and the size of the lot. Although some values are qualitative such as the condition of the garage (excellent, good, fair and none), it still refers to quantitative data. These data are taken as quantitative because this is not a survey where the responses are not uniformed. It is measurable data with standard in place. This type of measurements requires expertise that would apply the standards and unformalized for all houses. Cleaning the data is a step that precedes the analysis. We used multiple linear regression and Neutral Network method to analyze the data. The technique used is basically fitting the data to the model. The model here refers to the computerized formulas for each machine learning algorithm.
The purpose of performing the analysis is to predict the sale price of a house. Initially, we had a question about how one can predict the sale price of a house. Well! This is where the magic happens. The accuracy of predicting this price depends on the formulas. The choice of multiple linear regression is because while preparing the data for the analysis, we plotted few predictors against the target variable and saw some linear relationship. The second choice of neutral network is because we wanted to see how an unsupervised machine learning will perform against a supervised one. At the end, the verdict was tight because each algorithm performed well. Since the neutral network seems to be popular in neuron studies (complicated topic), we thought it would have easily outperformed the multiple linear regression. The assumption was not valid because this regression analysis showed better results with a low mean squared error value compared with the one from neutral network. In addition, based on the output values such as p-value, the chosen explanatory variables explained well the sale price of a house. Furthermore, we like the regression performance better because we can apply it into a business. Knowing a set of features along with the machine learning algorithm that can directly influence the sale price of a house, we can build an application that can tell customers the sale price of a house. We think this can be a starting point for many home buyers. There are other factors such as speculation (demand and supply) and inflation that influence the sale price of a house, but we believe the home buyers want to know this price first.
How can we predict the sale price of a house?
We found some interesting dataset from data source: https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/data
This data was originally generated the Ames Housing, a division within the Ames city in Iowa state. Thanks to Kaggle for making the data available free of use. The data is about a sale record of houses with all the defined criteria. Below is the description of the variable present in the dataset.
For a quicker look, we made the data (no modification) available at My Github site
## Warning: `gather_()` was deprecated in tidyr 1.2.0.
## Please use `gather()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
This data is composed of 80 features for 1460 observations. The datatypes are mixed of integer and character. The target variable or the variable of interest is āSalePriceā. According to the dictionary, āSalePriceā is the price at which a house was sold. Other variables are criteria used to derive the price of the house. We believe 81 variables for a price determination is excessive. Not all these variable are relevant/pertinent to the target variable. In addition, some variables have missing values. Therefore, we will need to reduce the number of feature by pertinence to SalePrice and cleaning the missing values.
Letās see the missing values.
##
## Attaching package: 'skimr'
## The following object is masked from 'package:naniar':
##
## n_complete
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:arules':
##
## intersect, setdiff, union
## The following object is masked from 'package:tsibble':
##
## interval
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
## Selecting by numeric.p100
There are 43 character(categorical) variables and 38 numerical variables. We know there are 1460 observations, so seeing the number (or percent) missing values per variable can give us an idea of the cleaning approach.
x | |
---|---|
Id | 0 |
MSSubClass | 0 |
MSZoning | 0 |
LotFrontage | 259 |
LotArea | 0 |
Street | 0 |
Alley | 1369 |
LotShape | 0 |
LandContour | 0 |
Utilities | 0 |
LotConfig | 0 |
LandSlope | 0 |
Neighborhood | 0 |
Condition1 | 0 |
Condition2 | 0 |
BldgType | 0 |
HouseStyle | 0 |
OverallQual | 0 |
OverallCond | 0 |
YearBuilt | 0 |
YearRemodAdd | 0 |
RoofStyle | 0 |
RoofMatl | 0 |
Exterior1st | 0 |
Exterior2nd | 0 |
MasVnrType | 8 |
MasVnrArea | 8 |
ExterQual | 0 |
ExterCond | 0 |
Foundation | 0 |
BsmtQual | 37 |
BsmtCond | 37 |
BsmtExposure | 38 |
BsmtFinType1 | 37 |
BsmtFinSF1 | 0 |
BsmtFinType2 | 38 |
BsmtFinSF2 | 0 |
BsmtUnfSF | 0 |
TotalBsmtSF | 0 |
Heating | 0 |
We selected half of the variables. Letās see the other half.
x | |
---|---|
HeatingQC | 0 |
CentralAir | 0 |
Electrical | 1 |
X1stFlrSF | 0 |
X2ndFlrSF | 0 |
LowQualFinSF | 0 |
GrLivArea | 0 |
BsmtFullBath | 0 |
BsmtHalfBath | 0 |
FullBath | 0 |
HalfBath | 0 |
BedroomAbvGr | 0 |
KitchenAbvGr | 0 |
KitchenQual | 0 |
TotRmsAbvGrd | 0 |
Functional | 0 |
Fireplaces | 0 |
FireplaceQu | 690 |
GarageType | 81 |
GarageYrBlt | 81 |
GarageFinish | 81 |
GarageCars | 0 |
GarageArea | 0 |
GarageQual | 81 |
GarageCond | 81 |
PavedDrive | 0 |
WoodDeckSF | 0 |
OpenPorchSF | 0 |
EnclosedPorch | 0 |
X3SsnPorch | 0 |
ScreenPorch | 0 |
PoolArea | 0 |
PoolQC | 1453 |
Fence | 1179 |
MiscFeature | 1406 |
MiscVal | 0 |
MoSold | 0 |
YrSold | 0 |
SaleType | 0 |
SaleCondition | 0 |
SalePrice | 0 |
By visual inspection, we can remove these variables: PoolQC, Fence, Alley, MiscFeature, FireplaceQu, LotFrontage. because missing too many valueā¦it probably means these variables are not important ⦠houses that have these variables with values are probably extra..extraā¦or too specialā¦
LotFrontage = LotFrontage: Linear feet of street connected to property. We want to remove this variable due to the 17.4% missing values and the fact that every house is built under the city regulation.
We will remove variable āIdā. There is no need for it or because it is just an indication for record purpose.
x | |
---|---|
LotFrontage | 259 |
Alley | 1369 |
MasVnrType | 8 |
MasVnrArea | 8 |
BsmtQual | 37 |
BsmtCond | 37 |
BsmtExposure | 38 |
BsmtFinType1 | 37 |
BsmtFinType2 | 38 |
Electrical | 1 |
FireplaceQu | 690 |
GarageType | 81 |
GarageYrBlt | 81 |
GarageFinish | 81 |
GarageQual | 81 |
GarageCond | 81 |
PoolQC | 1453 |
Fence | 1179 |
MiscFeature | 1406 |
Above are variables with missing values greater than 0. I like this already!
This is where the benefit of having friends who are in housing for sale or construction count.
Next, we will evaluate (does it really matter? can it be done easily? does it cost a lot ?) all other variables with missing values
GarageFinish (81 missing values): Interior finish of the garageā¦average homebuyers donāt care about it⦠the condition of the garage is sufficient ⦠remove
GarageQual (81 missing values): Garage quality, this might mean the quality of the materials used to build the garage. Removeā¦
GarageCond (81 missing values): Garage condition, this might mean the condition(look) at the time the house is sold.
GarageType(81 missing values): garage locationā¦ā¦garage condition is enough ā¦.remove
GarageYrBlt(81 missing values): garage year builtā¦same with yearbuiltā¦remove
MasVnrType(07 missing values): Masonry veneer typeā¦the type of material used to construct the house(bricks, stoneā¦)ā¦we can keep it for now
MasVnrArea(07 missing values): Masonry veneer area in square feetā¦this is the kind of work homebuyers wonāt do. Because it involve doing the math to find the ratio of the veneer area covered Vs. not coveredā¦.remove
BsmtQual( 037 missing values): Evaluates the height of the basementā¦will be removed because another variable described the same thing.
BsmtCond (37 missing values): Evaluates the general condition of the basementā¦.we can keepā¦
BsmtExposure(38 missing values): Refers to walkout or garden level walls..average homebuyers donāt care about it. remove
BsmtFinType1 (37 missing values): Rating of basement finished areaā¦described alreadyā¦remove
BsmtFinType2 ( 38 missing values): Type 1 finished square feetā¦no tricky math for buyer⦠remove
Electricalā¦we will only remove the 01 missing valuesā¦no hurts
## [1] "MSSubClass" "MSZoning" "LotArea" "Street"
## [5] "LotShape" "LandContour" "Utilities" "LotConfig"
## [9] "LandSlope" "Neighborhood" "Condition1" "Condition2"
## [13] "BldgType" "HouseStyle" "OverallQual" "OverallCond"
## [17] "YearBuilt" "YearRemodAdd" "RoofStyle" "RoofMatl"
## [21] "Exterior1st" "Exterior2nd" "MasVnrType" "ExterQual"
## [25] "ExterCond" "Foundation" "BsmtCond" "BsmtFinSF1"
## [29] "BsmtFinSF2" "BsmtUnfSF" "TotalBsmtSF" "Heating"
## [33] "HeatingQC" "CentralAir" "Electrical" "X1stFlrSF"
## [37] "X2ndFlrSF" "LowQualFinSF" "GrLivArea" "BsmtFullBath"
## [41] "BsmtHalfBath" "FullBath" "HalfBath" "BedroomAbvGr"
## [45] "KitchenAbvGr" "KitchenQual" "TotRmsAbvGrd" "Functional"
## [49] "Fireplaces" "GarageCars" "GarageArea" "GarageCond"
## [53] "PavedDrive" "WoodDeckSF" "OpenPorchSF" "EnclosedPorch"
## [57] "X3SsnPorch" "ScreenPorch" "PoolArea" "MiscVal"
## [61] "MoSold" "YrSold" "SaleType" "SaleCondition"
## [65] "SalePrice"
Above list is the name of the remaining variables after we removed the first set as described early. We continue to remove by pertinence (if it seems like extra or one can live without it in the areaā¦) to the sale price.
LandContour,LotShape, Street, LotConfig, LandSlope, Condition2, RoofStyle, Exterior2nd, OverallQual, Foundation, TotalBsmtSF,BsmtUnfSF, Heating, Electrical, X1stFlrSF, X2ndFlrSF, LowQualFinSF, BsmtFullBath, BsmtHalfBath, TotRmsAbvGrd, PavedDrive, EnclosedPorch, 3SsnPorch, ScreenPorch, MoSold,MSSubClass,Condition1
## [1] "MSZoning" "LotArea" "Utilities" "Neighborhood"
## [5] "BldgType" "HouseStyle" "OverallCond" "YearBuilt"
## [9] "YearRemodAdd" "RoofMatl" "Exterior1st" "MasVnrType"
## [13] "ExterQual" "ExterCond" "BsmtCond" "BsmtFinSF1"
## [17] "BsmtFinSF2" "HeatingQC" "CentralAir" "X1stFlrSF"
## [21] "GrLivArea" "FullBath" "HalfBath" "BedroomAbvGr"
## [25] "KitchenAbvGr" "KitchenQual" "Functional" "Fireplaces"
## [29] "GarageCars" "GarageArea" "GarageCond" "WoodDeckSF"
## [33] "OpenPorchSF" "PoolArea" "MiscVal" "YrSold"
## [37] "SaleType" "SaleCondition" "SalePrice"
## MasVnrType BsmtCond GarageCond
## 8 37 81
We significantly reduced the number of variable by nearly half. No need to do imputation by mean or other numerical imputations. We just need to replace the value according to the definition of the variable and remove the row where āNAā has no meaning and it is very low missing values (ex: 10 missing values will no infere on 1460 observations)
## [1] "MSZoning" "LotArea" "Utilities" "Neighborhood"
## [5] "BldgType" "HouseStyle" "OverallCond" "YearBuilt"
## [9] "YearRemodAdd" "RoofMatl" "Exterior1st" "MasVnrType"
## [13] "ExterQual" "ExterCond" "BsmtCond" "BsmtFinSF1"
## [17] "BsmtFinSF2" "HeatingQC" "CentralAir" "X1stFlrSF"
## [21] "GrLivArea" "FullBath" "HalfBath" "BedroomAbvGr"
## [25] "KitchenAbvGr" "KitchenQual" "Functional" "Fireplaces"
## [29] "GarageCars" "GarageArea" "GarageCond" "WoodDeckSF"
## [33] "OpenPorchSF" "PoolArea" "MiscVal" "YrSold"
## [37] "SaleType" "SaleCondition" "SalePrice"
Remained 36 variables after cleaning⦠letās convert character variables to categorical ones.
##
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
MSZoning* | 1 | 1452 | 4.033058e+00 | 6.269907e-01 | 4.0 | 4.062823e+00 | 0.0000 | 1 | 5 | 4 | -1.7341136 | 6.4223786 | 0.0164542 |
LotArea | 2 | 1452 | 1.050728e+04 | 9.989564e+03 | 9478.5 | 9.560115e+03 | 2962.2348 | 1300 | 215245 | 213945 | 12.2147550 | 202.7382100 | 262.1580559 |
Utilities* | 3 | 1452 | 1.000689e+00 | 2.624320e-02 | 1.0 | 1.000000e+00 | 0.0000 | 1 | 2 | 1 | 38.0264243 | 1445.0041308 | 0.0006887 |
Neighborhood* | 4 | 1452 | 1.313912e+01 | 5.889512e+00 | 13.0 | 1.309380e+01 | 7.4130 | 1 | 25 | 24 | 0.0198860 | -1.0506232 | 0.1545596 |
BldgType* | 5 | 1452 | 1.493113e+00 | 1.197554e+00 | 1.0 | 1.144578e+00 | 0.0000 | 1 | 5 | 4 | 2.2406039 | 3.4060800 | 0.0314276 |
HouseStyle* | 6 | 1452 | 4.037879e+00 | 1.913519e+00 | 3.0 | 4.028399e+00 | 1.4826 | 1 | 8 | 7 | 0.3058160 | -0.9646592 | 0.0502169 |
OverallCond | 7 | 1452 | 5.579201e+00 | 1.113136e+00 | 5.0 | 5.481067e+00 | 0.0000 | 1 | 9 | 8 | 0.6934936 | 1.0807585 | 0.0292122 |
YearBuilt | 8 | 1452 | 1.971116e+03 | 3.019376e+01 | 1972.0 | 1.973943e+03 | 37.0650 | 1872 | 2010 | 138 | -0.6076580 | -0.4482779 | 0.7923807 |
YearRemodAdd | 9 | 1452 | 1.984775e+03 | 2.065247e+01 | 1993.0 | 1.986256e+03 | 19.2738 | 1950 | 2010 | 60 | -0.4962541 | -1.2807746 | 0.5419867 |
RoofMatl* | 10 | 1452 | 2.075758e+00 | 6.007500e-01 | 2.0 | 2.000000e+00 | 0.0000 | 1 | 8 | 7 | 8.0694933 | 65.8874243 | 0.0157656 |
Exterior1st* | 11 | 1452 | 1.062052e+01 | 3.197594e+00 | 13.0 | 1.092427e+01 | 1.4826 | 1 | 15 | 14 | -0.7230384 | -0.3648052 | 0.0839151 |
MasVnrType* | 12 | 1452 | 2.761019e+00 | 6.157107e-01 | 3.0 | 2.728916e+00 | 0.0000 | 1 | 4 | 3 | -0.0673056 | -0.1344926 | 0.0161582 |
ExterQual* | 13 | 1452 | 3.544077e+00 | 6.916750e-01 | 4.0 | 3.654905e+00 | 0.0000 | 1 | 4 | 3 | -1.8375828 | 3.9126787 | 0.0181518 |
ExterCond* | 14 | 1452 | 4.732094e+00 | 7.335536e-01 | 5.0 | 4.944062e+00 | 0.0000 | 1 | 5 | 4 | -2.5500640 | 5.2369125 | 0.0192508 |
BsmtCond* | 15 | 1452 | 4.691460e+00 | 9.493638e-01 | 5.0 | 4.996558e+00 | 0.0000 | 1 | 5 | 4 | -2.9401349 | 7.1989685 | 0.0249143 |
BsmtFinSF1 | 16 | 1452 | 4.419704e+02 | 4.553603e+02 | 381.0 | 3.842900e+02 | 564.8706 | 0 | 5644 | 5644 | 1.6993684 | 11.2275285 | 11.9501082 |
BsmtFinSF2 | 17 | 1452 | 4.680579e+01 | 1.617262e+02 | 0.0 | 1.492255e+00 | 0.0000 | 0 | 1474 | 1474 | 4.2331421 | 19.8767340 | 4.2442131 |
HeatingQC* | 18 | 1452 | 2.544077e+00 | 1.739828e+00 | 1.0 | 2.430293e+00 | 0.0000 | 1 | 5 | 4 | 0.4761525 | -1.5173517 | 0.0456586 |
CentralAir* | 19 | 1452 | 1.934573e+00 | 2.473630e-01 | 2.0 | 2.000000e+00 | 0.0000 | 1 | 2 | 1 | -3.5112263 | 10.3358300 | 0.0064916 |
X1stFlrSF | 20 | 1452 | 1.161271e+03 | 3.850184e+02 | 1086.0 | 1.129282e+03 | 347.6697 | 334 | 4692 | 4358 | 1.3705592 | 5.7935362 | 10.1041132 |
GrLivArea | 21 | 1452 | 1.514092e+03 | 5.256278e+02 | 1461.5 | 1.466049e+03 | 483.3276 | 334 | 5642 | 5308 | 1.3715369 | 4.9011921 | 13.7941514 |
FullBath | 22 | 1452 | 1.562672e+00 | 5.502313e-01 | 2.0 | 1.558520e+00 | 0.0000 | 0 | 3 | 3 | 0.0363711 | -0.8723593 | 0.0144398 |
HalfBath | 23 | 1452 | 3.815427e-01 | 5.026637e-01 | 0.0 | 3.416523e-01 | 0.0000 | 0 | 2 | 2 | 0.6816203 | -1.0673211 | 0.0131915 |
BedroomAbvGr | 24 | 1452 | 2.867080e+00 | 8.148122e-01 | 3.0 | 2.852840e+00 | 0.0000 | 0 | 8 | 8 | 0.2171317 | 2.2390257 | 0.0213833 |
KitchenAbvGr | 25 | 1452 | 1.046143e+00 | 2.194982e-01 | 1.0 | 1.000000e+00 | 0.0000 | 0 | 3 | 3 | 4.5052679 | 21.7240175 | 0.0057603 |
KitchenQual* | 26 | 1452 | 3.342287e+00 | 8.297051e-01 | 4.0 | 3.506885e+00 | 0.0000 | 1 | 4 | 3 | -1.4239365 | 1.7256217 | 0.0217741 |
Functional* | 27 | 1452 | 6.752066e+00 | 9.705320e-01 | 7.0 | 7.000000e+00 | 0.0000 | 1 | 7 | 6 | -4.0811771 | 16.4229147 | 0.0254699 |
Fireplaces | 28 | 1452 | 6.122590e-01 | 6.434218e-01 | 1.0 | 5.344234e-01 | 1.4826 | 0 | 3 | 3 | 0.6473945 | -0.2166199 | 0.0168854 |
GarageCars | 29 | 1452 | 1.765151e+00 | 7.484957e-01 | 2.0 | 1.771945e+00 | 0.0000 | 0 | 4 | 4 | -0.3374301 | 0.2003861 | 0.0196429 |
GarageArea | 30 | 1452 | 4.724752e+02 | 2.141064e+02 | 478.0 | 4.692849e+02 | 174.9468 | 0 | 1418 | 1418 | 0.1829216 | 0.8988780 | 5.6188357 |
GarageCond* | 31 | 1452 | 5.761708e+00 | 8.046860e-01 | 6.0 | 6.000000e+00 | 0.0000 | 1 | 6 | 5 | -3.5643891 | 12.3529271 | 0.0211175 |
WoodDeckSF | 32 | 1452 | 9.441667e+01 | 1.253937e+02 | 0.0 | 7.194234e+01 | 0.0000 | 0 | 857 | 857 | 1.5391213 | 2.9784831 | 3.2907314 |
OpenPorchSF | 33 | 1452 | 4.639050e+01 | 6.604862e+01 | 24.0 | 3.301291e+01 | 35.5824 | 0 | 547 | 547 | 2.3807985 | 8.6198899 | 1.7333267 |
PoolArea | 34 | 1452 | 2.774105e+00 | 4.028739e+01 | 0.0 | 0.000000e+00 | 0.0000 | 0 | 738 | 738 | 14.7566829 | 220.9454623 | 1.0572698 |
MiscVal | 35 | 1452 | 4.372865e+01 | 4.974783e+02 | 0.0 | 0.000000e+00 | 0.0000 | 0 | 15500 | 15500 | 24.3594783 | 693.8077959 | 13.0554190 |
YrSold | 36 | 1452 | 2.007815e+03 | 1.328927e+00 | 2008.0 | 2.007769e+03 | 1.4826 | 2006 | 2010 | 4 | 0.0956796 | -1.1956650 | 0.0348753 |
SaleType* | 37 | 1452 | 8.511019e+00 | 1.563458e+00 | 9.0 | 8.925129e+00 | 0.0000 | 1 | 9 | 8 | -3.8350611 | 14.5699843 | 0.0410301 |
SaleCondition* | 38 | 1452 | 4.768595e+00 | 1.101421e+00 | 5.0 | 5.000000e+00 | 0.0000 | 1 | 6 | 5 | -2.7410913 | 6.8301526 | 0.0289048 |
SalePrice | 39 | 1452 | 1.806151e+05 | 7.928554e+04 | 162700.0 | 1.705286e+05 | 55894.0200 | 34900 | 755000 | 720100 | 1.8801536 | 6.5401878 | 2080.7058509 |
## No id variables; using all as measure variables
Above is the distribution of numerical variablesā¦yearsold looks like many years counted. in fact, those variables with multiple mode, mean the variable have a set of of values that repeat. Hard to see off values (outliers).
below is the distribution of categorical variables, not all them will be plot because some of them have too many levels to be plotted all at onceā¦maybe there is another technique
## [1] "MSZoning" "LotArea" "Utilities" "Neighborhood"
## [5] "BldgType" "HouseStyle" "OverallCond" "YearBuilt"
## [9] "YearRemodAdd" "RoofMatl" "Exterior1st" "MasVnrType"
## [13] "ExterQual" "ExterCond" "BsmtCond" "BsmtFinSF1"
## [17] "BsmtFinSF2" "HeatingQC" "CentralAir" "X1stFlrSF"
## [21] "GrLivArea" "FullBath" "HalfBath" "BedroomAbvGr"
## [25] "KitchenAbvGr" "KitchenQual" "Functional" "Fireplaces"
## [29] "GarageCars" "GarageArea" "GarageCond" "WoodDeckSF"
## [33] "OpenPorchSF" "PoolArea" "MiscVal" "YrSold"
## [37] "SaleType" "SaleCondition" "SalePrice"
The way we visualize these categorical variables is that some of them can be redefined numerical since the categorical level in some of them are null or close to null. Potentially use some rating⦠0 means nothing 1 means poor and so on with ā¦2, 3, 4, 5..
Of course this is not the only way to transform themā¦
Letās see the range in the variables yearsā¦
## The earliest house was sold in 2006
Another way of seeing the houses sold in 2006ā¦
letās see some values
##
##
## The latest house was sold in 2010
##
##
## The earliest house was built in 1872
##
##
## The latest house was built in 2010
##
##
## The house that was the first to be rebuilt was in 1950
##
##
## The house that was latest to be rebuilt was in 2010
Letās see the distribution at which house were sold.
Var1 | Freq |
---|---|
2006 | 313 |
2007 | 327 |
2008 | 301 |
2009 | 337 |
2010 | 174 |
Not sure what to think of the variables with yearsā¦does it affect the target variable? Letās transform categorical to factor.
Letās see some boxplot
As we can see, SalePrice, GrLivArea have some outliers⦠For the sake that we want to run regression analysis, we limit the variables to numerical moving forward.
Letās see some correlations
Based on the correlation plot, if we want to select those with high correlation with the target variable. Letās run another correlation function to help us select these variables rather than relying on visual (nothing wrong with visual)ā¦
## Warning in .font_global(font, quiet = FALSE): Font 'Arial Narrow' is not
## installed, has other name, or can't be found
We will keep the variables with 0.5 above⦠there are really 08 variables ( )
Letās see SalesPrice Vs. GrLivArea
## `geom_smooth()` using formula 'y ~ x'
Letās see SalesPrice Vs. YearBuilt
## `geom_smooth()` using formula 'y ~ x'
Letās see SalesPrice Vs. YearRemodAdd
## `geom_smooth()` using formula 'y ~ x'
Definitely high correlation and linear relationshipā¦.
We will run multiple linear regression and Neutral Network. There are definitely regression variables in this dataset. the neutral network is to see how other algorithm will perform, specially those on the unsupervised category. Neutral network seem to have perfomance in neuron analysis (according to neuroscientists who performed themā¦)
Split will be 0.8 train and 0.2 test
We want to use glm() function due to more than one independent/predictors.
##
## Call:
## glm(formula = SalePrice ~ ., data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -476681 -20824 -3747 17360 296387
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.114e+06 1.469e+05 -14.391 < 2e-16 ***
## YearBuilt 4.827e+02 6.049e+01 7.980 3.52e-15 ***
## YearRemodAdd 5.875e+02 7.809e+01 7.524 1.07e-13 ***
## X1stFlrSF 3.499e+01 4.143e+00 8.445 < 2e-16 ***
## GrLivArea 6.928e+01 3.665e+00 18.902 < 2e-16 ***
## FullBath -4.560e+03 3.309e+03 -1.378 0.169
## GarageCars 1.886e+04 3.732e+03 5.053 5.04e-07 ***
## GarageArea 1.271e+01 1.294e+01 0.982 0.326
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 1844051805)
##
## Null deviance: 7.4845e+12 on 1161 degrees of freedom
## Residual deviance: 2.1280e+12 on 1154 degrees of freedom
## AIC: 28099
##
## Number of Fisher Scoring iterations: 2
Wow! All the variables show good significance to the SalePrice.
##
## Attaching package: 'neuralnet'
## The following object is masked from 'package:dplyr':
##
## compute
Neutral network plot
Well done!
predicting NN
## MSE_model2
## MSE_model1 922874054.912377
## 1477004184.67592 1
Neutral Network seems about equal performance with multiple linear regression.
A closer look
It is a hard call hereā¦but we want to go with Multiple linear regression since it has a better MSE and p-value.
1 - file:///C:/Users/owner/Downloads/622-Article%20Text-961-2-10-20220308.pdf
2 - http://neuralnetworksanddeeplearning.com/chap1.html
3- https://cran.r-project.org/web/packages/dlookr/vignettes/EDA.html
5- https://plotly.com/r/line-charts/
6- https://www.r-bloggers.com/2015/09/fitting-a-neural-network-in-r-neuralnet-package/