In conducting anlysis & making a machine learning model to predict the house’s price, there are several step that this research pursue as well as data preprocessing, data wrangling, exploratory data analysis, build models, model comparison, assumption test, model improvement, model interpretation, and conclusion. In the first round analysis, recommending the model of model_all but the model doesn’t met the four of assumption test. Then, doing the outlier treatment to each numeric variables for meeting the four of assumption test. And then, the results is model of model_tuning is recommended to predict the house’s price also using a model of step_model is appropiate to see the significance of the predictor. Finally, this research recommend to use another complex model (model without assumption) so that it can capture non-linear relationships and using the step-wise regression, step_model, as a model recommendation and there is still room for improvement.
Ask a home buyer to describe their dream house, and they
probably won’t begin with the height of the basement ceiling or the
proximity to an east-west railroad. But this playground competition’s
dataset proves that much more influences price negotiations than the
number of bedrooms or a white-picket fence.
With 81 variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home.
The research objective is to predict the house price with
several predictor variables.
Here’s a brief version of what you’ll find in dataset after
variable selection.
| No. | Feature | Description |
|---|---|---|
| 1. | MSSubClass | Identifies the type of dwelling involved in the sale. |
| 2. | MSZoning | Identifies the general zoning classification of the sale. |
| 3. | LotFrontage | Linear feet of street connected to property. |
| 4. | LotArea | Lot size in square feet. |
| 5. | Street | Type of road access to property. |
| 6. | LotConfig | Lot configuration. |
| 7. | Condition1 | Proximity to various conditions. |
| 8. | Condition2 | Proximity to various conditions (if more than one is present). |
| 9. | BldgType | Type of dwelling. |
| 10. | HouseStyle | Style of dwelling. |
| 11. | OverallQual | Rates the overall material and finish of the house. |
| 12. | OverallCond | Rates the overall condition of the house. |
| 13. | YearBuilt | Original construction date. |
| 14. | YearRemodAdd | Remodel date (same as construction date if no remodeling or additions). |
| 15. | Exterior1st | Exterior covering on house. |
| 16. | ExterQual | Evaluates the quality of the material on the exterior. |
| 17. | X1stFlrSF | First Floor square feet. |
| 18. | X2ndFlrSF | Second floor square feet. |
| 19. | GrLivArea | Above grade (ground) living area square feet. |
| 20. | BsmtFullBath | Basement full bathrooms. |
| 21. | FullBath | Full bathrooms above grade. |
| 22. | HalfBath | Half baths above grade. |
| 23. | BedroomAbvGr | Bedrooms above grade (does NOT include basement bedrooms). |
| 24. | TotRmsAbvGrd | Total rooms above grade (does not include bathrooms). |
| 25. | Fireplaces | Number of fireplaces. |
| 26. | GarageType | Garage location. |
| 27. | GarageCars | Size of garage in car capacity. |
| 28. | GarageArea | Size of garage in square feet. |
| 29. | PavedDrive | Paved driveway. |
| 30. | OpenPorchSF | Open porch area in square feet. |
| 31. | SaleType | Type of sale. |
| 32. | SalePrice | House’s price. |
There are several packages that used in this research, as
follows:
# data cleaning
library(readr)
library(dplyr)
#data analysis
library(GGally)
#data visualizationl
library(ggplot2)
library(scales)
library(echarts4r)
#Cross Validation
library(rsample)
#RMSE
library(MLmetrics)
# Model Performance Comparison
library(performance)
# Hypotesis test
library(lmtest)
library(car)
This data set contains 38 coloumns of numeric datatype and 43
coloumns of character datatype.
Loading the data set (“train.csv”)
and assigned to house object. Then the data of house is ready to do data
wrangling process.
house <- read.csv("data_input/train.csv")
house
This step consists of three steps ranging from full data set
inspection, the top 6 observations of data set inspection, and the
bottom 6 observations of data set inspection by using head
also tail function so that the data set’s background can be
recognized a little bit.
1. Full data set.
house
2. Top 6 observations of data set.
# Top 6 data
head(house)
3. Bottom 6 observations of data set.
# Bottom 6 data
tail(house)There is an Id variable that doesn’t give any valuable information in conducting linear regression analysis. Then, in business perspective and a way to avoid redundancy, this analysis decides not to proceed several variables as follows:
house_new <- house %>%
select(-c(Id,Alley,LotShape,LandContour,Utilities,LandSlope,Neighborhood,RoofStyle,
RoofMatl,Exterior2nd,MasVnrType,MasVnrArea,ExterCond,Foundation,BsmtQual,
BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,
BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,
LowQualFinSF,BsmtHalfBath,KitchenAbvGr,KitchenQual,Functional,FireplaceQu,
GarageYrBlt,GarageFinish,GarageQual,GarageCond,WoodDeckSF,EnclosedPorch,
X3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,
YrSold,SaleCondition))
house_new
Perform data type inspection to ensure the data type of each
column is appropriate by using glimpse() function in dplyr
package. Then, there are several variables whose data type are changed
into factor such as MSZoning,Street,LotConfig,Condition1,Condition2,
BldgType,OverallQual,ExterQual,PavedDrive.
house_new <- house_new %>%
mutate_at(vars(MSZoning,Street,LotConfig,Condition1,Condition2,
BldgType,OverallQual,ExterQual,PavedDrive),as.factor)
# Relevel
levels(house_new$ExterQual) <- c("Fa","TA","Gd","Ex")
glimpse(house_new)#> Rows: 1,460
#> Columns: 32
#> $ MSSubClass <int> 60, 20, 60, 70, 60, 50, 20, 60, 50, 190, 20, 60, 20, 20, …
#> $ MSZoning <fct> RL, RL, RL, RL, RL, RL, RL, RL, RM, RL, RL, RL, RL, RL, R…
#> $ LotFrontage <int> 65, 80, 68, 60, 84, 85, 75, NA, 51, 50, 70, 85, NA, 91, N…
#> $ LotArea <int> 8450, 9600, 11250, 9550, 14260, 14115, 10084, 10382, 6120…
#> $ Street <fct> Pave, Pave, Pave, Pave, Pave, Pave, Pave, Pave, Pave, Pav…
#> $ LotConfig <fct> Inside, FR2, Inside, Corner, FR2, Inside, Inside, Corner,…
#> $ Condition1 <fct> Norm, Feedr, Norm, Norm, Norm, Norm, Norm, PosN, Artery, …
#> $ Condition2 <fct> Norm, Norm, Norm, Norm, Norm, Norm, Norm, Norm, Norm, Art…
#> $ BldgType <fct> 1Fam, 1Fam, 1Fam, 1Fam, 1Fam, 1Fam, 1Fam, 1Fam, 1Fam, 2fm…
#> $ HouseStyle <chr> "2Story", "1Story", "2Story", "2Story", "2Story", "1.5Fin…
#> $ OverallQual <fct> 7, 6, 7, 7, 8, 5, 8, 7, 7, 5, 5, 9, 5, 7, 6, 7, 6, 4, 5, …
#> $ OverallCond <int> 5, 8, 5, 5, 5, 5, 5, 6, 5, 6, 5, 5, 6, 5, 5, 8, 7, 5, 5, …
#> $ YearBuilt <int> 2003, 1976, 2001, 1915, 2000, 1993, 2004, 1973, 1931, 193…
#> $ YearRemodAdd <int> 2003, 1976, 2002, 1970, 2000, 1995, 2005, 1973, 1950, 195…
#> $ Exterior1st <chr> "VinylSd", "MetalSd", "VinylSd", "Wd Sdng", "VinylSd", "V…
#> $ ExterQual <fct> Gd, Ex, Gd, Ex, Gd, Ex, Gd, Ex, Ex, Ex, Ex, Fa, Ex, Gd, E…
#> $ X1stFlrSF <int> 856, 1262, 920, 961, 1145, 796, 1694, 1107, 1022, 1077, 1…
#> $ X2ndFlrSF <int> 854, 0, 866, 756, 1053, 566, 0, 983, 752, 0, 0, 1142, 0, …
#> $ GrLivArea <int> 1710, 1262, 1786, 1717, 2198, 1362, 1694, 2090, 1774, 107…
#> $ BsmtFullBath <int> 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, …
#> $ FullBath <int> 2, 2, 2, 1, 2, 1, 2, 2, 2, 1, 1, 3, 1, 2, 1, 1, 1, 2, 1, …
#> $ HalfBath <int> 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, …
#> $ BedroomAbvGr <int> 3, 3, 3, 3, 4, 1, 3, 3, 2, 2, 3, 4, 2, 3, 2, 2, 2, 2, 3, …
#> $ TotRmsAbvGrd <int> 8, 6, 6, 7, 9, 5, 7, 7, 8, 5, 5, 11, 4, 7, 5, 5, 5, 6, 6,…
#> $ Fireplaces <int> 0, 1, 1, 1, 1, 0, 1, 2, 2, 2, 0, 2, 0, 1, 1, 0, 1, 0, 0, …
#> $ GarageType <chr> "Attchd", "Attchd", "Attchd", "Detchd", "Attchd", "Attchd…
#> $ GarageCars <int> 2, 2, 2, 3, 3, 2, 2, 2, 2, 1, 1, 3, 1, 3, 1, 2, 2, 2, 2, …
#> $ GarageArea <int> 548, 460, 608, 642, 836, 480, 636, 484, 468, 205, 384, 73…
#> $ PavedDrive <fct> Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, …
#> $ OpenPorchSF <int> 61, 0, 42, 35, 84, 30, 57, 204, 0, 4, 0, 21, 0, 33, 213, …
#> $ SaleType <chr> "WD", "WD", "WD", "WD", "WD", "WD", "WD", "WD", "WD", "WD…
#> $ SalePrice <int> 208500, 181500, 223500, 140000, 250000, 143000, 307000, 2…
- Check Missing Values.
In this step,
checking the missing values is a must so that the missing value
treatment can be done and the data is ready to analyze.
colSums(is.na(house_new))#> MSSubClass MSZoning LotFrontage LotArea Street LotConfig
#> 0 0 259 0 0 0
#> Condition1 Condition2 BldgType HouseStyle OverallQual OverallCond
#> 0 0 0 0 0 0
#> YearBuilt YearRemodAdd Exterior1st ExterQual X1stFlrSF X2ndFlrSF
#> 0 0 0 0 0 0
#> GrLivArea BsmtFullBath FullBath HalfBath BedroomAbvGr TotRmsAbvGrd
#> 0 0 0 0 0 0
#> Fireplaces GarageType GarageCars GarageArea PavedDrive OpenPorchSF
#> 0 81 0 0 0 0
#> SaleType SalePrice
#> 0 0
- Treatment Missing Values.
The treatments
for missing value are drop the column of GarageType and replace the
missing value of LotFrontage with the LotFrontage median.
# Drop
house_new <- house_new[!(house_new$GarageType %in% NA),]
# Input NA with median of LotFrontage = 70
house_new$LotFrontage[is.na(house_new$LotFrontage)] <- median(house_new$LotFrontage, na.rm = TRUE)
- Recheck Missing Values.
Recheck the
missing value for each variables to ensure theris no missing value in
each variables.
colSums(is.na(house_new))#> MSSubClass MSZoning LotFrontage LotArea Street LotConfig
#> 0 0 0 0 0 0
#> Condition1 Condition2 BldgType HouseStyle OverallQual OverallCond
#> 0 0 0 0 0 0
#> YearBuilt YearRemodAdd Exterior1st ExterQual X1stFlrSF X2ndFlrSF
#> 0 0 0 0 0 0
#> GrLivArea BsmtFullBath FullBath HalfBath BedroomAbvGr TotRmsAbvGrd
#> 0 0 0 0 0 0
#> Fireplaces GarageType GarageCars GarageArea PavedDrive OpenPorchSF
#> 0 0 0 0 0 0
#> SaleType SalePrice
#> 0 0
Checking the data whether there is the same values or duplicates
for each row .
house_new%>%
duplicated() %>%
sum()#> [1] 0
Inspecting the data 5 numbers of summary and the mean to
recognize the data distribution for each numeric variables, to know the
total amount of each level in each factor variable, and to find out if
there are any outliers for each variable by using summary()
function.
summary(house_new)#> MSSubClass MSZoning LotFrontage LotArea Street
#> Min. : 20.00 C (all): 8 Min. : 21.00 Min. : 1300 Grvl: 5
#> 1st Qu.: 20.00 FV : 65 1st Qu.: 60.00 1st Qu.: 7741 Pave:1374
#> Median : 50.00 RH : 12 Median : 70.00 Median : 9591
#> Mean : 56.02 RL :1101 Mean : 70.56 Mean : 10696
#> 3rd Qu.: 70.00 RM : 193 3rd Qu.: 79.00 3rd Qu.: 11708
#> Max. :190.00 Max. :313.00 Max. :215245
#>
#> LotConfig Condition1 Condition2 BldgType HouseStyle
#> Corner :250 Norm :1195 Norm :1365 1Fam :1166 Length:1379
#> CulDSac: 93 Feedr : 69 Feedr : 5 2fmCon: 22 Class :character
#> FR2 : 44 Artery : 44 Artery : 2 Duplex: 40 Mode :character
#> FR3 : 4 RRAn : 26 PosN : 2 Twnhs : 38
#> Inside :988 PosN : 19 RRNn : 2 TwnhsE: 113
#> RRAe : 11 PosA : 1
#> (Other): 15 (Other): 2
#> OverallQual OverallCond YearBuilt YearRemodAdd Exterior1st
#> 5 :365 Min. :2.000 Min. :1880 Min. :1950 Length:1379
#> 6 :362 1st Qu.:5.000 1st Qu.:1955 1st Qu.:1968 Class :character
#> 7 :318 Median :5.000 Median :1976 Median :1994 Mode :character
#> 8 :167 Mean :5.578 Mean :1973 Mean :1985
#> 4 : 90 3rd Qu.:6.000 3rd Qu.:2001 3rd Qu.:2004
#> 9 : 43 Max. :9.000 Max. :2010 Max. :2010
#> (Other): 34
#> ExterQual X1stFlrSF X2ndFlrSF GrLivArea BsmtFullBath
#> Fa: 52 Min. : 438 Min. : 0.0 Min. : 438 Min. :0.0000
#> TA: 7 1st Qu.: 894 1st Qu.: 0.0 1st Qu.:1154 1st Qu.:0.0000
#> Gd:487 Median :1098 Median : 0.0 Median :1479 Median :0.0000
#> Ex:833 Mean :1177 Mean : 353.4 Mean :1535 Mean :0.4307
#> 3rd Qu.:1414 3rd Qu.: 738.5 3rd Qu.:1790 3rd Qu.:1.0000
#> Max. :4692 Max. :2065.0 Max. :5642 Max. :2.0000
#>
#> FullBath HalfBath BedroomAbvGr TotRmsAbvGrd
#> Min. :0.00 Min. :0.0000 Min. :0.000 Min. : 3.000
#> 1st Qu.:1.00 1st Qu.:0.0000 1st Qu.:2.000 1st Qu.: 5.000
#> Median :2.00 Median :0.0000 Median :3.000 Median : 6.000
#> Mean :1.58 Mean :0.3959 Mean :2.865 Mean : 6.553
#> 3rd Qu.:2.00 3rd Qu.:1.0000 3rd Qu.:3.000 3rd Qu.: 7.000
#> Max. :3.00 Max. :2.0000 Max. :6.000 Max. :12.000
#>
#> Fireplaces GarageType GarageCars GarageArea
#> Min. :0.0000 Length:1379 Min. :1.000 Min. : 160.0
#> 1st Qu.:0.0000 Class :character 1st Qu.:1.000 1st Qu.: 380.0
#> Median :1.0000 Mode :character Median :2.000 Median : 484.0
#> Mean :0.6418 Mean :1.871 Mean : 500.8
#> 3rd Qu.:1.0000 3rd Qu.:2.000 3rd Qu.: 580.0
#> Max. :3.0000 Max. :4.000 Max. :1418.0
#>
#> PavedDrive OpenPorchSF SaleType SalePrice
#> N: 58 Min. : 0.00 Length:1379 Min. : 35311
#> P: 28 1st Qu.: 0.00 Class :character 1st Qu.:134000
#> Y:1293 Median : 27.00 Mode :character Median :167500
#> Mean : 47.28 Mean :185480
#> 3rd Qu.: 69.50 3rd Qu.:217750
#> Max. :547.00 Max. :755000
#>
Insight:
- The are many outliers in
numeric variables.
In this step, using boxplot to visualize the distribution data
for each numeric variables. Then, defined the outlier boundaries. The
outliers of numeric variable treatment is removed.
1.
SalePrice Variable
# Visualization before treatment
boxplot(house_new$SalePrice,horizontal = T)# Treatment
house_new_outlier <- house_new[!(house_new$SalePrice>300000),]
# Visualization after treatment
boxplot(house_new_outlier$SalePrice,horizontal = T)
2. LotArea Variable
# Visualization
boxplot(house_new$LotArea,horizontal = T)# Treatment
house_new_outlier <- house_new_outlier[!(house_new_outlier$LotArea>15600),]
# Visualization after treatment
boxplot(house_new_outlier$LotArea,horizontal = T)
3. First Floor Variable
# Visualization
boxplot(house_new_outlier$X1stFlrSF,horizontal = T)# Treatment
house_new_outlier <- house_new_outlier[!(house_new_outlier$X1stFlrSF>2000),]
# Visualization after treatment
boxplot(house_new_outlier$X1stFlrSF,horizontal = T)4. Second Floor Variable
# Visualization
boxplot(house_new_outlier$X2ndFlrSF,horizontal = T)
# Treatment
house_new_outlier <- house_new_outlier[!(house_new_outlier$X2ndFlrSF>1700),]
# Visualization after treatment
boxplot(house_new_outlier$X2ndFlrSF,horizontal = T)5.GrLivArea Variable
# Visualization
boxplot(house_new_outlier$GrLivArea,horizontal = T)# Treatment
house_new_outlier <- house_new_outlier[!(house_new_outlier$GrLivArea>2550),]
# Visualization after treatment
boxplot(house_new_outlier$GrLivArea,horizontal = T)6.GarageArea Variable
# Visualization
boxplot(house_new_outlier$GarageArea,horizontal = T)# Treatment
house_new_outlier <- house_new_outlier[!(house_new_outlier$GarageArea>850),]
# Visualization after treatment
boxplot(house_new_outlier$GarageArea,horizontal = T)7.OpenPorchSF Variable
# Visualization
boxplot(house_new_outlier$OpenPorchSF,horizontal = T)# Treatment
house_new_outlier <- house_new_outlier[!(house_new_outlier$OpenPorchSF>115),]
# Visualization after treatment
boxplot(house_new_outlier$OpenPorchSF,horizontal = T)
Using ggcorr function in GGally package to get the
strength of correlation among numeric variables.
# Change the variable of OverallQual's data type to integer for visualization of correlation purposes.
house_new1 <- house_new %>%
mutate(OverallQual=as.integer(OverallQual))
# Visualization of correlation
ggcorr(house_new1,label = T,label_round = 2, nbreaks = 4, palette = "RdGy",
label_size = 3.5, label_color = "white",hjust = 0.8,layout.exp = 4)
Insight:
- The are nine numeric variables
whose have a strong correlation above 0.51.
Predicting the value of the SalePrice variable based on all or some of the predictor variables.
Note: It is required to find which predictor variable will produce the best linear regression model.
In cross validation step, the data is devided into train data
with proportion 80% of original data and test data with proportion 20%
of original data as unseen data.
1. Without Outlier Treatment.
# stratified random sampling method
# Set seed to lock the random
set.seed(1230)
# menentukan indeks untuk train dan test
splitted <- initial_split(data = house_new,
prop = 0.80,
strata = "Condition2")
# mengambil indeks data train
house_train <- training(splitted)
# mengambil indeks data test`
house_test <- testing(splitted)
house_trainhouse_test2. With Outlier Treatment.
# stratified random sampling method
# Set seed to lock the random
RNGkind(sample.kind = "Rounding")
set.seed(99)
# menentukan indeks untuk train dan test
splitted <- initial_split(data = house_new_outlier,
prop = 0.80,
strata = "Condition2")
# mengambil indeks data train
house_train_outlier <- training(splitted)
# mengambil indeks data test`
house_test_outlier <- testing(splitted)
house_train_outlierhouse_test_outlierIn this section, the model is built into 4 models and classified
based on selected predictors ranging from all predictors, based on
business / domain knowledge perspective, future selection, and step-wise
method. The Adjusted R-squared represents the goodness of fit for a
model and the star represents the strength of predictor variable’s
significance towards target variable.
In model of model_all, it involves all preddictors in this
model. Then using lm function to build linear regression
model and summary function to see model performance.
model_all <- lm(formula = SalePrice ~.,
data = house_train)
summary(model_all)#>
#> Call:
#> lm(formula = SalePrice ~ ., data = house_train)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -415645 -13291 -385 11367 190662
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -594346.1722 191705.7709 -3.100 0.001987 **
#> MSSubClass -17.5858 110.1381 -0.160 0.873173
#> MSZoningFV 20433.7681 15152.7743 1.349 0.177794
#> MSZoningRH 17836.1492 17147.9142 1.040 0.298526
#> MSZoningRL 21250.9990 14326.8909 1.483 0.138308
#> MSZoningRM 15732.6801 14377.3248 1.094 0.274097
#> LotFrontage -169.8132 57.5795 -2.949 0.003259 **
#> LotArea 0.5837 0.1576 3.703 0.000225 ***
#> StreetPave 27171.6219 21646.4151 1.255 0.209678
#> LotConfigCulDSac 8577.0974 4480.6661 1.914 0.055870 .
#> LotConfigFR2 -2884.1299 5894.8654 -0.489 0.624763
#> LotConfigFR3 -26113.8205 25402.3744 -1.028 0.304192
#> LotConfigInside 893.8776 2578.8792 0.347 0.728953
#> Condition1Feedr -3175.2640 7114.3194 -0.446 0.655461
#> Condition1Norm 16595.2310 5662.8086 2.931 0.003459 **
#> Condition1PosA 23622.5083 14318.9643 1.650 0.099307 .
#> Condition1PosN 11763.3182 10000.8452 1.176 0.239779
#> Condition1RRAe -5221.6671 12314.3448 -0.424 0.671633
#> Condition1RRAn 11834.3216 10126.3029 1.169 0.242811
#> Condition1RRNe -3089.6544 31814.4231 -0.097 0.922654
#> Condition1RRNn 15695.0836 16749.7724 0.937 0.348965
#> Condition2Feedr -1505.9231 34814.2031 -0.043 0.965506
#> Condition2Norm -6231.5881 26774.4005 -0.233 0.816007
#> Condition2RRAe -16555.6116 43353.7651 -0.382 0.702636
#> Condition2RRAn -759.4410 41968.2820 -0.018 0.985566
#> Condition2RRNn -4098.1940 41662.2233 -0.098 0.921660
#> BldgType2fmCon -15574.3401 17145.3888 -0.908 0.363900
#> BldgTypeDuplex -21517.3356 8598.3512 -2.502 0.012489 *
#> BldgTypeTwnhs -19706.1523 12977.4186 -1.518 0.129202
#> BldgTypeTwnhsE -17073.5061 11724.5764 -1.456 0.145643
#> HouseStyle1.5Unf 17885.2634 13425.0519 1.332 0.183084
#> HouseStyle1Story 21803.1108 5878.9360 3.709 0.000220 ***
#> HouseStyle2.5Fin -37150.4728 17728.6785 -2.096 0.036374 *
#> HouseStyle2.5Unf -4209.8782 15311.7199 -0.275 0.783415
#> HouseStyle2Story -11088.9031 4822.3152 -2.299 0.021680 *
#> HouseStyleSFoyer 14856.3610 8511.6942 1.745 0.081218 .
#> HouseStyleSLvl 13698.4419 7191.4361 1.905 0.057086 .
#> OverallQual3 14516.0978 26416.6602 0.550 0.582780
#> OverallQual4 11160.5651 23923.5341 0.467 0.640951
#> OverallQual5 18650.9367 24080.2091 0.775 0.438796
#> OverallQual6 24274.2646 24209.9720 1.003 0.316267
#> OverallQual7 35795.3655 24390.3637 1.468 0.142523
#> OverallQual8 71909.3214 24661.8990 2.916 0.003626 **
#> OverallQual9 139605.0303 25353.5318 5.506 0.000000046447180 ***
#> OverallQual10 190322.5510 27223.8806 6.991 0.000000000004948 ***
#> OverallCond 7235.3226 1160.5670 6.234 0.000000000665267 ***
#> YearBuilt 308.5308 83.6556 3.688 0.000238 ***
#> YearRemodAdd -55.5534 77.3254 -0.718 0.472654
#> Exterior1stBrkComm -15422.2649 34731.1658 -0.444 0.657104
#> Exterior1stBrkFace 20464.0459 11737.7091 1.743 0.081560 .
#> Exterior1stCBlock 942.6757 39369.3844 0.024 0.980902
#> Exterior1stCemntBd 14414.4189 11589.4544 1.244 0.213878
#> Exterior1stHdBoard 3365.9676 10767.3196 0.313 0.754642
#> Exterior1stImStucc -3815.4144 32881.9252 -0.116 0.907649
#> Exterior1stMetalSd 7251.2136 10565.1329 0.686 0.492660
#> Exterior1stPlywood 2566.5134 11246.0993 0.228 0.819526
#> Exterior1stStone 4014.4481 24796.3396 0.162 0.871419
#> Exterior1stStucco -31789.1961 13019.4544 -2.442 0.014790 *
#> Exterior1stVinylSd 5360.7418 10727.3905 0.500 0.617378
#> Exterior1stWd Sdng 5334.5480 10619.1052 0.502 0.615528
#> Exterior1stWdShing -4606.6128 12615.1979 -0.365 0.715065
#> ExterQualTA -1842.6563 22426.0497 -0.082 0.934531
#> ExterQualGd -12772.8839 7008.0476 -1.823 0.068659 .
#> ExterQualEx -20412.0775 7672.4081 -2.660 0.007927 **
#> X1stFlrSF -2.8886 28.4998 -0.101 0.919288
#> X2ndFlrSF 25.1131 27.6304 0.909 0.363623
#> GrLivArea 55.5959 28.4411 1.955 0.050885 .
#> BsmtFullBath 15254.3875 2084.2501 7.319 0.000000000000509 ***
#> FullBath 5808.7673 3113.3075 1.866 0.062360 .
#> HalfBath 3570.3219 2984.5635 1.196 0.231874
#> BedroomAbvGr -4876.7072 1972.3070 -2.473 0.013577 *
#> TotRmsAbvGrd 743.5013 1327.9429 0.560 0.575679
#> Fireplaces 6107.2338 1859.9431 3.284 0.001060 **
#> GarageTypeAttchd 19858.9326 15858.2004 1.252 0.210757
#> GarageTypeBasment 14238.5164 17768.1677 0.801 0.423117
#> GarageTypeBuiltIn 16897.2057 16540.6000 1.022 0.307234
#> GarageTypeCarPort 5500.7329 19584.2208 0.281 0.778863
#> GarageTypeDetchd 19627.4108 15660.0358 1.253 0.210370
#> GarageCars 16560.1245 3159.7165 5.241 0.000000194328272 ***
#> GarageArea -7.2699 10.4101 -0.698 0.485118
#> PavedDriveP -751.6286 8032.4585 -0.094 0.925466
#> PavedDriveY 2644.6623 5665.5244 0.467 0.640744
#> OpenPorchSF 5.8766 17.1008 0.344 0.731184
#> SaleTypeCon 38531.1726 23060.5633 1.671 0.095057 .
#> SaleTypeConLD 7546.5763 14045.7974 0.537 0.591190
#> SaleTypeConLI -83.3702 17217.4689 -0.005 0.996137
#> SaleTypeConLw -1606.9659 19816.0967 -0.081 0.935383
#> SaleTypeCWD 24391.1574 16915.7097 1.442 0.149633
#> SaleTypeNew 26975.2006 6948.4937 3.882 0.000110 ***
#> SaleTypeOth 25857.0486 31648.1974 0.817 0.414112
#> SaleTypeWD 8179.2117 5774.9202 1.416 0.156986
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 30730 on 1012 degrees of freedom
#> Multiple R-squared: 0.8643, Adjusted R-squared: 0.8522
#> F-statistic: 71.62 on 90 and 1012 DF, p-value: < 0.00000000000000022
Insight of model_all :
- OverallQual9
& OverallQual10, rates the overall material and finish of the house,
also Intercept have highest significant score to the model of model_all.
- The Adjusted R-squared score that represent the goodness of fit
for model is 0.8522, it means that only 85.22% of the variables can be
explained by the model.
After conducting the Assumption Test, the model of model_all’s result
is several variables have a strong correlation between predictors. There
are unnecessary predictors to be involved because it indicates a
redundant predictor in the model, which should be able to choose only
one of the variables with a strong relationship. Then using
lm function to build linear regression model with selected
predictors and summary function to see model performance.
# Drop several variables due to multicolinearity issue.
model_all_nomulti <- lm(formula = SalePrice ~. -HouseStyle -X2ndFlrSF -MSSubClass -BldgType -OverallCond -Exterior1st-ExterQual,
data = house_train)
summary(model_all_nomulti)#>
#> Call:
#> lm(formula = SalePrice ~ . - HouseStyle - X2ndFlrSF - MSSubClass -
#> BldgType - OverallCond - Exterior1st - ExterQual, data = house_train)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -480835 -14568 -432 12821 223071
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -602581.5105 167983.9710 -3.587 0.000350 ***
#> MSZoningFV 11593.8122 15858.2693 0.731 0.464889
#> MSZoningRH 9712.1398 17971.9238 0.540 0.589032
#> MSZoningRL 17160.1661 14988.9747 1.145 0.252533
#> MSZoningRM 3828.7905 15015.9038 0.255 0.798787
#> LotFrontage -83.5639 57.7586 -1.447 0.148260
#> LotArea 0.4757 0.1570 3.030 0.002506 **
#> StreetPave 12888.2506 18852.6734 0.684 0.494361
#> LotConfigCulDSac 11159.6848 4711.9989 2.368 0.018049 *
#> LotConfigFR2 -2312.1381 6279.0163 -0.368 0.712775
#> LotConfigFR3 -30367.9034 26998.5086 -1.125 0.260933
#> LotConfigInside -372.6487 2738.3679 -0.136 0.891781
#> Condition1Feedr -2351.7683 7443.1791 -0.316 0.752094
#> Condition1Norm 16911.6231 5882.7125 2.875 0.004125 **
#> Condition1PosA 24539.3091 15047.1488 1.631 0.103229
#> Condition1PosN 13718.5455 10493.6759 1.307 0.191394
#> Condition1RRAe 2410.8062 13017.4473 0.185 0.853110
#> Condition1RRAn 8030.7798 10722.8639 0.749 0.454063
#> Condition1RRNe 1503.6648 33949.2808 0.044 0.964681
#> Condition1RRNn 16079.8879 17764.2904 0.905 0.365579
#> Condition2Feedr 18768.6079 34461.4484 0.545 0.586127
#> Condition2Norm 11931.4388 25596.2088 0.466 0.641212
#> Condition2RRAe -5329.6477 43169.0898 -0.123 0.901767
#> Condition2RRAn 4198.6815 42424.5267 0.099 0.921183
#> Condition2RRNn 33050.1157 42518.5706 0.777 0.437152
#> OverallQual3 1442.3408 27210.2746 0.053 0.957736
#> OverallQual4 15862.0420 24842.1864 0.639 0.523281
#> OverallQual5 26858.2580 24907.1903 1.078 0.281135
#> OverallQual6 32311.5780 24977.3058 1.294 0.196077
#> OverallQual7 46630.2699 25157.7675 1.854 0.064091 .
#> OverallQual8 85455.4552 25411.1203 3.363 0.000799 ***
#> OverallQual9 156924.9798 25997.6188 6.036 0.0000000021918765 ***
#> OverallQual10 214083.0480 27826.1183 7.694 0.0000000000000331 ***
#> YearBuilt 64.9826 69.7059 0.932 0.351429
#> YearRemodAdd 194.6838 70.6606 2.755 0.005968 **
#> X1stFlrSF 5.5491 4.7953 1.157 0.247457
#> GrLivArea 50.1192 5.5883 8.969 < 0.0000000000000002 ***
#> BsmtFullBath 13260.0181 2176.3604 6.093 0.0000000015597503 ***
#> FullBath 3127.6154 3276.8625 0.954 0.340075
#> HalfBath -242.5368 3080.3211 -0.079 0.937257
#> BedroomAbvGr -2935.5775 2026.5075 -1.449 0.147753
#> TotRmsAbvGrd -224.5892 1365.9136 -0.164 0.869429
#> Fireplaces 6763.4803 1958.5690 3.453 0.000576 ***
#> GarageTypeAttchd 45070.5371 16218.8586 2.779 0.005552 **
#> GarageTypeBasment 25947.2446 18231.4251 1.423 0.154973
#> GarageTypeBuiltIn 50289.2951 16910.7318 2.974 0.003009 **
#> GarageTypeCarPort 22460.7387 19647.1110 1.143 0.253215
#> GarageTypeDetchd 39635.1201 16087.0406 2.464 0.013908 *
#> GarageCars 15234.1616 3320.0665 4.589 0.0000050071189558 ***
#> GarageArea -0.8073 10.9888 -0.073 0.941451
#> PavedDriveP -1243.3405 8495.3630 -0.146 0.883669
#> PavedDriveY 2702.1394 5857.0642 0.461 0.644646
#> OpenPorchSF 25.2194 17.9364 1.406 0.160010
#> SaleTypeCon 40131.9640 24442.8876 1.642 0.100919
#> SaleTypeConLD 2289.1457 14259.7284 0.161 0.872493
#> SaleTypeConLI -5265.5360 18304.6720 -0.288 0.773664
#> SaleTypeConLw 5982.1864 20544.5749 0.291 0.770971
#> SaleTypeCWD 13211.7694 17913.5687 0.738 0.460967
#> SaleTypeNew 29470.6728 7265.4691 4.056 0.0000535833201582 ***
#> SaleTypeOth 30115.7711 33907.1368 0.888 0.374647
#> SaleTypeWD 9393.0505 6082.3043 1.544 0.122813
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 33030 on 1042 degrees of freedom
#> Multiple R-squared: 0.8386, Adjusted R-squared: 0.8294
#> F-statistic: 90.26 on 60 and 1042 DF, p-value: < 0.00000000000000022
Insight of model_all_nomulti :
- The Adjusted
R-squared score that represent the goodness of fit for model is 0.8294,
it means that only 82.94% of the variables can be explained by the
model.
- OverallQual9 & OverallQual10, rates the overall
material and finish of the house, also Intercept have highest
significant score to the model of model_all_nomulti.
In the model of model_all_outlier has outlier treatment, where
actually outliers can be beneficial for the model. In order to measure
whether outliers can be useful for the model or not, it can be seen
after conducting the model comparison with function
compare_performance(). Then using lm function
to build linear regression model with selected predictors and
summary function to see model performance.
# Model with Outlier Treatment
model_all_outlier <- lm(formula = SalePrice ~.,
data = house_train_outlier)
# Drop several variables due to multicolinearity issue.
model_all_outlier_nomulti <- lm(formula = SalePrice ~. -HouseStyle -X2ndFlrSF -BldgType-OverallCond -Exterior1st,
data = house_train_outlier)
summary(model_all_outlier_nomulti)#>
#> Call:
#> lm(formula = SalePrice ~ . - HouseStyle - X2ndFlrSF - BldgType -
#> OverallCond - Exterior1st, data = house_train_outlier)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -100352 -9992 379 10158 59005
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -834963.2207 114749.8757 -7.276 0.000000000000881 ***
#> MSSubClass -103.7956 22.5957 -4.594 0.000005124313814 ***
#> MSZoningFV 22633.5428 9384.8555 2.412 0.016122 *
#> MSZoningRH 21531.5988 10769.7897 1.999 0.045948 *
#> MSZoningRL 24212.4378 8149.0328 2.971 0.003063 **
#> MSZoningRM 16290.4253 8107.6274 2.009 0.044873 *
#> LotFrontage 13.0998 52.9200 0.248 0.804560
#> LotArea 1.0965 0.3568 3.073 0.002196 **
#> StreetPave 27254.6324 19473.4230 1.400 0.162060
#> LotConfigCulDSac 7129.1412 3437.7392 2.074 0.038446 *
#> LotConfigFR2 -781.7392 4183.9549 -0.187 0.851836
#> LotConfigFR3 11705.1970 19079.9425 0.613 0.539747
#> LotConfigInside 1306.0901 1903.3442 0.686 0.492798
#> Condition1Feedr 1353.1374 5320.1630 0.254 0.799303
#> Condition1Norm 6585.8952 4591.9282 1.434 0.151930
#> Condition1PosA 5457.3425 13940.6033 0.391 0.695562
#> Condition1PosN 12670.9877 8559.9566 1.480 0.139231
#> Condition1RRAe -19025.1436 8033.4867 -2.368 0.018131 *
#> Condition1RRAn -2060.1518 7083.5675 -0.291 0.771259
#> Condition1RRNe 11690.4429 19178.2878 0.610 0.542337
#> Condition1RRNn 42400.4886 16897.2924 2.509 0.012311 *
#> Condition2Feedr -7969.8174 19554.0250 -0.408 0.683701
#> Condition2Norm -5874.9931 15240.1214 -0.385 0.699982
#> Condition2RRAn -21795.3197 24219.0778 -0.900 0.368455
#> Condition2RRNn 1801.1657 24638.1712 0.073 0.941743
#> OverallQual3 -93.1707 15299.7548 -0.006 0.995143
#> OverallQual4 18809.9212 13907.0281 1.353 0.176616
#> OverallQual5 27304.1010 13955.3087 1.957 0.050780 .
#> OverallQual6 36559.8307 14021.9152 2.607 0.009310 **
#> OverallQual7 46537.2116 14214.0974 3.274 0.001110 **
#> OverallQual8 73950.4921 14471.1476 5.110 0.000000410530664 ***
#> OverallQual9 100226.5966 25085.3984 3.995 0.000071075973398 ***
#> YearBuilt 188.3124 47.8204 3.938 0.000090002118209 ***
#> YearRemodAdd 218.0644 44.9081 4.856 0.000001465152692 ***
#> ExterQualTA -22827.3985 13799.7416 -1.654 0.098515 .
#> ExterQualGd -15180.8660 9582.8042 -1.584 0.113583
#> ExterQualEx -21027.0218 9491.8210 -2.215 0.027046 *
#> X1stFlrSF 11.1426 3.5642 3.126 0.001840 **
#> GrLivArea 48.1948 4.3220 11.151 < 0.0000000000000002 ***
#> BsmtFullBath 9887.6527 1472.4897 6.715 0.000000000037631 ***
#> FullBath 2047.1349 2193.8708 0.933 0.351066
#> HalfBath 1839.3413 1998.6116 0.920 0.357713
#> BedroomAbvGr -3125.9904 1410.9372 -2.216 0.027028 *
#> TotRmsAbvGrd -1655.3771 975.2597 -1.697 0.090049 .
#> Fireplaces 5495.6729 1272.0639 4.320 0.000017722640509 ***
#> GarageTypeAttchd 23189.6827 11092.5479 2.091 0.036910 *
#> GarageTypeBasment 18879.7536 12434.9605 1.518 0.129373
#> GarageTypeBuiltIn 23891.8056 11637.5846 2.053 0.040427 *
#> GarageTypeCarPort 8090.6653 13703.6027 0.590 0.555101
#> GarageTypeDetchd 23520.9531 11022.1341 2.134 0.033175 *
#> GarageCars 1991.4255 2340.9373 0.851 0.395215
#> GarageArea 19.2540 8.7326 2.205 0.027773 *
#> PavedDriveP 5410.8376 6083.7656 0.889 0.374084
#> PavedDriveY 3437.7019 3706.6489 0.927 0.354001
#> OpenPorchSF 45.6289 23.6919 1.926 0.054499 .
#> SaleTypeCon 39623.8272 19500.7714 2.032 0.042521 *
#> SaleTypeConLD 11966.9582 10221.5081 1.171 0.242073
#> SaleTypeConLI -3575.3449 11676.3333 -0.306 0.759536
#> SaleTypeConLw 9770.7908 11768.2631 0.830 0.406658
#> SaleTypeCWD 46031.5369 19132.8369 2.406 0.016379 *
#> SaleTypeNew 16856.1648 5057.5016 3.333 0.000902 ***
#> SaleTypeWD 8238.7698 3922.7058 2.100 0.036044 *
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 18330 on 736 degrees of freedom
#> Multiple R-squared: 0.8584, Adjusted R-squared: 0.8467
#> F-statistic: 73.17 on 61 and 736 DF, p-value: < 0.00000000000000022
Insight of model_all_outlier :
- The Adjusted
R-squared score that represent the goodness of fit for model is 0.8562,
it means that only 85.62% of the variables can be explained by the model
of model_all_outlier.
- OverallQual9 & OverallQual10, rates the
overall material and finish of the house, also Intercept have highest
significant score to the model of model_all_outlier.
In model of model_bob, it involves the business perspective in
constructing the variable of predictors that want to put into model as
model’s predictor.
model_bob <- lm(formula = SalePrice ~. -OpenPorchSF -HalfBath -GarageArea -MSSubClass -TotRmsAbvGrd -HouseStyle -X2ndFlrSF -BldgType -OverallCond -Exterior1st-ExterQual,
data = house_train)
summary(model_bob)#>
#> Call:
#> lm(formula = SalePrice ~ . - OpenPorchSF - HalfBath - GarageArea -
#> MSSubClass - TotRmsAbvGrd - HouseStyle - X2ndFlrSF - BldgType -
#> OverallCond - Exterior1st - ExterQual, data = house_train)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -480124 -15086 -295 13292 220950
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -615205.1363 163415.8103 -3.765 0.000176 ***
#> MSZoningFV 13076.1073 15714.7262 0.832 0.405547
#> MSZoningRH 9440.4067 17908.8558 0.527 0.598211
#> MSZoningRL 17266.1127 14899.5672 1.159 0.246789
#> MSZoningRM 4255.8347 14909.7024 0.285 0.775363
#> LotFrontage -80.5513 57.2302 -1.407 0.159577
#> LotArea 0.4754 0.1564 3.040 0.002428 **
#> StreetPave 12077.5804 18798.6010 0.642 0.520707
#> LotConfigCulDSac 11045.7455 4703.4308 2.348 0.019037 *
#> LotConfigFR2 -2090.0374 6259.5698 -0.334 0.738526
#> LotConfigFR3 -30481.8285 26970.1690 -1.130 0.258649
#> LotConfigInside -560.6366 2728.6531 -0.205 0.837251
#> Condition1Feedr -2367.8882 7399.9532 -0.320 0.749042
#> Condition1Norm 17035.3780 5852.2563 2.911 0.003680 **
#> Condition1PosA 24064.5973 15004.9030 1.604 0.109064
#> Condition1PosN 14630.5312 10439.4476 1.401 0.161371
#> Condition1RRAe 2777.6774 12951.6394 0.214 0.830226
#> Condition1RRAn 9420.4309 10618.1641 0.887 0.375175
#> Condition1RRNe 2125.9221 33872.5450 0.063 0.949968
#> Condition1RRNn 15973.0407 17704.1972 0.902 0.367149
#> Condition2Feedr 17520.7624 34328.0850 0.510 0.609885
#> Condition2Norm 11762.8144 25540.9607 0.461 0.645219
#> Condition2RRAe -7465.4676 42823.7206 -0.174 0.861640
#> Condition2RRAn 3006.2708 42366.0028 0.071 0.943444
#> Condition2RRNn 33882.9787 42364.7257 0.800 0.424013
#> OverallQual3 1381.7739 27174.8319 0.051 0.959457
#> OverallQual4 15643.5858 24813.2198 0.630 0.528536
#> OverallQual5 26546.1387 24875.0676 1.067 0.286137
#> OverallQual6 32155.9573 24950.1293 1.289 0.197749
#> OverallQual7 46390.1047 25130.4510 1.846 0.065179 .
#> OverallQual8 85395.7690 25374.0698 3.365 0.000792 ***
#> OverallQual9 156506.8133 25942.8516 6.033 0.0000000022338611 ***
#> OverallQual10 214124.5684 27791.4553 7.705 0.0000000000000304 ***
#> YearBuilt 69.6097 66.6149 1.045 0.296284
#> YearRemodAdd 196.7529 70.5542 2.789 0.005388 **
#> X1stFlrSF 5.7905 4.0244 1.439 0.150485
#> GrLivArea 50.2984 4.0109 12.541 < 0.0000000000000002 ***
#> BsmtFullBath 13420.2763 2161.4858 6.209 0.0000000007686819 ***
#> FullBath 3296.8402 2957.1861 1.115 0.265168
#> BedroomAbvGr -3102.9855 1799.9205 -1.724 0.085011 .
#> Fireplaces 6859.9001 1937.1719 3.541 0.000416 ***
#> GarageTypeAttchd 44819.8448 16089.4068 2.786 0.005438 **
#> GarageTypeBasment 25634.7808 18107.6412 1.416 0.157164
#> GarageTypeBuiltIn 49892.0733 16765.6741 2.976 0.002989 **
#> GarageTypeCarPort 22448.2759 19498.4254 1.151 0.249877
#> GarageTypeDetchd 39353.0340 15986.9581 2.462 0.013993 *
#> GarageCars 14889.1534 2360.2420 6.308 0.0000000004156525 ***
#> PavedDriveP -1199.9741 8467.2315 -0.142 0.887329
#> PavedDriveY 2484.5299 5822.9129 0.427 0.669699
#> SaleTypeCon 38759.4775 24380.5891 1.590 0.112189
#> SaleTypeConLD 1468.9956 14174.0486 0.104 0.917475
#> SaleTypeConLI -4578.1400 18236.5709 -0.251 0.801831
#> SaleTypeConLw 5613.3793 20474.0631 0.274 0.784008
#> SaleTypeCWD 12225.9999 17858.0434 0.685 0.493734
#> SaleTypeNew 29709.9205 7236.4622 4.106 0.0000434708337898 ***
#> SaleTypeOth 29926.3096 33861.1247 0.884 0.377010
#> SaleTypeWD 9393.6980 6067.2116 1.548 0.121859
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 33000 on 1046 degrees of freedom
#> Multiple R-squared: 0.8383, Adjusted R-squared: 0.8297
#> F-statistic: 96.86 on 56 and 1046 DF, p-value: < 0.00000000000000022
Insight of model_bob :
- The Adjusted R-squared
score that represent the goodness of fit for model is 0.8297, it means
that only 82.97% of the variables can be explained by the model of
model_bob.
- OverallQual9 & OverallQual10, rates the overall
material and finish of the house, also Intercept have highest
significant score to the model of model_bob.
Using ggcorr function to define which variable has
a strong correlation towards the target. Then using those variables for
constructing the variable of predictors for this model.
After, find out the variables that have a strong relationship. Then
using
lm function to build linear regression model with
selected predictors based on ggcorr result and
summary function to see model performance.
model_fs <- lm(formula = SalePrice ~ OverallQual + YearBuilt + YearRemodAdd +
X1stFlrSF + GrLivArea + FullBath + TotRmsAbvGrd + GarageCars + GarageArea,
data = house_train)
summary(model_fs)#>
#> Call:
#> lm(formula = SalePrice ~ OverallQual + YearBuilt + YearRemodAdd +
#> X1stFlrSF + GrLivArea + FullBath + TotRmsAbvGrd + GarageCars +
#> GarageArea, data = house_train)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -497536 -15795 -562 14621 212560
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -1135297.200 146793.240 -7.734 0.0000000000000237 ***
#> OverallQual3 -11120.600 27810.210 -0.400 0.689327
#> OverallQual4 2195.598 25490.631 0.086 0.931376
#> OverallQual5 17929.448 25308.185 0.708 0.478821
#> OverallQual6 24843.987 25407.735 0.978 0.328385
#> OverallQual7 42037.997 25653.932 1.639 0.101574
#> OverallQual8 80810.327 25917.674 3.118 0.001869 **
#> OverallQual9 156471.923 26602.255 5.882 0.0000000053948494 ***
#> OverallQual10 205374.403 28579.112 7.186 0.0000000000012364 ***
#> YearBuilt 323.796 56.771 5.704 0.0000000151177135 ***
#> YearRemodAdd 264.053 71.830 3.676 0.000248 ***
#> X1stFlrSF 18.574 3.642 5.099 0.0000004018952094 ***
#> GrLivArea 56.238 4.740 11.866 < 0.0000000000000002 ***
#> FullBath -3130.896 2953.832 -1.060 0.289406
#> TotRmsAbvGrd -1275.484 1234.137 -1.034 0.301599
#> GarageCars 14171.199 3427.015 4.135 0.0000382080031060 ***
#> GarageArea -2.935 11.062 -0.265 0.790780
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 35400 on 1086 degrees of freedom
#> Multiple R-squared: 0.8068, Adjusted R-squared: 0.804
#> F-statistic: 283.5 on 16 and 1086 DF, p-value: < 0.00000000000000022
Insight of model_fs :
- The Adjusted R-squared
score that represent the goodness of fit for model is 0.804, it means
that only 80.4% of the variables can be explained by the model of
model_fs.
- OverallQual9 and OverallQual10, rates the overall
material and finish of the house, also Intercept have highest
significant score to the model of model_fs.
In step-wise regression, there are 3 models ranging from
backward, forward, and both which can help to discover a model with
lowest AIC and to find significant predictors. Then using
step function to build step-wise regression model and
summary function to see model performance.
# Backward
model_backward <- step(object = model_all, direction = "backward", trace = F)
summary(model_backward)#>
#> Call:
#> lm(formula = SalePrice ~ LotFrontage + LotArea + Street + Condition1 +
#> BldgType + HouseStyle + OverallQual + OverallCond + YearBuilt +
#> Exterior1st + ExterQual + X2ndFlrSF + GrLivArea + BsmtFullBath +
#> FullBath + HalfBath + BedroomAbvGr + Fireplaces + GarageCars +
#> SaleType, data = house_train)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -422439 -13197 -778 12119 188125
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -786499.0006 140891.2919 -5.582 0.000000030290286 ***
#> LotFrontage -188.6030 54.5253 -3.459 0.000564 ***
#> LotArea 0.6668 0.1492 4.469 0.000008739166845 ***
#> StreetPave 43544.0374 19579.6729 2.224 0.026367 *
#> Condition1Feedr -4402.9343 6828.6026 -0.645 0.519213
#> Condition1Norm 16579.0004 5490.8553 3.019 0.002595 **
#> Condition1PosA 23687.8673 14143.8842 1.675 0.094279 .
#> Condition1PosN 14571.4734 9786.0916 1.489 0.136792
#> Condition1RRAe -6265.1549 11910.3582 -0.526 0.598983
#> Condition1RRAn 13976.1861 9515.5049 1.469 0.142195
#> Condition1RRNe 3998.6221 31453.9067 0.127 0.898865
#> Condition1RRNn 12315.6384 15311.0932 0.804 0.421373
#> BldgType2fmCon -17766.3732 7838.7317 -2.266 0.023627 *
#> BldgTypeDuplex -24354.8809 6229.6890 -3.909 0.000098494375329 ***
#> BldgTypeTwnhs -25351.9189 6655.9721 -3.809 0.000148 ***
#> BldgTypeTwnhsE -21259.5159 4243.0131 -5.010 0.000000637902494 ***
#> HouseStyle1.5Unf 18886.8962 12557.6952 1.504 0.132883
#> HouseStyle1Story 22429.1527 4934.0787 4.546 0.000006117341215 ***
#> HouseStyle2.5Fin -33915.4882 13661.1687 -2.483 0.013199 *
#> HouseStyle2.5Unf -5274.3912 14858.2929 -0.355 0.722677
#> HouseStyle2Story -11031.6404 4371.9343 -2.523 0.011775 *
#> HouseStyleSFoyer 14994.9972 7757.7825 1.933 0.053521 .
#> HouseStyleSLvl 12914.4486 5952.2806 2.170 0.030259 *
#> OverallQual3 23494.7917 25183.6954 0.933 0.351070
#> OverallQual4 15545.9581 22799.2523 0.682 0.495478
#> OverallQual5 25238.7751 22784.7378 1.108 0.268246
#> OverallQual6 30812.8917 22925.3233 1.344 0.179224
#> OverallQual7 42332.8739 23121.5056 1.831 0.067404 .
#> OverallQual8 78379.3239 23411.2185 3.348 0.000843 ***
#> OverallQual9 144883.0769 24128.3746 6.005 0.000000002647875 ***
#> OverallQual10 195718.7099 26109.0195 7.496 0.000000000000141 ***
#> OverallCond 7171.1265 1031.2383 6.954 0.000000000006276 ***
#> YearBuilt 356.9247 71.2067 5.013 0.000000631324648 ***
#> Exterior1stBrkComm -12970.3959 34210.8847 -0.379 0.704668
#> Exterior1stBrkFace 25193.1714 11209.7709 2.247 0.024822 *
#> Exterior1stCBlock 16634.8744 37336.6044 0.446 0.656024
#> Exterior1stCemntBd 16814.1070 11107.6346 1.514 0.130395
#> Exterior1stHdBoard 7807.6483 10244.4291 0.762 0.446152
#> Exterior1stImStucc -1516.1955 32449.0942 -0.047 0.962741
#> Exterior1stMetalSd 10687.7551 10048.3098 1.064 0.287741
#> Exterior1stPlywood 7298.7102 10679.1675 0.683 0.494473
#> Exterior1stStone 8195.2462 24362.3959 0.336 0.736645
#> Exterior1stStucco -28234.3894 12496.7649 -2.259 0.024069 *
#> Exterior1stVinylSd 8077.2289 10216.8532 0.791 0.429370
#> Exterior1stWd Sdng 8823.0709 10063.7498 0.877 0.380843
#> Exterior1stWdShing -948.8236 12097.5736 -0.078 0.937500
#> ExterQualTA -7624.7596 19966.0777 -0.382 0.702624
#> ExterQualGd -13835.8114 6872.5674 -2.013 0.044352 *
#> ExterQualEx -21263.8868 7540.1616 -2.820 0.004893 **
#> X2ndFlrSF 26.6791 7.2021 3.704 0.000223 ***
#> GrLivArea 53.7350 4.7377 11.342 < 0.0000000000000002 ***
#> BsmtFullBath 15206.2459 2042.8101 7.444 0.000000000000205 ***
#> FullBath 5432.3114 2999.1635 1.811 0.070387 .
#> HalfBath 3978.4012 2889.3962 1.377 0.168841
#> BedroomAbvGr -3948.2383 1739.3187 -2.270 0.023412 *
#> Fireplaces 6537.5602 1801.5365 3.629 0.000299 ***
#> GarageCars 14509.5169 2162.6635 6.709 0.000000000032137 ***
#> SaleTypeCon 41526.6382 22691.2760 1.830 0.067526 .
#> SaleTypeConLD 6725.4425 13656.6261 0.492 0.622493
#> SaleTypeConLI -870.7761 16849.5954 -0.052 0.958794
#> SaleTypeConLw 1343.6075 19582.5580 0.069 0.945311
#> SaleTypeCWD 27120.5765 16630.5083 1.631 0.103242
#> SaleTypeNew 26654.8031 6651.3120 4.007 0.000065762456150 ***
#> SaleTypeOth 28659.5255 31405.1208 0.913 0.361678
#> SaleTypeWD 8326.1918 5535.5370 1.504 0.132851
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 30600 on 1038 degrees of freedom
#> Multiple R-squared: 0.8621, Adjusted R-squared: 0.8536
#> F-statistic: 101.4 on 64 and 1038 DF, p-value: < 0.00000000000000022
Insight of model_backward :
- The Adjusted
R-squared score that represent the goodness of fit for model is 0.8536,
it means that only 85.36% of the variables can be explained by the model
of model_backward.
- OverallQual9 and OverallQual10, rates the
overall material and finish of the house, also Intercept have highest
significant score to the model of model_backward.
- Backward model Without multicolinearity
variables
# Backward model Without multicolinearity variables.
model_backward_nomulti <- step(object = model_all_nomulti, direction = "backward", trace = F)
summary(model_backward_nomulti)#>
#> Call:
#> lm(formula = SalePrice ~ MSZoning + LotFrontage + LotArea + LotConfig +
#> Condition1 + OverallQual + YearRemodAdd + X1stFlrSF + GrLivArea +
#> BsmtFullBath + BedroomAbvGr + Fireplaces + GarageType + GarageCars +
#> OpenPorchSF + SaleType, data = house_train)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -476988 -14701 -293 13539 225308
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -534450.3022 130395.0757 -4.099 0.00004473504361153 ***
#> MSZoningFV 18987.8030 14567.9547 1.303 0.192724
#> MSZoningRH 16451.3438 16868.0511 0.975 0.329637
#> MSZoningRL 23117.3469 13715.3975 1.686 0.092187 .
#> MSZoningRM 8522.0633 13893.9602 0.613 0.539768
#> LotFrontage -94.4625 56.3250 -1.677 0.093820 .
#> LotArea 0.4208 0.1431 2.940 0.003354 **
#> LotConfigCulDSac 11461.8176 4653.1776 2.463 0.013928 *
#> LotConfigFR2 -1388.4660 6212.7762 -0.223 0.823201
#> LotConfigFR3 -25968.8445 24872.5712 -1.044 0.296690
#> LotConfigInside -243.1049 2702.5968 -0.090 0.928342
#> Condition1Feedr -1063.8292 7124.1599 -0.149 0.881324
#> Condition1Norm 18533.6980 5652.9646 3.279 0.001077 **
#> Condition1PosA 26453.1744 14893.9155 1.776 0.076004 .
#> Condition1PosN 14919.2315 10341.7849 1.443 0.149425
#> Condition1RRAe 5453.0482 12711.1790 0.429 0.668014
#> Condition1RRAn 10326.0327 10095.8209 1.023 0.306636
#> Condition1RRNe 3012.0927 33744.1985 0.089 0.928890
#> Condition1RRNn 16418.1760 16787.9137 0.978 0.328310
#> OverallQual3 -1283.0143 26580.2227 -0.048 0.961511
#> OverallQual4 14840.2286 24577.2565 0.604 0.546093
#> OverallQual5 25279.6082 24561.4294 1.029 0.303603
#> OverallQual6 31843.3896 24661.1250 1.291 0.196904
#> OverallQual7 47023.7816 24831.6983 1.894 0.058538 .
#> OverallQual8 85732.6120 25064.9046 3.420 0.000649 ***
#> OverallQual9 156920.5019 25645.1639 6.119 0.00000000132604167 ***
#> OverallQual10 215455.4610 27435.1425 7.853 0.00000000000000993 ***
#> YearRemodAdd 234.1588 66.4971 3.521 0.000448 ***
#> X1stFlrSF 6.2253 3.9830 1.563 0.118354
#> GrLivArea 49.7728 3.7709 13.199 < 0.0000000000000002 ***
#> BsmtFullBath 13170.1166 2088.6661 6.306 0.00000000042159795 ***
#> BedroomAbvGr -2723.6570 1749.3177 -1.557 0.119775
#> Fireplaces 6355.8283 1908.8502 3.330 0.000900 ***
#> GarageTypeAttchd 49698.4849 15618.8561 3.182 0.001506 **
#> GarageTypeBasment 29350.3777 17871.3966 1.642 0.100824
#> GarageTypeBuiltIn 55568.1036 16220.3513 3.426 0.000637 ***
#> GarageTypeCarPort 24335.1893 19157.2554 1.270 0.204263
#> GarageTypeDetchd 42475.5928 15590.0468 2.725 0.006546 **
#> GarageCars 16321.2513 2210.5816 7.383 0.00000000000031290 ***
#> OpenPorchSF 26.3347 17.6971 1.488 0.137029
#> SaleTypeCon 38321.6146 24300.7567 1.577 0.115102
#> SaleTypeConLD 691.8352 13957.1022 0.050 0.960475
#> SaleTypeConLI -6810.0815 18178.0978 -0.375 0.708010
#> SaleTypeConLw 6591.7751 20408.4041 0.323 0.746764
#> SaleTypeCWD 12835.9309 17795.4664 0.721 0.470883
#> SaleTypeNew 29123.8208 7181.0888 4.056 0.00005368045966956 ***
#> SaleTypeOth 29498.8907 33743.6148 0.874 0.382205
#> SaleTypeWD 8819.6595 6019.9018 1.465 0.143196
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 32910 on 1055 degrees of freedom
#> Multiple R-squared: 0.8379, Adjusted R-squared: 0.8306
#> F-statistic: 116 on 47 and 1055 DF, p-value: < 0.00000000000000022
Insight of model_backward_nomulti :
- The
Adjusted R-squared score that represent the goodness of fit for model is
0.8299, it means that only 82.99% of the variables can be explained by
the model of model_backward_nomulti.
- OverallQual9 and
OverallQual10, rates the overall material and finish of the house, also
Intercept have highest significant score to the model of
model_backward_nomulti.
- Backward model Without multicolinearity variables
& with Outlier Treatment
# Model with Outlier Treatment
model_bwd_outlier_nomulti <- step(object = model_all_outlier_nomulti, direction = "backward", trace = F)
summary(model_bwd_outlier_nomulti)#>
#> Call:
#> lm(formula = SalePrice ~ MSSubClass + MSZoning + LotArea + Street +
#> Condition1 + OverallQual + YearBuilt + YearRemodAdd + ExterQual +
#> X1stFlrSF + GrLivArea + BsmtFullBath + BedroomAbvGr + TotRmsAbvGrd +
#> Fireplaces + GarageArea + OpenPorchSF + SaleType, data = house_train_outlier)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -99865 -10368 530 10464 59296
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -897701.0328 98760.4504 -9.090 < 0.0000000000000002 ***
#> MSSubClass -108.1648 21.0628 -5.135 0.0000003587000 ***
#> MSZoningFV 24555.8638 9172.5996 2.677 0.007588 **
#> MSZoningRH 20955.3630 10667.4204 1.964 0.049847 *
#> MSZoningRL 26005.2417 8049.8480 3.231 0.001289 **
#> MSZoningRM 17622.3153 8023.0817 2.196 0.028363 *
#> LotArea 1.0835 0.3151 3.438 0.000617 ***
#> StreetPave 33588.8196 18858.7656 1.781 0.075303 .
#> Condition1Feedr 400.4230 5007.9692 0.080 0.936293
#> Condition1Norm 6445.8084 4326.1189 1.490 0.136648
#> Condition1PosA 5607.7584 13821.3171 0.406 0.685054
#> Condition1PosN 14882.6840 8373.1424 1.777 0.075900 .
#> Condition1RRAe -17069.6556 7726.9136 -2.209 0.027466 *
#> Condition1RRAn -2154.0693 6408.1199 -0.336 0.736854
#> Condition1RRNe 10442.0547 19016.1381 0.549 0.583089
#> Condition1RRNn 42915.9209 16063.7887 2.672 0.007712 **
#> OverallQual3 2595.0319 15059.6503 0.172 0.863235
#> OverallQual4 19385.5120 13809.3563 1.404 0.160791
#> OverallQual5 28565.3775 13852.6284 2.062 0.039541 *
#> OverallQual6 38153.6193 13919.2538 2.741 0.006269 **
#> OverallQual7 48476.4737 14110.4683 3.435 0.000624 ***
#> OverallQual8 75777.3731 14370.8690 5.273 0.0000001754226 ***
#> OverallQual9 102852.2705 24992.9263 4.115 0.0000429328528 ***
#> YearBuilt 233.3913 37.5982 6.208 0.0000000008876 ***
#> YearRemodAdd 212.8192 44.1620 4.819 0.0000017440676 ***
#> ExterQualTA -25584.9304 13498.0981 -1.895 0.058415 .
#> ExterQualGd -13831.5120 9426.1789 -1.467 0.142697
#> ExterQualEx -19631.5618 9358.4760 -2.098 0.036261 *
#> X1stFlrSF 8.4047 2.8326 2.967 0.003101 **
#> GrLivArea 51.3943 3.7077 13.862 < 0.0000000000000002 ***
#> BsmtFullBath 9726.2100 1427.6222 6.813 0.0000000000196 ***
#> BedroomAbvGr -3303.2001 1356.6996 -2.435 0.015133 *
#> TotRmsAbvGrd -1651.6995 947.1219 -1.744 0.081581 .
#> Fireplaces 5969.0460 1227.0830 4.864 0.0000013976129 ***
#> GarageArea 23.4403 5.6157 4.174 0.0000334006048 ***
#> OpenPorchSF 45.5167 23.2544 1.957 0.050676 .
#> SaleTypeCon 39259.8665 19400.9809 2.024 0.043363 *
#> SaleTypeConLD 11423.0389 10073.5776 1.134 0.257171
#> SaleTypeConLI -2584.7130 11564.1097 -0.224 0.823198
#> SaleTypeConLw 10028.9174 11666.3250 0.860 0.390257
#> SaleTypeCWD 45415.9074 19043.9235 2.385 0.017334 *
#> SaleTypeNew 17501.7101 5006.8467 3.496 0.000501 ***
#> SaleTypeWD 9015.9153 3868.6396 2.331 0.020042 *
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 18310 on 755 degrees of freedom
#> Multiple R-squared: 0.8551, Adjusted R-squared: 0.8471
#> F-statistic: 106.1 on 42 and 755 DF, p-value: < 0.00000000000000022
# Create a Model without predictors
model_none <- lm(formula = SalePrice ~ 1, data =house_train)
# Forward
model_fwd <- step(object = model_none,direction = "forward",
scope = list(lower=model_none, upper = model_all),trace = F)
summary(model_fwd)#>
#> Call:
#> lm(formula = SalePrice ~ OverallQual + GrLivArea + YearBuilt +
#> MSSubClass + BsmtFullBath + OverallCond + GarageCars + Fireplaces +
#> SaleType + Condition1 + Exterior1st + LotArea + LotFrontage +
#> ExterQual + Street + BedroomAbvGr + BldgType + FullBath,
#> data = house_train)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -439880 -13369 -1067 11852 201540
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -878109.6221 131291.0292 -6.688 0.0000000000367069 ***
#> OverallQual3 19406.3364 25361.9434 0.765 0.444340
#> OverallQual4 9038.8137 22864.8056 0.395 0.692691
#> OverallQual5 19118.7592 22819.0301 0.838 0.402310
#> OverallQual6 23468.1986 22931.9686 1.023 0.306363
#> OverallQual7 35180.1098 23128.7975 1.521 0.128549
#> OverallQual8 71124.9450 23375.4995 3.043 0.002403 **
#> OverallQual9 136205.3135 24074.0820 5.658 0.0000000197889253 ***
#> OverallQual10 187997.3530 26062.6357 7.213 0.0000000000010466 ***
#> GrLivArea 58.2522 3.7477 15.543 < 0.0000000000000002 ***
#> YearBuilt 417.0213 65.6872 6.349 0.0000000003232699 ***
#> MSSubClass -162.1606 50.5734 -3.206 0.001385 **
#> BsmtFullBath 15071.8219 1994.6654 7.556 0.0000000000000905 ***
#> OverallCond 7520.6002 1028.8193 7.310 0.0000000000005301 ***
#> GarageCars 14711.1993 2160.8328 6.808 0.0000000000166372 ***
#> Fireplaces 6719.6159 1800.6284 3.732 0.000200 ***
#> SaleTypeCon 39461.7762 22895.2421 1.724 0.085079 .
#> SaleTypeConLD 13853.0234 13683.7896 1.012 0.311597
#> SaleTypeConLI -3471.9377 16988.3810 -0.204 0.838103
#> SaleTypeConLw 1800.4902 19176.8468 0.094 0.925216
#> SaleTypeCWD 27539.4650 16775.5757 1.642 0.100965
#> SaleTypeNew 25597.2301 6707.0059 3.816 0.000143 ***
#> SaleTypeOth 30131.6419 31712.5001 0.950 0.342255
#> SaleTypeWD 7738.3298 5581.8605 1.386 0.165940
#> Condition1Feedr -2846.6480 6813.4661 -0.418 0.676181
#> Condition1Norm 17359.5603 5466.4071 3.176 0.001539 **
#> Condition1PosA 22219.5089 14214.5188 1.563 0.118318
#> Condition1PosN 15006.9623 9732.4539 1.542 0.123388
#> Condition1RRAe -3887.1407 11899.2341 -0.327 0.743982
#> Condition1RRAn 13490.3923 9533.1795 1.415 0.157337
#> Condition1RRNe -2373.7801 31719.5799 -0.075 0.940359
#> Condition1RRNn 12795.9801 15427.5258 0.829 0.407053
#> Exterior1stBrkComm -11400.5518 34518.5874 -0.330 0.741260
#> Exterior1stBrkFace 28412.5596 11178.4792 2.542 0.011174 *
#> Exterior1stCBlock 20996.1963 37752.0375 0.556 0.578220
#> Exterior1stCemntBd 17204.5232 11080.2557 1.553 0.120793
#> Exterior1stHdBoard 9963.3373 10181.1889 0.979 0.328003
#> Exterior1stImStucc -1892.1702 32752.4274 -0.058 0.953941
#> Exterior1stMetalSd 11392.4557 10057.8347 1.133 0.257602
#> Exterior1stPlywood 9515.2715 10629.4789 0.895 0.370898
#> Exterior1stStone 4994.0305 24424.8687 0.204 0.838030
#> Exterior1stStucco -26114.6003 12469.3205 -2.094 0.036473 *
#> Exterior1stVinylSd 8529.3085 10194.2425 0.837 0.402964
#> Exterior1stWd Sdng 10460.2524 10029.3110 1.043 0.297204
#> Exterior1stWdShing 538.3750 12115.6980 0.044 0.964565
#> LotArea 0.6254 0.1506 4.153 0.0000355079325200 ***
#> LotFrontage -186.8743 54.3782 -3.437 0.000612 ***
#> ExterQualTA -11630.4892 20246.7180 -0.574 0.565795
#> ExterQualGd -14039.4167 6900.8020 -2.034 0.042157 *
#> ExterQualEx -21266.8478 7584.1058 -2.804 0.005139 **
#> StreetPave 41638.3240 19848.8834 2.098 0.036165 *
#> BedroomAbvGr -3972.1728 1731.7083 -2.294 0.022000 *
#> BldgType2fmCon 2198.1863 10518.1943 0.209 0.834498
#> BldgTypeDuplex -15988.9734 6574.1816 -2.432 0.015179 *
#> BldgTypeTwnhs -14379.7114 8661.6718 -1.660 0.097183 .
#> BldgTypeTwnhsE -6723.6941 6479.9812 -1.038 0.299691
#> FullBath 4268.2036 2747.8539 1.553 0.120657
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 30920 on 1046 degrees of freedom
#> Multiple R-squared: 0.8581, Adjusted R-squared: 0.8505
#> F-statistic: 113 on 56 and 1046 DF, p-value: < 0.00000000000000022
Insight of model_fwd :
- The Adjusted R-squared
score that represent the goodness of fit for model is 0.8505, it means
that only 85.05% of the variables can be explained by the model of
model_fwd.
- OverallQual9 and OverallQual10, rates the overall
material and finish of the house, also Intercept have highest
significant score to the model of model_fwd.
model_both <- step(object = model_none,direction = "both",
scope = list(upper=model_all),trace = F)
summary(model_both)#>
#> Call:
#> lm(formula = SalePrice ~ OverallQual + GrLivArea + YearBuilt +
#> MSSubClass + BsmtFullBath + OverallCond + GarageCars + Fireplaces +
#> SaleType + Condition1 + Exterior1st + LotArea + LotFrontage +
#> ExterQual + Street + BedroomAbvGr + BldgType + FullBath,
#> data = house_train)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -439880 -13369 -1067 11852 201540
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -878109.6221 131291.0292 -6.688 0.0000000000367069 ***
#> OverallQual3 19406.3364 25361.9434 0.765 0.444340
#> OverallQual4 9038.8137 22864.8056 0.395 0.692691
#> OverallQual5 19118.7592 22819.0301 0.838 0.402310
#> OverallQual6 23468.1986 22931.9686 1.023 0.306363
#> OverallQual7 35180.1098 23128.7975 1.521 0.128549
#> OverallQual8 71124.9450 23375.4995 3.043 0.002403 **
#> OverallQual9 136205.3135 24074.0820 5.658 0.0000000197889253 ***
#> OverallQual10 187997.3530 26062.6357 7.213 0.0000000000010466 ***
#> GrLivArea 58.2522 3.7477 15.543 < 0.0000000000000002 ***
#> YearBuilt 417.0213 65.6872 6.349 0.0000000003232699 ***
#> MSSubClass -162.1606 50.5734 -3.206 0.001385 **
#> BsmtFullBath 15071.8219 1994.6654 7.556 0.0000000000000905 ***
#> OverallCond 7520.6002 1028.8193 7.310 0.0000000000005301 ***
#> GarageCars 14711.1993 2160.8328 6.808 0.0000000000166372 ***
#> Fireplaces 6719.6159 1800.6284 3.732 0.000200 ***
#> SaleTypeCon 39461.7762 22895.2421 1.724 0.085079 .
#> SaleTypeConLD 13853.0234 13683.7896 1.012 0.311597
#> SaleTypeConLI -3471.9377 16988.3810 -0.204 0.838103
#> SaleTypeConLw 1800.4902 19176.8468 0.094 0.925216
#> SaleTypeCWD 27539.4650 16775.5757 1.642 0.100965
#> SaleTypeNew 25597.2301 6707.0059 3.816 0.000143 ***
#> SaleTypeOth 30131.6419 31712.5001 0.950 0.342255
#> SaleTypeWD 7738.3298 5581.8605 1.386 0.165940
#> Condition1Feedr -2846.6480 6813.4661 -0.418 0.676181
#> Condition1Norm 17359.5603 5466.4071 3.176 0.001539 **
#> Condition1PosA 22219.5089 14214.5188 1.563 0.118318
#> Condition1PosN 15006.9623 9732.4539 1.542 0.123388
#> Condition1RRAe -3887.1407 11899.2341 -0.327 0.743982
#> Condition1RRAn 13490.3923 9533.1795 1.415 0.157337
#> Condition1RRNe -2373.7801 31719.5799 -0.075 0.940359
#> Condition1RRNn 12795.9801 15427.5258 0.829 0.407053
#> Exterior1stBrkComm -11400.5518 34518.5874 -0.330 0.741260
#> Exterior1stBrkFace 28412.5596 11178.4792 2.542 0.011174 *
#> Exterior1stCBlock 20996.1963 37752.0375 0.556 0.578220
#> Exterior1stCemntBd 17204.5232 11080.2557 1.553 0.120793
#> Exterior1stHdBoard 9963.3373 10181.1889 0.979 0.328003
#> Exterior1stImStucc -1892.1702 32752.4274 -0.058 0.953941
#> Exterior1stMetalSd 11392.4557 10057.8347 1.133 0.257602
#> Exterior1stPlywood 9515.2715 10629.4789 0.895 0.370898
#> Exterior1stStone 4994.0305 24424.8687 0.204 0.838030
#> Exterior1stStucco -26114.6003 12469.3205 -2.094 0.036473 *
#> Exterior1stVinylSd 8529.3085 10194.2425 0.837 0.402964
#> Exterior1stWd Sdng 10460.2524 10029.3110 1.043 0.297204
#> Exterior1stWdShing 538.3750 12115.6980 0.044 0.964565
#> LotArea 0.6254 0.1506 4.153 0.0000355079325200 ***
#> LotFrontage -186.8743 54.3782 -3.437 0.000612 ***
#> ExterQualTA -11630.4892 20246.7180 -0.574 0.565795
#> ExterQualGd -14039.4167 6900.8020 -2.034 0.042157 *
#> ExterQualEx -21266.8478 7584.1058 -2.804 0.005139 **
#> StreetPave 41638.3240 19848.8834 2.098 0.036165 *
#> BedroomAbvGr -3972.1728 1731.7083 -2.294 0.022000 *
#> BldgType2fmCon 2198.1863 10518.1943 0.209 0.834498
#> BldgTypeDuplex -15988.9734 6574.1816 -2.432 0.015179 *
#> BldgTypeTwnhs -14379.7114 8661.6718 -1.660 0.097183 .
#> BldgTypeTwnhsE -6723.6941 6479.9812 -1.038 0.299691
#> FullBath 4268.2036 2747.8539 1.553 0.120657
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 30920 on 1046 degrees of freedom
#> Multiple R-squared: 0.8581, Adjusted R-squared: 0.8505
#> F-statistic: 113 on 56 and 1046 DF, p-value: < 0.00000000000000022
Insight of model_both :
- The Adjusted
R-squared score that represent the goodness of fit for model is 0.8505,
it means that only 85.05% of the variables can be explained by the model
of model_both.
- OverallQual9 and OverallQual10, rates the overall
material and finish of the house, also Intercept have highest
significant score to the model of model_both.
In comparison step, the all model is compared in order to find
high Adj. R-squared score, low AIC score, and low RMSE score.
-
1st Step - Before Assumption Test.
comparison <- compare_performance(model_none,model_all,model_bob,model_fs,model_backward,model_fwd,model_both)
as.data.frame(comparison)# Range of Data Train's Target
range(house_train$SalePrice)#> [1] 35311 755000
Insight:
Best model of linear regression, as
follows:
- Based on Adj. R-squared with highest
score (R2_adjusted column) : model_backward.
-
Based on AIC with lowest score (AIC column) :
model_backward.
- Based on RMSE
with lowest score (RMSE column) : model_all.
So, The model of model_all is suitable to make a
prediciton. But, using a model of model_backward is
appropiate to see the significance of the predictor. Then, step-wise
regression is a greedy algorithm that focuses on finding the best
results with relatively short time, but not necessarily giving the most
optimal results. Therefore, this research uses the results of step-wise
regression as a model recommendation and there is still room for
improvement.
comparison2 <- compare_performance(model_all_nomulti,model_all_outlier_nomulti,model_backward_nomulti,model_bwd_outlier_nomulti)
as.data.frame(comparison2)# Range of Data Train's Target
range(house_train_outlier$SalePrice)#> [1] 35311 297000
Insight:
Best model of linear regression, as
follows:
- Based on Adj. R-squared with highest
score (R2_adjusted column) : model_bwd_outlier_nomulti.
- Based on AIC with lowest score (AIC column) :
model_bwd_outlier_nomulti.
- Based on
RMSE with lowest score (RMSE column) :
model_all_outlier_nomulti.
The model of model_all_outlier_nomulti is suitable
to make a prediciton. But, using a model of
model_bwd_outlier_nomulti is appropiate to see the
significance of the predictor. Then, step-wise regression is a greedy
algorithm that focuses on finding the best results with relatively short
time, but not necessarily giving the most optimal results. Therefore,
this research uses the results of step-wise regression as a model
recommendation and there is still room for improvement.
The prediction is conducted by selected model which has best
performance model score.
pred_price_interval <- predict(object = model_backward, newdata = house_test, interval = "prediction", level = 0.95)
bwd_pred_result <- house_test %>%
select(SalePrice) %>%
bind_cols(as.data.frame(pred_price_interval)) %>%
relocate(fit,.after = lwr)
colnames(bwd_pred_result) <- c("SalePrice_Actual","LowPrice_Pred","Fit_Pred","UprPrice_Pred")
bwd_pred_result
The model is conducted linearity assumption test in order to met
the four pillars of assumption test.
- 1st Step - Before
Assumption Test.
In this section, the linearity assumption
will be checked by making a residual vs fitted plot for model of
model_backward.
options(scipen = 9999)
plot(model_backward, which = 1)
abline(h = 10, col = "green")
abline(h = -10, col = "green")resfit1 <- data.frame(residual = model_backward$residuals, fitted = model_backward$fitted.values)
resfit1 %>% ggplot(aes(fitted, residual)) +
geom_point() +
geom_hline(aes(yintercept = 0, colour= "red")) +
theme_minimal() +
labs(title = "Residual vs Fitted Plot") +
theme(legend.position = "none")The plots show that the residuals/errors do not bounce randomly around 0 (non-uniform) and are very far from 0 so that the mean does not equal 0. So the condition E(ϵ)=0 is not fulfilled. Then the red line is still outside the tolerance range of -10 to +10, so the model_backward is a non-linear model. It is concluded that the data may not be linear. The linearity assumption is not met.
The linear regression model is expected to produce errors that are
normally distributed and are mostly located around the number of 0.
hist(model_backward$residuals)shapiro.test(model_backward$residuals)#>
#> Shapiro-Wilk normality test
#>
#> data: model_backward$residuals
#> W = 0.79245, p-value < 0.00000000000000022
Based on histogram plot above and shapiro wilk test, most of erors aren’t located around number of 0 also shapiro wilk test results is p-value > 0.05 (H0 - normally distributed errors is rejected).It shows that the errors are not normally distributaed and not located around number of 0.
bptest(model_backward)#>
#> studentized Breusch-Pagan test
#>
#> data: model_backward
#> BP = 503.22, df = 64, p-value < 0.00000000000000022
Breusch-Pagan hypothesis test:
H0: constant spreading error or homoscedasticity
H1: error spread is not constant or
heteroscedasticity
Based on BP test sccore where the p-value results is lower than 0.05, then the alternative hypothesis (H1) is accepted / failed to reject. Then the model does not meet the assumption of homoscedasticity of residuals so that do the data transformation on the target or predictor variables.
In this section, measuring the correlation between predictor
variables with the vif() function in
library (car) in order to discover which predictor
variables have strong relationship between predictor variables which can
lead to redundant predictors. Then, choosing one variable so that there
is no redundant predictor in the model.
vif(model_backward)#> GVIF Df GVIF^(1/(2*Df))
#> LotFrontage 1.709594 1 1.307515
#> LotArea 1.694626 1 1.301778
#> Street 2.038089 1 1.427617
#> Condition1 1.945497 8 1.042472
#> BldgType 3.161430 4 1.154743
#> HouseStyle 19.775693 7 1.237602
#> OverallQual 16.733791 8 1.192545
#> OverallCond 1.498539 1 1.224148
#> YearBuilt 5.194551 1 2.279156
#> Exterior1st 8.965802 13 1.088023
#> ExterQual 12.061682 3 1.514379
#> X2ndFlrSF 11.627028 1 3.409843
#> GrLivArea 6.965922 1 2.639303
#> BsmtFullBath 1.316555 1 1.147412
#> FullBath 3.207742 1 1.791017
#> HalfBath 2.493750 1 1.579161
#> BedroomAbvGr 2.233516 1 1.494495
#> Fireplaces 1.565500 1 1.251199
#> GarageCars 2.163001 1 1.470714
#> SaleType 2.217563 8 1.051035
There are multicolinearity among variables within HouseStyle variable, OverallQual variable, ExterQual variable, and X2ndFlrSF variable due to VIF score is higher than 10 (VIF>10).
From the residual analysis, it is concluded that the residuals from model of model_backward shows non-linearity, non-normally distributed errors, heteroscedasticity, and multicollinearity. Thus, data transformation for predictor and target variables with highest p-value score is performed by using sqrt.
# Using model_all_outlier_nomulti with sqrt
model_tuning <- lm(formula= sqrt(SalePrice) ~ MSZoning + LotFrontage + LotArea+ Street + LotConfig + Condition1 + Condition2 + OverallQual + YearBuilt + YearRemodAdd + X1stFlrSF + GrLivArea + BsmtFullBath + FullBath + HalfBath + sqrt(BedroomAbvGr) + sqrt(TotRmsAbvGrd) + Fireplaces + GarageType + sqrt(GarageCars) + sqrt(GarageArea) + PavedDrive + sqrt(OpenPorchSF) + SaleType,
data = house_train_outlier)
summary(model_tuning)#>
#> Call:
#> lm(formula = sqrt(SalePrice) ~ MSZoning + LotFrontage + LotArea +
#> Street + LotConfig + Condition1 + Condition2 + OverallQual +
#> YearBuilt + YearRemodAdd + X1stFlrSF + GrLivArea + BsmtFullBath +
#> FullBath + HalfBath + sqrt(BedroomAbvGr) + sqrt(TotRmsAbvGrd) +
#> Fireplaces + GarageType + sqrt(GarageCars) + sqrt(GarageArea) +
#> PavedDrive + sqrt(OpenPorchSF) + SaleType, data = house_train_outlier)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -141.729 -12.118 0.835 12.784 69.262
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -1116.475753 137.484004 -8.121 0.00000000000000194 ***
#> MSZoningFV 46.054058 11.958543 3.851 0.000128 ***
#> MSZoningRH 40.272081 13.581357 2.965 0.003122 **
#> MSZoningRL 45.480091 10.370746 4.385 0.00001325494221484 ***
#> MSZoningRM 31.397148 10.334130 3.038 0.002464 **
#> LotFrontage 0.099851 0.065968 1.514 0.130548
#> LotArea 0.001813 0.000432 4.196 0.00003051881464823 ***
#> StreetPave 37.383186 21.714059 1.722 0.085558 .
#> LotConfigCulDSac 7.724227 4.371240 1.767 0.077631 .
#> LotConfigFR2 -0.957160 5.312280 -0.180 0.857062
#> LotConfigFR3 8.347318 24.309643 0.343 0.731414
#> LotConfigInside 1.088415 2.409413 0.452 0.651593
#> Condition1Feedr 1.934627 6.777392 0.285 0.775377
#> Condition1Norm 8.779067 5.821484 1.508 0.131969
#> Condition1PosA 10.511465 17.724906 0.593 0.553340
#> Condition1PosN 17.328153 10.892441 1.591 0.112072
#> Condition1RRAe -22.499589 10.235173 -2.198 0.028239 *
#> Condition1RRAn 1.102445 9.024720 0.122 0.902807
#> Condition1RRNe 23.740742 24.390736 0.973 0.330697
#> Condition1RRNn 44.225949 21.506011 2.056 0.040090 *
#> Condition2Feedr 3.145933 24.589227 0.128 0.898232
#> Condition2Norm 11.327439 18.949399 0.598 0.550174
#> Condition2RRAn -15.686096 30.808883 -0.509 0.610804
#> Condition2RRNn 11.157646 31.048291 0.359 0.719425
#> OverallQual3 18.291895 19.552337 0.936 0.349818
#> OverallQual4 49.231338 17.776758 2.769 0.005756 **
#> OverallQual5 61.431489 17.851872 3.441 0.000612 ***
#> OverallQual6 73.642988 17.913457 4.111 0.00004379712312152 ***
#> OverallQual7 89.050753 18.151076 4.906 0.00000114317687146 ***
#> OverallQual8 117.194340 18.438324 6.356 0.00000000036189349 ***
#> OverallQual9 154.025502 29.674093 5.191 0.00000027110104495 ***
#> YearBuilt 0.214500 0.059066 3.632 0.000301 ***
#> YearRemodAdd 0.377873 0.055620 6.794 0.00000000002246257 ***
#> X1stFlrSF 0.018516 0.004451 4.160 0.00003558289401388 ***
#> GrLivArea 0.051610 0.005240 9.850 < 0.0000000000000002 ***
#> BsmtFullBath 12.171482 1.868824 6.513 0.00000000013608698 ***
#> FullBath 1.555679 2.819148 0.552 0.581234
#> HalfBath 1.998786 2.545470 0.785 0.432569
#> sqrt(BedroomAbvGr) -2.136031 5.143939 -0.415 0.678078
#> sqrt(TotRmsAbvGrd) -12.119338 6.073817 -1.995 0.046372 *
#> Fireplaces 6.976545 1.616723 4.315 0.00001810957322644 ***
#> GarageTypeAttchd 31.445119 14.081550 2.233 0.025842 *
#> GarageTypeBasment 19.994641 15.792678 1.266 0.205886
#> GarageTypeBuiltIn 31.229928 14.776678 2.113 0.034895 *
#> GarageTypeCarPort 13.257012 17.352688 0.764 0.445126
#> GarageTypeDetchd 31.369102 14.002183 2.240 0.025367 *
#> sqrt(GarageCars) 9.179581 7.961105 1.153 0.249261
#> sqrt(GarageArea) 0.625992 0.483889 1.294 0.196184
#> PavedDriveP 9.153242 7.719502 1.186 0.236110
#> PavedDriveY 6.389496 4.667343 1.369 0.171421
#> sqrt(OpenPorchSF) 0.517155 0.265092 1.951 0.051451 .
#> SaleTypeCon 40.632632 24.730949 1.643 0.100810
#> SaleTypeConLD 13.749883 12.980034 1.059 0.289804
#> SaleTypeConLI -8.443244 14.849875 -0.569 0.569818
#> SaleTypeConLw 14.067374 14.999319 0.938 0.348618
#> SaleTypeCWD 46.355721 24.355765 1.903 0.057392 .
#> SaleTypeNew 20.084775 6.445700 3.116 0.001904 **
#> SaleTypeWD 10.078975 5.002207 2.015 0.044276 *
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 23.39 on 740 degrees of freedom
#> Multiple R-squared: 0.851, Adjusted R-squared: 0.8395
#> F-statistic: 74.15 on 57 and 740 DF, p-value: < 0.00000000000000022
Insight of model_tuning :
- The Adjusted
R-squared score that represent the goodness of fit for model is 0.8517,
it means that only 85.17% of the variables can be explained by the model
of model_tuning.
- OverallQual9, OverallQual8, OverallQual7, and
OverallQual6, rates the overall material and finish of the house, also
Intercept have highest significant score to the model of model_tuning.
Next, backward regression is performed with the aim of finding the best model in model_tuning. The AIC criteria were carried out to find the best model to be used in this case. The smallest AIC value is the criterion for selecting the best model. The backward regression method was used to select the model using AIC criteria.
# Backward
model_tuning_bwd <- step(object = model_tuning, direction = "backward")#> Start: AIC=5086.99
#> sqrt(SalePrice) ~ MSZoning + LotFrontage + LotArea + Street +
#> LotConfig + Condition1 + Condition2 + OverallQual + YearBuilt +
#> YearRemodAdd + X1stFlrSF + GrLivArea + BsmtFullBath + FullBath +
#> HalfBath + sqrt(BedroomAbvGr) + sqrt(TotRmsAbvGrd) + Fireplaces +
#> GarageType + sqrt(GarageCars) + sqrt(GarageArea) + PavedDrive +
#> sqrt(OpenPorchSF) + SaleType
#>
#> Df Sum of Sq RSS AIC
#> - Condition2 4 1050 405960 5081.1
#> - LotConfig 4 1975 406885 5082.9
#> - sqrt(BedroomAbvGr) 1 94 405005 5085.2
#> - FullBath 1 167 405077 5085.3
#> - PavedDrive 2 1207 406117 5085.4
#> - HalfBath 1 337 405248 5085.7
#> - sqrt(GarageCars) 1 727 405638 5086.4
#> - sqrt(GarageArea) 1 916 405826 5086.8
#> <none> 404910 5087.0
#> - LotFrontage 1 1254 406164 5087.5
#> - GarageType 5 5623 410534 5088.0
#> - Street 1 1622 406532 5088.2
#> - sqrt(OpenPorchSF) 1 2082 406993 5089.1
#> - sqrt(TotRmsAbvGrd) 1 2179 407089 5089.3
#> - SaleType 7 8536 413446 5089.6
#> - Condition1 8 13526 418436 5097.2
#> - YearBuilt 1 7216 412126 5099.1
#> - X1stFlrSF 1 9469 414379 5103.4
#> - LotArea 1 9632 414542 5103.8
#> - Fireplaces 1 10189 415099 5104.8
#> - MSZoning 4 20462 425373 5118.3
#> - BsmtFullBath 1 23210 428120 5129.5
#> - YearRemodAdd 1 25255 430166 5133.3
#> - GrLivArea 1 53084 457994 5183.3
#> - OverallQual 7 126926 531836 5290.6
#>
#> Step: AIC=5081.06
#> sqrt(SalePrice) ~ MSZoning + LotFrontage + LotArea + Street +
#> LotConfig + Condition1 + OverallQual + YearBuilt + YearRemodAdd +
#> X1stFlrSF + GrLivArea + BsmtFullBath + FullBath + HalfBath +
#> sqrt(BedroomAbvGr) + sqrt(TotRmsAbvGrd) + Fireplaces + GarageType +
#> sqrt(GarageCars) + sqrt(GarageArea) + PavedDrive + sqrt(OpenPorchSF) +
#> SaleType
#>
#> Df Sum of Sq RSS AIC
#> - LotConfig 4 2034 407994 5077.0
#> - PavedDrive 2 1089 407049 5079.2
#> - sqrt(BedroomAbvGr) 1 93 406053 5079.2
#> - FullBath 1 132 406092 5079.3
#> - HalfBath 1 354 406314 5079.8
#> - sqrt(GarageCars) 1 673 406633 5080.4
#> - sqrt(GarageArea) 1 967 406927 5081.0
#> <none> 405960 5081.1
#> - LotFrontage 1 1329 407289 5081.7
#> - Street 1 1575 407535 5082.1
#> - GarageType 5 6069 412029 5082.9
#> - sqrt(OpenPorchSF) 1 2071 408031 5083.1
#> - sqrt(TotRmsAbvGrd) 1 2158 408119 5083.3
#> - SaleType 7 8437 414397 5083.5
#> - YearBuilt 1 7529 413489 5093.7
#> - Condition1 8 14915 420875 5093.9
#> - X1stFlrSF 1 9247 415207 5097.0
#> - LotArea 1 9655 415615 5097.8
#> - Fireplaces 1 10515 416475 5099.5
#> - MSZoning 4 20583 426543 5112.5
#> - BsmtFullBath 1 23442 429402 5123.9
#> - YearRemodAdd 1 25521 431481 5127.7
#> - GrLivArea 1 53335 459295 5177.6
#> - OverallQual 7 128324 534284 5286.2
#>
#> Step: AIC=5077.05
#> sqrt(SalePrice) ~ MSZoning + LotFrontage + LotArea + Street +
#> Condition1 + OverallQual + YearBuilt + YearRemodAdd + X1stFlrSF +
#> GrLivArea + BsmtFullBath + FullBath + HalfBath + sqrt(BedroomAbvGr) +
#> sqrt(TotRmsAbvGrd) + Fireplaces + GarageType + sqrt(GarageCars) +
#> sqrt(GarageArea) + PavedDrive + sqrt(OpenPorchSF) + SaleType
#>
#> Df Sum of Sq RSS AIC
#> - PavedDrive 2 1018 409012 5075.0
#> - FullBath 1 86 408080 5075.2
#> - sqrt(BedroomAbvGr) 1 163 408157 5075.4
#> - HalfBath 1 350 408344 5075.7
#> - sqrt(GarageCars) 1 665 408659 5076.3
#> - LotFrontage 1 797 408791 5076.6
#> - sqrt(GarageArea) 1 981 408975 5077.0
#> <none> 407994 5077.0
#> - Street 1 1834 409828 5078.6
#> - sqrt(OpenPorchSF) 1 2048 410042 5079.0
#> - GarageType 5 6233 414227 5079.1
#> - SaleType 7 8343 416337 5079.2
#> - sqrt(TotRmsAbvGrd) 1 2198 410192 5079.3
#> - YearBuilt 1 8182 416176 5090.9
#> - Condition1 8 15561 423555 5090.9
#> - X1stFlrSF 1 9043 417037 5092.5
#> - Fireplaces 1 10429 418423 5095.2
#> - LotArea 1 11566 419560 5097.4
#> - MSZoning 4 20924 428918 5109.0
#> - BsmtFullBath 1 23956 431950 5120.6
#> - YearRemodAdd 1 25730 433725 5123.8
#> - GrLivArea 1 55478 463472 5176.8
#> - OverallQual 7 127300 535294 5279.8
#>
#> Step: AIC=5075.04
#> sqrt(SalePrice) ~ MSZoning + LotFrontage + LotArea + Street +
#> Condition1 + OverallQual + YearBuilt + YearRemodAdd + X1stFlrSF +
#> GrLivArea + BsmtFullBath + FullBath + HalfBath + sqrt(BedroomAbvGr) +
#> sqrt(TotRmsAbvGrd) + Fireplaces + GarageType + sqrt(GarageCars) +
#> sqrt(GarageArea) + sqrt(OpenPorchSF) + SaleType
#>
#> Df Sum of Sq RSS AIC
#> - FullBath 1 61 409073 5073.2
#> - sqrt(BedroomAbvGr) 1 139 409151 5073.3
#> - HalfBath 1 366 409379 5073.7
#> - sqrt(GarageCars) 1 541 409553 5074.1
#> - LotFrontage 1 842 409855 5074.7
#> <none> 409012 5075.0
#> - sqrt(GarageArea) 1 1170 410182 5075.3
#> - SaleType 7 8197 417210 5076.9
#> - sqrt(OpenPorchSF) 1 2025 411037 5077.0
#> - Street 1 2158 411170 5077.2
#> - GarageType 5 6454 415467 5077.5
#> - sqrt(TotRmsAbvGrd) 1 2459 411471 5077.8
#> - Condition1 8 15899 424911 5089.5
#> - X1stFlrSF 1 9076 418088 5090.5
#> - YearBuilt 1 10092 419105 5092.5
#> - LotArea 1 11239 420252 5094.7
#> - Fireplaces 1 11280 420293 5094.7
#> - MSZoning 4 22426 431438 5109.6
#> - BsmtFullBath 1 23756 432769 5118.1
#> - YearRemodAdd 1 25048 434061 5120.5
#> - GrLivArea 1 56454 465466 5176.2
#> - OverallQual 7 126953 535965 5276.8
#>
#> Step: AIC=5073.15
#> sqrt(SalePrice) ~ MSZoning + LotFrontage + LotArea + Street +
#> Condition1 + OverallQual + YearBuilt + YearRemodAdd + X1stFlrSF +
#> GrLivArea + BsmtFullBath + HalfBath + sqrt(BedroomAbvGr) +
#> sqrt(TotRmsAbvGrd) + Fireplaces + GarageType + sqrt(GarageCars) +
#> sqrt(GarageArea) + sqrt(OpenPorchSF) + SaleType
#>
#> Df Sum of Sq RSS AIC
#> - sqrt(BedroomAbvGr) 1 114 409187 5071.4
#> - HalfBath 1 305 409379 5071.7
#> - sqrt(GarageCars) 1 603 409676 5072.3
#> - LotFrontage 1 809 409882 5072.7
#> <none> 409073 5073.2
#> - sqrt(GarageArea) 1 1154 410227 5073.4
#> - SaleType 7 8182 417255 5075.0
#> - sqrt(OpenPorchSF) 1 2118 411191 5075.3
#> - Street 1 2155 411228 5075.3
#> - GarageType 5 6396 415469 5075.5
#> - sqrt(TotRmsAbvGrd) 1 2424 411497 5075.9
#> - Condition1 8 15951 425024 5087.7
#> - X1stFlrSF 1 9060 418133 5088.6
#> - LotArea 1 11181 420254 5092.7
#> - Fireplaces 1 11221 420295 5092.7
#> - YearBuilt 1 11669 420742 5093.6
#> - MSZoning 4 22867 431940 5108.6
#> - BsmtFullBath 1 24075 433148 5116.8
#> - YearRemodAdd 1 25429 434502 5119.3
#> - GrLivArea 1 68076 477149 5194.0
#> - OverallQual 7 126912 535986 5274.8
#>
#> Step: AIC=5071.38
#> sqrt(SalePrice) ~ MSZoning + LotFrontage + LotArea + Street +
#> Condition1 + OverallQual + YearBuilt + YearRemodAdd + X1stFlrSF +
#> GrLivArea + BsmtFullBath + HalfBath + sqrt(TotRmsAbvGrd) +
#> Fireplaces + GarageType + sqrt(GarageCars) + sqrt(GarageArea) +
#> sqrt(OpenPorchSF) + SaleType
#>
#> Df Sum of Sq RSS AIC
#> - HalfBath 1 321 409509 5070.0
#> - sqrt(GarageCars) 1 664 409851 5070.7
#> - LotFrontage 1 754 409941 5070.8
#> <none> 409187 5071.4
#> - sqrt(GarageArea) 1 1111 410298 5071.5
#> - SaleType 7 8250 417437 5073.3
#> - sqrt(OpenPorchSF) 1 2118 411305 5073.5
#> - Street 1 2170 411357 5073.6
#> - GarageType 5 6321 415508 5073.6
#> - sqrt(TotRmsAbvGrd) 1 3562 412749 5076.3
#> - Condition1 8 16093 425281 5086.2
#> - X1stFlrSF 1 9306 418493 5087.3
#> - LotArea 1 11090 420277 5090.7
#> - Fireplaces 1 11412 420600 5091.3
#> - YearBuilt 1 11603 420790 5091.7
#> - MSZoning 4 22761 431948 5106.6
#> - BsmtFullBath 1 24968 434155 5116.6
#> - YearRemodAdd 1 26364 435551 5119.2
#> - GrLivArea 1 68351 477538 5192.6
#> - OverallQual 7 127978 537166 5274.5
#>
#> Step: AIC=5070
#> sqrt(SalePrice) ~ MSZoning + LotFrontage + LotArea + Street +
#> Condition1 + OverallQual + YearBuilt + YearRemodAdd + X1stFlrSF +
#> GrLivArea + BsmtFullBath + sqrt(TotRmsAbvGrd) + Fireplaces +
#> GarageType + sqrt(GarageCars) + sqrt(GarageArea) + sqrt(OpenPorchSF) +
#> SaleType
#>
#> Df Sum of Sq RSS AIC
#> - sqrt(GarageCars) 1 693 410201 5069.4
#> - LotFrontage 1 697 410206 5069.4
#> <none> 409509 5070.0
#> - sqrt(GarageArea) 1 1054 410562 5070.1
#> - SaleType 7 8206 417715 5071.8
#> - GarageType 5 6196 415705 5072.0
#> - Street 1 2153 411662 5072.2
#> - sqrt(OpenPorchSF) 1 2235 411744 5072.3
#> - sqrt(TotRmsAbvGrd) 1 3517 413026 5074.8
#> - Condition1 8 15867 425375 5084.3
#> - X1stFlrSF 1 10462 419971 5088.1
#> - LotArea 1 11331 420840 5089.8
#> - Fireplaces 1 12097 421606 5091.2
#> - YearBuilt 1 12687 422195 5092.3
#> - MSZoning 4 23500 433008 5106.5
#> - BsmtFullBath 1 25083 434592 5115.4
#> - YearRemodAdd 1 26112 435620 5117.3
#> - GrLivArea 1 79060 488568 5208.9
#> - OverallQual 7 127752 537261 5272.7
#>
#> Step: AIC=5069.35
#> sqrt(SalePrice) ~ MSZoning + LotFrontage + LotArea + Street +
#> Condition1 + OverallQual + YearBuilt + YearRemodAdd + X1stFlrSF +
#> GrLivArea + BsmtFullBath + sqrt(TotRmsAbvGrd) + Fireplaces +
#> GarageType + sqrt(GarageArea) + sqrt(OpenPorchSF) + SaleType
#>
#> Df Sum of Sq RSS AIC
#> - LotFrontage 1 827 411028 5069.0
#> <none> 410201 5069.4
#> - Street 1 1966 412167 5071.2
#> - SaleType 7 8229 418430 5071.2
#> - GarageType 5 6248 416449 5071.4
#> - sqrt(OpenPorchSF) 1 2196 412397 5071.6
#> - sqrt(TotRmsAbvGrd) 1 3232 413433 5073.6
#> - sqrt(GarageArea) 1 6485 416687 5079.9
#> - Condition1 8 15611 425813 5083.2
#> - X1stFlrSF 1 10359 420561 5087.3
#> - LotArea 1 10964 421166 5088.4
#> - Fireplaces 1 13053 423254 5092.3
#> - YearBuilt 1 14282 424483 5094.7
#> - MSZoning 4 23087 433288 5105.0
#> - BsmtFullBath 1 24766 434967 5114.1
#> - YearRemodAdd 1 26637 436838 5117.6
#> - GrLivArea 1 78760 488961 5207.5
#> - OverallQual 7 129216 539418 5273.9
#>
#> Step: AIC=5068.96
#> sqrt(SalePrice) ~ MSZoning + LotArea + Street + Condition1 +
#> OverallQual + YearBuilt + YearRemodAdd + X1stFlrSF + GrLivArea +
#> BsmtFullBath + sqrt(TotRmsAbvGrd) + Fireplaces + GarageType +
#> sqrt(GarageArea) + sqrt(OpenPorchSF) + SaleType
#>
#> Df Sum of Sq RSS AIC
#> <none> 411028 5069.0
#> - GarageType 5 6061 417088 5070.6
#> - Street 1 1924 412952 5070.7
#> - SaleType 7 8241 419269 5070.8
#> - sqrt(OpenPorchSF) 1 2316 413344 5071.4
#> - sqrt(TotRmsAbvGrd) 1 3198 414226 5073.1
#> - sqrt(GarageArea) 1 7038 418066 5080.5
#> - Condition1 8 16121 427149 5083.7
#> - X1stFlrSF 1 10668 421695 5087.4
#> - Fireplaces 1 12799 423827 5091.4
#> - YearBuilt 1 13898 424926 5093.5
#> - LotArea 1 18281 429309 5101.7
#> - MSZoning 4 24616 435644 5107.4
#> - BsmtFullBath 1 24596 435624 5113.3
#> - YearRemodAdd 1 26350 437378 5116.5
#> - GrLivArea 1 78425 489453 5206.3
#> - OverallQual 7 129265 540293 5273.2
summary(model_tuning_bwd)#>
#> Call:
#> lm(formula = sqrt(SalePrice) ~ MSZoning + LotArea + Street +
#> Condition1 + OverallQual + YearBuilt + YearRemodAdd + X1stFlrSF +
#> GrLivArea + BsmtFullBath + sqrt(TotRmsAbvGrd) + Fireplaces +
#> GarageType + sqrt(GarageArea) + sqrt(OpenPorchSF) + SaleType,
#> data = house_train_outlier)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -141.385 -11.810 1.202 12.867 71.895
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -1185.5496140 123.7422772 -9.581 < 0.0000000000000002 ***
#> MSZoningFV 47.8327683 11.7181389 4.082 0.00004942175816 ***
#> MSZoningRH 40.8003856 13.4583576 3.032 0.002516 **
#> MSZoningRL 47.6556618 10.2449909 4.652 0.00000388838676 ***
#> MSZoningRM 32.6167465 10.2439534 3.184 0.001512 **
#> LotArea 0.0021135 0.0003647 5.795 0.00000001004670 ***
#> StreetPave 40.1259346 21.3463959 1.880 0.060527 .
#> Condition1Feedr 2.4689381 6.3809933 0.387 0.698924
#> Condition1Norm 10.0801595 5.5097687 1.830 0.067718 .
#> Condition1PosA 12.1236213 17.5064693 0.693 0.488822
#> Condition1PosN 20.5070543 10.6439358 1.927 0.054399 .
#> Condition1RRAe -20.5124816 9.9282631 -2.066 0.039162 *
#> Condition1RRAn 0.9719768 8.1581268 0.119 0.905194
#> Condition1RRNe 24.3649777 24.1708902 1.008 0.313763
#> Condition1RRNn 53.4566319 20.8362100 2.566 0.010493 *
#> OverallQual3 22.3893533 19.2112411 1.165 0.244213
#> OverallQual4 51.0667564 17.6013377 2.901 0.003824 **
#> OverallQual5 63.2400359 17.6278380 3.588 0.000355 ***
#> OverallQual6 75.9773606 17.7257100 4.286 0.00002052194416 ***
#> OverallQual7 91.2502798 17.9776344 5.076 0.00000048643747 ***
#> OverallQual8 119.1814461 18.2830832 6.519 0.00000000012970 ***
#> OverallQual9 154.6624657 29.5499614 5.234 0.00000021525869 ***
#> YearBuilt 0.2595350 0.0513673 5.053 0.00000054731803 ***
#> YearRemodAdd 0.3762884 0.0540873 6.957 0.00000000000754 ***
#> X1stFlrSF 0.0163162 0.0036860 4.427 0.00001098513175 ***
#> GrLivArea 0.0540103 0.0045000 12.002 < 0.0000000000000002 ***
#> BsmtFullBath 12.1182690 1.8028793 6.722 0.00000000003543 ***
#> sqrt(TotRmsAbvGrd) -13.0538697 5.3857683 -2.424 0.015594 *
#> Fireplaces 7.5785219 1.5629759 4.849 0.00000150881746 ***
#> GarageTypeAttchd 31.9801391 13.9081963 2.299 0.021756 *
#> GarageTypeBasment 20.2429567 15.5219714 1.304 0.192580
#> GarageTypeBuiltIn 31.0424046 14.5649284 2.131 0.033386 *
#> GarageTypeCarPort 12.8412252 16.8497521 0.762 0.446237
#> GarageTypeDetchd 31.1441776 13.8167248 2.254 0.024476 *
#> sqrt(GarageArea) 1.1124980 0.3094135 3.596 0.000345 ***
#> sqrt(OpenPorchSF) 0.5393338 0.2614866 2.063 0.039495 *
#> SaleTypeCon 39.8799194 24.5923544 1.622 0.105298
#> SaleTypeConLD 12.8355513 12.8187209 1.001 0.316996
#> SaleTypeConLI -8.2347973 14.7519869 -0.558 0.576862
#> SaleTypeConLw 12.6486178 14.8703324 0.851 0.395265
#> SaleTypeCWD 45.1134960 24.2419945 1.861 0.063138 .
#> SaleTypeNew 19.8067478 6.3970793 3.096 0.002033 **
#> SaleTypeWD 10.1147931 4.9676047 2.036 0.042084 *
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 23.33 on 755 degrees of freedom
#> Multiple R-squared: 0.8488, Adjusted R-squared: 0.8403
#> F-statistic: 100.9 on 42 and 755 DF, p-value: < 0.00000000000000022
# Assign to a new object from backward step-wise with lowest AIC score
step_model <- lm(formula = sqrt(SalePrice) ~ MSZoning + LotArea + Street +
Condition1 + OverallQual + YearBuilt + YearRemodAdd + X1stFlrSF +
GrLivArea + BsmtFullBath + sqrt(TotRmsAbvGrd) + Fireplaces +
GarageType + sqrt(GarageCars) + PavedDrive + sqrt(OpenPorchSF),
data = house_train_outlier)
summary(step_model)#>
#> Call:
#> lm(formula = sqrt(SalePrice) ~ MSZoning + LotArea + Street +
#> Condition1 + OverallQual + YearBuilt + YearRemodAdd + X1stFlrSF +
#> GrLivArea + BsmtFullBath + sqrt(TotRmsAbvGrd) + Fireplaces +
#> GarageType + sqrt(GarageCars) + PavedDrive + sqrt(OpenPorchSF),
#> data = house_train_outlier)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -142.41 -12.11 1.65 12.86 70.85
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -1208.2984979 125.0734246 -9.661 < 0.0000000000000002 ***
#> MSZoningFV 50.0889398 11.6262010 4.308 0.000018612838147 ***
#> MSZoningRH 42.5967228 13.4652658 3.163 0.001621 **
#> MSZoningRL 46.7804527 10.2008932 4.586 0.000005284496505 ***
#> MSZoningRM 31.8018003 10.1839850 3.123 0.001860 **
#> LotArea 0.0023296 0.0003615 6.444 0.000000000207131 ***
#> StreetPave 39.6475164 21.5124593 1.843 0.065718 .
#> Condition1Feedr 2.6781205 6.4071369 0.418 0.676072
#> Condition1Norm 9.5608921 5.5468660 1.724 0.085177 .
#> Condition1PosA 12.6366657 17.5883350 0.718 0.472689
#> Condition1PosN 19.8276346 10.5008782 1.888 0.059381 .
#> Condition1RRAe -22.5903214 9.8814318 -2.286 0.022521 *
#> Condition1RRAn 0.0251232 8.2226516 0.003 0.997563
#> Condition1RRNe 24.0103669 24.3008947 0.988 0.323445
#> Condition1RRNn 57.6447242 20.9007983 2.758 0.005955 **
#> OverallQual3 17.8207239 19.4331200 0.917 0.359419
#> OverallQual4 49.2644705 17.7235749 2.780 0.005577 **
#> OverallQual5 60.9327469 17.7571891 3.431 0.000633 ***
#> OverallQual6 73.1408528 17.8609274 4.095 0.000046739449480 ***
#> OverallQual7 89.9823886 18.0867720 4.975 0.000000807221600 ***
#> OverallQual8 117.5375402 18.3887588 6.392 0.000000000285855 ***
#> OverallQual9 153.1181821 29.7133326 5.153 0.000000326754145 ***
#> YearBuilt 0.2452017 0.0540553 4.536 0.000006656531028 ***
#> YearRemodAdd 0.4079384 0.0539746 7.558 0.000000000000118 ***
#> X1stFlrSF 0.0168860 0.0036582 4.616 0.000004592604823 ***
#> GrLivArea 0.0529243 0.0044757 11.825 < 0.0000000000000002 ***
#> BsmtFullBath 12.0178470 1.7871470 6.725 0.000000000034607 ***
#> sqrt(TotRmsAbvGrd) -13.6751331 5.4038522 -2.531 0.011587 *
#> Fireplaces 7.3269770 1.5717891 4.662 0.000003706088344 ***
#> GarageTypeAttchd 29.3776897 13.9682468 2.103 0.035779 *
#> GarageTypeBasment 18.2469830 15.6016305 1.170 0.242546
#> GarageTypeBuiltIn 27.9145622 14.5981125 1.912 0.056226 .
#> GarageTypeCarPort 8.0907643 16.9038593 0.479 0.632337
#> GarageTypeDetchd 28.6098451 13.8811916 2.061 0.039638 *
#> sqrt(GarageCars) 17.9522014 5.0237734 3.573 0.000375 ***
#> PavedDriveP 8.6330662 7.6417951 1.130 0.258952
#> PavedDriveY 5.8898092 4.6117561 1.277 0.201947
#> sqrt(OpenPorchSF) 0.5710397 0.2615929 2.183 0.029346 *
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 23.46 on 760 degrees of freedom
#> Multiple R-squared: 0.846, Adjusted R-squared: 0.8386
#> F-statistic: 112.9 on 37 and 760 DF, p-value: < 0.00000000000000022
Insight of step_model :
- The Adjusted
R-squared score that represent the goodness of fit for model is 0.8516,
it means that only 85.16% of the variables can be explained by the model
of model_tuning.
- OverallQual9, OverallQual8, OverallQual7, and
OverallQual6, rates the overall material and finish of the house, also
Intercept, MSZoningFV, StreetPave have highest significant score to the
model of model_tuning.
For the two candidate models, namely model_tuning and step_model,
performance comparisons is made for the two models. The goal is to find
the best regression model to model the data.
comparison_tuning <- compare_performance(model_tuning,step_model)
as.data.frame(comparison_tuning)Insight:
Best model of linear regression, as
follows:
- Based on Adj. R-squared with highest
score (R2_adjusted column) : model_tuning.
- Based
on AIC with lowest score (AIC column) :
step_model.
- Based on RMSE with
lowest score (RMSE column) : model_tuning.
In this section, the linearity assumption will be checked by making a residual vs fitted plot for model of model_backward.
plot(model_tuning, which = 1)
abline(h = 10, col = "green")
abline(h = -10, col = "green")The plots show that the residuals/errors bounce randomly around 0 (non-uniform) and most of residuals are not far from 0. So the condition E(ϵ)=0 is fulfilled. Then the red line is in the tolerance range of -10 to +10, so the model_tuning is a linear model. So, the linearity assumption is met.
The linear regression model is expected to produce errors that are
normally distributed and are mostly located around the number of 0.
# Plot 1
hist(model_tuning$residuals,breaks = 3)# Plot 2
res_tun <- data.frame(residual = model_tuning$residuals)
res_tun %>%
e_charts() %>%
e_histogram(residual, name = "Error", legend = F) %>%
e_title(
text = "Normality of Residuals",
textStyle = list(fontFamily = "Cardo", fontSize = 25),
subtext = "Normally Distributed Error",
subtextStyle = list(fontFamily = "Cardo", fontSize = 15, fontStyle = "italic"),
left = "center") %>%
e_hide_grid_lines() %>%
e_tooltip() %>%
e_x_axis(axisLabel = list(fontFamily = "Cardo")) %>%
e_y_axis(axisLabel = list(color = "#FAF7E6"))# Shapiro wilk test
shapiro.test(model_tuning$residuals)#>
#> Shapiro-Wilk normality test
#>
#> data: model_tuning$residuals
#> W = 0.95853, p-value = 0.00000000000003094
Based on histogram plot above that most of residuals / erors are located around number of 0. But with the shapiro wilk test the result of p-value is lower than 0.05, then it can be concluded that H0 - normally distributed errors is rejected or in other words the most of erors aren’t located around number of 0.
bptest(model_tuning)#>
#> studentized Breusch-Pagan test
#>
#> data: model_tuning
#> BP = 73.313, df = 57, p-value = 0.07166
Breusch-Pagan hypothesis test:
H0: constant spreading error or homoscedasticity
H1: error spread is not constant or
heteroscedasticity
Based on BP test sccore where the p-value results is lower than 0.05, then the alternative hypothesis (H1) is accepted / failed to reject. Then the model of model_tuning does not meet the assumption of homoscedasticity of residuals so that do the data transformation on the target or predictor variables.
In this section, measuring the correlation between predictor
variables with the vif() function in
library (car) in order to discover which predictor
variables have strong relationship between predictor variables which can
lead to redundant predictors. Then, choosing one variable so that there
is no redundant predictor in the model.
vif(model_tuning)#> GVIF Df GVIF^(1/(2*Df))
#> MSZoning 3.300226 4 1.160962
#> LotFrontage 2.023633 1 1.422545
#> LotArea 2.320601 1 1.523352
#> Street 1.719076 1 1.311135
#> LotConfig 1.570648 4 1.058059
#> Condition1 3.800535 8 1.087027
#> Condition2 1.964427 4 1.088064
#> OverallQual 4.692352 7 1.116751
#> YearBuilt 4.287220 1 2.070560
#> YearRemodAdd 1.909852 1 1.381974
#> X1stFlrSF 2.438140 1 1.561454
#> GrLivArea 5.609487 1 2.368436
#> BsmtFullBath 1.266517 1 1.125396
#> FullBath 3.205857 1 1.790491
#> HalfBath 2.334597 1 1.527939
#> sqrt(BedroomAbvGr) 2.363064 1 1.537226
#> sqrt(TotRmsAbvGrd) 3.884699 1 1.970964
#> Fireplaces 1.429814 1 1.195748
#> GarageType 4.073371 5 1.150788
#> sqrt(GarageCars) 4.595087 1 2.143615
#> sqrt(GarageArea) 4.218352 1 2.053863
#> PavedDrive 1.530158 2 1.112203
#> sqrt(OpenPorchSF) 1.421658 1 1.192333
#> SaleType 1.798233 7 1.042805
There are non-multicolinearity among variables due to VIF score is lower than 10 (VIF<10).
After assumption test for model of model_tuning, where only 2 assumptions are met namely linearity and non-multicollinearity. Then model interpretation as follows:
\[ sqrt(SalePrice) = -1115.02 + 43.08MSZoningFV + 26.69MSZoningRH + 35.50 MSZoningRL + 23.16MSZoningRM + 0.018LotFrontage + 0.0016 LotArea + 65.78 StreetPave + 5.37 LotConfigCulDSac + 0.65 LotConfigFR2 + -21.04 LotConfigFR3 + 1.36 LotConfigInside + 8.28 Condition1Feedr + 15.69 Condition1Norm + 16.28 Condition1PosA + 32.19 Condition1PosN + -15.53 Condition1RRAe + 7.54 Condition1RRAn + 8.95 Condition1RRNe + 42.91 Condition1RRNn + -3.34 Condition2Feedr + 6.77 Condition2Norm + -19.88 Condition2RRAn + 8.76 Condition2RRNn + 54.49 OverallQual3 + 75.00 OverallQual4 + 92.84 OverallQual5 + 103.13 OverallQual6 + 119.86 OverallQual7 + 146.25 OverallQual8 + 176.98 OverallQual9 + 0.22 YearBuilt + 0.34 YearRemodAdd + 0.01 X1stFlrSF + 0.05 GrLivArea + 11.96 BsmtFullBath + 1.06 FullBath + 1.21 HalfBath + -3.33 sqrt(BedroomAbvGr) + -10.53 sqrt(TotRmsAbvGrd) + 6.99 Fireplaces + 32.03 GarageTypeAttchd + 26.51 GarageTypeBasment + 32.67 GarageTypeBuiltIn + 11.15 GarageTypeCarPort + 30.52 GarageTypeDetchd + 13.34 sqrt(GarageCars) + 0.73 sqrt(GarageArea) + 8.79 PavedDriveP + 10.91 PavedDriveY + 0.46 sqrt(OpenPorchSF) + 30.72 SaleTypeCon + 5.93 SaleTypeConLD + -10.33 SaleTypeConLI + 10.83 SaleTypeConLw + 16.83 SaleTypeCWD + 20.50 SaleTypeNew + 28.27 SaleTypeOth + 9.88 SaleTypeWD \]
Linear Regression Model Used: model_tuning
1.Intercept: The value of the target variable
when all predictor’s value are equal to zero.
2. Coefficient/Slope: Increasing of 1 unit in
the predictor, the target variable value will increase as much as the
slope value.
3. Predictor Significance: Find out the
significance of each predictors that affect the target variable.
Goodness of Fit (R-Squared) : Describes how well
the predictor can explain the diversity of the target class.
Based on the process of analyzing the factors with model of
step_model summary, it is found that the factors that not significantly
influence SalePrice are sqrt(OpenPorchSF), PavedDriveP,
GarageTypeCarPort, GarageTypeBasment, OverallQual3, Condition1RRNe,
Condition1RRAn, Condition1RRAe, Condition1PosA, and Condition1Feedr.
The model of model_tuning is suitable to make a
prediciton. But, using a model of step_model is
appropiate to see the significance of the predictor. Then, step-wise
regression is a greedy algorithm that focuses on finding the best
results with relatively short time, but not necessarily giving the most
optimal results. Therefore, this research uses the results of step-wise
regression as a model recommendation and there is still room for
improvement.
Several numeric variabels, big p-value, have used
sqrt() for data transformation and there are still some
errors/residuals that bounce outside the -10 & +10 limits. Then
errors / residuals is not distributed normally and it has a
Heteroscedasticity. It is recommended to use another complex model
(model without assumption) so that it can capture non-linear
relationships.