1 Summary

In conducting anlysis & making a machine learning model to predict the house’s price, there are several step that this research pursue as well as data preprocessing, data wrangling, exploratory data analysis, build models, model comparison, assumption test, model improvement, model interpretation, and conclusion. In the first round analysis, recommending the model of model_all but the model doesn’t met the four of assumption test. Then, doing the outlier treatment to each numeric variables for meeting the four of assumption test. And then, the results is model of model_tuning is recommended to predict the house’s price also using a model of step_model is appropiate to see the significance of the predictor. Finally, this research recommend to use another complex model (model without assumption) so that it can capture non-linear relationships and using the step-wise regression, step_model, as a model recommendation and there is still room for improvement.


2 Preface

2.1 Background


Ask a home buyer to describe their dream house, and they probably won’t begin with the height of the basement ceiling or the proximity to an east-west railroad. But this playground competition’s dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence.

With 81 variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home.


2.2 Research Objectives


The research objective is to predict the house price with several predictor variables.


2.4 Data Description


Here’s a brief version of what you’ll find in dataset after variable selection.

No. Feature Description
1. MSSubClass Identifies the type of dwelling involved in the sale.
2. MSZoning Identifies the general zoning classification of the sale.
3. LotFrontage Linear feet of street connected to property.
4. LotArea Lot size in square feet.
5. Street Type of road access to property.
6. LotConfig Lot configuration.
7. Condition1 Proximity to various conditions.
8. Condition2 Proximity to various conditions (if more than one is present).
9. BldgType Type of dwelling.
10. HouseStyle Style of dwelling.
11. OverallQual Rates the overall material and finish of the house.
12. OverallCond Rates the overall condition of the house.
13. YearBuilt Original construction date.
14. YearRemodAdd Remodel date (same as construction date if no remodeling or additions).
15. Exterior1st Exterior covering on house.
16. ExterQual Evaluates the quality of the material on the exterior.
17. X1stFlrSF First Floor square feet.
18. X2ndFlrSF Second floor square feet.
19. GrLivArea Above grade (ground) living area square feet.
20. BsmtFullBath Basement full bathrooms.
21. FullBath Full bathrooms above grade.
22. HalfBath Half baths above grade.
23. BedroomAbvGr Bedrooms above grade (does NOT include basement bedrooms).
24. TotRmsAbvGrd Total rooms above grade (does not include bathrooms).
25. Fireplaces Number of fireplaces.
26. GarageType Garage location.
27. GarageCars Size of garage in car capacity.
28. GarageArea Size of garage in square feet.
29. PavedDrive Paved driveway.
30. OpenPorchSF Open porch area in square feet.
31. SaleType Type of sale.
32. SalePrice House’s price.

2.5 List Packages


There are several packages that used in this research, as follows:

# data cleaning
library(readr) 
library(dplyr)

#data analysis
library(GGally)

#data visualizationl
library(ggplot2) 
library(scales)
library(echarts4r)

#Cross Validation
library(rsample)

#RMSE
library(MLmetrics) 

# Model Performance Comparison 
library(performance)

# Hypotesis test
library(lmtest)
library(car)

3 Data Preprocessing

3.1 Read & Extracting Data


This data set contains 38 coloumns of numeric datatype and 43 coloumns of character datatype.
Loading the data set (“train.csv”) and assigned to house object. Then the data of house is ready to do data wrangling process.

house <- read.csv("data_input/train.csv")
house

4 Data Wrangling

4.1 Data Inspection


This step consists of three steps ranging from full data set inspection, the top 6 observations of data set inspection, and the bottom 6 observations of data set inspection by using head also tail function so that the data set’s background can be recognized a little bit.

1. Full data set.

house


2. Top 6 observations of data set.

# Top 6 data
head(house)


3. Bottom 6 observations of data set.

# Bottom 6 data
tail(house)

4.2 Variable Elimination

There is an Id variable that doesn’t give any valuable information in conducting linear regression analysis. Then, in business perspective and a way to avoid redundancy, this analysis decides not to proceed several variables as follows:

house_new <- house %>% 
  select(-c(Id,Alley,LotShape,LandContour,Utilities,LandSlope,Neighborhood,RoofStyle,
            RoofMatl,Exterior2nd,MasVnrType,MasVnrArea,ExterCond,Foundation,BsmtQual,
            BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,
            BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,
            LowQualFinSF,BsmtHalfBath,KitchenAbvGr,KitchenQual,Functional,FireplaceQu,
            GarageYrBlt,GarageFinish,GarageQual,GarageCond,WoodDeckSF,EnclosedPorch,
            X3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,
            YrSold,SaleCondition))

house_new

4.3 Change The Data Type


Perform data type inspection to ensure the data type of each column is appropriate by using glimpse() function in dplyr package. Then, there are several variables whose data type are changed into factor such as MSZoning,Street,LotConfig,Condition1,Condition2, BldgType,OverallQual,ExterQual,PavedDrive.

house_new <- house_new %>% 
  mutate_at(vars(MSZoning,Street,LotConfig,Condition1,Condition2,
                 BldgType,OverallQual,ExterQual,PavedDrive),as.factor)


# Relevel
levels(house_new$ExterQual) <- c("Fa","TA","Gd","Ex")

glimpse(house_new)
#> Rows: 1,460
#> Columns: 32
#> $ MSSubClass   <int> 60, 20, 60, 70, 60, 50, 20, 60, 50, 190, 20, 60, 20, 20, …
#> $ MSZoning     <fct> RL, RL, RL, RL, RL, RL, RL, RL, RM, RL, RL, RL, RL, RL, R…
#> $ LotFrontage  <int> 65, 80, 68, 60, 84, 85, 75, NA, 51, 50, 70, 85, NA, 91, N…
#> $ LotArea      <int> 8450, 9600, 11250, 9550, 14260, 14115, 10084, 10382, 6120…
#> $ Street       <fct> Pave, Pave, Pave, Pave, Pave, Pave, Pave, Pave, Pave, Pav…
#> $ LotConfig    <fct> Inside, FR2, Inside, Corner, FR2, Inside, Inside, Corner,…
#> $ Condition1   <fct> Norm, Feedr, Norm, Norm, Norm, Norm, Norm, PosN, Artery, …
#> $ Condition2   <fct> Norm, Norm, Norm, Norm, Norm, Norm, Norm, Norm, Norm, Art…
#> $ BldgType     <fct> 1Fam, 1Fam, 1Fam, 1Fam, 1Fam, 1Fam, 1Fam, 1Fam, 1Fam, 2fm…
#> $ HouseStyle   <chr> "2Story", "1Story", "2Story", "2Story", "2Story", "1.5Fin…
#> $ OverallQual  <fct> 7, 6, 7, 7, 8, 5, 8, 7, 7, 5, 5, 9, 5, 7, 6, 7, 6, 4, 5, …
#> $ OverallCond  <int> 5, 8, 5, 5, 5, 5, 5, 6, 5, 6, 5, 5, 6, 5, 5, 8, 7, 5, 5, …
#> $ YearBuilt    <int> 2003, 1976, 2001, 1915, 2000, 1993, 2004, 1973, 1931, 193…
#> $ YearRemodAdd <int> 2003, 1976, 2002, 1970, 2000, 1995, 2005, 1973, 1950, 195…
#> $ Exterior1st  <chr> "VinylSd", "MetalSd", "VinylSd", "Wd Sdng", "VinylSd", "V…
#> $ ExterQual    <fct> Gd, Ex, Gd, Ex, Gd, Ex, Gd, Ex, Ex, Ex, Ex, Fa, Ex, Gd, E…
#> $ X1stFlrSF    <int> 856, 1262, 920, 961, 1145, 796, 1694, 1107, 1022, 1077, 1…
#> $ X2ndFlrSF    <int> 854, 0, 866, 756, 1053, 566, 0, 983, 752, 0, 0, 1142, 0, …
#> $ GrLivArea    <int> 1710, 1262, 1786, 1717, 2198, 1362, 1694, 2090, 1774, 107…
#> $ BsmtFullBath <int> 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, …
#> $ FullBath     <int> 2, 2, 2, 1, 2, 1, 2, 2, 2, 1, 1, 3, 1, 2, 1, 1, 1, 2, 1, …
#> $ HalfBath     <int> 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, …
#> $ BedroomAbvGr <int> 3, 3, 3, 3, 4, 1, 3, 3, 2, 2, 3, 4, 2, 3, 2, 2, 2, 2, 3, …
#> $ TotRmsAbvGrd <int> 8, 6, 6, 7, 9, 5, 7, 7, 8, 5, 5, 11, 4, 7, 5, 5, 5, 6, 6,…
#> $ Fireplaces   <int> 0, 1, 1, 1, 1, 0, 1, 2, 2, 2, 0, 2, 0, 1, 1, 0, 1, 0, 0, …
#> $ GarageType   <chr> "Attchd", "Attchd", "Attchd", "Detchd", "Attchd", "Attchd…
#> $ GarageCars   <int> 2, 2, 2, 3, 3, 2, 2, 2, 2, 1, 1, 3, 1, 3, 1, 2, 2, 2, 2, …
#> $ GarageArea   <int> 548, 460, 608, 642, 836, 480, 636, 484, 468, 205, 384, 73…
#> $ PavedDrive   <fct> Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, …
#> $ OpenPorchSF  <int> 61, 0, 42, 35, 84, 30, 57, 204, 0, 4, 0, 21, 0, 33, 213, …
#> $ SaleType     <chr> "WD", "WD", "WD", "WD", "WD", "WD", "WD", "WD", "WD", "WD…
#> $ SalePrice    <int> 208500, 181500, 223500, 140000, 250000, 143000, 307000, 2…

4.4 Missing Values


- Check Missing Values.
In this step, checking the missing values is a must so that the missing value treatment can be done and the data is ready to analyze.

colSums(is.na(house_new))
#>   MSSubClass     MSZoning  LotFrontage      LotArea       Street    LotConfig 
#>            0            0          259            0            0            0 
#>   Condition1   Condition2     BldgType   HouseStyle  OverallQual  OverallCond 
#>            0            0            0            0            0            0 
#>    YearBuilt YearRemodAdd  Exterior1st    ExterQual    X1stFlrSF    X2ndFlrSF 
#>            0            0            0            0            0            0 
#>    GrLivArea BsmtFullBath     FullBath     HalfBath BedroomAbvGr TotRmsAbvGrd 
#>            0            0            0            0            0            0 
#>   Fireplaces   GarageType   GarageCars   GarageArea   PavedDrive  OpenPorchSF 
#>            0           81            0            0            0            0 
#>     SaleType    SalePrice 
#>            0            0


- Treatment Missing Values.
The treatments for missing value are drop the column of GarageType and replace the missing value of LotFrontage with the LotFrontage median.

# Drop 
house_new <- house_new[!(house_new$GarageType %in% NA),]

# Input NA with median of LotFrontage = 70

house_new$LotFrontage[is.na(house_new$LotFrontage)] <- median(house_new$LotFrontage, na.rm = TRUE)


- Recheck Missing Values.
Recheck the missing value for each variables to ensure theris no missing value in each variables.

colSums(is.na(house_new))
#>   MSSubClass     MSZoning  LotFrontage      LotArea       Street    LotConfig 
#>            0            0            0            0            0            0 
#>   Condition1   Condition2     BldgType   HouseStyle  OverallQual  OverallCond 
#>            0            0            0            0            0            0 
#>    YearBuilt YearRemodAdd  Exterior1st    ExterQual    X1stFlrSF    X2ndFlrSF 
#>            0            0            0            0            0            0 
#>    GrLivArea BsmtFullBath     FullBath     HalfBath BedroomAbvGr TotRmsAbvGrd 
#>            0            0            0            0            0            0 
#>   Fireplaces   GarageType   GarageCars   GarageArea   PavedDrive  OpenPorchSF 
#>            0            0            0            0            0            0 
#>     SaleType    SalePrice 
#>            0            0

4.5 Check Duplicate Data


Checking the data whether there is the same values or duplicates for each row .

house_new%>% 
  duplicated() %>% 
  sum()
#> [1] 0

5 EDA

5.1 Data Status


Inspecting the data 5 numbers of summary and the mean to recognize the data distribution for each numeric variables, to know the total amount of each level in each factor variable, and to find out if there are any outliers for each variable by using summary() function.

summary(house_new)
#>    MSSubClass        MSZoning     LotFrontage        LotArea        Street    
#>  Min.   : 20.00   C (all):   8   Min.   : 21.00   Min.   :  1300   Grvl:   5  
#>  1st Qu.: 20.00   FV     :  65   1st Qu.: 60.00   1st Qu.:  7741   Pave:1374  
#>  Median : 50.00   RH     :  12   Median : 70.00   Median :  9591              
#>  Mean   : 56.02   RL     :1101   Mean   : 70.56   Mean   : 10696              
#>  3rd Qu.: 70.00   RM     : 193   3rd Qu.: 79.00   3rd Qu.: 11708              
#>  Max.   :190.00                  Max.   :313.00   Max.   :215245              
#>                                                                               
#>    LotConfig     Condition1     Condition2     BldgType     HouseStyle       
#>  Corner :250   Norm   :1195   Norm   :1365   1Fam  :1166   Length:1379       
#>  CulDSac: 93   Feedr  :  69   Feedr  :   5   2fmCon:  22   Class :character  
#>  FR2    : 44   Artery :  44   Artery :   2   Duplex:  40   Mode  :character  
#>  FR3    :  4   RRAn   :  26   PosN   :   2   Twnhs :  38                     
#>  Inside :988   PosN   :  19   RRNn   :   2   TwnhsE: 113                     
#>                RRAe   :  11   PosA   :   1                                   
#>                (Other):  15   (Other):   2                                   
#>   OverallQual   OverallCond      YearBuilt     YearRemodAdd  Exterior1st       
#>  5      :365   Min.   :2.000   Min.   :1880   Min.   :1950   Length:1379       
#>  6      :362   1st Qu.:5.000   1st Qu.:1955   1st Qu.:1968   Class :character  
#>  7      :318   Median :5.000   Median :1976   Median :1994   Mode  :character  
#>  8      :167   Mean   :5.578   Mean   :1973   Mean   :1985                     
#>  4      : 90   3rd Qu.:6.000   3rd Qu.:2001   3rd Qu.:2004                     
#>  9      : 43   Max.   :9.000   Max.   :2010   Max.   :2010                     
#>  (Other): 34                                                                   
#>  ExterQual   X1stFlrSF      X2ndFlrSF        GrLivArea     BsmtFullBath   
#>  Fa: 52    Min.   : 438   Min.   :   0.0   Min.   : 438   Min.   :0.0000  
#>  TA:  7    1st Qu.: 894   1st Qu.:   0.0   1st Qu.:1154   1st Qu.:0.0000  
#>  Gd:487    Median :1098   Median :   0.0   Median :1479   Median :0.0000  
#>  Ex:833    Mean   :1177   Mean   : 353.4   Mean   :1535   Mean   :0.4307  
#>            3rd Qu.:1414   3rd Qu.: 738.5   3rd Qu.:1790   3rd Qu.:1.0000  
#>            Max.   :4692   Max.   :2065.0   Max.   :5642   Max.   :2.0000  
#>                                                                           
#>     FullBath       HalfBath       BedroomAbvGr    TotRmsAbvGrd   
#>  Min.   :0.00   Min.   :0.0000   Min.   :0.000   Min.   : 3.000  
#>  1st Qu.:1.00   1st Qu.:0.0000   1st Qu.:2.000   1st Qu.: 5.000  
#>  Median :2.00   Median :0.0000   Median :3.000   Median : 6.000  
#>  Mean   :1.58   Mean   :0.3959   Mean   :2.865   Mean   : 6.553  
#>  3rd Qu.:2.00   3rd Qu.:1.0000   3rd Qu.:3.000   3rd Qu.: 7.000  
#>  Max.   :3.00   Max.   :2.0000   Max.   :6.000   Max.   :12.000  
#>                                                                  
#>    Fireplaces      GarageType          GarageCars      GarageArea    
#>  Min.   :0.0000   Length:1379        Min.   :1.000   Min.   : 160.0  
#>  1st Qu.:0.0000   Class :character   1st Qu.:1.000   1st Qu.: 380.0  
#>  Median :1.0000   Mode  :character   Median :2.000   Median : 484.0  
#>  Mean   :0.6418                      Mean   :1.871   Mean   : 500.8  
#>  3rd Qu.:1.0000                      3rd Qu.:2.000   3rd Qu.: 580.0  
#>  Max.   :3.0000                      Max.   :4.000   Max.   :1418.0  
#>                                                                      
#>  PavedDrive  OpenPorchSF       SaleType           SalePrice     
#>  N:  58     Min.   :  0.00   Length:1379        Min.   : 35311  
#>  P:  28     1st Qu.:  0.00   Class :character   1st Qu.:134000  
#>  Y:1293     Median : 27.00   Mode  :character   Median :167500  
#>             Mean   : 47.28                      Mean   :185480  
#>             3rd Qu.: 69.50                      3rd Qu.:217750  
#>             Max.   :547.00                      Max.   :755000  
#> 


Insight:
- The are many outliers in numeric variables.


5.2 Outlier Treatment


In this step, using boxplot to visualize the distribution data for each numeric variables. Then, defined the outlier boundaries. The outliers of numeric variable treatment is removed.
1. SalePrice Variable

# Visualization before treatment
boxplot(house_new$SalePrice,horizontal = T)

# Treatment
house_new_outlier <- house_new[!(house_new$SalePrice>300000),]

# Visualization after treatment
boxplot(house_new_outlier$SalePrice,horizontal = T)

2. LotArea Variable

# Visualization
boxplot(house_new$LotArea,horizontal = T)

# Treatment
house_new_outlier <- house_new_outlier[!(house_new_outlier$LotArea>15600),]

# Visualization after treatment
boxplot(house_new_outlier$LotArea,horizontal = T)

3. First Floor Variable

# Visualization
boxplot(house_new_outlier$X1stFlrSF,horizontal = T)

# Treatment
house_new_outlier <- house_new_outlier[!(house_new_outlier$X1stFlrSF>2000),]

# Visualization after treatment
boxplot(house_new_outlier$X1stFlrSF,horizontal = T)

4. Second Floor Variable

# Visualization
boxplot(house_new_outlier$X2ndFlrSF,horizontal = T)


# Treatment
house_new_outlier <- house_new_outlier[!(house_new_outlier$X2ndFlrSF>1700),]

# Visualization after treatment
boxplot(house_new_outlier$X2ndFlrSF,horizontal = T)

5.GrLivArea Variable

# Visualization
boxplot(house_new_outlier$GrLivArea,horizontal = T)

# Treatment
house_new_outlier <- house_new_outlier[!(house_new_outlier$GrLivArea>2550),]

# Visualization after treatment
boxplot(house_new_outlier$GrLivArea,horizontal = T)

6.GarageArea Variable

# Visualization
boxplot(house_new_outlier$GarageArea,horizontal = T)

# Treatment
house_new_outlier <- house_new_outlier[!(house_new_outlier$GarageArea>850),]

# Visualization after treatment
boxplot(house_new_outlier$GarageArea,horizontal = T)

7.OpenPorchSF Variable

# Visualization
boxplot(house_new_outlier$OpenPorchSF,horizontal = T)

# Treatment
house_new_outlier <- house_new_outlier[!(house_new_outlier$OpenPorchSF>115),]

# Visualization after treatment
boxplot(house_new_outlier$OpenPorchSF,horizontal = T)


5.3 Check Correlation


Using ggcorr function in GGally package to get the strength of correlation among numeric variables.

# Change the variable of OverallQual's data type to integer for visualization of correlation purposes. 
house_new1 <- house_new %>% 
  mutate(OverallQual=as.integer(OverallQual))

# Visualization of correlation

ggcorr(house_new1,label = T,label_round = 2, nbreaks = 4, palette = "RdGy",
       label_size = 3.5, label_color = "white",hjust = 0.8,layout.exp = 4)


Insight:
- The are nine numeric variables whose have a strong correlation above 0.51.


6 Model

6.1 Business Problem


Predicting the value of the SalePrice variable based on all or some of the predictor variables.

  • Target variable (y): SalePrice variable.
  • Predictor variable (x): All or some of the variables other than the price column.

Note: It is required to find which predictor variable will produce the best linear regression model.


6.2 Cross Validation


In cross validation step, the data is devided into train data with proportion 80% of original data and test data with proportion 20% of original data as unseen data.

1. Without Outlier Treatment.

# stratified random sampling method

# Set seed to lock the random
set.seed(1230)


# menentukan indeks untuk train dan test
splitted <- initial_split(data = house_new,
                          prop = 0.80,
                          strata = "Condition2")

# mengambil indeks data train
house_train <- training(splitted)

# mengambil indeks data test`
house_test <- testing(splitted)

house_train
house_test


2. With Outlier Treatment.

# stratified random sampling method

# Set seed to lock the random
RNGkind(sample.kind = "Rounding")
set.seed(99)


# menentukan indeks untuk train dan test
splitted <- initial_split(data = house_new_outlier,
                          prop = 0.80,
                          strata = "Condition2")

# mengambil indeks data train
house_train_outlier <- training(splitted)

# mengambil indeks data test`
house_test_outlier <- testing(splitted)

house_train_outlier
house_test_outlier

6.3 Build Model

In this section, the model is built into 4 models and classified based on selected predictors ranging from all predictors, based on business / domain knowledge perspective, future selection, and step-wise method. The Adjusted R-squared represents the goodness of fit for a model and the star represents the strength of predictor variable’s significance towards target variable.

6.3.1 Model - All Predictors


In model of model_all, it involves all preddictors in this model. Then using lm function to build linear regression model and summary function to see model performance.

model_all <- lm(formula = SalePrice ~.,
                data = house_train)
summary(model_all)
#> 
#> Call:
#> lm(formula = SalePrice ~ ., data = house_train)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -415645  -13291    -385   11367  190662 
#> 
#> Coefficients:
#>                        Estimate   Std. Error t value          Pr(>|t|)    
#> (Intercept)        -594346.1722  191705.7709  -3.100          0.001987 ** 
#> MSSubClass             -17.5858     110.1381  -0.160          0.873173    
#> MSZoningFV           20433.7681   15152.7743   1.349          0.177794    
#> MSZoningRH           17836.1492   17147.9142   1.040          0.298526    
#> MSZoningRL           21250.9990   14326.8909   1.483          0.138308    
#> MSZoningRM           15732.6801   14377.3248   1.094          0.274097    
#> LotFrontage           -169.8132      57.5795  -2.949          0.003259 ** 
#> LotArea                  0.5837       0.1576   3.703          0.000225 ***
#> StreetPave           27171.6219   21646.4151   1.255          0.209678    
#> LotConfigCulDSac      8577.0974    4480.6661   1.914          0.055870 .  
#> LotConfigFR2         -2884.1299    5894.8654  -0.489          0.624763    
#> LotConfigFR3        -26113.8205   25402.3744  -1.028          0.304192    
#> LotConfigInside        893.8776    2578.8792   0.347          0.728953    
#> Condition1Feedr      -3175.2640    7114.3194  -0.446          0.655461    
#> Condition1Norm       16595.2310    5662.8086   2.931          0.003459 ** 
#> Condition1PosA       23622.5083   14318.9643   1.650          0.099307 .  
#> Condition1PosN       11763.3182   10000.8452   1.176          0.239779    
#> Condition1RRAe       -5221.6671   12314.3448  -0.424          0.671633    
#> Condition1RRAn       11834.3216   10126.3029   1.169          0.242811    
#> Condition1RRNe       -3089.6544   31814.4231  -0.097          0.922654    
#> Condition1RRNn       15695.0836   16749.7724   0.937          0.348965    
#> Condition2Feedr      -1505.9231   34814.2031  -0.043          0.965506    
#> Condition2Norm       -6231.5881   26774.4005  -0.233          0.816007    
#> Condition2RRAe      -16555.6116   43353.7651  -0.382          0.702636    
#> Condition2RRAn        -759.4410   41968.2820  -0.018          0.985566    
#> Condition2RRNn       -4098.1940   41662.2233  -0.098          0.921660    
#> BldgType2fmCon      -15574.3401   17145.3888  -0.908          0.363900    
#> BldgTypeDuplex      -21517.3356    8598.3512  -2.502          0.012489 *  
#> BldgTypeTwnhs       -19706.1523   12977.4186  -1.518          0.129202    
#> BldgTypeTwnhsE      -17073.5061   11724.5764  -1.456          0.145643    
#> HouseStyle1.5Unf     17885.2634   13425.0519   1.332          0.183084    
#> HouseStyle1Story     21803.1108    5878.9360   3.709          0.000220 ***
#> HouseStyle2.5Fin    -37150.4728   17728.6785  -2.096          0.036374 *  
#> HouseStyle2.5Unf     -4209.8782   15311.7199  -0.275          0.783415    
#> HouseStyle2Story    -11088.9031    4822.3152  -2.299          0.021680 *  
#> HouseStyleSFoyer     14856.3610    8511.6942   1.745          0.081218 .  
#> HouseStyleSLvl       13698.4419    7191.4361   1.905          0.057086 .  
#> OverallQual3         14516.0978   26416.6602   0.550          0.582780    
#> OverallQual4         11160.5651   23923.5341   0.467          0.640951    
#> OverallQual5         18650.9367   24080.2091   0.775          0.438796    
#> OverallQual6         24274.2646   24209.9720   1.003          0.316267    
#> OverallQual7         35795.3655   24390.3637   1.468          0.142523    
#> OverallQual8         71909.3214   24661.8990   2.916          0.003626 ** 
#> OverallQual9        139605.0303   25353.5318   5.506 0.000000046447180 ***
#> OverallQual10       190322.5510   27223.8806   6.991 0.000000000004948 ***
#> OverallCond           7235.3226    1160.5670   6.234 0.000000000665267 ***
#> YearBuilt              308.5308      83.6556   3.688          0.000238 ***
#> YearRemodAdd           -55.5534      77.3254  -0.718          0.472654    
#> Exterior1stBrkComm  -15422.2649   34731.1658  -0.444          0.657104    
#> Exterior1stBrkFace   20464.0459   11737.7091   1.743          0.081560 .  
#> Exterior1stCBlock      942.6757   39369.3844   0.024          0.980902    
#> Exterior1stCemntBd   14414.4189   11589.4544   1.244          0.213878    
#> Exterior1stHdBoard    3365.9676   10767.3196   0.313          0.754642    
#> Exterior1stImStucc   -3815.4144   32881.9252  -0.116          0.907649    
#> Exterior1stMetalSd    7251.2136   10565.1329   0.686          0.492660    
#> Exterior1stPlywood    2566.5134   11246.0993   0.228          0.819526    
#> Exterior1stStone      4014.4481   24796.3396   0.162          0.871419    
#> Exterior1stStucco   -31789.1961   13019.4544  -2.442          0.014790 *  
#> Exterior1stVinylSd    5360.7418   10727.3905   0.500          0.617378    
#> Exterior1stWd Sdng    5334.5480   10619.1052   0.502          0.615528    
#> Exterior1stWdShing   -4606.6128   12615.1979  -0.365          0.715065    
#> ExterQualTA          -1842.6563   22426.0497  -0.082          0.934531    
#> ExterQualGd         -12772.8839    7008.0476  -1.823          0.068659 .  
#> ExterQualEx         -20412.0775    7672.4081  -2.660          0.007927 ** 
#> X1stFlrSF               -2.8886      28.4998  -0.101          0.919288    
#> X2ndFlrSF               25.1131      27.6304   0.909          0.363623    
#> GrLivArea               55.5959      28.4411   1.955          0.050885 .  
#> BsmtFullBath         15254.3875    2084.2501   7.319 0.000000000000509 ***
#> FullBath              5808.7673    3113.3075   1.866          0.062360 .  
#> HalfBath              3570.3219    2984.5635   1.196          0.231874    
#> BedroomAbvGr         -4876.7072    1972.3070  -2.473          0.013577 *  
#> TotRmsAbvGrd           743.5013    1327.9429   0.560          0.575679    
#> Fireplaces            6107.2338    1859.9431   3.284          0.001060 ** 
#> GarageTypeAttchd     19858.9326   15858.2004   1.252          0.210757    
#> GarageTypeBasment    14238.5164   17768.1677   0.801          0.423117    
#> GarageTypeBuiltIn    16897.2057   16540.6000   1.022          0.307234    
#> GarageTypeCarPort     5500.7329   19584.2208   0.281          0.778863    
#> GarageTypeDetchd     19627.4108   15660.0358   1.253          0.210370    
#> GarageCars           16560.1245    3159.7165   5.241 0.000000194328272 ***
#> GarageArea              -7.2699      10.4101  -0.698          0.485118    
#> PavedDriveP           -751.6286    8032.4585  -0.094          0.925466    
#> PavedDriveY           2644.6623    5665.5244   0.467          0.640744    
#> OpenPorchSF              5.8766      17.1008   0.344          0.731184    
#> SaleTypeCon          38531.1726   23060.5633   1.671          0.095057 .  
#> SaleTypeConLD         7546.5763   14045.7974   0.537          0.591190    
#> SaleTypeConLI          -83.3702   17217.4689  -0.005          0.996137    
#> SaleTypeConLw        -1606.9659   19816.0967  -0.081          0.935383    
#> SaleTypeCWD          24391.1574   16915.7097   1.442          0.149633    
#> SaleTypeNew          26975.2006    6948.4937   3.882          0.000110 ***
#> SaleTypeOth          25857.0486   31648.1974   0.817          0.414112    
#> SaleTypeWD            8179.2117    5774.9202   1.416          0.156986    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 30730 on 1012 degrees of freedom
#> Multiple R-squared:  0.8643, Adjusted R-squared:  0.8522 
#> F-statistic: 71.62 on 90 and 1012 DF,  p-value: < 0.00000000000000022


Insight of model_all :
- OverallQual9 & OverallQual10, rates the overall material and finish of the house, also Intercept have highest significant score to the model of model_all.
- The Adjusted R-squared score that represent the goodness of fit for model is 0.8522, it means that only 85.22% of the variables can be explained by the model.


After conducting the Assumption Test, the model of model_all’s result is several variables have a strong correlation between predictors. There are unnecessary predictors to be involved because it indicates a redundant predictor in the model, which should be able to choose only one of the variables with a strong relationship. Then using lm function to build linear regression model with selected predictors and summary function to see model performance.

# Drop several variables due to multicolinearity issue.
model_all_nomulti <- lm(formula = SalePrice ~. -HouseStyle -X2ndFlrSF -MSSubClass -BldgType -OverallCond -Exterior1st-ExterQual,
                data = house_train)
summary(model_all_nomulti)
#> 
#> Call:
#> lm(formula = SalePrice ~ . - HouseStyle - X2ndFlrSF - MSSubClass - 
#>     BldgType - OverallCond - Exterior1st - ExterQual, data = house_train)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -480835  -14568    -432   12821  223071 
#> 
#> Coefficients:
#>                       Estimate   Std. Error t value             Pr(>|t|)    
#> (Intercept)       -602581.5105  167983.9710  -3.587             0.000350 ***
#> MSZoningFV          11593.8122   15858.2693   0.731             0.464889    
#> MSZoningRH           9712.1398   17971.9238   0.540             0.589032    
#> MSZoningRL          17160.1661   14988.9747   1.145             0.252533    
#> MSZoningRM           3828.7905   15015.9038   0.255             0.798787    
#> LotFrontage           -83.5639      57.7586  -1.447             0.148260    
#> LotArea                 0.4757       0.1570   3.030             0.002506 ** 
#> StreetPave          12888.2506   18852.6734   0.684             0.494361    
#> LotConfigCulDSac    11159.6848    4711.9989   2.368             0.018049 *  
#> LotConfigFR2        -2312.1381    6279.0163  -0.368             0.712775    
#> LotConfigFR3       -30367.9034   26998.5086  -1.125             0.260933    
#> LotConfigInside      -372.6487    2738.3679  -0.136             0.891781    
#> Condition1Feedr     -2351.7683    7443.1791  -0.316             0.752094    
#> Condition1Norm      16911.6231    5882.7125   2.875             0.004125 ** 
#> Condition1PosA      24539.3091   15047.1488   1.631             0.103229    
#> Condition1PosN      13718.5455   10493.6759   1.307             0.191394    
#> Condition1RRAe       2410.8062   13017.4473   0.185             0.853110    
#> Condition1RRAn       8030.7798   10722.8639   0.749             0.454063    
#> Condition1RRNe       1503.6648   33949.2808   0.044             0.964681    
#> Condition1RRNn      16079.8879   17764.2904   0.905             0.365579    
#> Condition2Feedr     18768.6079   34461.4484   0.545             0.586127    
#> Condition2Norm      11931.4388   25596.2088   0.466             0.641212    
#> Condition2RRAe      -5329.6477   43169.0898  -0.123             0.901767    
#> Condition2RRAn       4198.6815   42424.5267   0.099             0.921183    
#> Condition2RRNn      33050.1157   42518.5706   0.777             0.437152    
#> OverallQual3         1442.3408   27210.2746   0.053             0.957736    
#> OverallQual4        15862.0420   24842.1864   0.639             0.523281    
#> OverallQual5        26858.2580   24907.1903   1.078             0.281135    
#> OverallQual6        32311.5780   24977.3058   1.294             0.196077    
#> OverallQual7        46630.2699   25157.7675   1.854             0.064091 .  
#> OverallQual8        85455.4552   25411.1203   3.363             0.000799 ***
#> OverallQual9       156924.9798   25997.6188   6.036   0.0000000021918765 ***
#> OverallQual10      214083.0480   27826.1183   7.694   0.0000000000000331 ***
#> YearBuilt              64.9826      69.7059   0.932             0.351429    
#> YearRemodAdd          194.6838      70.6606   2.755             0.005968 ** 
#> X1stFlrSF               5.5491       4.7953   1.157             0.247457    
#> GrLivArea              50.1192       5.5883   8.969 < 0.0000000000000002 ***
#> BsmtFullBath        13260.0181    2176.3604   6.093   0.0000000015597503 ***
#> FullBath             3127.6154    3276.8625   0.954             0.340075    
#> HalfBath             -242.5368    3080.3211  -0.079             0.937257    
#> BedroomAbvGr        -2935.5775    2026.5075  -1.449             0.147753    
#> TotRmsAbvGrd         -224.5892    1365.9136  -0.164             0.869429    
#> Fireplaces           6763.4803    1958.5690   3.453             0.000576 ***
#> GarageTypeAttchd    45070.5371   16218.8586   2.779             0.005552 ** 
#> GarageTypeBasment   25947.2446   18231.4251   1.423             0.154973    
#> GarageTypeBuiltIn   50289.2951   16910.7318   2.974             0.003009 ** 
#> GarageTypeCarPort   22460.7387   19647.1110   1.143             0.253215    
#> GarageTypeDetchd    39635.1201   16087.0406   2.464             0.013908 *  
#> GarageCars          15234.1616    3320.0665   4.589   0.0000050071189558 ***
#> GarageArea             -0.8073      10.9888  -0.073             0.941451    
#> PavedDriveP         -1243.3405    8495.3630  -0.146             0.883669    
#> PavedDriveY          2702.1394    5857.0642   0.461             0.644646    
#> OpenPorchSF            25.2194      17.9364   1.406             0.160010    
#> SaleTypeCon         40131.9640   24442.8876   1.642             0.100919    
#> SaleTypeConLD        2289.1457   14259.7284   0.161             0.872493    
#> SaleTypeConLI       -5265.5360   18304.6720  -0.288             0.773664    
#> SaleTypeConLw        5982.1864   20544.5749   0.291             0.770971    
#> SaleTypeCWD         13211.7694   17913.5687   0.738             0.460967    
#> SaleTypeNew         29470.6728    7265.4691   4.056   0.0000535833201582 ***
#> SaleTypeOth         30115.7711   33907.1368   0.888             0.374647    
#> SaleTypeWD           9393.0505    6082.3043   1.544             0.122813    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 33030 on 1042 degrees of freedom
#> Multiple R-squared:  0.8386, Adjusted R-squared:  0.8294 
#> F-statistic: 90.26 on 60 and 1042 DF,  p-value: < 0.00000000000000022

Insight of model_all_nomulti :
- The Adjusted R-squared score that represent the goodness of fit for model is 0.8294, it means that only 82.94% of the variables can be explained by the model.
- OverallQual9 & OverallQual10, rates the overall material and finish of the house, also Intercept have highest significant score to the model of model_all_nomulti.


In the model of model_all_outlier has outlier treatment, where actually outliers can be beneficial for the model. In order to measure whether outliers can be useful for the model or not, it can be seen after conducting the model comparison with function compare_performance(). Then using lm function to build linear regression model with selected predictors and summary function to see model performance.

# Model with Outlier Treatment
model_all_outlier <- lm(formula = SalePrice ~.,
                data = house_train_outlier)

# Drop several variables due to multicolinearity issue.
model_all_outlier_nomulti <- lm(formula = SalePrice ~. -HouseStyle -X2ndFlrSF -BldgType-OverallCond -Exterior1st,
                data = house_train_outlier)

summary(model_all_outlier_nomulti)
#> 
#> Call:
#> lm(formula = SalePrice ~ . - HouseStyle - X2ndFlrSF - BldgType - 
#>     OverallCond - Exterior1st, data = house_train_outlier)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -100352   -9992     379   10158   59005 
#> 
#> Coefficients:
#>                       Estimate   Std. Error t value             Pr(>|t|)    
#> (Intercept)       -834963.2207  114749.8757  -7.276    0.000000000000881 ***
#> MSSubClass           -103.7956      22.5957  -4.594    0.000005124313814 ***
#> MSZoningFV          22633.5428    9384.8555   2.412             0.016122 *  
#> MSZoningRH          21531.5988   10769.7897   1.999             0.045948 *  
#> MSZoningRL          24212.4378    8149.0328   2.971             0.003063 ** 
#> MSZoningRM          16290.4253    8107.6274   2.009             0.044873 *  
#> LotFrontage            13.0998      52.9200   0.248             0.804560    
#> LotArea                 1.0965       0.3568   3.073             0.002196 ** 
#> StreetPave          27254.6324   19473.4230   1.400             0.162060    
#> LotConfigCulDSac     7129.1412    3437.7392   2.074             0.038446 *  
#> LotConfigFR2         -781.7392    4183.9549  -0.187             0.851836    
#> LotConfigFR3        11705.1970   19079.9425   0.613             0.539747    
#> LotConfigInside      1306.0901    1903.3442   0.686             0.492798    
#> Condition1Feedr      1353.1374    5320.1630   0.254             0.799303    
#> Condition1Norm       6585.8952    4591.9282   1.434             0.151930    
#> Condition1PosA       5457.3425   13940.6033   0.391             0.695562    
#> Condition1PosN      12670.9877    8559.9566   1.480             0.139231    
#> Condition1RRAe     -19025.1436    8033.4867  -2.368             0.018131 *  
#> Condition1RRAn      -2060.1518    7083.5675  -0.291             0.771259    
#> Condition1RRNe      11690.4429   19178.2878   0.610             0.542337    
#> Condition1RRNn      42400.4886   16897.2924   2.509             0.012311 *  
#> Condition2Feedr     -7969.8174   19554.0250  -0.408             0.683701    
#> Condition2Norm      -5874.9931   15240.1214  -0.385             0.699982    
#> Condition2RRAn     -21795.3197   24219.0778  -0.900             0.368455    
#> Condition2RRNn       1801.1657   24638.1712   0.073             0.941743    
#> OverallQual3          -93.1707   15299.7548  -0.006             0.995143    
#> OverallQual4        18809.9212   13907.0281   1.353             0.176616    
#> OverallQual5        27304.1010   13955.3087   1.957             0.050780 .  
#> OverallQual6        36559.8307   14021.9152   2.607             0.009310 ** 
#> OverallQual7        46537.2116   14214.0974   3.274             0.001110 ** 
#> OverallQual8        73950.4921   14471.1476   5.110    0.000000410530664 ***
#> OverallQual9       100226.5966   25085.3984   3.995    0.000071075973398 ***
#> YearBuilt             188.3124      47.8204   3.938    0.000090002118209 ***
#> YearRemodAdd          218.0644      44.9081   4.856    0.000001465152692 ***
#> ExterQualTA        -22827.3985   13799.7416  -1.654             0.098515 .  
#> ExterQualGd        -15180.8660    9582.8042  -1.584             0.113583    
#> ExterQualEx        -21027.0218    9491.8210  -2.215             0.027046 *  
#> X1stFlrSF              11.1426       3.5642   3.126             0.001840 ** 
#> GrLivArea              48.1948       4.3220  11.151 < 0.0000000000000002 ***
#> BsmtFullBath         9887.6527    1472.4897   6.715    0.000000000037631 ***
#> FullBath             2047.1349    2193.8708   0.933             0.351066    
#> HalfBath             1839.3413    1998.6116   0.920             0.357713    
#> BedroomAbvGr        -3125.9904    1410.9372  -2.216             0.027028 *  
#> TotRmsAbvGrd        -1655.3771     975.2597  -1.697             0.090049 .  
#> Fireplaces           5495.6729    1272.0639   4.320    0.000017722640509 ***
#> GarageTypeAttchd    23189.6827   11092.5479   2.091             0.036910 *  
#> GarageTypeBasment   18879.7536   12434.9605   1.518             0.129373    
#> GarageTypeBuiltIn   23891.8056   11637.5846   2.053             0.040427 *  
#> GarageTypeCarPort    8090.6653   13703.6027   0.590             0.555101    
#> GarageTypeDetchd    23520.9531   11022.1341   2.134             0.033175 *  
#> GarageCars           1991.4255    2340.9373   0.851             0.395215    
#> GarageArea             19.2540       8.7326   2.205             0.027773 *  
#> PavedDriveP          5410.8376    6083.7656   0.889             0.374084    
#> PavedDriveY          3437.7019    3706.6489   0.927             0.354001    
#> OpenPorchSF            45.6289      23.6919   1.926             0.054499 .  
#> SaleTypeCon         39623.8272   19500.7714   2.032             0.042521 *  
#> SaleTypeConLD       11966.9582   10221.5081   1.171             0.242073    
#> SaleTypeConLI       -3575.3449   11676.3333  -0.306             0.759536    
#> SaleTypeConLw        9770.7908   11768.2631   0.830             0.406658    
#> SaleTypeCWD         46031.5369   19132.8369   2.406             0.016379 *  
#> SaleTypeNew         16856.1648    5057.5016   3.333             0.000902 ***
#> SaleTypeWD           8238.7698    3922.7058   2.100             0.036044 *  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 18330 on 736 degrees of freedom
#> Multiple R-squared:  0.8584, Adjusted R-squared:  0.8467 
#> F-statistic: 73.17 on 61 and 736 DF,  p-value: < 0.00000000000000022

Insight of model_all_outlier :
- The Adjusted R-squared score that represent the goodness of fit for model is 0.8562, it means that only 85.62% of the variables can be explained by the model of model_all_outlier.
- OverallQual9 & OverallQual10, rates the overall material and finish of the house, also Intercept have highest significant score to the model of model_all_outlier.


6.3.2 Model - Based on Business Perspective


In model of model_bob, it involves the business perspective in constructing the variable of predictors that want to put into model as model’s predictor.

model_bob <- lm(formula = SalePrice ~. -OpenPorchSF -HalfBath -GarageArea -MSSubClass -TotRmsAbvGrd -HouseStyle -X2ndFlrSF -BldgType -OverallCond -Exterior1st-ExterQual,
                data = house_train)
summary(model_bob)
#> 
#> Call:
#> lm(formula = SalePrice ~ . - OpenPorchSF - HalfBath - GarageArea - 
#>     MSSubClass - TotRmsAbvGrd - HouseStyle - X2ndFlrSF - BldgType - 
#>     OverallCond - Exterior1st - ExterQual, data = house_train)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -480124  -15086    -295   13292  220950 
#> 
#> Coefficients:
#>                       Estimate   Std. Error t value             Pr(>|t|)    
#> (Intercept)       -615205.1363  163415.8103  -3.765             0.000176 ***
#> MSZoningFV          13076.1073   15714.7262   0.832             0.405547    
#> MSZoningRH           9440.4067   17908.8558   0.527             0.598211    
#> MSZoningRL          17266.1127   14899.5672   1.159             0.246789    
#> MSZoningRM           4255.8347   14909.7024   0.285             0.775363    
#> LotFrontage           -80.5513      57.2302  -1.407             0.159577    
#> LotArea                 0.4754       0.1564   3.040             0.002428 ** 
#> StreetPave          12077.5804   18798.6010   0.642             0.520707    
#> LotConfigCulDSac    11045.7455    4703.4308   2.348             0.019037 *  
#> LotConfigFR2        -2090.0374    6259.5698  -0.334             0.738526    
#> LotConfigFR3       -30481.8285   26970.1690  -1.130             0.258649    
#> LotConfigInside      -560.6366    2728.6531  -0.205             0.837251    
#> Condition1Feedr     -2367.8882    7399.9532  -0.320             0.749042    
#> Condition1Norm      17035.3780    5852.2563   2.911             0.003680 ** 
#> Condition1PosA      24064.5973   15004.9030   1.604             0.109064    
#> Condition1PosN      14630.5312   10439.4476   1.401             0.161371    
#> Condition1RRAe       2777.6774   12951.6394   0.214             0.830226    
#> Condition1RRAn       9420.4309   10618.1641   0.887             0.375175    
#> Condition1RRNe       2125.9221   33872.5450   0.063             0.949968    
#> Condition1RRNn      15973.0407   17704.1972   0.902             0.367149    
#> Condition2Feedr     17520.7624   34328.0850   0.510             0.609885    
#> Condition2Norm      11762.8144   25540.9607   0.461             0.645219    
#> Condition2RRAe      -7465.4676   42823.7206  -0.174             0.861640    
#> Condition2RRAn       3006.2708   42366.0028   0.071             0.943444    
#> Condition2RRNn      33882.9787   42364.7257   0.800             0.424013    
#> OverallQual3         1381.7739   27174.8319   0.051             0.959457    
#> OverallQual4        15643.5858   24813.2198   0.630             0.528536    
#> OverallQual5        26546.1387   24875.0676   1.067             0.286137    
#> OverallQual6        32155.9573   24950.1293   1.289             0.197749    
#> OverallQual7        46390.1047   25130.4510   1.846             0.065179 .  
#> OverallQual8        85395.7690   25374.0698   3.365             0.000792 ***
#> OverallQual9       156506.8133   25942.8516   6.033   0.0000000022338611 ***
#> OverallQual10      214124.5684   27791.4553   7.705   0.0000000000000304 ***
#> YearBuilt              69.6097      66.6149   1.045             0.296284    
#> YearRemodAdd          196.7529      70.5542   2.789             0.005388 ** 
#> X1stFlrSF               5.7905       4.0244   1.439             0.150485    
#> GrLivArea              50.2984       4.0109  12.541 < 0.0000000000000002 ***
#> BsmtFullBath        13420.2763    2161.4858   6.209   0.0000000007686819 ***
#> FullBath             3296.8402    2957.1861   1.115             0.265168    
#> BedroomAbvGr        -3102.9855    1799.9205  -1.724             0.085011 .  
#> Fireplaces           6859.9001    1937.1719   3.541             0.000416 ***
#> GarageTypeAttchd    44819.8448   16089.4068   2.786             0.005438 ** 
#> GarageTypeBasment   25634.7808   18107.6412   1.416             0.157164    
#> GarageTypeBuiltIn   49892.0733   16765.6741   2.976             0.002989 ** 
#> GarageTypeCarPort   22448.2759   19498.4254   1.151             0.249877    
#> GarageTypeDetchd    39353.0340   15986.9581   2.462             0.013993 *  
#> GarageCars          14889.1534    2360.2420   6.308   0.0000000004156525 ***
#> PavedDriveP         -1199.9741    8467.2315  -0.142             0.887329    
#> PavedDriveY          2484.5299    5822.9129   0.427             0.669699    
#> SaleTypeCon         38759.4775   24380.5891   1.590             0.112189    
#> SaleTypeConLD        1468.9956   14174.0486   0.104             0.917475    
#> SaleTypeConLI       -4578.1400   18236.5709  -0.251             0.801831    
#> SaleTypeConLw        5613.3793   20474.0631   0.274             0.784008    
#> SaleTypeCWD         12225.9999   17858.0434   0.685             0.493734    
#> SaleTypeNew         29709.9205    7236.4622   4.106   0.0000434708337898 ***
#> SaleTypeOth         29926.3096   33861.1247   0.884             0.377010    
#> SaleTypeWD           9393.6980    6067.2116   1.548             0.121859    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 33000 on 1046 degrees of freedom
#> Multiple R-squared:  0.8383, Adjusted R-squared:  0.8297 
#> F-statistic: 96.86 on 56 and 1046 DF,  p-value: < 0.00000000000000022

Insight of model_bob :
- The Adjusted R-squared score that represent the goodness of fit for model is 0.8297, it means that only 82.97% of the variables can be explained by the model of model_bob.
- OverallQual9 & OverallQual10, rates the overall material and finish of the house, also Intercept have highest significant score to the model of model_bob.


6.3.3 Model - Future Selection


Using ggcorr function to define which variable has a strong correlation towards the target. Then using those variables for constructing the variable of predictors for this model.

After, find out the variables that have a strong relationship. Then using lm function to build linear regression model with selected predictors based on ggcorr result and summary function to see model performance.

model_fs <- lm(formula = SalePrice ~ OverallQual + YearBuilt + YearRemodAdd + 
     X1stFlrSF + GrLivArea + FullBath + TotRmsAbvGrd + GarageCars + GarageArea,
   data = house_train)

summary(model_fs)
#> 
#> Call:
#> lm(formula = SalePrice ~ OverallQual + YearBuilt + YearRemodAdd + 
#>     X1stFlrSF + GrLivArea + FullBath + TotRmsAbvGrd + GarageCars + 
#>     GarageArea, data = house_train)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -497536  -15795    -562   14621  212560 
#> 
#> Coefficients:
#>                   Estimate   Std. Error t value             Pr(>|t|)    
#> (Intercept)   -1135297.200   146793.240  -7.734   0.0000000000000237 ***
#> OverallQual3    -11120.600    27810.210  -0.400             0.689327    
#> OverallQual4      2195.598    25490.631   0.086             0.931376    
#> OverallQual5     17929.448    25308.185   0.708             0.478821    
#> OverallQual6     24843.987    25407.735   0.978             0.328385    
#> OverallQual7     42037.997    25653.932   1.639             0.101574    
#> OverallQual8     80810.327    25917.674   3.118             0.001869 ** 
#> OverallQual9    156471.923    26602.255   5.882   0.0000000053948494 ***
#> OverallQual10   205374.403    28579.112   7.186   0.0000000000012364 ***
#> YearBuilt          323.796       56.771   5.704   0.0000000151177135 ***
#> YearRemodAdd       264.053       71.830   3.676             0.000248 ***
#> X1stFlrSF           18.574        3.642   5.099   0.0000004018952094 ***
#> GrLivArea           56.238        4.740  11.866 < 0.0000000000000002 ***
#> FullBath         -3130.896     2953.832  -1.060             0.289406    
#> TotRmsAbvGrd     -1275.484     1234.137  -1.034             0.301599    
#> GarageCars       14171.199     3427.015   4.135   0.0000382080031060 ***
#> GarageArea          -2.935       11.062  -0.265             0.790780    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 35400 on 1086 degrees of freedom
#> Multiple R-squared:  0.8068, Adjusted R-squared:  0.804 
#> F-statistic: 283.5 on 16 and 1086 DF,  p-value: < 0.00000000000000022

Insight of model_fs :
- The Adjusted R-squared score that represent the goodness of fit for model is 0.804, it means that only 80.4% of the variables can be explained by the model of model_fs.
- OverallQual9 and OverallQual10, rates the overall material and finish of the house, also Intercept have highest significant score to the model of model_fs.


6.3.4 Model - Step-wise Regression


In step-wise regression, there are 3 models ranging from backward, forward, and both which can help to discover a model with lowest AIC and to find significant predictors. Then using step function to build step-wise regression model and summary function to see model performance.

  • Backward
# Backward
model_backward <- step(object = model_all, direction = "backward", trace = F)

summary(model_backward)
#> 
#> Call:
#> lm(formula = SalePrice ~ LotFrontage + LotArea + Street + Condition1 + 
#>     BldgType + HouseStyle + OverallQual + OverallCond + YearBuilt + 
#>     Exterior1st + ExterQual + X2ndFlrSF + GrLivArea + BsmtFullBath + 
#>     FullBath + HalfBath + BedroomAbvGr + Fireplaces + GarageCars + 
#>     SaleType, data = house_train)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -422439  -13197    -778   12119  188125 
#> 
#> Coefficients:
#>                        Estimate   Std. Error t value             Pr(>|t|)    
#> (Intercept)        -786499.0006  140891.2919  -5.582    0.000000030290286 ***
#> LotFrontage           -188.6030      54.5253  -3.459             0.000564 ***
#> LotArea                  0.6668       0.1492   4.469    0.000008739166845 ***
#> StreetPave           43544.0374   19579.6729   2.224             0.026367 *  
#> Condition1Feedr      -4402.9343    6828.6026  -0.645             0.519213    
#> Condition1Norm       16579.0004    5490.8553   3.019             0.002595 ** 
#> Condition1PosA       23687.8673   14143.8842   1.675             0.094279 .  
#> Condition1PosN       14571.4734    9786.0916   1.489             0.136792    
#> Condition1RRAe       -6265.1549   11910.3582  -0.526             0.598983    
#> Condition1RRAn       13976.1861    9515.5049   1.469             0.142195    
#> Condition1RRNe        3998.6221   31453.9067   0.127             0.898865    
#> Condition1RRNn       12315.6384   15311.0932   0.804             0.421373    
#> BldgType2fmCon      -17766.3732    7838.7317  -2.266             0.023627 *  
#> BldgTypeDuplex      -24354.8809    6229.6890  -3.909    0.000098494375329 ***
#> BldgTypeTwnhs       -25351.9189    6655.9721  -3.809             0.000148 ***
#> BldgTypeTwnhsE      -21259.5159    4243.0131  -5.010    0.000000637902494 ***
#> HouseStyle1.5Unf     18886.8962   12557.6952   1.504             0.132883    
#> HouseStyle1Story     22429.1527    4934.0787   4.546    0.000006117341215 ***
#> HouseStyle2.5Fin    -33915.4882   13661.1687  -2.483             0.013199 *  
#> HouseStyle2.5Unf     -5274.3912   14858.2929  -0.355             0.722677    
#> HouseStyle2Story    -11031.6404    4371.9343  -2.523             0.011775 *  
#> HouseStyleSFoyer     14994.9972    7757.7825   1.933             0.053521 .  
#> HouseStyleSLvl       12914.4486    5952.2806   2.170             0.030259 *  
#> OverallQual3         23494.7917   25183.6954   0.933             0.351070    
#> OverallQual4         15545.9581   22799.2523   0.682             0.495478    
#> OverallQual5         25238.7751   22784.7378   1.108             0.268246    
#> OverallQual6         30812.8917   22925.3233   1.344             0.179224    
#> OverallQual7         42332.8739   23121.5056   1.831             0.067404 .  
#> OverallQual8         78379.3239   23411.2185   3.348             0.000843 ***
#> OverallQual9        144883.0769   24128.3746   6.005    0.000000002647875 ***
#> OverallQual10       195718.7099   26109.0195   7.496    0.000000000000141 ***
#> OverallCond           7171.1265    1031.2383   6.954    0.000000000006276 ***
#> YearBuilt              356.9247      71.2067   5.013    0.000000631324648 ***
#> Exterior1stBrkComm  -12970.3959   34210.8847  -0.379             0.704668    
#> Exterior1stBrkFace   25193.1714   11209.7709   2.247             0.024822 *  
#> Exterior1stCBlock    16634.8744   37336.6044   0.446             0.656024    
#> Exterior1stCemntBd   16814.1070   11107.6346   1.514             0.130395    
#> Exterior1stHdBoard    7807.6483   10244.4291   0.762             0.446152    
#> Exterior1stImStucc   -1516.1955   32449.0942  -0.047             0.962741    
#> Exterior1stMetalSd   10687.7551   10048.3098   1.064             0.287741    
#> Exterior1stPlywood    7298.7102   10679.1675   0.683             0.494473    
#> Exterior1stStone      8195.2462   24362.3959   0.336             0.736645    
#> Exterior1stStucco   -28234.3894   12496.7649  -2.259             0.024069 *  
#> Exterior1stVinylSd    8077.2289   10216.8532   0.791             0.429370    
#> Exterior1stWd Sdng    8823.0709   10063.7498   0.877             0.380843    
#> Exterior1stWdShing    -948.8236   12097.5736  -0.078             0.937500    
#> ExterQualTA          -7624.7596   19966.0777  -0.382             0.702624    
#> ExterQualGd         -13835.8114    6872.5674  -2.013             0.044352 *  
#> ExterQualEx         -21263.8868    7540.1616  -2.820             0.004893 ** 
#> X2ndFlrSF               26.6791       7.2021   3.704             0.000223 ***
#> GrLivArea               53.7350       4.7377  11.342 < 0.0000000000000002 ***
#> BsmtFullBath         15206.2459    2042.8101   7.444    0.000000000000205 ***
#> FullBath              5432.3114    2999.1635   1.811             0.070387 .  
#> HalfBath              3978.4012    2889.3962   1.377             0.168841    
#> BedroomAbvGr         -3948.2383    1739.3187  -2.270             0.023412 *  
#> Fireplaces            6537.5602    1801.5365   3.629             0.000299 ***
#> GarageCars           14509.5169    2162.6635   6.709    0.000000000032137 ***
#> SaleTypeCon          41526.6382   22691.2760   1.830             0.067526 .  
#> SaleTypeConLD         6725.4425   13656.6261   0.492             0.622493    
#> SaleTypeConLI         -870.7761   16849.5954  -0.052             0.958794    
#> SaleTypeConLw         1343.6075   19582.5580   0.069             0.945311    
#> SaleTypeCWD          27120.5765   16630.5083   1.631             0.103242    
#> SaleTypeNew          26654.8031    6651.3120   4.007    0.000065762456150 ***
#> SaleTypeOth          28659.5255   31405.1208   0.913             0.361678    
#> SaleTypeWD            8326.1918    5535.5370   1.504             0.132851    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 30600 on 1038 degrees of freedom
#> Multiple R-squared:  0.8621, Adjusted R-squared:  0.8536 
#> F-statistic: 101.4 on 64 and 1038 DF,  p-value: < 0.00000000000000022

Insight of model_backward :
- The Adjusted R-squared score that represent the goodness of fit for model is 0.8536, it means that only 85.36% of the variables can be explained by the model of model_backward.
- OverallQual9 and OverallQual10, rates the overall material and finish of the house, also Intercept have highest significant score to the model of model_backward.


- Backward model Without multicolinearity variables

# Backward model Without multicolinearity variables.
model_backward_nomulti <- step(object = model_all_nomulti, direction = "backward", trace = F)

summary(model_backward_nomulti)
#> 
#> Call:
#> lm(formula = SalePrice ~ MSZoning + LotFrontage + LotArea + LotConfig + 
#>     Condition1 + OverallQual + YearRemodAdd + X1stFlrSF + GrLivArea + 
#>     BsmtFullBath + BedroomAbvGr + Fireplaces + GarageType + GarageCars + 
#>     OpenPorchSF + SaleType, data = house_train)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -476988  -14701    -293   13539  225308 
#> 
#> Coefficients:
#>                       Estimate   Std. Error t value             Pr(>|t|)    
#> (Intercept)       -534450.3022  130395.0757  -4.099  0.00004473504361153 ***
#> MSZoningFV          18987.8030   14567.9547   1.303             0.192724    
#> MSZoningRH          16451.3438   16868.0511   0.975             0.329637    
#> MSZoningRL          23117.3469   13715.3975   1.686             0.092187 .  
#> MSZoningRM           8522.0633   13893.9602   0.613             0.539768    
#> LotFrontage           -94.4625      56.3250  -1.677             0.093820 .  
#> LotArea                 0.4208       0.1431   2.940             0.003354 ** 
#> LotConfigCulDSac    11461.8176    4653.1776   2.463             0.013928 *  
#> LotConfigFR2        -1388.4660    6212.7762  -0.223             0.823201    
#> LotConfigFR3       -25968.8445   24872.5712  -1.044             0.296690    
#> LotConfigInside      -243.1049    2702.5968  -0.090             0.928342    
#> Condition1Feedr     -1063.8292    7124.1599  -0.149             0.881324    
#> Condition1Norm      18533.6980    5652.9646   3.279             0.001077 ** 
#> Condition1PosA      26453.1744   14893.9155   1.776             0.076004 .  
#> Condition1PosN      14919.2315   10341.7849   1.443             0.149425    
#> Condition1RRAe       5453.0482   12711.1790   0.429             0.668014    
#> Condition1RRAn      10326.0327   10095.8209   1.023             0.306636    
#> Condition1RRNe       3012.0927   33744.1985   0.089             0.928890    
#> Condition1RRNn      16418.1760   16787.9137   0.978             0.328310    
#> OverallQual3        -1283.0143   26580.2227  -0.048             0.961511    
#> OverallQual4        14840.2286   24577.2565   0.604             0.546093    
#> OverallQual5        25279.6082   24561.4294   1.029             0.303603    
#> OverallQual6        31843.3896   24661.1250   1.291             0.196904    
#> OverallQual7        47023.7816   24831.6983   1.894             0.058538 .  
#> OverallQual8        85732.6120   25064.9046   3.420             0.000649 ***
#> OverallQual9       156920.5019   25645.1639   6.119  0.00000000132604167 ***
#> OverallQual10      215455.4610   27435.1425   7.853  0.00000000000000993 ***
#> YearRemodAdd          234.1588      66.4971   3.521             0.000448 ***
#> X1stFlrSF               6.2253       3.9830   1.563             0.118354    
#> GrLivArea              49.7728       3.7709  13.199 < 0.0000000000000002 ***
#> BsmtFullBath        13170.1166    2088.6661   6.306  0.00000000042159795 ***
#> BedroomAbvGr        -2723.6570    1749.3177  -1.557             0.119775    
#> Fireplaces           6355.8283    1908.8502   3.330             0.000900 ***
#> GarageTypeAttchd    49698.4849   15618.8561   3.182             0.001506 ** 
#> GarageTypeBasment   29350.3777   17871.3966   1.642             0.100824    
#> GarageTypeBuiltIn   55568.1036   16220.3513   3.426             0.000637 ***
#> GarageTypeCarPort   24335.1893   19157.2554   1.270             0.204263    
#> GarageTypeDetchd    42475.5928   15590.0468   2.725             0.006546 ** 
#> GarageCars          16321.2513    2210.5816   7.383  0.00000000000031290 ***
#> OpenPorchSF            26.3347      17.6971   1.488             0.137029    
#> SaleTypeCon         38321.6146   24300.7567   1.577             0.115102    
#> SaleTypeConLD         691.8352   13957.1022   0.050             0.960475    
#> SaleTypeConLI       -6810.0815   18178.0978  -0.375             0.708010    
#> SaleTypeConLw        6591.7751   20408.4041   0.323             0.746764    
#> SaleTypeCWD         12835.9309   17795.4664   0.721             0.470883    
#> SaleTypeNew         29123.8208    7181.0888   4.056  0.00005368045966956 ***
#> SaleTypeOth         29498.8907   33743.6148   0.874             0.382205    
#> SaleTypeWD           8819.6595    6019.9018   1.465             0.143196    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 32910 on 1055 degrees of freedom
#> Multiple R-squared:  0.8379, Adjusted R-squared:  0.8306 
#> F-statistic:   116 on 47 and 1055 DF,  p-value: < 0.00000000000000022

Insight of model_backward_nomulti :
- The Adjusted R-squared score that represent the goodness of fit for model is 0.8299, it means that only 82.99% of the variables can be explained by the model of model_backward_nomulti.
- OverallQual9 and OverallQual10, rates the overall material and finish of the house, also Intercept have highest significant score to the model of model_backward_nomulti.


- Backward model Without multicolinearity variables & with Outlier Treatment

# Model with Outlier Treatment

model_bwd_outlier_nomulti <- step(object = model_all_outlier_nomulti, direction = "backward", trace = F)

summary(model_bwd_outlier_nomulti)
#> 
#> Call:
#> lm(formula = SalePrice ~ MSSubClass + MSZoning + LotArea + Street + 
#>     Condition1 + OverallQual + YearBuilt + YearRemodAdd + ExterQual + 
#>     X1stFlrSF + GrLivArea + BsmtFullBath + BedroomAbvGr + TotRmsAbvGrd + 
#>     Fireplaces + GarageArea + OpenPorchSF + SaleType, data = house_train_outlier)
#> 
#> Residuals:
#>    Min     1Q Median     3Q    Max 
#> -99865 -10368    530  10464  59296 
#> 
#> Coefficients:
#>                     Estimate   Std. Error t value             Pr(>|t|)    
#> (Intercept)     -897701.0328   98760.4504  -9.090 < 0.0000000000000002 ***
#> MSSubClass         -108.1648      21.0628  -5.135      0.0000003587000 ***
#> MSZoningFV        24555.8638    9172.5996   2.677             0.007588 ** 
#> MSZoningRH        20955.3630   10667.4204   1.964             0.049847 *  
#> MSZoningRL        26005.2417    8049.8480   3.231             0.001289 ** 
#> MSZoningRM        17622.3153    8023.0817   2.196             0.028363 *  
#> LotArea               1.0835       0.3151   3.438             0.000617 ***
#> StreetPave        33588.8196   18858.7656   1.781             0.075303 .  
#> Condition1Feedr     400.4230    5007.9692   0.080             0.936293    
#> Condition1Norm     6445.8084    4326.1189   1.490             0.136648    
#> Condition1PosA     5607.7584   13821.3171   0.406             0.685054    
#> Condition1PosN    14882.6840    8373.1424   1.777             0.075900 .  
#> Condition1RRAe   -17069.6556    7726.9136  -2.209             0.027466 *  
#> Condition1RRAn    -2154.0693    6408.1199  -0.336             0.736854    
#> Condition1RRNe    10442.0547   19016.1381   0.549             0.583089    
#> Condition1RRNn    42915.9209   16063.7887   2.672             0.007712 ** 
#> OverallQual3       2595.0319   15059.6503   0.172             0.863235    
#> OverallQual4      19385.5120   13809.3563   1.404             0.160791    
#> OverallQual5      28565.3775   13852.6284   2.062             0.039541 *  
#> OverallQual6      38153.6193   13919.2538   2.741             0.006269 ** 
#> OverallQual7      48476.4737   14110.4683   3.435             0.000624 ***
#> OverallQual8      75777.3731   14370.8690   5.273      0.0000001754226 ***
#> OverallQual9     102852.2705   24992.9263   4.115      0.0000429328528 ***
#> YearBuilt           233.3913      37.5982   6.208      0.0000000008876 ***
#> YearRemodAdd        212.8192      44.1620   4.819      0.0000017440676 ***
#> ExterQualTA      -25584.9304   13498.0981  -1.895             0.058415 .  
#> ExterQualGd      -13831.5120    9426.1789  -1.467             0.142697    
#> ExterQualEx      -19631.5618    9358.4760  -2.098             0.036261 *  
#> X1stFlrSF             8.4047       2.8326   2.967             0.003101 ** 
#> GrLivArea            51.3943       3.7077  13.862 < 0.0000000000000002 ***
#> BsmtFullBath       9726.2100    1427.6222   6.813      0.0000000000196 ***
#> BedroomAbvGr      -3303.2001    1356.6996  -2.435             0.015133 *  
#> TotRmsAbvGrd      -1651.6995     947.1219  -1.744             0.081581 .  
#> Fireplaces         5969.0460    1227.0830   4.864      0.0000013976129 ***
#> GarageArea           23.4403       5.6157   4.174      0.0000334006048 ***
#> OpenPorchSF          45.5167      23.2544   1.957             0.050676 .  
#> SaleTypeCon       39259.8665   19400.9809   2.024             0.043363 *  
#> SaleTypeConLD     11423.0389   10073.5776   1.134             0.257171    
#> SaleTypeConLI     -2584.7130   11564.1097  -0.224             0.823198    
#> SaleTypeConLw     10028.9174   11666.3250   0.860             0.390257    
#> SaleTypeCWD       45415.9074   19043.9235   2.385             0.017334 *  
#> SaleTypeNew       17501.7101    5006.8467   3.496             0.000501 ***
#> SaleTypeWD         9015.9153    3868.6396   2.331             0.020042 *  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 18310 on 755 degrees of freedom
#> Multiple R-squared:  0.8551, Adjusted R-squared:  0.8471 
#> F-statistic: 106.1 on 42 and 755 DF,  p-value: < 0.00000000000000022

  • Forward
# Create a Model without predictors
model_none <- lm(formula = SalePrice ~ 1, data =house_train)

# Forward
model_fwd <- step(object = model_none,direction = "forward", 
                  scope = list(lower=model_none, upper = model_all),trace = F)

summary(model_fwd)
#> 
#> Call:
#> lm(formula = SalePrice ~ OverallQual + GrLivArea + YearBuilt + 
#>     MSSubClass + BsmtFullBath + OverallCond + GarageCars + Fireplaces + 
#>     SaleType + Condition1 + Exterior1st + LotArea + LotFrontage + 
#>     ExterQual + Street + BedroomAbvGr + BldgType + FullBath, 
#>     data = house_train)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -439880  -13369   -1067   11852  201540 
#> 
#> Coefficients:
#>                        Estimate   Std. Error t value             Pr(>|t|)    
#> (Intercept)        -878109.6221  131291.0292  -6.688   0.0000000000367069 ***
#> OverallQual3         19406.3364   25361.9434   0.765             0.444340    
#> OverallQual4          9038.8137   22864.8056   0.395             0.692691    
#> OverallQual5         19118.7592   22819.0301   0.838             0.402310    
#> OverallQual6         23468.1986   22931.9686   1.023             0.306363    
#> OverallQual7         35180.1098   23128.7975   1.521             0.128549    
#> OverallQual8         71124.9450   23375.4995   3.043             0.002403 ** 
#> OverallQual9        136205.3135   24074.0820   5.658   0.0000000197889253 ***
#> OverallQual10       187997.3530   26062.6357   7.213   0.0000000000010466 ***
#> GrLivArea               58.2522       3.7477  15.543 < 0.0000000000000002 ***
#> YearBuilt              417.0213      65.6872   6.349   0.0000000003232699 ***
#> MSSubClass            -162.1606      50.5734  -3.206             0.001385 ** 
#> BsmtFullBath         15071.8219    1994.6654   7.556   0.0000000000000905 ***
#> OverallCond           7520.6002    1028.8193   7.310   0.0000000000005301 ***
#> GarageCars           14711.1993    2160.8328   6.808   0.0000000000166372 ***
#> Fireplaces            6719.6159    1800.6284   3.732             0.000200 ***
#> SaleTypeCon          39461.7762   22895.2421   1.724             0.085079 .  
#> SaleTypeConLD        13853.0234   13683.7896   1.012             0.311597    
#> SaleTypeConLI        -3471.9377   16988.3810  -0.204             0.838103    
#> SaleTypeConLw         1800.4902   19176.8468   0.094             0.925216    
#> SaleTypeCWD          27539.4650   16775.5757   1.642             0.100965    
#> SaleTypeNew          25597.2301    6707.0059   3.816             0.000143 ***
#> SaleTypeOth          30131.6419   31712.5001   0.950             0.342255    
#> SaleTypeWD            7738.3298    5581.8605   1.386             0.165940    
#> Condition1Feedr      -2846.6480    6813.4661  -0.418             0.676181    
#> Condition1Norm       17359.5603    5466.4071   3.176             0.001539 ** 
#> Condition1PosA       22219.5089   14214.5188   1.563             0.118318    
#> Condition1PosN       15006.9623    9732.4539   1.542             0.123388    
#> Condition1RRAe       -3887.1407   11899.2341  -0.327             0.743982    
#> Condition1RRAn       13490.3923    9533.1795   1.415             0.157337    
#> Condition1RRNe       -2373.7801   31719.5799  -0.075             0.940359    
#> Condition1RRNn       12795.9801   15427.5258   0.829             0.407053    
#> Exterior1stBrkComm  -11400.5518   34518.5874  -0.330             0.741260    
#> Exterior1stBrkFace   28412.5596   11178.4792   2.542             0.011174 *  
#> Exterior1stCBlock    20996.1963   37752.0375   0.556             0.578220    
#> Exterior1stCemntBd   17204.5232   11080.2557   1.553             0.120793    
#> Exterior1stHdBoard    9963.3373   10181.1889   0.979             0.328003    
#> Exterior1stImStucc   -1892.1702   32752.4274  -0.058             0.953941    
#> Exterior1stMetalSd   11392.4557   10057.8347   1.133             0.257602    
#> Exterior1stPlywood    9515.2715   10629.4789   0.895             0.370898    
#> Exterior1stStone      4994.0305   24424.8687   0.204             0.838030    
#> Exterior1stStucco   -26114.6003   12469.3205  -2.094             0.036473 *  
#> Exterior1stVinylSd    8529.3085   10194.2425   0.837             0.402964    
#> Exterior1stWd Sdng   10460.2524   10029.3110   1.043             0.297204    
#> Exterior1stWdShing     538.3750   12115.6980   0.044             0.964565    
#> LotArea                  0.6254       0.1506   4.153   0.0000355079325200 ***
#> LotFrontage           -186.8743      54.3782  -3.437             0.000612 ***
#> ExterQualTA         -11630.4892   20246.7180  -0.574             0.565795    
#> ExterQualGd         -14039.4167    6900.8020  -2.034             0.042157 *  
#> ExterQualEx         -21266.8478    7584.1058  -2.804             0.005139 ** 
#> StreetPave           41638.3240   19848.8834   2.098             0.036165 *  
#> BedroomAbvGr         -3972.1728    1731.7083  -2.294             0.022000 *  
#> BldgType2fmCon        2198.1863   10518.1943   0.209             0.834498    
#> BldgTypeDuplex      -15988.9734    6574.1816  -2.432             0.015179 *  
#> BldgTypeTwnhs       -14379.7114    8661.6718  -1.660             0.097183 .  
#> BldgTypeTwnhsE       -6723.6941    6479.9812  -1.038             0.299691    
#> FullBath              4268.2036    2747.8539   1.553             0.120657    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 30920 on 1046 degrees of freedom
#> Multiple R-squared:  0.8581, Adjusted R-squared:  0.8505 
#> F-statistic:   113 on 56 and 1046 DF,  p-value: < 0.00000000000000022

Insight of model_fwd :
- The Adjusted R-squared score that represent the goodness of fit for model is 0.8505, it means that only 85.05% of the variables can be explained by the model of model_fwd.
- OverallQual9 and OverallQual10, rates the overall material and finish of the house, also Intercept have highest significant score to the model of model_fwd.


  • Both
model_both <- step(object = model_none,direction = "both",
                   scope = list(upper=model_all),trace = F)

summary(model_both)
#> 
#> Call:
#> lm(formula = SalePrice ~ OverallQual + GrLivArea + YearBuilt + 
#>     MSSubClass + BsmtFullBath + OverallCond + GarageCars + Fireplaces + 
#>     SaleType + Condition1 + Exterior1st + LotArea + LotFrontage + 
#>     ExterQual + Street + BedroomAbvGr + BldgType + FullBath, 
#>     data = house_train)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -439880  -13369   -1067   11852  201540 
#> 
#> Coefficients:
#>                        Estimate   Std. Error t value             Pr(>|t|)    
#> (Intercept)        -878109.6221  131291.0292  -6.688   0.0000000000367069 ***
#> OverallQual3         19406.3364   25361.9434   0.765             0.444340    
#> OverallQual4          9038.8137   22864.8056   0.395             0.692691    
#> OverallQual5         19118.7592   22819.0301   0.838             0.402310    
#> OverallQual6         23468.1986   22931.9686   1.023             0.306363    
#> OverallQual7         35180.1098   23128.7975   1.521             0.128549    
#> OverallQual8         71124.9450   23375.4995   3.043             0.002403 ** 
#> OverallQual9        136205.3135   24074.0820   5.658   0.0000000197889253 ***
#> OverallQual10       187997.3530   26062.6357   7.213   0.0000000000010466 ***
#> GrLivArea               58.2522       3.7477  15.543 < 0.0000000000000002 ***
#> YearBuilt              417.0213      65.6872   6.349   0.0000000003232699 ***
#> MSSubClass            -162.1606      50.5734  -3.206             0.001385 ** 
#> BsmtFullBath         15071.8219    1994.6654   7.556   0.0000000000000905 ***
#> OverallCond           7520.6002    1028.8193   7.310   0.0000000000005301 ***
#> GarageCars           14711.1993    2160.8328   6.808   0.0000000000166372 ***
#> Fireplaces            6719.6159    1800.6284   3.732             0.000200 ***
#> SaleTypeCon          39461.7762   22895.2421   1.724             0.085079 .  
#> SaleTypeConLD        13853.0234   13683.7896   1.012             0.311597    
#> SaleTypeConLI        -3471.9377   16988.3810  -0.204             0.838103    
#> SaleTypeConLw         1800.4902   19176.8468   0.094             0.925216    
#> SaleTypeCWD          27539.4650   16775.5757   1.642             0.100965    
#> SaleTypeNew          25597.2301    6707.0059   3.816             0.000143 ***
#> SaleTypeOth          30131.6419   31712.5001   0.950             0.342255    
#> SaleTypeWD            7738.3298    5581.8605   1.386             0.165940    
#> Condition1Feedr      -2846.6480    6813.4661  -0.418             0.676181    
#> Condition1Norm       17359.5603    5466.4071   3.176             0.001539 ** 
#> Condition1PosA       22219.5089   14214.5188   1.563             0.118318    
#> Condition1PosN       15006.9623    9732.4539   1.542             0.123388    
#> Condition1RRAe       -3887.1407   11899.2341  -0.327             0.743982    
#> Condition1RRAn       13490.3923    9533.1795   1.415             0.157337    
#> Condition1RRNe       -2373.7801   31719.5799  -0.075             0.940359    
#> Condition1RRNn       12795.9801   15427.5258   0.829             0.407053    
#> Exterior1stBrkComm  -11400.5518   34518.5874  -0.330             0.741260    
#> Exterior1stBrkFace   28412.5596   11178.4792   2.542             0.011174 *  
#> Exterior1stCBlock    20996.1963   37752.0375   0.556             0.578220    
#> Exterior1stCemntBd   17204.5232   11080.2557   1.553             0.120793    
#> Exterior1stHdBoard    9963.3373   10181.1889   0.979             0.328003    
#> Exterior1stImStucc   -1892.1702   32752.4274  -0.058             0.953941    
#> Exterior1stMetalSd   11392.4557   10057.8347   1.133             0.257602    
#> Exterior1stPlywood    9515.2715   10629.4789   0.895             0.370898    
#> Exterior1stStone      4994.0305   24424.8687   0.204             0.838030    
#> Exterior1stStucco   -26114.6003   12469.3205  -2.094             0.036473 *  
#> Exterior1stVinylSd    8529.3085   10194.2425   0.837             0.402964    
#> Exterior1stWd Sdng   10460.2524   10029.3110   1.043             0.297204    
#> Exterior1stWdShing     538.3750   12115.6980   0.044             0.964565    
#> LotArea                  0.6254       0.1506   4.153   0.0000355079325200 ***
#> LotFrontage           -186.8743      54.3782  -3.437             0.000612 ***
#> ExterQualTA         -11630.4892   20246.7180  -0.574             0.565795    
#> ExterQualGd         -14039.4167    6900.8020  -2.034             0.042157 *  
#> ExterQualEx         -21266.8478    7584.1058  -2.804             0.005139 ** 
#> StreetPave           41638.3240   19848.8834   2.098             0.036165 *  
#> BedroomAbvGr         -3972.1728    1731.7083  -2.294             0.022000 *  
#> BldgType2fmCon        2198.1863   10518.1943   0.209             0.834498    
#> BldgTypeDuplex      -15988.9734    6574.1816  -2.432             0.015179 *  
#> BldgTypeTwnhs       -14379.7114    8661.6718  -1.660             0.097183 .  
#> BldgTypeTwnhsE       -6723.6941    6479.9812  -1.038             0.299691    
#> FullBath              4268.2036    2747.8539   1.553             0.120657    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 30920 on 1046 degrees of freedom
#> Multiple R-squared:  0.8581, Adjusted R-squared:  0.8505 
#> F-statistic:   113 on 56 and 1046 DF,  p-value: < 0.00000000000000022

Insight of model_both :
- The Adjusted R-squared score that represent the goodness of fit for model is 0.8505, it means that only 85.05% of the variables can be explained by the model of model_both.
- OverallQual9 and OverallQual10, rates the overall material and finish of the house, also Intercept have highest significant score to the model of model_both.


6.4 Model Comparison


In comparison step, the all model is compared in order to find high Adj. R-squared score, low AIC score, and low RMSE score.
- 1st Step - Before Assumption Test.

comparison <- compare_performance(model_none,model_all,model_bob,model_fs,model_backward,model_fwd,model_both)

as.data.frame(comparison)
# Range of Data Train's Target
range(house_train$SalePrice)
#> [1]  35311 755000

Insight:
Best model of linear regression, as follows:
- Based on Adj. R-squared with highest score (R2_adjusted column) : model_backward.
- Based on AIC with lowest score (AIC column) : model_backward.
- Based on RMSE with lowest score (RMSE column) : model_all.

So, The model of model_all is suitable to make a prediciton. But, using a model of model_backward is appropiate to see the significance of the predictor. Then, step-wise regression is a greedy algorithm that focuses on finding the best results with relatively short time, but not necessarily giving the most optimal results. Therefore, this research uses the results of step-wise regression as a model recommendation and there is still room for improvement.


  • 2nd Step - After 1st Step in The Assumption Test.
comparison2 <- compare_performance(model_all_nomulti,model_all_outlier_nomulti,model_backward_nomulti,model_bwd_outlier_nomulti)

as.data.frame(comparison2)
# Range of Data Train's Target
range(house_train_outlier$SalePrice)
#> [1]  35311 297000

Insight:
Best model of linear regression, as follows:
- Based on Adj. R-squared with highest score (R2_adjusted column) : model_bwd_outlier_nomulti.
- Based on AIC with lowest score (AIC column) : model_bwd_outlier_nomulti.
- Based on RMSE with lowest score (RMSE column) : model_all_outlier_nomulti.

The model of model_all_outlier_nomulti is suitable to make a prediciton. But, using a model of model_bwd_outlier_nomulti is appropiate to see the significance of the predictor. Then, step-wise regression is a greedy algorithm that focuses on finding the best results with relatively short time, but not necessarily giving the most optimal results. Therefore, this research uses the results of step-wise regression as a model recommendation and there is still room for improvement.


6.5 Prediction Interval


The prediction is conducted by selected model which has best performance model score.

  • 1st Step - Before Assumption Test.

    This section uses best model from step-wise regression which has highest Adj. r-squared score,model_backward, to generate the interval price prediction.
pred_price_interval <- predict(object = model_backward, newdata = house_test, interval = "prediction", level = 0.95)

bwd_pred_result <- house_test %>% 
  select(SalePrice) %>% 
  bind_cols(as.data.frame(pred_price_interval)) %>% 
  relocate(fit,.after = lwr) 

colnames(bwd_pred_result) <- c("SalePrice_Actual","LowPrice_Pred","Fit_Pred","UprPrice_Pred")
bwd_pred_result

6.6 Model Assumption

6.6.1 Linearity


The model is conducted linearity assumption test in order to met the four pillars of assumption test.
- 1st Step - Before Assumption Test.
In this section, the linearity assumption will be checked by making a residual vs fitted plot for model of model_backward.

options(scipen = 9999)
plot(model_backward, which = 1)
abline(h = 10, col = "green")
abline(h = -10, col = "green")

resfit1 <- data.frame(residual = model_backward$residuals, fitted = model_backward$fitted.values)

resfit1 %>% ggplot(aes(fitted, residual)) + 
  geom_point() + 
  geom_hline(aes(yintercept = 0, colour= "red")) +
  theme_minimal() + 
  labs(title = "Residual vs Fitted Plot") + 
  theme(legend.position = "none")

The plots show that the residuals/errors do not bounce randomly around 0 (non-uniform) and are very far from 0 so that the mean does not equal 0. So the condition E(ϵ)=0 is not fulfilled. Then the red line is still outside the tolerance range of -10 to +10, so the model_backward is a non-linear model. It is concluded that the data may not be linear. The linearity assumption is not met.

6.6.2 Normality of Residuals

The linear regression model is expected to produce errors that are normally distributed and are mostly located around the number of 0.

hist(model_backward$residuals)

shapiro.test(model_backward$residuals)
#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  model_backward$residuals
#> W = 0.79245, p-value < 0.00000000000000022

Based on histogram plot above and shapiro wilk test, most of erors aren’t located around number of 0 also shapiro wilk test results is p-value > 0.05 (H0 - normally distributed errors is rejected).It shows that the errors are not normally distributaed and not located around number of 0.

6.6.3 Homoscedasticity of Residuals

bptest(model_backward)
#> 
#>  studentized Breusch-Pagan test
#> 
#> data:  model_backward
#> BP = 503.22, df = 64, p-value < 0.00000000000000022

Breusch-Pagan hypothesis test:

H0: constant spreading error or homoscedasticity
H1: error spread is not constant or heteroscedasticity

Based on BP test sccore where the p-value results is lower than 0.05, then the alternative hypothesis (H1) is accepted / failed to reject. Then the model does not meet the assumption of homoscedasticity of residuals so that do the data transformation on the target or predictor variables.

6.6.4 No Multicollinearity

In this section, measuring the correlation between predictor variables with the vif() function in library (car) in order to discover which predictor variables have strong relationship between predictor variables which can lead to redundant predictors. Then, choosing one variable so that there is no redundant predictor in the model.

vif(model_backward)
#>                   GVIF Df GVIF^(1/(2*Df))
#> LotFrontage   1.709594  1        1.307515
#> LotArea       1.694626  1        1.301778
#> Street        2.038089  1        1.427617
#> Condition1    1.945497  8        1.042472
#> BldgType      3.161430  4        1.154743
#> HouseStyle   19.775693  7        1.237602
#> OverallQual  16.733791  8        1.192545
#> OverallCond   1.498539  1        1.224148
#> YearBuilt     5.194551  1        2.279156
#> Exterior1st   8.965802 13        1.088023
#> ExterQual    12.061682  3        1.514379
#> X2ndFlrSF    11.627028  1        3.409843
#> GrLivArea     6.965922  1        2.639303
#> BsmtFullBath  1.316555  1        1.147412
#> FullBath      3.207742  1        1.791017
#> HalfBath      2.493750  1        1.579161
#> BedroomAbvGr  2.233516  1        1.494495
#> Fireplaces    1.565500  1        1.251199
#> GarageCars    2.163001  1        1.470714
#> SaleType      2.217563  8        1.051035

There are multicolinearity among variables within HouseStyle variable, OverallQual variable, ExterQual variable, and X2ndFlrSF variable due to VIF score is higher than 10 (VIF>10).

6.7 Model Improvement

From the residual analysis, it is concluded that the residuals from model of model_backward shows non-linearity, non-normally distributed errors, heteroscedasticity, and multicollinearity. Thus, data transformation for predictor and target variables with highest p-value score is performed by using sqrt.

6.7.1 Model Tuning

6.7.1.1 Transformations with sqrt

# Using model_all_outlier_nomulti with sqrt

model_tuning <- lm(formula= sqrt(SalePrice) ~ MSZoning + LotFrontage + LotArea+ Street + LotConfig + Condition1 + Condition2 + OverallQual + YearBuilt + YearRemodAdd + X1stFlrSF + GrLivArea + BsmtFullBath + FullBath + HalfBath + sqrt(BedroomAbvGr) + sqrt(TotRmsAbvGrd) + Fireplaces + GarageType + sqrt(GarageCars) + sqrt(GarageArea) + PavedDrive + sqrt(OpenPorchSF) + SaleType,
    data = house_train_outlier)

summary(model_tuning)
#> 
#> Call:
#> lm(formula = sqrt(SalePrice) ~ MSZoning + LotFrontage + LotArea + 
#>     Street + LotConfig + Condition1 + Condition2 + OverallQual + 
#>     YearBuilt + YearRemodAdd + X1stFlrSF + GrLivArea + BsmtFullBath + 
#>     FullBath + HalfBath + sqrt(BedroomAbvGr) + sqrt(TotRmsAbvGrd) + 
#>     Fireplaces + GarageType + sqrt(GarageCars) + sqrt(GarageArea) + 
#>     PavedDrive + sqrt(OpenPorchSF) + SaleType, data = house_train_outlier)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -141.729  -12.118    0.835   12.784   69.262 
#> 
#> Coefficients:
#>                        Estimate   Std. Error t value             Pr(>|t|)    
#> (Intercept)        -1116.475753   137.484004  -8.121  0.00000000000000194 ***
#> MSZoningFV            46.054058    11.958543   3.851             0.000128 ***
#> MSZoningRH            40.272081    13.581357   2.965             0.003122 ** 
#> MSZoningRL            45.480091    10.370746   4.385  0.00001325494221484 ***
#> MSZoningRM            31.397148    10.334130   3.038             0.002464 ** 
#> LotFrontage            0.099851     0.065968   1.514             0.130548    
#> LotArea                0.001813     0.000432   4.196  0.00003051881464823 ***
#> StreetPave            37.383186    21.714059   1.722             0.085558 .  
#> LotConfigCulDSac       7.724227     4.371240   1.767             0.077631 .  
#> LotConfigFR2          -0.957160     5.312280  -0.180             0.857062    
#> LotConfigFR3           8.347318    24.309643   0.343             0.731414    
#> LotConfigInside        1.088415     2.409413   0.452             0.651593    
#> Condition1Feedr        1.934627     6.777392   0.285             0.775377    
#> Condition1Norm         8.779067     5.821484   1.508             0.131969    
#> Condition1PosA        10.511465    17.724906   0.593             0.553340    
#> Condition1PosN        17.328153    10.892441   1.591             0.112072    
#> Condition1RRAe       -22.499589    10.235173  -2.198             0.028239 *  
#> Condition1RRAn         1.102445     9.024720   0.122             0.902807    
#> Condition1RRNe        23.740742    24.390736   0.973             0.330697    
#> Condition1RRNn        44.225949    21.506011   2.056             0.040090 *  
#> Condition2Feedr        3.145933    24.589227   0.128             0.898232    
#> Condition2Norm        11.327439    18.949399   0.598             0.550174    
#> Condition2RRAn       -15.686096    30.808883  -0.509             0.610804    
#> Condition2RRNn        11.157646    31.048291   0.359             0.719425    
#> OverallQual3          18.291895    19.552337   0.936             0.349818    
#> OverallQual4          49.231338    17.776758   2.769             0.005756 ** 
#> OverallQual5          61.431489    17.851872   3.441             0.000612 ***
#> OverallQual6          73.642988    17.913457   4.111  0.00004379712312152 ***
#> OverallQual7          89.050753    18.151076   4.906  0.00000114317687146 ***
#> OverallQual8         117.194340    18.438324   6.356  0.00000000036189349 ***
#> OverallQual9         154.025502    29.674093   5.191  0.00000027110104495 ***
#> YearBuilt              0.214500     0.059066   3.632             0.000301 ***
#> YearRemodAdd           0.377873     0.055620   6.794  0.00000000002246257 ***
#> X1stFlrSF              0.018516     0.004451   4.160  0.00003558289401388 ***
#> GrLivArea              0.051610     0.005240   9.850 < 0.0000000000000002 ***
#> BsmtFullBath          12.171482     1.868824   6.513  0.00000000013608698 ***
#> FullBath               1.555679     2.819148   0.552             0.581234    
#> HalfBath               1.998786     2.545470   0.785             0.432569    
#> sqrt(BedroomAbvGr)    -2.136031     5.143939  -0.415             0.678078    
#> sqrt(TotRmsAbvGrd)   -12.119338     6.073817  -1.995             0.046372 *  
#> Fireplaces             6.976545     1.616723   4.315  0.00001810957322644 ***
#> GarageTypeAttchd      31.445119    14.081550   2.233             0.025842 *  
#> GarageTypeBasment     19.994641    15.792678   1.266             0.205886    
#> GarageTypeBuiltIn     31.229928    14.776678   2.113             0.034895 *  
#> GarageTypeCarPort     13.257012    17.352688   0.764             0.445126    
#> GarageTypeDetchd      31.369102    14.002183   2.240             0.025367 *  
#> sqrt(GarageCars)       9.179581     7.961105   1.153             0.249261    
#> sqrt(GarageArea)       0.625992     0.483889   1.294             0.196184    
#> PavedDriveP            9.153242     7.719502   1.186             0.236110    
#> PavedDriveY            6.389496     4.667343   1.369             0.171421    
#> sqrt(OpenPorchSF)      0.517155     0.265092   1.951             0.051451 .  
#> SaleTypeCon           40.632632    24.730949   1.643             0.100810    
#> SaleTypeConLD         13.749883    12.980034   1.059             0.289804    
#> SaleTypeConLI         -8.443244    14.849875  -0.569             0.569818    
#> SaleTypeConLw         14.067374    14.999319   0.938             0.348618    
#> SaleTypeCWD           46.355721    24.355765   1.903             0.057392 .  
#> SaleTypeNew           20.084775     6.445700   3.116             0.001904 ** 
#> SaleTypeWD            10.078975     5.002207   2.015             0.044276 *  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 23.39 on 740 degrees of freedom
#> Multiple R-squared:  0.851,  Adjusted R-squared:  0.8395 
#> F-statistic: 74.15 on 57 and 740 DF,  p-value: < 0.00000000000000022

Insight of model_tuning :
- The Adjusted R-squared score that represent the goodness of fit for model is 0.8517, it means that only 85.17% of the variables can be explained by the model of model_tuning.
- OverallQual9, OverallQual8, OverallQual7, and OverallQual6, rates the overall material and finish of the house, also Intercept have highest significant score to the model of model_tuning.

6.7.1.2 Backward Regression Model in Transformation Model

Next, backward regression is performed with the aim of finding the best model in model_tuning. The AIC criteria were carried out to find the best model to be used in this case. The smallest AIC value is the criterion for selecting the best model. The backward regression method was used to select the model using AIC criteria.

# Backward   
model_tuning_bwd <- step(object = model_tuning, direction = "backward")
#> Start:  AIC=5086.99
#> sqrt(SalePrice) ~ MSZoning + LotFrontage + LotArea + Street + 
#>     LotConfig + Condition1 + Condition2 + OverallQual + YearBuilt + 
#>     YearRemodAdd + X1stFlrSF + GrLivArea + BsmtFullBath + FullBath + 
#>     HalfBath + sqrt(BedroomAbvGr) + sqrt(TotRmsAbvGrd) + Fireplaces + 
#>     GarageType + sqrt(GarageCars) + sqrt(GarageArea) + PavedDrive + 
#>     sqrt(OpenPorchSF) + SaleType
#> 
#>                      Df Sum of Sq    RSS    AIC
#> - Condition2          4      1050 405960 5081.1
#> - LotConfig           4      1975 406885 5082.9
#> - sqrt(BedroomAbvGr)  1        94 405005 5085.2
#> - FullBath            1       167 405077 5085.3
#> - PavedDrive          2      1207 406117 5085.4
#> - HalfBath            1       337 405248 5085.7
#> - sqrt(GarageCars)    1       727 405638 5086.4
#> - sqrt(GarageArea)    1       916 405826 5086.8
#> <none>                            404910 5087.0
#> - LotFrontage         1      1254 406164 5087.5
#> - GarageType          5      5623 410534 5088.0
#> - Street              1      1622 406532 5088.2
#> - sqrt(OpenPorchSF)   1      2082 406993 5089.1
#> - sqrt(TotRmsAbvGrd)  1      2179 407089 5089.3
#> - SaleType            7      8536 413446 5089.6
#> - Condition1          8     13526 418436 5097.2
#> - YearBuilt           1      7216 412126 5099.1
#> - X1stFlrSF           1      9469 414379 5103.4
#> - LotArea             1      9632 414542 5103.8
#> - Fireplaces          1     10189 415099 5104.8
#> - MSZoning            4     20462 425373 5118.3
#> - BsmtFullBath        1     23210 428120 5129.5
#> - YearRemodAdd        1     25255 430166 5133.3
#> - GrLivArea           1     53084 457994 5183.3
#> - OverallQual         7    126926 531836 5290.6
#> 
#> Step:  AIC=5081.06
#> sqrt(SalePrice) ~ MSZoning + LotFrontage + LotArea + Street + 
#>     LotConfig + Condition1 + OverallQual + YearBuilt + YearRemodAdd + 
#>     X1stFlrSF + GrLivArea + BsmtFullBath + FullBath + HalfBath + 
#>     sqrt(BedroomAbvGr) + sqrt(TotRmsAbvGrd) + Fireplaces + GarageType + 
#>     sqrt(GarageCars) + sqrt(GarageArea) + PavedDrive + sqrt(OpenPorchSF) + 
#>     SaleType
#> 
#>                      Df Sum of Sq    RSS    AIC
#> - LotConfig           4      2034 407994 5077.0
#> - PavedDrive          2      1089 407049 5079.2
#> - sqrt(BedroomAbvGr)  1        93 406053 5079.2
#> - FullBath            1       132 406092 5079.3
#> - HalfBath            1       354 406314 5079.8
#> - sqrt(GarageCars)    1       673 406633 5080.4
#> - sqrt(GarageArea)    1       967 406927 5081.0
#> <none>                            405960 5081.1
#> - LotFrontage         1      1329 407289 5081.7
#> - Street              1      1575 407535 5082.1
#> - GarageType          5      6069 412029 5082.9
#> - sqrt(OpenPorchSF)   1      2071 408031 5083.1
#> - sqrt(TotRmsAbvGrd)  1      2158 408119 5083.3
#> - SaleType            7      8437 414397 5083.5
#> - YearBuilt           1      7529 413489 5093.7
#> - Condition1          8     14915 420875 5093.9
#> - X1stFlrSF           1      9247 415207 5097.0
#> - LotArea             1      9655 415615 5097.8
#> - Fireplaces          1     10515 416475 5099.5
#> - MSZoning            4     20583 426543 5112.5
#> - BsmtFullBath        1     23442 429402 5123.9
#> - YearRemodAdd        1     25521 431481 5127.7
#> - GrLivArea           1     53335 459295 5177.6
#> - OverallQual         7    128324 534284 5286.2
#> 
#> Step:  AIC=5077.05
#> sqrt(SalePrice) ~ MSZoning + LotFrontage + LotArea + Street + 
#>     Condition1 + OverallQual + YearBuilt + YearRemodAdd + X1stFlrSF + 
#>     GrLivArea + BsmtFullBath + FullBath + HalfBath + sqrt(BedroomAbvGr) + 
#>     sqrt(TotRmsAbvGrd) + Fireplaces + GarageType + sqrt(GarageCars) + 
#>     sqrt(GarageArea) + PavedDrive + sqrt(OpenPorchSF) + SaleType
#> 
#>                      Df Sum of Sq    RSS    AIC
#> - PavedDrive          2      1018 409012 5075.0
#> - FullBath            1        86 408080 5075.2
#> - sqrt(BedroomAbvGr)  1       163 408157 5075.4
#> - HalfBath            1       350 408344 5075.7
#> - sqrt(GarageCars)    1       665 408659 5076.3
#> - LotFrontage         1       797 408791 5076.6
#> - sqrt(GarageArea)    1       981 408975 5077.0
#> <none>                            407994 5077.0
#> - Street              1      1834 409828 5078.6
#> - sqrt(OpenPorchSF)   1      2048 410042 5079.0
#> - GarageType          5      6233 414227 5079.1
#> - SaleType            7      8343 416337 5079.2
#> - sqrt(TotRmsAbvGrd)  1      2198 410192 5079.3
#> - YearBuilt           1      8182 416176 5090.9
#> - Condition1          8     15561 423555 5090.9
#> - X1stFlrSF           1      9043 417037 5092.5
#> - Fireplaces          1     10429 418423 5095.2
#> - LotArea             1     11566 419560 5097.4
#> - MSZoning            4     20924 428918 5109.0
#> - BsmtFullBath        1     23956 431950 5120.6
#> - YearRemodAdd        1     25730 433725 5123.8
#> - GrLivArea           1     55478 463472 5176.8
#> - OverallQual         7    127300 535294 5279.8
#> 
#> Step:  AIC=5075.04
#> sqrt(SalePrice) ~ MSZoning + LotFrontage + LotArea + Street + 
#>     Condition1 + OverallQual + YearBuilt + YearRemodAdd + X1stFlrSF + 
#>     GrLivArea + BsmtFullBath + FullBath + HalfBath + sqrt(BedroomAbvGr) + 
#>     sqrt(TotRmsAbvGrd) + Fireplaces + GarageType + sqrt(GarageCars) + 
#>     sqrt(GarageArea) + sqrt(OpenPorchSF) + SaleType
#> 
#>                      Df Sum of Sq    RSS    AIC
#> - FullBath            1        61 409073 5073.2
#> - sqrt(BedroomAbvGr)  1       139 409151 5073.3
#> - HalfBath            1       366 409379 5073.7
#> - sqrt(GarageCars)    1       541 409553 5074.1
#> - LotFrontage         1       842 409855 5074.7
#> <none>                            409012 5075.0
#> - sqrt(GarageArea)    1      1170 410182 5075.3
#> - SaleType            7      8197 417210 5076.9
#> - sqrt(OpenPorchSF)   1      2025 411037 5077.0
#> - Street              1      2158 411170 5077.2
#> - GarageType          5      6454 415467 5077.5
#> - sqrt(TotRmsAbvGrd)  1      2459 411471 5077.8
#> - Condition1          8     15899 424911 5089.5
#> - X1stFlrSF           1      9076 418088 5090.5
#> - YearBuilt           1     10092 419105 5092.5
#> - LotArea             1     11239 420252 5094.7
#> - Fireplaces          1     11280 420293 5094.7
#> - MSZoning            4     22426 431438 5109.6
#> - BsmtFullBath        1     23756 432769 5118.1
#> - YearRemodAdd        1     25048 434061 5120.5
#> - GrLivArea           1     56454 465466 5176.2
#> - OverallQual         7    126953 535965 5276.8
#> 
#> Step:  AIC=5073.15
#> sqrt(SalePrice) ~ MSZoning + LotFrontage + LotArea + Street + 
#>     Condition1 + OverallQual + YearBuilt + YearRemodAdd + X1stFlrSF + 
#>     GrLivArea + BsmtFullBath + HalfBath + sqrt(BedroomAbvGr) + 
#>     sqrt(TotRmsAbvGrd) + Fireplaces + GarageType + sqrt(GarageCars) + 
#>     sqrt(GarageArea) + sqrt(OpenPorchSF) + SaleType
#> 
#>                      Df Sum of Sq    RSS    AIC
#> - sqrt(BedroomAbvGr)  1       114 409187 5071.4
#> - HalfBath            1       305 409379 5071.7
#> - sqrt(GarageCars)    1       603 409676 5072.3
#> - LotFrontage         1       809 409882 5072.7
#> <none>                            409073 5073.2
#> - sqrt(GarageArea)    1      1154 410227 5073.4
#> - SaleType            7      8182 417255 5075.0
#> - sqrt(OpenPorchSF)   1      2118 411191 5075.3
#> - Street              1      2155 411228 5075.3
#> - GarageType          5      6396 415469 5075.5
#> - sqrt(TotRmsAbvGrd)  1      2424 411497 5075.9
#> - Condition1          8     15951 425024 5087.7
#> - X1stFlrSF           1      9060 418133 5088.6
#> - LotArea             1     11181 420254 5092.7
#> - Fireplaces          1     11221 420295 5092.7
#> - YearBuilt           1     11669 420742 5093.6
#> - MSZoning            4     22867 431940 5108.6
#> - BsmtFullBath        1     24075 433148 5116.8
#> - YearRemodAdd        1     25429 434502 5119.3
#> - GrLivArea           1     68076 477149 5194.0
#> - OverallQual         7    126912 535986 5274.8
#> 
#> Step:  AIC=5071.38
#> sqrt(SalePrice) ~ MSZoning + LotFrontage + LotArea + Street + 
#>     Condition1 + OverallQual + YearBuilt + YearRemodAdd + X1stFlrSF + 
#>     GrLivArea + BsmtFullBath + HalfBath + sqrt(TotRmsAbvGrd) + 
#>     Fireplaces + GarageType + sqrt(GarageCars) + sqrt(GarageArea) + 
#>     sqrt(OpenPorchSF) + SaleType
#> 
#>                      Df Sum of Sq    RSS    AIC
#> - HalfBath            1       321 409509 5070.0
#> - sqrt(GarageCars)    1       664 409851 5070.7
#> - LotFrontage         1       754 409941 5070.8
#> <none>                            409187 5071.4
#> - sqrt(GarageArea)    1      1111 410298 5071.5
#> - SaleType            7      8250 417437 5073.3
#> - sqrt(OpenPorchSF)   1      2118 411305 5073.5
#> - Street              1      2170 411357 5073.6
#> - GarageType          5      6321 415508 5073.6
#> - sqrt(TotRmsAbvGrd)  1      3562 412749 5076.3
#> - Condition1          8     16093 425281 5086.2
#> - X1stFlrSF           1      9306 418493 5087.3
#> - LotArea             1     11090 420277 5090.7
#> - Fireplaces          1     11412 420600 5091.3
#> - YearBuilt           1     11603 420790 5091.7
#> - MSZoning            4     22761 431948 5106.6
#> - BsmtFullBath        1     24968 434155 5116.6
#> - YearRemodAdd        1     26364 435551 5119.2
#> - GrLivArea           1     68351 477538 5192.6
#> - OverallQual         7    127978 537166 5274.5
#> 
#> Step:  AIC=5070
#> sqrt(SalePrice) ~ MSZoning + LotFrontage + LotArea + Street + 
#>     Condition1 + OverallQual + YearBuilt + YearRemodAdd + X1stFlrSF + 
#>     GrLivArea + BsmtFullBath + sqrt(TotRmsAbvGrd) + Fireplaces + 
#>     GarageType + sqrt(GarageCars) + sqrt(GarageArea) + sqrt(OpenPorchSF) + 
#>     SaleType
#> 
#>                      Df Sum of Sq    RSS    AIC
#> - sqrt(GarageCars)    1       693 410201 5069.4
#> - LotFrontage         1       697 410206 5069.4
#> <none>                            409509 5070.0
#> - sqrt(GarageArea)    1      1054 410562 5070.1
#> - SaleType            7      8206 417715 5071.8
#> - GarageType          5      6196 415705 5072.0
#> - Street              1      2153 411662 5072.2
#> - sqrt(OpenPorchSF)   1      2235 411744 5072.3
#> - sqrt(TotRmsAbvGrd)  1      3517 413026 5074.8
#> - Condition1          8     15867 425375 5084.3
#> - X1stFlrSF           1     10462 419971 5088.1
#> - LotArea             1     11331 420840 5089.8
#> - Fireplaces          1     12097 421606 5091.2
#> - YearBuilt           1     12687 422195 5092.3
#> - MSZoning            4     23500 433008 5106.5
#> - BsmtFullBath        1     25083 434592 5115.4
#> - YearRemodAdd        1     26112 435620 5117.3
#> - GrLivArea           1     79060 488568 5208.9
#> - OverallQual         7    127752 537261 5272.7
#> 
#> Step:  AIC=5069.35
#> sqrt(SalePrice) ~ MSZoning + LotFrontage + LotArea + Street + 
#>     Condition1 + OverallQual + YearBuilt + YearRemodAdd + X1stFlrSF + 
#>     GrLivArea + BsmtFullBath + sqrt(TotRmsAbvGrd) + Fireplaces + 
#>     GarageType + sqrt(GarageArea) + sqrt(OpenPorchSF) + SaleType
#> 
#>                      Df Sum of Sq    RSS    AIC
#> - LotFrontage         1       827 411028 5069.0
#> <none>                            410201 5069.4
#> - Street              1      1966 412167 5071.2
#> - SaleType            7      8229 418430 5071.2
#> - GarageType          5      6248 416449 5071.4
#> - sqrt(OpenPorchSF)   1      2196 412397 5071.6
#> - sqrt(TotRmsAbvGrd)  1      3232 413433 5073.6
#> - sqrt(GarageArea)    1      6485 416687 5079.9
#> - Condition1          8     15611 425813 5083.2
#> - X1stFlrSF           1     10359 420561 5087.3
#> - LotArea             1     10964 421166 5088.4
#> - Fireplaces          1     13053 423254 5092.3
#> - YearBuilt           1     14282 424483 5094.7
#> - MSZoning            4     23087 433288 5105.0
#> - BsmtFullBath        1     24766 434967 5114.1
#> - YearRemodAdd        1     26637 436838 5117.6
#> - GrLivArea           1     78760 488961 5207.5
#> - OverallQual         7    129216 539418 5273.9
#> 
#> Step:  AIC=5068.96
#> sqrt(SalePrice) ~ MSZoning + LotArea + Street + Condition1 + 
#>     OverallQual + YearBuilt + YearRemodAdd + X1stFlrSF + GrLivArea + 
#>     BsmtFullBath + sqrt(TotRmsAbvGrd) + Fireplaces + GarageType + 
#>     sqrt(GarageArea) + sqrt(OpenPorchSF) + SaleType
#> 
#>                      Df Sum of Sq    RSS    AIC
#> <none>                            411028 5069.0
#> - GarageType          5      6061 417088 5070.6
#> - Street              1      1924 412952 5070.7
#> - SaleType            7      8241 419269 5070.8
#> - sqrt(OpenPorchSF)   1      2316 413344 5071.4
#> - sqrt(TotRmsAbvGrd)  1      3198 414226 5073.1
#> - sqrt(GarageArea)    1      7038 418066 5080.5
#> - Condition1          8     16121 427149 5083.7
#> - X1stFlrSF           1     10668 421695 5087.4
#> - Fireplaces          1     12799 423827 5091.4
#> - YearBuilt           1     13898 424926 5093.5
#> - LotArea             1     18281 429309 5101.7
#> - MSZoning            4     24616 435644 5107.4
#> - BsmtFullBath        1     24596 435624 5113.3
#> - YearRemodAdd        1     26350 437378 5116.5
#> - GrLivArea           1     78425 489453 5206.3
#> - OverallQual         7    129265 540293 5273.2
summary(model_tuning_bwd)
#> 
#> Call:
#> lm(formula = sqrt(SalePrice) ~ MSZoning + LotArea + Street + 
#>     Condition1 + OverallQual + YearBuilt + YearRemodAdd + X1stFlrSF + 
#>     GrLivArea + BsmtFullBath + sqrt(TotRmsAbvGrd) + Fireplaces + 
#>     GarageType + sqrt(GarageArea) + sqrt(OpenPorchSF) + SaleType, 
#>     data = house_train_outlier)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -141.385  -11.810    1.202   12.867   71.895 
#> 
#> Coefficients:
#>                         Estimate    Std. Error t value             Pr(>|t|)    
#> (Intercept)        -1185.5496140   123.7422772  -9.581 < 0.0000000000000002 ***
#> MSZoningFV            47.8327683    11.7181389   4.082     0.00004942175816 ***
#> MSZoningRH            40.8003856    13.4583576   3.032             0.002516 ** 
#> MSZoningRL            47.6556618    10.2449909   4.652     0.00000388838676 ***
#> MSZoningRM            32.6167465    10.2439534   3.184             0.001512 ** 
#> LotArea                0.0021135     0.0003647   5.795     0.00000001004670 ***
#> StreetPave            40.1259346    21.3463959   1.880             0.060527 .  
#> Condition1Feedr        2.4689381     6.3809933   0.387             0.698924    
#> Condition1Norm        10.0801595     5.5097687   1.830             0.067718 .  
#> Condition1PosA        12.1236213    17.5064693   0.693             0.488822    
#> Condition1PosN        20.5070543    10.6439358   1.927             0.054399 .  
#> Condition1RRAe       -20.5124816     9.9282631  -2.066             0.039162 *  
#> Condition1RRAn         0.9719768     8.1581268   0.119             0.905194    
#> Condition1RRNe        24.3649777    24.1708902   1.008             0.313763    
#> Condition1RRNn        53.4566319    20.8362100   2.566             0.010493 *  
#> OverallQual3          22.3893533    19.2112411   1.165             0.244213    
#> OverallQual4          51.0667564    17.6013377   2.901             0.003824 ** 
#> OverallQual5          63.2400359    17.6278380   3.588             0.000355 ***
#> OverallQual6          75.9773606    17.7257100   4.286     0.00002052194416 ***
#> OverallQual7          91.2502798    17.9776344   5.076     0.00000048643747 ***
#> OverallQual8         119.1814461    18.2830832   6.519     0.00000000012970 ***
#> OverallQual9         154.6624657    29.5499614   5.234     0.00000021525869 ***
#> YearBuilt              0.2595350     0.0513673   5.053     0.00000054731803 ***
#> YearRemodAdd           0.3762884     0.0540873   6.957     0.00000000000754 ***
#> X1stFlrSF              0.0163162     0.0036860   4.427     0.00001098513175 ***
#> GrLivArea              0.0540103     0.0045000  12.002 < 0.0000000000000002 ***
#> BsmtFullBath          12.1182690     1.8028793   6.722     0.00000000003543 ***
#> sqrt(TotRmsAbvGrd)   -13.0538697     5.3857683  -2.424             0.015594 *  
#> Fireplaces             7.5785219     1.5629759   4.849     0.00000150881746 ***
#> GarageTypeAttchd      31.9801391    13.9081963   2.299             0.021756 *  
#> GarageTypeBasment     20.2429567    15.5219714   1.304             0.192580    
#> GarageTypeBuiltIn     31.0424046    14.5649284   2.131             0.033386 *  
#> GarageTypeCarPort     12.8412252    16.8497521   0.762             0.446237    
#> GarageTypeDetchd      31.1441776    13.8167248   2.254             0.024476 *  
#> sqrt(GarageArea)       1.1124980     0.3094135   3.596             0.000345 ***
#> sqrt(OpenPorchSF)      0.5393338     0.2614866   2.063             0.039495 *  
#> SaleTypeCon           39.8799194    24.5923544   1.622             0.105298    
#> SaleTypeConLD         12.8355513    12.8187209   1.001             0.316996    
#> SaleTypeConLI         -8.2347973    14.7519869  -0.558             0.576862    
#> SaleTypeConLw         12.6486178    14.8703324   0.851             0.395265    
#> SaleTypeCWD           45.1134960    24.2419945   1.861             0.063138 .  
#> SaleTypeNew           19.8067478     6.3970793   3.096             0.002033 ** 
#> SaleTypeWD            10.1147931     4.9676047   2.036             0.042084 *  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 23.33 on 755 degrees of freedom
#> Multiple R-squared:  0.8488, Adjusted R-squared:  0.8403 
#> F-statistic: 100.9 on 42 and 755 DF,  p-value: < 0.00000000000000022
# Assign to a new object from backward step-wise with lowest AIC score

step_model <- lm(formula = sqrt(SalePrice) ~ MSZoning + LotArea + Street + 
    Condition1 + OverallQual + YearBuilt + YearRemodAdd + X1stFlrSF + 
    GrLivArea + BsmtFullBath + sqrt(TotRmsAbvGrd) + Fireplaces + 
    GarageType + sqrt(GarageCars) + PavedDrive + sqrt(OpenPorchSF), 
    data = house_train_outlier)

summary(step_model)
#> 
#> Call:
#> lm(formula = sqrt(SalePrice) ~ MSZoning + LotArea + Street + 
#>     Condition1 + OverallQual + YearBuilt + YearRemodAdd + X1stFlrSF + 
#>     GrLivArea + BsmtFullBath + sqrt(TotRmsAbvGrd) + Fireplaces + 
#>     GarageType + sqrt(GarageCars) + PavedDrive + sqrt(OpenPorchSF), 
#>     data = house_train_outlier)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -142.41  -12.11    1.65   12.86   70.85 
#> 
#> Coefficients:
#>                         Estimate    Std. Error t value             Pr(>|t|)    
#> (Intercept)        -1208.2984979   125.0734246  -9.661 < 0.0000000000000002 ***
#> MSZoningFV            50.0889398    11.6262010   4.308    0.000018612838147 ***
#> MSZoningRH            42.5967228    13.4652658   3.163             0.001621 ** 
#> MSZoningRL            46.7804527    10.2008932   4.586    0.000005284496505 ***
#> MSZoningRM            31.8018003    10.1839850   3.123             0.001860 ** 
#> LotArea                0.0023296     0.0003615   6.444    0.000000000207131 ***
#> StreetPave            39.6475164    21.5124593   1.843             0.065718 .  
#> Condition1Feedr        2.6781205     6.4071369   0.418             0.676072    
#> Condition1Norm         9.5608921     5.5468660   1.724             0.085177 .  
#> Condition1PosA        12.6366657    17.5883350   0.718             0.472689    
#> Condition1PosN        19.8276346    10.5008782   1.888             0.059381 .  
#> Condition1RRAe       -22.5903214     9.8814318  -2.286             0.022521 *  
#> Condition1RRAn         0.0251232     8.2226516   0.003             0.997563    
#> Condition1RRNe        24.0103669    24.3008947   0.988             0.323445    
#> Condition1RRNn        57.6447242    20.9007983   2.758             0.005955 ** 
#> OverallQual3          17.8207239    19.4331200   0.917             0.359419    
#> OverallQual4          49.2644705    17.7235749   2.780             0.005577 ** 
#> OverallQual5          60.9327469    17.7571891   3.431             0.000633 ***
#> OverallQual6          73.1408528    17.8609274   4.095    0.000046739449480 ***
#> OverallQual7          89.9823886    18.0867720   4.975    0.000000807221600 ***
#> OverallQual8         117.5375402    18.3887588   6.392    0.000000000285855 ***
#> OverallQual9         153.1181821    29.7133326   5.153    0.000000326754145 ***
#> YearBuilt              0.2452017     0.0540553   4.536    0.000006656531028 ***
#> YearRemodAdd           0.4079384     0.0539746   7.558    0.000000000000118 ***
#> X1stFlrSF              0.0168860     0.0036582   4.616    0.000004592604823 ***
#> GrLivArea              0.0529243     0.0044757  11.825 < 0.0000000000000002 ***
#> BsmtFullBath          12.0178470     1.7871470   6.725    0.000000000034607 ***
#> sqrt(TotRmsAbvGrd)   -13.6751331     5.4038522  -2.531             0.011587 *  
#> Fireplaces             7.3269770     1.5717891   4.662    0.000003706088344 ***
#> GarageTypeAttchd      29.3776897    13.9682468   2.103             0.035779 *  
#> GarageTypeBasment     18.2469830    15.6016305   1.170             0.242546    
#> GarageTypeBuiltIn     27.9145622    14.5981125   1.912             0.056226 .  
#> GarageTypeCarPort      8.0907643    16.9038593   0.479             0.632337    
#> GarageTypeDetchd      28.6098451    13.8811916   2.061             0.039638 *  
#> sqrt(GarageCars)      17.9522014     5.0237734   3.573             0.000375 ***
#> PavedDriveP            8.6330662     7.6417951   1.130             0.258952    
#> PavedDriveY            5.8898092     4.6117561   1.277             0.201947    
#> sqrt(OpenPorchSF)      0.5710397     0.2615929   2.183             0.029346 *  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 23.46 on 760 degrees of freedom
#> Multiple R-squared:  0.846,  Adjusted R-squared:  0.8386 
#> F-statistic: 112.9 on 37 and 760 DF,  p-value: < 0.00000000000000022

Insight of step_model :
- The Adjusted R-squared score that represent the goodness of fit for model is 0.8516, it means that only 85.16% of the variables can be explained by the model of model_tuning.
- OverallQual9, OverallQual8, OverallQual7, and OverallQual6, rates the overall material and finish of the house, also Intercept, MSZoningFV, StreetPave have highest significant score to the model of model_tuning.

6.7.2 Model Performance Comparison

For the two candidate models, namely model_tuning and step_model, performance comparisons is made for the two models. The goal is to find the best regression model to model the data.

comparison_tuning <- compare_performance(model_tuning,step_model)

as.data.frame(comparison_tuning)

Insight:
Best model of linear regression, as follows:
- Based on Adj. R-squared with highest score (R2_adjusted column) : model_tuning.
- Based on AIC with lowest score (AIC column) : step_model.
- Based on RMSE with lowest score (RMSE column) : model_tuning.

6.7.3 Assumption Test

6.7.3.1 Linearity

In this section, the linearity assumption will be checked by making a residual vs fitted plot for model of model_backward.

plot(model_tuning, which = 1)
abline(h = 10, col = "green")
abline(h = -10, col = "green")

The plots show that the residuals/errors bounce randomly around 0 (non-uniform) and most of residuals are not far from 0. So the condition E(ϵ)=0 is fulfilled. Then the red line is in the tolerance range of -10 to +10, so the model_tuning is a linear model. So, the linearity assumption is met.

6.7.3.2 Normality of Residuals

The linear regression model is expected to produce errors that are normally distributed and are mostly located around the number of 0.

# Plot 1
hist(model_tuning$residuals,breaks = 3)

# Plot 2

res_tun <- data.frame(residual = model_tuning$residuals)

res_tun %>% 
e_charts() %>% 
  e_histogram(residual, name = "Error", legend = F) %>% 
  e_title(
    text = "Normality of Residuals",
    textStyle = list(fontFamily = "Cardo", fontSize = 25),
    subtext = "Normally Distributed Error",
    subtextStyle = list(fontFamily = "Cardo", fontSize = 15, fontStyle = "italic"),
    left = "center") %>% 
  e_hide_grid_lines() %>% 
  e_tooltip() %>% 
  e_x_axis(axisLabel = list(fontFamily = "Cardo")) %>% 
  e_y_axis(axisLabel = list(color = "#FAF7E6"))
# Shapiro wilk test

shapiro.test(model_tuning$residuals)
#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  model_tuning$residuals
#> W = 0.95853, p-value = 0.00000000000003094

Based on histogram plot above that most of residuals / erors are located around number of 0. But with the shapiro wilk test the result of p-value is lower than 0.05, then it can be concluded that H0 - normally distributed errors is rejected or in other words the most of erors aren’t located around number of 0.

6.7.3.3 Homoscedasticity of Residuals

bptest(model_tuning)
#> 
#>  studentized Breusch-Pagan test
#> 
#> data:  model_tuning
#> BP = 73.313, df = 57, p-value = 0.07166

Breusch-Pagan hypothesis test:

H0: constant spreading error or homoscedasticity
H1: error spread is not constant or heteroscedasticity

Based on BP test sccore where the p-value results is lower than 0.05, then the alternative hypothesis (H1) is accepted / failed to reject. Then the model of model_tuning does not meet the assumption of homoscedasticity of residuals so that do the data transformation on the target or predictor variables.

6.7.3.4 No Multicollinearity

In this section, measuring the correlation between predictor variables with the vif() function in library (car) in order to discover which predictor variables have strong relationship between predictor variables which can lead to redundant predictors. Then, choosing one variable so that there is no redundant predictor in the model.

vif(model_tuning)
#>                        GVIF Df GVIF^(1/(2*Df))
#> MSZoning           3.300226  4        1.160962
#> LotFrontage        2.023633  1        1.422545
#> LotArea            2.320601  1        1.523352
#> Street             1.719076  1        1.311135
#> LotConfig          1.570648  4        1.058059
#> Condition1         3.800535  8        1.087027
#> Condition2         1.964427  4        1.088064
#> OverallQual        4.692352  7        1.116751
#> YearBuilt          4.287220  1        2.070560
#> YearRemodAdd       1.909852  1        1.381974
#> X1stFlrSF          2.438140  1        1.561454
#> GrLivArea          5.609487  1        2.368436
#> BsmtFullBath       1.266517  1        1.125396
#> FullBath           3.205857  1        1.790491
#> HalfBath           2.334597  1        1.527939
#> sqrt(BedroomAbvGr) 2.363064  1        1.537226
#> sqrt(TotRmsAbvGrd) 3.884699  1        1.970964
#> Fireplaces         1.429814  1        1.195748
#> GarageType         4.073371  5        1.150788
#> sqrt(GarageCars)   4.595087  1        2.143615
#> sqrt(GarageArea)   4.218352  1        2.053863
#> PavedDrive         1.530158  2        1.112203
#> sqrt(OpenPorchSF)  1.421658  1        1.192333
#> SaleType           1.798233  7        1.042805

There are non-multicolinearity among variables due to VIF score is lower than 10 (VIF<10).

6.8 Model Interpretation

After assumption test for model of model_tuning, where only 2 assumptions are met namely linearity and non-multicollinearity. Then model interpretation as follows:

\[ sqrt(SalePrice) = -1115.02 + 43.08MSZoningFV + 26.69MSZoningRH + 35.50 MSZoningRL + 23.16MSZoningRM + 0.018LotFrontage + 0.0016 LotArea + 65.78 StreetPave + 5.37 LotConfigCulDSac + 0.65 LotConfigFR2 + -21.04 LotConfigFR3 + 1.36 LotConfigInside + 8.28 Condition1Feedr + 15.69 Condition1Norm + 16.28 Condition1PosA + 32.19 Condition1PosN + -15.53 Condition1RRAe + 7.54 Condition1RRAn + 8.95 Condition1RRNe + 42.91 Condition1RRNn + -3.34 Condition2Feedr + 6.77 Condition2Norm + -19.88 Condition2RRAn + 8.76 Condition2RRNn + 54.49 OverallQual3 + 75.00 OverallQual4 + 92.84 OverallQual5 + 103.13 OverallQual6 + 119.86 OverallQual7 + 146.25 OverallQual8 + 176.98 OverallQual9 + 0.22 YearBuilt + 0.34 YearRemodAdd + 0.01 X1stFlrSF + 0.05 GrLivArea + 11.96 BsmtFullBath + 1.06 FullBath + 1.21 HalfBath + -3.33 sqrt(BedroomAbvGr) + -10.53 sqrt(TotRmsAbvGrd) + 6.99 Fireplaces + 32.03 GarageTypeAttchd + 26.51 GarageTypeBasment + 32.67 GarageTypeBuiltIn + 11.15 GarageTypeCarPort + 30.52 GarageTypeDetchd + 13.34 sqrt(GarageCars) + 0.73 sqrt(GarageArea) + 8.79 PavedDriveP + 10.91 PavedDriveY + 0.46 sqrt(OpenPorchSF) + 30.72 SaleTypeCon + 5.93 SaleTypeConLD + -10.33 SaleTypeConLI + 10.83 SaleTypeConLw + 16.83 SaleTypeCWD + 20.50 SaleTypeNew + 28.27 SaleTypeOth + 9.88 SaleTypeWD \]

  • Linear Regression Model Used: model_tuning

  • 1.Intercept: The value of the target variable when all predictor’s value are equal to zero.

    • Interpretation of Model’s intercept :
      • When all predictors’ value equal to 0 , the house price / SalePrice decreases by 1.115.
  • 2. Coefficient/Slope: Increasing of 1 unit in the predictor, the target variable value will increase as much as the slope value.

    • Interpretation of Model’s Coefficient/Slope :
      • There are 7 variabels which can decrease the price / SalePrice when one of those variabels increase 1 value and all others predictor is constant / do not change. But, in order to find out the decreasing value in price/SalePrice, the value of the slope/coefficient must be exponentialized first.
      • And the rest of predictor variables can increase the price / Saleprice in accordance with the value of slope / cofficient for each variables and with one condition if one one of those variabels increase 1 value and all others predictor is constant / do not change. But, in order to find out the increasing value in price/SalePrice, the value of the slope/coefficient must be exponentialized first.
  • 3. Predictor Significance: Find out the significance of each predictors that affect the target variable.

    • Interpretation of Model’s Predictor Significance :
      • There are 27 variables with p-value score lower than 0.05, which have a significant effect on the target variable.
  • Goodness of Fit (R-Squared) : Describes how well the predictor can explain the diversity of the target class.

    • Interpretation of Model’s R-Squared:
      • The Adjusted R-squared score that represent the goodness of fit for model is 0.8517, it means that only 85.17% of the variables can be explained by the model of model_tuning for explainning / predicting the target.

7 Conclusion

  • Based on the process of analyzing the factors with model of step_model summary, it is found that the factors that not significantly influence SalePrice are sqrt(OpenPorchSF), PavedDriveP, GarageTypeCarPort, GarageTypeBasment, OverallQual3, Condition1RRNe, Condition1RRAn, Condition1RRAe, Condition1PosA, and Condition1Feedr.

  • The model of model_tuning is suitable to make a prediciton. But, using a model of step_model is appropiate to see the significance of the predictor. Then, step-wise regression is a greedy algorithm that focuses on finding the best results with relatively short time, but not necessarily giving the most optimal results. Therefore, this research uses the results of step-wise regression as a model recommendation and there is still room for improvement.

  • Several numeric variabels, big p-value, have used sqrt() for data transformation and there are still some errors/residuals that bounce outside the -10 & +10 limits. Then errors / residuals is not distributed normally and it has a Heteroscedasticity. It is recommended to use another complex model (model without assumption) so that it can capture non-linear relationships.