1 Introduction

In this LBB Project, we will create Regression Model using dataset provided by Kaggle to analyze house pricing based on its property attributes.

2 Data Preparation

2.1 Import Libraries

# Library setup and Installation necessary packages
library(dplyr)
library(lubridate) # working with datetime
library(GGally) # correlation relationship
library(MLmetrics) # for MAE calculations
library(performance) # model performance comparison
library(lmtest) # Testing Linear Regression Model
library(car)

2.2 Data Wrangling

2.2.1 Read Dataset

# Read dataset
house <- read.csv("data_input/HousePrices_HalfMil.csv")#, stringsAsFactors = F) %>% 

str(house)
#> 'data.frame':    500000 obs. of  16 variables:
#>  $ Area         : int  164 84 190 75 148 124 58 249 243 242 ...
#>  $ Garage       : int  2 2 2 2 1 3 1 2 1 1 ...
#>  $ FirePlace    : int  0 0 4 4 4 3 0 1 0 2 ...
#>  $ Baths        : int  2 4 4 4 2 3 2 1 2 4 ...
#>  $ White.Marble : int  0 0 1 0 1 0 0 1 0 0 ...
#>  $ Black.Marble : int  1 0 0 0 0 1 0 0 0 0 ...
#>  $ Indian.Marble: int  0 1 0 1 0 0 1 0 1 1 ...
#>  $ Floors       : int  0 1 0 1 1 1 0 1 1 0 ...
#>  $ City         : int  3 2 2 1 2 1 3 1 1 2 ...
#>  $ Solar        : int  1 0 0 1 1 0 0 0 0 1 ...
#>  $ Electric     : int  1 0 0 1 0 0 1 1 0 0 ...
#>  $ Fiber        : int  1 0 1 1 0 1 1 0 0 0 ...
#>  $ Glass.Doors  : int  1 1 0 1 1 1 1 1 0 0 ...
#>  $ Swiming.Pool : int  0 1 0 1 1 1 0 1 1 1 ...
#>  $ Garden       : int  0 1 0 1 1 1 1 0 0 0 ...
#>  $ Prices       : int  43800 37550 49500 50075 52400 54300 34400 50425 29575 22300 ...

2.2.2 Explore Dataset

As there have not been prior knowledge on information regarding variables in the dataset given from the data source information page, hence we will explore the dataset further to interpret the variables meaning of the house attributes.

glimpse(house)
#> Rows: 500,000
#> Columns: 16
#> $ Area          <int> 164, 84, 190, 75, 148, 124, 58, 249, 243, 242, 61, 189, …
#> $ Garage        <int> 2, 2, 2, 2, 1, 3, 1, 2, 1, 1, 2, 2, 2, 3, 3, 3, 1, 3, 2,…
#> $ FirePlace     <int> 0, 0, 4, 4, 4, 3, 0, 1, 0, 2, 4, 0, 0, 3, 3, 4, 0, 3, 3,…
#> $ Baths         <int> 2, 4, 4, 4, 2, 3, 2, 1, 2, 4, 5, 4, 2, 3, 1, 1, 5, 3, 5,…
#> $ White.Marble  <int> 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,…
#> $ Black.Marble  <int> 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1,…
#> $ Indian.Marble <int> 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0,…
#> $ Floors        <int> 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1,…
#> $ City          <int> 3, 2, 2, 1, 2, 1, 3, 1, 1, 2, 1, 2, 1, 3, 3, 1, 3, 1, 3,…
#> $ Solar         <int> 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0,…
#> $ Electric      <int> 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1,…
#> $ Fiber         <int> 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0,…
#> $ Glass.Doors   <int> 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1,…
#> $ Swiming.Pool  <int> 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0,…
#> $ Garden        <int> 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0,…
#> $ Prices        <int> 43800, 37550, 49500, 50075, 52400, 54300, 34400, 50425, …

There are 500,000 observation data in the dataset

summary(house)
#>       Area           Garage        FirePlace         Baths      
#>  Min.   :  1.0   Min.   :1.000   Min.   :0.000   Min.   :1.000  
#>  1st Qu.: 63.0   1st Qu.:1.000   1st Qu.:1.000   1st Qu.:2.000  
#>  Median :125.0   Median :2.000   Median :2.000   Median :3.000  
#>  Mean   :124.9   Mean   :2.001   Mean   :2.003   Mean   :2.998  
#>  3rd Qu.:187.0   3rd Qu.:3.000   3rd Qu.:3.000   3rd Qu.:4.000  
#>  Max.   :249.0   Max.   :3.000   Max.   :4.000   Max.   :5.000  
#>   White.Marble    Black.Marble    Indian.Marble        Floors      
#>  Min.   :0.000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
#>  1st Qu.:0.000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
#>  Median :0.000   Median :0.0000   Median :0.0000   Median :0.0000  
#>  Mean   :0.333   Mean   :0.3327   Mean   :0.3343   Mean   :0.4994  
#>  3rd Qu.:1.000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000  
#>  Max.   :1.000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
#>       City           Solar           Electric          Fiber       
#>  Min.   :1.000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
#>  1st Qu.:1.000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
#>  Median :2.000   Median :0.0000   Median :1.0000   Median :1.0000  
#>  Mean   :2.001   Mean   :0.4987   Mean   :0.5007   Mean   :0.5005  
#>  3rd Qu.:3.000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000  
#>  Max.   :3.000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
#>   Glass.Doors      Swiming.Pool        Garden           Prices     
#>  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   : 7725  
#>  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:33500  
#>  Median :0.0000   Median :1.0000   Median :1.0000   Median :41850  
#>  Mean   :0.4999   Mean   :0.5004   Mean   :0.5016   Mean   :42050  
#>  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:50750  
#>  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :77975

Check any missing values in the dataset

# Check for missing values
colSums(is.na(house))
#>          Area        Garage     FirePlace         Baths  White.Marble 
#>             0             0             0             0             0 
#>  Black.Marble Indian.Marble        Floors          City         Solar 
#>             0             0             0             0             0 
#>      Electric         Fiber   Glass.Doors  Swiming.Pool        Garden 
#>             0             0             0             0             0 
#>        Prices 
#>             0

Check for unique values on several columns

# Check unique values 
unique(house$Garage)
#> [1] 2 1 3
unique(house$FirePlace)
#> [1] 0 4 3 1 2
unique(house$Baths)
#> [1] 2 4 3 1 5
unique(house$City)
#> [1] 3 2 1

All the columns except Area and Prices can be changed to categorial type as those columns have unique values at most 5 unique values.

2.2.3 Dataset Descriptions

Based on information from our data exploration on the structure of our original dataset house, we can interpret the variable attributes as follows: - Area : Area of the house - Garage : Number of Garages in the house
- FirePlace : Number of Fireplaces built in the house
- Baths : Number of Bathrooms in the house
- White.Marble : Whether the house using white marble. 1 for Yes, 0 for No. - Black.Marble : Whether the house using black marble. 1 for Yes, 0 for No.
- Indian.Marble : Whether the house using black marble. 1 for Yes, 0 for No. - Floors : How many levels of floors in the house. 0 for ground level, 1 for 1 level above ground level, etc.
- City : Code for city area which the house is located
- Solar : Whether the house is using Solar based power sourced. 1 for Yes, 0 for No.
- Electric : Whether the house is using Electricity based power sourced. 1 for Yes, 0 for No. - Fiber : Whether the house is using Fiber. 1 for Yes, 0 for No. - Glass.Doors : Whether the house has Glass Doors. 1 for Yes, 0 for No. - Swiming.Pool : Whether the house has Swiming Pool. 1 for Yes, 0 for No. - Garden : Whether the house has a Garden. 1 for Yes, 0 for No.
- Prices : House Price

2.3 Data Cleansing

We will change all the data types except Area and Price to categorial factor type and store the new dataset into new dataframe named house_clean.

# Change data_type 
house_clean <- house %>% 
  mutate_at(vars(Garage, FirePlace, Baths, White.Marble, Black.Marble, 
                 Indian.Marble, Floors, City, Solar, Electric, 
                 Fiber, Glass.Doors, Swiming.Pool, Garden),
            as.factor)

# Confirm Data Type Change
glimpse(house_clean)
#> Rows: 500,000
#> Columns: 16
#> $ Area          <int> 164, 84, 190, 75, 148, 124, 58, 249, 243, 242, 61, 189, …
#> $ Garage        <fct> 2, 2, 2, 2, 1, 3, 1, 2, 1, 1, 2, 2, 2, 3, 3, 3, 1, 3, 2,…
#> $ FirePlace     <fct> 0, 0, 4, 4, 4, 3, 0, 1, 0, 2, 4, 0, 0, 3, 3, 4, 0, 3, 3,…
#> $ Baths         <fct> 2, 4, 4, 4, 2, 3, 2, 1, 2, 4, 5, 4, 2, 3, 1, 1, 5, 3, 5,…
#> $ White.Marble  <fct> 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,…
#> $ Black.Marble  <fct> 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1,…
#> $ Indian.Marble <fct> 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0,…
#> $ Floors        <fct> 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1,…
#> $ City          <fct> 3, 2, 2, 1, 2, 1, 3, 1, 1, 2, 1, 2, 1, 3, 3, 1, 3, 1, 3,…
#> $ Solar         <fct> 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0,…
#> $ Electric      <fct> 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1,…
#> $ Fiber         <fct> 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0,…
#> $ Glass.Doors   <fct> 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1,…
#> $ Swiming.Pool  <fct> 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0,…
#> $ Garden        <fct> 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0,…
#> $ Prices        <int> 43800, 37550, 49500, 50075, 52400, 54300, 34400, 50425, …

3 Data Exploration

3.1 Business Question

In this project, we will analyze House Pricing Prices based on the house attributes to find which house attributes affect house pricing.

  • Variable Target : Prices

3.2 Variable Correlations

Check variable correlations using dataframe house_clean

ggcorr(house_clean, label = T)

Because all the variables have been changed to categorial type, hence the only attribute that can be used to analyze Price can only be Area.

This is not what we wanted because our purpose is to analyze House Pricing based on the House’s attributes, so we will need to keep the data types of all the columns as it is.

Check variable correlations using dataframe house

ggcorr(house, label = T, hjust = T, layout.exp = 3)

Based on the figure above, we can observe the correlations between each variable with Prices and begin to create several modelling to analyze as follows:

  1. Modelling using predictor variable with the highest correlations which is:

    • Floors : 0.6
  2. Modelling using predictor variables with higher correlations of ranges > 0.3 or < -0.3 :

    • Floors : 0.6
    • Fiber : 0.5
    • Indian.Marble : -0.4
    • White.Marble : 0.4
  3. Modelling using all variables as predictor variables.

4 Data Modelling

4.1 First Model with Highest Correlation Variable

Predictor Variable : Floors

model_price_1 <- lm(formula = Prices ~ Floors, 
                    data = house)
summary(model_price_1)
#> 
#> Call:
#> lm(formula = Prices ~ Floors, data = house)
#> 
#> Residuals:
#>    Min     1Q Median     3Q    Max 
#> -27211  -6986    -11   6592  28542 
#> 
#> Coefficients:
#>             Estimate Std. Error t value            Pr(>|t|)    
#> (Intercept) 34557.66      19.00    1819 <0.0000000000000002 ***
#> Floors      15003.39      26.89     558 <0.0000000000000002 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 9507 on 499998 degrees of freedom
#> Multiple R-squared:  0.3837, Adjusted R-squared:  0.3837 
#> F-statistic: 3.113e+05 on 1 and 499998 DF,  p-value: < 0.00000000000000022

💡 Insight :

  • Using one predictor variable Floors, R-squared value is 0.3837 = 38.37%.

  • 38.37% variance from target, house pricing Prices can be explained from our chosen predictor variable.

4.2 Second Model using Higher Correlation Variables

Predictor Variables : Floors, Fiber, Indian.Marble, White.Marble.

model_price_2 <- lm(formula = Prices ~ Floors + Fiber + Indian.Marble + White.Marble, 
                    data = house)
summary(model_price_2)
#> 
#> Call:
#> lm(formula = Prices ~ Floors + Fiber + Indian.Marble + White.Marble, 
#>     data = house)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -15059.3  -3325.2     -0.2   3328.2  14999.8 
#> 
#> Coefficients:
#>               Estimate Std. Error t value            Pr(>|t|)    
#> (Intercept)   27362.24      14.98  1827.1 <0.0000000000000002 ***
#> Floors        14987.99      13.39  1119.1 <0.0000000000000002 ***
#> Fiber         11724.54      13.39   875.4 <0.0000000000000002 ***
#> Indian.Marble -4990.47      16.40  -304.3 <0.0000000000000002 ***
#> White.Marble   9020.49      16.41   549.5 <0.0000000000000002 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 4735 on 499995 degrees of freedom
#> Multiple R-squared:  0.8471, Adjusted R-squared:  0.8471 
#> F-statistic: 6.926e+05 on 4 and 499995 DF,  p-value: < 0.00000000000000022

💡 Insight :

  • Using four predictor variable, R-squared value is 0.8471 = 84.71%.

  • 84.71% variance from target, house pricing Prices can be explained from our four chosen predictor variables.

4.3 Third Model using All Variables

Predictor Variables : all variables.

model_price_all <- lm(formula = Prices ~ . , 
                    data = house)
summary(model_price_all)
#> 
#> Call:
#> lm(formula = Prices ~ ., data = house)
#> 
#> Residuals:
#>          Min           1Q       Median           3Q          Max 
#> -0.000134259  0.000000000  0.000000000  0.000000001  0.000000359 
#> 
#> Coefficients: (1 not defined because of singularities)
#>                           Estimate           Std. Error            t value
#> (Intercept)     999.99999980934274     0.00000000150884   662761448609.378
#> Area             25.00000000000007     0.00000000000374  6684081274085.180
#> Garage         1500.00000000008686     0.00000000032868  4563726084680.788
#> FirePlace       750.00000000030684     0.00000000018991  3949347384946.940
#> Baths          1250.00000000023488     0.00000000018988  6583118144426.266
#> White.Marble  13999.99999999921420     0.00000000065744 21294550363191.246
#> Black.Marble   4999.99999999932425     0.00000000065759  7603483994802.182
#> Indian.Marble                   NA                   NA                 NA
#> Floors        15000.00000000155524     0.00000000053706 27929898615753.586
#> City           3499.99999999960710     0.00000000032900 10638349212123.465
#> Solar           249.99999999946226     0.00000000053707   465491312128.012
#> Electric       1249.99999999946249     0.00000000053706  2327493360996.926
#> Fiber         11749.99999999948704     0.00000000053707 21878164464387.484
#> Glass.Doors    4449.99999999952979     0.00000000053706  8285809124254.771
#> Swiming.Pool      0.00000000054141     0.00000000053706              1.008
#> Garden            0.00000000053856     0.00000000053707              1.003
#>                          Pr(>|t|)    
#> (Intercept)   <0.0000000000000002 ***
#> Area          <0.0000000000000002 ***
#> Garage        <0.0000000000000002 ***
#> FirePlace     <0.0000000000000002 ***
#> Baths         <0.0000000000000002 ***
#> White.Marble  <0.0000000000000002 ***
#> Black.Marble  <0.0000000000000002 ***
#> Indian.Marble                  NA    
#> Floors        <0.0000000000000002 ***
#> City          <0.0000000000000002 ***
#> Solar         <0.0000000000000002 ***
#> Electric      <0.0000000000000002 ***
#> Fiber         <0.0000000000000002 ***
#> Glass.Doors   <0.0000000000000002 ***
#> Swiming.Pool                0.313    
#> Garden                      0.316    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.0000001899 on 499985 degrees of freedom
#> Multiple R-squared:      1,  Adjusted R-squared:      1 
#> F-statistic: 1.453e+26 on 14 and 499985 DF,  p-value: < 0.00000000000000022

💡 Insight :

  • Using all variables as our predictor variable, R-squared value is 1 = 100%.

  • 100% variance from target, house pricing Prices can be explained from all variables above.

  • Significant predictor variables almost all except Indian.Marble, Swimming.Pool and Garden.

  • ⚠️ R-square value = 1, we may need to be cautious that this model may be overfitting.

4.4 Fourth Model with all except Indian.Marble

Based on the Third Model using All Variables as the predictor variables, we observed that the value of R-squared equals to 1 (one) and there is one variable Indian.Marble gives us NA value and we will create additional model by eliminating this variable to see the effect it gave to our model.

Predictor Variables : all except Indian.Marble.

model_price_no_IndianMarble <- lm(formula = Prices ~ . - Indian.Marble, 
                    data = house)
summary(model_price_no_IndianMarble)
#> 
#> Call:
#> lm(formula = Prices ~ . - Indian.Marble, data = house)
#> 
#> Residuals:
#>          Min           1Q       Median           3Q          Max 
#> -0.000134259  0.000000000  0.000000000  0.000000001  0.000000359 
#> 
#> Coefficients:
#>                          Estimate           Std. Error            t value
#> (Intercept)    999.99999980934274     0.00000000150884   662761448609.378
#> Area            25.00000000000007     0.00000000000374  6684081274085.180
#> Garage        1500.00000000008686     0.00000000032868  4563726084680.788
#> FirePlace      750.00000000030684     0.00000000018991  3949347384946.940
#> Baths         1250.00000000023488     0.00000000018988  6583118144426.266
#> White.Marble 13999.99999999921420     0.00000000065744 21294550363191.246
#> Black.Marble  4999.99999999932425     0.00000000065759  7603483994802.182
#> Floors       15000.00000000155524     0.00000000053706 27929898615753.586
#> City          3499.99999999960710     0.00000000032900 10638349212123.465
#> Solar          249.99999999946226     0.00000000053707   465491312128.012
#> Electric      1249.99999999946249     0.00000000053706  2327493360996.926
#> Fiber        11749.99999999948704     0.00000000053707 21878164464387.484
#> Glass.Doors   4449.99999999952979     0.00000000053706  8285809124254.771
#> Swiming.Pool     0.00000000054141     0.00000000053706              1.008
#> Garden           0.00000000053856     0.00000000053707              1.003
#>                         Pr(>|t|)    
#> (Intercept)  <0.0000000000000002 ***
#> Area         <0.0000000000000002 ***
#> Garage       <0.0000000000000002 ***
#> FirePlace    <0.0000000000000002 ***
#> Baths        <0.0000000000000002 ***
#> White.Marble <0.0000000000000002 ***
#> Black.Marble <0.0000000000000002 ***
#> Floors       <0.0000000000000002 ***
#> City         <0.0000000000000002 ***
#> Solar        <0.0000000000000002 ***
#> Electric     <0.0000000000000002 ***
#> Fiber        <0.0000000000000002 ***
#> Glass.Doors  <0.0000000000000002 ***
#> Swiming.Pool               0.313    
#> Garden                     0.316    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.0000001899 on 499985 degrees of freedom
#> Multiple R-squared:      1,  Adjusted R-squared:      1 
#> F-statistic: 1.453e+26 on 14 and 499985 DF,  p-value: < 0.00000000000000022

💡 Insight :

  • Using all variables as our predictor variables except Indian.Marble, R-squared value is still 1 = 100%.

  • 100% variance from target, house pricing Prices can be explained from all variables above.

  • ⚠️ R-square value = 1, we may need to be cautious that this model may be overfitting.

4.5 Fifth Model with all Significant Variables

Based on the Fourth Model using All Variables except Indian.Marble, we observed that the value of R-squared remains the same as equals to 1 (one) and let us create additional model by eliminating the other two non-significant variables Swimming.Pool and Garden.

Predictor Variables : all variables except Indian.Marble, Swiming.Pool and Garden.

model_price_signf <- lm(formula = Prices ~ . - Indian.Marble - Swiming.Pool - Garden, 
                    data = house)
summary(model_price_signf)
#> 
#> Call:
#> lm(formula = Prices ~ . - Indian.Marble - Swiming.Pool - Garden, 
#>     data = house)
#> 
#> Residuals:
#>          Min           1Q       Median           3Q          Max 
#> -0.000134260  0.000000000  0.000000000  0.000000001  0.000000359 
#> 
#> Coefficients:
#>                          Estimate           Std. Error        t value
#> (Intercept)    999.99999980987025     0.00000000146116   684387154347
#> Area            25.00000000000009     0.00000000000374  6684089292557
#> Garage        1500.00000000008708     0.00000000032868  4563729715707
#> FirePlace      750.00000000030707     0.00000000018991  3949350666525
#> Baths         1250.00000000023556     0.00000000018988  6583144017318
#> White.Marble 13999.99999999921238     0.00000000065744 21294598533522
#> Black.Marble  4999.99999999932515     0.00000000065759  7603486054622
#> Floors       15000.00000000155524     0.00000000053706 27929902155431
#> City          3499.99999999960755     0.00000000032900 10638357415806
#> Solar          249.99999999945970     0.00000000053706   465495581371
#> Electric      1249.99999999946317     0.00000000053706  2327494383680
#> Fiber        11749.99999999948886     0.00000000053706 21878350420159
#> Glass.Doors   4449.99999999953161     0.00000000053706  8285855516810
#>                         Pr(>|t|)    
#> (Intercept)  <0.0000000000000002 ***
#> Area         <0.0000000000000002 ***
#> Garage       <0.0000000000000002 ***
#> FirePlace    <0.0000000000000002 ***
#> Baths        <0.0000000000000002 ***
#> White.Marble <0.0000000000000002 ***
#> Black.Marble <0.0000000000000002 ***
#> Floors       <0.0000000000000002 ***
#> City         <0.0000000000000002 ***
#> Solar        <0.0000000000000002 ***
#> Electric     <0.0000000000000002 ***
#> Fiber        <0.0000000000000002 ***
#> Glass.Doors  <0.0000000000000002 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.0000001899 on 499987 degrees of freedom
#> Multiple R-squared:      1,  Adjusted R-squared:      1 
#> F-statistic: 1.695e+26 on 12 and 499987 DF,  p-value: < 0.00000000000000022

💡 Insight :

  • Using all variables as our predictor variables except Indian.Marble, Swiming.Pool and Garden, R-squared value remains the same at 1 = 100%.

  • 100% variance from target, house pricing Prices can be explained from all variables above.

  • ⚠️ R-square value = 1, we may need to be cautious that this model may be overfitting.

5 Model Comparison

At this point, we have created several models which are :

  1. model_price_1 : Model with Highest Correlation Variable Floors.

  2. model_price_2 : Model using Higher Correlation Variables Floors, Fiber, Indian.Marble, White.Marble.

  3. model_price_all : Model using all variables as Predictor Variables.

  4. model_price_no_IndianMarble : Model using all variables except Indian.Marble.

  5. model_price_signf : Model using all significant variables which is all variables except Indian.Marble, Swiming.Pool and Garden.

Let us recap and compare all of our five modellings above using r-squared and error :

5.1 Using R-Squared

summary(model_price_1)$r.squared # One Predictor
#> [1] 0.3837194
summary(model_price_2)$adj.r.squared # Multiple Predictors
#> [1] 0.8471227
summary(model_price_all)$adj.r.squared # Multiple Predictors
#> [1] 1
summary(model_price_no_IndianMarble)$adj.r.squared # Multiple Predictors
#> [1] 1
summary(model_price_signf)$adj.r.squared # Multiple Predictors
#> [1] 1

💡 Insight :

  • model_price_all, model_price_no_IndianMarble, and model_price_signf has the highest r-squared value at 1.

  • model_price_1 has the lowest r-squared value at 0.3837 = 38.37%.

  • ⚠️ However, we may need to be cautious that those models may be overfitting because it is too perfect.

5.2 Using Error Metrics (RMSE)

Make predictions for each model

price_pred_1 <- predict(model_price_1, house)
price_pred_2 <- predict(model_price_2, house)
price_pred_all <- predict(model_price_all, house)
price_pred_no_IndianMarble <- predict(model_price_no_IndianMarble, house)
price_pred_signf <- predict(model_price_signf, house)

Prediction Evaluation using Error Metrics (RMSE)

RMSE_1 <- RMSE(y_pred = price_pred_1,
               y_true = house$Prices)

RMSE_2 <- RMSE(y_pred = price_pred_2,
               y_true = house$Prices)

RMSE_all <- RMSE(y_pred = price_pred_all,
                 y_true = house$Prices)

RMSE_no_IndianMarble <- RMSE(y_pred = price_pred_no_IndianMarble,
                             y_true = house$Prices)

RMSE_signf <- RMSE(y_pred = price_pred_signf,
                   y_true = house$Prices)

RMSE_1
#> [1] 9506.956
RMSE_2
#> [1] 4735.022
RMSE_all
#> [1] 0.0000001901459
RMSE_no_IndianMarble
#> [1] 0.0000001901459
RMSE_signf
#> [1] 0.0000001901501

💡 Insight :

  • Based on RMSE values, model_price_all and model_price_no_IndianMarble have the best performance because the two models have the lowest RMSE values at 0.0000001901459.

  • However, model_price_signf also considered to have best performance with almost no difference on RMSE values at 0.0000001901501.

  • ⚠️ Still, with RMSE values almost non-existent, we may need to be cautious.

  • In the opposite side, model_price_1 have the worst performance with the highest RMSE values at 9506.956 and this is unexpected because this model is using the highest correlation variable as its predictor variable.

5.3 Conclusions

Based on the Model Comparison that we have combined and created, we can conclude that there are many variables which can affect the value of our house pricing. We can observed it by summarizing as follows:

  • Model using only 1 predictor variable (with highest correlations) Floors = 0.6, gives the worst model performance.

  • Model using 4 predictor variables (with higher correlations variables) (Floors, Fiber, Indian.Marble, White.Marble), gives better model performance compare to only using 1 predictor variable.

  • Model using all variables as predictor variables or also only using all significant variables, gives best model performances but ⚠️ note of caution that this may be overfitting model as it is too perfect with almost zero error.

In other words, the more the house has many additional significant attributes except Indian Marble, Swimming Pool and Garden, the more it can affect the house pricing.

6 Stepwise Regression Model

If in our previous attempt of modelling is more like trial and error to find the best combinations of predictor variables, let us now use method of Stepwise Regression instead.

6.1 Forward Stepwise

# Create model without predictor to start
model_price_base <- lm(formula = Prices ~ 1, 
                    data = house)

# We already have model for upper limit : model_price_all

# Stepwise Regression
model_price_forward <- step(object = model_price_base,
                            direction = "forward",
                            scope = list(upper = model_price_all)
                            )
#> Start:  AIC=9401807
#> Prices ~ 1
#> 
#>                 Df      Sum of Sq            RSS     AIC
#> + Floors         1 28137670256676 45191105612975 9159783
#> + Fiber          1 17222144966830 56106630902821 9267960
#> + White.Marble   1 14727518113609 58601257756042 9289711
#> + Indian.Marble  1 10025454477185 63303321392466 9328302
#> + City           1  3989805570061 69338970299590 9373836
#> + Glass.Doors    1  2428230058991 70900545810660 9384972
#> + Area           1  1600050598200 71728725271451 9390779
#> + Baths          1  1543586418569 71785189451082 9391172
#> + Garage         1   737604900715 72591170968936 9396755
#> + FirePlace      1   582650002516 72746125867135 9397821
#> + Black.Marble   1   446697232267 72882078637384 9398754
#> + Electric       1   201670610161 73127105259490 9400432
#> + Solar          1     5209679561 73323566190090 9401774
#> <none>                            73328775869651 9401807
#> + Swiming.Pool   1      234043881 73328541825770 9401808
#> + Garden         1      173920415 73328601949236 9401808
#> 
#> Step:  AIC=9159783
#> Prices ~ Floors
#> 
#>                 Df      Sum of Sq            RSS     AIC
#> + Fiber          1 17161774856129 28029330756847 8920961
#> + White.Marble   1 14724329410278 30466776202698 8962653
#> + Indian.Marble  1 10035180243376 35155925369599 9034231
#> + City           1  4003393234440 41187712378535 9113405
#> + Glass.Doors    1  2428367593536 42762738019440 9132168
#> + Area           1  1610488043954 43580617569021 9141641
#> + Baths          1  1555207029826 43635898583149 9142275
#> + Garage         1   746110257205 44444995355770 9151461
#> + FirePlace      1   581149541139 44609956071836 9153313
#> + Black.Marble   1   444090818181 44747014794794 9154847
#> + Electric       1   201282051847 44989823561128 9157553
#> + Solar          1     7437646280 45183667966695 9159703
#> + Swiming.Pool   1      269529965 45190836083010 9159782
#> + Garden         1      249560798 45190856052177 9159782
#> <none>                            45191105612975 9159783
#> 
#> Step:  AIC=8920961
#> Prices ~ Floors + Fiber
#> 
#>                 Df      Sum of Sq            RSS     AIC
#> + White.Marble   1 14742671033035 13286659723812 8547717
#> + Indian.Marble  1 10048366919693 17980963837154 8698994
#> + City           1  4048561433438 23980769323408 8842963
#> + Glass.Doors    1  2457747958687 25571582798160 8875078
#> + Area           1  1609279115650 26420051641197 8891398
#> + Baths          1  1562306360520 26467024396327 8892287
#> + Garage         1   750129260320 27279201496527 8907399
#> + FirePlace      1   569726306346 27459604450501 8910695
#> + Black.Marble   1   444499183390 27584831573457 8912970
#> + Electric       1   202433546973 27826897209874 8917338
#> + Solar          1     7265670476 28022065086371 8920833
#> + Garden         1      252502079 28029078254768 8920958
#> <none>                            28029330756847 8920961
#> + Swiming.Pool   1         464385 28029330292462 8920963
#> 
#> Step:  AIC=8547717
#> Prices ~ Floors + Fiber + White.Marble
#> 
#>                 Df     Sum of Sq            RSS     AIC
#> + City           1 4061895141055  9224764582757 8365285
#> + Glass.Doors    1 2462605272635 10824054451177 8445224
#> + Black.Marble   1 2076445268323 11210214455490 8462752
#> + Indian.Marble  1 2076445268323 11210214455490 8462752
#> + Area           1 1584782472730 11701877251082 8484214
#> + Baths          1 1538485207921 11748174515891 8486188
#> + Garage         1  746534973249 12540124750563 8518806
#> + FirePlace      1  564215464953 12722444258859 8526023
#> + Electric       1  200509914544 13086149809269 8540116
#> + Solar          1    8338855184 13278320868628 8547405
#> + Garden         1     149069376 13286510654437 8547714
#> <none>                           13286659723812 8547717
#> + Swiming.Pool   1      43506897 13286616216915 8547717
#> 
#> Step:  AIC=8365285
#> Prices ~ Floors + Fiber + White.Marble + City
#> 
#>                 Df     Sum of Sq           RSS     AIC
#> + Glass.Doors    1 2457781442951 6766983139806 8210369
#> + Black.Marble   1 2081506095751 7143258487006 8237425
#> + Indian.Marble  1 2081506095751 7143258487006 8237425
#> + Area           1 1602371800156 7622392782601 8269886
#> + Baths          1 1542779602026 7681984980731 8273780
#> + Garage         1  743782030855 8480982551902 8323254
#> + FirePlace      1  564835861461 8659928721296 8333694
#> + Electric       1  199091680363 9025672902394 8354377
#> + Solar          1    8161007847 9216603574909 8364844
#> + Garden         1      95579979 9224669002778 8365282
#> <none>                           9224764582757 8365285
#> + Swiming.Pool   1      35130383 9224729452374 8365285
#> 
#> Step:  AIC=8210369
#> Prices ~ Floors + Fiber + White.Marble + City + Glass.Doors
#> 
#>                 Df     Sum of Sq           RSS     AIC
#> + Black.Marble   1 2082103704397 4684879435409 8026513
#> + Indian.Marble  1 2082103704397 4684879435409 8026513
#> + Area           1 1607244583078 5159738556728 8074786
#> + Baths          1 1549285292131 5217697847674 8080371
#> + Garage         1  749672149111 6017310990695 8151663
#> + FirePlace      1  565687046739 6201296093066 8166722
#> + Electric       1  197573697371 6569409442435 8195555
#> + Solar          1    8394120147 6758589019658 8209750
#> + Swiming.Pool   1      28014024 6766955125782 8210368
#> <none>                           6766983139806 8210369
#> + Garden         1      20781279 6766962358527 8210369
#> 
#> Step:  AIC=8026513
#> Prices ~ Floors + Fiber + White.Marble + City + Glass.Doors + 
#>     Black.Marble
#> 
#>                Df     Sum of Sq           RSS     AIC
#> + Area          1 1608175453133 3076703982276 7816274
#> + Baths         1 1555500334873 3129379100536 7824762
#> + Garage        1  743586714890 3941292720518 7940099
#> + FirePlace     1  566807351596 4118072083812 7962037
#> + Electric      1  197972686303 4486906749106 8004927
#> + Solar         1    8099860660 4676779574749 8025650
#> + Swiming.Pool  1      32266754 4684847168654 8026512
#> <none>                          4684879435409 8026513
#> + Garden        1      12521931 4684866913477 8026514
#> 
#> Step:  AIC=7816274
#> Prices ~ Floors + Fiber + White.Marble + City + Glass.Doors + 
#>     Black.Marble + Area
#> 
#>                Df     Sum of Sq           RSS     AIC
#> + Baths         1 1556798499478 1519905482798 7463671
#> + Garage        1  745553282885 2331150699390 7677528
#> + FirePlace     1  566099969257 2510604013018 7714609
#> + Electric      1  198113917947 2878590064329 7782997
#> + Solar         1    7979583108 3068724399167 7814978
#> + Swiming.Pool  1      24009667 3076679972609 7816272
#> <none>                          3076703982276 7816274
#> + Garden        1       2962361 3076701019915 7816276
#> 
#> Step:  AIC=7463671
#> Prices ~ Floors + Fiber + White.Marble + City + Glass.Doors + 
#>     Black.Marble + Area + Baths
#> 
#>                Df    Sum of Sq           RSS     AIC
#> + Garage        1 753445238325  766460244472 7121363
#> + FirePlace     1 564897553777  955007929020 7231331
#> + Electric      1 196952649137 1322952833661 7394282
#> + Solar         1   8147991183 1511757491615 7460985
#> <none>                         1519905482798 7463671
#> + Swiming.Pool  1      4538430 1519900944368 7463671
#> + Garden        1       189502 1519905293296 7463673
#> 
#> Step:  AIC=7121363
#> Prices ~ Floors + Fiber + White.Marble + City + Glass.Doors + 
#>     Black.Marble + Area + Baths + Garage
#> 
#>                Df    Sum of Sq          RSS     AIC
#> + FirePlace     1 563189576547 203270667925 6457742
#> + Electric      1 196348696668 570111547805 6973389
#> + Solar         1   7919027911 758541216561 7116172
#> <none>                         766460244472 7121363
#> + Swiming.Pool  1      1379736 766458864737 7121364
#> + Garden        1        18324 766460226148 7121365
#> 
#> Step:  AIC=6457742
#> Prices ~ Floors + Fiber + White.Marble + City + Glass.Doors + 
#>     Black.Marble + Area + Baths + Garage + FirePlace
#> 
#>                Df    Sum of Sq          RSS     AIC
#> + Electric      1 195458365125   7812302801 4828325
#> + Solar         1   7960332630 195310335295 6437770
#> <none>                         203270667925 6457742
#> + Swiming.Pool  1        43820 203270624105 6457744
#> + Garden        1         1465 203270666460 6457744
#> 
#> Step:  AIC=4828325
#> Prices ~ Floors + Fiber + White.Marble + City + Glass.Doors + 
#>     Black.Marble + Area + Baths + Garage + FirePlace + Electric
#> 
#>                Df  Sum of Sq        RSS       AIC
#> + Solar         1 7812302801          0 -15391789
#> + Garden        1     141896 7812160905   4828318
#> <none>                       7812302801   4828325
#> + Swiming.Pool  1       1731 7812301069   4828327
#> 
#> Step:  AIC=-15391789
#> Prices ~ Floors + Fiber + White.Marble + City + Glass.Doors + 
#>     Black.Marble + Area + Baths + Garage + FirePlace + Electric + 
#>     Solar
#> 
#>                Df            Sum of Sq           RSS       AIC
#> <none>                                 0.00000002137 -15391789
#> + Garden        1 0.000000000000016711 0.00000002137 -15391787
#> + Swiming.Pool  1 0.000000000000016452 0.00000002137 -15391787

Let us look at the summary for our forward stepwise model model_price_forward.

summary(model_price_forward)
#> 
#> Call:
#> lm(formula = Prices ~ Floors + Fiber + White.Marble + City + 
#>     Glass.Doors + Black.Marble + Area + Baths + Garage + FirePlace + 
#>     Electric + Solar, data = house)
#> 
#> Residuals:
#>          Min           1Q       Median           3Q          Max 
#> -0.000134145  0.000000000  0.000000000  0.000000001  0.000004353 
#> 
#> Coefficients:
#>                           Estimate            Std. Error        t value
#> (Intercept)    999.999999904196670     0.000000001590924   628565405640
#> Floors       14999.999999853393092     0.000000000584754 25651811499613
#> Fiber        11749.999999947447577     0.000000000584756 20093852022064
#> White.Marble 14000.000000052734322     0.000000000715830 19557713611217
#> City          3499.999999992095582     0.000000000358216  9770644292922
#> Glass.Doors   4450.000000006944902     0.000000000584755  7610023216287
#> Black.Marble  4999.999999999229658     0.000000000715993  6983310930652
#> Area            24.999999999998334     0.000000000004072  6138904376604
#> Baths         1250.000000000046839     0.000000000206742  6046192660042
#> Garage        1500.000000000003183     0.000000000357868  4191491031784
#> FirePlace      750.000000000298542     0.000000000206770  3627223549882
#> Electric      1249.999999999690544     0.000000000584753  2137653288745
#> Solar          249.999999999673605     0.000000000584758   427527631169
#>                         Pr(>|t|)    
#> (Intercept)  <0.0000000000000002 ***
#> Floors       <0.0000000000000002 ***
#> Fiber        <0.0000000000000002 ***
#> White.Marble <0.0000000000000002 ***
#> City         <0.0000000000000002 ***
#> Glass.Doors  <0.0000000000000002 ***
#> Black.Marble <0.0000000000000002 ***
#> Area         <0.0000000000000002 ***
#> Baths        <0.0000000000000002 ***
#> Garage       <0.0000000000000002 ***
#> FirePlace    <0.0000000000000002 ***
#> Electric     <0.0000000000000002 ***
#> Solar        <0.0000000000000002 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.0000002067 on 499987 degrees of freedom
#> Multiple R-squared:      1,  Adjusted R-squared:      1 
#> F-statistic: 1.43e+26 on 12 and 499987 DF,  p-value: < 0.00000000000000022

The best performance models using combination of the following predictor variables:

  • Floors
  • Fiber
  • White.Marble
  • City
  • Glass.Doors
  • Black.Marble
  • Area
  • Baths
  • Garage
  • FirePlace
  • Electric
  • Solar

Total predictor variables used above are combination of 12 variables, except Indian.Marble, Swiming.Pool and Garden.

6.2 Backward Stepwise

# Backward Stepwise Regression
model_price_backward <- step(object = model_price_all,
                             direction = "backward",
                             scope = list(lower = model_price_base)
                             )
#> Start:  AIC=-15476871
#> Prices ~ Area + Garage + FirePlace + Baths + White.Marble + Black.Marble + 
#>     Indian.Marble + Floors + City + Solar + Electric + Fiber + 
#>     Glass.Doors + Swiming.Pool + Garden
#> 
#> 
#> Step:  AIC=-15476871
#> Prices ~ Area + Garage + FirePlace + Baths + White.Marble + Black.Marble + 
#>     Floors + City + Solar + Electric + Fiber + Glass.Doors + 
#>     Swiming.Pool + Garden
#> 
#>                Df      Sum of Sq            RSS       AIC
#> - Garden        1              0              0 -15476872
#> - Swiming.Pool  1              0              0 -15476872
#> <none>                                        0 -15476871
#> - Solar         1     7812159167     7812159167   4828320
#> - Electric      1   195310155285   195310155285   6437774
#> - FirePlace     1   562339506701   562339506701   6966532
#> - Garage        1   750908575108   750908575108   7111121
#> - Baths         1  1562468325721  1562468325721   7477490
#> - Area          1  1610761970684  1610761970684   7492711
#> - Black.Marble  1  2084362234399  2084362234399   7621588
#> - Glass.Doors   1  2475242609737  2475242609737   7707526
#> - City          1  4080340520010  4080340520010   7957447
#> - White.Marble  1 16348762026008 16348762026008   8651433
#> - Fiber         1 17257174438008 17257174438008   8678471
#> - Floors        1 28124618625749 28124618625749   8922680
#> 
#> Step:  AIC=-15476872
#> Prices ~ Area + Garage + FirePlace + Baths + White.Marble + Black.Marble + 
#>     Floors + City + Solar + Electric + Fiber + Glass.Doors + 
#>     Swiming.Pool
#> 
#>                Df      Sum of Sq            RSS       AIC
#> - Swiming.Pool  1              0              0 -15476873
#> <none>                                        0 -15476872
#> - Solar         1     7812301069     7812301069   4828327
#> - Electric      1   195310272173   195310272173   6437772
#> - FirePlace     1   562339536264   562339536264   6966530
#> - Garage        1   750908894209   750908894209   7111119
#> - Baths         1  1562472960134  1562472960134   7477490
#> - Area          1  1610765290914  1610765290914   7492710
#> - Black.Marble  1  2084363305721  2084363305721   7621587
#> - Glass.Doors   1  2475270016081  2475270016081   7707530
#> - City          1  4080346531325  4080346531325   7957446
#> - White.Marble  1 16348784685940 16348784685940   8651432
#> - Fiber         1 17257174439261 17257174439261   8678469
#> - Floors        1 28124625667462 28124625667462   8922678
#> 
#> Step:  AIC=-15476873
#> Prices ~ Area + Garage + FirePlace + Baths + White.Marble + Black.Marble + 
#>     Floors + City + Solar + Electric + Fiber + Glass.Doors
#> 
#>                Df      Sum of Sq            RSS       AIC
#> <none>                                        0 -15476873
#> - Solar         1     7812302801     7812302801   4828325
#> - Electric      1   195310335295   195310335295   6437770
#> - FirePlace     1   562340465328   562340465328   6966529
#> - Garage        1   750909802193   750909802193   7111118
#> - Baths         1  1562480674329  1562480674329   7477490
#> - Area          1  1610765904412  1610765904412   7492708
#> - Black.Marble  1  2084363453100  2084363453100   7621585
#> - Glass.Doors   1  2475270433901  2475270433901   7707528
#> - City          1  4080346988017  4080346988017   7957444
#> - White.Marble  1 16348836692056 16348836692056   8651431
#> - Fiber         1 17257468537577 17257468537577   8678475
#> - Floors        1 28124626960386 28124626960386   8922676

Let us look at the summary for our backward stepwise model model_price_backward.

summary(model_price_backward)
#> 
#> Call:
#> lm(formula = Prices ~ Area + Garage + FirePlace + Baths + White.Marble + 
#>     Black.Marble + Floors + City + Solar + Electric + Fiber + 
#>     Glass.Doors, data = house)
#> 
#> Residuals:
#>          Min           1Q       Median           3Q          Max 
#> -0.000134260  0.000000000  0.000000000  0.000000001  0.000000359 
#> 
#> Coefficients:
#>                          Estimate           Std. Error        t value
#> (Intercept)    999.99999980987025     0.00000000146116   684387154347
#> Area            25.00000000000009     0.00000000000374  6684089292557
#> Garage        1500.00000000008708     0.00000000032868  4563729715707
#> FirePlace      750.00000000030707     0.00000000018991  3949350666525
#> Baths         1250.00000000023556     0.00000000018988  6583144017318
#> White.Marble 13999.99999999921238     0.00000000065744 21294598533522
#> Black.Marble  4999.99999999932515     0.00000000065759  7603486054622
#> Floors       15000.00000000155524     0.00000000053706 27929902155431
#> City          3499.99999999960755     0.00000000032900 10638357415806
#> Solar          249.99999999945970     0.00000000053706   465495581371
#> Electric      1249.99999999946317     0.00000000053706  2327494383680
#> Fiber        11749.99999999948886     0.00000000053706 21878350420159
#> Glass.Doors   4449.99999999953161     0.00000000053706  8285855516810
#>                         Pr(>|t|)    
#> (Intercept)  <0.0000000000000002 ***
#> Area         <0.0000000000000002 ***
#> Garage       <0.0000000000000002 ***
#> FirePlace    <0.0000000000000002 ***
#> Baths        <0.0000000000000002 ***
#> White.Marble <0.0000000000000002 ***
#> Black.Marble <0.0000000000000002 ***
#> Floors       <0.0000000000000002 ***
#> City         <0.0000000000000002 ***
#> Solar        <0.0000000000000002 ***
#> Electric     <0.0000000000000002 ***
#> Fiber        <0.0000000000000002 ***
#> Glass.Doors  <0.0000000000000002 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.0000001899 on 499987 degrees of freedom
#> Multiple R-squared:      1,  Adjusted R-squared:      1 
#> F-statistic: 1.695e+26 on 12 and 499987 DF,  p-value: < 0.00000000000000022

The best performance models using combination of the following predictor variables:

  • Area
  • Garage
  • FirePlace
  • Baths
  • White.Marble
  • Black.Marble
  • Floors
  • City
  • Solar
  • Electric
  • Fiber
  • Glass.Doors

Total predictor variables used above are combination of 12 variables, except Indian.Marble, Swiming.Pool and Garden.

6.3 Both

# Both
model_price_both <- step(object = model_price_base,
                         direction = "both",
                         scope = list(upper = model_price_all,
                                      lower = model_price_base)
                         )
#> Start:  AIC=9401807
#> Prices ~ 1
#> 
#>                 Df      Sum of Sq            RSS     AIC
#> + Floors         1 28137670256676 45191105612975 9159783
#> + Fiber          1 17222144966830 56106630902821 9267960
#> + White.Marble   1 14727518113609 58601257756042 9289711
#> + Indian.Marble  1 10025454477185 63303321392466 9328302
#> + City           1  3989805570061 69338970299590 9373836
#> + Glass.Doors    1  2428230058991 70900545810660 9384972
#> + Area           1  1600050598200 71728725271451 9390779
#> + Baths          1  1543586418569 71785189451082 9391172
#> + Garage         1   737604900715 72591170968936 9396755
#> + FirePlace      1   582650002516 72746125867135 9397821
#> + Black.Marble   1   446697232267 72882078637384 9398754
#> + Electric       1   201670610161 73127105259490 9400432
#> + Solar          1     5209679561 73323566190090 9401774
#> <none>                            73328775869651 9401807
#> + Swiming.Pool   1      234043881 73328541825770 9401808
#> + Garden         1      173920415 73328601949236 9401808
#> 
#> Step:  AIC=9159783
#> Prices ~ Floors
#> 
#>                 Df      Sum of Sq            RSS     AIC
#> + Fiber          1 17161774856129 28029330756847 8920961
#> + White.Marble   1 14724329410278 30466776202698 8962653
#> + Indian.Marble  1 10035180243376 35155925369599 9034231
#> + City           1  4003393234440 41187712378535 9113405
#> + Glass.Doors    1  2428367593536 42762738019440 9132168
#> + Area           1  1610488043954 43580617569021 9141641
#> + Baths          1  1555207029826 43635898583149 9142275
#> + Garage         1   746110257205 44444995355770 9151461
#> + FirePlace      1   581149541139 44609956071836 9153313
#> + Black.Marble   1   444090818181 44747014794794 9154847
#> + Electric       1   201282051847 44989823561128 9157553
#> + Solar          1     7437646280 45183667966695 9159703
#> + Swiming.Pool   1      269529965 45190836083010 9159782
#> + Garden         1      249560798 45190856052177 9159782
#> <none>                            45191105612975 9159783
#> - Floors         1 28137670256676 73328775869651 9401807
#> 
#> Step:  AIC=8920961
#> Prices ~ Floors + Fiber
#> 
#>                 Df      Sum of Sq            RSS     AIC
#> + White.Marble   1 14742671033035 13286659723812 8547717
#> + Indian.Marble  1 10048366919693 17980963837154 8698994
#> + City           1  4048561433438 23980769323408 8842963
#> + Glass.Doors    1  2457747958687 25571582798160 8875078
#> + Area           1  1609279115650 26420051641197 8891398
#> + Baths          1  1562306360520 26467024396327 8892287
#> + Garage         1   750129260320 27279201496527 8907399
#> + FirePlace      1   569726306346 27459604450501 8910695
#> + Black.Marble   1   444499183390 27584831573457 8912970
#> + Electric       1   202433546973 27826897209874 8917338
#> + Solar          1     7265670476 28022065086371 8920833
#> + Garden         1      252502079 28029078254768 8920958
#> <none>                            28029330756847 8920961
#> + Swiming.Pool   1         464385 28029330292462 8920963
#> - Fiber          1 17161774856129 45191105612975 9159783
#> - Floors         1 28077300145974 56106630902821 9267960
#> 
#> Step:  AIC=8547717
#> Prices ~ Floors + Fiber + White.Marble
#> 
#>                 Df      Sum of Sq            RSS     AIC
#> + City           1  4061895141055  9224764582757 8365285
#> + Glass.Doors    1  2462605272635 10824054451177 8445224
#> + Black.Marble   1  2076445268323 11210214455490 8462752
#> + Indian.Marble  1  2076445268323 11210214455490 8462752
#> + Area           1  1584782472730 11701877251082 8484214
#> + Baths          1  1538485207921 11748174515891 8486188
#> + Garage         1   746534973249 12540124750563 8518806
#> + FirePlace      1   564215464953 12722444258859 8526023
#> + Electric       1   200509914544 13086149809269 8540116
#> + Solar          1     8338855184 13278320868628 8547405
#> + Garden         1      149069376 13286510654437 8547714
#> <none>                            13286659723812 8547717
#> + Swiming.Pool   1       43506897 13286616216915 8547717
#> - White.Marble   1 14742671033035 28029330756847 8920961
#> - Fiber          1 17180116478885 30466776202698 8962653
#> - Floors         1 28074080667113 41360740390925 9115501
#> 
#> Step:  AIC=8365285
#> Prices ~ Floors + Fiber + White.Marble + City
#> 
#>                 Df      Sum of Sq            RSS     AIC
#> + Glass.Doors    1  2457781442951  6766983139806 8210369
#> + Black.Marble   1  2081506095751  7143258487006 8237425
#> + Indian.Marble  1  2081506095751  7143258487006 8237425
#> + Area           1  1602371800156  7622392782601 8269886
#> + Baths          1  1542779602026  7681984980731 8273780
#> + Garage         1   743782030855  8480982551902 8323254
#> + FirePlace      1   564835861461  8659928721296 8333694
#> + Electric       1   199091680363  9025672902394 8354377
#> + Solar          1     8161007847  9216603574909 8364844
#> + Garden         1       95579979  9224669002778 8365282
#> <none>                             9224764582757 8365285
#> + Swiming.Pool   1       35130383  9224729452374 8365285
#> - City           1  4061895141055 13286659723812 8547717
#> - White.Marble   1 14756004740652 23980769323408 8842963
#> - Fiber          1 17225391617828 26450156200585 8891968
#> - Floors         1 28087670960865 37312435543622 9064000
#> 
#> Step:  AIC=8210369
#> Prices ~ Floors + Fiber + White.Marble + City + Glass.Doors
#> 
#>                 Df      Sum of Sq            RSS     AIC
#> + Black.Marble   1  2082103704397  4684879435409 8026513
#> + Indian.Marble  1  2082103704397  4684879435409 8026513
#> + Area           1  1607244583078  5159738556728 8074786
#> + Baths          1  1549285292131  5217697847674 8080371
#> + Garage         1   749672149111  6017310990695 8151663
#> + FirePlace      1   565687046739  6201296093066 8166722
#> + Electric       1   197573697371  6569409442435 8195555
#> + Solar          1     8394120147  6758589019658 8209750
#> + Swiming.Pool   1       28014024  6766955125782 8210368
#> <none>                             6766983139806 8210369
#> + Garden         1       20781279  6766962358527 8210369
#> - Glass.Doors    1  2457781442951  9224764582757 8365285
#> - City           1  4057071311372 10824054451177 8445224
#> - White.Marble   1 14760851566768 21527834706573 8789012
#> - Fiber          1 17254802476653 24021785616459 8843819
#> - Floors         1 28087748852556 34854731992362 9029933
#> 
#> Step:  AIC=8026513
#> Prices ~ Floors + Fiber + White.Marble + City + Glass.Doors + 
#>     Black.Marble
#> 
#>                Df      Sum of Sq            RSS     AIC
#> + Area          1  1608175453133  3076703982276 7816274
#> + Baths         1  1555500334873  3129379100536 7824762
#> + Garage        1   743586714890  3941292720518 7940099
#> + FirePlace     1   566807351596  4118072083812 7962037
#> + Electric      1   197972686303  4486906749106 8004927
#> + Solar         1     8099860660  4676779574749 8025650
#> + Swiming.Pool  1       32266754  4684847168654 8026512
#> <none>                            4684879435409 8026513
#> + Garden        1       12521931  4684866913477 8026514
#> - Black.Marble  1  2082103704397  6766983139806 8210369
#> - Glass.Doors   1  2458379051597  7143258487006 8237425
#> - City          1  4062129275196  8747008710605 8338697
#> - White.Marble  1 16399145659060 21084025094468 8778599
#> - Fiber         1 17257789445128 21942668880536 8798557
#> - Floors        1 28093555546010 32778434981418 8999226
#> 
#> Step:  AIC=7816274
#> Prices ~ Floors + Fiber + White.Marble + City + Glass.Doors + 
#>     Black.Marble + Area
#> 
#>                Df      Sum of Sq            RSS     AIC
#> + Baths         1  1556798499478  1519905482798 7463671
#> + Garage        1   745553282885  2331150699390 7677528
#> + FirePlace     1   566099969257  2510604013018 7714609
#> + Electric      1   198113917947  2878590064329 7782997
#> + Solar         1     7979583108  3068724399167 7814978
#> + Swiming.Pool  1       24009667  3076679972609 7816272
#> <none>                            3076703982276 7816274
#> + Garden        1        2962361  3076701019915 7816276
#> - Area          1  1608175453133  4684879435409 8026513
#> - Black.Marble  1  2083034574452  5159738556728 8074786
#> - Glass.Doors   1  2463253972628  5539957954904 8110336
#> - City          1  4079747417786  7156451400062 8238350
#> - White.Marble  1 16377938345664 19454642327940 8738386
#> - Fiber         1 17256690991214 20333394973490 8760475
#> - Floors        1 28104013042666 31180717024941 8974243
#> 
#> Step:  AIC=7463671
#> Prices ~ Floors + Fiber + White.Marble + City + Glass.Doors + 
#>     Black.Marble + Area + Baths
#> 
#>                Df      Sum of Sq            RSS     AIC
#> + Garage        1   753445238325   766460244472 7121363
#> + FirePlace     1   564897553777   955007929020 7231331
#> + Electric      1   196952649137  1322952833661 7394282
#> + Solar         1     8147991183  1511757491615 7460985
#> <none>                            1519905482798 7463671
#> + Swiming.Pool  1        4538430  1519900944368 7463671
#> + Garden        1         189502  1519905293296 7463673
#> - Baths         1  1556798499478  3076703982276 7816274
#> - Area          1  1609473617738  3129379100536 7824762
#> - Black.Marble  1  2089253977860  3609159460658 7896082
#> - Glass.Doors   1  2469785555845  3989691038643 7946202
#> - City          1  4084079035568  5603984518365 8116084
#> - White.Marble  1 16364805574575 17884711057373 8696318
#> - Fiber         1 17263849668437 18783755151235 8720841
#> - Floors        1 28115643074462 29635548557259 8948832
#> 
#> Step:  AIC=7121363
#> Prices ~ Floors + Fiber + White.Marble + City + Glass.Doors + 
#>     Black.Marble + Area + Baths + Garage
#> 
#>                Df      Sum of Sq            RSS     AIC
#> + FirePlace     1   563189576547   203270667925 6457742
#> + Electric      1   196348696668   570111547805 6973389
#> + Solar         1     7919027911   758541216561 7116172
#> <none>                             766460244472 7121363
#> + Swiming.Pool  1        1379736   766458864737 7121364
#> + Garden        1          18324   766460226148 7121365
#> - Garage        1   753445238325  1519905482798 7463671
#> - Baths         1  1564690454918  2331150699390 7677528
#> - Area          1  1611454680757  2377914925229 7687459
#> - Black.Marble  1  2083134304329  2849594548801 7777935
#> - Glass.Doors   1  2475723608373  3242183852846 7842470
#> - City          1  4081309331261  4847769575733 8043606
#> - White.Marble  1 16352865847243 17119326091715 8674451
#> - Fiber         1 17267920331614 18034380576087 8700487
#> - Floors        1 28124203167579 28890663412051 8936106
#> 
#> Step:  AIC=6457742
#> Prices ~ Floors + Fiber + White.Marble + City + Glass.Doors + 
#>     Black.Marble + Area + Baths + Garage + FirePlace
#> 
#>                Df      Sum of Sq            RSS     AIC
#> + Electric      1   195458365125     7812302801 4828325
#> + Solar         1     7960332630   195310335295 6437770
#> <none>                             203270667925 6457742
#> + Swiming.Pool  1          43820   203270624105 6457744
#> + Garden        1           1465   203270666460 6457744
#> - FirePlace     1   563189576547   766460244472 7121363
#> - Garage        1   751737261095   955007929020 7231331
#> - Baths         1  1563477838353  1766748506278 7538919
#> - Area          1  1610745649461  1814016317387 7552121
#> - Black.Marble  1  2084255608832  2287526276757 7668084
#> - Glass.Doors   1  2476565823927  2679836491852 7747227
#> - City          1  4081928408880  4285199076805 7981932
#> - White.Marble  1 16349433795392 16552704463317 8657624
#> - Fiber         1 17256526080070 17459796747996 8684299
#> - Floors        1 28122732979287 28326003647212 8926239
#> 
#> Step:  AIC=4828325
#> Prices ~ Floors + Fiber + White.Marble + City + Glass.Doors + 
#>     Black.Marble + Area + Baths + Garage + FirePlace + Electric
#> 
#>                Df      Sum of Sq            RSS       AIC
#> + Solar         1     7812302801              0 -15391789
#> + Garden        1         141896     7812160905   4828318
#> <none>                               7812302801   4828325
#> + Swiming.Pool  1           1731     7812301069   4828327
#> - Electric      1   195458365125   203270667925   6457742
#> - FirePlace     1   562299245004   570111547805   6973389
#> - Garage        1   751136710960   758949013760   7116441
#> - Baths         1  1562316285510  1570128588311   7479930
#> - Area          1  1610885328953  1618697631753   7495162
#> - Black.Marble  1  2084651543827  2092463846627   7623522
#> - Glass.Doors   1  2475044981222  2482857284022   7709056
#> - City          1  4080521442083  4088333744884   7958420
#> - White.Marble  1 16348265198893 16356077501694   8651651
#> - Fiber         1 17257647498350 17265459801151   8678705
#> - Floors        1 28122340428933 28130152731733   8922772
#> 
#> Step:  AIC=-15391789
#> Prices ~ Floors + Fiber + White.Marble + City + Glass.Doors + 
#>     Black.Marble + Area + Baths + Garage + FirePlace + Electric + 
#>     Solar
#> 
#>                Df      Sum of Sq            RSS       AIC
#> <none>                                        0 -15391789
#> + Garden        1              0              0 -15391787
#> + Swiming.Pool  1              0              0 -15391787
#> - Solar         1     7812302801     7812302801   4828325
#> - Electric      1   195310335295   195310335295   6437770
#> - FirePlace     1   562340465328   562340465328   6966529
#> - Garage        1   750909802193   750909802193   7111118
#> - Baths         1  1562480674329  1562480674329   7477490
#> - Area          1  1610765904413  1610765904413   7492708
#> - Black.Marble  1  2084363453100  2084363453100   7621585
#> - Glass.Doors   1  2475270433900  2475270433900   7707528
#> - City          1  4080346988017  4080346988017   7957444
#> - White.Marble  1 16348836692056 16348836692056   8651431
#> - Fiber         1 17257468537578 17257468537578   8678475
#> - Floors        1 28124626960385 28124626960385   8922676

Let us look at the summary for our both stepwise model model_price_both.

summary(model_price_both)
#> 
#> Call:
#> lm(formula = Prices ~ Floors + Fiber + White.Marble + City + 
#>     Glass.Doors + Black.Marble + Area + Baths + Garage + FirePlace + 
#>     Electric + Solar, data = house)
#> 
#> Residuals:
#>          Min           1Q       Median           3Q          Max 
#> -0.000134145  0.000000000  0.000000000  0.000000001  0.000004353 
#> 
#> Coefficients:
#>                           Estimate            Std. Error        t value
#> (Intercept)    999.999999904196670     0.000000001590924   628565405640
#> Floors       14999.999999853393092     0.000000000584754 25651811499613
#> Fiber        11749.999999947447577     0.000000000584756 20093852022064
#> White.Marble 14000.000000052734322     0.000000000715830 19557713611217
#> City          3499.999999992095582     0.000000000358216  9770644292922
#> Glass.Doors   4450.000000006944902     0.000000000584755  7610023216287
#> Black.Marble  4999.999999999229658     0.000000000715993  6983310930652
#> Area            24.999999999998334     0.000000000004072  6138904376604
#> Baths         1250.000000000046839     0.000000000206742  6046192660042
#> Garage        1500.000000000003183     0.000000000357868  4191491031784
#> FirePlace      750.000000000298542     0.000000000206770  3627223549882
#> Electric      1249.999999999690544     0.000000000584753  2137653288745
#> Solar          249.999999999673605     0.000000000584758   427527631169
#>                         Pr(>|t|)    
#> (Intercept)  <0.0000000000000002 ***
#> Floors       <0.0000000000000002 ***
#> Fiber        <0.0000000000000002 ***
#> White.Marble <0.0000000000000002 ***
#> City         <0.0000000000000002 ***
#> Glass.Doors  <0.0000000000000002 ***
#> Black.Marble <0.0000000000000002 ***
#> Area         <0.0000000000000002 ***
#> Baths        <0.0000000000000002 ***
#> Garage       <0.0000000000000002 ***
#> FirePlace    <0.0000000000000002 ***
#> Electric     <0.0000000000000002 ***
#> Solar        <0.0000000000000002 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.0000002067 on 499987 degrees of freedom
#> Multiple R-squared:      1,  Adjusted R-squared:      1 
#> F-statistic: 1.43e+26 on 12 and 499987 DF,  p-value: < 0.00000000000000022

The best performance models using combination of the following predictor variables:

  • Floors
  • Fiber
  • White.Marble
  • City
  • Glass.Doors
  • Black.Marble
  • Area
  • Baths
  • Garage
  • FirePlace
  • Electric
  • Solar

Total predictor variables used above are combination of 12 variables, except Indian.Marble, Swiming.Pool and Garden.

6.4 Model Comparison (Stepwise)

6.4.1 R-Squared Value

summary(model_price_forward)$adj.r.squared
#> [1] 1
summary(model_price_backward)$adj.r.squared
#> [1] 1
summary(model_price_both)$adj.r.squared
#> [1] 1

All three step-wise models also gives the best performance with r-squared value equals to 1.

6.4.2 Using Error Metrics (RMSE)

Make prediction for each model

price_pred_forward <- predict(model_price_forward, house)
price_pred_backward <- predict(model_price_backward, house)
price_pred_both <- predict(model_price_both, house)

Prediction Evaluation using Error Metrics (RMSE)

RMSE(y_pred = price_pred_forward,
     y_true = house$Prices)
#> [1] 0.0000002071309
RMSE(y_pred = price_pred_backward,
     y_true = house$Prices)
#> [1] 0.0000001901501
RMSE(y_pred = price_pred_both,
     y_true = house$Prices)
#> [1] 0.0000002071309

Even though all the error from three step-wise models above almost non-existent, model_price_backward give the lowest error value of 0.0000001901501.

7 Conclusion With and Without Stepwise Regression

We can conclude that the best performance models based on our Stepwise Regression Models are using the combination from 12 predictor variables which essentially all variables except Indian.Marble, Swiming.Pool and Garden.

Although all three models gives r-squared value equals to 1, but the best modelling is using backward step-wise regression model_price_backward with error value of RMSE equals to 0.0000001901501.

The RMSE value from our model_price_backward has the same value as our model model_price_signf which essentially also using all variables as its predictor variables except variables Indian.Marble, Swiming.Pool and Garden.

However, out of our previous modelings without Stepwise Regression, there are two models that give slightly lower RMSE values compare to model_price_backward and model_price_signf, which are:

- `model_price_all` 
- `model_price_no_IndianMarble`

With RMSE value of 0.0000001901459

8 Linear Regression Assumptions

As we have several best models to choose from let us choose model_price_no_IndianMarble to check for Breusch-Pagan for Homoscedasticity Test and No Multicolinearity because it seems that our model has best performance so far to the point almost perfect.

8.1 Breusch-Pagan for Homoscedasticity Test

Let us do Breusch-Pagan testing using the following hypotheses:

\(H_0\): Error is considered Homoscedasticity
\(H_1\): Error is considered Heteroscedasticity

bptest(model_price_no_IndianMarble)
#> 
#>  studentized Breusch-Pagan test
#> 
#> data:  model_price_no_IndianMarble
#> BP = 13.316, df = 14, p-value = 0.5018

Conclusion :

  • p-value : 0.5018
  • alpha : 0.05
  • Accept \(H_0\) because p-value > alpha
  • Hence, the error distribution is constant (Homoscedasticity)
  • Thus, we fulfill Homoscedasticity condition

8.2 No Multicolinearity

VIF (Variance Inflation Factor) Testing will indicate if:

  • VIF > 10: multicollinearity happens in our model
  • VIF < 10: no multicollinearity happens in our model
vif(model_price_no_IndianMarble)
#>         Area       Garage    FirePlace        Baths White.Marble Black.Marble 
#>     1.000024     1.000033     1.000011     1.000037     1.331392     1.331387 
#>       Floors         City        Solar     Electric        Fiber  Glass.Doors 
#>     1.000012     1.000026     1.000037     1.000010     1.000036     1.000028 
#> Swiming.Pool       Garden 
#>     1.000030     1.000039

Conclusion : All predictor variables used in model_price_no_IndianMarble passed the No Multicolinearity Test as the VIF values are all less than 10

9 Closing Statements

It seems that we have created several best models to analyze and predict House Pricing based on the given dataset.

At the same time, we may need to be cautious whether these models are indeed provide us with the correct solution for the purpose of to analyze the list of House Attributes that can affect House Pricing, because these models give results that are near perfect with almost non-existent error.

The best models conclude that the more attributes within the House (up to the 12 variables), the best house pricing it will generates.

Further studies may need to be conducted with different dataset to check the validity of these models because using the provided dataset, these are already the best models created.