Import Libraries

# Data Wrangling
library(dplyr)     

# Visualization
library(GGally)

# Model testing
library(performance) 
library(lmtest)
library(car)

Data Source

This project will use Linear Regression to model the relationship between house prices and the features that the houses have. The data used in this project is taken from: https://www.kaggle.com/datasets/greenwing1985/housepricing

Data Input

houseprice <- read.csv("HousePrices_HalfMil.csv")
head(houseprice)
#>   Area Garage FirePlace Baths White.Marble Black.Marble Indian.Marble Floors
#> 1  164      2         0     2            0            1             0      0
#> 2   84      2         0     4            0            0             1      1
#> 3  190      2         4     4            1            0             0      0
#> 4   75      2         4     4            0            0             1      1
#> 5  148      1         4     2            1            0             0      1
#> 6  124      3         3     3            0            1             0      1
#>   City Solar Electric Fiber Glass.Doors Swiming.Pool Garden Prices
#> 1    3     1        1     1           1            0      0  43800
#> 2    2     0        0     0           1            1      1  37550
#> 3    2     0        0     1           0            0      0  49500
#> 4    1     1        1     1           1            1      1  50075
#> 5    2     1        0     0           1            1      1  52400
#> 6    1     0        0     1           1            1      1  54300
glimpse(houseprice)
#> Rows: 500,000
#> Columns: 16
#> $ Area          <int> 164, 84, 190, 75, 148, 124, 58, 249, 243, 242, 61, 189, …
#> $ Garage        <int> 2, 2, 2, 2, 1, 3, 1, 2, 1, 1, 2, 2, 2, 3, 3, 3, 1, 3, 2,…
#> $ FirePlace     <int> 0, 0, 4, 4, 4, 3, 0, 1, 0, 2, 4, 0, 0, 3, 3, 4, 0, 3, 3,…
#> $ Baths         <int> 2, 4, 4, 4, 2, 3, 2, 1, 2, 4, 5, 4, 2, 3, 1, 1, 5, 3, 5,…
#> $ White.Marble  <int> 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,…
#> $ Black.Marble  <int> 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1,…
#> $ Indian.Marble <int> 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0,…
#> $ Floors        <int> 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1,…
#> $ City          <int> 3, 2, 2, 1, 2, 1, 3, 1, 1, 2, 1, 2, 1, 3, 3, 1, 3, 1, 3,…
#> $ Solar         <int> 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0,…
#> $ Electric      <int> 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1,…
#> $ Fiber         <int> 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0,…
#> $ Glass.Doors   <int> 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1,…
#> $ Swiming.Pool  <int> 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0,…
#> $ Garden        <int> 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0,…
#> $ Prices        <int> 43800, 37550, 49500, 50075, 52400, 54300, 34400, 50425, …

The data has 500,000 rows and 16 columns.

anyNA(houseprice)
#> [1] FALSE

There is no NA values in our data, so we’re good to use it as is.

Exploratory Data Analysis

The Prices column in our data will be our target variable.

Let’s check the house prices distribution.

summary(houseprice$Prices)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>    7725   33500   41850   42050   50750   77975

We want to see the correlation between features.

ggcorr(houseprice, label = TRUE, label_size = 3, hjust = 1, layout.exp = 2)

Features correlation with Prices:

  • Very weak correlation: Electric, Baths, FirePlace, Garage, Area
  • Weak correlation: Glass.Doors, City
  • Moderate Correlation: Fiber, Indian.Marble, White.Marble
  • Strong correlation: Floors

The rest of the features has negligible correlation with Prices.

We also need to note that White.Marble, Black.Marble, and Indian.Marble has moderate correlation with each other. This may cause multicollinearity to happen in our linear regression model. We will deal with these later after looking at our models.

Modeling

Using all features as predictor variable

For the first model, we will use all features as predictor variables.

model_all <- lm(formula = Prices ~ ., data = houseprice)
summary(model_all)
#> 
#> Call:
#> lm(formula = Prices ~ ., data = houseprice)
#> 
#> Residuals:
#>          Min           1Q       Median           3Q          Max 
#> -0.000134260  0.000000000  0.000000000  0.000000001  0.000000360 
#> 
#> Coefficients: (1 not defined because of singularities)
#>                           Estimate           Std. Error            t value
#> (Intercept)     999.99999980933296     0.00000000150884   662758838130.408
#> Area             25.00000000000007     0.00000000000374  6684054946884.657
#> Garage         1500.00000000008640     0.00000000032868  4563708109116.631
#> FirePlace       750.00000000030752     0.00000000018991  3949331829292.207
#> Baths          1250.00000000023601     0.00000000018988  6583092214898.437
#> White.Marble  13999.99999999921783     0.00000000065745 21294466488403.605
#> Black.Marble   4999.99999999932425     0.00000000065760  7603454046266.312
#> Indian.Marble                   NA                   NA                 NA
#> Floors        15000.00000000155524     0.00000000053706 27929788605714.617
#> City           3499.99999999960937     0.00000000032900 10638307309887.180
#> Solar           249.99999999946232     0.00000000053707   465489478654.927
#> Electric       1249.99999999946135     0.00000000053706  2327484193486.549
#> Fiber         11749.99999999948523     0.00000000053707 21878078290865.754
#> Glass.Doors    4449.99999999952979     0.00000000053707  8285776488182.684
#> Swiming.Pool      0.00000000054140     0.00000000053707              1.008
#> Garden            0.00000000053855     0.00000000053707              1.003
#>                          Pr(>|t|)    
#> (Intercept)   <0.0000000000000002 ***
#> Area          <0.0000000000000002 ***
#> Garage        <0.0000000000000002 ***
#> FirePlace     <0.0000000000000002 ***
#> Baths         <0.0000000000000002 ***
#> White.Marble  <0.0000000000000002 ***
#> Black.Marble  <0.0000000000000002 ***
#> Indian.Marble                  NA    
#> Floors        <0.0000000000000002 ***
#> City          <0.0000000000000002 ***
#> Solar         <0.0000000000000002 ***
#> Electric      <0.0000000000000002 ***
#> Fiber         <0.0000000000000002 ***
#> Glass.Doors   <0.0000000000000002 ***
#> Swiming.Pool                0.313    
#> Garden                      0.316    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.0000001899 on 499985 degrees of freedom
#> Multiple R-squared:      1,  Adjusted R-squared:      1 
#> F-statistic: 1.453e+26 on 14 and 499985 DF,  p-value: < 0.00000000000000022

It seems like there is a problem with our model when defining the coefficients of Indian.Marble. As mentioned before, Indian.Marble has a strong correlation with White.Marble and Black.Marble, which may cause multicollinearity and cause problems in our model.

To handle this, we will remove the feature Indian.Marble from our model.

Remove predictor that causes NA

model_all <- lm(formula = Prices ~ .-Indian.Marble, data = houseprice)
summary(model_all)
#> 
#> Call:
#> lm(formula = Prices ~ . - Indian.Marble, data = houseprice)
#> 
#> Residuals:
#>          Min           1Q       Median           3Q          Max 
#> -0.000134260  0.000000000  0.000000000  0.000000001  0.000000360 
#> 
#> Coefficients:
#>                          Estimate           Std. Error            t value
#> (Intercept)    999.99999980933296     0.00000000150884   662758838130.408
#> Area            25.00000000000007     0.00000000000374  6684054946884.657
#> Garage        1500.00000000008640     0.00000000032868  4563708109116.631
#> FirePlace      750.00000000030752     0.00000000018991  3949331829292.207
#> Baths         1250.00000000023601     0.00000000018988  6583092214898.437
#> White.Marble 13999.99999999921783     0.00000000065745 21294466488403.605
#> Black.Marble  4999.99999999932425     0.00000000065760  7603454046266.312
#> Floors       15000.00000000155524     0.00000000053706 27929788605714.617
#> City          3499.99999999960937     0.00000000032900 10638307309887.180
#> Solar          249.99999999946232     0.00000000053707   465489478654.927
#> Electric      1249.99999999946135     0.00000000053706  2327484193486.549
#> Fiber        11749.99999999948523     0.00000000053707 21878078290865.754
#> Glass.Doors   4449.99999999952979     0.00000000053707  8285776488182.684
#> Swiming.Pool     0.00000000054140     0.00000000053707              1.008
#> Garden           0.00000000053855     0.00000000053707              1.003
#>                         Pr(>|t|)    
#> (Intercept)  <0.0000000000000002 ***
#> Area         <0.0000000000000002 ***
#> Garage       <0.0000000000000002 ***
#> FirePlace    <0.0000000000000002 ***
#> Baths        <0.0000000000000002 ***
#> White.Marble <0.0000000000000002 ***
#> Black.Marble <0.0000000000000002 ***
#> Floors       <0.0000000000000002 ***
#> City         <0.0000000000000002 ***
#> Solar        <0.0000000000000002 ***
#> Electric     <0.0000000000000002 ***
#> Fiber        <0.0000000000000002 ***
#> Glass.Doors  <0.0000000000000002 ***
#> Swiming.Pool               0.313    
#> Garden                     0.316    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.0000001899 on 499985 degrees of freedom
#> Multiple R-squared:      1,  Adjusted R-squared:      1 
#> F-statistic: 1.453e+26 on 14 and 499985 DF,  p-value: < 0.00000000000000022

Model Interpretation

  • From the model above, it seems like all variables are significant aside from Swiming.Pool and Garden.
  • The adjusted R-squared value is 1, meaning that all (100%) variance from target variable Prices can be explained by the chosen predictor variables.

Remove insignificant variables

We will remove the non-significant variables: Swiming.Pool and Garden.

model_all <- lm(formula = Prices ~ .-Indian.Marble -Swiming.Pool -Garden, data = houseprice)
summary(model_all)
#> 
#> Call:
#> lm(formula = Prices ~ . - Indian.Marble - Swiming.Pool - Garden, 
#>     data = houseprice)
#> 
#> Residuals:
#>          Min           1Q       Median           3Q          Max 
#> -0.000134260  0.000000000  0.000000000  0.000000001  0.000000360 
#> 
#> Coefficients:
#>                          Estimate           Std. Error        t value
#> (Intercept)    999.99999980986900     0.00000000146117   684384458756
#> Area            25.00000000000009     0.00000000000374  6684062965981
#> Garage        1500.00000000008663     0.00000000032868  4563711740576
#> FirePlace      750.00000000030764     0.00000000018991  3949335111245
#> Baths         1250.00000000023670     0.00000000018988  6583118088334
#> White.Marble 13999.99999999921783     0.00000000065745 21294514660633
#> Black.Marble  4999.99999999932515     0.00000000065760  7603456106824
#> Floors       15000.00000000155524     0.00000000053706 27929792148117
#> City          3499.99999999960983     0.00000000032900 10638315514581
#> Solar          249.99999999945976     0.00000000053706   465493747927
#> Electric      1249.99999999946203     0.00000000053706  2327485216394
#> Fiber        11749.99999999948704     0.00000000053706 21878264248050
#> Glass.Doors   4449.99999999953161     0.00000000053706  8285822881367
#>                         Pr(>|t|)    
#> (Intercept)  <0.0000000000000002 ***
#> Area         <0.0000000000000002 ***
#> Garage       <0.0000000000000002 ***
#> FirePlace    <0.0000000000000002 ***
#> Baths        <0.0000000000000002 ***
#> White.Marble <0.0000000000000002 ***
#> Black.Marble <0.0000000000000002 ***
#> Floors       <0.0000000000000002 ***
#> City         <0.0000000000000002 ***
#> Solar        <0.0000000000000002 ***
#> Electric     <0.0000000000000002 ***
#> Fiber        <0.0000000000000002 ***
#> Glass.Doors  <0.0000000000000002 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.0000001899 on 499987 degrees of freedom
#> Multiple R-squared:      1,  Adjusted R-squared:      1 
#> F-statistic: 1.695e+26 on 12 and 499987 DF,  p-value: < 0.00000000000000022

Both this and the previous model has adjusted R-squared value 1, therefore it is fine to keep this model.

Model Interpretation

  • All variables are significant.
  • The adjusted R-squared value is 1, meaning that all (100%) variance from target variable Prices can be explained by the chosen predictor variables.

Using only several features as predictor variable

In this model we will try and only use several features as predictor variable, based on their correlation.

ggcorr(houseprice, label = TRUE, label_size = 3, hjust = 1, layout.exp = 2)

We will try to use the features with minimum absolute correlation value 0.2. So the features that we will use are: Glass.Doors, Fiber, City, Floors, Indian.Marble, White.Marble

houseprice_multiple <- houseprice %>% select(Prices, Glass.Doors, Fiber, City, Floors, Indian.Marble, White.Marble)

model_multiple <- lm(formula = Prices ~ ., data = houseprice_multiple)
summary(model_multiple)
#> 
#> Call:
#> lm(formula = Prices ~ ., data = houseprice_multiple)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -9370.5 -2153.1     1.1  2147.2  9351.4 
#> 
#> Coefficients:
#>                Estimate Std. Error t value            Pr(>|t|)    
#> (Intercept)   18144.277     15.012  1208.6 <0.0000000000000002 ***
#> Glass.Doors    4434.766      8.658   512.2 <0.0000000000000002 ***
#> Fiber         11750.084      8.658  1357.1 <0.0000000000000002 ***
#> City           3492.153      5.304   658.4 <0.0000000000000002 ***
#> Floors        14991.642      8.658  1731.6 <0.0000000000000002 ***
#> Indian.Marble -4997.262     10.601  -471.4 <0.0000000000000002 ***
#> White.Marble   9024.192     10.612   850.4 <0.0000000000000002 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 3061 on 499993 degrees of freedom
#> Multiple R-squared:  0.9361, Adjusted R-squared:  0.9361 
#> F-statistic: 1.221e+06 on 6 and 499993 DF,  p-value: < 0.00000000000000022

Model Interpretation

  • All variables are significant.
  • The adjusted R-squared value is 0.9361, meaning that 93.61% variance from target variable Prices can be explained by the chosen predictor variables.

Using Step wise regression

We will do feature selection with all 3 kinds of step wise regression: forward, backward, and both.

# Model without predictor 
model_base_1 <- lm(formula = Prices ~ 1,
                      data = houseprice)

# Model with all predictor
model_base_all <- lm(formula = Prices ~ .,
                     data = houseprice)

# Stepwise regression 
model_step_forward <-  step(object = model_base_1,  
                       direction = "forward", 
                       scope = list(upper = model_base_all), 
                       trace = F)

model_step_backward <- step(object = model_base_all,
                       direction = "backward",
                       scope = list(lower = model_base_1), 
                       trace = F)

model_step_both <- step(object = model_base_1,
                   direction = "both",
                   scope = list(upper = model_base_all,
                                lower = model_base_1),
                   trace = F)

Forward

summary(model_step_forward)
#> 
#> Call:
#> lm(formula = Prices ~ Floors + Fiber + White.Marble + City + 
#>     Glass.Doors + Black.Marble + Area + Baths + Garage + FirePlace + 
#>     Electric + Solar, data = houseprice)
#> 
#> Residuals:
#>          Min           1Q       Median           3Q          Max 
#> -0.000134145  0.000000000  0.000000000  0.000000001  0.000004352 
#> 
#> Coefficients:
#>                           Estimate            Std. Error        t value
#> (Intercept)    999.999999904189053     0.000000001590920   628567198577
#> Floors       14999.999999853393092     0.000000000584752 25651884669521
#> Fiber        11749.999999947449396     0.000000000584754 20093909338301
#> White.Marble 14000.000000052730684     0.000000000715828 19557769398158
#> City          3499.999999992096491     0.000000000358215  9770672162967
#> Glass.Doors   4450.000000006955815     0.000000000584753  7610044923319
#> Black.Marble  4999.999999999224201     0.000000000715991  6983330850034
#> Area            24.999999999998334     0.000000000004072  6138921887378
#> Baths         1250.000000000046384     0.000000000206741  6046209906362
#> Garage        1500.000000000003638     0.000000000357867  4191502987704
#> FirePlace      750.000000000298428     0.000000000206769  3627233896271
#> Electric      1249.999999999690090     0.000000000584752  2137659386244
#> Solar          249.999999999674060     0.000000000584756   427528850660
#>                         Pr(>|t|)    
#> (Intercept)  <0.0000000000000002 ***
#> Floors       <0.0000000000000002 ***
#> Fiber        <0.0000000000000002 ***
#> White.Marble <0.0000000000000002 ***
#> City         <0.0000000000000002 ***
#> Glass.Doors  <0.0000000000000002 ***
#> Black.Marble <0.0000000000000002 ***
#> Area         <0.0000000000000002 ***
#> Baths        <0.0000000000000002 ***
#> Garage       <0.0000000000000002 ***
#> FirePlace    <0.0000000000000002 ***
#> Electric     <0.0000000000000002 ***
#> Solar        <0.0000000000000002 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.0000002067 on 499987 degrees of freedom
#> Multiple R-squared:      1,  Adjusted R-squared:      1 
#> F-statistic: 1.43e+26 on 12 and 499987 DF,  p-value: < 0.00000000000000022

Model Interpretation

  • All chosen predictor variables are significant.
  • The adjusted R-squared value is 1, meaning that all (100%) variance from target variable Prices can be explained by the chosen predictor variables.
  • The chosen predictor variables for this model are:
names(model_step_forward$coefficients) %>% sort
#>  [1] "(Intercept)"  "Area"         "Baths"        "Black.Marble" "City"        
#>  [6] "Electric"     "Fiber"        "FirePlace"    "Floors"       "Garage"      
#> [11] "Glass.Doors"  "Solar"        "White.Marble"

Backward

summary(model_step_backward)
#> 
#> Call:
#> lm(formula = Prices ~ Area + Garage + FirePlace + Baths + White.Marble + 
#>     Black.Marble + Floors + City + Solar + Electric + Fiber + 
#>     Glass.Doors, data = houseprice)
#> 
#> Residuals:
#>          Min           1Q       Median           3Q          Max 
#> -0.000134260  0.000000000  0.000000000  0.000000001  0.000000360 
#> 
#> Coefficients:
#>                          Estimate           Std. Error        t value
#> (Intercept)    999.99999980986900     0.00000000146117   684384458756
#> Area            25.00000000000009     0.00000000000374  6684062965981
#> Garage        1500.00000000008663     0.00000000032868  4563711740576
#> FirePlace      750.00000000030764     0.00000000018991  3949335111245
#> Baths         1250.00000000023670     0.00000000018988  6583118088334
#> White.Marble 13999.99999999921783     0.00000000065745 21294514660633
#> Black.Marble  4999.99999999932515     0.00000000065760  7603456106824
#> Floors       15000.00000000155524     0.00000000053706 27929792148117
#> City          3499.99999999960983     0.00000000032900 10638315514581
#> Solar          249.99999999945976     0.00000000053706   465493747927
#> Electric      1249.99999999946203     0.00000000053706  2327485216394
#> Fiber        11749.99999999948704     0.00000000053706 21878264248050
#> Glass.Doors   4449.99999999953161     0.00000000053706  8285822881367
#>                         Pr(>|t|)    
#> (Intercept)  <0.0000000000000002 ***
#> Area         <0.0000000000000002 ***
#> Garage       <0.0000000000000002 ***
#> FirePlace    <0.0000000000000002 ***
#> Baths        <0.0000000000000002 ***
#> White.Marble <0.0000000000000002 ***
#> Black.Marble <0.0000000000000002 ***
#> Floors       <0.0000000000000002 ***
#> City         <0.0000000000000002 ***
#> Solar        <0.0000000000000002 ***
#> Electric     <0.0000000000000002 ***
#> Fiber        <0.0000000000000002 ***
#> Glass.Doors  <0.0000000000000002 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.0000001899 on 499987 degrees of freedom
#> Multiple R-squared:      1,  Adjusted R-squared:      1 
#> F-statistic: 1.695e+26 on 12 and 499987 DF,  p-value: < 0.00000000000000022

Model Interpretation

  • All chosen predictor variables are significant.
  • The adjusted R-squared value is 1, meaning that all (100%) variance from target variable Prices can be explained by the chosen predictor variables.
  • The chosen predictor variables for this model are:
names(model_step_backward$coefficients) %>% sort
#>  [1] "(Intercept)"  "Area"         "Baths"        "Black.Marble" "City"        
#>  [6] "Electric"     "Fiber"        "FirePlace"    "Floors"       "Garage"      
#> [11] "Glass.Doors"  "Solar"        "White.Marble"

Both

summary(model_step_both)
#> 
#> Call:
#> lm(formula = Prices ~ Floors + Fiber + White.Marble + City + 
#>     Glass.Doors + Black.Marble + Area + Baths + Garage + FirePlace + 
#>     Electric + Solar, data = houseprice)
#> 
#> Residuals:
#>          Min           1Q       Median           3Q          Max 
#> -0.000134145  0.000000000  0.000000000  0.000000001  0.000004352 
#> 
#> Coefficients:
#>                           Estimate            Std. Error        t value
#> (Intercept)    999.999999904189053     0.000000001590920   628567198577
#> Floors       14999.999999853393092     0.000000000584752 25651884669521
#> Fiber        11749.999999947449396     0.000000000584754 20093909338301
#> White.Marble 14000.000000052730684     0.000000000715828 19557769398158
#> City          3499.999999992096491     0.000000000358215  9770672162967
#> Glass.Doors   4450.000000006955815     0.000000000584753  7610044923319
#> Black.Marble  4999.999999999224201     0.000000000715991  6983330850034
#> Area            24.999999999998334     0.000000000004072  6138921887378
#> Baths         1250.000000000046384     0.000000000206741  6046209906362
#> Garage        1500.000000000003638     0.000000000357867  4191502987704
#> FirePlace      750.000000000298428     0.000000000206769  3627233896271
#> Electric      1249.999999999690090     0.000000000584752  2137659386244
#> Solar          249.999999999674060     0.000000000584756   427528850660
#>                         Pr(>|t|)    
#> (Intercept)  <0.0000000000000002 ***
#> Floors       <0.0000000000000002 ***
#> Fiber        <0.0000000000000002 ***
#> White.Marble <0.0000000000000002 ***
#> City         <0.0000000000000002 ***
#> Glass.Doors  <0.0000000000000002 ***
#> Black.Marble <0.0000000000000002 ***
#> Area         <0.0000000000000002 ***
#> Baths        <0.0000000000000002 ***
#> Garage       <0.0000000000000002 ***
#> FirePlace    <0.0000000000000002 ***
#> Electric     <0.0000000000000002 ***
#> Solar        <0.0000000000000002 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.0000002067 on 499987 degrees of freedom
#> Multiple R-squared:      1,  Adjusted R-squared:      1 
#> F-statistic: 1.43e+26 on 12 and 499987 DF,  p-value: < 0.00000000000000022

Model Interpretation

  • All chosen predictor variables are significant.
  • The adjusted R-squared value is 1, meaning that all (100%) variance from target variable Prices can be explained by the chosen predictor variables.
  • The chosen predictor variables for this model are:
names(model_step_both$coefficients) %>% sort
#>  [1] "(Intercept)"  "Area"         "Baths"        "Black.Marble" "City"        
#>  [6] "Electric"     "Fiber"        "FirePlace"    "Floors"       "Garage"      
#> [11] "Glass.Doors"  "Solar"        "White.Marble"

Prediction

We will make predictions for each model on our data and evaluate the prediction results.

houseprice_pred <- data.frame(actual_price = houseprice$Prices)

houseprice_pred$model_all <- predict(model_all, houseprice)
houseprice_pred$model_multiple <-predict(model_multiple, houseprice)

houseprice_pred$model_step_forward <- predict(model_step_forward, houseprice)
houseprice_pred$model_step_backward <-predict(model_step_backward, houseprice)
houseprice_pred$model_step_both <-predict(model_step_both, houseprice)

head(houseprice_pred)
#>   actual_price model_all model_multiple model_step_forward model_step_backward
#> 1        43800     43800       44805.59              43800               43800
#> 2        37550     37550       39557.73              37550               37550
#> 3        49500     49500       45902.86              49500               49500
#> 4        50075     50075       47815.66              50075               50075
#> 5        52400     52400       53579.18              52400               52400
#> 6        54300     54300       52812.92              54300               54300
#>   model_step_both
#> 1           43800
#> 2           37550
#> 3           49500
#> 4           50075
#> 5           52400
#> 6           54300

Here we can see the actual_price and compare it with the predicted house price from each models. Most of them seem to be able to predict the house price accurately. However, we may not be able to tell which one is the best model just by looking at this table. Therefore, we need to create a model comparison.

Model Comparison

We have made several models for our data. We want to pick the best model that can both predict and also describe our target variable (prices). The evaluation method that we will use are by looking at:

  • AIC, or Akaike Information Criterion

    AIC value represents information loss. 
    The smaller AIC value, the less information is lost from our model.
  • R-square value

    The higher r-square value, the better our chosen predictors can describe target variable variance.
    Because all of our models use more than 1 predictor, we will use the adjusted R-square value.
  • Error value (difference between predicted and actual value), The smaller error value, the better our model can predict target variable.

    For this case we will calculate the error using RMSE, or Root Mean Squared Error, 
    which is the square root of squared error value. 
    It is sensitive to error and can be easily interpreted.

To do the comparison, we will utilize the performance library.

comparison <- compare_performance(  model_all, model_multiple,
                                    model_step_forward, model_step_backward, model_step_both  )

as.data.frame(comparison)[, c('Name','AIC','R2_adjusted', 'RMSE')] %>% arrange(RMSE)
#>                  Name       AIC R2_adjusted               RMSE
#> 1           model_all -14057928   1.0000000    0.0000001898760
#> 2 model_step_backward -14057928   1.0000000    0.0000001898760
#> 3  model_step_forward -13972851   1.0000000    0.0000002067371
#> 4     model_step_both -13972851   1.0000000    0.0000002067371
#> 5      model_multiple   9445453   0.9361105 3061.0061860142678

We want a model that has:

  1. Small AIC
  2. High R-square value
  3. Low RMSE

From the table above we can see that model_all has the smallest AIC value, highest R-squared value, and smallest RMSE. Therefore, we can conclude that model_all is the best model that we have.

Assumption Checking

We need to check if our model fulfills the assumptions for Linear Regression model, to ensure that it can predict new data consistently (Best Linear Unbiased Estimator / BLUE model).

The assumptions that we need to check are:

  1. Linearity
  2. Normality of Residuals
  3. Homoscedasticity of Residuals
  4. No multicollinearity

We will perform the assumption checking only on our best model, which is model_all.

Linearity

The assumption of linearity means that the target variable and its predictors have a linear relationship. In multiple linear regression, the assumption of linearity can be assessed by visualizing the residuals vs. the fitted / predicted values using a scatter plot.

The line in the plot should be close to linear / have no pattern.

plot(model_all, which = 1)

Conclusion:

  • The target variable and predictors have a linear relationship.
  • The linearity assumption is satisfied.

Normality of Residual

The errors/residuals are expected to have a normal distribution, so the error values will cluster around zero.

We can check this assumption using:

  1. Histogram visualization of residual using the hist() function.
  2. Shapiro-Wilk hypothesis test using shapiro.test()

However, using the shapiro.test() function to our data will cause an error as follows: > Error in shapiro.test(.) : sample size must be between 3 and 5000

This is because our model size exceeds 5000, as seen below:

length(model_all$residuals)
#> [1] 500000

Therefore, we will use histogram visualization.

hist(model_all$residuals)

As the residual values cluster around 0, we can conclude that the assumption is satisfied.

Homoscedasticity of Residuals

Error/residuals are expected to have no pattern. If we plot the model’s residuals vs fitted.values, and the residual values spread around zero, then it is homoscedasticity.

We can check this assumption in 2 ways:

  1. Scatter plot visualisation of the model’s residuals vs fitted.values
plot(model_all$fitted.values, model_all$residuals)
abline(h=0, col="red")

  1. Breuch-Pagan hypothesis test using bptest() from package lmtest
bptest(model_all)
#> 
#>  studentized Breusch-Pagan test
#> 
#> data:  model_all
#> BP = 11.298, df = 12, p-value = 0.5035

The residuals spreads around 0 in the scatter plot and the p-value > 0.05 in th Breuch-Pagan test. Therefore, we can conclude that the homoscedasticity assumption is satisfied.

No multicollinearity

Multicollinearity is the condition where there are strong correlation(s) between predictor variables. This is undesirable because it indicates redundant predictors in the model, and ideally, only one of the highly correlated variables should be selected.

We will do VIF (Variance Inflation Factor) test using vif() function from package car:

  • If VIF > 10: there is multicollinearity in the model
  • If VIF < 10: there is no multicollinearity in the model
vif(model_all)
#>         Area       Garage    FirePlace        Baths White.Marble Black.Marble 
#>     1.000022     1.000031     1.000009     1.000030     1.331386     1.331386 
#>       Floors         City        Solar     Electric        Fiber  Glass.Doors 
#>     1.000012     1.000024     1.000018     1.000009     1.000019     1.000017

All VIF values are < 10, therefore there is no multicollinearity in the model and the assumption is satisfied.

Conclusion

Our best model is model_all which includes all house features except for Indian.marble, Swiming.Pool and Garden as predictors for target variable Prices. This model’s adjusted R-square value is 1 and its RMSE value is 0.0000001898756. This model satisfy all linearity regression assumption and can be used to predict house prices based on its features.