Introduction

The purpose of machine learning is to make machine that can learn by itself in understanding the pattern of data so that it can predict what will happen in the future. At this time, we want to use house price data set to find out the relationships among variables that impact to the house price. We also want to find out which model is the best to predict the price of a house based on variables on the data set.

Data Preparation

Load the required package.

library(dplyr)
library(MLmetrics)
library(lmtest)
library(car)
library(GGally)
library(performance)
library(nortest)

options(scipen = 100, max.print = 1e+06)

Read data

# read data house price
houseprice <- read.csv("HousePrices_HalfMil.csv")
rmarkdown::paged_table(houseprice)

str(houseprice)

#> 'data.frame':    500000 obs. of  16 variables:
#>  $ Area         : int  164 84 190 75 148 124 58 249 243 242 ...
#>  $ Garage       : int  2 2 2 2 1 3 1 2 1 1 ...
#>  $ FirePlace    : int  0 0 4 4 4 3 0 1 0 2 ...
#>  $ Baths        : int  2 4 4 4 2 3 2 1 2 4 ...
#>  $ White.Marble : int  0 0 1 0 1 0 0 1 0 0 ...
#>  $ Black.Marble : int  1 0 0 0 0 1 0 0 0 0 ...
#>  $ Indian.Marble: int  0 1 0 1 0 0 1 0 1 1 ...
#>  $ Floors       : int  0 1 0 1 1 1 0 1 1 0 ...
#>  $ City         : int  3 2 2 1 2 1 3 1 1 2 ...
#>  $ Solar        : int  1 0 0 1 1 0 0 0 0 1 ...
#>  $ Electric     : int  1 0 0 1 0 0 1 1 0 0 ...
#>  $ Fiber        : int  1 0 1 1 0 1 1 0 0 0 ...
#>  $ Glass.Doors  : int  1 1 0 1 1 1 1 1 0 0 ...
#>  $ Swiming.Pool : int  0 1 0 1 1 1 0 1 1 1 ...
#>  $ Garden       : int  0 1 0 1 1 1 1 0 0 0 ...
#>  $ Prices       : int  43800 37550 49500 50075 52400 54300 34400 50425 29575 22300 ...

The data has 500,000 rows and 16 variables. Below is the description of the variables:
- Area: area size of a house
- Garage: no of car in a garage
- FirePlace: no of fire place in a house
- Baths: no of bathroom in a house
- White.Marble: if a house use white marble (0 = No, 1 = Yes)
- Black.Marble: if a house use black marble (0 = No, 1 = Yes)
- Indian.Marble: if a house use indian marble (0 = No, 1 = Yes)
- Floors: no of floor in a house
- City: city location of a house
- Solar: if a house use solar panel (0 = No, 1 = Yes)
- Electric: if a house use electric for the solar panel (0 = No, 1 = Yes)
- Fiber: if a house use fiber (0 = No, 1 = Yes)
- Glass.Doors: if a house use glass doors (0 = No, 1 = Yes)
- Swiming.Pool: if a house has swimming pool (0 = No, 1 = Yes)
- Garden: if a house has garden (0 = No, 1 = Yes)
- Prices: price of a house

Let’s say we want to predict the price of a house based on other variables that will give us the best prediction. Thus, in this case our target variable is Prices.

Before we go further, first we need to make sure that our data is clean and has proper data type. Since there are some variables that are supposed to be in category type instead of numeric, we need to transform them.

# transform int into factor
houseprice_clean <- houseprice %>% mutate_at(
  c("White.Marble","Black.Marble","Indian.Marble","City","Solar","Electric","Fiber","Glass.Doors","Swiming.Pool","Garden"), factor
)

glimpse(houseprice_clean)

#> Rows: 500,000
#> Columns: 16
#> $ Area          <int> 164, 84, 190, 75, 148, 124, 58, 249, 243, 242, 61, 189, ~
#> $ Garage        <int> 2, 2, 2, 2, 1, 3, 1, 2, 1, 1, 2, 2, 2, 3, 3, 3, 1, 3, 2,~
#> $ FirePlace     <int> 0, 0, 4, 4, 4, 3, 0, 1, 0, 2, 4, 0, 0, 3, 3, 4, 0, 3, 3,~
#> $ Baths         <int> 2, 4, 4, 4, 2, 3, 2, 1, 2, 4, 5, 4, 2, 3, 1, 1, 5, 3, 5,~
#> $ White.Marble  <fct> 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,~
#> $ Black.Marble  <fct> 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1,~
#> $ Indian.Marble <fct> 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0,~
#> $ Floors        <int> 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1,~
#> $ City          <fct> 3, 2, 2, 1, 2, 1, 3, 1, 1, 2, 1, 2, 1, 3, 3, 1, 3, 1, 3,~
#> $ Solar         <fct> 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0,~
#> $ Electric      <fct> 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1,~
#> $ Fiber         <fct> 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0,~
#> $ Glass.Doors   <fct> 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1,~
#> $ Swiming.Pool  <fct> 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0,~
#> $ Garden        <fct> 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0,~
#> $ Prices        <int> 43800, 37550, 49500, 50075, 52400, 54300, 34400, 50425, ~

# check if there is missing value
colSums((is.na(houseprice_clean)))

#>          Area        Garage     FirePlace         Baths  White.Marble 
#>             0             0             0             0             0 
#>  Black.Marble Indian.Marble        Floors          City         Solar 
#>             0             0             0             0             0 
#>      Electric         Fiber   Glass.Doors  Swiming.Pool        Garden 
#>             0             0             0             0             0 
#>        Prices 
#>             0

Exploratory Data Analysis

Now we can explore the data, see if there is any pattern that can show us correlation between variables.

Check distribution of Prices variable:

boxplot(houseprice_clean$Prices)

💡 Insight:

Outlier with top extreem values
Most of the data are gathered in the middle value (30,000-50,000)
Box size not to wide -> distribution not too variatif

Check distribution of Area,Garage,FirePlace,Baths,Floors variable:

boxplot(houseprice_clean$Area)

boxplot(houseprice_clean$Garage)

boxplot(houseprice_clean$FirePlace)

boxplot(houseprice_clean$Baths)

boxplot(houseprice_clean$Floors)

💡 Insight:

No outliers

Check correlation between variables:

ggcorr(houseprice_clean, label = TRUE, label_size = 2.9, hjust = 1, layout.exp = 2)

💡 Insight:

Variable Area,Garage,FirePlace,Baths,Floors have positive correlation with Prices
Among those variables, Floors has the strongest positive correlation with Prices

Modeling

Let’s use all variables other than Prices as our predictor variables.

model_houseprice_multi <- lm(Prices ~ ., houseprice_clean)
summary(model_houseprice_multi)

#> 
#> Call:
#> lm(formula = Prices ~ ., data = houseprice_clean)
#> 
#> Residuals:
#>          Min           1Q       Median           3Q          Max 
#> -0.000114740  0.000000000  0.000000000  0.000000001  0.000000504 
#> 
#> Coefficients: (1 not defined because of singularities)
#>                             Estimate            Std. Error            t value
#> (Intercept)     4499.999999838227268     0.000000001204656  3735506482202.710
#> Area              24.999999999988276     0.000000000003196  7821089712763.962
#> Garage          1500.000000000106411     0.000000000280896  5340047166572.192
#> FirePlace        750.000000000248633     0.000000000162297  4621155359901.714
#> Baths           1250.000000000239197     0.000000000162276  7702942606260.459
#> White.Marble1  13999.999999999954525     0.000000000561868 24916899410430.270
#> Black.Marble1   4999.999999999263309     0.000000000561994  8896888137234.410
#> Indian.Marble1                    NA                    NA                 NA
#> Floors         15000.000000000349246     0.000000000458984 32680891666696.090
#> City2           3500.000000000096406     0.000000000562242  6225082231790.734
#> City3           6999.999999999426109     0.000000000562339 12448001978974.068
#> Solar1           249.999999999535987     0.000000000458990   544674534492.101
#> Electric1       1249.999999999550255     0.000000000458983  2723414492680.783
#> Fiber1         11749.999999999592546     0.000000000458990 25599693872701.074
#> Glass.Doors1    4449.999999999497959     0.000000000458986  9695282581124.541
#> Swiming.Pool1      0.000000000462530     0.000000000458987              1.008
#> Garden1            0.000000000462974     0.000000000458991              1.009
#>                           Pr(>|t|)    
#> (Intercept)    <0.0000000000000002 ***
#> Area           <0.0000000000000002 ***
#> Garage         <0.0000000000000002 ***
#> FirePlace      <0.0000000000000002 ***
#> Baths          <0.0000000000000002 ***
#> White.Marble1  <0.0000000000000002 ***
#> Black.Marble1  <0.0000000000000002 ***
#> Indian.Marble1                  NA    
#> Floors         <0.0000000000000002 ***
#> City2          <0.0000000000000002 ***
#> City3          <0.0000000000000002 ***
#> Solar1         <0.0000000000000002 ***
#> Electric1      <0.0000000000000002 ***
#> Fiber1         <0.0000000000000002 ***
#> Glass.Doors1   <0.0000000000000002 ***
#> Swiming.Pool1                0.314    
#> Garden1                      0.313    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.0000001623 on 499984 degrees of freedom
#> Multiple R-squared:      1,  Adjusted R-squared:      1 
#> F-statistic: 1.856e+26 on 15 and 499984 DF,  p-value: < 0.00000000000000022

💡 Insight:

Slope for Numeric:

When the Area increases 1 point, then Prices will also increase 24.9999 point, note that other predictor values are constant.
When the Garage increases 1 point, then Prices will also increase 1500 point, note that other predictor values are constant.
When the FirePlace increases 1 point, then Prices will also increase 750 point, note that other predictor values are constant.
When the Baths increases 1 point, then Prices will also increase 1250 point, note that other predictor values are constant.
When the Floors increases 1 point, then Prices will also increase 15000 point, note that other predictor values are constant.

Slope for Categoric (Dummy Variables):

Use of White.Marble increases Prices to 13999.9999 point higher than without one, note that other predictor values are constant.
Use of Black.Marble increases Prices to 4999.9999 point higher than without one, note that other predictor values are constant.
Location of City 2 increases Prices to 3500 point higher than location of City 1, note that other predictor values are constant.
Location of City 3 increases Prices to 6999.9999 point higher than location of City 1, note that other predictor values are constant.
Use of Solar increases Prices to 249.9999 point higher than without one, note that other predictor values are constant.
Use of Electric increases Prices to 1249.9999 point higher than without one, note that other predictor values are constant.
Use of Fiber increases Prices to 11749.9999 point higher than without one, note that other predictor values are constant.
Use of Glass.Doors increases Prices to 4449.9999 point higher than without one, note that other predictor values are constant.

Significant Predictor Based on the Pr(>|t|) value, we can see that most of the predictor variables are significant except for Indian.Marble, Swiming.Pool and Garden. Thus, we can try to remove predictor variables that are not significant.
Adjusted R-squared Since we use multiple linear regression (more than one predictor variables), therefore we see the value of Adjusted R-squared which is equal to 1. It means that predictor variables can explain the variety of Prices up to 100%, while the rest (0%) can be explained by other unused variables.

Re-Modeling

Now let’s create a new model, with only use the most significant variables from model_houseprice_multi, which are White.Marble, Floors and Fiber

model_houseprice_multi2 <- lm(Prices ~ White.Marble+Floors+Fiber, data = houseprice_clean)
summary(model_houseprice_multi2)

#> 
#> Call:
#> lm(formula = Prices ~ White.Marble + Floors + Fiber, data = houseprice_clean)
#> 
#> Residuals:
#>    Min     1Q Median     3Q    Max 
#> -17547  -3594      6   3589  17501 
#> 
#> Coefficients:
#>               Estimate Std. Error t value            Pr(>|t|)    
#> (Intercept)   24862.19      13.63  1823.8 <0.0000000000000002 ***
#> White.Marble1 11521.81      15.47   744.8 <0.0000000000000002 ***
#> Floors        14986.44      14.58  1027.8 <0.0000000000000002 ***
#> Fiber1        11723.54      14.58   804.1 <0.0000000000000002 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 5155 on 499996 degrees of freedom
#> Multiple R-squared:  0.8188, Adjusted R-squared:  0.8188 
#> F-statistic: 7.532e+05 on 3 and 499996 DF,  p-value: < 0.00000000000000022

R-squared value from the model is 0.8188 (Adjusted R-squared)

Prediction

Since we do not have new data, let’s use train data to see the performance of the model, i.e. predict house price data with the model that has been created.

Let’s compare the performance of two models as follows: 1. model_houseprice_multi: all predictor variables (numeric and categoric) 2. model_houseprice_multi2: three significant predictor variables (White.Marble, Floors and Fiber)

# save the prediction result in new column
houseprice_clean$pred_multi <- predict(object = model_houseprice_multi, newdata = houseprice_clean)
houseprice_clean$pred_multi2 <- predict(object = model_houseprice_multi2, newdata = houseprice_clean)

head(houseprice_clean)

Model Evaluation

Now let’s evaluate the models to find out which model is better than others.

R-squared

First of all, we see the goodness of fit based on the R-squared value.

# check r-squared value for each model
summary(model_houseprice_multi)$adj.r.squared # all predictors

#> [1] 1

summary(model_houseprice_multi2)$adj.r.squared # 3 predictors

#> [1] 0.8188059

The best model based on R-squared is model_houseprice_multi

Error

Remember that the purpose of regression model is to minimize prediction error, therefore we will calculate the difference between actual value and prediction value.

# check RMSE for each model
RMSE(y_pred = houseprice_clean$pred_multi, y_true = houseprice_clean$Prices)

#> [1] 0.0000001625065

RMSE(y_pred = houseprice_clean$pred_multi2, y_true = houseprice_clean$Prices)

#> [1] 5154.932

The best model based on RMSE is model_houseprice_multi Each time model_houseprice_multi runs a prediction, the result will miss 0.0000001625065 value in average.

Step-wise Regression

As an option to find out the best model, let’s try with Step-wise Regression. Step-wise regression helps us to find good predictor variables, by looking for the best combination of predictors that result in the best model based on AIC value. Akaike Information Criterion (AIC) represents how many lost information exists in the model or information loss. The best model is a model that have small AIC value.

Backward Elimination

# stepwise regression: backward elimination
model_houseprice_backward <- step(object = model_houseprice_multi,
                       direction = "backward",
                       trace = FALSE) # step-wise process not shown

# summary model backward
summary(model_houseprice_backward)

#> 
#> Call:
#> lm(formula = Prices ~ Area + Garage + FirePlace + Baths + White.Marble + 
#>     Black.Marble + Floors + City + Solar + Electric + Fiber + 
#>     Glass.Doors, data = houseprice_clean)
#> 
#> Residuals:
#>          Min           1Q       Median           3Q          Max 
#> -0.000114741  0.000000000  0.000000000  0.000000001  0.000000504 
#> 
#> Coefficients:
#>                            Estimate            Std. Error        t value
#> (Intercept)    4499.999999838680196     0.000000001160900  3876302704558
#> Area             24.999999999988287     0.000000000003196  7821099005137
#> Garage         1500.000000000106411     0.000000000280896  5340051354462
#> FirePlace       750.000000000248974     0.000000000162297  4621159145740
#> Baths          1250.000000000240107     0.000000000162275  7702972827634
#> White.Marble1 13999.999999999954525     0.000000000561866 24916955485305
#> Black.Marble1  4999.999999999263309     0.000000000561994  8896890444215
#> Floors        15000.000000000349246     0.000000000458984 32680895394628
#> City2          3500.000000000097316     0.000000000562241  6225092621617
#> City3          6999.999999999427018     0.000000000562339 12448011458101
#> Solar1          249.999999999533827     0.000000000458985   544679525694
#> Electric1      1249.999999999550710     0.000000000458982  2723415655671
#> Fiber1        11749.999999999592546     0.000000000458986 25599911210639
#> Glass.Doors1   4449.999999999498868     0.000000000458984  9695336766293
#>                          Pr(>|t|)    
#> (Intercept)   <0.0000000000000002 ***
#> Area          <0.0000000000000002 ***
#> Garage        <0.0000000000000002 ***
#> FirePlace     <0.0000000000000002 ***
#> Baths         <0.0000000000000002 ***
#> White.Marble1 <0.0000000000000002 ***
#> Black.Marble1 <0.0000000000000002 ***
#> Floors        <0.0000000000000002 ***
#> City2         <0.0000000000000002 ***
#> City3         <0.0000000000000002 ***
#> Solar1        <0.0000000000000002 ***
#> Electric1     <0.0000000000000002 ***
#> Fiber1        <0.0000000000000002 ***
#> Glass.Doors1  <0.0000000000000002 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.0000001623 on 499986 degrees of freedom
#> Multiple R-squared:      1,  Adjusted R-squared:      1 
#> F-statistic: 2.142e+26 on 13 and 499986 DF,  p-value: < 0.00000000000000022

Forward Selection

# create model without predictor variables
model_houseprice_none <- lm(Prices ~ 1, data = houseprice_clean)

# stepwise regression: forward selection
model_houseprice_forward <- step(
  object = model_houseprice_none, # lower limit
  direction = "forward",
  scope = list(upper = model_houseprice_multi), # upper limit
  trace = FALSE) # step-wise process not shown

# summary model forward
summary(model_houseprice_forward)

#> 
#> Call:
#> lm(formula = Prices ~ Floors + Fiber + White.Marble + City + 
#>     Glass.Doors + Indian.Marble + Area + Baths + Garage + FirePlace + 
#>     Electric + Solar, data = houseprice_clean)
#> 
#> Residuals:
#>          Min           1Q       Median           3Q          Max 
#> -0.000105009  0.000000000  0.000000000  0.000000001  0.000016350 
#> 
#> Coefficients:
#>                             Estimate            Std. Error        t value
#> (Intercept)     9499.999999827012289     0.000000001077595  8815929964174
#> Floors         15000.000000046766218     0.000000000425905 35219132098377
#> Fiber1         11749.999999995810867     0.000000000425907 27588187035442
#> White.Marble1   8999.999999984282113     0.000000000522011 17241008183984
#> City2           3500.000000009361429     0.000000000521720  6708578719096
#> City3           7000.000000005459697     0.000000000521811 13414814821034
#> Glass.Doors1    4450.000000006350092     0.000000000425905 10448347335250
#> Indian.Marble1 -5000.000000000656655     0.000000000521491 -9587887848100
#> Area              24.999999999999648     0.000000000002966  8428542599274
#> Baths           1250.000000000126420     0.000000000150580  8301241881238
#> Garage          1500.000000000003183     0.000000000260652  5754798691823
#> FirePlace        750.000000000237151     0.000000000150600  4980072070729
#> Electric1       1249.999999999593683     0.000000000425904  2934935979493
#> Solar1           249.999999999554802     0.000000000425906   586983310434
#>                           Pr(>|t|)    
#> (Intercept)    <0.0000000000000002 ***
#> Floors         <0.0000000000000002 ***
#> Fiber1         <0.0000000000000002 ***
#> White.Marble1  <0.0000000000000002 ***
#> City2          <0.0000000000000002 ***
#> City3          <0.0000000000000002 ***
#> Glass.Doors1   <0.0000000000000002 ***
#> Indian.Marble1 <0.0000000000000002 ***
#> Area           <0.0000000000000002 ***
#> Baths          <0.0000000000000002 ***
#> Garage         <0.0000000000000002 ***
#> FirePlace      <0.0000000000000002 ***
#> Electric1      <0.0000000000000002 ***
#> Solar1         <0.0000000000000002 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.0000001506 on 499986 degrees of freedom
#> Multiple R-squared:      1,  Adjusted R-squared:      1 
#> F-statistic: 2.488e+26 on 13 and 499986 DF,  p-value: < 0.00000000000000022

Both

# stepwise regression: both
model_houseprice_both <- step(
  object = model_houseprice_none, # lower limit
  direction = "both",
  scope = list(upper = model_houseprice_multi), # upper limit
  trace = FALSE
)

# summary model both
summary(model_houseprice_both)

#> 
#> Call:
#> lm(formula = Prices ~ Floors + Fiber + White.Marble + City + 
#>     Glass.Doors + Indian.Marble + Area + Baths + Garage + FirePlace + 
#>     Electric + Solar, data = houseprice_clean)
#> 
#> Residuals:
#>          Min           1Q       Median           3Q          Max 
#> -0.000105009  0.000000000  0.000000000  0.000000001  0.000016350 
#> 
#> Coefficients:
#>                             Estimate            Std. Error        t value
#> (Intercept)     9499.999999827012289     0.000000001077595  8815929964174
#> Floors         15000.000000046766218     0.000000000425905 35219132098377
#> Fiber1         11749.999999995810867     0.000000000425907 27588187035442
#> White.Marble1   8999.999999984282113     0.000000000522011 17241008183984
#> City2           3500.000000009361429     0.000000000521720  6708578719096
#> City3           7000.000000005459697     0.000000000521811 13414814821034
#> Glass.Doors1    4450.000000006350092     0.000000000425905 10448347335250
#> Indian.Marble1 -5000.000000000656655     0.000000000521491 -9587887848100
#> Area              24.999999999999648     0.000000000002966  8428542599274
#> Baths           1250.000000000126420     0.000000000150580  8301241881238
#> Garage          1500.000000000003183     0.000000000260652  5754798691823
#> FirePlace        750.000000000237151     0.000000000150600  4980072070729
#> Electric1       1249.999999999593683     0.000000000425904  2934935979493
#> Solar1           249.999999999554802     0.000000000425906   586983310434
#>                           Pr(>|t|)    
#> (Intercept)    <0.0000000000000002 ***
#> Floors         <0.0000000000000002 ***
#> Fiber1         <0.0000000000000002 ***
#> White.Marble1  <0.0000000000000002 ***
#> City2          <0.0000000000000002 ***
#> City3          <0.0000000000000002 ***
#> Glass.Doors1   <0.0000000000000002 ***
#> Indian.Marble1 <0.0000000000000002 ***
#> Area           <0.0000000000000002 ***
#> Baths          <0.0000000000000002 ***
#> Garage         <0.0000000000000002 ***
#> FirePlace      <0.0000000000000002 ***
#> Electric1      <0.0000000000000002 ***
#> Solar1         <0.0000000000000002 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.0000001506 on 499986 degrees of freedom
#> Multiple R-squared:      1,  Adjusted R-squared:      1 
#> F-statistic: 2.488e+26 on 13 and 499986 DF,  p-value: < 0.00000000000000022

Comparison

Let’s compare Adjusted R-squared value for those 3 models:

summary(model_houseprice_backward)$adj.r.squared

#> [1] 1

summary(model_houseprice_forward)$adj.r.squared

#> [1] 1

summary(model_houseprice_both)$adj.r.squared

#> [1] 1

Let’s see comparison of all models that we have created.

comparison <- compare_performance(model_houseprice_none, model_houseprice_multi,  model_houseprice_multi2, model_houseprice_backward, model_houseprice_forward, model_houseprice_both)

as.data.frame(comparison)

💡 Insight:

All models have the same Adjusted R-squared value (= 1), except model model_houseprice_none and model_houseprice_multi2
Models that have the smallest RMSE value are model_houseprice_forward and model_houseprice_both
Models that have the smallest AIC value are model_houseprice_forward and model_houseprice_both

Prediction Interval

In this section, let’s use one of the best model from step-wise regression, i.e model_houseprice_forward

# ordinary prediction
pred_model_step <- predict(model_houseprice_forward, newdata = houseprice_clean)
head(pred_model_step)

#>     1     2     3     4     5     6 
#> 43800 37550 49500 50075 52400 54300

In case we need to predict a range, we can add interval = "prediction" to get prediction interval.

pred_model_step_interval <- predict(
  object = model_houseprice_forward,
  newdata = houseprice_clean,
  interval = "prediction",
  level = 0.95) # level of confidence, with 0.05 error rate

head(pred_model_step_interval)

#>     fit   lwr   upr
#> 1 43800 43800 43800
#> 2 37550 37550 37550
#> 3 49500 49500 49500
#> 4 50075 50075 50075
#> 5 52400 52400 52400
#> 6 54300 54300 54300

Conclusion

The best model based on R-squared is model_houseprice_forward or model_houseprice_both (Higher R-squared value)
The best model based on RMSE is model_houseprice_forward or model_houseprice_both (Smallest RMSE value)
No of predictor variables: 12 variables
Selected model to predict new data: model_houseprice_forward or model_houseprice_both

Assumption

Below are some assumptions need to check to ensure if our models can be considered as the Best Linear Unbiased Estimator (BLUE) model, i.e model that can predict new data consistently.

Normality of Residuals

Model needs to have error with normal distribution. Therefore, error is gathered around zero value.

# residual histogram
hist(model_houseprice_forward$residuals)

# Anderson-Darling test from residual
ad.test(model_houseprice_forward$residuals)

#> 
#>  Anderson-Darling normality test
#> 
#> data:  model_houseprice_forward$residuals
#> A = 190847, p-value < 0.00000000000000022

With p-value < alpha (0.05), the residuals are not normally distributed. Therefore, somehow the assumption is not fulfilled and we need to look at how to handle this.

Homoscedasticity of Residuals

# scatter plot
plot(x = model_houseprice_forward$fitted.values, y = model_houseprice_forward$residuals)
abline(h = 0, col = "red") # horizontal line at 0

# bptest of model
bptest(model_houseprice_forward)

#> 
#>  studentized Breusch-Pagan test
#> 
#> data:  model_houseprice_forward
#> BP = 11.606, df = 13, p-value = 0.5602

With p-value > alpha (0.05), the residuals are distributed constantly or homoscedasticity. Therefore, the assumption is fulfilled.

No Multicollinearity

summary(model_houseprice_forward)$call

#> lm(formula = Prices ~ Floors + Fiber + White.Marble + City + 
#>     Glass.Doors + Indian.Marble + Area + Baths + Garage + FirePlace + 
#>     Electric + Solar, data = houseprice_clean)

# vif of model
vif(model_houseprice_forward)

#>                   GVIF Df GVIF^(1/(2*Df))
#> Floors        1.000016  1        1.000008
#> Fiber         1.000027  1        1.000013
#> White.Marble  1.334649  1        1.155270
#> City          1.000042  2        1.000011
#> Glass.Doors   1.000017  1        1.000008
#> Indian.Marble 1.334637  1        1.155265
#> Area          1.000022  1        1.000011
#> Baths         1.000032  1        1.000016
#> Garage        1.000032  1        1.000016
#> FirePlace     1.000010  1        1.000005
#> Electric      1.000011  1        1.000005
#> Solar         1.000019  1        1.000009

All VIF < 10, there is no multicollinearity. Therefore, assumption is fulfilled.

Conclusion

Model model_houseprice_forward is the selected model as the best model. Variables that are useful to describe the variances in house prices are Floors, Fiber, White.Marble, City, Glass.Doors, Indian.Marble, Area, Baths, Garage, FirePlace, Electric and Solar. Our final model has satisfied some of the classical assumptions. The R-squared of the model is perfect, which is 100% of the variables can explain the variances in the house price. The accuracy of the model in predicting the house price is measured with RMSE, with training data has RMSE of 0.0000001505767.

Regression Model: House Price Prediction

Ivan Shindunata

May 21, 2022

Introduction

Data Preparation

Exploratory Data Analysis

Modeling

Re-Modeling

Prediction

Model Evaluation

R-squared

Error

Step-wise Regression

Backward Elimination

Forward Selection

Both

Comparison

Prediction Interval

Conclusion

Assumption

Normality of Residuals

Homoscedasticity of Residuals

No Multicollinearity

Conclusion