library(dplyr)
library(GGally)
library(MLmetrics)
library(performance)
library(ggplot2)
library(lmtest)
library(car)
options(scipen=99)

This report will try to predict house price using linear regression using house price database taken from Kaggle. We will explore which predictors/variables that has significant impact in determining house price.

This database is generated by computer, to help the very bigger in the field of machine learning, who wish to practice R and different ML models.

1. Load House Price Dataset

house <- read.csv("house.csv")

2. Exploratory Data Analysis

Checking data structure of house data set.

glimpse(house)
## Rows: 500,000
## Columns: 16
## $ Area          <int> 164, 84, 190, 75, 148, 124, 58, 249, 243, 242, 61, 189, ~
## $ Garage        <int> 2, 2, 2, 2, 1, 3, 1, 2, 1, 1, 2, 2, 2, 3, 3, 3, 1, 3, 2,~
## $ FirePlace     <int> 0, 0, 4, 4, 4, 3, 0, 1, 0, 2, 4, 0, 0, 3, 3, 4, 0, 3, 3,~
## $ Baths         <int> 2, 4, 4, 4, 2, 3, 2, 1, 2, 4, 5, 4, 2, 3, 1, 1, 5, 3, 5,~
## $ White.Marble  <int> 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,~
## $ Black.Marble  <int> 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1,~
## $ Indian.Marble <int> 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0,~
## $ Floors        <int> 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1,~
## $ City          <int> 3, 2, 2, 1, 2, 1, 3, 1, 1, 2, 1, 2, 1, 3, 3, 1, 3, 1, 3,~
## $ Solar         <int> 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0,~
## $ Electric      <int> 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1,~
## $ Fiber         <int> 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0,~
## $ Glass.Doors   <int> 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1,~
## $ Swiming.Pool  <int> 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0,~
## $ Garden        <int> 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0,~
## $ Prices        <int> 43800, 37550, 49500, 50075, 52400, 54300, 34400, 50425, ~

Below are the explanation of each column:

  • Area : Area of the house
  • Garage : How many garage in the house
  • FirePlace : How many fireplace in the house
  • Baths : How many bathrooms in the house
  • White.Marble : Whether the house use white marble or not
  • Black.Marble : Whether the house use black marble or not
  • Indian.Marble : Whether the house use Indian marble or not
  • Floors : Whether the house has floors or not
  • City : Location of the house
  • Solar : Whether the house has solar water heater or not
  • Electric : Whether the house has electric heater or not
  • Fiber : Whether the house has fiber connection or not
  • Glass.Doors : Whether the house has glass door or not
  • Swimming.Pool : Whether the house has swimming pool or not
  • Garden : Whether the house has garden or not
  • Prices : Prices of the house

Checking for NA value:

anyNA(house)
## [1] FALSE

Luckily there is no NA value in the data set.

The data set that we are using has 500,000 rows and it is necessary to take sample from original data set so we can test the model properly. We are going to take 5000 from the data set.

set.seed(100)
indx_hs <- sample(nrow(house), 5000)
hs_samp <- house[indx_hs,]

After taking samples we want to check the distribution of the data.

hist(house$Prices)

hist(hs_samp$Prices)

From two histograms below the distribution Prices in house and hs_samp already identical so we can safely use hs_samp use data set to generate linear regression models.

We are going to see correlation of each variables:

ggcorr(hs_samp, label = TRUE,label_size = 3, hjust = 1)

From the chart above we can see that most of the predictors are not strongly correlated, except Fiber, Floors and White.Marble.

Fiber, Floors, and White.Marble has moderate correlation with Prices, compared to other predictors

Beside that there looking at the Black.Marble and White.Marble and Indian.Marble has correlation which could indicate multicollinearity, which we have to drop one of the columns in the .

3. Linear Regression

Regression with Fiber & Floor

We will start with Fiber & Floors which have no multicollinearity as shown by mod_house_ff below.

mod_house_ff <- lm(formula=Prices~Fiber+Floors, data=hs_samp)
summary(mod_house_ff)
## 
## Call:
## lm(formula = Prices ~ Fiber + Floors, data = hs_samp)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -19810.3  -5549.5   -407.5   5591.1  21600.5 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)  28515.5      184.5  154.53 <0.0000000000000002 ***
## Fiber        11914.2      212.5   56.06 <0.0000000000000002 ***
## Floors       15344.7      212.5   72.20 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7513 on 4997 degrees of freedom
## Multiple R-squared:  0.6231, Adjusted R-squared:  0.6229 
## F-statistic:  4130 on 2 and 4997 DF,  p-value: < 0.00000000000000022
coef(mod_house_ff)
## (Intercept)       Fiber      Floors 
##    28515.52    11914.23    15344.73

From Summary above we can get several points:

  1. Both Fibers and Floors have positive coefficient 11914.23 and 15344.73 with intercept of 28515.52.
  2. Both Fibers and Floors are significant predictor in mod_house_ff model also supported by the p-value that is less than 0.05.
  3. Looking from Adjusted R Squared, using predictor variable in the mod_house_ff, the model able to explain 62.29% of Prices, while the rest are explained by other variables that not included in the model.

Regression with all Predictors

From the previous mod_house_ff, the model ability to explain could be enhanced by using all predictors and see the difference.

mod_house_all <- lm(formula=Prices~., data=hs_samp)
summary(mod_house_all)
## 
## Call:
## lm(formula = Prices ~ ., data = hs_samp)
## 
## Residuals:
##             Min              1Q          Median              3Q             Max 
## -0.000000000876 -0.000000000037 -0.000000000010  0.000000000017  0.000000053765 
## 
## Coefficients: (1 not defined because of singularities)
##                             Estimate             Std. Error             t value
## (Intercept)    1000.0000000008236611     0.0000000000606875  16477854946452.666
## Area             24.9999999999999822     0.0000000000001492 167568017144802.688
## Garage         1500.0000000000120508     0.0000000000132916 112852850157247.984
## FirePlace       750.0000000000053433     0.0000000000077381  96922660742804.609
## Baths          1249.9999999999902229     0.0000000000075672 165185662267132.281
## White.Marble  13999.9999999999981810     0.0000000000262460 533414713650915.000
## Black.Marble   5000.0000000000418368     0.0000000000267158 187155176597633.344
## Indian.Marble                     NA                     NA                  NA
## Floors        15000.0000000000072760     0.0000000000216129 694030933314415.750
## City           3499.9999999999786269     0.0000000000131179 266811384074804.969
## Solar           249.9999999999790248     0.0000000000216080  11569804692099.191
## Electric       1249.9999999999779448     0.0000000000216196  57817851986206.141
## Fiber         11749.9999999999417923     0.0000000000215977 544040442124294.750
## Glass.Doors    4450.0000000000463842     0.0000000000216148 205877330706994.969
## Swiming.Pool      0.0000000000235680     0.0000000000216042               1.091
## Garden           -0.0000000000212118     0.0000000000216120              -0.981
##                          Pr(>|t|)    
## (Intercept)   <0.0000000000000002 ***
## Area          <0.0000000000000002 ***
## Garage        <0.0000000000000002 ***
## FirePlace     <0.0000000000000002 ***
## Baths         <0.0000000000000002 ***
## White.Marble  <0.0000000000000002 ***
## Black.Marble  <0.0000000000000002 ***
## Indian.Marble                  NA    
## Floors        <0.0000000000000002 ***
## City          <0.0000000000000002 ***
## Solar         <0.0000000000000002 ***
## Electric      <0.0000000000000002 ***
## Fiber         <0.0000000000000002 ***
## Glass.Doors   <0.0000000000000002 ***
## Swiming.Pool                0.275    
## Garden                      0.326    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.000000000763 on 4985 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 9.182e+28 on 14 and 4985 DF,  p-value: < 0.00000000000000022

From Summary above we can get several points:

  1. Most of the predictor has positive coefficient, except Indian.Marble that has NA. which could indicate collinearity with other predictor which also explained by ggcorr above. So, we decided to drop Black.Marble and Indian.Marble from the model.
  2. Most predictors are significant predictor in mod_house_all model which also supported by the p-value that is less than 0.05. Beside that, Swimming.Pool and Garden are not significant predictor as proved by p-values which are more than 0.05. Those predictors will be removed in the revied model.
  3. Looking from Adjusted R Squared, using predictor variable in the mod_house_all, the model able to explain 100% of Prices. And will not be able to run a feature selection using step.
  4. Value of 1 in Adjusted R-Squared could indicate problem in the model and will drop some predictors as mentioned in point no.1 above.

Dropping Predictors

We will drop predictors that have multicollinearity and run once again ggcorr.

house_eliminate <- hs_samp %>% 
  select(-Garden, -Swiming.Pool, -Indian.Marble, -Black.Marble,
         -Solar, -Electric, -Baths, -Garage, -FirePlace)
ggcorr(house_eliminate,label = TRUE,label_size = 3, hjust = 1)

From the correlation plot above, there is no more predictor with multicollinearity. Also there is one new additional predictor which is White.Marble.

Running Model with Fiber, Floors, White.Marble

After dropping several predictors and running ggcorr, we will run the regression with Fiber, Floors and White.Marble.

mod_hs_ffw <- lm(formula=Prices~Fiber+Floors+White.Marble, data=hs_samp)
summary(mod_hs_ffw)
## 
## Call:
## lm(formula = Prices ~ Fiber + Floors + White.Marble, data = hs_samp)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15808.9  -3612.9     16.1   3580.3  15912.1 
## 
## Coefficients:
##              Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)   24774.7      137.1  180.77 <0.0000000000000002 ***
## Fiber         11809.2      146.7   80.48 <0.0000000000000002 ***
## Floors        15029.0      146.8  102.38 <0.0000000000000002 ***
## White.Marble  11431.7      154.3   74.07 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5187 on 4996 degrees of freedom
## Multiple R-squared:  0.8203, Adjusted R-squared:  0.8202 
## F-statistic:  7605 on 3 and 4996 DF,  p-value: < 0.00000000000000022

From Summary above we can get several points:

  1. Both Fibers and Floors and White.Marble have positive coefficient 11809.2 and 15029.0 with intercept of 24774.7.
  2. Both Fibers and Floors are White.Marble significant predictor in mod_house_ff model also supported by the p-value that is less than 0.05.
  3. Looking from Adjusted R Squared, using predictor variable in the mod_hs_ffw, the model able to explain 82.02% of Prices, while the rest are explained by other variables that not included in the model.

Running Model with All revised Predictors

Below are the model all predictors after dropping multicollinear and insignificant predictors.

mod_house_all_f <- lm(formula=Prices~., data=house_eliminate)
summary(mod_house_all_f)
## 
## Call:
## lm(formula = Prices ~ ., data = house_eliminate)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8663.8 -2254.7    26.6  2270.4  8773.0 
## 
## Coefficients:
##                Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)  12451.8789   165.3735   75.30 <0.0000000000000002 ***
## Area            24.3637     0.6271   38.85 <0.0000000000000002 ***
## White.Marble 11543.3204    95.4672  120.91 <0.0000000000000002 ***
## Floors       15057.2971    90.8337  165.77 <0.0000000000000002 ***
## City          3523.5828    55.1490   63.89 <0.0000000000000002 ***
## Fiber        11770.4246    90.7903  129.64 <0.0000000000000002 ***
## Glass.Doors   4439.4131    90.7702   48.91 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3208 on 4993 degrees of freedom
## Multiple R-squared:  0.9313, Adjusted R-squared:  0.9312 
## F-statistic: 1.129e+04 on 6 and 4993 DF,  p-value: < 0.00000000000000022

From Summary of revised model above we can see that:

  1. After removing insignificant predictors and predictors with multicollinearity, all predictors in mod_house_all_f are significant in predicting the house price.
  2. Using predictor variable in the mod_house_all_f, the model able to explain 93.12% of Prices.
  3. Though, the adjusted R-square is very high, it does not mean that the model is perfect at predicting prices. So, we have to verify this model to using other tools.

Running stepwise regression resulted same model as mod_house_all_f.

step(object=mod_house_all_f, direction = "backward", trace = 0)
## 
## Call:
## lm(formula = Prices ~ Area + White.Marble + Floors + City + Fiber + 
##     Glass.Doors, data = house_eliminate)
## 
## Coefficients:
##  (Intercept)          Area  White.Marble        Floors          City  
##     12451.88         24.36      11543.32      15057.30       3523.58  
##        Fiber   Glass.Doors  
##     11770.42       4439.41

To note, this model has been through trial and error process to determine more balanced model compared to other models. Because, when using all predictors (after removing predictors with multicollinearity), adjusted R-Squared is 1 which indicated problem in the model. Also, due to Adjusted R-squared condition, running feature selection with stepwise regression is impossible because the stepwise thought the model is already perfect. After removing multicollinearity and insignificant perdictors, the model is still considered “too good” with Adjusted R-Squared around 0.97. After that, we removed some significant predictors but has 0.1 correlation (from ggcorr). We removed Solar, Electric, Baths, Garage, FirePlace. After dropping more predictors, the model meet homoscedasticity.

Model Candidate:

Below are candidate models for determining house Prices.

coef(mod_house_ff)
## (Intercept)       Fiber      Floors 
##    28515.52    11914.23    15344.73

Prices = 28515.52 + 11914.23(Fiber) + 15344.73(Floors)

coef(mod_hs_ffw)
##  (Intercept)        Fiber       Floors White.Marble 
##     24774.71     11809.17     15029.03     11431.74

Prices = 24774.71 + 11809.17(Fiber) + 15029.03(Floors) + 11431.74(White.Marble)

coef(mod_house_all_f)
##  (Intercept)         Area White.Marble       Floors         City        Fiber 
##  12451.87889     24.36368  11543.32035  15057.29714   3523.58277  11770.42463 
##  Glass.Doors 
##   4439.41313

Prices = 12451.87889 + 24.36368(Area) + 11543.32035(White.Marble) + 15057.29714(Floors) + 3523.58277(City) + 11770.42463(Fiber) + 4439.41313(Glass.Doors)

Model Prediction and Error

In this section we will check the prediction and error checking from each models.

Prediction

h_p_pred <- predict(mod_house_all_f, house_eliminate)
h_p_pred_ff <- predict(mod_house_ff, house_eliminate)
h_p_pred_ffw <- predict(mod_hs_ffw, house_eliminate)

RMSE

We will check each model using RMSE (Root Mean Square Error).

  • mod_house_all_f
RMSE(y_pred=h_p_pred, y_true=house_eliminate$Prices)
## [1] 3206.039
  • mod_hs_ffw
RMSE(y_pred=h_p_pred_ffw, y_true=house_eliminate$Prices)
## [1] 5185.418
  • mod_house_ff
RMSE(y_pred=h_p_pred_ff, y_true=house_eliminate$Prices)
## [1] 7511.113

From RMSE of each models, we can see the lowest RMSE are mod_house_all_f.

MAE

We will check each model using MAE (Mean Absolute Error).

summary(house_eliminate$Prices)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9650   33700   42000   42118   50900   77375

We can see from summary of Prices, the data ranges from 9650 to 77375

  • mod_house_all_f
MAE(y_pred=h_p_pred, y_true=house_eliminate$Prices)
## [1] 2614.859
  • mod_hs_ffw
MAE(y_pred=h_p_pred_ffw, y_true=house_eliminate$Prices)
## [1] 4160.664
  • mod_house_ff
MAE(y_pred=h_p_pred_ff, y_true=house_eliminate$Prices)
## [1] 6222.643

From MAE of each models, we can see the lowest RMSE are mod_house_all_f as the error range is below Prices ranges.

MAPE

We will check each model using MAE (Mean Absolute Percentage Error).

  • mod_house_all_f
MAPE(y_pred=h_p_pred, y_true=house_eliminate$Prices)
## [1] 0.07172891
  • mod_hs_ffw
MAPE(y_pred=h_p_pred_ffw, y_true=house_eliminate$Prices)
## [1] 0.1144254
  • mod_house_ff
MAPE(y_pred=h_p_pred_ff, y_true=house_eliminate$Prices)
## [1] 0.1685246

From MAPE of each models, we can see the lowest RMSE are mod_house_all_f.

Model Evaluation

In this section, we will evaluate each models using Normality test, Homoscedasticity test, Multicollinearity test.

Normality

  • mod_house_all_f
hist(mod_house_all_f$residuals)

shapiro.test(mod_house_all_f$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  mod_house_all_f$residuals
## W = 0.99653, p-value = 0.000000002084
  • mod_hs_ffw
hist(mod_hs_ffw$residuals)

shapiro.test(mod_hs_ffw$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  mod_hs_ffw$residuals
## W = 0.99911, p-value = 0.01036
  • mod_house_ff
hist(mod_house_ff$residuals)

shapiro.test(mod_house_ff$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  mod_house_ff$residuals
## W = 0.99183, p-value = 0.0000000000000002233

Using normality test, Shapiro-Wilk test, none of the models are having p-value more than 0.05. The only model that has p-value close to 0.05 is mod_hs_ffw.

Using histogram only all models has normal distribution of residuals. Most of the residuals are still concentrated in 0 value.

Homoscedasticity

Below are homoscedasticity test of each model using bptest.

  • mod_house_all_f
bptest(mod_house_all_f)
## 
##  studentized Breusch-Pagan test
## 
## data:  mod_house_all_f
## BP = 314.38, df = 6, p-value < 0.00000000000000022
plot(x = mod_house_all_f$fitted.values, y = mod_house_all_f$residuals)
abline(h = 0, col = "red", lty = 2)

  • mod_hs_ffw
bptest(mod_hs_ffw)
## 
##  studentized Breusch-Pagan test
## 
## data:  mod_hs_ffw
## BP = 42.745, df = 3, p-value = 0.000000002788
plot(x = mod_hs_ffw$fitted.values, y = mod_hs_ffw$residuals)
abline(h = 0, col = "red", lty = 2)

  • mod_house_ff
bptest(mod_house_ff)
## 
##  studentized Breusch-Pagan test
## 
## data:  mod_house_ff
## BP = 0.24471, df = 2, p-value = 0.8848
plot(x = mod_house_ff$fitted.values, y = mod_house_ff$residuals)
abline(h = 0, col = "red", lty = 2)

Using homoscedasticity test mod_house_ff model that has p-value that surpassed 0.05. Other model’s p-value are below 0.05.

Multicollinearity

In this section we will test multicollinearity of each model.

  • mod_house_ff
vif(mod_house_ff)
##    Fiber   Floors 
## 1.000144 1.000144
  • mod_house_all_f
vif(mod_house_all_f)
##         Area White.Marble       Floors         City        Fiber  Glass.Doors 
##     1.001263     1.001267     1.001975     1.001084     1.001015     1.000493
  • mod_hs_ffw
vif(mod_hs_ffw)
##        Fiber       Floors White.Marble 
##     1.000238     1.000988     1.000931

From test to each model above, there is no multicollinearity found in each model.

4. Summary

From the all the models above, we can see that:

  • mod_house_ff (Low Adj R Square, failed Shapiro-Wilk, Meet Normality, Heteroscedasticity, BP > 0.05, No Multicollinearity)

  • mod_hs_ffw (Medium Adj R Square, failed Shapiro-Wilk, Meet Normality, Heteroscedasticity, BP < 0.05,No Multicollinearity)

  • mod_house_all_f (High Adj R Square, failed Shapiro-Wilk, Meet Normality, Homoscedasticity, BP < 0.05, No Multicollinearity)

From the summary above we can conclude that no model is perfect. But We think that mod_house_all_f has the balance of pros and cons and has good performance in predicting the house Prices.

So, using the mod_house_all_f house Prices has positive correlation with Area, White.Marble, Floors, City, Fiber, Glass.Doors.

Reference