House Price Prediction Using Linear Regression

Introduction
Data Preparation
Exploratory Data Analysis
Modelling
- Without Outlier
- With Outlier
Model Analysis
Prediction
Stepwise
Model Evaluation
Conclusion

Introduction

We will try to predict house price using linear regression model. We try to find relationship between variables that affect the price. Lets do it!

Data Preparation

Package loading

library(tidyverse)
library(GGally)
library(car)
library(brglm)
library(brglm2)
library(lmtest)

First of all we have to prepare our data and try to observe it.

house <- read.csv("HousePrices_HalfMil.csv")
head(house)

##   Area Garage FirePlace Baths White.Marble Black.Marble Indian.Marble Floors
## 1  164      2         0     2            0            1             0      0
## 2   84      2         0     4            0            0             1      1
## 3  190      2         4     4            1            0             0      0
## 4   75      2         4     4            0            0             1      1
## 5  148      1         4     2            1            0             0      1
## 6  124      3         3     3            0            1             0      1
##   City Solar Electric Fiber Glass.Doors Swiming.Pool Garden Prices
## 1    3     1        1     1           1            0      0  43800
## 2    2     0        0     0           1            1      1  37550
## 3    2     0        0     1           0            0      0  49500
## 4    1     1        1     1           1            1      1  50075
## 5    2     1        0     0           1            1      1  52400
## 6    1     0        0     1           1            1      1  54300

str(house)

## 'data.frame':    500000 obs. of  16 variables:
##  $ Area         : int  164 84 190 75 148 124 58 249 243 242 ...
##  $ Garage       : int  2 2 2 2 1 3 1 2 1 1 ...
##  $ FirePlace    : int  0 0 4 4 4 3 0 1 0 2 ...
##  $ Baths        : int  2 4 4 4 2 3 2 1 2 4 ...
##  $ White.Marble : int  0 0 1 0 1 0 0 1 0 0 ...
##  $ Black.Marble : int  1 0 0 0 0 1 0 0 0 0 ...
##  $ Indian.Marble: int  0 1 0 1 0 0 1 0 1 1 ...
##  $ Floors       : int  0 1 0 1 1 1 0 1 1 0 ...
##  $ City         : int  3 2 2 1 2 1 3 1 1 2 ...
##  $ Solar        : int  1 0 0 1 1 0 0 0 0 1 ...
##  $ Electric     : int  1 0 0 1 0 0 1 1 0 0 ...
##  $ Fiber        : int  1 0 1 1 0 1 1 0 0 0 ...
##  $ Glass.Doors  : int  1 1 0 1 1 1 1 1 0 0 ...
##  $ Swiming.Pool : int  0 1 0 1 1 1 0 1 1 1 ...
##  $ Garden       : int  0 1 0 1 1 1 1 0 0 0 ...
##  $ Prices       : int  43800 37550 49500 50075 52400 54300 34400 50425 29575 22300 ...

All variables have already on the right class

Now check for the missing values or NA

colSums(is.na(house))

##          Area        Garage     FirePlace         Baths  White.Marble 
##             0             0             0             0             0 
##  Black.Marble Indian.Marble        Floors          City         Solar 
##             0             0             0             0             0 
##      Electric         Fiber   Glass.Doors  Swiming.Pool        Garden 
##             0             0             0             0             0 
##        Prices 
##             0

Great! There are no missing values in our data.

Exploratory Data Analysis

Our goal is to predict the price, so we have to look at our price distribution using hist.

hist(house$Prices)

It seems that our price is distributed normally.

In this part, we will try to observe and explore our variables to see if there any pattern on them. We ’ll use ggcorr argument

ggcorr(house, label = T, hjust= 1)

From the chart above, we can make an explanation that our Price variable has high correlation with Fiber, Floors, Indian.Marble, and White Marble.
The good news is our predictors are independent each other, except Indian Marble, Black Marble and White Marble.
Therefore, we are going to select our next data frame in order to create our model and store it to house_clean object.

Since White Marble and Indian Marble are correlated each other (predictor must be independent each other) and has the same correlation value with Prices, we may choose one of them to be our predictor. In this case, I will pick White Marble

house_free <- house %>% 
  select(Prices, Fiber, Floors,White.Marble)

Now take a look on boxplot

boxplot(house_free)

As we see, there are some outliers attached on Prices.
Lets explore our prices closer.

boxplot(house_free$Prices, plot = F)

## $stats
##       [,1]
## [1,]  7725
## [2,] 33500
## [3,] 41850
## [4,] 50750
## [5,] 76600
## attr(,"class")
##         1 
## "integer" 
## 
## $n
## [1] 5e+05
## 
## $conf
##          [,1]
## [1,] 41811.46
## [2,] 41888.54
## 
## $out
##  [1] 76975 77225 77000 77175 77375 77525 76825 77700 76950 77075 77250 77975
## [13] 76750 77225 76775 76800
## 
## $group
##  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 
## $names
## [1] "1"

We have 16 outliers data on Prices. Just try to remove the outliers.

outlier<- boxplot(house_free$Prices, plot = F)$out
house_clean<- house_free %>% 
  filter(Prices != outlier)

Modelling

Now we do the main part, creating models!
We will try to create two models, without outlier, house_Clean, and with outliers, house_free

Without Outlier

Before creating models, we should divide our data into two parts, data train and data test.

set.seed(100)
index <- sample (nrow(house_clean), nrow(house_clean)*0.8)
clean_train<- house_clean[index, ]
clean_test <- house_clean[-index, ]

Creating Model

model_clean<-lm(formula = Prices~., data = clean_train)
summary(model_clean)

## 
## Call:
## lm(formula = Prices ~ ., data = clean_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17547.5  -3586.7      6.2   3590.2  17504.4 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  24859.82      15.23  1632.5   <2e-16 ***
## Fiber        11726.88      16.30   719.7   <2e-16 ***
## Floors       14985.81      16.30   919.7   <2e-16 ***
## White.Marble 11521.30      17.28   666.7   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5153 on 399995 degrees of freedom
## Multiple R-squared:  0.8191, Adjusted R-squared:  0.8191 
## F-statistic: 6.038e+05 on 3 and 399995 DF,  p-value: < 2.2e-16

From the summary above we can say:
1. All predictors are highly siginificant to prices changes
2 Adjusted r squared value is quite high 0.819, this means this model is good enough to be used.

With Outlier

Before creating models, we should divide our data into two parts, data train and data test.

set.seed(100)
index2<-sample(nrow(house_free), nrow(house_free)*0.8)
outlier_train <- house_free [index2, ]
outlier_test <- house_free [-index2, ]

Creating Model

model_outlier <- lm (formula = Prices~., data = outlier_train)
summary(model_outlier)

## 
## Call:
## lm(formula = Prices ~ ., data = outlier_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17546.8  -3584.7      5.8   3590.3  17505.8 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  24857.12      15.23  1632.0   <2e-16 ***
## Fiber        11727.57      16.29   719.8   <2e-16 ***
## Floors       14987.13      16.29   919.8   <2e-16 ***
## White.Marble 11519.01      17.29   666.3   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5152 on 399996 degrees of freedom
## Multiple R-squared:  0.819,  Adjusted R-squared:  0.819 
## F-statistic: 6.031e+05 on 3 and 399996 DF,  p-value: < 2.2e-16

From the summary above we can say: 1. All predictors are highly siginificant to prices changes
2 Adjusted r squared value is quite high 0.819, this means this model is good enough to be used.

Model Analysis

Both models, model_clean and model_outlier, have smiliar properties. They have the same r squared and al predictors are significant. This means that, outliers do not have any significant roles in our models. Therefore, we can use either model_outlier or model_clean

Prediction

Now we use the model_clean to predict clean_test data using predict().

pred1<- predict(object = model_clean, newdata = clean_test, type = "response", interval = "confidence", level = 0.95)

We will check the Root Mean Square Error pred 1 using RMSE

library(MLmetrics)
RMSE(pred1, clean_test$Prices)

## [1] 5162.972

For model_outlier

pred2<- predict(object = model_outlier, newdata = outlier_test, type = "response", interval = "confidence", level = 0.95)

RMSE(pred2, outlier_test$Prices)

## [1] 5165.072

After a quick comparison, we can conclude that RMSE of model_clean is lower than that of model_outlier, hence model_Clean is a better model.

Stepwise

How about creating models by automatically picking the predictors. We can use step function to create our model.

set.seed(100)
index3 <-sample(nrow(house), nrow(house)*0.8)
house_train <- house[index3, ]
house_test <- house [-index3, ]

model_house<-lm (formula = Prices~., data = house_train)

clean_step<- step(object = model_house,direction = "backward")

## Start:  AIC=-14266229
## Prices ~ Area + Garage + FirePlace + Baths + White.Marble + Black.Marble + 
##     Indian.Marble + Floors + City + Solar + Electric + Fiber + 
##     Glass.Doors + Swiming.Pool + Garden
## 
## 
## Step:  AIC=-14266229
## Prices ~ Area + Garage + FirePlace + Baths + White.Marble + Black.Marble + 
##     Floors + City + Solar + Electric + Fiber + Glass.Doors + 
##     Swiming.Pool + Garden
## 
##                Df  Sum of Sq        RSS       AIC
## - Garden        1 0.0000e+00 0.0000e+00 -14266230
## - Swiming.Pool  1 0.0000e+00 0.0000e+00 -14266230
## <none>                       0.0000e+00 -14266229
## - Solar         1 6.2497e+09 6.2497e+09   3862660
## - Electric      1 1.5625e+11 1.5625e+11   5150225
## - FirePlace     1 4.4969e+11 4.4969e+11   5573074
## - Garage        1 6.0067e+11 6.0067e+11   5688863
## - Baths         1 1.2496e+12 1.2496e+12   5981871
## - Area          1 1.2886e+12 1.2886e+12   5994161
## - Black.Marble  1 1.6679e+12 1.6679e+12   6097363
## - Glass.Doors   1 1.9802e+12 1.9802e+12   6166020
## - City          1 3.2662e+12 3.2662e+12   6366194
## - White.Marble  1 1.3084e+13 1.3084e+13   6921316
## - Fiber         1 1.3806e+13 1.3806e+13   6942778
## - Floors        1 2.2500e+13 2.2500e+13   7138147
## 
## Step:  AIC=-14266230
## Prices ~ Area + Garage + FirePlace + Baths + White.Marble + Black.Marble + 
##     Floors + City + Solar + Electric + Fiber + Glass.Doors + 
##     Swiming.Pool
## 
##                Df  Sum of Sq        RSS       AIC
## - Swiming.Pool  1 0.0000e+00 0.0000e+00 -14266231
## <none>                       0.0000e+00 -14266230
## - Solar         1 6.2498e+09 6.2498e+09   3862667
## - Electric      1 1.5625e+11 1.5625e+11   5150223
## - FirePlace     1 4.4969e+11 4.4969e+11   5573072
## - Garage        1 6.0067e+11 6.0067e+11   5688862
## - Baths         1 1.2496e+12 1.2496e+12   5981871
## - Area          1 1.2886e+12 1.2886e+12   5994160
## - Black.Marble  1 1.6679e+12 1.6679e+12   6097362
## - Glass.Doors   1 1.9802e+12 1.9802e+12   6166024
## - City          1 3.2662e+12 3.2662e+12   6366193
## - White.Marble  1 1.3084e+13 1.3084e+13   6921316
## - Fiber         1 1.3806e+13 1.3806e+13   6942776
## - Floors        1 2.2500e+13 2.2500e+13   7138146
## 
## Step:  AIC=-14266231
## Prices ~ Area + Garage + FirePlace + Baths + White.Marble + Black.Marble + 
##     Floors + City + Solar + Electric + Fiber + Glass.Doors
## 
##                Df  Sum of Sq        RSS       AIC
## <none>                       0.0000e+00 -14266231
## - Solar         1 6.2498e+09 6.2498e+09   3862665
## - Electric      1 1.5625e+11 1.5625e+11   5150221
## - FirePlace     1 4.4969e+11 4.4969e+11   5573070
## - Garage        1 6.0067e+11 6.0067e+11   5688860
## - Baths         1 1.2496e+12 1.2496e+12   5981870
## - Area          1 1.2886e+12 1.2886e+12   5994158
## - Black.Marble  1 1.6679e+12 1.6679e+12   6097360
## - Glass.Doors   1 1.9802e+12 1.9802e+12   6166022
## - City          1 3.2662e+12 3.2662e+12   6366192
## - White.Marble  1 1.3084e+13 1.3084e+13   6921316
## - Fiber         1 1.3806e+13 1.3806e+13   6942785
## - Floors        1 2.2500e+13 2.2500e+13   7138145

We have our model and stored in clean_step

summary(clean_step)

## 
## Call:
## lm(formula = Prices ~ Area + Garage + FirePlace + Baths + White.Marble + 
##     Black.Marble + Floors + City + Solar + Electric + Fiber + 
##     Glass.Doors, data = house_train)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -1.138e-05  0.000e+00  0.000e+00  1.000e-10  3.408e-07 
## 
## Coefficients:
##               Estimate Std. Error   t value Pr(>|t|)    
## (Intercept)  1.000e+03  1.549e-10 6.454e+12   <2e-16 ***
## Area         2.500e+01  3.965e-13 6.306e+13   <2e-16 ***
## Garage       1.500e+03  3.484e-11 4.305e+13   <2e-16 ***
## FirePlace    7.500e+02  2.013e-11 3.725e+13   <2e-16 ***
## Baths        1.250e+03  2.013e-11 6.210e+13   <2e-16 ***
## White.Marble 1.400e+04  6.967e-11 2.009e+14   <2e-16 ***
## Black.Marble 5.000e+03  6.970e-11 7.174e+13   <2e-16 ***
## Floors       1.500e+04  5.693e-11 2.635e+14   <2e-16 ***
## City         3.500e+03  3.486e-11 1.004e+14   <2e-16 ***
## Solar        2.500e+02  5.693e-11 4.392e+12   <2e-16 ***
## Electric     1.250e+03  5.693e-11 2.196e+13   <2e-16 ***
## Fiber        1.175e+04  5.693e-11 2.064e+14   <2e-16 ***
## Glass.Doors  4.450e+03  5.693e-11 7.817e+13   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.8e-08 on 399987 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 1.508e+28 on 12 and 399987 DF,  p-value: < 2.2e-16

Check for the predict and RMSE

pred3<- predict (object = clean_step, newdata = house_test, interval = "confidence",
                 level = 0.95)

RMSE (pred3, house_test$Prices)

## [1] 1.802131e-08

Our clean_step model looked surprisingly perfect. Error is slightly low and all predictors are highly significant.

However, this kind of situation seems upnormal since the adjusted r squared value is exactly 1.

Therefore, we have to move on to the next step, model evaluation.

Model Evaluation

In this part, we will evaluate our model by checking the asumptions of our model.
There are 4 asumptions, which are :
1. Normality
2. Homoscedasticity
3. Multicolinearity
4. Linearity

Normality

Model model_clean

hist(model_clean$residuals, breaks = 10)

Model model_outlier

hist(model_outlier$residuals, breaks = 10)

Model clean_step

hist(clean_step$residuals, breaks = 10)

Analysis:

Both model_clean and model_outlier have a normally distributed residuals, which means most of the error or residuals are equivalent to 0.
clean_step residuals are lesser than 3, hence it is not normally distributed.

Homoscedasticity

Model model_clean

bptest(model_clean)

## 
##  studentized Breusch-Pagan test
## 
## data:  model_clean
## BP = 3097.5, df = 3, p-value < 2.2e-16

Model model_outlier

bptest(model_outlier)

## 
##  studentized Breusch-Pagan test
## 
## data:  model_outlier
## BP = 3098.9, df = 3, p-value < 2.2e-16

Model clean_step

bptest (clean_step)

## 
##  studentized Breusch-Pagan test
## 
## data:  clean_step
## BP = 11.054, df = 12, p-value = 0.5243

Conclusion: model_clean and model_outlier have p value which lesser than 0.05, hence these models’ residuals are not homogenous. In the other hand, clean_Step model has p value>0.05, so we can say that this model’s residuals are homogen.

Multicolinearity

Model model_clean

vif(model_clean)

##        Fiber       Floors White.Marble 
##     1.000007     1.000007     1.000001

Model model_outlier

vif (model_outlier)

##        Fiber       Floors White.Marble 
##     1.000004     1.000003     1.000001

Model clean_step

vif (clean_step)

##         Area       Garage    FirePlace        Baths White.Marble Black.Marble 
##     1.000027     1.000036     1.000011     1.000039     1.330525     1.330542 
##       Floors         City        Solar     Electric        Fiber  Glass.Doors 
##     1.000012     1.000028     1.000021     1.000009     1.000020     1.000029

Conclusion : All three models do not have any predictor which has multicolinearity value >=10. Therefore, our predictors are not dependent each others.

Linearity

Model model_clean

ggcorr(clean_train, label =T)

Model model_outlier

ggcorr(outlier_train, label =T)

Model clean_step

ggcorr(house_train %>% 
         select (Prices, Area, Garage, FirePlace, Baths, White.Marble, Black.Marble, Floors , City, Solar, Electric, Fiber,Glass.Doors), label = T)

Conclusion :

For model_clean and model_outlier, all predictors have correlation with the target variables, prices. While in clean_step model, there are some variables that have 0 correlation with Prices.

Conclusion

Both model_clean and model_outlier have similar properties. They have high adjusted r squared and all predictors are significant. Although, they missed one assumption which is homoscedasticity, they still categorized as a good model to be interpreted due to their adjusted r squared.

clean_step model seems has a perfect properties with r squared = 1 and highly significant predictors.