House Price Prediction Using Linear Regression

Introduction

We will try to predict house price using linear regression model. We try to find relationship between variables that affect the price. Lets do it!

Data Preparation

Package loading

First of all we have to prepare our data and try to observe it.

##   Area Garage FirePlace Baths White.Marble Black.Marble Indian.Marble Floors
## 1  164      2         0     2            0            1             0      0
## 2   84      2         0     4            0            0             1      1
## 3  190      2         4     4            1            0             0      0
## 4   75      2         4     4            0            0             1      1
## 5  148      1         4     2            1            0             0      1
## 6  124      3         3     3            0            1             0      1
##   City Solar Electric Fiber Glass.Doors Swiming.Pool Garden Prices
## 1    3     1        1     1           1            0      0  43800
## 2    2     0        0     0           1            1      1  37550
## 3    2     0        0     1           0            0      0  49500
## 4    1     1        1     1           1            1      1  50075
## 5    2     1        0     0           1            1      1  52400
## 6    1     0        0     1           1            1      1  54300
## 'data.frame':    500000 obs. of  16 variables:
##  $ Area         : int  164 84 190 75 148 124 58 249 243 242 ...
##  $ Garage       : int  2 2 2 2 1 3 1 2 1 1 ...
##  $ FirePlace    : int  0 0 4 4 4 3 0 1 0 2 ...
##  $ Baths        : int  2 4 4 4 2 3 2 1 2 4 ...
##  $ White.Marble : int  0 0 1 0 1 0 0 1 0 0 ...
##  $ Black.Marble : int  1 0 0 0 0 1 0 0 0 0 ...
##  $ Indian.Marble: int  0 1 0 1 0 0 1 0 1 1 ...
##  $ Floors       : int  0 1 0 1 1 1 0 1 1 0 ...
##  $ City         : int  3 2 2 1 2 1 3 1 1 2 ...
##  $ Solar        : int  1 0 0 1 1 0 0 0 0 1 ...
##  $ Electric     : int  1 0 0 1 0 0 1 1 0 0 ...
##  $ Fiber        : int  1 0 1 1 0 1 1 0 0 0 ...
##  $ Glass.Doors  : int  1 1 0 1 1 1 1 1 0 0 ...
##  $ Swiming.Pool : int  0 1 0 1 1 1 0 1 1 1 ...
##  $ Garden       : int  0 1 0 1 1 1 1 0 0 0 ...
##  $ Prices       : int  43800 37550 49500 50075 52400 54300 34400 50425 29575 22300 ...

All variables have already on the right class

Now check for the missing values or NA

##          Area        Garage     FirePlace         Baths  White.Marble 
##             0             0             0             0             0 
##  Black.Marble Indian.Marble        Floors          City         Solar 
##             0             0             0             0             0 
##      Electric         Fiber   Glass.Doors  Swiming.Pool        Garden 
##             0             0             0             0             0 
##        Prices 
##             0

Great! There are no missing values in our data.

Exploratory Data Analysis

Our goal is to predict the price, so we have to look at our price distribution using hist.

It seems that our price is distributed normally.

In this part, we will try to observe and explore our variables to see if there any pattern on them. We ’ll use ggcorr argument

From the chart above, we can make an explanation that our Price variable has high correlation with Fiber, Floors, Indian.Marble, and White Marble.
The good news is our predictors are independent each other, except Indian Marble, Black Marble and White Marble.
Therefore, we are going to select our next data frame in order to create our model and store it to house_clean object.

Since White Marble and Indian Marble are correlated each other (predictor must be independent each other) and has the same correlation value with Prices, we may choose one of them to be our predictor. In this case, I will pick White Marble

Now take a look on boxplot

As we see, there are some outliers attached on Prices.
Lets explore our prices closer.

## $stats
##       [,1]
## [1,]  7725
## [2,] 33500
## [3,] 41850
## [4,] 50750
## [5,] 76600
## attr(,"class")
##         1 
## "integer" 
## 
## $n
## [1] 5e+05
## 
## $conf
##          [,1]
## [1,] 41811.46
## [2,] 41888.54
## 
## $out
##  [1] 76975 77225 77000 77175 77375 77525 76825 77700 76950 77075 77250 77975
## [13] 76750 77225 76775 76800
## 
## $group
##  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 
## $names
## [1] "1"

We have 16 outliers data on Prices. Just try to remove the outliers.

Modelling

Now we do the main part, creating models!
We will try to create two models, without outlier, house_Clean, and with outliers, house_free

Without Outlier

Before creating models, we should divide our data into two parts, data train and data test.

Creating Model

## 
## Call:
## lm(formula = Prices ~ ., data = clean_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17547.5  -3586.7      6.2   3590.2  17504.4 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  24859.82      15.23  1632.5   <2e-16 ***
## Fiber        11726.88      16.30   719.7   <2e-16 ***
## Floors       14985.81      16.30   919.7   <2e-16 ***
## White.Marble 11521.30      17.28   666.7   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5153 on 399995 degrees of freedom
## Multiple R-squared:  0.8191, Adjusted R-squared:  0.8191 
## F-statistic: 6.038e+05 on 3 and 399995 DF,  p-value: < 2.2e-16

From the summary above we can say:
1. All predictors are highly siginificant to prices changes
2 Adjusted r squared value is quite high 0.819, this means this model is good enough to be used.

With Outlier

Before creating models, we should divide our data into two parts, data train and data test.

Creating Model

## 
## Call:
## lm(formula = Prices ~ ., data = outlier_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17546.8  -3584.7      5.8   3590.3  17505.8 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  24857.12      15.23  1632.0   <2e-16 ***
## Fiber        11727.57      16.29   719.8   <2e-16 ***
## Floors       14987.13      16.29   919.8   <2e-16 ***
## White.Marble 11519.01      17.29   666.3   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5152 on 399996 degrees of freedom
## Multiple R-squared:  0.819,  Adjusted R-squared:  0.819 
## F-statistic: 6.031e+05 on 3 and 399996 DF,  p-value: < 2.2e-16

From the summary above we can say: 1. All predictors are highly siginificant to prices changes
2 Adjusted r squared value is quite high 0.819, this means this model is good enough to be used.

Model Analysis

Both models, model_clean and model_outlier, have smiliar properties. They have the same r squared and al predictors are significant. This means that, outliers do not have any significant roles in our models. Therefore, we can use either model_outlier or model_clean

Prediction

Now we use the model_clean to predict clean_test data using predict().

We will check the Root Mean Square Error pred 1 using RMSE

## [1] 5162.972

For model_outlier

## [1] 5165.072

After a quick comparison, we can conclude that RMSE of model_clean is lower than that of model_outlier, hence model_Clean is a better model.

Stepwise

How about creating models by automatically picking the predictors. We can use step function to create our model.

## Start:  AIC=-14266229
## Prices ~ Area + Garage + FirePlace + Baths + White.Marble + Black.Marble + 
##     Indian.Marble + Floors + City + Solar + Electric + Fiber + 
##     Glass.Doors + Swiming.Pool + Garden
## 
## 
## Step:  AIC=-14266229
## Prices ~ Area + Garage + FirePlace + Baths + White.Marble + Black.Marble + 
##     Floors + City + Solar + Electric + Fiber + Glass.Doors + 
##     Swiming.Pool + Garden
## 
##                Df  Sum of Sq        RSS       AIC
## - Garden        1 0.0000e+00 0.0000e+00 -14266230
## - Swiming.Pool  1 0.0000e+00 0.0000e+00 -14266230
## <none>                       0.0000e+00 -14266229
## - Solar         1 6.2497e+09 6.2497e+09   3862660
## - Electric      1 1.5625e+11 1.5625e+11   5150225
## - FirePlace     1 4.4969e+11 4.4969e+11   5573074
## - Garage        1 6.0067e+11 6.0067e+11   5688863
## - Baths         1 1.2496e+12 1.2496e+12   5981871
## - Area          1 1.2886e+12 1.2886e+12   5994161
## - Black.Marble  1 1.6679e+12 1.6679e+12   6097363
## - Glass.Doors   1 1.9802e+12 1.9802e+12   6166020
## - City          1 3.2662e+12 3.2662e+12   6366194
## - White.Marble  1 1.3084e+13 1.3084e+13   6921316
## - Fiber         1 1.3806e+13 1.3806e+13   6942778
## - Floors        1 2.2500e+13 2.2500e+13   7138147
## 
## Step:  AIC=-14266230
## Prices ~ Area + Garage + FirePlace + Baths + White.Marble + Black.Marble + 
##     Floors + City + Solar + Electric + Fiber + Glass.Doors + 
##     Swiming.Pool
## 
##                Df  Sum of Sq        RSS       AIC
## - Swiming.Pool  1 0.0000e+00 0.0000e+00 -14266231
## <none>                       0.0000e+00 -14266230
## - Solar         1 6.2498e+09 6.2498e+09   3862667
## - Electric      1 1.5625e+11 1.5625e+11   5150223
## - FirePlace     1 4.4969e+11 4.4969e+11   5573072
## - Garage        1 6.0067e+11 6.0067e+11   5688862
## - Baths         1 1.2496e+12 1.2496e+12   5981871
## - Area          1 1.2886e+12 1.2886e+12   5994160
## - Black.Marble  1 1.6679e+12 1.6679e+12   6097362
## - Glass.Doors   1 1.9802e+12 1.9802e+12   6166024
## - City          1 3.2662e+12 3.2662e+12   6366193
## - White.Marble  1 1.3084e+13 1.3084e+13   6921316
## - Fiber         1 1.3806e+13 1.3806e+13   6942776
## - Floors        1 2.2500e+13 2.2500e+13   7138146
## 
## Step:  AIC=-14266231
## Prices ~ Area + Garage + FirePlace + Baths + White.Marble + Black.Marble + 
##     Floors + City + Solar + Electric + Fiber + Glass.Doors
## 
##                Df  Sum of Sq        RSS       AIC
## <none>                       0.0000e+00 -14266231
## - Solar         1 6.2498e+09 6.2498e+09   3862665
## - Electric      1 1.5625e+11 1.5625e+11   5150221
## - FirePlace     1 4.4969e+11 4.4969e+11   5573070
## - Garage        1 6.0067e+11 6.0067e+11   5688860
## - Baths         1 1.2496e+12 1.2496e+12   5981870
## - Area          1 1.2886e+12 1.2886e+12   5994158
## - Black.Marble  1 1.6679e+12 1.6679e+12   6097360
## - Glass.Doors   1 1.9802e+12 1.9802e+12   6166022
## - City          1 3.2662e+12 3.2662e+12   6366192
## - White.Marble  1 1.3084e+13 1.3084e+13   6921316
## - Fiber         1 1.3806e+13 1.3806e+13   6942785
## - Floors        1 2.2500e+13 2.2500e+13   7138145

We have our model and stored in clean_step

## 
## Call:
## lm(formula = Prices ~ Area + Garage + FirePlace + Baths + White.Marble + 
##     Black.Marble + Floors + City + Solar + Electric + Fiber + 
##     Glass.Doors, data = house_train)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -1.138e-05  0.000e+00  0.000e+00  1.000e-10  3.408e-07 
## 
## Coefficients:
##               Estimate Std. Error   t value Pr(>|t|)    
## (Intercept)  1.000e+03  1.549e-10 6.454e+12   <2e-16 ***
## Area         2.500e+01  3.965e-13 6.306e+13   <2e-16 ***
## Garage       1.500e+03  3.484e-11 4.305e+13   <2e-16 ***
## FirePlace    7.500e+02  2.013e-11 3.725e+13   <2e-16 ***
## Baths        1.250e+03  2.013e-11 6.210e+13   <2e-16 ***
## White.Marble 1.400e+04  6.967e-11 2.009e+14   <2e-16 ***
## Black.Marble 5.000e+03  6.970e-11 7.174e+13   <2e-16 ***
## Floors       1.500e+04  5.693e-11 2.635e+14   <2e-16 ***
## City         3.500e+03  3.486e-11 1.004e+14   <2e-16 ***
## Solar        2.500e+02  5.693e-11 4.392e+12   <2e-16 ***
## Electric     1.250e+03  5.693e-11 2.196e+13   <2e-16 ***
## Fiber        1.175e+04  5.693e-11 2.064e+14   <2e-16 ***
## Glass.Doors  4.450e+03  5.693e-11 7.817e+13   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.8e-08 on 399987 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 1.508e+28 on 12 and 399987 DF,  p-value: < 2.2e-16

Check for the predict and RMSE

## [1] 1.802131e-08

Our clean_step model looked surprisingly perfect. Error is slightly low and all predictors are highly significant.

However, this kind of situation seems upnormal since the adjusted r squared value is exactly 1.

Therefore, we have to move on to the next step, model evaluation.

Model Evaluation

In this part, we will evaluate our model by checking the asumptions of our model.
There are 4 asumptions, which are :
1. Normality
2. Homoscedasticity
3. Multicolinearity
4. Linearity

Normality

  1. Model model_clean

  1. Model model_outlier

  1. Model clean_step

Analysis:

  1. Both model_clean and model_outlier have a normally distributed residuals, which means most of the error or residuals are equivalent to 0.

  2. clean_step residuals are lesser than 3, hence it is not normally distributed.

Homoscedasticity

  1. Model model_clean
## 
##  studentized Breusch-Pagan test
## 
## data:  model_clean
## BP = 3097.5, df = 3, p-value < 2.2e-16
  1. Model model_outlier
## 
##  studentized Breusch-Pagan test
## 
## data:  model_outlier
## BP = 3098.9, df = 3, p-value < 2.2e-16
  1. Model clean_step
## 
##  studentized Breusch-Pagan test
## 
## data:  clean_step
## BP = 11.054, df = 12, p-value = 0.5243

Conclusion: model_clean and model_outlier have p value which lesser than 0.05, hence these models’ residuals are not homogenous. In the other hand, clean_Step model has p value>0.05, so we can say that this model’s residuals are homogen.

Multicolinearity

  1. Model model_clean
##        Fiber       Floors White.Marble 
##     1.000007     1.000007     1.000001
  1. Model model_outlier
##        Fiber       Floors White.Marble 
##     1.000004     1.000003     1.000001
  1. Model clean_step
##         Area       Garage    FirePlace        Baths White.Marble Black.Marble 
##     1.000027     1.000036     1.000011     1.000039     1.330525     1.330542 
##       Floors         City        Solar     Electric        Fiber  Glass.Doors 
##     1.000012     1.000028     1.000021     1.000009     1.000020     1.000029

Conclusion : All three models do not have any predictor which has multicolinearity value >=10. Therefore, our predictors are not dependent each others.

Linearity

  1. Model model_clean

  1. Model model_outlier

  1. Model clean_step

Conclusion :

For model_clean and model_outlier, all predictors have correlation with the target variables, prices. While in clean_step model, there are some variables that have 0 correlation with Prices.

Conclusion

Both model_clean and model_outlier have similar properties. They have high adjusted r squared and all predictors are significant. Although, they missed one assumption which is homoscedasticity, they still categorized as a good model to be interpreted due to their adjusted r squared.

clean_step model seems has a perfect properties with r squared = 1 and highly significant predictors.

Alfado Dhusi Sembiring

22/2/2020