BACKGROUND

Introduction

In this exercise, I want to predict price of a house based on various predictor variable.
The data is taken from kaggle, and from the description it seemed that the data is computer generated. So we’ll see how that affects our numbers.

I will use MSE and RMSE as a measure of my model’s accuracy. For those who are not familiar, MSE and RMSE is our error rate. The value is dependent on our target range, meaning if we get MSE of a 1000 and our data is in the millions, then it’s a very low error rate. But if our MSE is 1000 and our data is in the hundreds, then our error rate is very high.

Library

Library packages that I’m using

library(tidyverse)
library(GGally)
library(MLmetrics)
library(car)
library(lmtest)
library(stringr)

Data Setup

Reading the data

house_price <- read_csv("data_input/HousePrices_HalfMil.csv")

Checking the data

knitr::kable(head(house_price, 10))

Area	Garage	FirePlace	Baths	White Marble	Black Marble	Indian Marble	Floors	City	Solar	Electric	Fiber	Glass Doors	Swiming Pool	Garden	Prices
164	2	0	2	0	1	0	0	3	1	1	1	1	0	0	43800
84	2	0	4	0	0	1	1	2	0	0	0	1	1	1	37550
190	2	4	4	1	0	0	0	2	0	0	1	0	0	0	49500
75	2	4	4	0	0	1	1	1	1	1	1	1	1	1	50075
148	1	4	2	1	0	0	1	2	1	0	0	1	1	1	52400
124	3	3	3	0	1	0	1	1	0	0	1	1	1	1	54300
58	1	0	2	0	0	1	0	3	0	1	1	1	0	1	34400
249	2	1	1	1	0	0	1	1	0	1	0	1	1	0	50425
243	1	0	2	0	0	1	1	1	0	0	0	0	1	0	29575
242	1	2	4	0	0	1	0	2	1	0	0	0	1	0	22300

LINEARITY TEST

At first glance, there is some linearity in our data, but I want to try to compare the result after I engineer the data.

ggcorr(house_price, label = T, hjust = .8, layout.exp = 2)

DATA WRANGLING

The data seems to have a few feature we need to engineer first.
1. White Marble, Black Marble, and Indian Marble should be a factor of one column instead of 3
2. I think Garage, Fireplace, Baths, and City columns should be an factor.
3. Columns who only have 0 and 1 like Floors, Solar, Electric, Fiber, Glass Doors, Swimming Pool, and Garden will be logical.

names(house_price) <- str_replace_all(str_to_lower(names(house_price)), " ", "_")

house_price_filter <- house_price %>% 
  rename(swimming_pool = swiming_pool) %>% #spelling!!
  mutate( floors = as.logical(as.integer(floors)),
          solar = as.logical(as.integer(solar)),
          electric = as.logical(as.integer(electric)),
          fiber = as.logical(as.integer(fiber)),
          glass_doors = as.logical(as.integer(glass_doors)),
          swimming_pool = as.logical(as.integer(swimming_pool)),
          garden = as.logical(as.integer(garden))
          ) %>% 
  mutate(garage = as.factor(garage),
         fireplace = as.factor(fireplace),
         baths = as.factor(baths),
         city = as.factor(city)) %>% 
  mutate(marble_choice = case_when( white_marble == 1 ~ "white_marble",
                                     black_marble == 1 ~ "black_marble",
                                     indian_marble == 1 ~ "indian_marble") ) %>% 
  select(-c(white_marble, black_marble, indian_marble))

Separating Data to Train and Test

In order to test the model on later stage, I’ll separate the dataset into 2, training and testing data with a ratio 8:2.

set.seed(100)
intrain <- sample(nrow(house_price_filter), nrow(house_price_filter) * .8)
house_price_train <- house_price_filter[intrain,]
house_price_test <- house_price_filter[-intrain,]

MODELLING

We ended up with version 2. The reason being that after trying out model 1, it seemed that we have an overfit model. Our R squared with version 1 is 1, and our RMSE in the end is very low, as well as using predictors with weak correlation.

Version 2 however, is made with predictors with relatively stronger correlation to our target variable and predictor that should impact the price in a business-case sense.

Version 1 Model - All Variables

model_house <- lm(formula = prices ~., data = house_price_train)

summary(model_house)

## 
## Call:
## lm(formula = prices ~ ., data = house_price_train)
## 
## Residuals:
##          Min           1Q       Median           3Q          Max 
## -0.000046590  0.000000000  0.000000000  0.000000000  0.000000198 
## 
## Coefficients:
##                                         Estimate            Std. Error             t value            Pr(>|t|)    
## (Intercept)                12249.999999925450538     0.000000000583347  20999507180557.438 <0.0000000000000002 ***
## area                          25.000000000004619     0.000000000001623  15408197241292.121 <0.0000000000000002 ***
## garage2                     1500.000000000111413     0.000000000285641   5251338988395.544 <0.0000000000000002 ***
## garage3                     2999.999999999419742     0.000000000285171  10520012508983.896 <0.0000000000000002 ***
## fireplace1                   749.999999999717488     0.000000000369017   2032427319033.808 <0.0000000000000002 ***
## fireplace2                  1499.999999999714419     0.000000000369037   4064628304190.725 <0.0000000000000002 ***
## fireplace3                  2249.999999999310603     0.000000000368570   6104671996517.958 <0.0000000000000002 ***
## fireplace4                  3000.000000000022283     0.000000000368597   8138972100765.117 <0.0000000000000002 ***
## baths2                      1249.999999998968860     0.000000000368232   3394599186564.467 <0.0000000000000002 ***
## baths3                      2499.999999999759893     0.000000000367638   6800172397216.058 <0.0000000000000002 ***
## baths4                      3749.999999999752617     0.000000000368039  10189128624311.863 <0.0000000000000002 ***
## baths5                      4999.999999999542524     0.000000000368346  13574209596262.346 <0.0000000000000002 ***
## floorsTRUE                 15000.000000000394721     0.000000000232974  64384820861913.500 <0.0000000000000002 ***
## city2                       3500.000000000356522     0.000000000285400  12263477558947.682 <0.0000000000000002 ***
## city3                       7000.000000000159162     0.000000000285355  24530818169298.816 <0.0000000000000002 ***
## solarTRUE                    250.000000000234593     0.000000000232976   1073070716705.686 <0.0000000000000002 ***
## electricTRUE                1250.000000000187811     0.000000000232975   5365379701286.465 <0.0000000000000002 ***
## fiberTRUE                  11750.000000000161890     0.000000000232976  50434366647922.492 <0.0000000000000002 ***
## glass_doorsTRUE             4449.999999999679858     0.000000000232976  19100663635603.164 <0.0000000000000002 ***
## swimming_poolTRUE             -0.000000000239923     0.000000000232977              -1.030               0.303    
## gardenTRUE                     0.000000000232821     0.000000000232980               0.999               0.318    
## marble_choiceindian_marble -4999.999999999145075     0.000000000285229 -17529758565682.213 <0.0000000000000002 ***
## marble_choicewhite_marble   9000.000000000556611     0.000000000285637  31508535793061.383 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.00000007367 on 399977 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 4.912e+26 on 22 and 399977 DF,  p-value: < 0.00000000000000022

Checking fitted value and residuals.

plot(model_house$fitted.values, model_house$residuals)

Version 2 Model - Strong Predictors

model_house_v2 <- lm(formula = prices ~ area + floors + marble_choice + fiber + city, data = house_price_train)

summary(model_house_v2)

## 
## Call:
## lm(formula = prices ~ area + floors + marble_choice + fiber + 
##     city, data = house_price_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8504.7 -2474.5     2.6  2470.9  8500.6 
## 
## Coefficients:
##                              Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)                20742.0789    16.6831  1243.3 <0.0000000000000002 ***
## area                          24.8906     0.0733   339.6 <0.0000000000000002 ***
## floorsTRUE                 14990.8490    10.5256  1424.2 <0.0000000000000002 ***
## marble_choiceindian_marble -4996.4787    12.8864  -387.7 <0.0000000000000002 ***
## marble_choicewhite_marble   9008.5019    12.9048   698.1 <0.0000000000000002 ***
## fiberTRUE                  11741.9339    10.5256  1115.6 <0.0000000000000002 ***
## city2                       3505.7211    12.8942   271.9 <0.0000000000000002 ***
## city3                       7004.1872    12.8920   543.3 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3328 on 399992 degrees of freedom
## Multiple R-squared:  0.9245, Adjusted R-squared:  0.9244 
## F-statistic: 6.992e+05 on 7 and 399992 DF,  p-value: < 0.00000000000000022

Checking fitted value and residuals.

plot(model_house_v2$fitted.values, model_house_v2$residuals)

CHECKING ASSUMPTIONS

Normality

We have a very much normal distributed residual plot.

hist(model_house_v2$residuals)

shapiro.test(model_house_v2$residuals[0:5000])

## 
##  Shapiro-Wilk normality test
## 
## data:  model_house_v2$residuals[0:5000]
## W = 0.99481, p-value = 0.000000000002007

Homoscedacity

It seemed that our model passed homoscedacity assumption according to our Breusch-Pagan test.

plot(model_house_v2$fitted.values, model_house_v2$residuals)
abline(h = 0, col = "red")

bptest(model_house_v2)

## 
##  studentized Breusch-Pagan test
## 
## data:  model_house_v2
## BP = 3.4978, df = 7, p-value = 0.8355

Multicolinearity

It doesn’t seem that the predictor variables has strong linear relationship with each other.

vif(model_house_v2)

##                   GVIF Df GVIF^(1/(2*Df))
## area          1.000018  1        1.000009
## floors        1.000011  1        1.000006
## marble_choice 1.000010  2        1.000002
## fiber         1.000011  1        1.000005
## city          1.000031  2        1.000008

Train and Test MSE/RMSE

Below is the price range of the original dataset as a guide to our model performance.

Below is our train and test MSE and RMSE. AS you can see, the RMSE difference between the two is very close, meaning our model is not an overfit one.

Below is a plot of our predicted price against our test price.

CONCLUSION

Model Performance

Currently, I think our model is a good one. The difference in RMSE between our train and test is very small, indicating that our model is not an overfit model, and it passes all the assumptions test. We managed to get a high Adjusted R squared of 0.9244, meaning that our model and chosen predictor variables can explain 92% of our target variable.

Model Insight

Based on our model, the variables affects a house price positively are the Area, Floors, Marble Choices, Fiber, and Cities. In the case of Marble Choice, having Indian Marble negatively impacts a house price.

Floors and Fibers affects the price the most, and Area seemed to affects our price the least.

House Pricing Prediction with Linear Regression

Deo Ivan Mareza

2/16/2020