Intro

House Price Prediction

This report will try to predict house price using linear regression using house price database taken from Kaggle. We will explore which predictors/variables that has significant impact in determining house price.

Brief

library(dplyr)
library(tidyverse)
library(caret)
library(GGally)
library(rsample)

Get the Data

house <- read.csv("house_price.csv")

head(house)
#>   Area Garage FirePlace Baths White.Marble Black.Marble Indian.Marble Floors
#> 1  164      2         0     2            0            1             0      0
#> 2   84      2         0     4            0            0             1      1
#> 3  190      2         4     4            1            0             0      0
#> 4   75      2         4     4            0            0             1      1
#> 5  148      1         4     2            1            0             0      1
#> 6  124      3         3     3            0            1             0      1
#>   City Solar Electric Fiber Glass.Doors Swiming.Pool Garden Prices
#> 1    3     1        1     1           1            0      0  43800
#> 2    2     0        0     0           1            1      1  37550
#> 3    2     0        0     1           0            0      0  49500
#> 4    1     1        1     1           1            1      1  50075
#> 5    2     1        0     0           1            1      1  52400
#> 6    1     0        0     1           1            1      1  54300

Data Cleansing

Check the Data Type

glimpse(house)
#> Rows: 500,000
#> Columns: 16
#> $ Area          <int> 164, 84, 190, 75, 148, 124, 58, 249, 243, 242, 61, 189, …
#> $ Garage        <int> 2, 2, 2, 2, 1, 3, 1, 2, 1, 1, 2, 2, 2, 3, 3, 3, 1, 3, 2,…
#> $ FirePlace     <int> 0, 0, 4, 4, 4, 3, 0, 1, 0, 2, 4, 0, 0, 3, 3, 4, 0, 3, 3,…
#> $ Baths         <int> 2, 4, 4, 4, 2, 3, 2, 1, 2, 4, 5, 4, 2, 3, 1, 1, 5, 3, 5,…
#> $ White.Marble  <int> 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,…
#> $ Black.Marble  <int> 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1,…
#> $ Indian.Marble <int> 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0,…
#> $ Floors        <int> 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1,…
#> $ City          <int> 3, 2, 2, 1, 2, 1, 3, 1, 1, 2, 1, 2, 1, 3, 3, 1, 3, 1, 3,…
#> $ Solar         <int> 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0,…
#> $ Electric      <int> 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1,…
#> $ Fiber         <int> 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0,…
#> $ Glass.Doors   <int> 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1,…
#> $ Swiming.Pool  <int> 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0,…
#> $ Garden        <int> 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0,…
#> $ Prices        <int> 43800, 37550, 49500, 50075, 52400, 54300, 34400, 50425, …

From the result above, some of data type are not in the corect type. So we need to convert it.

house <- house %>% 
  mutate_if(is.numeric, as.factor) %>% 
  mutate_at(c("Garage", "FirePlace", "Area", "FirePlace", "Baths", "Prices"), as.numeric)

glimpse(house)
#> Rows: 500,000
#> Columns: 16
#> $ Area          <dbl> 164, 84, 190, 75, 148, 124, 58, 249, 243, 242, 61, 189, …
#> $ Garage        <dbl> 2, 2, 2, 2, 1, 3, 1, 2, 1, 1, 2, 2, 2, 3, 3, 3, 1, 3, 2,…
#> $ FirePlace     <dbl> 1, 1, 5, 5, 5, 4, 1, 2, 1, 3, 5, 1, 1, 4, 4, 5, 1, 4, 4,…
#> $ Baths         <dbl> 2, 4, 4, 4, 2, 3, 2, 1, 2, 4, 5, 4, 2, 3, 1, 1, 5, 3, 5,…
#> $ White.Marble  <fct> 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,…
#> $ Black.Marble  <fct> 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1,…
#> $ Indian.Marble <fct> 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0,…
#> $ Floors        <fct> 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1,…
#> $ City          <fct> 3, 2, 2, 1, 2, 1, 3, 1, 1, 2, 1, 2, 1, 3, 3, 1, 3, 1, 3,…
#> $ Solar         <fct> 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0,…
#> $ Electric      <fct> 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1,…
#> $ Fiber         <fct> 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0,…
#> $ Glass.Doors   <fct> 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1,…
#> $ Swiming.Pool  <fct> 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0,…
#> $ Garden        <fct> 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0,…
#> $ Prices        <dbl> 1399, 1149, 1627, 1650, 1743, 1819, 1023, 1664, 830, 539…

The data is in the correct format.

Look at White.Marble, Black.Marble, and Indian.Marble column. Theese columns should be a factor of one column instead of 3.

house <- house %>% 
  mutate(marble_type = case_when(White.Marble == 1 ~ "white_marble",
                                 Black.Marble == 1 ~ "black_marble",
                                 Indian.Marble == 1 ~ "indian_marble") ) %>% 
  select(-c(White.Marble, Black.Marble, Indian.Marble)) %>% 
  mutate(marble_type = as.factor(marble_type))

Check Missing Value

Now, we check whether there are missing value from the data.

colSums(is.na(house))
#>         Area       Garage    FirePlace        Baths       Floors         City 
#>            0            0            0            0            0            0 
#>        Solar     Electric        Fiber  Glass.Doors Swiming.Pool       Garden 
#>            0            0            0            0            0            0 
#>       Prices  marble_type 
#>            0            0

There are no missing values in our data, so we can continue to the next process.

Exploratory Data Analysis

Exploratory data analysis is a phase where we explore the data variables, see if there are any pattern that can indicate any kind of correlation between variables.

ggcorr(house, label = TRUE, label_size = 2.9, hjust = 1, layout.exp = 2)

The graphic shows that variable Bath, FirePlace, Garage, Area has positive correlation with the prices variable.

Modelling

set.seed(20)
index <- initial_split(data = house, 
                       prop = 0.8) 

data_train <- training(index)
data_test <-testing(index)
model_all <- lm(Prices ~., data = data_train)

summary(model_all)
#> 
#> Call:
#> lm(formula = Prices ~ ., data = data_train)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -51.967  -0.005   0.000   0.006  44.971 
#> 
#> Coefficients:
#>                            Estimate Std. Error    t value Pr(>|t|)    
#> (Intercept)              -2.965e+00  1.715e-03 -1.729e+03   <2e-16 ***
#> Area                      1.000e+00  4.357e-06  2.295e+05   <2e-16 ***
#> Garage                    6.000e+01  3.828e-04  1.567e+05   <2e-16 ***
#> FirePlace                 3.000e+01  2.214e-04  1.355e+05   <2e-16 ***
#> Baths                     5.000e+01  2.211e-04  2.261e+05   <2e-16 ***
#> Floors1                   6.000e+02  6.254e-04  9.593e+05   <2e-16 ***
#> City2                     1.400e+02  7.662e-04  1.827e+05   <2e-16 ***
#> City3                     2.800e+02  7.663e-04  3.654e+05   <2e-16 ***
#> Solar1                    9.999e+00  6.255e-04  1.599e+04   <2e-16 ***
#> Electric1                 5.000e+01  6.254e-04  7.994e+04   <2e-16 ***
#> Fiber1                    4.700e+02  6.255e-04  7.514e+05   <2e-16 ***
#> Glass.Doors1              1.780e+02  6.255e-04  2.846e+05   <2e-16 ***
#> Swiming.Pool1             7.824e-04  6.255e-04  1.251e+00    0.211    
#> Garden1                   2.793e-04  6.255e-04  4.470e-01    0.655    
#> marble_typeindian_marble -2.000e+02  7.660e-04 -2.611e+05   <2e-16 ***
#> marble_typewhite_marble   3.600e+02  7.664e-04  4.697e+05   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.1978 on 399984 degrees of freedom
#> Multiple R-squared:      1,  Adjusted R-squared:      1 
#> F-statistic: 1.602e+11 on 15 and 399984 DF,  p-value: < 2.2e-16
model_partial <- lm(Prices ~ Area+Garage+FirePlace+Baths+Floors+City+Solar+Electric+Fiber+Glass.Doors+marble_type,
                    data = data_train)

summary(model_partial)
#> 
#> Call:
#> lm(formula = Prices ~ Area + Garage + FirePlace + Baths + Floors + 
#>     City + Solar + Electric + Fiber + Glass.Doors + marble_type, 
#>     data = data_train)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -51.968  -0.005   0.000   0.006  44.970 
#> 
#> Coefficients:
#>                            Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)              -2.965e+00  1.658e-03   -1788   <2e-16 ***
#> Area                      1.000e+00  4.357e-06  229531   <2e-16 ***
#> Garage                    6.000e+01  3.828e-04  156736   <2e-16 ***
#> FirePlace                 3.000e+01  2.214e-04  135522   <2e-16 ***
#> Baths                     5.000e+01  2.211e-04  226129   <2e-16 ***
#> Floors1                   6.000e+02  6.254e-04  959305   <2e-16 ***
#> City2                     1.400e+02  7.662e-04  182709   <2e-16 ***
#> City3                     2.800e+02  7.663e-04  365407   <2e-16 ***
#> Solar1                    9.999e+00  6.254e-04   15986   <2e-16 ***
#> Electric1                 5.000e+01  6.254e-04   79938   <2e-16 ***
#> Fiber1                    4.700e+02  6.255e-04  751451   <2e-16 ***
#> Glass.Doors1              1.780e+02  6.255e-04  284587   <2e-16 ***
#> marble_typeindian_marble -2.000e+02  7.660e-04 -261110   <2e-16 ***
#> marble_typewhite_marble   3.600e+02  7.664e-04  469732   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.1978 on 399986 degrees of freedom
#> Multiple R-squared:      1,  Adjusted R-squared:      1 
#> F-statistic: 1.849e+11 on 13 and 399986 DF,  p-value: < 2.2e-16

Compare R-squared model_all and model_partial:

summary(model_all)$adj.r.squared
#> [1] 0.9999998
summary(model_partial)$adj.r.squared
#> [1] 0.9999998

Because the r-squared values are the same, so the error values will be compared using RMSE.

library(MLmetrics)

# prediksi
pred_model_all <- predict(model_all, data_test)
pred_model_partial <- predict(model_partial, data_test)


# evaluasi
RMSE(y_pred = pred_model_all, y_true = data_test$Prices)
#> [1] 0.2953036
RMSE(y_pred = pred_model_partial, y_true = data_test$Prices)
#> [1] 0.2953008

Based on R-squared and errors, the model model_partial is selected.

Then we will try using a step-wise regression model, by trying a combination of predictors until the smallest AIC (information loss) is produced.

# stepwise regression - backward
model_backward <- step(object = model_all, direction = "backward", trace = F)
# stepwise regression - forward

model_none <- lm(Prices ~ 1, data = data_train)

model_forward <- step(object = model_none, scope = list(upper = model_all), direction = "forward", trace = F)
# stepwise regression - both

model_both <- step(object = model_none, scope = list(upper = model_all), direction = "both", trace = F)

The adjusted R-squared values in the four models can be compared:

summary(model_backward)$adj.r.squared
#> [1] 0.9999998
summary(model_forward)$adj.r.squared
#> [1] 0.9999998
summary(model_both)$adj.r.squared
#> [1] 0.9999998
summary(model_partial)$adj.r.squared
#> [1] 0.9999998

Prediction

pred_model_backward <- predict(model_backward, newdata = data_test)
pred_model_forward <- predict(model_forward, newdata = data_test)
pred_model_both <- predict(model_both, newdata = data_test)

Model Evaluation

# evaluation
RMSE(y_pred = pred_model_partial, y_true = data_test$Prices)
#> [1] 0.2953008
RMSE(y_pred = pred_model_backward, y_true = data_test$Prices)
#> [1] 0.2953008
RMSE(y_pred = pred_model_forward, y_true = data_test$Prices)
#> [1] 0.2953008
RMSE(y_pred = pred_model_both, y_true = data_test$Prices)
#> [1] 0.2953008

The four models have the exact same RMSE. The stepwise model has the same RMSE as the multiple predictor model. We can choose models with multiple predictors or stepwise models.

Asumption

Normality of Residuals

hist(model_partial$residuals, breaks = 10)

From the graph above, it can be concluded that the Residuals are distributed normally.

Homoscedasticity

library(lmtest)
bptest(model_backward)
#> 
#>  studentized Breusch-Pagan test
#> 
#> data:  model_backward
#> BP = 23.118, df = 13, p-value = 0.04028

From the above results obtained a p-value of 0.04028. Because the p-value < alpha (0.05) then H0 is rejected. So, it can be concluded that the error variance spreads not constant/forms a pattern (Heteroscedasticity).

Multicollinearity

library(car)

vif(model_backward)
#>                 GVIF Df GVIF^(1/(2*Df))
#> Area        1.000034  1        1.000017
#> Garage      1.000078  1        1.000039
#> FirePlace   1.000014  1        1.000007
#> Baths       1.000041  1        1.000020
#> Floors      1.000025  1        1.000013
#> City        1.000055  2        1.000014
#> Solar       1.000013  1        1.000006
#> Electric    1.000015  1        1.000008
#> Fiber       1.000032  1        1.000016
#> Glass.Doors 1.000038  1        1.000019
#> marble_type 1.000054  2        1.000014

Because there is no VIF value that is more than 10 there is no multicollinearity. Thus, the assumption test is fulfilled.

Conclusion

Both model produces a reliable prediction to forecast the House Price. Also by using Step-Wise regression we were able to produce an equal model with similar results from the model without Step-Wise regression, this means that out Step-Wise regression for feature reduction succesfully reduce the number of features for trainning efficiency but also at the same time retain the information needed to produce reliable predictions.