Linear Regression on House Pricing

Intro

This report are made around the House Prices that are provided in Kaggle. Be aware that this dataset is computer generated that are provided with the purpose of academic research in practice of model development. We will use linear regression in order to make prediction model to decide the Prices based on the important parameters provided.

Data Preparation

Load the required packages

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(MLmetrics)
## 
## Attaching package: 'MLmetrics'
## The following object is masked from 'package:base':
## 
##     Recall
library(lmtest)
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode

Read csv data

house_price <- read.csv("dataset_houseprices.csv")
house_price %>% head(5)
##   Area Garage FirePlace Baths White.Marble Black.Marble Indian.Marble Floors
## 1  164      2         0     2            0            1             0      0
## 2   84      2         0     4            0            0             1      1
## 3  190      2         4     4            1            0             0      0
## 4   75      2         4     4            0            0             1      1
## 5  148      1         4     2            1            0             0      1
##   City Solar Electric Fiber Glass.Doors Swiming.Pool Garden Prices
## 1    3     1        1     1           1            0      0  43800
## 2    2     0        0     0           1            1      1  37550
## 3    2     0        0     1           0            0      0  49500
## 4    1     1        1     1           1            1      1  50075
## 5    2     1        0     0           1            1      1  52400

The dataset contain a lot of integer value as we can see above, but infact it was a categorical value that represent information in the form of number. For example, in Black.Marble column, 1 represent ‘Yes, there is a Black Marble in the House’ and 0 the otherwise. So, for the purpose of efficiency in model development, a transformation to type factor is needed.

house_price_clean <- house_price %>% 
  mutate_at(vars(Area,Garage,FirePlace,Baths,White.Marble,Black.Marble,Indian.Marble,Floors,City,Solar,Electric,Fiber,Glass.Doors,Swiming.Pool,Garden), as.factor)
glimpse(house_price_clean)
## Rows: 500,000
## Columns: 16
## $ Area          <fct> 164, 84, 190, 75, 148, 124, 58, 249, 243, 242, 61, 189, …
## $ Garage        <fct> 2, 2, 2, 2, 1, 3, 1, 2, 1, 1, 2, 2, 2, 3, 3, 3, 1, 3, 2,…
## $ FirePlace     <fct> 0, 0, 4, 4, 4, 3, 0, 1, 0, 2, 4, 0, 0, 3, 3, 4, 0, 3, 3,…
## $ Baths         <fct> 2, 4, 4, 4, 2, 3, 2, 1, 2, 4, 5, 4, 2, 3, 1, 1, 5, 3, 5,…
## $ White.Marble  <fct> 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,…
## $ Black.Marble  <fct> 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1,…
## $ Indian.Marble <fct> 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0,…
## $ Floors        <fct> 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1,…
## $ City          <fct> 3, 2, 2, 1, 2, 1, 3, 1, 1, 2, 1, 2, 1, 3, 3, 1, 3, 1, 3,…
## $ Solar         <fct> 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0,…
## $ Electric      <fct> 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1,…
## $ Fiber         <fct> 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0,…
## $ Glass.Doors   <fct> 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1,…
## $ Swiming.Pool  <fct> 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0,…
## $ Garden        <fct> 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0,…
## $ Prices        <int> 43800, 37550, 49500, 50075, 52400, 54300, 34400, 50425, …

check if any missing data

anyNA(house_price_clean)
## [1] FALSE

We also notice that predictor Area has 249 unique values, so we decide to leave out this predictor from the trainning set because it was too specific and too many levels.

length(unique(house_price_clean$Area))
## [1] 249
house_price_clean <- house_price_clean %>% 
  select(-Area)
names(house_price_clean)
##  [1] "Garage"        "FirePlace"     "Baths"         "White.Marble" 
##  [5] "Black.Marble"  "Indian.Marble" "Floors"        "City"         
##  [9] "Solar"         "Electric"      "Fiber"         "Glass.Doors"  
## [13] "Swiming.Pool"  "Garden"        "Prices"

Exploratory Data Analysis (EDA)

EDA is the phase where we asses the dataset. A correlation test in generally used to test the linearity of the dataset, but since our predictor data is all factor, then we cannot do correlation assesment, instead we can check the skewness of each predictor to make sure there are no imbalance frequency in each column

house_price %>% group_by(Area) %>% summarise(count=n()) %>% ungroup()
## # A tibble: 249 × 2
##     Area count
##    <int> <int>
##  1     1  2038
##  2     2  2030
##  3     3  1905
##  4     4  1967
##  5     5  2071
##  6     6  1990
##  7     7  2060
##  8     8  1983
##  9     9  1920
## 10    10  2025
## # … with 239 more rows
house_price %>% group_by(Garage) %>% summarise(count=n()) %>% ungroup()
## # A tibble: 3 × 2
##   Garage  count
##    <int>  <int>
## 1      1 166552
## 2      2 166251
## 3      3 167197
house_price %>% group_by(FirePlace) %>% summarise(count=n()) %>% ungroup()
## # A tibble: 5 × 2
##   FirePlace  count
##       <int>  <int>
## 1         0  99569
## 2         1  99983
## 3         2  99954
## 4         3 100168
## 5         4 100326
house_price %>% group_by(Baths) %>% summarise(count=n()) %>% ungroup()
## # A tibble: 5 × 2
##   Baths  count
##   <int>  <int>
## 1     1 100319
## 2     2  99794
## 3     3 100158
## 4     4  99989
## 5     5  99740
house_price %>% group_by(White.Marble) %>% summarise(count=n()) %>% ungroup()
## # A tibble: 2 × 2
##   White.Marble  count
##          <int>  <int>
## 1            0 333504
## 2            1 166496
house_price %>% group_by(Black.Marble) %>% summarise(count=n()) %>% ungroup()
## # A tibble: 2 × 2
##   Black.Marble  count
##          <int>  <int>
## 1            0 333655
## 2            1 166345
house_price %>% group_by(Indian.Marble) %>% summarise(count=n()) %>% ungroup()
## # A tibble: 2 × 2
##   Indian.Marble  count
##           <int>  <int>
## 1             0 332841
## 2             1 167159
house_price %>% group_by(Floors) %>% summarise(count=n()) %>% ungroup()
## # A tibble: 2 × 2
##   Floors  count
##    <int>  <int>
## 1      0 250307
## 2      1 249693
house_price %>% group_by(City) %>% summarise(count=n()) %>% ungroup()
## # A tibble: 3 × 2
##    City  count
##   <int>  <int>
## 1     1 166314
## 2     2 166902
## 3     3 166784
house_price %>% group_by(Solar) %>% summarise(count=n()) %>% ungroup()
## # A tibble: 2 × 2
##   Solar  count
##   <int>  <int>
## 1     0 250653
## 2     1 249347
house_price %>% group_by(Electric) %>% summarise(count=n()) %>% ungroup()
## # A tibble: 2 × 2
##   Electric  count
##      <int>  <int>
## 1        0 249675
## 2        1 250325
house_price %>% group_by(Fiber) %>% summarise(count=n()) %>% ungroup()
## # A tibble: 2 × 2
##   Fiber  count
##   <int>  <int>
## 1     0 249766
## 2     1 250234
house_price %>% group_by(Glass.Doors) %>% summarise(count=n()) %>% ungroup()
## # A tibble: 2 × 2
##   Glass.Doors  count
##         <int>  <int>
## 1           0 250065
## 2           1 249935
house_price %>% group_by(Swiming.Pool) %>% summarise(count=n()) %>% ungroup()
## # A tibble: 2 × 2
##   Swiming.Pool  count
##          <int>  <int>
## 1            0 249782
## 2            1 250218
house_price %>% group_by(Garden) %>% summarise(count=n()) %>% ungroup()
## # A tibble: 2 × 2
##   Garden  count
##    <int>  <int>
## 1      0 249177
## 2      1 250823

as many as it is, we can observe that all of the variables are fairly distributed, most of the variables are evenly porpotioned, except for White.Marble, Black.Marble, and Indian.Marble with the composition of roughly 1:2 which is still acceptable.

Modelling

model_house_price <- lm(formula =  Prices ~ .  , data = house_price_clean)
summary(model_house_price)
## 
## Call:
## lm(formula = Prices ~ ., data = house_price_clean)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -3127.69 -1556.89     0.65  1552.74  3137.98 
## 
## Coefficients: (1 not defined because of singularities)
##                 Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)    10379.346     11.910  871.512   <2e-16 ***
## Garage2         1505.576      6.223  241.946   <2e-16 ***
## Garage3         2996.052      6.214  482.149   <2e-16 ***
## FirePlace1       751.615      8.036   93.530   <2e-16 ***
## FirePlace2      1489.221      8.037  185.301   <2e-16 ***
## FirePlace3      2251.582      8.032  280.312   <2e-16 ***
## FirePlace4      3002.395      8.029  373.931   <2e-16 ***
## Baths2          1247.118      8.025  155.406   <2e-16 ***
## Baths3          2499.609      8.018  311.763   <2e-16 ***
## Baths4          3745.880      8.021  467.011   <2e-16 ***
## Baths5          4997.958      8.026  622.724   <2e-16 ***
## White.Marble1  14009.039      6.215 2254.144   <2e-16 ***
## Black.Marble1   4998.862      6.216  804.164   <2e-16 ***
## Indian.Marble1        NA         NA       NA       NA    
## Floors1        14997.228      5.077 2954.075   <2e-16 ***
## City2           3493.203      6.219  561.707   <2e-16 ***
## City3           6984.857      6.220 1122.962   <2e-16 ***
## Solar1           251.915      5.077   49.620   <2e-16 ***
## Electric1       1249.509      5.077  246.122   <2e-16 ***
## Fiber1         11750.359      5.077 2314.506   <2e-16 ***
## Glass.Doors1    4445.571      5.077  875.660   <2e-16 ***
## Swiming.Pool1      2.222      5.077    0.438    0.662    
## Garden1            5.114      5.077    1.007    0.314    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1795 on 499978 degrees of freedom
## Multiple R-squared:  0.978,  Adjusted R-squared:  0.978 
## F-statistic: 1.06e+06 on 21 and 499978 DF,  p-value: < 2.2e-16

Based on the initial modelling, we can observe that one variable is not defined because of singularities ad that is Indian.Marble1. For the next model tuning, we will remove the unused parameters and build the model with step-wise feature selection to find the most efficient model.

house_price_clean <- house_price_clean %>% select(-Indian.Marble)

model_house_price_clean <- lm(formula =  Prices ~ .  , data = house_price_clean)
model_house_price_step <- step(object = model_house_price_clean,
                          direction = "both",
                          trace = FALSE)
summary(model_house_price_step)
## 
## Call:
## lm(formula = Prices ~ Garage + FirePlace + Baths + White.Marble + 
##     Black.Marble + Floors + City + Solar + Electric + Fiber + 
##     Glass.Doors, data = house_price_clean)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -3124.25 -1556.78     0.72  1552.72  3134.30 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   10383.005     11.360  913.97   <2e-16 ***
## Garage2        1505.584      6.223  241.95   <2e-16 ***
## Garage3        2996.051      6.214  482.15   <2e-16 ***
## FirePlace1      751.625      8.036   93.53   <2e-16 ***
## FirePlace2     1489.212      8.037  185.30   <2e-16 ***
## FirePlace3     2251.574      8.032  280.31   <2e-16 ***
## FirePlace4     3002.412      8.029  373.93   <2e-16 ***
## Baths2         1247.103      8.025  155.40   <2e-16 ***
## Baths3         2499.613      8.018  311.76   <2e-16 ***
## Baths4         3745.891      8.021  467.01   <2e-16 ***
## Baths5         4997.970      8.026  622.73   <2e-16 ***
## White.Marble1 14009.041      6.215 2254.15   <2e-16 ***
## Black.Marble1  4998.866      6.216  804.17   <2e-16 ***
## Floors1       14997.225      5.077 2954.08   <2e-16 ***
## City2          3493.215      6.219  561.71   <2e-16 ***
## City3          6984.865      6.220 1122.96   <2e-16 ***
## Solar1          251.892      5.077   49.62   <2e-16 ***
## Electric1      1249.514      5.077  246.12   <2e-16 ***
## Fiber1        11750.368      5.077 2314.53   <2e-16 ***
## Glass.Doors1   4445.589      5.077  875.67   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1795 on 499980 degrees of freedom
## Multiple R-squared:  0.978,  Adjusted R-squared:  0.978 
## F-statistic: 1.172e+06 on 19 and 499980 DF,  p-value: < 2.2e-16

With the addition of step-wise regression it helps us to reduce the number of unsignificant predictor, and in this case Swiming Pool and Garden are removed from the list of predictors.

Prediction

While in this dataset does not contain new data, we will predict the prices from the house_price dataset. We will use two models that we already made to predict the Prices :

  1. model_house_price_clean, a model without predictor Area and Indian.Marble

  2. model_house_price_step, a model made using step-wise regression to find the best predictors

house_price$pred_all_clean <- predict(object = model_house_price_clean, newdata = house_price_clean)
house_price$pred_step <- predict(object = model_house_price_step, newdata = house_price_clean)
head(house_price)
##   Area Garage FirePlace Baths White.Marble Black.Marble Indian.Marble Floors
## 1  164      2         0     2            0            1             0      0
## 2   84      2         0     4            0            0             1      1
## 3  190      2         4     4            1            0             0      0
## 4   75      2         4     4            0            0             1      1
## 5  148      1         4     2            1            0             0      1
## 6  124      3         3     3            0            1             0      1
##   City Solar Electric Fiber Glass.Doors Swiming.Pool Garden Prices
## 1    3     1        1     1           1            0      0  43800
## 2    2     0        0     0           1            1      1  37550
## 3    2     0        0     1           0            0      0  49500
## 4    1     1        1     1           1            1      1  50075
## 5    2     1        0     0           1            1      1  52400
## 6    1     0        0     1           1            1      1  54300
##   pred_all_clean pred_step
## 1       42813.11  42816.78
## 2       38574.14  38570.51
## 3       47885.80  47889.52
## 4       51335.11  51331.48
## 5       51833.15  51829.48
## 6       54325.95  54322.29

Model Evaluation

R-squared

We will compared the goodness of fit of the model based on the it’s R-squared value

summary(model_house_price_clean)$adj.r.squared
## [1] 0.978033
summary(model_house_price_step)$adj.r.squared
## [1] 0.9780331

Error

We will calculate the error which means the difference of the actual value and the predicted value. Using RMSE or Root Mean Squared Error that sensitive to bigger Errors.

RMSE(y_pred = house_price$pred_all_clean, y_true = house_price$Prices)
## [1] 1794.85
RMSE(y_pred = house_price$pred_step, y_true = house_price$Prices)
## [1] 1794.852

By comparing the RMSE from both models we can argue that both models produce almost identical prediction results. That number means the prediction Price may differ up to 1794 points from the actual prices.

Conclusion

Both model produces a reliable prediction to forecast the House Price. Also by using Step-Wise regression we were able to produce an equal model with similar results from the model without Step-Wise regression, this means that out Step-Wise regression for feature reduction succesfully reduce the number of features for trainning efficiency but also at the same time retain the information needed to produce reliable predictions.