House Price Prediction using Linear Regression

Debora Sanjaya

October 11, 2022

Introduction

Linear regression is a supervised machine learning model that attempts to model a linear relationship between a dependent or target variable (Y) and one or more independent or predictor variables (X).

The HousePricing dataset is used to predict house prices, taking into account the features inherent in the property and the surrounding environment to further determine whether or not these aspects can add value to the house.

Data Preparation

Load libraries

Load the required packages

library(dplyr)
library(GGally)
library(MLmetrics)
library(lmtest)
library(car)

Read data

house <- read.csv("HousePrices_HalfMil.csv")

head(house)
#>   Area Garage FirePlace Baths White.Marble Black.Marble Indian.Marble Floors
#> 1  164      2         0     2            0            1             0      0
#> 2   84      2         0     4            0            0             1      1
#> 3  190      2         4     4            1            0             0      0
#> 4   75      2         4     4            0            0             1      1
#> 5  148      1         4     2            1            0             0      1
#> 6  124      3         3     3            0            1             0      1
#>   City Solar Electric Fiber Glass.Doors Swiming.Pool Garden Prices
#> 1    3     1        1     1           1            0      0  43800
#> 2    2     0        0     0           1            1      1  37550
#> 3    2     0        0     1           0            0      0  49500
#> 4    1     1        1     1           1            1      1  50075
#> 5    2     1        0     0           1            1      1  52400
#> 6    1     0        0     1           1            1      1  54300

Exploratory Data Analysis

# check data structure
glimpse(house)
#> Rows: 500,000
#> Columns: 16
#> $ Area          <int> 164, 84, 190, 75, 148, 124, 58, 249, 243, 242, 61, 189, …
#> $ Garage        <int> 2, 2, 2, 2, 1, 3, 1, 2, 1, 1, 2, 2, 2, 3, 3, 3, 1, 3, 2,…
#> $ FirePlace     <int> 0, 0, 4, 4, 4, 3, 0, 1, 0, 2, 4, 0, 0, 3, 3, 4, 0, 3, 3,…
#> $ Baths         <int> 2, 4, 4, 4, 2, 3, 2, 1, 2, 4, 5, 4, 2, 3, 1, 1, 5, 3, 5,…
#> $ White.Marble  <int> 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,…
#> $ Black.Marble  <int> 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1,…
#> $ Indian.Marble <int> 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0,…
#> $ Floors        <int> 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1,…
#> $ City          <int> 3, 2, 2, 1, 2, 1, 3, 1, 1, 2, 1, 2, 1, 3, 3, 1, 3, 1, 3,…
#> $ Solar         <int> 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0,…
#> $ Electric      <int> 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1,…
#> $ Fiber         <int> 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0,…
#> $ Glass.Doors   <int> 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1,…
#> $ Swiming.Pool  <int> 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0,…
#> $ Garden        <int> 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0,…
#> $ Prices        <int> 43800, 37550, 49500, 50075, 52400, 54300, 34400, 50425, …

The data has 500,000 rows and 16 columns. For the purpose of this analysis, Prices is the target variable (Y) and the rest can be considered as predictor variables (X).

# check missing value
anyNA(house)
#> [1] FALSE

No missing values found in this dataset.

Correlation between variables

Correlation shows statistical relationships and the extent to which two variables have a linear relationship with each other.

ggcorr(house, label = TRUE, label_size = 2.5, hjust = 1, layout.exp = 2)

Based on the graph, only several variables have positive correlation with Prices, where Floors (0.6), Fiber (0.5), and White.Marble (0.4) have stronger correlation compared to other attributes.

This indicates that house prices are greatly influenced by the presence of these attributes and not so much affected by the number of Baths (0.1) or whether the house has Glass.Doors (0.2).

Data distribution

A boxplot is used to show the spread and centers of a dataset.

# the spread of `Prices`
boxplot(house$Prices)

💡 Insight:

  • There are outliers above the horizontal line of the boxplot.
  • Q1-Q3 is spread from around 3000 to 5000, where 4000 is the median Q2.
# `Prices` and `Floors`
plot(house$Prices, house$Floors)

💡 Insight:

  • Floors in houses can bump the Prices up to 8000, while those without floors are less expensive.
  • The starting price point of houses with floors is around 2000.

Multiple Linear Regression

Prices as target, Floors, Fiber, White.Marble as predictors because these predictors have the highest correlation to Prices.

# multiple predictors
model_multi <- lm(Prices ~ Floors + Fiber + White.Marble, house)
summary(model_multi)
#> 
#> Call:
#> lm(formula = Prices ~ Floors + Fiber + White.Marble, data = house)
#> 
#> Residuals:
#>    Min     1Q Median     3Q    Max 
#> -17547  -3594      6   3589  17501 
#> 
#> Coefficients:
#>              Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)  24862.19      13.63  1823.8   <2e-16 ***
#> Floors       14986.44      14.58  1027.8   <2e-16 ***
#> Fiber        11723.54      14.58   804.1   <2e-16 ***
#> White.Marble 11521.81      15.47   744.8   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 5155 on 499996 degrees of freedom
#> Multiple R-squared:  0.8188, Adjusted R-squared:  0.8188 
#> F-statistic: 7.532e+05 on 3 and 499996 DF,  p-value: < 2.2e-16

💡 Insight:

  • Even if the house does not have any of these features or variables, the price is 24862.19
  • The presence of Floors will add 14986.44 to Prices
  • The presence of Fiber will add 11723.54 to Prices
  • The presence of White.Marble will add 11521.81 to Prices
  • The adjusted R-squared is 0.8188, indicating that these predictors are significant and good enough to justify the value of a house.

Prediction

house$pred_multi <- predict(model_multi, house)
head(house)
#>   Area Garage FirePlace Baths White.Marble Black.Marble Indian.Marble Floors
#> 1  164      2         0     2            0            1             0      0
#> 2   84      2         0     4            0            0             1      1
#> 3  190      2         4     4            1            0             0      0
#> 4   75      2         4     4            0            0             1      1
#> 5  148      1         4     2            1            0             0      1
#> 6  124      3         3     3            0            1             0      1
#>   City Solar Electric Fiber Glass.Doors Swiming.Pool Garden Prices pred_multi
#> 1    3     1        1     1           1            0      0  43800   36585.73
#> 2    2     0        0     0           1            1      1  37550   39848.63
#> 3    2     0        0     1           0            0      0  49500   48107.54
#> 4    1     1        1     1           1            1      1  50075   51572.17
#> 5    2     1        0     0           1            1      1  52400   51370.45
#> 6    1     0        0     1           1            1      1  54300   51572.17

Model Evaluation

RMSE performance

Root Mean Square Error (RMSE) is the prediction errors based on the standard deviation of the residuals. Residuals are a measure of how far from the regression line data points are, RMSE is a measure of how spread out these residuals are. RMSE shows hopw concentrated the data is around the line of best fit.

RMSE(y_pred = house$pred_multi, y_true = house$Prices)
#> [1] 5154.932

In order to determine RMSE performance, it is necessary to compare it with the target variable range.

range(house$Prices)
#> [1]  7725 77975

The range of the target variable or Prices is from 7725 to 77975. Therefore, this low RMSE indicates that the model is good enough to be used for prediction.

Normality of residuals

The linear regression model is expected to produce errors that are normally distributed. That way, more errors are clustered around zero.

# histogram
hist(model_multi$residuals)

The Shapiro-Wilk hypothesis test is another method to check the normality of residuals. However it cannot be used in this case as the sample size is over 5000.

Breusch-Pagan hypothesis test

It is expected that the error generated by the model spreads randomly or with constant variation. There should be no pattern when visualised. This condition is also known as homoscedasticity of residuals.

bptest(model_multi)
#> 
#>  studentized Breusch-Pagan test
#> 
#> data:  model_multi
#> BP = 3859.5, df = 3, p-value < 2.2e-16

p-value is less than 0.05, which indicates that heteroscedasticity is present in the regression model.

Multicollinearity

It is a condition where there is a strong correlation between predictors. This is undesirable because it indicates a redundant predictor in the model.

VIF value > 10: multicollinearity occurs in the model VIF value < 10: no multicollinearity in the model

vif(model_multi)
#>       Floors        Fiber White.Marble 
#>     1.000002     1.000002     1.000000

There is no multicollinearity as VIF values are all less than 10.

Conclusion

Variables that are useful to determine house prices are floors, fiber, and white marble. The presence of these aspects or variables in a house will drive its price higher. The R-squared generated from the regression model is rather high, as 81.88% of these variables contribute to house prices. The RMSE of this model is also quite low. Based on results of the tests that have been carried out, this model is significant however there is room for improvement.