Introduction
Linear regression is a supervised machine learning model that attempts to model a linear relationship between a dependent or target variable (Y) and one or more independent or predictor variables (X).
The HousePricing dataset is used to predict house prices, taking into account the features inherent in the property and the surrounding environment to further determine whether or not these aspects can add value to the house.
Data Preparation
Load libraries
Load the required packages
library(dplyr)
library(GGally)
library(MLmetrics)
library(lmtest)
library(car)
Read data
house <- read.csv("HousePrices_HalfMil.csv")
head(house)
#> Area Garage FirePlace Baths White.Marble Black.Marble Indian.Marble Floors
#> 1 164 2 0 2 0 1 0 0
#> 2 84 2 0 4 0 0 1 1
#> 3 190 2 4 4 1 0 0 0
#> 4 75 2 4 4 0 0 1 1
#> 5 148 1 4 2 1 0 0 1
#> 6 124 3 3 3 0 1 0 1
#> City Solar Electric Fiber Glass.Doors Swiming.Pool Garden Prices
#> 1 3 1 1 1 1 0 0 43800
#> 2 2 0 0 0 1 1 1 37550
#> 3 2 0 0 1 0 0 0 49500
#> 4 1 1 1 1 1 1 1 50075
#> 5 2 1 0 0 1 1 1 52400
#> 6 1 0 0 1 1 1 1 54300
Exploratory Data Analysis
# check data structure
glimpse(house)
#> Rows: 500,000
#> Columns: 16
#> $ Area <int> 164, 84, 190, 75, 148, 124, 58, 249, 243, 242, 61, 189, …
#> $ Garage <int> 2, 2, 2, 2, 1, 3, 1, 2, 1, 1, 2, 2, 2, 3, 3, 3, 1, 3, 2,…
#> $ FirePlace <int> 0, 0, 4, 4, 4, 3, 0, 1, 0, 2, 4, 0, 0, 3, 3, 4, 0, 3, 3,…
#> $ Baths <int> 2, 4, 4, 4, 2, 3, 2, 1, 2, 4, 5, 4, 2, 3, 1, 1, 5, 3, 5,…
#> $ White.Marble <int> 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,…
#> $ Black.Marble <int> 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1,…
#> $ Indian.Marble <int> 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0,…
#> $ Floors <int> 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1,…
#> $ City <int> 3, 2, 2, 1, 2, 1, 3, 1, 1, 2, 1, 2, 1, 3, 3, 1, 3, 1, 3,…
#> $ Solar <int> 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0,…
#> $ Electric <int> 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1,…
#> $ Fiber <int> 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0,…
#> $ Glass.Doors <int> 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1,…
#> $ Swiming.Pool <int> 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0,…
#> $ Garden <int> 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0,…
#> $ Prices <int> 43800, 37550, 49500, 50075, 52400, 54300, 34400, 50425, …
The data has 500,000 rows and 16 columns. For the purpose of this
analysis, Prices is the target variable
(Y) and the rest can be considered as predictor variables
(X).
# check missing value
anyNA(house)
#> [1] FALSE
No missing values found in this dataset.
Correlation between variables
Correlation shows statistical relationships and the extent to which two variables have a linear relationship with each other.
ggcorr(house, label = TRUE, label_size = 2.5, hjust = 1, layout.exp = 2)
Based on the graph, only several variables have positive correlation
with Prices, where Floors (0.6),
Fiber (0.5), and White.Marble (0.4) have
stronger correlation compared to other attributes.
This indicates that house prices are greatly influenced by the
presence of these attributes and not so much affected by the number of
Baths (0.1) or whether the house has
Glass.Doors (0.2).
Data distribution
A boxplot is used to show the spread and centers of a dataset.
# the spread of `Prices`
boxplot(house$Prices)
💡 Insight:
- There are outliers above the horizontal line of the boxplot.
- Q1-Q3 is spread from around 3000 to 5000, where 4000 is the median Q2.
# `Prices` and `Floors`
plot(house$Prices, house$Floors)
💡 Insight:
Floorsin houses can bump thePricesup to 8000, while those without floors are less expensive.- The starting price point of houses with floors is around 2000.
Multiple Linear Regression
Prices as target, Floors,
Fiber, White.Marble as predictors because
these predictors have the highest correlation to
Prices.
# multiple predictors
model_multi <- lm(Prices ~ Floors + Fiber + White.Marble, house)
summary(model_multi)
#>
#> Call:
#> lm(formula = Prices ~ Floors + Fiber + White.Marble, data = house)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -17547 -3594 6 3589 17501
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 24862.19 13.63 1823.8 <2e-16 ***
#> Floors 14986.44 14.58 1027.8 <2e-16 ***
#> Fiber 11723.54 14.58 804.1 <2e-16 ***
#> White.Marble 11521.81 15.47 744.8 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 5155 on 499996 degrees of freedom
#> Multiple R-squared: 0.8188, Adjusted R-squared: 0.8188
#> F-statistic: 7.532e+05 on 3 and 499996 DF, p-value: < 2.2e-16
💡 Insight:
- Even if the house does not have any of these features or variables, the price is 24862.19
- The presence of
Floorswill add 14986.44 toPrices - The presence of
Fiberwill add 11723.54 toPrices - The presence of
White.Marblewill add 11521.81 toPrices - The adjusted R-squared is 0.8188, indicating that these predictors are significant and good enough to justify the value of a house.
Prediction
house$pred_multi <- predict(model_multi, house)
head(house)
#> Area Garage FirePlace Baths White.Marble Black.Marble Indian.Marble Floors
#> 1 164 2 0 2 0 1 0 0
#> 2 84 2 0 4 0 0 1 1
#> 3 190 2 4 4 1 0 0 0
#> 4 75 2 4 4 0 0 1 1
#> 5 148 1 4 2 1 0 0 1
#> 6 124 3 3 3 0 1 0 1
#> City Solar Electric Fiber Glass.Doors Swiming.Pool Garden Prices pred_multi
#> 1 3 1 1 1 1 0 0 43800 36585.73
#> 2 2 0 0 0 1 1 1 37550 39848.63
#> 3 2 0 0 1 0 0 0 49500 48107.54
#> 4 1 1 1 1 1 1 1 50075 51572.17
#> 5 2 1 0 0 1 1 1 52400 51370.45
#> 6 1 0 0 1 1 1 1 54300 51572.17
Model Evaluation
RMSE performance
Root Mean Square Error (RMSE) is the prediction errors based on the standard deviation of the residuals. Residuals are a measure of how far from the regression line data points are, RMSE is a measure of how spread out these residuals are. RMSE shows hopw concentrated the data is around the line of best fit.
RMSE(y_pred = house$pred_multi, y_true = house$Prices)
#> [1] 5154.932
In order to determine RMSE performance, it is necessary to compare it with the target variable range.
range(house$Prices)
#> [1] 7725 77975
The range of the target variable or Prices is from 7725
to 77975. Therefore, this low RMSE indicates that the model is good
enough to be used for prediction.
Normality of residuals
The linear regression model is expected to produce errors that are normally distributed. That way, more errors are clustered around zero.
# histogram
hist(model_multi$residuals)
The Shapiro-Wilk hypothesis test is another method to check the normality of residuals. However it cannot be used in this case as the sample size is over 5000.
Breusch-Pagan hypothesis test
It is expected that the error generated by the model spreads randomly or with constant variation. There should be no pattern when visualised. This condition is also known as homoscedasticity of residuals.
bptest(model_multi)
#>
#> studentized Breusch-Pagan test
#>
#> data: model_multi
#> BP = 3859.5, df = 3, p-value < 2.2e-16
p-value is less than 0.05, which indicates that heteroscedasticity is present in the regression model.
Multicollinearity
It is a condition where there is a strong correlation between predictors. This is undesirable because it indicates a redundant predictor in the model.
VIF value > 10: multicollinearity occurs in the model VIF value < 10: no multicollinearity in the model
vif(model_multi)
#> Floors Fiber White.Marble
#> 1.000002 1.000002 1.000000
There is no multicollinearity as VIF values are all less than 10.
Conclusion
Variables that are useful to determine house prices are floors, fiber, and white marble. The presence of these aspects or variables in a house will drive its price higher. The R-squared generated from the regression model is rather high, as 81.88% of these variables contribute to house prices. The RMSE of this model is also quite low. Based on results of the tests that have been carried out, this model is significant however there is room for improvement.