Introduction

This time, we will learn how to analyze and predict house prices. To conduct this analysis and prediction, we will use one of the machine learning models, namely linear regression. We also aim to understand the relationship between house prices and other variables. The data used for this analysis comes from https://www.kaggle.com/greenwing1985/housepricing.

Objectives

  • Identify the significant variables that influence house price predictions
  • Interpret the linear regression model
  • Predict house prices
  • Evaluate the linear regression model
  • Check the assumptions of the linear regression model
# library

library(dplyr)
library(scales)
library(ggplot2)
library(GGally)
library(caret)
library(lmtest)
library(car)

Data Preparation

house <-  read.csv("HousePrices_HalfMil.csv")
head(house)

Check for Missing Values (NA)

house %>% 
  is.na() %>% 
  colSums()
##          Area        Garage     FirePlace         Baths  White.Marble 
##             0             0             0             0             0 
##  Black.Marble Indian.Marble        Floors          City         Solar 
##             0             0             0             0             0 
##      Electric         Fiber   Glass.Doors  Swiming.Pool        Garden 
##             0             0             0             0             0 
##        Prices 
##             0

Based on the data above, our house dataset does not contain any missing values.

Data Description

house %>% 
  str()
## 'data.frame':    500000 obs. of  16 variables:
##  $ Area         : int  164 84 190 75 148 124 58 249 243 242 ...
##  $ Garage       : int  2 2 2 2 1 3 1 2 1 1 ...
##  $ FirePlace    : int  0 0 4 4 4 3 0 1 0 2 ...
##  $ Baths        : int  2 4 4 4 2 3 2 1 2 4 ...
##  $ White.Marble : int  0 0 1 0 1 0 0 1 0 0 ...
##  $ Black.Marble : int  1 0 0 0 0 1 0 0 0 0 ...
##  $ Indian.Marble: int  0 1 0 1 0 0 1 0 1 1 ...
##  $ Floors       : int  0 1 0 1 1 1 0 1 1 0 ...
##  $ City         : int  3 2 2 1 2 1 3 1 1 2 ...
##  $ Solar        : int  1 0 0 1 1 0 0 0 0 1 ...
##  $ Electric     : int  1 0 0 1 0 0 1 1 0 0 ...
##  $ Fiber        : int  1 0 1 1 0 1 1 0 0 0 ...
##  $ Glass.Doors  : int  1 1 0 1 1 1 1 1 0 0 ...
##  $ Swiming.Pool : int  0 1 0 1 1 1 0 1 1 1 ...
##  $ Garden       : int  0 1 0 1 1 1 1 0 0 0 ...
##  $ Prices       : int  43800 37550 49500 50075 52400 54300 34400 50425 29575 22300 ...

From the data displayed above, we obtain the following information:

  • The dataset consists of 500,000 rows and 16 columns (variables).
  • The target variable is the Price column, while the other columns serve as predictor variables.

Data Description:

  • Area = Land area (square meters or square feet).
  • Garage = Number of garages.
  • FirePlace = Number of fireplaces.
  • Baths = Number of bathrooms.
  • White.Marble = Indicates whether the house uses white marble (1 = Yes, 0 = No).
  • Black.Marble = Indicates whether the house uses black marble (1 = Yes, 0 = No).
  • Indian.Marble = Indicates whether the house uses Indian marble (1 = Yes, 0 = No).
  • Floors = Indicates whether the house is multi-story (1 = Yes, 0 = No).
  • Solar = Indicates whether the house has solar power (1 = Yes, 0 = No).
  • Electric = Indicates whether the house has electricity installed (1 = Yes, 0 = No).
  • Fiber = Indicates whether the house has fiber internet installed (1 = Yes, 0 = No).
  • Glass.Doors = Indicates whether the house has glass doors (1 = Yes, 0 = No).
  • Swimming.Pool = Indicates whether the house has a swimming pool (1 = Yes, 0 = No).
  • Garden = Indicates whether the house has a garden (1 = Yes, 0 = No).

These variables serve as predictor variables, while Price is the target variable for house price prediction.

Exploratory Data Analysis

We will use Pearson correlation to measure the relationship between variables in the dataset. To do this, we can utilize the GGally package in R, which provides an enhanced visualization of correlations.

ggcorr(house, label = TRUE, label_size = 2.9, hjust = 1, layout.exp = 2)

From the image above, we can conclude that the variables that have a significant correlation with house sale prices are White.Marble,Indian.Marble,Floors,City,Fiber and Glass.Doors.

Model

Before building the linear regression model, we need to split the dataset into training and testing sets. We will allocate:

  • 80% of the data as data_train (training data)
  • 20% of the data as data_test (testing data)

Cross Validation

# Random Sampling
set.seed(123)
row_data <- nrow(house)
index <- sample(row_data, row_data*0.8)

data_train <- house[index, ]
data_test <- house[-index, ]

Build Model

house_model <- lm(Prices ~ White.Marble + Indian.Marble + Floors + City + Fiber + Glass.Doors
                     ,data = data_train)

house_model
## 
## Call:
## lm(formula = Prices ~ White.Marble + Indian.Marble + Floors + 
##     City + Fiber + Glass.Doors, data = data_train)
## 
## Coefficients:
##   (Intercept)   White.Marble  Indian.Marble         Floors           City  
##         18139           9024          -4997          14998           3492  
##         Fiber    Glass.Doors  
##         11752           4435
summary(house_model)
## 
## Call:
## lm(formula = Prices ~ White.Marble + Indian.Marble + Floors + 
##     City + Fiber + Glass.Doors, data = data_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9367.0 -2152.0    -0.5  2148.0  9354.7 
## 
## Coefficients:
##                Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)   18138.920     16.774  1081.4 <0.0000000000000002 ***
## White.Marble   9024.401     11.857   761.1 <0.0000000000000002 ***
## Indian.Marble -4997.352     11.857  -421.5 <0.0000000000000002 ***
## Floors        14998.245      9.678  1549.6 <0.0000000000000002 ***
## City           3491.729      5.929   589.0 <0.0000000000000002 ***
## Fiber         11752.405      9.679  1214.3 <0.0000000000000002 ***
## Glass.Doors    4434.556      9.679   458.2 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3061 on 399993 degrees of freedom
## Multiple R-squared:  0.9362, Adjusted R-squared:  0.9362 
## F-statistic: 9.782e+05 on 6 and 399993 DF,  p-value: < 0.00000000000000022

\[ Prices = 18138.920 + 9024.401\White.Marble -4997.352\Indian.Marble + 14998.245\Floors + 3491.729\City + 11752.405\Fiber + 4434.556\Glass.Doors \]

The Adjusted R-squared value of the house_model is 0.9362, meaning that the model can explain 93.62% of the variance in the target variable (house price). The remaining 6.38% is influenced by other factors not included in the model.

Evaluation

Model Perfomance

Before evaluating the model’s performance, we need to predict house prices using the trained regression model. Then, we will use MAE (Mean Absolute Error) and RMSE (Root Mean Squared Error) from the caret package to assess the model’s accuracy on both training and testing data.

\[ MAE = \frac{\sum |\hat y - y|}{n} \]

\[ RMSE = \sqrt{\frac{1}{n} \sum (\hat y - y)^2} \]

Prediction of Model Performance on Training Data

house_predict_train <- predict(house_model, newdata = data_train)
MAE(house_predict_train, data_train$Prices)
## [1] 2481.073
RMSE(house_predict_train, data_train$Prices)
## [1] 3060.576

Interpretation:

  • MAE (Mean Absolute Error): 2481.073 This means that, on average, our model’s predictions on the training data deviate by 2481.073
    (either positively or negatively) from the actual house prices.

  • RMSE (Root Mean Squared Error): 3060.576 The model’s predictions have an average deviation of 3060.576 from the actual prices, considering squared errors to penalize larger deviations.

Prediction of Model Performance on Training Data

house_predict_test <- predict(house_model, newdata = data_test)
MAE(house_predict_test, data_test$Prices)
## [1] 2482.874
RMSE(house_predict_test, data_test$Prices)
## [1] 3062.738

Interpretation:

  • MAE (Mean Absolute Error): 2482.874 This means that, on average, our model’s predictions on the training data deviate by 2482.874
    (either positively or negatively) from the actual house prices.

  • RMSE (Root Mean Squared Error): 3062.738 The model’s predictions have an average deviation of 3062.738 from the actual prices, considering squared errors to penalize larger deviations.

error_model <- data.frame(data_train = c(MAE = MAE(house_predict_train, data_train$Prices),
                          RMSE = RMSE(house_predict_train, data_train$Prices)),
           data_test =  c(MAE(house_predict_test, data_test$Prices),
                          RMSE = RMSE(house_predict_test, data_test$Prices))
           )
error_model

We can observe that the model’s performance on both the training data and testing data is similar, with no significant differences.

Thus, we can conclude that the model does not suffer from overfitting or underfitting. This indicates that the model generalizes well to new data and provides reliable predictions.

Checking Assumptions

Linear regression has several assumptions that need to be met to ensure that the interpretation is unbiased. These assumptions are crucial if the goal of building the linear regression model is to interpret or analyze the effect of each predictor on the target variable.

However, if the purpose is only to make predictions, then meeting these assumptions is not mandatory. The model can still be useful even if some assumptions are violated, as long as its predictive performance remains strong.

Linearity

Linearity means that the target variable and its predictors have a linear relationship, meaning the relationship follows a straight-line pattern.

The linear relationship between the target variable and predictors can also be observed through correlation between variables. If the correlation between a predictor and the target variable is strong (either positive or negative), it indicates a linear relationship.

house_model
## 
## Call:
## lm(formula = Prices ~ White.Marble + Indian.Marble + Floors + 
##     City + Fiber + Glass.Doors, data = data_train)
## 
## Coefficients:
##   (Intercept)   White.Marble  Indian.Marble         Floors           City  
##         18139           9024          -4997          14998           3492  
##         Fiber    Glass.Doors  
##         11752           4435

\[ Prices = 18138.920 + 9024.401\White.Marble -4997.352\Indian.Marble + 14998.245\Floors + 3491.729\City + 11752.405\Fiber + 4434.556\Glass.Doors \]

ggcorr(house %>% 
         select(c(White.Marble,Indian.Marble,Floors,City,Fiber,Glass.Doors, Prices)),
       label = TRUE, label_size = 2.9, hjust = 1, layout.exp = 2)

Normality

hist(house_model$residuals)

Based on the histogram plot above, we can conclude that the residuals (errors) are normally distributed.

Homocesdasticity

Homoscedasticity means that the residuals (errors) have a constant variance across all levels of the predicted values. This ensures that the model’s predictions remain stable and reliable.

If the residuals show a pattern (such as a funnel shape, increasing/decreasing spread, or a clear trend), it indicates Heteroscedasticity. This can lead to:

  • Biased standard errors → Confidence intervals and hypothesis tests may become unreliable.
  • Overestimation or underestimation of predictor effects.
bptest(house_model)
## 
##  studentized Breusch-Pagan test
## 
## data:  house_model
## BP = 8.7898, df = 6, p-value = 0.1858

We can see that the p-value > 0.05, which means that the residuals have constant variance (Homoscedasticity).

Multicolinearity

Multicollinearity occurs when predictor variables in the model have a strong correlation with each other. This can cause instability in the regression model, making it difficult to determine the individual effect of each predictor on the target variable.

We can detect multicollinearity using the Variance Inflation Factor (VIF):

  • VIF > 10 → High multicollinearity (Problematic, consider removing a variable).
  • VIF < 5 → Low or no multicollinearity (Good).
vif(house_model)
##  White.Marble Indian.Marble        Floors          City         Fiber 
##      1.334860      1.334860      1.000002      1.000012      1.000013 
##   Glass.Doors 
##      1.000010

Conclusion

  • The variables that significantly impact and help explain variations in house prices are White.Marble, Indian.Marble, Floors, City, Fiber, Glass.Doors, with Prices as the target variable.

  • The R-squared value obtained from the model is quite high, indicating that the model can explain 93.62% of the variation in house prices, while the remaining 6.38% is influenced by other factors.

  • The model’s accuracy in predicting house prices is measured using Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE).

  • On the training data (data_train), the model has: MAE = 2481.073 and RMSE = 3060.576

  • On the test data (data_test), the model has: MAE = 2482.874 and RMSE = 3062.738

  • Since the MAE and RMSE values for data_train and data_test are very similar, we can conclude that the model does not suffer from overfitting or underfitting.