Linear Regression With Multiple Features — Predicting Housing Price

Abstract

The purpose of my regression and exploratory data analysis is to get an insight of the housing prices with respect to its other attributes. Here I am studying the dataset “ Housing Prices” by Kaggle. The main focus is on analyzing the factors which are affecting the prices of houses from the given 500000 houses with their prices and other columns which will be taken into consideration as factors which might affect the prices.

Importing the libraries and reading the dataset from CSV file

house <- read.csv("data_input/HousePrices_HalfMil.csv")

Understanding the Data

str(house)

## 'data.frame':    500000 obs. of  16 variables:
##  $ Area         : int  164 84 190 75 148 124 58 249 243 242 ...
##  $ Garage       : int  2 2 2 2 1 3 1 2 1 1 ...
##  $ FirePlace    : int  0 0 4 4 4 3 0 1 0 2 ...
##  $ Baths        : int  2 4 4 4 2 3 2 1 2 4 ...
##  $ White.Marble : int  0 0 1 0 1 0 0 1 0 0 ...
##  $ Black.Marble : int  1 0 0 0 0 1 0 0 0 0 ...
##  $ Indian.Marble: int  0 1 0 1 0 0 1 0 1 1 ...
##  $ Floors       : int  0 1 0 1 1 1 0 1 1 0 ...
##  $ City         : int  3 2 2 1 2 1 3 1 1 2 ...
##  $ Solar        : int  1 0 0 1 1 0 0 0 0 1 ...
##  $ Electric     : int  1 0 0 1 0 0 1 1 0 0 ...
##  $ Fiber        : int  1 0 1 1 0 1 1 0 0 0 ...
##  $ Glass.Doors  : int  1 1 0 1 1 1 1 1 0 0 ...
##  $ Swiming.Pool : int  0 1 0 1 1 1 0 1 1 1 ...
##  $ Garden       : int  0 1 0 1 1 1 1 0 0 0 ...
##  $ Prices       : int  43800 37550 49500 50075 52400 54300 34400 50425 29575 22300 ...

Variable Explanations:

Area: Land area of the building
Garage: Garage Type
FirePlace: Type of fireplace in the house
Baths: Bathroom Type
White.Marble: Type of white marble in the house
Black.Marble: Type of black marble in the house
Indian.Marble: Types of Indian marble in the house
Floors: Type of floor
City: City of home location
Solar: There is Solar powered / no
Electric: There is electricity / no
Fiber: There is Fiber roof / no
Glass.Doors: Doors made of glass / no
Swiming.Pool: There is a swimming pool / no
Garden: There is a garden / no
Prices: House prices

Checking whether the dataset has null values

# menguji keberadaan missing value
anyNA(house)

## [1] FALSE

Checking the columns and the number of missing entries in them Examining this is important as because of this the dataset can lose expressiveness, which can lead to weak and biased analyses

Look across variables

library(GGally)

## Loading required package: ggplot2

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

ggcorr(house, label = TRUE, label_size = 2.9, hjust = 1, layout.exp = 2)

In the correlation graph, it can be seen that there are variables that have a positive influence on price where the GDP factor has the highest positive correlation compared to other factors.

The highest correlation between the predictor variables is that the floor of the house Floors is very influential on the price of the house, which is 0.6

Regression Model

At this point, Build a regression model between the Price and Floor variables, because the correlation between the two variables is very high. Because some variables must be of type category, but still of type numeric, we must change the data type to category.

names(house)[2:15]

##  [1] "Garage"        "FirePlace"     "Baths"         "White.Marble" 
##  [5] "Black.Marble"  "Indian.Marble" "Floors"        "City"         
##  [9] "Solar"         "Electric"      "Fiber"         "Glass.Doors"  
## [13] "Swiming.Pool"  "Garden"

house[names(house)[2:15]] = lapply(house[names(house)[2:15]], as.factor)

str(house)

## 'data.frame':    500000 obs. of  16 variables:
##  $ Area         : int  164 84 190 75 148 124 58 249 243 242 ...
##  $ Garage       : Factor w/ 3 levels "1","2","3": 2 2 2 2 1 3 1 2 1 1 ...
##  $ FirePlace    : Factor w/ 5 levels "0","1","2","3",..: 1 1 5 5 5 4 1 2 1 3 ...
##  $ Baths        : Factor w/ 5 levels "1","2","3","4",..: 2 4 4 4 2 3 2 1 2 4 ...
##  $ White.Marble : Factor w/ 2 levels "0","1": 1 1 2 1 2 1 1 2 1 1 ...
##  $ Black.Marble : Factor w/ 2 levels "0","1": 2 1 1 1 1 2 1 1 1 1 ...
##  $ Indian.Marble: Factor w/ 2 levels "0","1": 1 2 1 2 1 1 2 1 2 2 ...
##  $ Floors       : Factor w/ 2 levels "0","1": 1 2 1 2 2 2 1 2 2 1 ...
##  $ City         : Factor w/ 3 levels "1","2","3": 3 2 2 1 2 1 3 1 1 2 ...
##  $ Solar        : Factor w/ 2 levels "0","1": 2 1 1 2 2 1 1 1 1 2 ...
##  $ Electric     : Factor w/ 2 levels "0","1": 2 1 1 2 1 1 2 2 1 1 ...
##  $ Fiber        : Factor w/ 2 levels "0","1": 2 1 2 2 1 2 2 1 1 1 ...
##  $ Glass.Doors  : Factor w/ 2 levels "0","1": 2 2 1 2 2 2 2 2 1 1 ...
##  $ Swiming.Pool : Factor w/ 2 levels "0","1": 1 2 1 2 2 2 1 2 2 2 ...
##  $ Garden       : Factor w/ 2 levels "0","1": 1 2 1 2 2 2 2 1 1 1 ...
##  $ Prices       : int  43800 37550 49500 50075 52400 54300 34400 50425 29575 22300 ...

Model Processing

mod1 <- lm(Prices ~ Floors, data = house)
mod1

## 
## Call:
## lm(formula = Prices ~ Floors, data = house)
## 
## Coefficients:
## (Intercept)      Floors1  
##       34558        15003

summary(mod1)

## 
## Call:
## lm(formula = Prices ~ Floors, data = house)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -27211  -6986    -11   6592  28542 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 34557.66      19.00    1819   <2e-16 ***
## Floors1     15003.39      26.89     558   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9507 on 499998 degrees of freedom
## Multiple R-squared:  0.3837, Adjusted R-squared:  0.3837 
## F-statistic: 3.113e+05 on 1 and 499998 DF,  p-value: < 2.2e-16

When viewed from the perspective of multiple R-squared, this model is still very weak because the predictor fields used are categorical. Next, we will try to use all the variables that will be selected automatically using the stepwise regression technique.

mod2 <- lm(Prices ~., data = house)
mod2_back <- step(mod2, direction = "backward",trace = 0)

## Warning: attempting model selection on an essentially perfect fit is nonsense

## Warning: attempting model selection on an essentially perfect fit is nonsense

## Warning: attempting model selection on an essentially perfect fit is nonsense

## Warning: attempting model selection on an essentially perfect fit is nonsense

Based on two models where the model uses the backward stepping method to select the variable to get the smallest AIC value, there are indications that the model is overfitting as seen from the corrected ideal R-squared value = 1

Evaluation of Model Assumptions

As a statistical model, linear regression has several assumptions that need to be met so that the interpretation obtained is not biased. This assumption only needs to be fulfilled if the purpose of making a linear regression model is to want an interpretation or to see the effect of each predictor on the value of the target variable. If you only want to use linear regression to make predictions, then the model assumptions are not required to be met.

Normality of Residual

The hope is that when making a linear regression model, the resulting error is normally distributed. This means that many errors are gathered around the number 0.

hist(mod2_back$residuals, breaks = 20)

If we look at the distribution, the backward stepwise model is proven not to be well distributed, because it is too clustered at 0.

Shapiro Test

tryCatch({
shapiro.test(mod2_back$residuals)
}, 
warning = 
  function(w) {message("sample size must be between 3 and 5000")},
error = 
  function(e) {message("sample size must be between 3 and 5000")})

## sample size must be between 3 and 5000

head(mod2_back$residuals)

##             1             2             3             4             5 
## -8.344602e-05  5.049946e-07 -3.748786e-07  1.851977e-09  1.199718e-08 
##             6 
## -8.005082e-08

Based on the error from the Shapiro Test, the model cannot be tested for the Shapiro Test because there are sample values below 3 or negative values. As for the Shapiro Test itself, the sample value must range from 3 - 5000

Heteroscedasticity

library(lmtest)

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

bptest(mod2_back)

## 
##  studentized Breusch-Pagan test
## 
## data:  mod2_back
## BP = 19.34, df = 20, p-value = 0.4998

Based on the heteroscedasticity test, the backward stepwise model does not have a pattern (heteroscedasticity) of P>0.05 where all existing patterns have been successfully captured by the model made.

Conclusion

Based on the statistical test assumptions above and several other considerations, this HousePricing data is not appropriate to be used as a model using multiple linear regression because many variables consist of categorical types. However, if you use ordinary linear regression which only consists of 2 variables, the model is still relevant and can be used in the future