HOUSEPRICE LINEAR REGRESSION

Preparing Data & Install Package

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(readr)
library(tufte)
House <-  read_csv("HousePrices.csv")
## Rows: 500000 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (16): Area, Garage, FirePlace, Baths, White Marble, Black Marble, Indian...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
House
## # A tibble: 500,000 × 16
##     Area Garage FireP…¹ Baths White…² Black…³ India…⁴ Floors  City Solar Elect…⁵
##    <dbl>  <dbl>   <dbl> <dbl>   <dbl>   <dbl>   <dbl>  <dbl> <dbl> <dbl>   <dbl>
##  1   164      2       0     2       0       1       0      0     3     1       1
##  2    84      2       0     4       0       0       1      1     2     0       0
##  3   190      2       4     4       1       0       0      0     2     0       0
##  4    75      2       4     4       0       0       1      1     1     1       1
##  5   148      1       4     2       1       0       0      1     2     1       0
##  6   124      3       3     3       0       1       0      1     1     0       0
##  7    58      1       0     2       0       0       1      0     3     0       1
##  8   249      2       1     1       1       0       0      1     1     0       1
##  9   243      1       0     2       0       0       1      1     1     0       0
## 10   242      1       2     4       0       0       1      0     2     1       0
## # … with 499,990 more rows, 5 more variables: Fiber <dbl>, `Glass Doors` <dbl>,
## #   `Swiming Pool` <dbl>, Garden <dbl>, Prices <dbl>, and abbreviated variable
## #   names ¹​FirePlace, ²​`White Marble`, ³​`Black Marble`, ⁴​`Indian Marble`,
## #   ⁵​Electric
glimpse(House)
## Rows: 500,000
## Columns: 16
## $ Area            <dbl> 164, 84, 190, 75, 148, 124, 58, 249, 243, 242, 61, 189…
## $ Garage          <dbl> 2, 2, 2, 2, 1, 3, 1, 2, 1, 1, 2, 2, 2, 3, 3, 3, 1, 3, …
## $ FirePlace       <dbl> 0, 0, 4, 4, 4, 3, 0, 1, 0, 2, 4, 0, 0, 3, 3, 4, 0, 3, …
## $ Baths           <dbl> 2, 4, 4, 4, 2, 3, 2, 1, 2, 4, 5, 4, 2, 3, 1, 1, 5, 3, …
## $ `White Marble`  <dbl> 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, …
## $ `Black Marble`  <dbl> 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, …
## $ `Indian Marble` <dbl> 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, …
## $ Floors          <dbl> 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, …
## $ City            <dbl> 3, 2, 2, 1, 2, 1, 3, 1, 1, 2, 1, 2, 1, 3, 3, 1, 3, 1, …
## $ Solar           <dbl> 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, …
## $ Electric        <dbl> 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, …
## $ Fiber           <dbl> 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, …
## $ `Glass Doors`   <dbl> 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, …
## $ `Swiming Pool`  <dbl> 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, …
## $ Garden          <dbl> 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, …
## $ Prices          <dbl> 43800, 37550, 49500, 50075, 52400, 54300, 34400, 50425…
House <- House %>%
  select(-Area)
anyNA(House)
## [1] FALSE
colSums(is.na(House))
##        Garage     FirePlace         Baths  White Marble  Black Marble 
##             0             0             0             0             0 
## Indian Marble        Floors          City         Solar      Electric 
##             0             0             0             0             0 
##         Fiber   Glass Doors  Swiming Pool        Garden        Prices 
##             0             0             0             0             0

Exploratory Data Analysis

cor(House$Prices, House$Floors)
## [1] 0.6194508
library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
ggcorr(House, label = TRUE, label_size = 3, hjust = 1, layout.exp = 2)

Modeling & Cross Validation

Before we create a model, we need to divide the data into 2 data, namely: “data_train” & “data_test”. “data_train” to train the linear regression model and “data_test” will be used as a comparison and see if the model is overfit and cannot predict new data that has not been seen during the training phase. 80% of all data is used as “data_train” and the rest as “data_test”.

set.seed(2)
samplesize <- round(0.8 * nrow(House), 0)
index <- sample(seq_len(nrow(House)), size = samplesize)

data_train <- House[index, ]
data_test <- House[-index, ]

nrow(House)
## [1] 500000
nrow(data_train)
## [1] 400000
nrow(data_test)
## [1] 100000
# Simple linear Model Creation
model1 <-  lm(Prices~ Floors, data_train)
summary(model1)
## 
## Call:
## lm(formula = Prices ~ Floors, data = data_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -27027.8  -6992.4     -2.8   6597.2  28532.6 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept) 34567.39      21.25  1626.4 <0.0000000000000002 ***
## Floors      14985.43      30.07   498.4 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9509 on 399998 degrees of freedom
## Multiple R-squared:  0.3831, Adjusted R-squared:  0.3831 
## F-statistic: 2.484e+05 on 1 and 399998 DF,  p-value: < 0.00000000000000022
sum(House$Floors)
## [1] 249693

The intercept coefficient and coefficient for each predictor are generated, resulting in the regression model formula:

\[\hat{y}=\beta_0+\beta_1.x_1+...+\beta_n.x_n\].

where, \(\beta_0\) is the intercept, \(\beta_1, ..., \beta_n\) is the predictor coefficient, and \(x_1,...,x_n\) is the predictor variable used.

So that the model formula is obtained:

\[Prices = 34567.39 + 14985.43*{Floors}\]

34567.39 + 14985.43*249693
## [1] 3741791540
  1. Intercept: the starting point of the regression line is formed, indicating the target value when the predictor value = 0

    When Floors are 249693, Prices = 34567.39

  2. Coefficient/Slope: increase in the target variable every 1 unit

  • Positive coefficient = positive correlation, increasing the value of the target variable

  • Negative coefficient = negative correlation, lowering the value of the target variable

    Floors improve

  1. Predictor significance: determines whether each predictor has a significant effect on the target variable.
  • A predictor is said to be significant when the p-value < 0.05 (alpha)
  • It can also be seen from the number of stars for each predictor

Significant variable: Floors

  1. R-squared: measure of model goodness. How well the model can explain the target.
  • range of values 0-1, the closer to 1 the better

The predictors that we use in the model can explain as much as 38.31% of the variance of the target variable, while the rest is explained by other variables outside the model.

plot(House$Prices, House$Floors)
abline(model1, col = "red")

Multiple Linear Model

# Multiple linear Model Creation
model_all <-  lm(Prices~ ., data_train)
summary(model_all)
## 
## Call:
## lm(formula = Prices ~ ., data = data_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -3119.20 -1555.63     0.47  1551.72  3124.08 
## 
## Coefficients: (1 not defined because of singularities)
##                  Estimate Std. Error  t value            Pr(>|t|)    
## (Intercept)      4133.180     15.157  272.695 <0.0000000000000002 ***
## Garage           1500.162      3.475  431.712 <0.0000000000000002 ***
## FirePlace         750.185      2.007  373.842 <0.0000000000000002 ***
## Baths            1249.272      2.006  622.659 <0.0000000000000002 ***
## `White Marble`  14006.114      6.950 2015.387 <0.0000000000000002 ***
## `Black Marble`   4997.962      6.949  719.280 <0.0000000000000002 ***
## `Indian Marble`        NA         NA       NA                  NA    
## Floors          15000.705      5.676 2642.956 <0.0000000000000002 ***
## City             3493.313      3.478 1004.344 <0.0000000000000002 ***
## Solar             250.738      5.676   44.177 <0.0000000000000002 ***
## Electric         1252.410      5.676  220.661 <0.0000000000000002 ***
## Fiber           11748.018      5.676 2069.837 <0.0000000000000002 ***
## `Glass Doors`    4444.435      5.676  783.056 <0.0000000000000002 ***
## `Swiming Pool`      4.113      5.676    0.725               0.469    
## Garden              3.664      5.676    0.646               0.519    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1795 on 399986 degrees of freedom
## Multiple R-squared:  0.978,  Adjusted R-squared:  0.978 
## F-statistic: 1.369e+06 on 13 and 399986 DF,  p-value: < 0.00000000000000022

Regression formula for the above model:

\[ Prices = 4133.180 \\ + 1500.162 * Garage\\ + 750.185 * FirePlace\\ + 1249.272 * Baths\\ + 14006.114 * `White Marble`\\ + 4997.962 * `Black Marble`\\ + 15000.705 * Floors\\ + 250.738 * Solar\\ + 1252.410 * Electric\\ + 11748.018 * Fiber\\ + 4444.435 * `Glass Doors`\\ + 4.113 * `Swiming Pool`\\ + 3.664 * Garden \]

Evaluation of the model, R-squared value “The model is better if the R-Squared value is closer to 1”

summary(model_all)$adj.r.squared
## [1] 0.9780197

Model Tuning

Model Tuning using the stepwise regression function - “backward”

model_backward <- step(object = model_all, direction = "backward", trace = F)
summary(model_backward)
## 
## Call:
## lm(formula = Prices ~ Garage + FirePlace + Baths + `White Marble` + 
##     `Black Marble` + Floors + City + Solar + Electric + Fiber + 
##     `Glass Doors`, data = data_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -3115.43 -1555.24     0.54  1551.86  3120.15 
## 
## Coefficients:
##                 Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)     4137.026     14.627  282.83 <0.0000000000000002 ***
## Garage          1500.164      3.475  431.71 <0.0000000000000002 ***
## FirePlace        750.188      2.007  373.84 <0.0000000000000002 ***
## Baths           1249.278      2.006  622.67 <0.0000000000000002 ***
## `White Marble` 14006.107      6.950 2015.40 <0.0000000000000002 ***
## `Black Marble`  4997.966      6.949  719.28 <0.0000000000000002 ***
## Floors         15000.699      5.676 2642.96 <0.0000000000000002 ***
## City            3493.319      3.478 1004.35 <0.0000000000000002 ***
## Solar            250.724      5.676   44.17 <0.0000000000000002 ***
## Electric        1252.414      5.676  220.66 <0.0000000000000002 ***
## Fiber          11748.036      5.676 2069.86 <0.0000000000000002 ***
## `Glass Doors`   4444.450      5.676  783.06 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1795 on 399988 degrees of freedom
## Multiple R-squared:  0.978,  Adjusted R-squared:  0.978 
## F-statistic: 1.618e+06 on 11 and 399988 DF,  p-value: < 0.00000000000000022
summary(model_backward)$adj.r.squared
## [1] 0.9780198

After the model is created, we can use it to make predictions to “data_test”.

Model from stepwise result: model_backward

# Prediction
pred_model_all <- predict(model_all, data_test)
pred_model_backward <- predict(model_backward, data_test) 


# Roots mean squared error means that the average error squared is then rooted, aiming to see the error from the prediction
library(MLmetrics)
## 
## Attaching package: 'MLmetrics'
## The following object is masked from 'package:base':
## 
##     Recall
# Evaluation
RMSE(y_pred = pred_model_all, y_true = data_test$Prices)
## [1] 1795.243
RMSE(y_pred = pred_model_backward, y_true = data_test$Prices)
## [1] 1795.244

The comparison between model_all and model_backward is close, but we choose model_backward because it uses fewer variables so the computation process is faster.

Fit is the middle value or predict result - Lower is the minimum limit - Upper is the maximum limit

pred_model_backward1 <- predict(model_backward, newdata = data_test, 
                             interval = "prediction", # menambahkan ci utk prediksi 
                             level = 0.95) # default confidence level
length(pred_model_backward)
## [1] 100000
nrow(pred_model_backward1)
## [1] 100000

Knowing Formulas

summary(model_backward)$call
## lm(formula = Prices ~ Garage + FirePlace + Baths + `White Marble` + 
##     `Black Marble` + Floors + City + Solar + Electric + Fiber + 
##     `Glass Doors`, data = data_train)

Model Evaluation

1. Linearity

Linearity hypothesis test:

# For example, we check for only 1 predictor
cor.test(House$Prices, House$Floors)
## 
##  Pearson's product-moment correlation
## 
## data:  House$Prices and House$Floors
## t = 557.96, df = 499998, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6177397 0.6211561
## sample estimates:
##       cor 
## 0.6194508

Test conclusion: the variables are significantly correlated, our assumption test is fulfilled

# library(car)
# crPlots(model_backward)

P-value >= Alpha, then it fails to reject H0 P-value < Alpha, then reject H0

  • H0: the correlation is not significant
  • H1: significant correlation

Desired H1

Conclusion: significantly correlated variables, our assumption test is met

2. Normality of Residuals

hist(model_backward$residuals, main = "Plot Model Backward", xlab = "Residual Value", ylim = c(0,40000))

Test statistics with shapiro.test()

# We cannot use shapiro.test because the sample data requirements range between 3-5000 sample data
# shapiro. test(model_backward$residuals)

Kolmogorov-Smirnov Test using the “ks.test” function

ks.test(model_backward$residuals, "pnorm")
## 
##  Asymptotic one-sample Kolmogorov-Smirnov test
## 
## data:  model_backward$residuals
## D = 0.49934, p-value < 0.00000000000000022
## alternative hypothesis: two-sided

Based on the Kolmogorov-Smirnov Test, the p-value <0.00000000000000022 is less than the alpha value, namely: 0.05

3. Homoscedasticity of Residuals

  1. Scatterplot visualization: model’s fitted value - model’s error:
# fitted values = hasil prediksi ke data untuk training model
plot(x = model_backward$fitted.values, 
     y = model_backward$residuals,
     main = "Homoscedasticity of Residuals",
     ylab = "Residual Backward",
     xlab = "Predict Value ")
abline(h = 0, col = "red")

library(lmtest)
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
bptest(model_backward)
## 
##  studentized Breusch-Pagan test
## 
## data:  model_backward
## BP = 7.6726, df = 11, p-value = 0.7423

Based on the scatterplot above, it shows that the distribution of residuals is “Homoscedasticity”, which fulfills the assumption that the distribution of errors is evenly distributed,

4. No Multicollinearity

library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
vif(model_backward)
##         Garage      FirePlace          Baths `White Marble` `Black Marble` 
##       1.000041       1.000027       1.000039       1.331639       1.331641 
##         Floors           City          Solar       Electric          Fiber 
##       1.000022       1.000019       1.000019       1.000015       1.000032 
##  `Glass Doors` 
##       1.000021

Test conclusion: there is no multicollinearity, the assumption test is met

  • VIF value > 10 : there is multicollinearity
  • VIF value < 10 : no multicollinearity

CONCLUSION :

Based on model testing using model_backward, it can be concluded that the best model that can be used on HousePrice data, with an R-Squared value: 0.9780198, which means that around 97.80% of the price percentage can be explained by the model. model accuracy after predictive modeling using model_backward has an error value using RMSE with a value of: 1795.244.

After we do all the assumptions testing we get:

  1. linearity: assumptions are met
  2. Normality of Residuals: the assumption is met
  3. Residue Homoscedasticity : Assumption Fulfilled
  4. No Multicollinearity: the assumptions are met