HOUSEPRICE LINEAR REGRESSION
Preparing Data & Install Package
library(dplyr)##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(readr)
library(tufte)
House <- read_csv("HousePrices.csv")## Rows: 500000 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (16): Area, Garage, FirePlace, Baths, White Marble, Black Marble, Indian...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
House## # A tibble: 500,000 × 16
## Area Garage FireP…¹ Baths White…² Black…³ India…⁴ Floors City Solar Elect…⁵
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 164 2 0 2 0 1 0 0 3 1 1
## 2 84 2 0 4 0 0 1 1 2 0 0
## 3 190 2 4 4 1 0 0 0 2 0 0
## 4 75 2 4 4 0 0 1 1 1 1 1
## 5 148 1 4 2 1 0 0 1 2 1 0
## 6 124 3 3 3 0 1 0 1 1 0 0
## 7 58 1 0 2 0 0 1 0 3 0 1
## 8 249 2 1 1 1 0 0 1 1 0 1
## 9 243 1 0 2 0 0 1 1 1 0 0
## 10 242 1 2 4 0 0 1 0 2 1 0
## # … with 499,990 more rows, 5 more variables: Fiber <dbl>, `Glass Doors` <dbl>,
## # `Swiming Pool` <dbl>, Garden <dbl>, Prices <dbl>, and abbreviated variable
## # names ¹FirePlace, ²`White Marble`, ³`Black Marble`, ⁴`Indian Marble`,
## # ⁵Electric
glimpse(House)## Rows: 500,000
## Columns: 16
## $ Area <dbl> 164, 84, 190, 75, 148, 124, 58, 249, 243, 242, 61, 189…
## $ Garage <dbl> 2, 2, 2, 2, 1, 3, 1, 2, 1, 1, 2, 2, 2, 3, 3, 3, 1, 3, …
## $ FirePlace <dbl> 0, 0, 4, 4, 4, 3, 0, 1, 0, 2, 4, 0, 0, 3, 3, 4, 0, 3, …
## $ Baths <dbl> 2, 4, 4, 4, 2, 3, 2, 1, 2, 4, 5, 4, 2, 3, 1, 1, 5, 3, …
## $ `White Marble` <dbl> 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, …
## $ `Black Marble` <dbl> 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, …
## $ `Indian Marble` <dbl> 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, …
## $ Floors <dbl> 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, …
## $ City <dbl> 3, 2, 2, 1, 2, 1, 3, 1, 1, 2, 1, 2, 1, 3, 3, 1, 3, 1, …
## $ Solar <dbl> 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, …
## $ Electric <dbl> 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, …
## $ Fiber <dbl> 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, …
## $ `Glass Doors` <dbl> 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, …
## $ `Swiming Pool` <dbl> 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, …
## $ Garden <dbl> 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, …
## $ Prices <dbl> 43800, 37550, 49500, 50075, 52400, 54300, 34400, 50425…
House <- House %>%
select(-Area)anyNA(House)## [1] FALSE
colSums(is.na(House))## Garage FirePlace Baths White Marble Black Marble
## 0 0 0 0 0
## Indian Marble Floors City Solar Electric
## 0 0 0 0 0
## Fiber Glass Doors Swiming Pool Garden Prices
## 0 0 0 0 0
Exploratory Data Analysis
cor(House$Prices, House$Floors)## [1] 0.6194508
library(GGally)## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
ggcorr(House, label = TRUE, label_size = 3, hjust = 1, layout.exp = 2)Modeling & Cross Validation
Before we create a model, we need to divide the data into 2 data, namely: “data_train” & “data_test”. “data_train” to train the linear regression model and “data_test” will be used as a comparison and see if the model is overfit and cannot predict new data that has not been seen during the training phase. 80% of all data is used as “data_train” and the rest as “data_test”.
set.seed(2)
samplesize <- round(0.8 * nrow(House), 0)
index <- sample(seq_len(nrow(House)), size = samplesize)
data_train <- House[index, ]
data_test <- House[-index, ]
nrow(House)## [1] 500000
nrow(data_train)## [1] 400000
nrow(data_test)## [1] 100000
# Simple linear Model Creation
model1 <- lm(Prices~ Floors, data_train)
summary(model1)##
## Call:
## lm(formula = Prices ~ Floors, data = data_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27027.8 -6992.4 -2.8 6597.2 28532.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34567.39 21.25 1626.4 <0.0000000000000002 ***
## Floors 14985.43 30.07 498.4 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9509 on 399998 degrees of freedom
## Multiple R-squared: 0.3831, Adjusted R-squared: 0.3831
## F-statistic: 2.484e+05 on 1 and 399998 DF, p-value: < 0.00000000000000022
sum(House$Floors)## [1] 249693
The intercept coefficient and coefficient for each predictor are generated, resulting in the regression model formula:
\[\hat{y}=\beta_0+\beta_1.x_1+...+\beta_n.x_n\].
where, \(\beta_0\) is the intercept, \(\beta_1, ..., \beta_n\) is the predictor coefficient, and \(x_1,...,x_n\) is the predictor variable used.
So that the model formula is obtained:
\[Prices = 34567.39 + 14985.43*{Floors}\]
34567.39 + 14985.43*249693## [1] 3741791540
Intercept: the starting point of the regression line is formed, indicating the target value when the predictor value = 0
When Floors are 249693, Prices = 34567.39
Coefficient/Slope: increase in the target variable every 1 unit
Positive coefficient = positive correlation, increasing the value of the target variable
Negative coefficient = negative correlation, lowering the value of the target variable
Floors improve
- Predictor significance: determines whether each predictor has a significant effect on the target variable.
- A predictor is said to be significant when the p-value < 0.05 (alpha)
- It can also be seen from the number of stars for each predictor
Significant variable: Floors
- R-squared: measure of model goodness. How well the model can explain the target.
- range of values 0-1, the closer to 1 the better
The predictors that we use in the model can explain as much as 38.31% of the variance of the target variable, while the rest is explained by other variables outside the model.
plot(House$Prices, House$Floors)
abline(model1, col = "red")Multiple Linear Model
# Multiple linear Model Creation
model_all <- lm(Prices~ ., data_train)
summary(model_all)##
## Call:
## lm(formula = Prices ~ ., data = data_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3119.20 -1555.63 0.47 1551.72 3124.08
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4133.180 15.157 272.695 <0.0000000000000002 ***
## Garage 1500.162 3.475 431.712 <0.0000000000000002 ***
## FirePlace 750.185 2.007 373.842 <0.0000000000000002 ***
## Baths 1249.272 2.006 622.659 <0.0000000000000002 ***
## `White Marble` 14006.114 6.950 2015.387 <0.0000000000000002 ***
## `Black Marble` 4997.962 6.949 719.280 <0.0000000000000002 ***
## `Indian Marble` NA NA NA NA
## Floors 15000.705 5.676 2642.956 <0.0000000000000002 ***
## City 3493.313 3.478 1004.344 <0.0000000000000002 ***
## Solar 250.738 5.676 44.177 <0.0000000000000002 ***
## Electric 1252.410 5.676 220.661 <0.0000000000000002 ***
## Fiber 11748.018 5.676 2069.837 <0.0000000000000002 ***
## `Glass Doors` 4444.435 5.676 783.056 <0.0000000000000002 ***
## `Swiming Pool` 4.113 5.676 0.725 0.469
## Garden 3.664 5.676 0.646 0.519
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1795 on 399986 degrees of freedom
## Multiple R-squared: 0.978, Adjusted R-squared: 0.978
## F-statistic: 1.369e+06 on 13 and 399986 DF, p-value: < 0.00000000000000022
Regression formula for the above model:
\[ Prices = 4133.180 \\ + 1500.162 * Garage\\ + 750.185 * FirePlace\\ + 1249.272 * Baths\\ + 14006.114 * `White Marble`\\ + 4997.962 * `Black Marble`\\ + 15000.705 * Floors\\ + 250.738 * Solar\\ + 1252.410 * Electric\\ + 11748.018 * Fiber\\ + 4444.435 * `Glass Doors`\\ + 4.113 * `Swiming Pool`\\ + 3.664 * Garden \]
Evaluation of the model, R-squared value “The model is better if the R-Squared value is closer to 1”
summary(model_all)$adj.r.squared## [1] 0.9780197
Model Tuning
Model Tuning using the stepwise regression function - “backward”
model_backward <- step(object = model_all, direction = "backward", trace = F)summary(model_backward)##
## Call:
## lm(formula = Prices ~ Garage + FirePlace + Baths + `White Marble` +
## `Black Marble` + Floors + City + Solar + Electric + Fiber +
## `Glass Doors`, data = data_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3115.43 -1555.24 0.54 1551.86 3120.15
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4137.026 14.627 282.83 <0.0000000000000002 ***
## Garage 1500.164 3.475 431.71 <0.0000000000000002 ***
## FirePlace 750.188 2.007 373.84 <0.0000000000000002 ***
## Baths 1249.278 2.006 622.67 <0.0000000000000002 ***
## `White Marble` 14006.107 6.950 2015.40 <0.0000000000000002 ***
## `Black Marble` 4997.966 6.949 719.28 <0.0000000000000002 ***
## Floors 15000.699 5.676 2642.96 <0.0000000000000002 ***
## City 3493.319 3.478 1004.35 <0.0000000000000002 ***
## Solar 250.724 5.676 44.17 <0.0000000000000002 ***
## Electric 1252.414 5.676 220.66 <0.0000000000000002 ***
## Fiber 11748.036 5.676 2069.86 <0.0000000000000002 ***
## `Glass Doors` 4444.450 5.676 783.06 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1795 on 399988 degrees of freedom
## Multiple R-squared: 0.978, Adjusted R-squared: 0.978
## F-statistic: 1.618e+06 on 11 and 399988 DF, p-value: < 0.00000000000000022
summary(model_backward)$adj.r.squared## [1] 0.9780198
After the model is created, we can use it to make predictions to “data_test”.
Model from stepwise result: model_backward
# Prediction
pred_model_all <- predict(model_all, data_test)
pred_model_backward <- predict(model_backward, data_test)
# Roots mean squared error means that the average error squared is then rooted, aiming to see the error from the prediction
library(MLmetrics)##
## Attaching package: 'MLmetrics'
## The following object is masked from 'package:base':
##
## Recall
# Evaluation
RMSE(y_pred = pred_model_all, y_true = data_test$Prices)## [1] 1795.243
RMSE(y_pred = pred_model_backward, y_true = data_test$Prices)## [1] 1795.244
The comparison between model_all and model_backward is close, but we choose model_backward because it uses fewer variables so the computation process is faster.
Fit is the middle value or predict result - Lower is the minimum limit - Upper is the maximum limit
pred_model_backward1 <- predict(model_backward, newdata = data_test,
interval = "prediction", # menambahkan ci utk prediksi
level = 0.95) # default confidence level
length(pred_model_backward)## [1] 100000
nrow(pred_model_backward1)## [1] 100000
Knowing Formulas
summary(model_backward)$call## lm(formula = Prices ~ Garage + FirePlace + Baths + `White Marble` +
## `Black Marble` + Floors + City + Solar + Electric + Fiber +
## `Glass Doors`, data = data_train)
Model Evaluation
1. Linearity
Linearity hypothesis test:
# For example, we check for only 1 predictor
cor.test(House$Prices, House$Floors)##
## Pearson's product-moment correlation
##
## data: House$Prices and House$Floors
## t = 557.96, df = 499998, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6177397 0.6211561
## sample estimates:
## cor
## 0.6194508
Test conclusion: the variables are significantly correlated, our assumption test is fulfilled
# library(car)
# crPlots(model_backward)P-value >= Alpha, then it fails to reject H0 P-value < Alpha, then reject H0
- H0: the correlation is not significant
- H1: significant correlation
Desired H1
Conclusion: significantly correlated variables, our assumption test is met
2. Normality of Residuals
hist(model_backward$residuals, main = "Plot Model Backward", xlab = "Residual Value", ylim = c(0,40000))
Test statistics with
shapiro.test()
# We cannot use shapiro.test because the sample data requirements range between 3-5000 sample data
# shapiro. test(model_backward$residuals)Kolmogorov-Smirnov Test using the “ks.test” function
ks.test(model_backward$residuals, "pnorm")##
## Asymptotic one-sample Kolmogorov-Smirnov test
##
## data: model_backward$residuals
## D = 0.49934, p-value < 0.00000000000000022
## alternative hypothesis: two-sided
Based on the Kolmogorov-Smirnov Test, the p-value <0.00000000000000022 is less than the alpha value, namely: 0.05
3. Homoscedasticity of Residuals
- Scatterplot visualization: model’s fitted value - model’s error:
# fitted values = hasil prediksi ke data untuk training model
plot(x = model_backward$fitted.values,
y = model_backward$residuals,
main = "Homoscedasticity of Residuals",
ylab = "Residual Backward",
xlab = "Predict Value ")
abline(h = 0, col = "red")library(lmtest)## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
bptest(model_backward)##
## studentized Breusch-Pagan test
##
## data: model_backward
## BP = 7.6726, df = 11, p-value = 0.7423
Based on the scatterplot above, it shows that the distribution of residuals is “Homoscedasticity”, which fulfills the assumption that the distribution of errors is evenly distributed,
4. No Multicollinearity
library(car)## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
vif(model_backward)## Garage FirePlace Baths `White Marble` `Black Marble`
## 1.000041 1.000027 1.000039 1.331639 1.331641
## Floors City Solar Electric Fiber
## 1.000022 1.000019 1.000019 1.000015 1.000032
## `Glass Doors`
## 1.000021
Test conclusion: there is no multicollinearity, the assumption test is met
- VIF value > 10 : there is multicollinearity
- VIF value < 10 : no multicollinearity
CONCLUSION :
Based on model testing using model_backward, it can be concluded that the best model that can be used on HousePrice data, with an R-Squared value: 0.9780198, which means that around 97.80% of the price percentage can be explained by the model. model accuracy after predictive modeling using model_backward has an error value using RMSE with a value of: 1795.244.
After we do all the assumptions testing we get:
- linearity: assumptions are met
- Normality of Residuals: the assumption is met
- Residue Homoscedasticity : Assumption Fulfilled
- No Multicollinearity: the assumptions are met