This report will try to predict house price using linear regression using house price database taken from Kaggle. We will explore which predictors/variables that has significant impact in determining house price.
library(dplyr)
library(tidyverse)
library(caret)
library(GGally)
library(rsample)house <- read.csv("house_price.csv")
head(house)#> Area Garage FirePlace Baths White.Marble Black.Marble Indian.Marble Floors
#> 1 164 2 0 2 0 1 0 0
#> 2 84 2 0 4 0 0 1 1
#> 3 190 2 4 4 1 0 0 0
#> 4 75 2 4 4 0 0 1 1
#> 5 148 1 4 2 1 0 0 1
#> 6 124 3 3 3 0 1 0 1
#> City Solar Electric Fiber Glass.Doors Swiming.Pool Garden Prices
#> 1 3 1 1 1 1 0 0 43800
#> 2 2 0 0 0 1 1 1 37550
#> 3 2 0 0 1 0 0 0 49500
#> 4 1 1 1 1 1 1 1 50075
#> 5 2 1 0 0 1 1 1 52400
#> 6 1 0 0 1 1 1 1 54300
glimpse(house)#> Rows: 500,000
#> Columns: 16
#> $ Area <int> 164, 84, 190, 75, 148, 124, 58, 249, 243, 242, 61, 189, …
#> $ Garage <int> 2, 2, 2, 2, 1, 3, 1, 2, 1, 1, 2, 2, 2, 3, 3, 3, 1, 3, 2,…
#> $ FirePlace <int> 0, 0, 4, 4, 4, 3, 0, 1, 0, 2, 4, 0, 0, 3, 3, 4, 0, 3, 3,…
#> $ Baths <int> 2, 4, 4, 4, 2, 3, 2, 1, 2, 4, 5, 4, 2, 3, 1, 1, 5, 3, 5,…
#> $ White.Marble <int> 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,…
#> $ Black.Marble <int> 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1,…
#> $ Indian.Marble <int> 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0,…
#> $ Floors <int> 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1,…
#> $ City <int> 3, 2, 2, 1, 2, 1, 3, 1, 1, 2, 1, 2, 1, 3, 3, 1, 3, 1, 3,…
#> $ Solar <int> 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0,…
#> $ Electric <int> 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1,…
#> $ Fiber <int> 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0,…
#> $ Glass.Doors <int> 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1,…
#> $ Swiming.Pool <int> 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0,…
#> $ Garden <int> 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0,…
#> $ Prices <int> 43800, 37550, 49500, 50075, 52400, 54300, 34400, 50425, …
From the result above, some of data type are not in the corect type. So we need to convert it.
house <- house %>%
mutate_if(is.numeric, as.factor) %>%
mutate_at(c("Garage", "FirePlace", "Area", "FirePlace", "Baths", "Prices"), as.numeric)
glimpse(house)#> Rows: 500,000
#> Columns: 16
#> $ Area <dbl> 164, 84, 190, 75, 148, 124, 58, 249, 243, 242, 61, 189, …
#> $ Garage <dbl> 2, 2, 2, 2, 1, 3, 1, 2, 1, 1, 2, 2, 2, 3, 3, 3, 1, 3, 2,…
#> $ FirePlace <dbl> 1, 1, 5, 5, 5, 4, 1, 2, 1, 3, 5, 1, 1, 4, 4, 5, 1, 4, 4,…
#> $ Baths <dbl> 2, 4, 4, 4, 2, 3, 2, 1, 2, 4, 5, 4, 2, 3, 1, 1, 5, 3, 5,…
#> $ White.Marble <fct> 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,…
#> $ Black.Marble <fct> 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1,…
#> $ Indian.Marble <fct> 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0,…
#> $ Floors <fct> 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1,…
#> $ City <fct> 3, 2, 2, 1, 2, 1, 3, 1, 1, 2, 1, 2, 1, 3, 3, 1, 3, 1, 3,…
#> $ Solar <fct> 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0,…
#> $ Electric <fct> 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1,…
#> $ Fiber <fct> 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0,…
#> $ Glass.Doors <fct> 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1,…
#> $ Swiming.Pool <fct> 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0,…
#> $ Garden <fct> 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0,…
#> $ Prices <dbl> 1399, 1149, 1627, 1650, 1743, 1819, 1023, 1664, 830, 539…
The data is in the correct format.
Look at White.Marble, Black.Marble, and
Indian.Marble column. Theese columns should be a factor of
one column instead of 3.
house <- house %>%
mutate(marble_type = case_when(White.Marble == 1 ~ "white_marble",
Black.Marble == 1 ~ "black_marble",
Indian.Marble == 1 ~ "indian_marble") ) %>%
select(-c(White.Marble, Black.Marble, Indian.Marble)) %>%
mutate(marble_type = as.factor(marble_type))Now, we check whether there are missing value from the data.
colSums(is.na(house))#> Area Garage FirePlace Baths Floors City
#> 0 0 0 0 0 0
#> Solar Electric Fiber Glass.Doors Swiming.Pool Garden
#> 0 0 0 0 0 0
#> Prices marble_type
#> 0 0
There are no missing values in our data, so we can continue to the next process.
Exploratory data analysis is a phase where we explore the data variables, see if there are any pattern that can indicate any kind of correlation between variables.
ggcorr(house, label = TRUE, label_size = 2.9, hjust = 1, layout.exp = 2)
The graphic shows that variable
Bath,
FirePlace, Garage, Area has
positive correlation with the prices variable.
set.seed(20)
index <- initial_split(data = house,
prop = 0.8)
data_train <- training(index)
data_test <-testing(index)model_all <- lm(Prices ~., data = data_train)
summary(model_all)#>
#> Call:
#> lm(formula = Prices ~ ., data = data_train)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -51.967 -0.005 0.000 0.006 44.971
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -2.965e+00 1.715e-03 -1.729e+03 <2e-16 ***
#> Area 1.000e+00 4.357e-06 2.295e+05 <2e-16 ***
#> Garage 6.000e+01 3.828e-04 1.567e+05 <2e-16 ***
#> FirePlace 3.000e+01 2.214e-04 1.355e+05 <2e-16 ***
#> Baths 5.000e+01 2.211e-04 2.261e+05 <2e-16 ***
#> Floors1 6.000e+02 6.254e-04 9.593e+05 <2e-16 ***
#> City2 1.400e+02 7.662e-04 1.827e+05 <2e-16 ***
#> City3 2.800e+02 7.663e-04 3.654e+05 <2e-16 ***
#> Solar1 9.999e+00 6.255e-04 1.599e+04 <2e-16 ***
#> Electric1 5.000e+01 6.254e-04 7.994e+04 <2e-16 ***
#> Fiber1 4.700e+02 6.255e-04 7.514e+05 <2e-16 ***
#> Glass.Doors1 1.780e+02 6.255e-04 2.846e+05 <2e-16 ***
#> Swiming.Pool1 7.824e-04 6.255e-04 1.251e+00 0.211
#> Garden1 2.793e-04 6.255e-04 4.470e-01 0.655
#> marble_typeindian_marble -2.000e+02 7.660e-04 -2.611e+05 <2e-16 ***
#> marble_typewhite_marble 3.600e+02 7.664e-04 4.697e+05 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.1978 on 399984 degrees of freedom
#> Multiple R-squared: 1, Adjusted R-squared: 1
#> F-statistic: 1.602e+11 on 15 and 399984 DF, p-value: < 2.2e-16
model_partial <- lm(Prices ~ Area+Garage+FirePlace+Baths+Floors+City+Solar+Electric+Fiber+Glass.Doors+marble_type,
data = data_train)
summary(model_partial)#>
#> Call:
#> lm(formula = Prices ~ Area + Garage + FirePlace + Baths + Floors +
#> City + Solar + Electric + Fiber + Glass.Doors + marble_type,
#> data = data_train)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -51.968 -0.005 0.000 0.006 44.970
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -2.965e+00 1.658e-03 -1788 <2e-16 ***
#> Area 1.000e+00 4.357e-06 229531 <2e-16 ***
#> Garage 6.000e+01 3.828e-04 156736 <2e-16 ***
#> FirePlace 3.000e+01 2.214e-04 135522 <2e-16 ***
#> Baths 5.000e+01 2.211e-04 226129 <2e-16 ***
#> Floors1 6.000e+02 6.254e-04 959305 <2e-16 ***
#> City2 1.400e+02 7.662e-04 182709 <2e-16 ***
#> City3 2.800e+02 7.663e-04 365407 <2e-16 ***
#> Solar1 9.999e+00 6.254e-04 15986 <2e-16 ***
#> Electric1 5.000e+01 6.254e-04 79938 <2e-16 ***
#> Fiber1 4.700e+02 6.255e-04 751451 <2e-16 ***
#> Glass.Doors1 1.780e+02 6.255e-04 284587 <2e-16 ***
#> marble_typeindian_marble -2.000e+02 7.660e-04 -261110 <2e-16 ***
#> marble_typewhite_marble 3.600e+02 7.664e-04 469732 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.1978 on 399986 degrees of freedom
#> Multiple R-squared: 1, Adjusted R-squared: 1
#> F-statistic: 1.849e+11 on 13 and 399986 DF, p-value: < 2.2e-16
Compare R-squared model_all and
model_partial:
summary(model_all)$adj.r.squared#> [1] 0.9999998
summary(model_partial)$adj.r.squared#> [1] 0.9999998
Because the r-squared values are the same, so the error values will be compared using RMSE.
library(MLmetrics)
# prediksi
pred_model_all <- predict(model_all, data_test)
pred_model_partial <- predict(model_partial, data_test)
# evaluasi
RMSE(y_pred = pred_model_all, y_true = data_test$Prices)#> [1] 0.2953036
RMSE(y_pred = pred_model_partial, y_true = data_test$Prices)#> [1] 0.2953008
Based on R-squared and errors, the model model_partial
is selected.
Then we will try using a step-wise regression model, by trying a combination of predictors until the smallest AIC (information loss) is produced.
# stepwise regression - backward
model_backward <- step(object = model_all, direction = "backward", trace = F)# stepwise regression - forward
model_none <- lm(Prices ~ 1, data = data_train)
model_forward <- step(object = model_none, scope = list(upper = model_all), direction = "forward", trace = F)# stepwise regression - both
model_both <- step(object = model_none, scope = list(upper = model_all), direction = "both", trace = F)The adjusted R-squared values in the four models can be compared:
summary(model_backward)$adj.r.squared#> [1] 0.9999998
summary(model_forward)$adj.r.squared#> [1] 0.9999998
summary(model_both)$adj.r.squared#> [1] 0.9999998
summary(model_partial)$adj.r.squared#> [1] 0.9999998
pred_model_backward <- predict(model_backward, newdata = data_test)
pred_model_forward <- predict(model_forward, newdata = data_test)
pred_model_both <- predict(model_both, newdata = data_test)# evaluation
RMSE(y_pred = pred_model_partial, y_true = data_test$Prices)#> [1] 0.2953008
RMSE(y_pred = pred_model_backward, y_true = data_test$Prices)#> [1] 0.2953008
RMSE(y_pred = pred_model_forward, y_true = data_test$Prices)#> [1] 0.2953008
RMSE(y_pred = pred_model_both, y_true = data_test$Prices)#> [1] 0.2953008
The four models have the exact same RMSE. The stepwise model has the same RMSE as the multiple predictor model. We can choose models with multiple predictors or stepwise models.
hist(model_partial$residuals, breaks = 10)From the graph above, it can be concluded that the Residuals are distributed normally.
library(lmtest)
bptest(model_backward)#>
#> studentized Breusch-Pagan test
#>
#> data: model_backward
#> BP = 23.118, df = 13, p-value = 0.04028
From the above results obtained a p-value of 0.04028. Because the p-value < alpha (0.05) then H0 is rejected. So, it can be concluded that the error variance spreads not constant/forms a pattern (Heteroscedasticity).
library(car)
vif(model_backward)#> GVIF Df GVIF^(1/(2*Df))
#> Area 1.000034 1 1.000017
#> Garage 1.000078 1 1.000039
#> FirePlace 1.000014 1 1.000007
#> Baths 1.000041 1 1.000020
#> Floors 1.000025 1 1.000013
#> City 1.000055 2 1.000014
#> Solar 1.000013 1 1.000006
#> Electric 1.000015 1 1.000008
#> Fiber 1.000032 1 1.000016
#> Glass.Doors 1.000038 1 1.000019
#> marble_type 1.000054 2 1.000014
Because there is no VIF value that is more than 10 there is no multicollinearity. Thus, the assumption test is fulfilled.
Both model produces a reliable prediction to forecast the House Price. Also by using Step-Wise regression we were able to produce an equal model with similar results from the model without Step-Wise regression, this means that out Step-Wise regression for feature reduction succesfully reduce the number of features for trainning efficiency but also at the same time retain the information needed to produce reliable predictions.