In this article, I will predict the value of real estate using a regression model. This model will determine what factors affect the fluctuations in the value of a real estate. The market historical data set of real estate valuation are collected from Sindian Dist., New Taipei City, Taiwan.
library(car)
library(dplyr)
library(GGally)
library(ggplot2)
library(lmtest)
library(manipulate)
library(MASS)
library(MLmetrics)
library(performance)realestate <- readxl::read_xlsx("Real estate valuation data set.xlsx")
realestate <- realestate %>%
dplyr::select(-No)names(realestate) <- c("date", "age", "MRT_station", "store", "lat", "long", "price")realestate$store <- as.integer(realestate$store)anyNA(realestate)#> [1] FALSE
head(realestate)RNGkind(sample.kind = "Rounding")
set.seed(123)
# index sampling
index <- sample(x = nrow(realestate), size= nrow(realestate)*0.75)
# splitting
train <- realestate[index,]
test <- realestate[-index,]ggcorr(train, label = T) Based on the correlation output above: - Variable store, lat, and long have a strong positive correlation with price. - Variable date has a weak positive correlation with price. - Variable age has a weak negative correlation with price. - Variable MRT_station has a strong negative correlation with price.
Initial Assumptions: - Transaction date and age of real estate have less effect on real estate prices. - The further north and east the property is located, the price of the property will increase. - The more convenience stores around the property, the higher the property price. - The further away the property is from the nearest MRT station, the lower the property price.
model_all <- lm(price ~ ., data = realestate)
summary(model_all)#>
#> Call:
#> lm(formula = price ~ ., data = realestate)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -35.667 -5.412 -0.967 4.217 75.190
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -14441.982719 6775.386097 -2.132 0.03364 *
#> date 5.149017 1.556876 3.307 0.00103 **
#> age -0.269697 0.038530 -7.000 0.0000000000106 ***
#> MRT_station -0.004488 0.000718 -6.250 0.0000000010373 ***
#> store 1.133325 0.188160 6.023 0.0000000038269 ***
#> lat 225.470143 44.565775 5.059 0.0000006382166 ***
#> long -12.429061 48.581168 -0.256 0.79820
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 8.858 on 407 degrees of freedom
#> Multiple R-squared: 0.5824, Adjusted R-squared: 0.5762
#> F-statistic: 94.6 on 6 and 407 DF, p-value: < 0.00000000000000022
model_backward <- step(model_all, direction = "backward", trace = 0)
summary(model_backward)#>
#> Call:
#> lm(formula = price ~ date + age + MRT_station + store + lat,
#> data = realestate)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -35.625 -5.373 -1.020 4.243 75.343
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -15964.8038627 3233.1114480 -4.938 0.0000011535027 ***
#> date 5.1375550 1.5544483 3.305 0.00103 **
#> age -0.2693805 0.0384660 -7.003 0.0000000000104 ***
#> MRT_station -0.0043533 0.0004899 -8.887 < 0.0000000000000002 ***
#> store 1.1361928 0.1876103 6.056 0.0000000031674 ***
#> lat 226.8794043 44.1733708 5.136 0.0000004353834 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 8.847 on 408 degrees of freedom
#> Multiple R-squared: 0.5823, Adjusted R-squared: 0.5772
#> F-statistic: 113.8 on 5 and 408 DF, p-value: < 0.00000000000000022
# model_all
predict_all <- predict(model_all, newdata = test)
# model_backward
predict_backward <- predict(model_backward, newdata = test)compare_performance(model_all, model_backward)model_all has better perfomance than model_backward based on AIC, Adjuster-R square, and RMSE
# model_all
MAPE(y_pred = predict_all, y_true = test$price)*100#> [1] 20.02665
# model_backward
MAPE(y_pred = predict_backward, y_true = test$price)*100#> [1] 20.05283
Error value of model_all lower than model_backward
check_model(model_all) ## Linearity
linear_all <- data.frame(residual = model_all$residuals, fitted = model_all$fitted.values)
linear_all %>%
ggplot(aes(fitted, residual)) +
geom_point() +
geom_smooth() +
geom_hline(aes(yintercept = 0)) +
theme(panel.grid = element_blank(), panel.background = element_blank()) There is a pattern in the data, with the residuals has become more negative as the fitted values increase before increased again. The pattern indicate that our model may not be linear enough.
shapiro.test(model_all$residuals)#>
#> Shapiro-Wilk normality test
#>
#> data: model_all$residuals
#> W = 0.87622, p-value < 0.00000000000000022
check_normality(model_all)#> Warning: Non-normality of residuals detected (p < .001).
The null hypothesis is that the residuals follow normal distribution. With p-value < 0.05, we can conclude that our hypothesis is rejected, and our residuals are not following the normal distribution.
bptest(model_all)#>
#> studentized Breusch-Pagan test
#>
#> data: model_all
#> BP = 8.4674, df = 6, p-value = 0.2058
check_heteroscedasticity(model_all)#> Warning: Heteroscedasticity (non-constant error variance) detected (p = 0.040).
One of method to detect heterocesdasticity is using the Breusch-Pagan test, with null hypothesis is there is no heterocesdasticity. With p-value > 0.05, we can conclude that heterocesdasticity is not present in our model.
vif(model_all)#> date age MRT_station store lat long
#> 1.014655 1.014287 4.322984 1.617021 1.610225 2.926305
There is no predictors that has VIF value exceeds 5 or 10. We can conclude there is no multicollinearity in our model.
eh, I. C., & Hsu, T. K. (2018). Building real estate valuation models with comparative approach through case-based reasoning. Applied Soft Computing, 65, 260-271.